For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > February 2006 > Need to improve throughput - Any thoughts









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Need to improve throughput - Any thoughts
Gladstone Daniel - dglads

2006-02-27, 6:58 pm

I have a text file with around 1 million lines and I need to do a search

And replace on over 9000 words. I am currently reading a line and
passing=20
A hash table against it and any matches it is replacing the word in the
string. It is running real slow. Any thoughts on how to improve it?=20

Daniel Gladstone (Daniel.Gladstone@Acxiom.com)

****************************************
***********************************
The information contained in this communication is confidential, is
intended only for the use of the recipient named above, and may be legally
privileged.

If the reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying of this
communication is strictly prohibited.

If you have received this communication in error, please resend this
communication to the sender and delete the original message or any copy
of it from your computer system.

Thank You.
****************************************
************************************
Timothy Johnson

2006-02-27, 6:58 pm


Since you haven't provided any code, it would be difficult. Here are
some thoughts:

* Make sure you're doing a "while(<INFILE> ){" instead of "@array =3D
<INFILE>"
* Try precompiling your regular expressions.
(see qr// in perlop under "Quote and Quote-like Operators")


-----Original Message-----
From: Gladstone Daniel - dglads [mailto:Daniel.Gladstone@acxiom.com]=20
Sent: Monday, February 27, 2006 8:50 AM
To: beginners@perl.org
Subject: Need to improve throughput - Any thoughts=20

I have a text file with around 1 million lines and I need to do a search

And replace on over 9000 words. I am currently reading a line and
passing=20
A hash table against it and any matches it is replacing the word in the
string. It is running real slow. Any thoughts on how to improve it?=20

Daniel Gladstone (Daniel.Gladstone@Acxiom.com)



DJ Stunks

2006-02-27, 6:58 pm


Gladstone Daniel - dglads wrote:
> I have a text file with around 1 million lines and I need to do a search
>
> And replace on over 9000 words. I am currently reading a line and
> passing
> A hash table against it and any matches it is replacing the word in the
> string. It is running real slow. Any thoughts on how to improve it?


Repeated hash lookups are expensive. You may need a better algorithm
for that portion of your code.

-jp

usenet@DavidFilmer.com

2006-02-27, 6:58 pm

Gladstone Daniel - dglads wrote:
> I have a text file with around 1 million lines and I need to do a search
> And replace on over 9000 words. I am currently reading a line and
> passing A hash table against it...


We could help you more if you would post actual code instead of an
English description of it. English is ambigious, but Perl is precise.

You also don't show us what your data looks like (is it freeform text
with punctuation characters, etc? Do you have singular/plural word
forms to deal with? Does uppercase/lowercase matter? etc, etc, etc). It
really does matter.

Something like this might be OK for a start (without knowing more about
the data); refinement may be necessary to suit the actual data:

#!/usr/bin/perl
use strict; use warnings;

my %change = qw/Fred Fredrick
drives peddals
car vehicle
Barney Bernard
log tree/;

while (my $line = <DATA> ) { #or your actual filehandle
$line =~ s/$_/$change{$_} || $_/e for (split /\s+/, $line);
print $line;
}

__DATA__
Fred drives the Flintstone family car
Barney has a car that looks like a log
car Car cars car's cargo encarta Nascar - only first 'car' matches!

--
http://DavidFilmer.com

Zentara

2006-02-28, 6:57 pm

On Mon, 27 Feb 2006 10:49:35 -0600, Daniel.Gladstone@acxiom.com
("Gladstone Daniel - dglads") wrote:

>I have a text file with around 1 million lines and I need to do a search
>
>And replace on over 9000 words. I am currently reading a line and
>passing
>A hash table against it and any matches it is replacing the word in the
>string. It is running real slow. Any thoughts on how to improve it?


Look on cpan for Regexp-Optimizer.

It will help optimize the regexp for your list. If you think about
it, 9000 words will have alot in common, so you should be able
to find patterns in those sets of words, and rexexp for them.
There is no need to look for each word.

You also might experiment with breaking your 9000 word
list into smaller lists, use the optimizer on those smaller
lists, and run each one against each line separately.






--
I'm not really a human, but I play one on earth.
http://zentara.net/japh.html
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com