For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > November 2005 > 15 Million RAW









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author 15 Million RAW
Lorenzo Caggioni

2005-11-24, 6:56 pm

Hi,



What I have to do is:



1- Read Line from an input file

2- Validate the raw (for example: is second char == 2?)

3- Split the line

4- Write the validated and splitted raw in an output file whit a different

order (for example: last 2 digits I have to write as first 2 digits)



I have to loop it for 15 MILLION line!!!!



How long does perl takes?



The program I written takes 25 sec for 10.000 line... too much....



Do you have suggestion?

I have to think in parallel way? (fork ?!?!)



Thanks

Lorenzo

Pierre Smolarek

2005-11-24, 6:56 pm

Lorenzo Caggioni wrote:
> The program I written takes 25 sec for 10.000 line... too much....
>

How quickly do you need to it if 25 seconds is too long?

--
Best Regards,

Pierre Smolarek
Unify Media Ltd

tel. 1-403-681-8054

John W. Krahn

2005-11-24, 9:55 pm

Lorenzo Caggioni wrote:
> Hi,


Hello,

> What I have to do is:
>
> 1- Read Line from an input file
> 2- Validate the raw (for example: is second char == 2?)
> 3- Split the line
> 4- Write the validated and splitted raw in an output file whit a different
> order (for example: last 2 digits I have to write as first 2 digits)


Have you done that? Where is it?


John
--
use Perl;
program
fulfillment
Chris Devers

2005-11-24, 9:55 pm

usenet@DavidFilmer.com

2005-11-25, 3:55 am

Lorenzo Caggioni wrote:
> What I have to do is:
> 1- Read Line from an input file


What does the input file look like? Are lines fixed-length? Variable?
20 byes long or 20000 bytes long?

> 2- Validate the raw (for example: is second char == 2?)


What if it isn't 'valid?' Do you just skip it, or run an exception
routine?

> 3- Split the line


Using what criteria? Why? (maybe split isn't the way to go, if all you
want to do is write the line out in a different format)

> How long does perl takes?


That has much more to do with your hardware than your scripting
language. Your bottleneck will be file I/O. Can your files
(input/output) reside on two different hard drives on two different I/O
channels? That would help a lot.

> I have to think in parallel way? (fork ?!?!)


That may make things worse (it can cause disk thrashing, and you would
need to consider race conditions).

Since you didn't give us a sample of your code, nor a sample of your
data, nor an effective description of your task, it's hard to give you
much specific advice.

Gary Stainburn

2005-11-25, 3:55 am

Here's my 2peneth.

Avoid regex. While it's powerfull, it's also expensive.

Short but sweet

Gary

On Friday 25 November 2005 3:31 am, Chris Devers wrote:
> On Thu, 24 Nov 2005, Pierre Smolarek wrote:
>
> If 10,000 lines take 25 seconds, you're doing 400 lines per second.
>
> At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.
>
> While asking for a firmer definition for "faster" is a fair question,
> it's fair to assume that he wants to do better than 10.4 hours :-)
>
> That said, the canned answer applies here. If the problem is --
>
> 1 Read Line from an input file
> 2 Validate the raw (for example: is second char == 2?)
> 3 Split the line
> 4 Write the validated and splitted raw in an output file with a
> different order (for example: last 2 digits I have to write as
> first 2 digits)
>
> -- then, in order to give *any* constructive advice, we need:
>
> * to see the code in question
> * to know if the code has been benchmarked
>
> If we can't see the code, we can't possibly offer useful suggestions.
>
> If we don't have benchmark info to know what part of the code is
> taking so long, we can't even speculate as to where to start
> optimizing things.
>
> One of the suggestions in Damian Conway's _Perl Best Practices_ is a
> simple piece of advice: "Don't Optimize Code -- Benchmark It". For
> details, look over this excerpt from the book:
>
> http://www.perl.com/lpt/a/2005/07/14/bestpractices.html
>
> It's sound advice. The book's next suggestion -- which I can't seem
> to find a reference to online, so you're just going to have to find a
> copy of the book itself -- is "Don't optimize data structures --
> measure them." This is also sound advice. If you use a module like
> Devel::Size to determine how space is being allocated, you can get a
> better sense of where you might be choking on data and, in turn, have
> a sense of where you need to fix things.
>
> Once you've used such tools to map out how your program is consuming
> time and space, you can start making decisions about how to reduce
> that consumption, by speeding up critical sections, reducing memory
> use, or just throwing more RAM and CPU at the problem if you're
> starved there and software optimizations seem like they might not be
> enough. But until you've figured out where the time is being spent,
> or what system resource is being exhausted, you can't properly
> address the problem.
>
> Really, you could do a whole lot worse than by just getting a copy of
> _Perl Best Practices_ and using its advice to rewrite your program
> from scratch. Almost everyone could improve their code this way...
> :-)


--
Gary Stainburn

This email does not contain private or confidential material as it
may be snooped on by interested government parties for unknown
and undisclosed purposes - Regulation of Investigatory Powers Act, 2000

Chris Devers

2005-11-25, 7:55 am

John Doe

2005-11-25, 7:55 am

Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
> Attached you can find the code an a input file to try it.
>
> I'm sorry if the code is not realy commented and if it is no real clear,
> but i have to delete some line because it is base on a database....


=46rom a short view into the code, I see optimization potential
(some may have quite an effect, others may not...) in:

a) main::SplitRowByLength:

instead of substr, you could try and benchmark direct extraction of the fie=
lds=20
with a single regex along the lines my @fields=3D$line=3D~/(.{1})(.{4})/;

unpack may be better; not much experience with it.

b) in the top level while loop:

avoid the repeated eval (can't see a purpose for that...). I may have=20
overlooked something, but why

$xFieldValue =3D '($cdr[0]';
$xFieldValue .=3D ',\@cdr,\$cdrsline,\$dbh)';
eval ("fmtTLGInternationalFormatTelegramTEST".$xFieldValue);

instead of a simple=20

fmtTLGInternationalFormatTelegramTEST($c
dr[0],\@cdr,\$cdrsline,\$dbh)

(where the ref to $dbh is unneccessary since it is an object, and $cdr[0]=20
could be replaced by a preceeding my $cdr0=3D$cdr[0] and then use $cdr0)

?

Then, first make a my variable instead of using the same hash lookup severa=
l=20
times. F.i $globalParameters{"OutputFileFieldDelimiter"} is used many times.


c) generally

Avoid most of the string interpolation where not necessary (hash keys, arou=
nd=20
integers, left from '=3D>' etc.)

d) shorten some subs=20

sub fmtCurrencyCodeTEST {
my($xCurr) =3D "EUR";
return $xCurr;=20
}
=3D>
sub fmtCurrencyCodeTEST {'EUR'}

sub fmtTLGATTR2_int_natTEST {
my ($xServiceCode,$xInputCDR) =3D @_;
return $xInputCDR->[20];
}
=3D>
sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}

etc.

e) fmtTLGConvertDateTEST

here the many substr could be avoided


Since I'm still a beginner, be carful with my advices...
hopefully at least 2 cents,

joe

Rob Coops

2005-11-25, 7:55 am

Making the subs shorter will maybe help a little in the speed of processing
but it will make it a lot more difficult for the person that gets to take
over the maintanace. When you know what you are doing and why it is easy to
read it, but when you get a big program written like that and are asked to
support it... you will go looking for the guy that wrote it and give him a
good old kick in the .... because of all the headache he cost you.
So unless this is a very personal script that will not ever be handed over
to anyone and your memory is good enough to remember what you are doing
where and why please make sure you do not write subs like that unless you
are very good at documenting your code as you are writting it.


On 11/25/05, John Doe <security.department@tele2.ch> wrote:
>
> Lorenzo Caggioni am Freitag, 25. November 2005 11.04:
>
> From a short view into the code, I see optimization potential
> (some may have quite an effect, others may not...) in:
>
> a) main::SplitRowByLength:
>
> instead of substr, you could try and benchmark direct extraction of the
> fields
> with a single regex along the lines my @fields=$line=~/(.{1})(.{4})/;
>
> unpack may be better; not much experience with it.
>
> b) in the top level while loop:
>
> avoid the repeated eval (can't see a purpose for that...). I may have
> overlooked something, but why
>
> $xFieldValue = '($cdr[0]';
> $xFieldValue .= ',\@cdr,\$cdrsline,\$dbh)';
> eval ("fmtTLGInternationalFormatTelegramTEST".$xFieldValue);
>
> instead of a simple
>
> fmtTLGInternationalFormatTelegramTEST($c
dr[0],\@cdr,\$cdrsline,\$dbh)
>
> (where the ref to $dbh is unneccessary since it is an object, and $cdr[0]
> could be replaced by a preceeding my $cdr0=$cdr[0] and then use $cdr0)
>
> ?
>
> Then, first make a my variable instead of using the same hash lookup
> several
> times. F.i $globalParameters{"OutputFileFieldDelimiter"} is used many
> times.
>
>
> c) generally
>
> Avoid most of the string interpolation where not necessary (hash keys,
> around
> integers, left from '=>' etc.)
>
> d) shorten some subs
>
> sub fmtCurrencyCodeTEST {
> my($xCurr) = "EUR";
> return $xCurr;
> }
> =>
> sub fmtCurrencyCodeTEST {'EUR'}
>
> sub fmtTLGATTR2_int_natTEST {
> my ($xServiceCode,$xInputCDR) = @_;
> return $xInputCDR->[20];
> }
> =>
> sub fmtTLGATTR2_int_natTEST {$_[1]->[20]}
>
> etc.
>
> e) fmtTLGConvertDateTEST
>
> here the many substr could be avoided
>
>
> Since I'm still a beginner, be carful with my advices...
> hopefully at least 2 cents,
>
> joe
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>
>


John Doe

2005-11-25, 6:56 pm

Rob Coops am Freitag, 25. November 2005 14.13:
> Making the subs shorter will maybe help a little in the speed of processing
> but it will make it a lot more difficult for the person that gets to take
> over the maintanace. When you know what you are doing and why it is easy to
> read it, but when you get a big program written like that and are asked to
> support it... you will go looking for the guy that wrote it and give him a
> good old kick in the .... because of all the headache he cost you.
> So unless this is a very personal script that will not ever be handed over
> to anyone and your memory is good enough to remember what you are doing
> where and why please make sure you do not write subs like that unless you
> are very good at documenting your code as you are writting it.


Hi Rob

[see inline]
[color=darkred]
> On 11/25/05, John Doe <security.department@tele2.ch> wrote:
[...][color=darkred]
[...]

Ok, making subs shorter with less local variables won't improve performance
significantly. That's why a listed it at the end :-)

Concerning bad maintanability, I don't see much problems in my examples, since
there is no obfuscating of algorithms and such, but only direct access to the
arguments - the difference is not very big.

And of course a sub should be documented:
- purpose
- side effects
- parameter description
- description of the return values
- (etc.)

Compare:

# purpose: return currency
# in: --
# out: constant string 'EUR'
#
sub fmtCurrencyCodeTEST {
my($xCurr) = "EUR";
return $xCurr;
}

# purpose: return currency
# in: --
# out: constant string 'EUR'
#
sub fmtCurrencyCodeTEST {'EUR'}

In this example, you could even omit the comments, since it's obvious what's
the purpose of the sub.


Have a look into the perl source; you will find lots of such examples.

greetings,

joe
Lorenzo Caggioni

2005-11-25, 6:56 pm

I made some changes in the program (delete eval, edjust subs... )

Now the program takes less then 3 sec but it loses all the structure...

The main thing that increase performance is delete the eval("fun name").
I do it in this way because the name of the function is retrived from a
database.
is there another way to recal a function retrining his name from a variable?

Any suggestions?

Thanks


On 11/25/05, John Doe <security.department@tele2.ch> wrote:
>
> Rob Coops am Freitag, 25. November 2005 14.13:
> processing
> take
> to
> to
> a
> over
> you
>
> Hi Rob
>
> [see inline]
>
> [...]
> [...]
>
> Ok, making subs shorter with less local variables won't improve
> performance
> significantly. That's why a listed it at the end :-)
>
> Concerning bad maintanability, I don't see much problems in my examples,
> since
> there is no obfuscating of algorithms and such, but only direct access to
> the
> arguments - the difference is not very big.
>
> And of course a sub should be documented:
> - purpose
> - side effects
> - parameter description
> - description of the return values
> - (etc.)
>
> Compare:
>
> # purpose: return currency
> # in: --
> # out: constant string 'EUR'
> #
> sub fmtCurrencyCodeTEST {
> my($xCurr) = "EUR";
> return $xCurr;
> }
>
> # purpose: return currency
> # in: --
> # out: constant string 'EUR'
> #
> sub fmtCurrencyCodeTEST {'EUR'}
>
> In this example, you could even omit the comments, since it's obvious
> what's
> the purpose of the sub.
>
>
> Have a look into the perl source; you will find lots of such examples.
>
> greetings,
>
> joe
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>
>


Jeff 'japhy' Pinyan

2005-11-25, 6:56 pm

On Nov 25, Lorenzo Caggioni said:

> I made some changes in the program (delete eval, edjust subs... )
>
> Now the program takes less then 3 sec but it loses all the structure...
>
> The main thing that increase performance is delete the eval("fun name").
> I do it in this way because the name of the function is retrived from a
> database.
> is there another way to recal a function retrining his name from a variable?


Yes, it's called a dispatch table:

my %functions = (
abc => \&do_this,
def => \&do_that,
ghi => \&do_something_else,
);

Those \&... things are REFERENCES to functions. So you do:

while (my @row = get_stuff_from_database()) {
# assuming $row[0] is abc or def or ghi
# that is, $row[0] holds the nickname of the function
my $code = $functions{$row[0]};

$code->(@arguments);
}

So when $row[0] is 'abc', we call do_this(...). Etc.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://www.perlmonks.org/ % have long ago been overpaid?
http://princeton.pm.org/ % -- Meister Eckhart
Dr.Ruud

2005-11-25, 6:56 pm

Lorenzo Caggioni:

Please don't toppost, and cut all the text that you don't react on.

> is there another way to recal a function retrining his name from a
> variable?


If the set of functions is limited, use if:

if ('abc' eq $func) {
abc
} elseif ('def' eq $func) {
def
}

Or put them in a hash.

--
Affijn, Ruud

"Gewoon is een tijger."

Tom Allison

2005-11-25, 6:56 pm

Chris Devers wrote:
> On Thu, 24 Nov 2005, Pierre Smolarek wrote:
>
>
>
>
> If 10,000 lines take 25 seconds, you're doing 400 lines per second.
>
> At that rate, 15,000,000 lines will take 37,500 seconds, or 10h25m.
>
> While asking for a firmer definition for "faster" is a fair question,
> it's fair to assume that he wants to do better than 10.4 hours :-)
>
> That said, the canned answer applies here. If the problem is --
>
> 1 Read Line from an input file
> 2 Validate the raw (for example: is second char == 2?)
> 3 Split the line
> 4 Write the validated and splitted raw in an output file with a
> different order (for example: last 2 digits I have to write as
> first 2 digits)
>
> -- then, in order to give *any* constructive advice, we need:
>
> * to see the code in question
> * to know if the code has been benchmarked
>
> If we can't see the code, we can't possibly offer useful suggestions.
>
> If we don't have benchmark info to know what part of the code is taking
> so long, we can't even speculate as to where to start optimizing things.
>
> One of the suggestions in Damian Conway's _Perl Best Practices_ is a
> simple piece of advice: "Don't Optimize Code -- Benchmark It". For
> details, look over this excerpt from the book:
>
> http://www.perl.com/lpt/a/2005/07/14/bestpractices.html
>
> It's sound advice. The book's next suggestion -- which I can't seem to
> find a reference to online, so you're just going to have to find a copy
> of the book itself -- is "Don't optimize data structures -- measure
> them." This is also sound advice. If you use a module like Devel::Size
> to determine how space is being allocated, you can get a better sense of
> where you might be choking on data and, in turn, have a sense of where
> you need to fix things.
>


There's also Profil (Devel::Profil) to find out where you are spending
that 25 minutes.
John W. Krahn

2005-11-25, 9:55 pm

Lorenzo Caggioni wrote:
> Attached you can find the code an a input file to try it.
>
> I'm sorry if the code is not realy commented and if it is no real clear, but
> i have to delete some line because it is base on a database....
>
> Now the program can run without any DB.
> You can find even a profile for the program.


Others have mentioned optimizations but I noticed a few errors:

89 if ($InvalidReason eq undef)

You can not use the value undef in a comparison, that should be:

if ( ! defined $InvalidReason )

And:

311 @{$inputCDR_HASH{"0"}} = @{$xInputCDR} if $xInputCDR != undef;

@{$inputCDR_HASH{"0"}} = @{$xInputCDR} if defined $xInputCDR;



392 return $globalParameters{"GNV_INTERF_MODIFIER"}{"11"}{"NATTLG"} if
$xServiceCode = 9510;
393 return $globalParameters{"GNV_INTERF_MODIFIER"}{"10"}{"INTTLG"} if
$xServiceCode = 9520;

If you had warnings enabled then perl would have warned you that you are doing
an asignment instead of a comparison. You should have these two lines at the
beginning of your program:

use warnings;
use strict;



John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com