For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > September 2006 > Your opinion on large file processing









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Your opinion on large file processing
Andrej Kastrin

2006-09-23, 3:57 am

Dear all,

the script below count word occurences in input file. It uses simple
hash structure to store unique words and its frequencies.
--------------------
use strict;
my %words;
while (<> ) {
chop;
foreach my $wd (split) {
$words{$wd}++;
}
}

foreach my $w (keys %words) {
print "$w|$words{$w}\n";
}
--------------------

In order to process large amounts of data (10.000.000 lines) and to
avoid memory problems I use DB_File module to store hash %words into
local file and than read data from it.

--------------------
use strict;
use DB_File;
tie my %words, 'DB_File', 'words.db';
while (<> ) {
chop;
foreach my $wd (split) {
$words{$wd}++;
}
}

foreach my $w (keys %words) {
print "$w|$words{$w}\n";
}
untie(%words);
--------------------


Is that brainy solution in the sense of good programming practice...?

Thanks in advance for any opinion,
Andrej
nobull67@gmail.com

2006-09-23, 7:57 am


Andrej Kastrin wrote:
> Dear all,
>
> the script below count word occurences in input file. It uses simple
> hash structure to store unique words and its frequencies.
> --------------------
> use strict;
> my %words;
> while (<> ) {
> chop;
> foreach my $wd (split) {
> $words{$wd}++;
> }
> }
>
> foreach my $w (keys %words) {
> print "$w|$words{$w}\n";
> }
> --------------------
>
> In order to process large amounts of data (10.000.000 lines) and to
> avoid memory problems I use DB_File module to store hash %words into
> local file and than read data from it.
>
> --------------------
> use strict;


Consider using warnings too.

> use DB_File;
> tie my %words, 'DB_File', 'words.db';
> while (<> ) {
> chop;


Get into the habit of using chomp() not chop(). For details perldoc -f
chomp

> foreach my $wd (split) {
> $words{$wd}++;
> }
> }
>
> foreach my $w (keys %words) {
> print "$w|$words{$w}\n";
> }


Since you are woried about %words being huge evaluating the list
keys(%words) is probably bad. (Perl5 doesn't have lazy lists).

while( my ($w,$c) = each %words) {
print "$w|$c\n";
}

I can't recall for sure if DB_File has a lazy each() but I think it
does.

Peter Scott

2006-09-25, 6:57 pm

On Sat, 23 Sep 2006 11:51:54 +0200, Andrej Kastrin wrote:
> the script below count word occurences in input file. It uses simple
> hash structure to store unique words and its frequencies.

[...]
> foreach my $w (keys %words) {
> print "$w|$words{$w}\n";
> }

[...]
> Is that brainy solution in the sense of good programming practice...?


Good start, but you just shot yourself in the foot. Read 'perldoc -f tie'
and pay especial attention starting at the second paragraph.

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/

Andrej Kastrin

2006-09-25, 6:57 pm

Peter Scott wrote:
> On Sat, 23 Sep 2006 11:51:54 +0200, Andrej Kastrin wrote:
>
> [...]
>
> [...]
>
>
> Good start, but you just shot yourself in the foot. Read 'perldoc -f tie'
> and pay especial attention starting at the second paragraph.
>
>

Peter, thanks for your response. I already implement 'each' function to
iterate over the hash without building the entire list in memory.

Cheers, Andrej
usenet@DavidFilmer.com

2006-09-25, 6:57 pm

Andrej Kastrin wrote:
> Peter, thanks for your response. I already implement 'each' function to
> iterate over the hash without building the entire list in memory.


You don't show that in your code:

foreach my $w (keys %words) {

You do realize that the each() function is not the same as foreach()?
Your code builds the entire list in memory, which is what Mr. Scott
warned you about.

--
David Filmer (http://DavidFilmer.com)

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com