Home > Archive > PERL Beginners > September 2006 > Your opinion on large file processing
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Your opinion on large file processing
|
|
| Andrej Kastrin 2006-09-23, 3:57 am |
| Dear all,
the script below count word occurences in input file. It uses simple
hash structure to store unique words and its frequencies.
--------------------
use strict;
my %words;
while (<> ) {
chop;
foreach my $wd (split) {
$words{$wd}++;
}
}
foreach my $w (keys %words) {
print "$w|$words{$w}\n";
}
--------------------
In order to process large amounts of data (10.000.000 lines) and to
avoid memory problems I use DB_File module to store hash %words into
local file and than read data from it.
--------------------
use strict;
use DB_File;
tie my %words, 'DB_File', 'words.db';
while (<> ) {
chop;
foreach my $wd (split) {
$words{$wd}++;
}
}
foreach my $w (keys %words) {
print "$w|$words{$w}\n";
}
untie(%words);
--------------------
Is that brainy solution in the sense of good programming practice...?
Thanks in advance for any opinion,
Andrej
| |
| nobull67@gmail.com 2006-09-23, 7:57 am |
|
Andrej Kastrin wrote:
> Dear all,
>
> the script below count word occurences in input file. It uses simple
> hash structure to store unique words and its frequencies.
> --------------------
> use strict;
> my %words;
> while (<> ) {
> chop;
> foreach my $wd (split) {
> $words{$wd}++;
> }
> }
>
> foreach my $w (keys %words) {
> print "$w|$words{$w}\n";
> }
> --------------------
>
> In order to process large amounts of data (10.000.000 lines) and to
> avoid memory problems I use DB_File module to store hash %words into
> local file and than read data from it.
>
> --------------------
> use strict;
Consider using warnings too.
> use DB_File;
> tie my %words, 'DB_File', 'words.db';
> while (<> ) {
> chop;
Get into the habit of using chomp() not chop(). For details perldoc -f
chomp
> foreach my $wd (split) {
> $words{$wd}++;
> }
> }
>
> foreach my $w (keys %words) {
> print "$w|$words{$w}\n";
> }
Since you are woried about %words being huge evaluating the list
keys(%words) is probably bad. (Perl5 doesn't have lazy lists).
while( my ($w,$c) = each %words) {
print "$w|$c\n";
}
I can't recall for sure if DB_File has a lazy each() but I think it
does.
| |
| Peter Scott 2006-09-25, 6:57 pm |
| On Sat, 23 Sep 2006 11:51:54 +0200, Andrej Kastrin wrote:
> the script below count word occurences in input file. It uses simple
> hash structure to store unique words and its frequencies.
[...]
> foreach my $w (keys %words) {
> print "$w|$words{$w}\n";
> }
[...]
> Is that brainy solution in the sense of good programming practice...?
Good start, but you just shot yourself in the foot. Read 'perldoc -f tie'
and pay especial attention starting at the second paragraph.
--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/
| |
| Andrej Kastrin 2006-09-25, 6:57 pm |
| Peter Scott wrote:
> On Sat, 23 Sep 2006 11:51:54 +0200, Andrej Kastrin wrote:
>
> [...]
>
> [...]
>
>
> Good start, but you just shot yourself in the foot. Read 'perldoc -f tie'
> and pay especial attention starting at the second paragraph.
>
>
Peter, thanks for your response. I already implement 'each' function to
iterate over the hash without building the entire list in memory.
Cheers, Andrej
| |
| usenet@DavidFilmer.com 2006-09-25, 6:57 pm |
| Andrej Kastrin wrote:
> Peter, thanks for your response. I already implement 'each' function to
> iterate over the hash without building the entire list in memory.
You don't show that in your code:
foreach my $w (keys %words) {
You do realize that the each() function is not the same as foreach()?
Your code builds the entire list in memory, which is what Mr. Scott
warned you about.
--
David Filmer (http://DavidFilmer.com)
|
|
|
|
|