Home > Archive > PERL Beginners > March 2008 > diff says memory exhausted need help with perl
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
diff says memory exhausted need help with perl
|
|
| tc314@hotmail.com 2008-03-28, 7:04 pm |
| I've got two similar large files with one word per line and they're
sorted.
Each file has a few words not in the other.
I typically identify the unique words in the file using diff,grep,cut.
When the files are too big (2Gig) diff dies with "memory exhausted".
I want to search for the unique words in file1 but I might need to
ping-pong since neither file is a superset of the other.
I don't want to be limited by physical RAM as the file sizes exceed
RAM.
I assume I'm not the first to have this problem.
Can someone point me to perl code?
TIA
| |
| Lawrence Statton 2008-03-28, 7:04 pm |
|
If you're using Gnu diff (i.e. the diff that comes with most Linuces)
--speed-large-files might help you, without having to jump through a
perl hoop.
--L
| |
| John W. Krahn 2008-03-28, 10:04 pm |
| tc314@hotmail.com wrote:
> I've got two similar large files with one word per line and they're
> sorted.
> Each file has a few words not in the other.
> I typically identify the unique words in the file using diff,grep,cut.
> When the files are too big (2Gig) diff dies with "memory exhausted".
>
> I want to search for the unique words in file1 but I might need to
> ping-pong since neither file is a superset of the other.
> I don't want to be limited by physical RAM as the file sizes exceed
> RAM.
>
> I assume I'm not the first to have this problem.
> Can someone point me to perl code?
This appears to do what you require:
#!/usr/bin/perl
use warnings;
use strict;
my ( $file1, $file2 ) = ( 'file1', 'file2' );
open my $F1, '<', $file1 or die "Cannot open '$file1' $!";
open my $F2, '<', $file2 or die "Cannot open '$file2' $!";
my ( $first, $second ) = ( '', '' );
do {
if ( $first eq $second ) {
$first = <$F1> || '~'; # because ~ is the last ASCII character
$second = <$F2> || '~';
}
elsif ( $first lt $second ) {
print "$file1: $first";
$first = <$F1> || '~';
}
elsif ( $first gt $second ) {
print "$file2: $second";
$second = <$F2> || '~';
}
} until eof $F1 and eof $F2;
__END__
John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
| |
| tc314@hotmail.com 2008-03-29, 8:01 am |
| On Mar 28, 6:54 pm, lawre...@cluon.com (Lawrence Statton) wrote:
> If you're using Gnu diff (i.e. the diff that comes with most Linuces)
> --speed-large-files might help you, without having to jump through a
> perl hoop.
>
> --L
Problems:
1) it runs out of memory 8Gig of files with 2GB RAM
2) it assumes a number of lines (3999) because it doesn't know if it
will
find a difference in one line or a million lines.
(2b: this goes against the *nix pipe concept because it then pushes
this
unwieldy block to the next pipe 'cut' rather than gracefully streaming
from pipe to pipe.)
3) The heiristic approach is an imprecise solution to an exact
problem.
It doesn't work perfectly every time.
For most files the simple bash scripts a clean, self-documenting and
fine.
It's natural in perl.
I'm battling syntax and trying to avoid physical RAM issues entirely.
Thanks
| |
| Rob Dixon 2008-03-29, 7:12 pm |
| tc314@hotmail.com wrote:
>
> On Mar 28, 6:54 pm, lawre...@cluon.com (Lawrence Statton) wrote:
>
> Problems:
[snip]
>
> 3) The heiristic approach is an imprecise solution to an exact
> problem. It doesn't work perfectly every time.
>
[snip]
Come to think of it:
What heuristic approach?
Rob
| |
| Rob Dixon 2008-03-29, 7:12 pm |
| tc314@hotmail.com wrote:
>
> On Mar 28, 6:54 pm, lawre...@cluon.com (Lawrence Statton) wrote:
>
> Problems:
> 1) it runs out of memory 8Gig of files with 2GB RAM
> 2) it assumes a number of lines (3999) because it doesn't know if it
> will
> find a difference in one line or a million lines.
> (2b: this goes against the *nix pipe concept because it then pushes
> this
> unwieldy block to the next pipe 'cut' rather than gracefully streaming
> from pipe to pipe.)
> 3) The heiristic approach is an imprecise solution to an exact
> problem.
> It doesn't work perfectly every time.
>
> For most files the simple bash scripts a clean, self-documenting and
> fine.
> It's natural in perl.
> I'm battling syntax and trying to avoid physical RAM issues entirely.
The diff utility is a general purpose application that will generate a
minimal edit (list of changes) to translate one file into another. If
both files are sorted the problem reduces enormously, and using diff is
overkill. Please take a look at the code that John posted. If it doesn't
do what you require in a tiny fraction of the time taken by your current
method I will be astonished.
Rob
(For those interested, the algorithm employed by diff has a performance
between O(N) (no changes) and O(N^2). It is documented at
http://www.xmailserver.org/diff2.pdf)
|
|
|
|
|