For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > February 2006 > Parsing/sorting big file problem









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Parsing/sorting big file problem
mcvallet@hotmail.com

2006-02-23, 7:00 pm

Hi,
I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number (the max number being
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you,


########################################
######################################"
$#complete = 4000000;

open(OUTPUTFILE, $outPut)
|| die "cannot open file";

#variable initialisation
my $countTotPositive = 0;
my $countTotNegative = 0;
my $stop= 0;
my $countTotProt = 0;
my @start = times();


while(($ligne = <OUTPUTFILE> ) && $stop == 0){
#identifying the protein being compared
if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
#the next commented lignes are here for test purposes
if ($ligne =~ m/^.*1200>>>(\w+).*/){
$stop= 1;
}
$protName1 = $2;
$protName1 =~ s/_//g;
$count = 0;
}
#parsing the results
else{
$_=$ligne ;
my $evalue= 0;
/^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*)\W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
my $protName2=$1;
my $nbAa=$2;
my $eval3=$3;
my $eval4=$4;
my $eval5=$5;
$eval[0]="$6";
$eval[1]=$7;
my $eval8=$8;
$protName2 =~ s/_//g;
#finding out what is the evalue for this result
if ($ligne =~ m/e\+(\d{2,2})$/so){
$evalue = $eval[0].".".@eval;
for ($i = 0; $i < $eval8; $i++){
$evalue = $evalue * 10;
}
}else{
if ($eval[0] =~ m/^0/){
$evalue = $eval[0].".".$eval[1].$eval8;
}else{
$evalue = $eval[0].$eval[1].$eval8;
}
}

@sortedCouple = sort($protName1,$protName2);

if ($complete{"$sortedCouple[0]-$sortedCouple[1]"}[0]
|| $sortedCouple[0] =~ m/$sortedCouple[1]/i){

$evalue2 = $evalue;
#modifying the evalue 1 if the identical couple
if($sortedCouple[0] =~ m/$sortedCouple[1]/i){
$evalue1 = $evalue;
$identical =1;
$countTotPositive++;
}else{
$evalue1 = $complete{"$sortedCouple[0]-$sortedCouple[1]"}[0];
$identical =$complete{"$sortedCouple[0]-$sortedCouple[1]"}[1];
}
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
$count++;
}
# temporaly saving the partial results
else{
$class1 = $classes{$protName1};
$class2 = $classes{$protName2};
$identical = ( $class1=~ m/$class2/ ? 1 : 0);
if ($identical == 1){
$countTotPositive++;
}else{
$countTotNegative++;
}
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$evalue,
$identical];
}

}

}
close OUTPUTFILE;
#variable initialisation
$countPositive = 0;
$countNegative = 0;
foreach $complete (sort{$complete{$a}[2]<=> $complete{$b}[2]} keys
%complete) {
if ($complete{$complete}[3] == 1){
$countPositive++;
}else{
$countNegative++;
}
$newLigne =
$complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";
push @results,$newLigne;

}

@end = times();
# ============= Analyse results

print "Reading and parsing file took ",$end[0]-$start[0]," cpu
seconds\n";

# creation du document
print "\n";
@start = times();
open (F,">results/5out.test");
print F "@results";
close F;
@end = times();
# ============= Analyse results

print "Writting the file results/5out.test",$end[0]-$start[0]," cpu
seconds\n";


}
########################################
######################################""

John W. Krahn

2006-02-23, 9:57 pm

mcvallet@hotmail.com wrote:
> I am coding a program that parses a file 370Mb. As long as I keep this
> number less than a 1000 in this portion :
> # basicly tells me until when i should continue to read the file)
> if ($ligne =~ m/^.*1000>>>(\w+).*/){
> $stop= 1;
> }
> it works, but as soon as I increase the number (the max number being
> 2225) so I am not even reading 1/2 of it, the program does not respond.
> Does anybody have a suggestion for this ?
> thank you,
>
>
> ########################################
######################################"
> $#complete = 4000000;


You are expanding the array @complete to contain 4,000,001 elements but it
doesn't look like you are using that array anywhere. Perhaps it is causing
your problem?


John
--
use Perl;
program
fulfillment
mcvallet@hotmail.com

2006-02-24, 3:57 am

The only thing I know is that the array will contain 2225*2225 = 4 950
625 and I thought I was using this array here
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
Did I mix up the $ and @ ?

Furthermore, at the beginning I was not expanding the array to this
size, but it was not working either this is why I tried to expand the
array.

mc

John W. Krahn

2006-02-24, 3:57 am

mcvallet@hotmail.com wrote:
> The only thing I know is that the array will contain 2225*2225 = 4 950
> 625 and I thought I was using this array here
> $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,


That is using the hash %complete, not the array @complete.


John
--
use Perl;
program
fulfillment
MSG

2006-02-24, 3:57 am


mcvallet@hotmail.com wrote:
> The only thing I know is that the array will contain 2225*2225 = 4 950
> 625 and I thought I was using this array here
> $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
> $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
> Did I mix up the $ and @ ?
>
> Furthermore, at the beginning I was not expanding the array to this
> size, but it was not working either this is why I tried to expand the
> array.
>
> mc


Where are 'use strict' and 'use warnings' ?!!
You can catch a lot of problems simply by using those. such as your
using complete{ } and $#complete ( hash / array ).

January Weiner

2006-02-24, 3:57 am

mcvallet@hotmail.com wrote:
> Hi,


Hello,
first of all: I think you are parsing output of some sequence comparison
program. Maybe you could describe in more detail what you are trying to
do? Your code is long, incomplete, with messy intendation and
practically uncommented, so it is hard to see what you are doing. For
example, what about the %classes hash? Where does it come from, where is
it defined?

> 2225) so I am not even reading 1/2 of it, the program does not respond.
> Does anybody have a suggestion for this ?
> thank you,


Hm. From my experience with large protein data sets -- looks like your
program exhausts all of the memory. A couple of suggestions:

1) As far as I can tell, you do the following: you first parse the search
results (I assume these are search results) and evaluate them at the
same time, then you sort them according to e-value, then you save them
in a file. You can do the following:

- first do the parsing, and save the data on the fly to a temporary
file

- then open the temporary file, make the evaluation, sort the
results, remove redundant etc.

- how long are the protein names? Maybe that is the problem? If you
have hundreds of thousands of fasta-style descriptions, using them
for a hash table in Perl (your "%complete" hash) may be very
inefficient. Try to use only short ids.

- if everything else fails, instead of spending ws on correcting
your program (and there is, methinks, a lot to correct), try to get
your hands on a machine with more memory or a better OS and run
your calculations there.

- clean up your code, comment it, post it again here.

2) if I am correct in my assumption and you are writing a parser for
blast or ssearch or the results of a similar program, why don't you
use Bioperl?

(snip the code fragment)

j.

--
------------ January Weiner 3 -------------------------------------
Division of Bioinformatics, University of Muenster
January Weiner

2006-02-24, 3:57 am

mcvallet@hotmail.com wrote:
> The only thing I know is that the array will contain 2225*2225 = 4 950
> 625 and I thought I was using this array here
> $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,


this is a hash. When you write $blah{foo}, you access the hash %blah and
get the value stored for the key 'foo'.

> $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
> Did I mix up the $ and @ ?


you mixed up the % and the @.

However, I think that your problem is rather the size of your data. You
have a hash with 5 million elements, right? Try to roughly estimate how
much memory this will take. You need to store 5 million keys, right? Each
key being at least some 10 characters, right? Not to mention the arrays
that you store in the hash, correct?

1)Make the hash keys as short as possible.

2)Maybe instead of using protein names as keys, encode the file with
results (protein name1 = 0 ; protein name2 = 1 etc.). And instead of
using a hash, use a two-dimensional array:

my $matrix = [ ] ;

while( <INPUT_FILE> ) {
... # do your stuff

my ($prot_a, $prot_b) ; # these will be numerical IDs, and not names

if($prot_a > $prot_b) { # sort
($prot_a, $prot_b) = ($prot_b, $prot_a) ;
}

$result = [ ] ;
... # do some more stuff
# fill up $result

# store the $result in the matrix
$matrix->[$prot_a][$prot_b] = $result ;
}

j.

--
------------ January Weiner 3 ---------------------+---------------
Division of Bioinformatics, University of Muenster
mcvallet@hotmail.com

2006-02-24, 6:58 pm

the entire code is not here, but you were correct, Iwas not using them.
thanks,
mc

mcvallet@hotmail.com

2006-02-24, 6:58 pm


> first of all: I think you are parsing output of some sequence

comparison
> program.

exactly
> Maybe you could describe in more detail what you are trying to
> do? Your code is long, incomplete, with messy intendation and
> practically uncommented, so it is hard to see what you are doing.

Sorry
>For example, what about the %classes hash? Where does it come from,

where is
>it defined?

the %classes is a class contains the structural family of the classes
-it is at the begining of my wode witch I did not post because, it
works correctly.



>1) As far as I can tell, you do the following: you first parse the search
>results (I assume these are search results) and evaluate them at

the
>same time, then you sort them according to e-value, then you

save them
> in a file. You can do the following:
> - first do the parsing, and save the data on the fly to a temporary
> file


Not exactly, the results are already pre-parsed, but there are still
thing that are not necessary. The file look a bit like this :
1>>> d1tima_ 244 fragments - 244 aa
1dqzB0 ( 277) 4276 20.6
99
1hbnC0 ( 244) 4193 20.4
1e+02
1cxpD0 ( 463) 4140 20.3
2e+02
......
2225>>> another protein
the last 2225 results....

> - first do the parsing, and save the data on the fly to a

temporary
> file


> - then open the temporary file, make the evaluation, sort the
> results, remove redundant etc.


> - how long are the protein names? Maybe that is the problem?

If you
> have hundreds of thousands of fasta-style descriptions, using

them
> for a hash table in Perl (your "%complete" hash) may be very
> inefficient. Try to use only short ids.

5 letters long

> - if everything else fails, instead of spending ws on

correcting
> your program (and there is, methinks, a lot to correct), try

to get
> your hands on a machine with more memory or a better OS and

run
> your calculations there.


>- clean up your code, comment it, post it again here.

ok
thanks again,
mc

mcvallet@hotmail.com

2006-02-24, 6:58 pm

> Maybe you could describe in more detail what you are trying to
> do?

I want to get all the couples a-b and the sum of there evalues eval_ab
+ eval_ba and sort the results according to that sum

Michael Zawrotny

2006-02-24, 6:58 pm

mcvallet@hotmail.com <mcvallet@hotmail.com> wrote:
> Hi,
> I am coding a program that parses a file 370Mb. As long as I keep this
> number less than a 1000 in this portion :
> # basicly tells me until when i should continue to read the file)
> if ($ligne =~ m/^.*1000>>>(\w+).*/){
> $stop= 1;
> }
> it works, but as soon as I increase the number (the max number being
> 2225) so I am not even reading 1/2 of it, the program does not respond.
> Does anybody have a suggestion for this ?
> thank you,

[ snip ]
>
>
> if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
> #the next commented lignes are here for test purposes
> if ($ligne =~ m/^.*1200>>>(\w+).*/){
> $stop= 1;
> }


I think that the problem is in your regexps. A leading or trailing
".*" is almost always a mistake. It says "match 0 or more of
any single character" (not exactly, but pretty much). If it doesn't
match using zero characters, it will try again with one, ...

Doing that at both the beginning and end of the line can lead to an
enormous amount of backtracking. You could try adding a non-greedy
qualifier ("?") after the ".*", or better yet, just drop the ".*"
entirely since it always matches and thus doesn't change the overall
outcome of the attempted match.


Mike

--
Michael Zawrotny
Institute of Molecular Biophysics
Florida State University | email: zawrotny@sb.fsu.edu
Tallahassee, FL 32306-4380 | phone: (850) 644-0069
Tad McClellan

2006-02-24, 6:58 pm

mcvallet@hotmail.com <mcvallet@hotmail.com> wrote:

> I am coding a program that parses a file 370Mb. As long as I keep this
> number less than a 1000 in this portion :
> # basicly tells me until when i should continue to read the file)
> if ($ligne =~ m/^.*1000>>>(\w+).*/){
> $stop= 1;
> }
> it works, but as soon as I increase the number



There is NO number in your pattern.

The "1000" is a string, not a number.


> $#complete = 4000000;



You can avoid getting fingerprints on the screen (from counting zeros):

$#complete = 4_000_000;



> open(OUTPUTFILE, $outPut)
> || die "cannot open file";



You are opening OUTPUTFILE for *input*.

That is a pretty strange choice of filehandle name...

You should include the $! variable in your die message.


> while(($ligne = <OUTPUTFILE> ) && $stop == 0){



You don't need the $stop flag if you simply last() out of the
while loop at the appropriate place.


> #identifying the protein being compared
> if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){

^^^^^^
^^^^^^

That part of your pattern makes no sense to me.

Did you mean (\d+) instead?


> #the next commented lignes are here for test purposes



The next lines are not "commented"...


> if ($ligne =~ m/^.*1200>>>(\w+).*/){
> $stop= 1;



last; # exit the while loop, avoid the problem immediately below


> }
> $protName1 = $2;



If that pattern matches, then it will wipe out $2 from the
earlier pattern match, and you will store an undef into $protName1.

The dollar-digit variables are set/reset at each successful pattern match.


> $protName1 =~ s/_//g;



Regexes are for strings. tr/// is for characters.

$protName1 =~ tr/_//d;


> /^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*)\W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
> my $protName2=$1;



You should *never* use the dollar-digit variables unless you
have first ensured that the pattern match *succeeded*.


> my $eval3=$3;
> my $eval4=$4;
> my $eval5=$5;



Sequentially named variables very often indicate that there is
a better choice of data structure, such as an array rather than
a bunch of independant scalars.


> $eval[0]="$6";



What were you hoping that those double quotes would do for you?

perldoc -q vars


> #finding out what is the evalue for this result
> if ($ligne =~ m/e\+(\d{2,2})$/so){



You should not throw modifiers on the end willy-nilly like that.

Add modifiers when they will make a difference, and that difference
is what you want to happen.


m//s changes the meaning of dot (.), it has no effect when there
is no dot in your pattern.

m//o is used when you have variables in your pattern, it has
no effect when there are no variables in your pattern.

if ($ligne =~ m/e\+(\d{2})$/){
or
if ($ligne =~ m/e\+(\d\d)$/){

Is probably easier to read and understand.



> for ($i = 0; $i < $eval8; $i++){
> $evalue = $evalue * 10;
> }



$evalue *= 10 for 1 .. $eval8; # replaces that entire if-block



> $newLigne =
> $complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";



That is simply to horrid to look upon.

This should do the same thing (assuming that there are only 6
elements in the array) without making you scream:

$newLigne = join("\t", @{ $complete{$complete} }) . "\n";


> push @results,$newLigne;



You don't even need the $newLigne temporary variable:

push @results, join("\t", @{ $complete{$complete} }) . "\n";



> open (F,">results/5out.test");



You should always, yes *always*, check the return value from open():

open (F,">results/5out.test") or
die "could not open 'results/5out.test' $!";


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
Salvador Fandino

2006-02-25, 7:56 am

mcvallet@hotmail.com wrote:
> Hi,
> I am coding a program that parses a file 370Mb. As long as I keep this
> number less than a 1000 in this portion :
> # basicly tells me until when i should continue to read the file)
> if ($ligne =~ m/^.*1000>>>(\w+).*/){
> $stop= 1;
> }
> it works, but as soon as I increase the number (the max number being
> 2225) so I am not even reading 1/2 of it, the program does not respond.
> Does anybody have a suggestion for this ?
> thank you,
> ...


read the file in blocks, sort and save them in temp. files and finally
perform a merge sort:

Untested:

use warnings;
use strict;
use Sort::Key 'keysort_inplace';
use Sort::Key::Merger 'filekeymerger';
use File::Temp ...;

my @lines;
my @tempfn;

sub extract_sorting_key {
# extract the key that has to be used for sorting
# from $_, for instance:
/foo: (/w+)/;
$1
}

sub sort_and_write_block {
&keysort_inplace(\&extract_key, \@lines);
my ($fh, $filename) = File::Temp->new(...);
print $fh $_ for @lines;
close $fh;
push @tempfn, $filename;
@lines = ();
}

while (<> ) {
unless ($fh) {
($fh, $fn) = File::Temp->new(...);
}
sort_and_write_block() if @lines > 1000000
}

sort_and_write_block() if @lines;

my $merger = &filekeymerger(\&extract_key, @tempfn);

while (defined (my $line = $merger->())) {
# your lines arrive sorted here,
# do whatever you need with them!
...
}



Cheers,

- Salva
Salvador Fandino

2006-02-25, 7:56 am

Salvador Fandino wrote:

> ...
> while (<> ) {
> unless ($fh) {
> ($fh, $fn) = File::Temp->new(...);
> }
> sort_and_write_block() if @lines > 1000000
> }


oops, that should be...

while(<> ) {
push @lines, $_;
sort_and_write_block() if @lines > 1000000;
}


Cheers,

- Salva
mcvallet@hotmail.com

2006-02-27, 7:01 pm

thank you everybody,
it seems to work know, not on my computer but on a bigger computer, and
it does not take verylong either...
thanks,
mc

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com