Home > Archive > PERL Miscellaneous > August 2004 > Performance Improvement of complex data structure (hash of hashes of hashes)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Performance Improvement of complex data structure (hash of hashes of hashes)
|
|
| Scott Gilpin 2004-08-24, 4:40 pm |
| Hi everyone -
I'm trying to improve the performance (runtime) of a program that
processes large files. The output of the processing is some fixed
number of matrices (that can vary between invocations of the program),
each of which has a different number of rows, and the same number of
columns. However, the number of rows and columns may not be known
until the last row of the original file is read. The original file
contains approximately 100 millon rows. Each individual matrix has
between 5 and 200 rows, and between 50 and 10000 columns. The data
structure I'm using is a hash of hashes of hashes that stores this
info. N is the total number of columns, M1 is the total number of
rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
etc. The total number of matrices is between 3 and 15.
matrix #1 => row name 1 => col name 1 => value of 1,1
col name 2 => value of 1,2
......
col name N => value of 1,N
row name 2 => col name 1 => value of 2,1
col name 2 => value of 2,2
......
col name N => value of 2,N
.....
row name M1=> col name 1 => value of M1,1
col name 2 => value of M1,2
......
col name N => value of M1,N
matrix #2 => row name 1 => col name 1 => value of 2,1
col name 2 => value of 2,1
......
col name N => value of 2,1
.....
row name M2=> col name 1 => value of M2,1
col name 2 => value of M2,N
......
col name N => value of M2,N
etc, etc...
Here is the code that I'm using to build up this data structure. I'm
running perl version 5.8.3 on solaris 8 (sparc processor). The system
is not memory bound or cpu bound - this program is really the only
thing that runs. There are several gigabytes of memory, and this
program doesn't grow bigger than around 100 MB. Right now the run time
for the following while loop with 100 million rows of data is about 6
hours. Any small improvements would be great.
## loop to process each row of the original data
while(<INDATA> )
{
chomp($_);
## Each row is delimited with |
my @original_row = split(/\|/o,$_);
## The cell value and the column name are always in the same
position
my $cell_value = $original_row[24];
my $col_name = $original_row[1];
## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;
## For each matrix, loop through and increment the
row/column value
foreach my $matrix (@matrixList)
{
## positionHash tells the position of the value for
## this matrix in the original data row
my $row_name = $original_row[$positionHash{$matrix}];
$matrix_values{$matrix}{$row_name}{$col_
name} +=
$cell_value;
}
} ## end while
I tried using DProf & dprofpp, but that didn't reveal anything
interesting. I also tried setting the initial size of each hash using
'keys', but this didn't show any improvement. I could only initialize
the hash of hashes - and not the third level of hashes (since I don't
know the values in the second hash until they are read in from the
file). I know that memory allocation in C is expensive, as is
re-hashing - I suspect that's what's taking up a lot of the time.
My specific questions are:
Is there a profiler for perl that will produce output with information
about the underlying C function calls? (eg - malloc) Or at least
more information than DProf?
Is there a more suitable data structure that I should use?
Is there a way to allocate all the memory I would need at the beginning
of the program, to eliminate subsequent memory allocation and
rehashing? (My system has plenty of memory)
Anything else I'm missing?
Thanks in advance.
Scott
| |
| Anno Siegel 2004-08-25, 9:00 am |
| Scott Gilpin <sgilpin@gmail.com> wrote in comp.lang.perl.misc:
> Hi everyone -
>
> I'm trying to improve the performance (runtime) of a program that
> processes large files. The output of the processing is some fixed
> number of matrices (that can vary between invocations of the program),
> each of which has a different number of rows, and the same number of
> columns. However, the number of rows and columns may not be known
> until the last row of the original file is read. The original file
> contains approximately 100 millon rows. Each individual matrix has
> between 5 and 200 rows, and between 50 and 10000 columns. The data
> structure I'm using is a hash of hashes of hashes that stores this
> info. N is the total number of columns, M1 is the total number of
> rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
> etc. The total number of matrices is between 3 and 15.
[...]
> Here is the code that I'm using to build up this data structure. I'm
> running perl version 5.8.3 on solaris 8 (sparc processor). The system
> is not memory bound or cpu bound - this program is really the only
> thing that runs. There are several gigabytes of memory, and this
> program doesn't grow bigger than around 100 MB. Right now the run time
> for the following while loop with 100 million rows of data is about 6
> hours. Any small improvements would be great.
It shouldn't take that long, unless the data structure blows up way
beyond 100 MB.
> ## loop to process each row of the original data
> while(<INDATA> )
> {
> chomp($_);
>
>
> ## Each row is delimited with |
> my @original_row = split(/\|/o,$_);
>
> ## The cell value and the column name are always in the same
> position
> my $cell_value = $original_row[24];
> my $col_name = $original_row[1];
>
> ## Add this column name to the list of ones we've seen
> $columns_seen{$col_name}=1;
Where is this used?
> ## For each matrix, loop through and increment the
> row/column value
> foreach my $matrix (@matrixList)
Where is @matrixList set?
> {
>
> ## positionHash tells the position of the value for
> ## this matrix in the original data row
> my $row_name = $original_row[$positionHash{$matrix}];
Where is %positionHash set?
> $matrix_values{$matrix}{$row_name}{$col_
name} +=
> $cell_value;
> }
>
> } ## end while
This code isn't runnable. How are we to improve code we can't run?
To make it runnable, I had to realize that %positionHash is nowhere
set and come up with a plausible one. Same for @matrixList. I had
to find that %columns_seen is nowhere used, and discard it. Then I
had to generate a set of input data of for the program to run with.
It would have been your job to do that, and you are far better equipped
to do it.
That said, I don't see room for fundamental improvement. Apparently
each "cell value" contributes to all matrices in the same column,
but in lines that are determined indirectly (though %positionHash).
You program does that without doing any excessive extra work. There
may be re-arrangements of the data structure with corresponding code
adaptions that make it marginally faster, but I wouldn't expect
anything better than 10%.
> I tried using DProf & dprofpp, but that didn't reveal anything
> interesting.
It can't. DProf works on subroutine basis, but your code doesn't
use any subroutines.
> I also tried setting the initial size of each hash using
> 'keys', but this didn't show any improvement. I could only initialize
> the hash of hashes - and not the third level of hashes (since I don't
> know the values in the second hash until they are read in from the
> file). I know that memory allocation in C is expensive, as is
> re-hashing - I suspect that's what's taking up a lot of the time.
One thing to observe is whether program speed deteriorates over
time. Just print something to stderr every so-many records and
watch the rhythm. If it gets slower with time, the problem is most
likely memory related. If it doesn't, you're none the wiser.
Anno
| |
| Scott Gilpin 2004-08-25, 3:57 pm |
| Anno Siegel wrote:
> Scott Gilpin <sgilpin@gmail.com> wrote in comp.lang.perl.misc:
program),[color=darkred]
of[color=darkred]
>
> [...]
>
I'm[color=darkred]
system[color=darkred]
time[color=darkred]
6[color=darkred]
>
> It shouldn't take that long, unless the data structure blows up way
> beyond 100 MB.
>
>
> Where is this used?
>
>
> Where is @matrixList set?
>
>
> Where is %positionHash set?
>
>
> This code isn't runnable. How are we to improve code we can't run?
Thanks for your reply. I apologize for not being more clear in my
original post. I've included the entire code to produce the desired
output. The first 10 lines of the input file are:
928219|7|6|MI|2
928219|9|5|MO|1
928219|11|5|CA|41
928219|8|6|MA|1
928219|5|5|WY|3
701396|10|7|QC|8
701396|17|1|MI|1
928219|0|3|CA|2
701396|13|1|CA|2
928219|1|1|CA|2
The header is:
col_name|matrix1_rows|matrix2_rows|matri
x3_rows|cell_values
The source code is:
#!/usr/local/bin/perl5.8.3
use strict;
## The list of matrices actually varies between invocations
## of the program - anywhere from 3 - 15
my @matrixList = ("matrix1", "matrix2", "matrix3");
## Position hash has the row positions of the values for each matrix
my %positionHash = (matrix1 => 1, matrix2 => 2, matrix3 => 3);
## Keep track of the columns we've seen so far
my %columns_seen = ();
## hash of hashes of hashes - matrix => rows => columns
my %matrix_values = ();
open (INDATA, "data.txt") ||
die "can't open data.txt";
while(<INDATA> ) {
chomp($_);
## Each row is variable width, delimited with |
my @original_row = split(/\|/,$_);
## The cell value and the column name are always in the same
## position
my $cell_value = $original_row[4];
my $col_name = $original_row[0];
## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;
## For each matrix, loop through and increment the
## row/column value
foreach my $matrix (@matrixList) {
my $row_name = $original_row[$positionHash{$matrix}];
$matrix_values{$matrix}{$row_name}{$col_
name} += $cell_value;
}
} ## end while
## The following code runs very quicky compare to the
## while loop above (2 mins vs. 6 hrs)
## I'm only including it to produce the desired output
## Create a header row with column names that is the same
## for all matrices
my $header = "";
foreach my $col_name (sort keys %columns_seen) {
$header = $header . "," . "$col_name";
}
## Create a file for each separate matrix
foreach my $matrix (@matrixList) {
## Open output file
my $OUT_FILE = $matrix . ".csv";
open (OUTFILE, ">$OUT_FILE") || die "can't open $OUT_FILE";
## Now we create the first line of file.
## Starting with the matrix name and a comma.
## Then printing out the column names.
my $firstline = $matrix . "$header";
print OUTFILE "$firstline\n";
## Loop for each row in the matrix
foreach my $row_name (keys(%{$matrix_values{$matrix} } )) {
my $line = $row_name;
## Loop for each column in the matrix
foreach my $col_name (sort keys %columns_seen) {
my $cell_value;
if ($matrix_values{$matrix}{$row_name}{$col
_name}) {
$cell_value =
$matrix_values{$matrix}{$row_name}{$col_
name};
} else {
$cell_value = ".";
}
$line = $line . ",$cell_value";
}
print OUTFILE "$line\n";
}
close OUTFILE;
}
>
> To make it runnable, I had to realize that %positionHash is nowhere
> set and come up with a plausible one. Same for @matrixList. I had
> to find that %columns_seen is nowhere used, and discard it. Then I
> had to generate a set of input data of for the program to run with.
> It would have been your job to do that, and you are far better
equipped
> to do it.
>
> That said, I don't see room for fundamental improvement. Apparently
> each "cell value" contributes to all matrices in the same column,
> but in lines that are determined indirectly (though %positionHash).
>
> You program does that without doing any excessive extra work. There
> may be re-arrangements of the data structure with corresponding code
> adaptions that make it marginally faster, but I wouldn't expect
> anything better than 10%.
>
>
> It can't. DProf works on subroutine basis, but your code doesn't
> use any subroutines.
>
using[color=darkred]
initialize[color=darkred]
don't[color=darkred]
>
> One thing to observe is whether program speed deteriorates over
> time. Just print something to stderr every so-many records and
> watch the rhythm. If it gets slower with time, the problem is most
> likely memory related. If it doesn't, you're none the wiser.
The runtime scales linearly with the number of rows, so
I don't believe it to be a memory issue.
>
> Anno
| |
| ctcgag@hotmail.com 2004-08-26, 3:56 am |
| "Scott Gilpin" <sgilpin@gmail.com> wrote:
> Here is the code that I'm using to build up this data structure. I'm
> running perl version 5.8.3 on solaris 8 (sparc processor). The system
> is not memory bound or cpu bound -
If it is not memory or cpu bound, then it must be I/O bound. That is
pretty much the only other choice, right? Are you sure it isn't CPU bound?
The fact that it is the only thing that runs on that machine doesn't mean
it isn't CPU bound.
> ## loop to process each row of the original data
> while(<INDATA> )
> {
> chomp($_);
>
> ## Each row is delimited with |
> my @original_row = split(/\|/o,$_);
} #Let us say you end the loop right there.
#How long does it take to run now?
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
| |
| Anno Siegel 2004-08-26, 8:56 am |
| Scott Gilpin <sgilpin@gmail.com> wrote in comp.lang.perl.misc:
> Anno Siegel wrote:
[...]
[color=darkred]
> Thanks for your reply. I apologize for not being more clear in my
> original post. I've included the entire code to produce the desired
> output. The first 10 lines of the input file are:
>
> 928219|7|6|MI|2
> 928219|9|5|MO|1
> 928219|11|5|CA|41
> 928219|8|6|MA|1
> 928219|5|5|WY|3
> 701396|10|7|QC|8
> 701396|17|1|MI|1
> 928219|0|3|CA|2
> 701396|13|1|CA|2
> 928219|1|1|CA|2
>
> The header is:
>
> col_name|matrix1_rows|matrix2_rows|matri
x3_rows|cell_values
>
> The source code is:
Yeah, that's much better.
I have added some critical remarks to your code, but to summarize,
I still don't see why it is slow, and I agree that it shouldn't take
6 hours on a modern machine.
To pinpoint the problem, exclude parts of the code from processing.
How long does it take just to read the file and do nothing with
the lines? What if you don't split the lines but use a fixed array
instead of the line data? What if you *do* split, but don't make
the matrix entries? Etc... See my suggestion for timing code below
to facilitate the evaluation.
So:
> #!/usr/local/bin/perl5.8.3
>
> use strict;
No warnings? You should always switch warnings on, especially when
juggling nested data structures.
> ## The list of matrices actually varies between invocations
> ## of the program - anywhere from 3 - 15
> my @matrixList = ("matrix1", "matrix2", "matrix3");
>
> ## Position hash has the row positions of the values for each matrix
> my %positionHash = (matrix1 => 1, matrix2 => 2, matrix3 => 3);
>
> ## Keep track of the columns we've seen so far
> my %columns_seen = ();
>
> ## hash of hashes of hashes - matrix => rows => columns
> my %matrix_values = ();
>
> open (INDATA, "data.txt") ||
> die "can't open data.txt";
>
To facilitate run-time evaluation, add this:
my $linecount = 0;
my $time = - times;
> while(<INDATA> ) {
$linecount ++;
Code in {} should be indented!
>
> chomp($_);
Just
chomp;
does the same thing. The way things are you don't need to chomp
at all (because the last field is numeric and doesn't suffer from
a trailing "\n"), but I'd leave it in.
>
> ## Each row is variable width, delimited with |
> my @original_row = split(/\|/,$_);
>
> ## The cell value and the column name are always in the same
> ## position
> my $cell_value = $original_row[4];
> my $col_name = $original_row[0];
Shorter, but essentially no different:
my ( $col_name, $cell_value) = @original_row[ 0, 4];
Also, the "magic number" 4 is one more than the elements in
@matrixList, so:
my ( $col_name, $cell_value) = @original_row[ 0, 1 + @matrixList];
> ## Add this column name to the list of ones we've seen
> $columns_seen{$col_name}=1;
>
> ## For each matrix, loop through and increment the
> ## row/column value
> foreach my $matrix (@matrixList) {
Next level of indentation, please!
> my $row_name = $original_row[$positionHash{$matrix}];
> $matrix_values{$matrix}{$row_name}{$col_
name} += $cell_value;
> }
>
> } ## end while
Now show the number of lines and the time used:
$time += times;
warn "$linecount lines read in $time s cpu\n";
> ## The following code runs very quicky compare to the
> ## while loop above (2 mins vs. 6 hrs)
> ## I'm only including it to produce the desired output
Including it is good. This way we can generate valid results for
comparison.
> ## Create a header row with column names that is the same
> ## for all matrices
> my $header = "";
>
>
> foreach my $col_name (sort keys %columns_seen) {
> $header = $header . "," . "$col_name";
> }
A shorter and more standard way, though without the leading "," is
my $header = join ',', sort keys %columns_seen;
> ## Create a file for each separate matrix
> foreach my $matrix (@matrixList) {
>
> ## Open output file
> my $OUT_FILE = $matrix . ".csv";
> open (OUTFILE, ">$OUT_FILE") || die "can't open $OUT_FILE";
>
> ## Now we create the first line of file.
> ## Starting with the matrix name and a comma.
> ## Then printing out the column names.
>
> my $firstline = $matrix . "$header";
Originally there was no need to quote "$header". Now there is,
because we must insert a ",":
my $firstline = $matrix . ",$header";
> print OUTFILE "$firstline\n";
>
> ## Loop for each row in the matrix
> foreach my $row_name (keys(%{$matrix_values{$matrix} } )) {
>
> my $line = $row_name;
>
> ## Loop for each column in the matrix
> foreach my $col_name (sort keys %columns_seen) {
Sorting again for each line that is printed is wasteful. Sort them
once and for all before you enter any loop.
> my $cell_value;
> if ($matrix_values{$matrix}{$row_name}{$col
_name}) {
> $cell_value =
> $matrix_values{$matrix}{$row_name}{$col_
name};
> } else {
> $cell_value = ".";
> }
This would more commonly be written
my $cell_value = $matrix_values{$matrix}{$row_name}{$col_
name} || '.';
but that is arguably wrong. What if the cell value is defined, but
happens to be 0? Should 0 be replaced by "."? If not,
my $cell_value = $matrix_values{$matrix}{$row_name}{$col_
name};
defined or $_ = '.' for $cell_value;
> $line = $line . ",$cell_value";
> }
> print OUTFILE "$line\n";
>
> }
> close OUTFILE;
> }
Anno
| |
| Scott Gilpin 2004-08-26, 4:01 pm |
|
ctcgag@hotmail.com wrote:
> "Scott Gilpin" <sgilpin@gmail.com> wrote:
>
I'm[color=darkred]
system[color=darkred]
>
> If it is not memory or cpu bound, then it must be I/O bound. That is
> pretty much the only other choice, right? Are you sure it isn't CPU
bound?
> The fact that it is the only thing that runs on that machine doesn't
mean
> it isn't CPU bound.
I don't believe it's CPU bound, because when I monitor the process
using top, the CPU is always about 75% idle. Also, vmstat doesn't seem
to reveal anything useful - there is no swapping occuring, and there
aren't an abnormal number of context switches.
>
>
>
> } #Let us say you end the loop right there.
> #How long does it take to run now?
Just doing a split and chomp takes about 1/2 the time as the full
processing. (3 hrs for 100 million rows)
>
> Xho
>
> --
> -------------------- http://NewsReader.Com/ --------------------
> Usenet Newsgroup Service $9.95/Month 30GB
| |
| Scott Gilpin 2004-08-26, 4:01 pm |
| Thanks again for your help. I'll definitely make these improvements
and see how things go.
btw - I'm using groups-beta.google.com to post messages, and it seems
to remove any indentation from messages that I post (hence the poorly
indented code).
-Scott
Anno Siegel wrote:
> Scott Gilpin <sgilpin@gmail.com> wrote in comp.lang.perl.misc:
>
> [...]
>
>
> Yeah, that's much better.
>
> I have added some critical remarks to your code, but to summarize,
> I still don't see why it is slow, and I agree that it shouldn't take
> 6 hours on a modern machine.
>
> To pinpoint the problem, exclude parts of the code from processing.
> How long does it take just to read the file and do nothing with
> the lines? What if you don't split the lines but use a fixed array
> instead of the line data? What if you *do* split, but don't make
> the matrix entries? Etc... See my suggestion for timing code below
> to facilitate the evaluation.
>
> So:
>
>
> No warnings? You should always switch warnings on, especially when
> juggling nested data structures.
>
matrix[color=darkred]
>
> To facilitate run-time evaluation, add this:
>
> my $linecount = 0;
> my $time = - times;
>
>
> $linecount ++;
>
> Code in {} should be indented!
>
>
> Just
>
> chomp;
>
> does the same thing. The way things are you don't need to chomp
> at all (because the last field is numeric and doesn't suffer from
> a trailing "\n"), but I'd leave it in.
>
>
> Shorter, but essentially no different:
>
> my ( $col_name, $cell_value) = @original_row[ 0, 4];
>
> Also, the "magic number" 4 is one more than the elements in
> @matrixList, so:
>
> my ( $col_name, $cell_value) = @original_row[ 0, 1 +
@matrixList];
>
>
> Next level of indentation, please!
>
>
> Now show the number of lines and the time used:
>
> $time += times;
> warn "$linecount lines read in $time s cpu\n";
>
>
> Including it is good. This way we can generate valid results for
> comparison.
>
>
> A shorter and more standard way, though without the leading "," is
>
> my $header = join ',', sort keys %columns_seen;
>
>
> Originally there was no need to quote "$header". Now there is,
> because we must insert a ",":
>
> my $firstline = $matrix . ",$header";
>
>
> Sorting again for each line that is printed is wasteful. Sort them
> once and for all before you enter any loop.
>
>
> This would more commonly be written
>
> my $cell_value = $matrix_values{$matrix}{$row_name}{$col_
name} ||
'.';
>
> but that is arguably wrong. What if the cell value is defined, but
> happens to be 0? Should 0 be replaced by "."? If not,
>
> my $cell_value = $matrix_values{$matrix}{$row_name}{$col_
name};
> defined or $_ = '.' for $cell_value;
>
>
> Anno
| |
| ctcgag@hotmail.com 2004-08-26, 4:01 pm |
| anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) wrote:
>
> Yeah, that's much better.
>
> I have added some critical remarks to your code, but to summarize,
> I still don't see why it is slow, and I agree that it shouldn't take
> 6 hours on a modern machine.
The sample data he posted had [4] as the last index per line, while
the original code (presumably, which is the one that took 6 hours)
had [24] as the index. Taking that change of size into account, I would
say that 6 hours is a reasonable time for it to take.
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
| |
| ctcgag@hotmail.com 2004-08-26, 4:01 pm |
| "Scott Gilpin" <sgilpin@gmail.com> wrote:
> ctcgag@hotmail.com wrote:
> I'm
> system
> bound?
> mean
>
> I don't believe it's CPU bound, because when I monitor the process
> using top, the CPU is always about 75% idle.
Is this a 4 processor machine? If so, then you could probably parallelize
the program to make it about 4 times faster. If not, then that is strange
and perhaps you really are IO bound, although that is hard to believe.
>
> Just doing a split and chomp takes about 1/2 the time as the full
> processing. (3 hrs for 100 million rows)
When just reading and splitting the data takes half the time, that doesn't
bode well for optimization potential. At that point, I'd recommend rearing
back and looking at the big picture. How do you get these 100 million row
files in the first place? Presumably you get a lot of these (because a 6
hour run-time probably isn't a big problem for a one-time deal), so could
you tap directly into the data source that is used to create the files,
rather than using the intermediate files?
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
| |
| Anno Siegel 2004-08-26, 4:01 pm |
| <ctcgag@hotmail.com> wrote in comp.lang.perl.misc:
> anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) wrote:
>
> The sample data he posted had [4] as the last index per line, while
> the original code (presumably, which is the one that took 6 hours)
> had [24] as the index. Taking that change of size into account, I would
> say that 6 hours is a reasonable time for it to take.
You may be right. After I read that just splitting the data takes
half of the time I had second thoughts of my own.
It also means that there is no gross inefficiency in the other half
of the program, so there won't be a dramatic improvement through
some super algorithm.
You mentioned parallelization... that looks easy. Split the data
in as many parts as you have CPUs and run them in parallel (on one or
more machines). A program that adds up the resulting matrices wouldn't
be hard (though not trivial). More importantly, it wouldn't run long.
Then there's C, but the problem seems to involve some essential string
processing, so that's not particularly attractive. If only those lines
and columns had numbers instead of names.
Anno
|
|
|
|
|