For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > September 2006 > Best way to read a large file in line by line?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Best way to read a large file in line by line?
jplee3@gmail.com

2006-09-28, 6:57 pm

Hi all,
I'm a n00b to Perl [and scripting in general] and have been trying
to figure out the best way to process large log files (1gb and higher
in size). I've seen that many people use "@array = <FILE>;" to
accomplish this, but things seem to slow down quite a bit with the
files I'm working with.

I guess what I'm getting at, is if there's a faster way to do this
(i.e. using a while loop to read through the file first, and dumping
each line of the file into an array)

What I would like to accomplish is being able to access each line as an
array element - so $array[0] would return line 1, $array[1] would
return line 2, $array[2] would return line 3, and so on...

Any ideas?

Thanks!

usenet@DavidFilmer.com

2006-09-29, 3:57 am

jplee3@gmail.com wrote:
> I guess what I'm getting at, is if there's a faster way to do this
> (i.e. using a while loop to read through the file first, and dumping
> each line of the file into an array)


Why mess with the array at all? Just read the file one line at a time
and process each line as it is read. It's usually a bad idea to try to
slurp the entire file unless you REALLY need random-access to it. Most
file processing is sequential.

> What I would like to accomplish is being able to access each line as an
> array element - so $array[0] would return line 1, $array[1] would
> return line 2, $array[2] would return line 3, and so on...


If you must, use Tie::File (which lets you treat a file like any other
Perl array).

--
David Filmer (http://DavidFilmer.com)

Paul Lalli

2006-09-29, 7:57 am

jplee3@gmail.com wrote:
> I'm a n00b to Perl [and scripting in general] and have been trying
> to figure out the best way to process large log files (1gb and higher
> in size). I've seen that many people use "@array = <FILE>;" to
> accomplish this, but things seem to slow down quite a bit with the
> files I'm working with.


Please burn any tutorial or other example that shows that as a good way
to read from files.

> I guess what I'm getting at, is if there's a faster way to do this
> (i.e. using a while loop to read through the file first, and dumping
> each line of the file into an array)


Why do you think you need the file into an array at all? Why not just
process each line of the file as you read it, and then discard that
line in favor of the next line?

while (my $line = <$fh> ) {
#do something with $line;
}

> What I would like to accomplish is being able to access each line as an
> array element - so $array[0] would return line 1, $array[1] would
> return line 2, $array[2] would return line 3, and so on...


Why? What makes you think you need all the lines of the file to be in
memory at the same time? I'm not denying the *possibility* that your
goal might have that requirement, but it's generally unlikely.

Paul Lalli!

jplee3@gmail.com

2006-09-29, 6:57 pm


Paul Lalli wrote:
> jplee3@gmail.com wrote:
>
> Please burn any tutorial or other example that shows that as a good way
> to read from files.
>


I guess there's quite a few sites that need burning : )

>
> Why do you think you need the file into an array at all? Why not just
> process each line of the file as you read it, and then discard that
> line in favor of the next line?
>


Thanks, I actually started out with a "while (<LOGFILE> ) { }" type of
loop for line-by-line processing. And using the split() function, I'm
able to get each word into an array (so in essence, each line becomes
an array).

> while (my $line = <$fh> ) {
> #do something with $line;
> }
>
>
> Why? What makes you think you need all the lines of the file to be in
> memory at the same time? I'm not denying the *possibility* that your
> goal might have that requirement, but it's generally unlikely.
>

I guess with such a huge file this wouldn't be a good idea, right? I
think I just wanted to be able to access each log entry as I would be
able to when grepping/uniqing/lessing the original file.

So would I be looking at implementing an 'array of arrays?' At this
point, with the while loop I've implemented, I'm able to have an array
of each line with each string being an element. This is beneficial
because I can 'filter' out and include/exclude the strings I want by
accessing their respective elements.

Even with these 'modified' lines, is it still too expensive to put all
them into an array as elements?

One of the main issues is that there are duplicate line entries - each
line has a unique ID which is the only difference. I want to remove
that unique ID from each string; then put all the lines into an array
(or hash) so that I can compare them as elements and remove the
duplicates.

> Paul Lalli!


Paul Lalli

2006-09-29, 6:57 pm

jpl...@gmail.com wrote:
> Paul Lalli wrote:
> I guess there's quite a few sites that need burning : )


Yes, there really are....

>
> Thanks, I actually started out with a "while (<LOGFILE> ) { }" type of
> loop for line-by-line processing. And using the split() function, I'm
> able to get each word into an array (so in essence, each line becomes
> an array).


With you so far....

>
> I guess with such a huge file this wouldn't be a good idea, right?


In general, no it's not a good idea unless you *need* to have multiple
lines in memory at the same time. Most of the time, there is no need
for that requirement.

> I
> think I just wanted to be able to access each log entry as I would be
> able to when grepping/uniqing/lessing the original file.


Not seeing how this relates. When you do any of those commands from
the shell, they spit out each matching line. You can do that just fine
reading line by line. Read a line, print it if need be, read the next
line, etc.

> So would I be looking at implementing an 'array of arrays?'


For what? You haven't yet said what you want all this data for. If
you wanted an array of all "lines" where each "line" is now actually an
array of words, then yes, you'd need a two-d array.

> At this
> point, with the while loop I've implemented, I'm able to have an array
> of each line with each string being an element. This is beneficial
> because I can 'filter' out and include/exclude the strings I want by
> accessing their respective elements.
>
> Even with these 'modified' lines, is it still too expensive to put all
> them into an array as elements?


Only you can answer that. Perl certainly is not going to complain
about using too much memory, until you actually use more memory than
your system has. But the more memory you use, the slower your program
will execute. Only you can decide how slow is "too" slow.

> One of the main issues is that there are duplicate line entries - each
> line has a unique ID which is the only difference. I want to remove
> that unique ID from each string; then put all the lines into an array
> (or hash) so that I can compare them as elements and remove the
> duplicates.


But you still haven't said *WHY* you want them all in an array,
duplicates or no. WHAT IS THE END GOAL? Do you simply want to only
print out the non-duplicate lines? If that's all it is, there is no
need for storing all the lines in memory. Just store each unique
identifier. If the current line's identifier has already been seen,
don't print the line. If it hasn't, add the identifier to the hash,
and print the line.

For example, if your unique identifier was the first sequence of
non-whitespace on your line:

my %ids;
while (my $line = <$fh> ) {
my ($id) = ($line =~ /^(\S+)/;
if (!exists $ids{$id}++) {
print $line;
}
}

If you can tell us what the *actual* goal of your program is, rather
than intermediate steps you *think* you want to use to achieve that
goal, we can probably give you better advice...


Paul Lalli

jplee3@gmail.com

2006-09-29, 6:57 pm


Paul Lalli wrote:
> jpl...@gmail.com wrote:
>
> Yes, there really are....
>
>
> With you so far....
>
>
> In general, no it's not a good idea unless you *need* to have multiple
> lines in memory at the same time. Most of the time, there is no need
> for that requirement.
>
>
> Not seeing how this relates. When you do any of those commands from
> the shell, they spit out each matching line. You can do that just fine
> reading line by line. Read a line, print it if need be, read the next
> line, etc.
>
>
> For what? You haven't yet said what you want all this data for. If
> you wanted an array of all "lines" where each "line" is now actually an
> array of words, then yes, you'd need a two-d array.
>
>
> Only you can answer that. Perl certainly is not going to complain
> about using too much memory, until you actually use more memory than
> your system has. But the more memory you use, the slower your program
> will execute. Only you can decide how slow is "too" slow.
>
>
> But you still haven't said *WHY* you want them all in an array,
> duplicates or no. WHAT IS THE END GOAL? Do you simply want to only
> print out the non-duplicate lines? If that's all it is, there is no
> need for storing all the lines in memory. Just store each unique
> identifier. If the current line's identifier has already been seen,
> don't print the line. If it hasn't, add the identifier to the hash,
> and print the line.
>
> For example, if your unique identifier was the first sequence of
> non-whitespace on your line:
>
> my %ids;
> while (my $line = <$fh> ) {
> my ($id) = ($line =~ /^(\S+)/;
> if (!exists $ids{$id}++) {
> print $line;
> }
> }
>
> If you can tell us what the *actual* goal of your program is, rather
> than intermediate steps you *think* you want to use to achieve that
> goal, we can probably give you better advice...
>


Thanks for the advice so far... it's been super helpful.

Sorry guys, I'm basically trying to concoct something that will run
through a Windows Event Security Log, single out specific entries
(whether by user logon, failure codes, etc), and output these to XML
(or insert them into a DB). So far, I've been able to single out what I
need with regex and am able to ouput them to XML or into a DB.
The main issue here is the duplicate lines, which makes the output
really bloated.

>
> Paul Lalli


jplee3@gmail.com

2006-09-29, 6:57 pm


jplee3@gmail.com wrote:
> Paul Lalli wrote:
RE the unique identifier - the problem is that every line, including
the duplicates, have unique IDs. I need to print everything out (well,
whatever I specify in my regex) and exclude the duplicate messages

i.e.
Sep 28 04:03:38 dc DC EvntSLog: RealSource:"DC" [AUF] Thu Sep 28
04:03:38 2006: HQ-DC-02/Security (680) - "303217949: [AUF] Thu Sep 28
04:03:38 2006: NT AUTHORITY\SYSTEM/Security/DC/Security (680) - "Logon
attempt by: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 Logon account:
testuser Source Workstation: testmachine Error Code: 0xC000006A ""

Sep 28 04:03:38 dc DC EvntSLog: RealSource:"DC" [AUF] Thu Sep 28
04:03:38 2006: HQ-DC-02/Security (680) - "303217950: [AUF] Thu Sep 28
04:03:38 2006: NT AUTHORITY\SYSTEM/Security/DC/Security (680) - "Logon
attempt by: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 Logon account:
testuser Source Workstation: testmachine Error Code: 0xC000006A ""

- These are obviously duplicate entries; the problem is that the IDs
"303217949" and "303217950" differentiate them.
[color=darkred]
>
> Thanks for the advice so far... it's been super helpful.
>
> Sorry guys, I'm basically trying to concoct something that will run
> through a Windows Event Security Log, single out specific entries
> (whether by user logon, failure codes, etc), and output these to XML
> (or insert them into a DB). So far, I've been able to single out what I
> need with regex and am able to ouput them to XML or into a DB.
> The main issue here is the duplicate lines, which makes the output
> really bloated.
>

[color=darkred]

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com