For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > May 2007 > Outlook CSV Parser









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Outlook CSV Parser
Laxminarayan G Kamath A

2007-05-30, 3:58 am


Hi PERLers,
We here at DeepRoot Linux were trying to parse Outlook's csv so
that I can add them to ldap addressbook.. After several futile tries
around with lots of built in packages, we decided it was up to us to
device an algorithm. As time was of importance , we wrote a "well it
works!" "character by character" parsing perl algorithm.
Wondering about how I could make it run faster, I thought I
will give regular expressions a try. The attached file is what I have
come up to.. but it still takes more than 10 seconds on my 1.6 GHZ P4
with 1G RAM to parse a 6500+ lines of CSV . and seperate them out ..
Let alone importing them to LDAP. Any ways of optimising it further?
The stubling blocks : there are several types of problems in
Outlook's CSV ..
1. One line need not be one record. They may cointain multine
fields.
2. A sigh of relief but : only multi-line fields are wrapped in
double quotes.
3. commas are both inside and outside the fields. the ones in
the fileds must not be treated as "seperator" - again fields with
commans are wrapped in double quotes.

I hope I am on the right mailing list.. Else, please direct me to the
proper one.

--
Cheers,
Laxminarayan G Kamath A
e-mail: laxminarayan@deeproot.co.in
Work URL: http://deeproot.in
Laxminarayan G Kamath A

2007-05-30, 3:58 am

On Wed, 30 May 2007 11:10:00 +0530, Laxminarayan G Kamath A
<laxminarayan@deeproot.co.in> wrote:

> The attached file is what I have
> come up to.. but it still takes more


.... Had forgotten to attach the file..

--
Cheers,
Laxminarayan G Kamath A
e-mail: laxminarayan@deeproot.co.in
Work URL: http://deeproot.in

Mumia W.

2007-05-30, 3:58 am

On 05/30/2007 12:40 AM, Laxminarayan G Kamath A wrote:
> Hi PERLers,
> We here at DeepRoot Linux were trying to parse Outlook's csv so
> that I can add them to ldap addressbook.. [...]


The Perl module Text::CSV_XS would make your work much simpler, and it
might execute a little faster.




Laxminarayan G Kamath A

2007-05-30, 3:58 am

On Wed, 30 May 2007 01:26:30 -0500, "Mumia W." <mumia.w.18.spam
+nospam@earthlink.net> wrote:


> The Perl module Text::CSV_XS would make your work much simpler, and
> it might execute a little faster.


Thank you for pointing out .. but we have already tried it!
Unfortunately, it failed to seperate the records in the right fashion.
We have also tried the several more modules from CPAN.. and they were
not able to parse the OutLook's CSV.

If you read my mail again, you might find that I already mentioned that
we tried several modules before falling back to writing our own code.

What I am expecting is help with the variant of the regex I used as the
condition for while loop. I am sure If we modify that regexp a little
bit, then we can just use it on the record like this :

$_ = $record;
@fields = /regexp/g;

I tried a lot of different ways but just could not get the right
regexp :-(.

--
Cheers,
Laxminarayan G Kamath A
e-mail: laxminarayan@deeproot.co.in
Work URL: http://deeproot.in
Dr.Ruud

2007-05-30, 7:58 am

Laxminarayan G Kamath A schreef:

> The stubling blocks : there are several types of problems in
> Outlook's CSV ..


You forgot to supply a link to such a file. Or show a __DATA__ section
for testing.

--
Affijn, Ruud

"Gewoon is een tijger."

Mumia W.

2007-05-30, 7:58 am

On 05/30/2007 03:04 AM, Laxminarayan G Kamath A wrote:
> [...]
> I tried a lot of different ways but just could not get the right
> regexp :-(.
>


I reiterate what the eminent Dr. Ruud said. I need some data to play
with before I play with the code you posted.



Ken Foskey

2007-05-30, 7:58 am

On Wed, 2007-05-30 at 13:34 +0530, Laxminarayan G Kamath A wrote:

> What I am expecting is help with the variant of the regex I used as the
> condition for while loop. I am sure If we modify that regexp a little
> bit, then we can just use it on the record like this :
>
> $_ = $record;
> @fields = /regexp/g;
>
> I tried a lot of different ways but just could not get the right
> regexp :-(.


CSV is a horrible format. Far too unreliable, we have exported CSV
from excel that imported differently into excel.

Is there another option, eg connecting to Outlook via a remote
connection?

Is there another format available?

I doubt a simple regex will do it if the CSV modules do not work.

What data do you have problems with? Without samples there is nothing
we can do.


--
Ken Foskey
FOSS developer

Chas Owens

2007-05-30, 6:59 pm

On 5/30/07, Laxminarayan G Kamath A <laxminarayan@deeproot.co.in> wrote:
snip
> Any ways of optimising it further?

snip

Premature optimization is the root of all evil. Have you profiled the
code yet? If not then here is some documentation that will point you
in the right direction

http://www.perl.com/pub/a/2004/06/25/profiling.html
http://search.cpan.org/~nwclark/per...tils/dprofpp.PL

But while I am looking lets see what is going on.

snip
> 1. One line need not be one record. They may cointain multine
> fields.
> 2. A sigh of relief but : only multi-line fields are wrapped in
> double quotes.
> 3. commas are both inside and outside the fields. the ones in
> the fileds must not be treated as "seperator" - again fields with
> commans are wrapped in double quotes.

snip

The following code seems to speed up the parsing by two orders of
magnitude (2.214 seconds for the old code and 0.036 seconds for this
code on 100 records). Also, there seems to be a bug in your original
code. I setup a test file with a 100 records of 30 fields each and it
found

found 33 fields in 1 records
found 34 fields in 1 records
found 36 fields in 3 records
found 37 fields in 5 records
found 38 fields in 10 records
found 39 fields in 9 records
found 40 fields in 12 records
found 41 fields in 17 records
found 42 fields in 15 records
found 43 fields in 13 records
found 44 fields in 7 records
found 45 fields in 5 records
found 46 fields in 1 records
found 48 fields in 1 records

===code to generate test file===
#!/usr/bin/perl

use strict;
use warnings;

my $fields = 30;
my $fieldlen = 30;
my @fieldtype = qw(normal quoted comma);
my $records = shift;

for my $rec (1 .. $records) {
for my $field (1 .. $fields) {
my $type = $fieldtype[int rand @fieldtype];
if ($type eq 'normal') {
print 'n' x $fieldlen, ",";
} elsif ($type eq 'quoted') {
print '"';
my $i = 0;
until ($i < $fieldlen) {
my $len = int rand $fieldlen;
print 'q' x $len, "\n";
$i += $len;
}
print '",';
} elsif ($type eq 'comma') {
print '"';
my $i = 0;
until ($i == $fieldlen) {
my $len = int rand $fieldlen;
$len = $fieldlen - $i if $i+$len > $fieldlen;
print 'c' x ($len/2), ',', 'c' x ($len/2), "\n";
$i += $len;
}
print '",';
}
}
print "\n";
}

===code to parse test file===
#!/usr/bin/perl

use strict;
use warnings;

my $record = "";
my $quotes = 0;
my @records;
while (defined (my $line = <> )) {
next if $record eq "" and $line =~ /^\s*$/;

$record .= $line;

#count the number of quotes
$quotes += () = $line =~ /"/g;

#if $quotes is even then we have a full record
if ($quotes % 2 == 0) {
$quotes = 0;
chomp $record;
my @fields;
my $unbalanced = 0;
for my $field (split /,/, $record) {
my $count = $field =~ s/"//g;
if ($count % 2) {
if ($unbalanced) {
$unbalanced = 0;
$fields[-1] .= ",$field";
next;
}
$unbalanced = 1;
push @fields, $field;
next;
}
if ($unbalanced) {
$fields[-1] .= ",$field";
} else {
push @fields, $field;
}
}
push @records, { whole => $record, fields => \@fields};
$record = "";
}

}

for my $rec (@records) {
print join "|", @{$rec->{fields}},"\n===\n";
}
Chas Owens

2007-05-30, 6:59 pm

On 5/30/07, Ken Foskey <foskey@optushome.com.au> wrote:
snip
> CSV is a horrible format. Far too unreliable, we have exported CSV
> from excel that imported differently into excel.

snip

Just pedantic nitpick, but CSV is an incredibly reliable format, the
problem is find programs that actually use CSV rather than a CSV-like
format. It works out to the same thing, but it isn't CSV's fault.
For an example of a programmer using a CSV-like format where he/she
should be using the real thing look at my other post on this thread.
My code fails to handle escaped double quotes correctly.
Laxminarayan G Kamath A

2007-05-31, 3:59 am

On Wed, 30 May 2007 10:38:40 +0200, "Dr.Ruud" <rvtol+news@isolution.nl>
wrote:

> You forgot to supply a link to such a file. Or show a __DATA__ section
> for testing.


http://download.deeproot.in/~kamath...pted-sample.csv

--
Cheers,
Laxminarayan G Kamath A
e-mail: laxminarayan@deeproot.co.in
Work URL: http://deeproot.in
Mumia W.

2007-05-31, 7:58 am

On 05/31/2007 02:32 AM, Laxminarayan G Kamath A wrote:
>
> http://download.deeproot.in/~kamath...pted-sample.csv
>


Well I asked for it. :-)

It's impossible to tell where one record ends and another record begins
with that file.



Dr.Ruud

2007-05-31, 7:58 am

Laxminarayan G Kamath A:
> Ruud:


>
> http://download.deeproot.in/~kamath...pted-sample.csv


OK, lets check how wellformed it is:

perl -we'
local $/;
$_ = <>;
s/"[^"]*"//g;
s/(?<=,)[^",]+(?=,)//g;
s/^[^,"]*,+//;
print
' outlook-encrtypted-sample.csv


That prints:
-----------------------------------------------
"3snji

dsnjidsknk@nmkkski.smk
-----------------------------------------------

so AFAICS there is a problem somewhere at the end.

Maybe try a zipped version?

--
Affijn, Ruud

"Gewoon is een tijger."
Dr.Ruud

2007-05-31, 6:59 pm

"Mumia W." schreef:
> Laxminarayan G Kamath A:


>
> Well I asked for it. :-)
>
> It's impossible to tell where one record ends and another record
> begins with that file.


Maybe not, because the rule was that it ends at newline, unless inside
quotes.

--
Affijn, Ruud

"Gewoon is een tijger."

Chaocheebye

2007-05-31, 9:35 pm

http://Celine-Dion-facestanding-mov...p?movie=1673286
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com