Home > Archive > PERL Miscellaneous > December 2004 > Reading poorly structured data
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Reading poorly structured data
|
|
| Alan Mead 2004-12-08, 3:57 am |
| I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:
Bush, George, President, 1 White House Way, Washington,
DC 00000; gbush@whitehouse.gov
Kerry, John, 1 Main, Detroit, MI 00000; jkerry@yahoo.com
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; paul@newmans.org
Blair, Tony, 1 Downing Street, London, UK 0000000
.... etc..
So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.
In a later file dozens of records appear on the same line.
I'd like to output
lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
email=gbush@whitehouse.gov
Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.
-Alan
my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('',''
,'','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
....
}
| |
| A. Sinan Unur 2004-12-08, 3:57 am |
| Alan Mead <amead@comcast.net> wrote in
news:pan.2004.12.08.02.40.13.661534@comcast.net:
> I have five files of contact info (one for each year of a conference).
> All five have slightly different fairly unstructured formats. One looks
> like this:
>
> Bush, George, President, 1 White House Way, Washington,
> DC 00000; gbush@whitehouse.gov
> Kerry, John, 1 Main, Detroit, MI 00000; jkerry@yahoo.com
> Williams, Robin, 2 Main, Burbank, CA 00000
> Newman, Paul, President and Principal Spokesperson,
> Paul Newmans's Own Brand Foods, 123 Main Street,
> Olympia Fields, WY 00000; paul@newmans.org
> Blair, Tony, 1 Downing Street, London, UK 0000000
> ... etc..
Here is somewhat of a kludge that "works" for the snippet you posted. Hope
this helps.
#! perl
use strict;
use warnings;
use File::Slurp;
my $input = read_file(\*DATA);
$input =~ tr/\n/ /;
my @records;
while(length $input) {
my %record;
$record{lname} = grab_name($input);
$record{fname} = grab_name($input);
$input =~ /[A-Z]{2} \d+/g;
$record{address} = substr $input, 0, pos($input);
$input = substr $input, pos($input);
if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
$record{email} = $1;
$input = substr $input, pos $input;
}
push @records, \%record;
}
use Data::Dumper;
print Dumper \@records;
sub grab_name {
my $off = index $_[0], ',';
my $name = substr $_[0], 0, $off;
$_[0] = substr $_[0], $off + 2;
return $name;
}
__DATA__
Bush, George, President, 1 White House Way, Washington,
DC 00000; gbush@whitehouse.gov
Kerry, John, 1 Main, Detroit, MI 00000; jkerry@yahoo.com
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; paul@newmans.org
Blair, Tony, 1 Downing Street, London, UK 0000000
| |
| Alan Mead 2004-12-08, 3:57 am |
| On Wed, 08 Dec 2004 04:04:53 +0000, A. Sinan Unur wrote:
> Here is somewhat of a kludge that "works" for the snippet you posted. Hope
> this helps.
>
> #! perl
> use strict;
> use warnings;
> use File::Slurp;
> my $input = read_file(\*DATA);
> $input =~ tr/\n/ /;
> my @records;
> while(length $input) {
> my %record;
> $record{lname} = grab_name($input);
> $record{fname} = grab_name($input);
> $input =~ /[A-Z]{2} \d+/g;
> $record{address} = substr $input, 0, pos($input);
> $input = substr $input, pos($input);
> if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
> $record{email} = $1;
> $input = substr $input, pos $input;
> }
> push @records, \%record;
> }
[...]
And so it does very nicely. I think you are making use of the fact that
these all had a pair of capital letters near the end (including the
convenient UK) but there is a 'D.C.' in my data and some other
addresses outside the US (that lack this feature). I should have included
a better sample. But this may get me to 95% ... The way you've slurped the
file makes this perfectly applicable to the rest of the files which is a
REALLY BIG help.
Thanks!
-Alan
| |
| A. Sinan Unur 2004-12-08, 3:57 am |
| Alan Mead <amead@comcast.net> wrote in
news:pan.2004.12.08.04.29.10.851237@comcast.net:
> On Wed, 08 Dec 2004 04:04:53 +0000, A. Sinan Unur wrote:
>
....
[color=darkred]
> And so it does very nicely. I think you are making use of the fact
> that these all had a pair of capital letters near the end (including
> the convenient UK) but there is a 'D.C.' in my data and some other
> addresses outside the US (that lack this feature).
Actually, that is a standing for some kind of Country/State Code with
numeric postal code match because all your addresses seemed to end with
that.
The "two capital letters followed by some digits as end of mailing address
indicator" was one of the things that made the code kludgy.
I am sure others will provide better ways once the sun comes up. Good luck.
Sinan.
|
|
|
|
|