For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > March 2006 > Web data clean up - Tie::File splice question(s)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Web data clean up - Tie::File splice question(s)
Chris

2006-03-21, 9:58 pm

Hi:

I work for a small non-profit and was assigned a research and report
task. The problem in my case is legging it from research to report,
i.e. taking dynamic web page data and processing it to input into a
FileMaker database. I was blessed with this task because I know the
most Excel, a laughable qualification. Never the less, throw thyself
on the bunji stakes.

The code you see reflects just how new a beginner I am, as will my
questions, probably.

A brief description of the Fetch the data activity:

We search a foundation database to find possible donors for our unlikely
adventure. When found, the prospective donor's web page is saved to
local disk. The listings returned have between say thirteen and forty
entries. The foundation database is a deeply java-php driven site and a
certain amount of arm chair browsing is required to select which of the
offered search results to save. These conditions taken together have
guided me away from trying to spider in there.

Internal, organizational note:

We span generations, our seventy year olds feel comfortable with
Filemaker which is fine, our hipster web designer doesn't know perl, and
I suggested this site.


The data:

The foundation info returned is between thirteen and forty entries per
record. The order of the entries, if present, are standard. In general
form it has Name/Address information, Foundation information, and
sometimes sample grants. Happily, they are in a single column table at
a standard depth on the page.

Each page will have:

Foundation Name

A blank line between Name/Address info and Foundation Info

EIN

And then maybe up to ten sample grants from that reporting year

So, I need to:

Extract the table data and insert place holders for unused entries that
php didn't display when serving the page. So I tried the following:

#!/usr/bin/perl -w
use strict;
use HTML::TableExtract;
use Tie::File;
# use Data::Dumper; # just holding this in reserve as I don't
know how #to use it

my $content = "C:\\perl\\Foundations\\Barringer.htm"; # set target file
open CONTENT, $content or die "Can't open $content: $!";
open(Barringer, ">c:\\perl\\Foundations\\Barringer.txt") or die "can't
open outfile:$!";
#so this just to rename Barringer to txt on write/close? Well,
it #worked. Though Table::Extract kicks out text if HTML not #specified
my $te = new HTML::TableExtract( depth => 5); #get that table
$te->parse_file($content);

foreach my $row($te->rows) {
print Barringer join(',' , (@$row), "/n");
}

close CONTENT or die "could not close 'CONTENT': $!";
close Barringer or die "could not close 'Barringer': $!";

my @new = " "; # to splice in where field missing
my $file = "C:\\perl\\Foundations\\Barringer.txt";

tie my @array, 'Tie::File', $file or die "couldn't tie?: $!";

# my $lvar;
# my $i=0;
# foreach $lvar (@array){
# print "$i is $lvar\n";
# $i++;
# } #prints to console and Tie worked
#Now, regular expressions to compare entries against full template
#and splice in blank lines if template entry missing

#for (@array) { #Start doing some regex tests
# if ($array[0] =~ /\s/){
# shift (@array);
# }
# }

for (@array) {
if ($array[3] =~ /^p/i ) {
splice @array, 3, 0, "\n"; #couldn't get @new to work so
maybe it #should be
#my @new = " \n" (\n gets tossed) and
#splice @array 3, 0, @new
}
}
my $lvar;
my $i=0;
foreach $lvar (@array){
print "$i is $lvar\n";
$i++;
}

untie @array;

___________________

And this is as far as I've gotten with writing. It extracts the table
data I want. The challenge is to return the shorter web pages to a full
template. I seem to be able to move things through Splice though
at the moment I'm more or less guessing if and where it might take action.

As developing, it looks like I'm going to have forty for (@array) {
test - do somethings

Questions:

1) Should I be doing this with a hash or array? I've used array as the
order is important to the final product, and Tie-File gave me a way to
insert in arrays (files).

2) What is the syntax for using Splice or what's happening in
Splice @array, 3, 0, "some line or value"; does "someline
or value" come after array{3} or displace it, does the 0 simply say no
lines to replace/delete?

2)I imagine an array comparison between the Tie array and a 'Template
array', iterating Template over tie. Can I pass a Template_array[x} to a
variable to use in the for tests like:

for (@array) {
if ($array[3] =~ $regex ) { # where regex = say Template_array[2}
splice @array, 3, 0, "\n"; # and [2} = /^p/i

Well any way, I think a loop within a loop will come into play.

3) Since each final array will eventually have 40 positions should I
just atart by expanding the initial record array then further process it?

4) If I tried to solve this as an intersection of the arrays, resulting
in processing 'cases' perhaps, what would constitute a duplicate. Right
now I'm using very simple, literal tests of the first letter. Would two
entries starting with P be a duplicate. I guess I'm asking does
duplicate in array intersections refer to value or array position[]?

Wish list - a little function to grab the first one or two letter in
each Tie array?

Thanks for taking a look at this.

Chris
Chris

2006-03-26, 3:57 am

Chris wrote:
> Hi:
>
> I work for a small non-profit and was assigned a research and report
> task. The problem in my case is legging it from research to report,
> i.e. taking dynamic web page data and processing it to input into a
> FileMaker database. I was blessed with this task because I know the
> most Excel, a laughable qualification. Never the less, throw thyself
> on the bunji stakes.
>
> The code you see reflects just how new a beginner I am, as will my
> questions, probably.
>
> A brief description of the Fetch the data activity:
>
> We search a foundation database to find possible donors for our unlikely
> adventure. When found, the prospective donor's web page is saved to
> local disk. The listings returned have between say thirteen and forty
> entries. The foundation database is a deeply java-php driven site and a
> certain amount of arm chair browsing is required to select which of the
> offered search results to save. These conditions taken together have
> guided me away from trying to spider in there.
>
> Internal, organizational note:
>
> We span generations, our seventy year olds feel comfortable with
> Filemaker which is fine, our hipster web designer doesn't know perl, and
> I suggested this site.
>
>
> The data:
>
> The foundation info returned is between thirteen and forty entries per
> record. The order of the entries, if present, are standard. In general
> form it has Name/Address information, Foundation information, and
> sometimes sample grants. Happily, they are in a single column table at
> a standard depth on the page.
>
> Each page will have:
>
> Foundation Name
>
> A blank line between Name/Address info and Foundation Info
>
> EIN
>
> And then maybe up to ten sample grants from that reporting year
>
> So, I need to:
>
> Extract the table data and insert place holders for unused entries that
> php didn't display when serving the page. So I tried the following:
>
> #!/usr/bin/perl -w
> use strict;
> use HTML::TableExtract;
> use Tie::File;
> # use Data::Dumper; # just holding this in reserve as I don't know
> how #to use it
>
> my $content = "C:\\perl\\Foundations\\Barringer.htm"; # set target file
> open CONTENT, $content or die "Can't open $content: $!";
> open(Barringer, ">c:\\perl\\Foundations\\Barringer.txt") or die "can't
> open outfile:$!";
> #so this just to rename Barringer to txt on write/close? Well,
> it #worked. Though Table::Extract kicks out text if HTML
> not #specified
> my $te = new HTML::TableExtract( depth => 5); #get that table
> $te->parse_file($content);
>
> foreach my $row($te->rows) {
> print Barringer join(',' , (@$row), "/n");
> }
>
> close CONTENT or die "could not close 'CONTENT': $!";
> close Barringer or die "could not close 'Barringer': $!";
>
> my @new = " "; # to splice in where field missing
> my $file = "C:\\perl\\Foundations\\Barringer.txt";
>
> tie my @array, 'Tie::File', $file or die "couldn't tie?: $!";
>
> # my $lvar;
> # my $i=0;
> # foreach $lvar (@array){
> # print "$i is $lvar\n";
> # $i++;
> # } #prints to console and Tie worked
> #Now, regular expressions to compare entries against full template
> #and splice in blank lines if template entry missing
>
> #for (@array) { #Start doing some regex tests
> # if ($array[0] =~ /\s/){
> # shift (@array);
> # }
> # }
>
> for (@array) {
> if ($array[3] =~ /^p/i ) {
> splice @array, 3, 0, "\n"; #couldn't get @new to work so maybe
> it #should be
> #my @new = " \n" (\n gets tossed) and
> #splice @array 3, 0, @new
> }
> }
> my $lvar;
> my $i=0;
> foreach $lvar (@array){
> print "$i is $lvar\n";
> $i++;
> }
>
> untie @array;
>
> ___________________
>
> And this is as far as I've gotten with writing. It extracts the table
> data I want. The challenge is to return the shorter web pages to a full
> template. I seem to be able to move things through Splice though
> at the moment I'm more or less guessing if and where it might take action.
>
> As developing, it looks like I'm going to have forty for (@array) {
> test - do somethings
>
> Questions:
>
> 1) Should I be doing this with a hash or array? I've used array as the
> order is important to the final product, and Tie-File gave me a way to
> insert in arrays (files).
>
> 2) What is the syntax for using Splice or what's happening in
> Splice @array, 3, 0, "some line or value"; does "someline
> or value" come after array{3} or displace it, does the 0 simply say no
> lines to replace/delete?
>
> 2)I imagine an array comparison between the Tie array and a 'Template
> array', iterating Template over tie. Can I pass a Template_array[x} to a
> variable to use in the for tests like:
>
> for (@array) {
> if ($array[3] =~ $regex ) { # where regex = say Template_array[2}
> splice @array, 3, 0, "\n"; # and [2} = /^p/i
>
> Well any way, I think a loop within a loop will come into play.
>
> 3) Since each final array will eventually have 40 positions should I
> just atart by expanding the initial record array then further process it?
>
> 4) If I tried to solve this as an intersection of the arrays, resulting
> in processing 'cases' perhaps, what would constitute a duplicate. Right
> now I'm using very simple, literal tests of the first letter. Would two
> entries starting with P be a duplicate. I guess I'm asking does
> duplicate in array intersections refer to value or array position[]?
>
> Wish list - a little function to grab the first one or two letter in
> each Tie array?
>
> Thanks for taking a look at this.
>
> Chris

Well, I guess I'll just document my progress, or flailing about.

#!/usr/bin/perl -w

use strict;
use HTML::TableExtract;
use Data::Dumper;

#This accomplishes first leg, open HTML page, extract given table,
#and write back to a text file for further processing
#Three Days

my $content = "C:\\perl\\foundations\\camp.htm";
open(CONTENT, $content) or warn "Can't open '$content': $!"; # open
target

my $process = ">C:\\perl\\Foundations\\camp.txt";
open(PROCESS, $process ) or die "couldn't open '$process':$!";

my $te = new HTML::TableExtract( depth => 5);
$te->parse_file($content);

foreach my $row($te->rows) {

select PROCESS;
print join(',' , (@$row), "/n") ;
}

close CONTENT or die "could not close '$content': $!";
close PROCESS or die "could not close '$process': $!";
----------------------
This is a direction I am thinking about taking to build back into the
template that I'm interpolating. My hope is to figure out
Array::Compare and see if I can feed the returned info into a Tie::File
splice and voila. I assume it will come to more than that so I'm also
looking at Alak or some sort of game tree approach without jumping.
--------------------------
#!/usr/bin/perl -w
use strict;
use Array::Compare;
use Data::Dumper;

my $data_file = 0;
my @raw_meme = ();
my $raw_meme = 0;

$data_file = "c:\\perl\\foundations\\camp.txt";
open (DAT, "< $data_file") || die ("couldn't open $data_file: $!");

while ( <DAT> ) {
push @raw_meme, substr($_, 0, 5)
}
close DAT;

$#raw_meme = 39; #the standard size

#foreach $meme (@raw_data) {
# @raw_meme = substr($meme, 0, 3);
# $h++;
# }

# my $lvar;
# my $i=0;
# foreach $lvar (@raw_meme){
# print "$i is $lvar\n";
# $i++;
# }
my @template =("m/\s/" , "m/\s/" , "m/^\w/" , "m/^\(form/" ,
"m/^c\/o\s./" ,
"m/^P\.O\.B/" , "m/\d\.\.\.\./" , "m/\Z\.\d\d\d\d/" ,
"Telep" , "Conta" , "/^FAX../" , "E-Mail" , "Appli" ,
"m/\s/" ,
"Donor" , "Type " , "Backg" , "Purpo", "Progr" , "Field" ,
"Geogr" , "Types" , "Limit" , "Publi" , "Appli" , "Offic" ,
"Finan" , "EIN: " , "Selec" , "/^\$\d\d\.\d\d/" ,
"/^\$\d\d\.\d\d/" ,
"/^\$\d\d\.\d\d/" , "/^\$\d\d\.\d\d/" , "/^\$\d\d\.\d\d/" ,
"/^\$\d\d\.\d\d/" , "/^\$\d\d\.\d\d/" ,
"/^\$\d\d\.\d\d/" ,
"/^\$\d\d\.\d\d/" , "/^\$\d\d\.\d\d/" );

# my $lvar;
#my $i=0;
# foreach $lvar (@template){
# print "$i is $lvar\n";
# $i++;
# }
# my $comp = 0;
my $comp = Array::Compare->new(DefFull => 1);
$comp->compare(\@raw_meme, \@template); # full comparison
#
# my $lvar;
# my $i=0;
# foreach $lvar (@comp){
# print "$i is $lvar\n";
# $i++;
# }

print $comp;
--------------------
Still don't know if those patterns are going to match as patterns in the
array comparison. So, more reading on Lists, eq, & etc.

Any thoughts would be appreciated.

Chris
flicka@ix.netcom.com




Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com