For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > June 2006 > Another (hopefully more clear) plea for help with CSV data









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Another (hopefully more clear) plea for help with CSV data
Ralph H. Stoos Jr.

2006-06-24, 8:01 am

All,

I have been asked to do something in Access or Excel which I find
disagreeable. Open Source is the way to go and also, PERL is more
flexible. I might even obtain a little professional development on this.

My last post seemed to confuse folks (thanks to all who responded). I
will try to clarify.

Machine,PurgeSuccess,PurgePrepared,Purge
Started,PurgeCauseFaultID,PurgeModId,Pur
geStopModId,ModIdJobCfg,NameJobCfg,Full_
cfg,FinisherCfg,plex,PPM,PropID,AtreeNod
eID,JobID,MediaID,Width,Height,Color,Wei
ght,Caliper,Drilled,Finish,Grain,Coating
Front,CoatingBa
ck,SW,additional_info,debuglog
1125785731,N,Y,Y,927,6,5,_2_3_4_5_6,_SFM
20_IOT7_SFM7_BFM20_BFM2,_SFM20_IOT7_SFM7
_BFM20_BFM2,DUAL_BFM,Simplex,120,44366,1
228,392,527,279400,431800,white,75,104,F
ALSE,regular,y,none,none,RV0.6.5.27,,DebugMsgLog.2006_05_24.07_48_00
1125785731,Y,Y,N,1003,6,,_2_3_4_5_6,_SFM
20_IOT7_SFM7_BFM20_BFM2,_SFM20_IOT7_SFM7
_BFM20_BFM2,DUAL_BFM,Duplex,120,69206,75
,408,29,279400,431800,white,75,104,FALSE
,regular,y,none,none,RV0.6.5.27,,DebugMsgLog.2006_05_31.14_33_25A


Above is the same slice of data. It all lines up in neat little columns
when you drop it into a spreadsheet. Admittedly, it could be plopped
into all sorts of database programs, but I think doing the PERL thing
will make it more flexible and easier to change.

OK, so here is the task. The first row I want to turn into variable
names. Machine, PurgeSuccess, etc. Then with that removed the real work
happens. The script would then request which variables I would like to
sort on or find records (rows) that contain the "matches" specified.

As an example, of how it should work, lets say I wanted to find all
records that had a specific value in the Machine column that also
contained another specific value in the Purge Success column. The
script would parse each row to see if there were matches in the
"record". Iterating through the entire file (could be as many as 15000
records), the matching records (complete rows) should be written to a
new CSV file with either the name created automatically or entered by
the user.

Processing
for each row:
if Machine = "blah" AND Purge Success = "Y" and PPM = "120 etc.


The original file would be left intact for another round of parsing with
different parameters.

Hope this makes it clear. I really would like to get this put up on one
of my Linux servers so that the developers could use it to track trends
in the data. I would just append new data to the object file as it is
available. TRying hard to learn PERL as my first language. Have found
the PERL doc to be excellent and I have purchased three PERL books
hoping to be able to do this mtself and support it over time. I even
bought two Python books to research which scripting language might
handle it better. PERL, being the senior language, should in theory
have the most capability. Looked at Ruby too, but don't want the RAILS
overhead.

Thanks a bunch for any help and accept my apologies for the HUGE post.

Regards,

Ralph (the total noob)
Dr.Ruud

2006-06-24, 8:01 am

"Ralph H. Stoos Jr." schreef:

> Machine,PurgeSuccess,PurgePrepared,Purge
Started,...
> 1125785731,N,Y,Y,...
> 1125785731,Y,Y,N,....
>
> Above is the same slice of data. It all lines up in neat little
> columns when you drop it into a spreadsheet. Admittedly, it could be
> plopped into all sorts of database programs, but I think doing the
> PERL thing will make it more flexible and easier to change.


Perl (mind the spelling) and databases work very well together.


> OK, so here is the task. The first row I want to turn into variable
> names.


Why? I would use an array of arrays (or maybe a hash of arrays), but
only if a database was not possible.


> Machine, PurgeSuccess, etc. Then with that removed the real
> work happens. The script would then request which variables I would
> like to sort on or find records (rows) that contain the "matches"
> specified.


A perfect job for an RDBMS.


> As an example, of how it should work, lets say I wanted to find all
> records that had a specific value in the Machine column that also
> contained another specific value in the Purge Success column. The
> script would parse each row to see if there were matches in the
> "record". Iterating through the entire file (could be as many as
> 15000 records), the matching records (complete rows) should be
> written to a new CSV file with either the name created automatically
> or entered by the user.


With 15000 records the data would only be a few megabytes, so yes, it is
possible to do it in memory.


> Processing
> for each row:
> if Machine = "blah" AND Purge Success = "Y" and PPM = "120 etc.


To save memory, another way would be to access the data from file, line
by line, and test the conditions on the go.


> The original file would be left intact for another round of parsing
> with different parameters.
>
> Hope this makes it clear. I really would like to get this put up on
> one of my Linux servers so that the developers could use it to track
> trends in the data. I would just append new data to the object file
> as it is available. TRying hard to learn PERL as my first language.
> Have found the PERL doc to be excellent and I have purchased three
> PERL books hoping to be able to do this mtself and support it over
> time. I even bought two Python books to research which scripting
> language might handle it better. PERL, being the senior language,
> should in theory have the most capability. Looked at Ruby too, but
> don't want the RAILS overhead.


Search on CPAN for CSV:
http://search.cpan.org/search?m=module&q=CSV&s=1&n=100

Maybe this one is best for your situation:
http://search.cpan.org/~bigj/Tie-CS....21/CSV_File.pm

Use a hash to convert colum-name to column-number.

--
Affijn, Ruud

"Gewoon is een tijger."


Charles K. Clarkson

2006-06-24, 8:01 am

Ralph H. Stoos Jr. wrote:

Perl is a name, not an acronym. Write Perl or perl, but never
PERL (unless you are yelling).


: OK, so here is the task. The first row I want to turn into
: variable names. Machine, PurgeSuccess, etc.

You probably don't want to do that. You may end up needing
symbolic references to access those variables. Using symbolic
references is frowned on. Another approach might use a hash which
has keys that match the column names. The values associated with
those keys would reference arrays of values from all the records.


: Then with that removed the real work happens. The script would
: then request which variables I would like to sort on or find
: records (rows) that contain the "matches" specified.

Are you going to rewrite the script for each different record
search? How often will different searches be needed? Who decides
the need for new searches (you, the boss, other programmers, users,
etc.)?


HTH,

Charles K. Clarkson
--
Mobile Homes Specialist
Free Market Advocate
Web Programmer

254 968-8328

Don't tread on my bandwidth. Trim your posts.

Chad Perrin

2006-06-24, 8:01 am

On Thu, Jun 22, 2006 at 03:07:32AM -0500, Charles K. Clarkson wrote:
>
> : OK, so here is the task. The first row I want to turn into
> : variable names. Machine, PurgeSuccess, etc.
>
> You probably don't want to do that. You may end up needing
> symbolic references to access those variables. Using symbolic
> references is frowned on. Another approach might use a hash which
> has keys that match the column names. The values associated with
> those keys would reference arrays of values from all the records.


I must be missing something. Please explain the sentence "You may end
up needing symbolic references to access those variables."


>
> : Then with that removed the real work happens. The script would
> : then request which variables I would like to sort on or find
> : records (rows) that contain the "matches" specified.
>
> Are you going to rewrite the script for each different record
> search? How often will different searches be needed? Who decides
> the need for new searches (you, the boss, other programmers, users,
> etc.)?


Why would he need to rewrite the script? Why couldn't it just loop
through input data to (for instance) create an array containing each
row, starting with the column headings row, then shift from each array
to name new arrays from the column headings row and populate it in order
from the other rows? I'm not saying that's necessarily the best way to
do it, but it seems like a reasonable example of how the script could
just dynamically handle different data from each run if need be without
having to rewrite it each time. Some search functions could be created
that simply take input that allows for search terms to be specified when
the script is run. Voila, it works. Did I misunderstand your point?

Of course, this seems like a lot of unnecessary work, since it basically
involves writing one's own simple DBMS query system, but it would seem
to suit the OP's requirements.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The measure on a man's real character is what he would do
if he knew he would never be found out." - Thomas McCauley
DJ Stunks

2006-06-24, 8:01 am


Ralph H. Stoos Jr. wrote:
> All,
>
> I have been asked to do something in Access or Excel which I find
> disagreeable. Open Source is the way to go and also, PERL is more
> flexible. I might even obtain a little professional development on this.
>
> My last post seemed to confuse folks (thanks to all who responded). I
> will try to clarify.
>
> Machine,PurgeSuccess,PurgePrepared,Purge
Started,PurgeCauseFaultID,PurgeModId,Pur
geStopModId,ModIdJobCfg,NameJobCfg,Full_
cfg,FinisherCfg,plex,PPM,PropID,AtreeNod
eID,JobID,MediaID,Width,Height,Color,Wei
ght,Caliper,Drilled,Finish,Grain,Coating
Front,Coating

Back,SW,additional_info,debuglog
> 1125785731,N,Y,Y,927,6,5,_2_3_4_5_6,_SFM
20_IOT7_SFM7_BFM20_BFM2,_SFM20_IOT7_SFM7
_BFM20_BFM2,DUAL_BFM,Simplex,120,44366,1
228,392,527,279400,431800,white,75,104,F
ALSE,regular,y,none,none,RV0.6.5.27,,DebugMsgLog.2006_05_24.07_48_00
> 1125785731,Y,Y,N,1003,6,,_2_3_4_5_6,_SFM
20_IOT7_SFM7_BFM20_BFM2,_SFM20_IOT7_SFM7
_BFM20_BFM2,DUAL_BFM,Duplex,120,69206,75
,408,29,279400,431800,white,75,104,FALSE
,regular,y,none,none,RV0.6.5.27,,DebugMsgLog.2006_05_31.14_33_25A
>
>
> Above is the same slice of data. It all lines up in neat little columns
> when you drop it into a spreadsheet. Admittedly, it could be plopped
> into all sorts of database programs, but I think doing the PERL thing
> will make it more flexible and easier to change.
>
> OK, so here is the task. The first row I want to turn into variable
> names. Machine, PurgeSuccess, etc. Then with that removed the real work
> happens. The script would then request which variables I would like to
> sort on or find records (rows) that contain the "matches" specified.
>
> As an example, of how it should work, lets say I wanted to find all
> records that had a specific value in the Machine column that also
> contained another specific value in the Purge Success column. The
> script would parse each row to see if there were matches in the
> "record". Iterating through the entire file (could be as many as 15000
> records), the matching records (complete rows) should be written to a
> new CSV file with either the name created automatically or entered by
> the user.
>
> Processing
> for each row:
> if Machine = "blah" AND Purge Success = "Y" and PPM = "120 etc.
>
>
> The original file would be left intact for another round of parsing with
> different parameters.
>
> Hope this makes it clear. I really would like to get this put up on one
> of my Linux servers so that the developers could use it to track trends
> in the data. I would just append new data to the object file as it is
> available. TRying hard to learn PERL as my first language. Have found
> the PERL doc to be excellent and I have purchased three PERL books
> hoping to be able to do this mtself and support it over time. I even
> bought two Python books to research which scripting language might
> handle it better. PERL, being the senior language, should in theory
> have the most capability. Looked at Ruby too, but don't want the RAILS
> overhead.
>
> Thanks a bunch for any help and accept my apologies for the HUGE post.
>
> Regards,
>
> Ralph (the total noob)


use DBD::CSV and manipulate your data using the column headings via
SQL.

-jp

Timothy Johnson

2006-06-24, 8:01 am

Here's one thought:

Why not create a hash of arrays? Each hash key would be a column name,
and each record would be spread across the arrays using the same array
index. If you really wanted to get fancy you could create a hash of
cached results to speed up future queries. I wouldn't recommend it over
using an RDBMS or even Excel, but as an exercise it can be done. =20

As a side note, all religious convictions about Open Source aside, using
DBI and Access would be a decent solution to a problem like this, since
it sounds like you already own it. If you wanted to use Linux
exclusively, though there are plenty of other options available to you,
like MySQL. =20




-----Original Message-----
From: Ralph H. Stoos Jr. [mailto:rstoos@rochester.rr.com]=20
Sent: Wednesday, June 21, 2006 9:34 PM
To: beginners@perl.org
Subject: Another (hopefully more clear) plea for help with CSV data

All,

I have been asked to do something in Access or Excel which I find
disagreeable. Open Source is the way to go and also, PERL is more
flexible. I might even obtain a little professional development on
this.

My last post seemed to confuse folks (thanks to all who responded). I
will try to clarify.

Machine,PurgeSuccess,PurgePrepared,Purge
Started,PurgeCauseFaultID,PurgeM
odId,PurgeStopModId,ModIdJobCfg,NameJobC
fg,Full_cfg,FinisherCfg,plex,PPM
,PropID,AtreeNodeID,JobID,MediaID,Width,
Height,Color,Weight,Caliper,Dril
led,Finish,Grain,CoatingFront,CoatingBac
k,SW,additional_info,debuglog
1125785731,N,Y,Y,927,6,5,_2_3_4_5_6,_SFM
20_IOT7_SFM7_BFM20_BFM2,_SFM20_I
OT7_SFM7_BFM20_BFM2,DUAL_BFM,Simplex,120
,44366,1228,392,527,279400,43180
0,white,75,104,FALSE,regular,y,none,none
,RV0.6.5.27,,DebugMsgLog.2006_05
_24.07_48_00
1125785731,Y,Y,N,1003,6,,_2_3_4_5_6,_SFM
20_IOT7_SFM7_BFM20_BFM2,_SFM20_I
OT7_SFM7_BFM20_BFM2,DUAL_BFM,Duplex,120,
69206,75,408,29,279400,431800,wh
ite,75,104,FALSE,regular,y,none,none,RV0
.6.5.27,,DebugMsgLog.2006_05_31.
14_33_25A


<snip>long explanation of searchable records problem</snip>


Charles K. Clarkson

2006-06-24, 8:01 am

Chad Perrin wrote:

: On Thu, Jun 22, 2006 at 03:07:32AM -0500, Charles K. Clarkson wrote:
::
::: OK, so here is the task. The first row I want to turn into
::: variable names. Machine, PurgeSuccess, etc.
::
:: You probably don't want to do that. You may end up needing
:: symbolic references to access those variables. Using symbolic
:: references is frowned on. Another approach might use a hash which
:: has keys that match the column names. The values associated with
:: those keys would reference arrays of values from all the records.
:
: I must be missing something. Please explain the sentence "You
: may end up needing symbolic references to access those variables."


Ralph wants to create variables named as columns. Since each
variable will hold an array, he will probably go with these.

@Machine,
@PurgeSuccess,
@PurgePrepared,
@PurgeStarted,
@PurgeCauseFaultID,
@PurgeModId,
@PurgeStopModId,
@ModIdJobCfg,
@NameJobCfg,
@Full_cfg,
@FinisherCfg,
@plex,
@PPM,
@PropID,
@AtreeNodeID,
@JobID,
@MediaID,
@Width,
@Height,
@Color, and
@Weight

Assume Ralph writes a generic program which allows the user to
define searches using column names. Problem: How do we use the
variable with only the column name? Assuming the column name is in
variable named $column_name, this comes to mind. Note that
@ModIdJobCfg cannot be lexically scoped. Neither can the other 20
variables above.

# For testing. Normally comes from user.
my $column_name = 'ModIdJobCfg';

# For testing. This data would normally come from a file.
our @ModIdJobCfg = 1 .. 3;

{
no strict 'refs';

print "$_\n" foreach @{ $column_name };
}


If we used a hash of arrays, we would be able to avoid
symbolic references. Note that we can also test for column name
availability easier with the hash as well.

# For testing. Normally comes from user.
my $column_name = 'ModIdJobCfg';

# For testing. This data would normally come from a file and
# contain more keys.
my %jobs = (
ModIdJobCfg => [ 1 .. 3 ],
);

print "$_\n" foreach @{ $jobs{ $column_name } };



: : Are you going to rewrite the script for each different
: : record search? How often will different searches be needed?
: : Who decides the need for new searches (you, the boss, other
: : programmers, users, etc.)?
:
: Why would he need to rewrite the script?

To do a significantly different search. If he's not rewriting
the script than he is creating his own query lanqage.


: Why couldn't it just loop through input data to (for instance)
: create an array containing each row, starting with the column
: headings row, then shift from each array to name new arrays from
: the column headings row and populate it in order from the other
: rows?

I was asking questions, not making statements. If the searches
are going to change regularly and if non perl programmers are
going to do the searches, then Ralph is basically creating his
own SQL language and a database may be a better solution. If only
perl programmers are going to do the searches then they can just
rewrite the script each time they do new search. I don't know what
Ralph needs, hence the reason for my questions.


: I'm not saying that's necessarily the best way to do it, but it
: seems like a reasonable example of how the script could just
: dynamically handle different data from each run if need be
: without having to rewrite it each time. Some search functions
: could be created that simply take input that allows for search
: terms to be specified when the script is run. Voila, it works.
: Did I misunderstand your point?

I wasn't making a point. I was asking for more information.
That's why all the sentences had question marks at the end.

HTH,

Charles K. Clarkson
--
Mobile Homes Specialist
Free Market Advocate
Web Programmer

254 968-8328

Don't tread on my bandwidth. Trim your posts.

Todd W

2006-06-24, 8:01 am


""Ralph H. Stoos Jr."" <rstoos@rochester.rr.com> wrote in message
news:449A1DA2.40003@rochester.rr.com...
> All,
>
> I have been asked to do something in Access or Excel which I find
> disagreeable. Open Source is the way to go and also, PERL is more
> flexible. I might even obtain a little professional development on this.
>
> My last post seemed to confuse folks (thanks to all who responded). I
> will try to clarify.
>

<snip data>
>
> OK, so here is the task. The first row I want to turn into variable
> names. Machine, PurgeSuccess, etc. Then with that removed the real work
> happens. The script would then request which variables I would like to
> sort on or find records (rows) that contain the "matches" specified.
>


Ralph,

The other suggestions are correct, you want this data in a rdbms. But perl
has modules that treat csv files as a database:

$ cat driver.pl
use warnings;
use strict;

use DBD::CSV;
my $dbh = DBI->connect("DBI:CSV:f_dir=./shop;csv_eol=\n;");

my $sql = 'select * from machines';
my $sth = $dbh->prepare( $sql );
$sth->execute;

while ( my $row = $sth->fetchrow_hashref ) {
print( "$row->{Machine}: $row->{NameJobCfg}\n");
}

using your data, I get this:

$ perl driver.pl
1125785731: _SFM20_IOT7_SFM7_BFM20_BFM2
1125785731: _SFM20_IOT7_SFM7_BFM20_BFM2

In other words, do NOT write a parser for the file. Your client will loose
interest long before you get it right. Its a wheel that has already been
invented.

DBD::CSV (a la SQL::Statement) allows arbitrairily complex WHERE and ORDER
BY clauses for filtering and sorting.

Here is how to set up the directory structure for the above program:

$ ls -l
total 8
-rw-rw-r-- 1 trwww trwww 285 Jun 23 12:26 driver.pl
drwxrwxr-x 2 trwww trwww 4096 Jun 23 12:12 shop
$ ls -l shop
total 4
-rw-rw-r-- 1 trwww trwww 741 Jun 23 12:04 machines

driver.pl is the program above. The file shop/machines is your csv file.

Todd W.


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com