For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > February 2006 > Extract multiple lines









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Extract multiple lines
Jack Daniels

2006-02-23, 3:55 am

It's driving me bonkers and can't afford any more psychiatic bills. The data
is a saved .txt file when viewing from a website. The vendor will not give
us an actual file even though we payed a montly fee for use of the database.
I have around 5000 records that need to be converted to MARC cataloging
records. I need to either have the data from each heading on 1 line or have
the script extract each heading and all the subsequent lines.

The script is only extracting the first line of the heading.. I can only
have 1 blank line between each record which works in the script. If I right
click then import to excel when viewing the records at the website, each
heading is a continous string, which is what I need. I can then save as a
tab delimited file and the lines for each heading remian continuous, which
works. But we have ceased our subscription and I now only have saved .txt
files of the 5000 records. I can't figure out how and where to modify the
script to work on the files. I suppose I could spend a couple months
manually joining lines, but that really cuts into naptime.

Sample data:


Title 10 fastest growing careers: jobs for the future part four
business
and computer technology (03616)
Physical Color; Sound; 15 minutes
Copyrighted 1990
Producer GUIDANCE ASSOCIATES (GUID)
Dewey 371.425
Synopsis Contents: The business community depends on up-to-the minute
technology - technology that is changing rapidly. As a result, careers
in
technology, especially computers and specialized areas such as
accounting
are much in demand. Takes a look at three business and computer
careers:
software engineering, computer programming and accounting.
Subjects CAREER GUIDANCE; CAREER SERVICES
Holdings
1/2 VHS video: Head Office, 1 copy



Title 10 fastest growing careers: jobs for the future part one legal
and
health (03613)
Physical Color; Sound; 15 minutes
Copyrighted 1990
Producer GUIDANCE ASSOCIATES (GUID)
Dewey 371.425
Synopsis Contents: Takes a look at the fast growing health and legal
fields. Talks to a registered nurse about her changing role in a major
hospital, a physician's assistant who works with two doctors in a busy
family practice, and a paralegal who works with an attorney.
Subjects CAREER GUIDANCE; CAREER SERVICES
Holdings
1/2 VHS video: Head Office, 1 copy


HERE IS THE SCRIPT

open(MYINPUTFILE, "<1000chomp.txt"); # open for input

my(@lines) = <MYINPUTFILE>; # read file into list


my $title;
my $series;
my $subjects;
my $physical;
my $synopsis;
my $producer;
my $copyrighted;
my $dewey;
for my $line (@lines)
{

$line =~ /Title/ and $title = $line;
$line =~ /Title/ and print "=LDR 00000nam 2200000Ia 45e0\n","=245
00\$a",$line;

$line =~ /Dewey/ and $dewey = $line;
$line =~ /Dewey/ and print "=082 \\\\\$a",$line;

$line =~ /Producer/ and $producer = $line;
$line =~ /Producer/ and print "=040 \\\\\$aCaSRRI\n","=260
\\\\\$a",$line;

$line =~ /Copyrighted/ and $copyrighted = $line;
$line =~ /Copyrighted/ and print "=261 \\\\\$c",$line;

$line =~ /Physical/ and $physical = $line;
$line =~ /Physical/ and print "=300 \\\\\$a1 videocassette ( min.)
:\$bsd., col. ;\$c13 mm.",$line;

$line =~ /Series/ and $series = $line;
$line =~ /Series/ and print "=440 0\\\$a",$line;

$line =~ /Synopsis/ and $synopsis = $line;
$line =~ /Synopsis/ and print "=520 \\\\\$a",$line;

$line =~ /Subjects/ and $subjects = $line;
$line =~ /Subjects/ and print "=550 \\\\\$a",$line,"\n";



Hans Meier

2006-02-23, 7:55 am

Jack Daniels (Butch) am Donnerstag, 23. Februar 2006 10.30:
> It's driving me bonkers and can't afford any more psychiatic bills. The
> data is a saved .txt file when viewing from a website. The vendor will not
> give us an actual file even though we payed a montly fee for use of the
> database. I have around 5000 records that need to be converted to MARC
> cataloging records. I need to either have the data from each heading on 1
> line or have the script extract each heading and all the subsequent lines.
>
> The script is only extracting the first line of the heading..


Yes, with every loop trough @lines, you overwrite your variables $title to $dewey.

> I can only
> have 1 blank line between each record which works in the script. If I right
> click then import to excel when viewing the records at the website, each
> heading is a continous string, which is what I need. I can then save as a
> tab delimited file and the lines for each heading remian continuous, which
> works. But we have ceased our subscription and I now only have saved .txt
> files of the 5000 records. I can't figure out how and where to modify the
> script to work on the files. I suppose I could spend a couple months
> manually joining lines, but that really cuts into naptime.


I don't know what MARC cataloging records are nor is my english enough good to understand what you exactly mean, and I don't know if the leading spaces on every line below are in the sample data, but It may help you to produce a CSV file from the data.

So, you can adjust my script below or wait for undoubtly arriving better solutions:

[...]

> HERE IS THE SCRIPT
>
> open(MYINPUTFILE, "<1000chomp.txt"); # open for input
>
> my(@lines) = <MYINPUTFILE>; # read file into list
>
>
> my $title;
> my $series;
> my $subjects;
> my $physical;
> my $synopsis;
> my $producer;
> my $copyrighted;
> my $dewey;
> for my $line (@lines)
> {
>
> $line =~ /Title/ and $title = $line;
> $line =~ /Title/ and print "=LDR 00000nam 2200000Ia 45e0\n","=245
> 00\$a",$line;
>
> $line =~ /Dewey/ and $dewey = $line;
> $line =~ /Dewey/ and print "=082 \\\\\$a",$line;
>
> $line =~ /Producer/ and $producer = $line;
> $line =~ /Producer/ and print "=040 \\\\\$aCaSRRI\n","=260
> \\\\\$a",$line;
>
> $line =~ /Copyrighted/ and $copyrighted = $line;
> $line =~ /Copyrighted/ and print "=261 \\\\\$c",$line;
>
> $line =~ /Physical/ and $physical = $line;
> $line =~ /Physical/ and print "=300 \\\\\$a1 videocassette ( min.)
>
> :\$bsd., col. ;\$c13 mm.",$line;
>
> $line =~ /Series/ and $series = $line;
> $line =~ /Series/ and print "=440 0\\\$a",$line;
>
> $line =~ /Synopsis/ and $synopsis = $line;
> $line =~ /Synopsis/ and print "=520 \\\\\$a",$line;
>
> $line =~ /Subjects/ and $subjects = $line;
> $line =~ /Subjects/ and print "=550 \\\\\$a",$line,"\n";


========================
#!/usr/bin/perl
use strict;
use warnings;


local $/=""; # split data at 1..n empty lines

# btw: Series does not occur in the sample data
my $stops=qr/(?:Title)|(?:Physical)|(?:Copyrighted)|(?:Producer)|(?:Dewey)|(?:Synopsis)|(?:Subjects)|(?:Series)/;

for my $record (<DATA> ) {

my @pairs=split (/($stops)/, $record);
shift @pairs; # remove the undef 1st entry

my %keyed=@pairs;

$keyed{$_}=~s/\s+/ /gs for keys %keyed;

# now you have one record as key/one-line-value pairs
# for further processing, see:

print join "\n", map {"$_=>$keyed{$_}"} keys %keyed;
print "\n\n";

# you could sort it, produce a CSV-file, ...
}


__DATA__
Title 10 fastest growing careers: jobs for the future part four
business
and computer technology (03616)
Physical Color; Sound; 15 minutes
Copyrighted 1990
Producer GUIDANCE ASSOCIATES (GUID)
Dewey 371.425
Synopsis Contents: The business community depends on up-to-the minute
technology - technology that is changing rapidly. As a result, careers
in
technology, especially computers and specialized areas such as
accounting
are much in demand. Takes a look at three business and computer
careers:
software engineering, computer programming and accounting.
Subjects CAREER GUIDANCE; CAREER SERVICES
Holdings
1/2 VHS video: Head Office, 1 copy



Title 10 fastest growing careers: jobs for the future part one legal
and
health (03613)
Physical Color; Sound; 15 minutes
Copyrighted 1990
Producer GUIDANCE ASSOCIATES (GUID)
Dewey 371.425
Synopsis Contents: Takes a look at the fast growing health and legal
fields. Talks to a registered nurse about her changing role in a major
hospital, a physician's assistant who works with two doctors in a busy
family practice, and a paralegal who works with an attorney.
Subjects CAREER GUIDANCE; CAREER SERVICES
Holdings
1/2 VHS video: Head Office, 1 copy
=============
Hans Meier

2006-02-23, 7:55 am

Hans Meier (John Doe) am Donnerstag, 23. Februar 2006 13.07:
[...]
sorry for replying to myself...
>
> Yes, with every loop trough @lines, you overwrite your variables $title to
> $dewey.


this doesn't matter since you print the contents out before overwriting.

Should get some sleep....

Greetings
Hans
John W. Krahn

2006-02-23, 9:56 pm

Jack Daniels (Butch) wrote:
> It's driving me bonkers and can't afford any more psychiatic bills. The data
> is a saved .txt file when viewing from a website. The vendor will not give
> us an actual file even though we payed a montly fee for use of the database.
> I have around 5000 records that need to be converted to MARC cataloging
> records. I need to either have the data from each heading on 1 line or have
> the script extract each heading and all the subsequent lines.



This appears to do what you want:


use warnings;
use strict;

# The headers in the correct order for printing
my @ordered = qw[ Title Dewey Producer Copyrighted Physical Series Synopsis
Subjects ];
my $alternation = join '|', @ordered;


my %prepend = (
Title => "=LDR 00000nam 2200000Ia 45e0\n=24500\$a",
Dewey => "=082 \\\\\$a",
Producer => "=040 \\\\\$aCaSRRI\n=260\\\\\$a",
Copyrighted => "=261 \\\\\$c",
Physical => "=300 \\\\\$a1 videocassette ( min.):\$bsd., col. ;\$c13 mm.",
Series => "=440 0\\\$a",
Synopsis => "=520 \\\\\$a",
Subjects => "=550 \\\\\$a",
);


open MYINPUTFILE, '<', '1000chomp.txt'
or die "Cannot open '1000chomp.txt' $!";


my ( $heading, %record );
while ( <MYINPUTFILE> ) {
if ( /^ {6}(($alternation)\s+.*)/ ) {
$record{ $heading = $2 } = $1;
}
elsif ( /(\S.*)/ ) {
$record{ $heading } .= " $1";
}
elsif ( %record ) {
print "$prepend{$_}$record{$_}\n" for @ordered;
print "\n";
%record = ();
}
}
# Print the last record.
if ( %record ) {
print "$prepend{$_}$record{$_}\n" for @ordered;
print "\n";
}

__END__



John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com