Home > Archive > PERL Beginners > February 2008 > Why doesn't this work: matching capturing
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Why doesn't this work: matching capturing
|
|
| Kevin Zembower 2008-02-26, 10:02 pm |
| I have a data file that looks like this:
uSF1 MD15000000 009214935522451020 9 0101001
88722397N07209999900
116759 0Block Group 1
S 1158 662+39283007-076574503
uSF1 MD15000000 009215035522451020 9 0101002
88722397N07209999900
109338 0Block Group 2
S 842 547+39280857-076573636
uSF1 MD15000000 009215135522451020 9 0101003
88722397N07209999900
182248 135142Block Group 3
S 920 442+39279557-076574311
This is actually three lines that all start with 'uSF1'. This is the
Summary File from the US 2000 Census. I want to print all the census
tracts and blockgroup numbers for FIPS state code =3D "24" (Maryland) =
and
FIPS county code "510" (Baltimore City) for summary level '150'. These
are all fixed-length records. I tried:
[kevinz@www UScensus]$ perl -ne '($tract, $bg) =3D
/^.{8}150.{18}24510.{21}(.{6})(.)/; print "Tract $tract BLKGRP $bg\n";'
mdgeo.uf1 |head
Tract BLKGRP=20
Tract BLKGRP=20
Tract BLKGRP
<snip>
I thought that this would:
skip 8 characters and match '150'
skip 19 more characters and match '24' and '510'
skip 21 more characters and capture the next 6 in $tract
capture the next character in $bg
and print them.
The first two matches work, but nothing is captured. Any ideas what I'm
doing wrong?
Thanks for your help and advice.
-Kevin
Kevin Zembower
Internet Services Group manager
Center for Communication Programs
Bloomberg School of Public Health
Johns Hopkins University
111 Market Place, Suite 310
Baltimore, Maryland 21202
410-659-6139=20
| |
| Paul Lalli 2008-02-26, 10:02 pm |
| On Feb 26, 1:19=A0pm, kzemb...@jhuccp.org (Kevin Zembower) wrote:
> I have a data file that looks like this:
> uSF1 =A0MD15000000 =A0009214935522451020 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09 =
=A00101001
> 88722397N07209999900
> 116759 =A0 =A0 =A0 =A0 =A0 =A0 0Block Group 1
> S =A0 =A0 =A01158 =A0 =A0 =A0662+39283007-076574503
>
> uSF1 =A0MD15000000 =A0009215035522451020 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09 =
=A00101002
> 88722397N07209999900
> 109338 =A0 =A0 =A0 =A0 =A0 =A0 0Block Group 2
> S =A0 =A0 =A0 842 =A0 =A0 =A0547+39280857-076573636
>
> uSF1 =A0MD15000000 =A0009215135522451020 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A09 =
=A00101003
> 88722397N07209999900
> 182248 =A0 =A0 =A0 =A0135142Block Group 3
> S =A0 =A0 =A0 920 =A0 =A0 =A0442+39279557-076574311
>
> This is actually three lines that all start with 'uSF1'. This is the
> Summary File from the US 2000 Census. I want to print all the census
> tracts and blockgroup numbers for FIPS state code =3D "24" (Maryland) and
> FIPS county code "510" (Baltimore City) for summary level '150'. These
> are all fixed-length records. I tried:
> [kevinz@www UScensus]$ perl -ne '($tract, $bg) =3D
> /^.{8}150.{18}24510.{21}(.{6})(.)/; print "Tract $tract BLKGRP $bg\n";'
> mdgeo.uf1 |head
> Tract =A0BLKGRP
> Tract =A0BLKGRP
> Tract =A0BLKGRP
> <snip>
>
> I thought that this would:
> =A0 =A0skip 8 characters and match '150'
> =A0 =A0skip 19 more characters and match '24' and '510'
> =A0 =A0skip 21 more characters and capture the next 6 in $tract
> =A0 =A0capture the next character in $bg
> =A0 =A0and print them.
>
> The first two matches work, but nothing is captured. Any ideas what I'm
> doing wrong?
On what do you base your assumption that "the first two matches
work"? Nothing in your code or output indicates that, as you are
never checking the return value of the pattern match.
FWIW, your code did work for me when I copy and pasted your sample
text, and joined the lines as they should have been. Therefore, I
think it's pretty likely that your datafile does not contain what you
think it does. I think it's more likely that the one line you think
you have that starts with uSF is actually broken up into a few lines.
Try some debugging prints of $_ to see what you actually have, like:
print "Line $.: <<$_>>";
Try checking the return value of your regexp:
/^.{8}150.{18}24510.{21}(.{6})(.)/ and print "Tract $1 BLKGRP $2\n";
Try enabling warnings to see of your two variables are undefined
(which they would be if the pattern didn't match) or just empty
strings (which they would be if the pattern matched but nothing was
captured - this, of course, isn't possible, since a six-character
match can't possibly be the empty string).
Paul Lalli
| |
| Jim Gibson 2008-02-26, 10:02 pm |
| In article
< B68EB32ADEE6D74594B63A7D2931F0BA075214@X
CH-VN01.sph.ad.jhsph.edu>,
Kevin Zembower <kzembowe@jhuccp.org> wrote:
> I have a data file that looks like this:
> uSF1 MD15000000 009214935522451020 9 0101001 ...
> uSF1 MD15000000 009215035522451020 9 0101002 ...
> uSF1 MD15000000 009215135522451020 9 0101003 ...
>
> This is actually three lines that all start with 'uSF1'. This is the
> Summary File from the US 2000 Census. I want to print all the census
> tracts and blockgroup numbers for FIPS state code = "24" (Maryland) and
> FIPS county code "510" (Baltimore City) for summary level '150'. These
> are all fixed-length records. I tried:
> [kevinz@www UScensus]$ perl -ne '($tract, $bg) =
> /^.{8}150.{18}24510.{21}(.{6})(.)/; print "Tract $tract BLKGRP $bg\n";'
> mdgeo.uf1 |head
> Tract BLKGRP
> Tract BLKGRP
> Tract BLKGRP
> <snip>
>
> I thought that this would:
> skip 8 characters and match '150'
> skip 19 more characters and match '24' and '510'
> skip 21 more characters and capture the next 6 in $tract
> capture the next character in $bg
> and print them.
>
> The first two matches work, but nothing is captured. Any ideas what I'm
> doing wrong?
It works for me:
% perl -ne '($t,$b)=m/^.{8}150.{18}24510.{21}(.{6})(.)/;print"Tract
$t\tBLKGRP $b\n";' mdgeo.uf1
Tract 010100 BLKGRP 1
Tract 010100 BLKGRP 2
Tract 010100 BLKGRP 3
Perhaps your files do not contain what you think they do.
I would use the unpack function for this task (severe line wrap ahead):
#!/usr/local/bin/perl
use strict;
use warnings;
while(my $line = <DATA> ) {
my( $tract, $bg ) = unpack('x55 A6 A', $line);
print "Tract $tract, BLKGRP $bg\n";
}
__DATA__
uSF1 MD15000000 009214935522451020 9
010100188722397N07209999900116759 0Block Group 1S 1158
662+39283007-076574503
uSF1 MD15000000 009215035522451020 9
010100288722397N07209999900109338 0Block Group 2S 842
547+39280857-076573636
uSF1 MD15000000 009215135522451020 9
010100388722397N07209999900182248 135142Block Group 3S 920
442+39279557-076574311
Output:
Tract 010100, BLKGRP 1
Tract 010100, BLKGRP 2
Tract 010100, BLKGRP 3
--
Jim Gibson
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
| |
| Kevin Zembower 2008-02-26, 10:02 pm |
| Paul, thank you very much for your helpful reply. To answer your =
question, I am certain that the first two matches worked because I =
produced the output I showed with:
[kevinz@www UScensus]$ perl -ne 'print if =
/^.{8}150.{18}24510.{21}(.{6})(.)/;' mdgeo.uf1 |head -1
uSF1 MD15000000 009214935522451020 9 0101001 =
88722397N07209999900 =
116759 0Block Group 1 =
S =
1158 662+39283007-076574503 =
=20
[kevinz@www UScensus]$
Sorry if I left out this information and wasted anyone's time.
A person responded to me privately and suggested this modification, =
which seems to work fine:
[kevinz@www UScensus]$ perl -ne 'print "Tract $1 BLKGRP $2\n" if =
/^.{8}150.{18}24510.{21}(.{6})(.)/;' mdgeo.uf1 |head -3
Tract 010100 BLKGRP 1
Tract 010100 BLKGRP 2
Tract 010100 BLKGRP 3
[kevinz@www UScensus]$
I've been bitten by this bug before and can never remember the solution =
to when variables are assigned values. I thought assigning them in a =
previous command would have worked, but I must have overlooked =
something.
Thank you, again, for your help.
-Kevin
-----Original Message-----
From: Paul Lalli [mailto:mritty@gmail.com]=20
Sent: Tuesday, February 26, 2008 3:41 PM
To: beginners@perl.org
Subject: Re: Why doesn't this work: matching capturing
On Feb 26, 1:19=A0pm, kzemb...@jhuccp.org (Kevin Zembower) wrote:
> I have a data file that looks like this:
> uSF1 =A0MD15000000 =A0009214935522451020 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A09 =A00101001
> 88722397N07209999900
> 116759 =A0 =A0 =A0 =A0 =A0 =A0 0Block Group 1
> S =A0 =A0 =A01158 =A0 =A0 =A0662+39283007-076574503
>
> uSF1 =A0MD15000000 =A0009215035522451020 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A09 =A00101002
> 88722397N07209999900
> 109338 =A0 =A0 =A0 =A0 =A0 =A0 0Block Group 2
> S =A0 =A0 =A0 842 =A0 =A0 =A0547+39280857-076573636
>
> uSF1 =A0MD15000000 =A0009215135522451020 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A09 =A00101003
> 88722397N07209999900
> 182248 =A0 =A0 =A0 =A0135142Block Group 3
> S =A0 =A0 =A0 920 =A0 =A0 =A0442+39279557-076574311
>
> This is actually three lines that all start with 'uSF1'. This is the
> Summary File from the US 2000 Census. I want to print all the census
> tracts and blockgroup numbers for FIPS state code =3D "24" (Maryland) =
and
> FIPS county code "510" (Baltimore City) for summary level '150'. These
> are all fixed-length records. I tried:
> [kevinz@www UScensus]$ perl -ne '($tract, $bg) =3D
> /^.{8}150.{18}24510.{21}(.{6})(.)/; print "Tract $tract BLKGRP =
$bg\n";'
> mdgeo.uf1 |head
> Tract =A0BLKGRP
> Tract =A0BLKGRP
> Tract =A0BLKGRP
> <snip>
>
> I thought that this would:
> =A0 =A0skip 8 characters and match '150'
> =A0 =A0skip 19 more characters and match '24' and '510'
> =A0 =A0skip 21 more characters and capture the next 6 in $tract
> =A0 =A0capture the next character in $bg
> =A0 =A0and print them.
>
> The first two matches work, but nothing is captured. Any ideas what =
I'm
> doing wrong?
On what do you base your assumption that "the first two matches
work"? Nothing in your code or output indicates that, as you are
never checking the return value of the pattern match.
FWIW, your code did work for me when I copy and pasted your sample
text, and joined the lines as they should have been. Therefore, I
think it's pretty likely that your datafile does not contain what you
think it does. I think it's more likely that the one line you think
you have that starts with uSF is actually broken up into a few lines.
Try some debugging prints of $_ to see what you actually have, like:
print "Line $.: <<$_>>";
Try checking the return value of your regexp:
/^.{8}150.{18}24510.{21}(.{6})(.)/ and print "Tract $1 BLKGRP $2\n";
Try enabling warnings to see of your two variables are undefined
(which they would be if the pattern didn't match) or just empty
strings (which they would be if the pattern matched but nothing was
captured - this, of course, isn't possible, since a six-character
match can't possibly be the empty string).
Paul Lalli
--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/
|
|
|
|
|