For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > September 2006 > Need help with repeating match









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Need help with repeating match
Kevin Zembower

2006-09-25, 6:57 pm

I'm trying to process a file that mostly has lines like:
http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf 342740
http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741

However, it sometimes has more than one URL on a line, like:
http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
01POR.pdf 16875
http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
pdfhttp://db.jhuccp.org/docs/732302POR.pdf 18024

I want to capture the portion from the start of 'http://' to either the
first whitespace or to the start of the next 'http://' and loop through,
doing something with the portion captured, until it fails to capture any
more. I also need the six digit number at the end inside each loop.

I think the way to do this is with a look-ahead assertion, but don't
understand this very well. I also don't understand how to write this so
it doesn't fail to match the lines with only one URL in them. Can anyone
give me a hand getting started with this task? Thanks for your help and
suggestions.

-Kevin

Kevin Zembower
Internet Services Group manager
Center for Communication Programs
Bloomberg School of Public Health
Johns Hopkins University
111 Market Place, Suite 310
Baltimore, Maryland 21202
410-659-6139=20
Rob Dixon

2006-09-25, 6:57 pm

Zembower, Kevin wrote:
>
> I'm trying to process a file that mostly has lines like:
> http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf 342740
> http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
>
> However, it sometimes has more than one URL on a line, like:
> http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
> pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
> 01POR.pdf 16875
> http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
> pdfhttp://db.jhuccp.org/docs/732302POR.pdf 18024
>
> I want to capture the portion from the start of 'http://' to either the
> first whitespace or to the start of the next 'http://' and loop through,
> doing something with the portion captured, until it fails to capture any
> more. I also need the six digit number at the end inside each loop.
>
> I think the way to do this is with a look-ahead assertion, but don't
> understand this very well. I also don't understand how to write this so
> it doesn't fail to match the lines with only one URL in them. Can anyone
> give me a hand getting started with this task? Thanks for your help and
> suggestions.



This code is hopefully made a littel more readable by first constructing a regex
for a single url and then using it in the global match to say that what we want
is a URL followed by either another URL or whitespace.

Hope it does the trick.

Rob



use strict;
use warnings;

while (<DATA> ) {

my ($n) = /(\d+)\s*$/;

my $url = qr#http://\S*?#;
my @urls = m#$url(?=$url|\s)#g;

print "$_\n" foreach @urls;
print $n, "\n\n";
}


__DATA__
http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf 342740
http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
http://db.jhuccp.org/docs/732301.pd...s/732301POR.pdf
16875
http://db.jhuccp.org/docs/732302.pd...s/732302POR.pdf
18024


**OUTPUT**

http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf
342740

http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf
342741

http://db.jhuccp.org/docs/732301.pdf
http://db.jhuccp.org/docs/732301FRE.pdf
http://db.jhuccp.org/docs/732301SPA.pdf
http://db.jhuccp.org/docs/732301POR.pdf
16875

http://db.jhuccp.org/docs/732302.pdf
http://db.jhuccp.org/docs/732302FRE.pdf
http://db.jhuccp.org/docs/732302POR.pdf
18024
Kevin Zembower

2006-09-26, 7:57 am

Rob, thanks so much for helping me with this perl task. I'm still going
over your solution character-by-character to fully understand it. I
really appreciate your efforts in working it out.

-Kevin

-----Original Message-----
From: Rob Dixon [mailto:rob.dixon@350.com]=20
Sent: Monday, September 25, 2006 5:22 PM
To: beginners@perl.org
Subject: Re: Need help with repeating match

Zembower, Kevin wrote:
>
> I'm trying to process a file that mostly has lines like:
> http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf

342740
> http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
>
> However, it sometimes has more than one URL on a line, like:
>

http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
>

pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
> 01POR.pdf 16875
>

http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
> pdfhttp://db.jhuccp.org/docs/732302POR.pdf 18024
>
> I want to capture the portion from the start of 'http://' to either

the
> first whitespace or to the start of the next 'http://' and loop

through,
> doing something with the portion captured, until it fails to capture

any
> more. I also need the six digit number at the end inside each loop.
>
> I think the way to do this is with a look-ahead assertion, but don't
> understand this very well. I also don't understand how to write this

so
> it doesn't fail to match the lines with only one URL in them. Can

anyone
> give me a hand getting started with this task? Thanks for your help

and
> suggestions.



This code is hopefully made a littel more readable by first constructing
a regex
for a single url and then using it in the global match to say that what
we want
is a URL followed by either another URL or whitespace.

Hope it does the trick.

Rob



use strict;
use warnings;

while (<DATA> ) {

my ($n) =3D /(\d+)\s*$/;

my $url =3D qr#http://\S*?#;
my @urls =3D m#$url(?=3D$url|\s)#g;

print "$_\n" foreach @urls;
print $n, "\n\n";
}


__DATA__
http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf 342740
http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
01POR.pdf=20
16875
http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
pdfhttp://db.jhuccp.org/docs/732302POR.pdf=20
18024


**OUTPUT**

http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf
342740

http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf
342741

http://db.jhuccp.org/docs/732301.pdf
http://db.jhuccp.org/docs/732301FRE.pdf
http://db.jhuccp.org/docs/732301SPA.pdf
http://db.jhuccp.org/docs/732301POR.pdf
16875

http://db.jhuccp.org/docs/732302.pdf
http://db.jhuccp.org/docs/732302FRE.pdf
http://db.jhuccp.org/docs/732302POR.pdf
18024

--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com