Home > Archive > PERL Beginners > September 2006 > Need help with repeating match
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Need help with repeating match
|
|
| Kevin Zembower 2006-09-25, 6:57 pm |
| I'm trying to process a file that mostly has lines like:
http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf 342740
http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
However, it sometimes has more than one URL on a line, like:
http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
01POR.pdf 16875
http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
pdfhttp://db.jhuccp.org/docs/732302POR.pdf 18024
I want to capture the portion from the start of 'http://' to either the
first whitespace or to the start of the next 'http://' and loop through,
doing something with the portion captured, until it fails to capture any
more. I also need the six digit number at the end inside each loop.
I think the way to do this is with a look-ahead assertion, but don't
understand this very well. I also don't understand how to write this so
it doesn't fail to match the lines with only one URL in them. Can anyone
give me a hand getting started with this task? Thanks for your help and
suggestions.
-Kevin
Kevin Zembower
Internet Services Group manager
Center for Communication Programs
Bloomberg School of Public Health
Johns Hopkins University
111 Market Place, Suite 310
Baltimore, Maryland 21202
410-659-6139=20
| |
|
|
| Kevin Zembower 2006-09-26, 7:57 am |
| Rob, thanks so much for helping me with this perl task. I'm still going
over your solution character-by-character to fully understand it. I
really appreciate your efforts in working it out.
-Kevin
-----Original Message-----
From: Rob Dixon [mailto:rob.dixon@350.com]=20
Sent: Monday, September 25, 2006 5:22 PM
To: beginners@perl.org
Subject: Re: Need help with repeating match
Zembower, Kevin wrote:
>
> I'm trying to process a file that mostly has lines like:
> http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf
342740
> http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
>
> However, it sometimes has more than one URL on a line, like:
>
http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
>
pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
> 01POR.pdf 16875
>
http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
> pdfhttp://db.jhuccp.org/docs/732302POR.pdf 18024
>
> I want to capture the portion from the start of 'http://' to either
the
> first whitespace or to the start of the next 'http://' and loop
through,
> doing something with the portion captured, until it fails to capture
any
> more. I also need the six digit number at the end inside each loop.
>
> I think the way to do this is with a look-ahead assertion, but don't
> understand this very well. I also don't understand how to write this
so
> it doesn't fail to match the lines with only one URL in them. Can
anyone
> give me a hand getting started with this task? Thanks for your help
and
> suggestions.
This code is hopefully made a littel more readable by first constructing
a regex
for a single url and then using it in the global match to say that what
we want
is a URL followed by either another URL or whitespace.
Hope it does the trick.
Rob
use strict;
use warnings;
while (<DATA> ) {
my ($n) =3D /(\d+)\s*$/;
my $url =3D qr#http://\S*?#;
my @urls =3D m#$url(?=3D$url|\s)#g;
print "$_\n" foreach @urls;
print $n, "\n\n";
}
__DATA__
http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf 342740
http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf 342741
http://db.jhuccp.org/docs/732301.pd...docs/732301FRE.
pdfhttp://db.jhuccp.org/docs/732301SPA.pdfhttp://db.jhuccp.org/docs/7323
01POR.pdf=20
16875
http://db.jhuccp.org/docs/732302.pd...docs/732302FRE.
pdfhttp://db.jhuccp.org/docs/732302POR.pdf=20
18024
**OUTPUT**
http://www.cpsp.edu.pk/jcpsp/ARCHIE...06/article5.pdf
342740
http://www.scielo.br/pdf/bjid/v10n1/a04v10n1.pdf
342741
http://db.jhuccp.org/docs/732301.pdf
http://db.jhuccp.org/docs/732301FRE.pdf
http://db.jhuccp.org/docs/732301SPA.pdf
http://db.jhuccp.org/docs/732301POR.pdf
16875
http://db.jhuccp.org/docs/732302.pdf
http://db.jhuccp.org/docs/732302FRE.pdf
http://db.jhuccp.org/docs/732302POR.pdf
18024
--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
|
|
|
|
|