For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > February 2006 > regexp problem









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author regexp problem
pdc124@yahoo.co.uk

2006-02-01, 6:56 pm

I want to extract some of a web page and put bits of it into an amail.
Ive got this far :
#retrieve bit of web page
$o=~/startoftable(.*)<tr>(.*)<\/table>(.*)endoftable/i;
my $p=$1;

#remove table tags
$p=~tr/(\/?t{r|d})+//d;


this leaves me with a variable number of '<>'s depending on whether
its the start of a line or not
eg
<><>21<><>15<><><><>NC2<><>12<><><><>NC4<><>7 ......
I want to change these to a single character and then split the
string to array elements and then put them in the email
ive tried combinations
of

$p=~tr/(<> ){2|4}/#/;

but it replaces each element with a '# , not the whole lot with a
singel #

Whats wrong ?

Paul Lalli

2006-02-01, 6:56 pm

pdc124@yahoo.co.uk wrote:
> I want to extract some of a web page and put bits of it into an amail.
> Ive got this far :
> #retrieve bit of web page
> $o=~/startoftable(.*)<tr>(.*)<\/table>(.*)endoftable/i;
> my $p=$1;
>
> #remove table tags
> $p=~tr/(\/?t{r|d})+//d;
>
>
> this leaves me with a variable number of '<>'s depending on whether
> its the start of a line or not
> eg
> <><>21<><>15<><><><>NC2<><>12<><><><>NC4<><>7 ......
> I want to change these to a single character and then split the
> string to array elements and then put them in the email
> ive tried combinations
> of
>
> $p=~tr/(<> ){2|4}/#/;
>
> but it replaces each element with a '# , not the whole lot with a
> singel #
>
> Whats wrong ?



You are confusing the translitteration operator ( tr/// ) with the
substitution operator ( s/// ). tr/// takes a list of characters, and
replaces each character on the left with the corresponding character on
the right. If there are not the same number of elements on the right,
the final character is repeated the number of times needed.

s///, on the other hand, takes a regular expression on the right, and
replaces it with the string on the left.

Your original operation:
$p=~tr/(\/?t{r|d})+//d;
is not removing all tr, td, /tr, and /td groups. It is removing all (,
/, ?, t, {, r, |, d, }, ), and + characters. (removing because you used
the /d modifier)

Your second operation:
$p=~tr/(<> ){2|4}/#/;
is replacing every (, <, >, ), {, 2, |, 4, and } character with a #
character.

Those should be, respectively:
s/(\/?t[rd])//g;
and
s/(<><> ){1,2}/#/g;

Of course, you should be using an actual HTML Parser, (like, for
example, HTML::Parser), rather than doing this via regexps in the first
place.

Read more about s/// and tr/// in:
perldoc perlop
perldoc perlretut
perldoc perlre
perldoc perlreref

Paul Lalli

Paul Lalli

2006-02-01, 6:56 pm

Paul Lalli wrote:
> s///, on the other hand, takes a regular expression on the right, and
> replaces it with the string on the left.


Wow. I'm dyslexic today. s/// takes a RegExp on the *left* and
replaces it with the string on the *right*.

Sorry about that.

Paul Lalli

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com