Home > Archive > PERL Beginners > November 2006 > pattern substitution
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
pattern substitution
|
|
| Adriano Allora 2006-11-18, 7:56 am |
| hi to all,
I've got a list of tagged words, like this one (only a little bit
longest):
<tLn nr=11>
e CON e
le DET:def il
ha VER:pres avere|riavere
detto VER:pper dire
< NOM <unknown>
CORR VER:infi corre
> NOM <unknown>
e CON e
a PRE a
I need to transform the list below in (in which the CORR tag isn't
tagged):
<tLn nr=11>
e CON e
le DET:def il
ha VER:pres avere|riavere
detto VER:pper dire
<CORR>
e CON e
a PRE a
So I tried to write this awful script:
#!/usr/bin/perl -w
use strict;
$^I = '';
my $tic = 0;
my $toc = 0;
while(<> )
{
if(/^< NOM <unknown>.*/i)
{
$tic = 1;
next;
}
next if /^> NOM <unknown>.*/i;
next if $toc == 1;
$toc = 0;
if($tic==1)
{
s/^(\/?\w+).+/$1/gi;
chomp();
$_ = "<$_>";
$toc = 1;
$tic = 0;
}
s/<>//g;
print;
}
it doesn't return errors, but it stop printing the output after the
first correction. Someone can explain me why and eventually suggest how
to correct the corrector?
Thanks at all,
alladr
PS: another strange thing: if I declare at the beginning of the script:
my($tic,$toc); it returns me an error...
|^|_|^|_|^| |^|_|^|_|^|
| | | |
| | | |
| |*\_/*\_/*\_/*\_/*\_/* | |
| |
| |
| |
| http://www.e-allora.net |
| |
| |
**************************************
| |
| D. Bolliger 2006-11-18, 7:56 am |
| Adriano Allora am Samstag, 18. November 2006 11:52:
> hi to all,
Ciao Adriano
> I've got a list of tagged words, like this one (only a little bit
> longest):
>
> <tLn nr=11>
> e CON e
> le DET:def il
> ha VER:pres avere|riavere
> detto VER:pper dire
> < NOM <unknown>
> CORR VER:infi corre
>
>
> e CON e
> a PRE a
>
> I need to transform the list below in (in which the CORR tag isn't
> tagged):
>
> <tLn nr=11>
> e CON e
> le DET:def il
> ha VER:pres avere|riavere
> detto VER:pper dire
> <CORR>
> e CON e
> a PRE a
>
> So I tried to write this awful script:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> $^I = '';
>
> my $tic = 0;
> my $toc = 0;
>
> while(<> )
> {
> if(/^< NOM <unknown>.*/i)
You don't need the .* in the regex here (and below).
> {
> $tic = 1;
> next;
> }
> next if /^> NOM <unknown>.*/i;
> next if $toc == 1;
$toc can only have the values 0 and 1. So, if you get here, $toc is 0...
> $toc = 0;
....and this won't change $toc.
> if($tic==1)
> {
> s/^(\/?\w+).+/$1/gi;
> chomp();
> $_ = "<$_>";
> $toc = 1;
> $tic = 0;
> }
> s/<>//g;
> print;
> }
>
> it doesn't return errors, but it stop printing the output after the
> first correction. Someone can explain me why
Didn't look deeply enough in the code, so I can't :-)
> and eventually suggest how to correct the corrector?
The script below seems to do what you want. It's not very elegant, but (I
think) easy to understand. I use a $inside variable that does what you maybe
intended with $tic and $toc.
> PS: another strange thing: if I declare at the beginning of the script:
> my($tic,$toc); it returns me an error...
You don't say what error, but I got errors like
"Use of uninitialized value in numeric eq (==) at ./script.pl line 19,
<DATA> line 1.".
$tic/toc is used in a numeric comparison before a value has
been assigned (my ($tic, $toc) leaves both undefined). The program flow may be
different from what you expect, and maybe also the reason for a stop after
the first correction.
I hope this helps,
Dani
#!/usr/bin/perl
use strict;
use warnings;
my $inside; # are we within a tagged area?
while(<DATA> ) {
if (/^<\s+NOM\s+<unknown>/i) {
$inside=1;
next;
}
elsif (/^>\s+NOM\s+<unknown>/i) {
$inside=0;
next;
}
elsif ($inside) {
my ($str)=/(^\w+)/ or die;
print "<$1>\n";
}
else {
print;
}
}
__DATA__
<tLn nr=11>
e CON e
le DET:def il
ha VER:pres avere|riavere
detto VER:pper dire
< NOM <unknown>
CORR VER:infi corre
> NOM <unknown>
e CON e
a PRE a
<tLn nr=11>
e CON e
le DET:def il
ha VER:pres avere|riavere
detto VER:pper dire
< NOM <unknown>
BLA VER:infi corre
> NOM <unknown>
e CON e
a PRE a
| |
| John W. Krahn 2006-11-18, 7:56 am |
| Adriano Allora wrote:
> hi to all,
Hello,
> I've got a list of tagged words, like this one (only a little bit longest):
>
> <tLn nr=11>
> e CON e
> le DET:def il
> ha VER:pres avere|riavere
> detto VER:pper dire
> < NOM <unknown>
> CORR VER:infi corre
> e CON e
> a PRE a
>
> I need to transform the list below in (in which the CORR tag isn't tagged):
>
> <tLn nr=11>
> e CON e
> le DET:def il
> ha VER:pres avere|riavere
> detto VER:pper dire
> <CORR>
> e CON e
> a PRE a
>
> So I tried to write this awful script:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> $^I = '';
>
> my $tic = 0;
> my $toc = 0;
>
> while(<> )
> {
> if(/^< NOM <unknown>.*/i)
> {
> $tic = 1;
> next;
> }
> next if /^> NOM <unknown>.*/i;
> next if $toc == 1;
Below you set $toc = 1 for the line you want modified so everything after the
line you want modified is bypassed.
> $toc = 0;
> if($tic==1)
> {
> s/^(\/?\w+).+/$1/gi;
> chomp();
> $_ = "<$_>";
> $toc = 1;
> $tic = 0;
> }
> s/<>//g;
> print;
> }
>
> it doesn't return errors, but it stop printing the output after the
> first correction. Someone can explain me why and eventually suggest how
> to correct the corrector?
This should do what you want:
while ( <> ) {
if ( /^<\s+NOM\s+<unknown>/i .. /^>\s+NOM\s+<unknown>/i ) {
s!^(/?\w+).+!<$1>!g or next;
}
print;
}
> PS: another strange thing: if I declare at the beginning of the script:
> my($tic,$toc); it returns me an error...
Probably because they are undefined and you are using them with a comparison
operator?
John
--
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order. -- Larry Wall
| |
| Rob Dixon 2006-11-18, 6:56 pm |
| Adriano Allora wrote:
>
> hi to all,
Hi Adriano. Read my comments in-line and my solution at the end.
> I've got a list of tagged words, like this one (only a little bit longest):
>
> <tLn nr=11>
> e CON e
> le DET:def il
> ha VER:pres avere|riavere
> detto VER:pper dire
> < NOM <unknown>
> CORR VER:infi corre
> e CON e
> a PRE a
Posting a short example is fine, as long as the full data doesn't contain any
records significantly different from any of those in your sample. See later.
> I need to transform the list below in (in which the CORR tag isn't tagged):
>
> <tLn nr=11>
> e CON e
> le DET:def il
> ha VER:pres avere|riavere
> detto VER:pper dire
> <CORR>
> e CON e
> a PRE a
>
> So I tried to write this awful script:
Don't think I'm not going to tell you it's awful just because you've said so
already :)
> #!/usr/bin/perl -w
>
> use strict;
and always
use warnings;
as well. It didn't show any problems in this case but it's a useful thing to
have around.
> $^I = '';
Better not to use this at all while you're testing, as you'll keep overwriting
the input data and have to restore it.
> my $tic = 0;
> my $toc = 0;
You're getting by having two flags when you need only one: you're
either inside a tag or you're not.
> while(<> )
> {
> if(/^< NOM <unknown>.*/i)
No need for the .* at the end of the regex. And isn't it unnecessarily long?
Unless there are other records that look similar in your source data that you
don't want to match here I'd say
if (/^<\s/) {
was adequate. (Starts with a left angle bracket followed by white space.)
> {
> $tic = 1;
> next;
> }
> next if /^> NOM <unknown>.*/i;
> next if $toc == 1;
Here is your problem. $toc is getting set inside the following if statement and
is never reset, so the loop then cycles unproductively until the end of the
input.
> $toc = 0;
> if($tic==1)
> {
> s/^(\/?\w+).+/$1/gi;
There's no need for the /i modifier as you don't have any literal letters to
match. And there's no need for the /g as you have no /m so ^ can match only
once. (The line has only one beginning!) It's also clearer to use different
delimiters when you're matching slashes instead of escaping them, like this:
s|^(/?\w+).+|$1|gi;
It looks like your records can also begin with a slash. You didn't show us any
data like that. This is the sort of thing I was talking about earlier - your
example data can be a subset of the real thing but it needs to be fully
representative.
And in the end all you've done is strip off the tail of the line after the first
word. As far as I can tell
s/\s.*//;
is adequate. (Remove everything at and after the first white space.)
> chomp();
A bit late to be doing the chomp here isn't it? Put it straight after the while
statement and then we know where we are for the entirety oif the loop. And
anyway the substitution you just did in the line before will have lopped off the
trailing newline anyway.
By the way, you don't need the parentheses.
> $_ = "<$_>";
It's a misconception to try to turn the actual input record into what the output
should be; after all, you didn't say $_='' above when the output should be
blank, you simply called next. Simply build the output record that you want and
print it. You'll see this in my solution below.
> $toc = 1;
> $tic = 0;
> }
> s/<>//g;
What's this for? If it's to fix a bug in your code that's generating '<>' when
it shouldn't then you should fix the bug instead. Or if there are records in
your source data that need this stripping out then you haven't shown any.
Otherwise I don't understand its purpose.
> print;
> }
>
> it doesn't return errors, but it stop printing the output after the
> first correction. Someone can explain me why and eventually suggest how
> to correct the corrector?
Yep, awful script. Mainly because of unclear thinking I believe. If I was to ask
you what states $tic and $toc represented you couldn't easily tell me, and
because you don't really know what they do (they are yours after all and you
should know what they're for!) you've tried setting and clearing them at odd
places and failed to get it right.
> Thanks at all,
>
>
> alladr
>
> PS: another strange thing: if I declare at the beginning of the script:
> my($tic,$toc); it returns me an error...
It would have been nice if you had told us the error, but my guess is that it's
"variable masks earlier declaration", because you will then have declared them
twice.
Here's the fix. It has a single flag $tag, and it means the script is processing
'inside' a tag. It's set when a line beginning with a lone '<' is found and
cleared when the corresponding '>' appears.
HTH,
Rob
use strict;
use warnings;
#$^I = '';
my $tag;
while (<> ) {
chomp;
if(/^[<>]\s/) {
$tag = /^</;
}
elsif ($tag) {
/(^\S+)/;
print "<$1>\n";
}
else {
print "$_\n";
}
}
|
|
|
|
|