Home > Archive > PERL Miscellaneous > March 2004 > how to capture multiple lines?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
how to capture multiple lines?
|
|
| Geoff Cox 2004-03-29, 5:38 am |
| Hello
I thought following code using / /s should get multiple lines such as
<p> hdajhksdh jash djh jaskd a
d ahjkd jakdljkaksdkjlad a
d ajkd jadklj aldkj ald </p>
but it is only capturing where <p> and </p> are on the same line..
Help!
if ($line =~ /<p>(.*)<\/p>/s) {
print ("\$1 = $1 \n");
}
Cheers
Geoff
| |
| Tassilo v. Parseval 2004-03-29, 5:38 am |
| Also sprach Geoff Cox:
> I thought following code using / /s should get multiple lines such as
>
><p> hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald </p>
>
> but it is only capturing where <p> and </p> are on the same line..
>
> Help!
>
> if ($line =~ /<p>(.*)<\/p>/s) {
> print ("\$1 = $1 \n");
> }
This works for me as expected:
$line = <<EOC;
<p> hdajhksdh jash djh jaskd a
d ahjkd jakdljkaksdkjlad a
d ajkd jadklj aldkj ald </p>
EOC
if ($line =~ /<p>(.*)<\/p>/s) {
print ("\$1 = $1 \n");
}
__END__
$1 = hdajhksdh jash djh jaskd a
d ahjkd jakdljkaksdkjlad a
d ajkd jadklj aldkj ald
Did you check that $line really contains what you think it contains?
Maybe you read into this variable line-wise and so quite naturally you
only get a match when <p>...</p> happen to be on one line.
Btw: I hope the appearance of <p> and </p> is only falsely indicating
that you are working with HTML because you cannot parse HTML properly
with regexes. But if the above really is HTML, you'll be happier with
one of the HTML parsing modules, such as HTML::Parser.
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
| |
| Gunnar Hjalmarsson 2004-03-29, 5:38 am |
| Geoff Cox wrote:
> I thought following code using / /s should get multiple lines such
> as
>
> <p> hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald </p>
>
> but it is only capturing where <p> and </p> are on the same line..
No, it's not.
> if ($line =~ /<p>(.*)<\/p>/s) {
> print ("\$1 = $1 \n");
> }
That works fine for me.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Geoff Cox 2004-03-29, 5:38 am |
| On 29 Mar 2004 09:25:13 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
>Also sprach Geoff Cox:
Tassilo,
I should have said that the <p> ... </p> is from an html file ....
I have just tried following which works for above but breaks the rest
of the input
$/ = "\0a\0d";
$line =~ /<p>(.*?)<\/p>/s;
$/ = "\0a";
The 3rd line does not appear to put $/ back to the default value??
Cheers
Geoff
>
>
>This works for me as expected:
>
> $line = <<EOC;
> <p> hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald </p>
> EOC
>
> if ($line =~ /<p>(.*)<\/p>/s) {
> print ("\$1 = $1 \n");
> }
> __END__
> $1 = hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald
>
>Did you check that $line really contains what you think it contains?
>Maybe you read into this variable line-wise and so quite naturally you
>only get a match when <p>...</p> happen to be on one line.
>
>Btw: I hope the appearance of <p> and </p> is only falsely indicating
>that you are working with HTML because you cannot parse HTML properly
>with regexes. But if the above really is HTML, you'll be happier with
>one of the HTML parsing modules, such as HTML::Parser.
>
>Tassilo
| |
| Geoff Cox 2004-03-29, 5:38 am |
| On Mon, 29 Mar 2004 11:28:48 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
Gunnar
I should have said that the <p> ... </p> is from an html file ....
I have just tried following which works for above but breaks the rest
of the input
$/ = "\0a\0d";
$line =~ /<p>(.*?)<\/p>/s;
$/ = "\0a";
The 3rd line does not appear to put $/ back to the default value??
Cheers
Geoff
>Geoff Cox wrote:
>
>No, it's not.
>
>
>That works fine for me.
| |
| Geoff Cox 2004-03-29, 5:38 am |
| On Mon, 29 Mar 2004 09:56:23 GMT, Geoff Cox
<geoffacox@dontspamblueyonder.co.uk> wrote:
>On 29 Mar 2004 09:25:13 GMT, "Tassilo v. Parseval"
><tassilo.parseval@rwth-aachen.de> wrote:
>
>
>Tassilo,
>
>I should have said that the <p> ... </p> is from an html file ....
>
>I have just tried following which works for above but breaks the rest
>of the input
>
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
I think this should have been (I had the wrong order of 0doa
$/ = "\0d\0a";
$line =~ /<p>(.*?)<\/p>/s;
$/ = "\0a";
but still breaks rest of the script in that $/ does not seem to be
back to the default value ...?
Geoff
[color=darkred]
>
>The 3rd line does not appear to put $/ back to the default value??
>
>Cheers
>
>Geoff
>
>
>
>
>
| |
| Gunnar Hjalmarsson 2004-03-29, 5:38 am |
| Geoff Cox wrote:
> I should have said that the <p> ... </p> is from an html file ....
Then you should consider to use a module instead.
> I have just tried following which works for above but breaks the
> rest of the input
>
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
>
> The 3rd line does not appear to put $/ back to the default value??
I'm not sure what you are trying to do. If the file isn't really huge,
why don't you just slurp it into a scalar variable instead of reading
it line by line?
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Tassilo v. Parseval 2004-03-29, 6:31 am |
| Also sprach Geoff Cox:
> On 29 Mar 2004 09:25:13 GMT, "Tassilo v. Parseval"
><tassilo.parseval@rwth-aachen.de> wrote:
>
>
> Tassilo,
>
> I should have said that the <p> ... </p> is from an html file ....
As if we didn't know. ;-)
Another thing you should have done is choosing a more effective
follow-up style. Put your reply below the stuff you are replying to,
cutting out parts you don't refer to.
> I have just tried following which works for above but breaks the rest
> of the input
>
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
>
> The 3rd line does not appear to put $/ back to the default value??
Your handling of $/ looks a bit fishy. First of all, I suspect that
"\0a\0d" is supposed to be a Windows line-ending. Well, it's not. That
would be "\0d\0a".
Secondly, you shouldn't even be in need of setting $/ explicitely.
Usually perl will be able to read a file with Windows newlines even on
other platforms. You can also force newline translation so that perl
will automatically replace "\0d\0a" with "\0d" (or vice versa, depending
on the platform):
open HTML, "<:crlf", "file.html" or die $!;
This works ever since 5.8.0, AFAIK.
If you really want to tamper with $/ manually, use local() so that perl
recovers the old value for you:
{ # need a block here
local $/ = "\0d\0a";
...
}
# here $/ has its previous value again
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
| |
| Geoff Cox 2004-03-29, 7:34 am |
| On Mon, 29 Mar 2004 12:08:39 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
>Geoff Cox wrote:
>
>Then you should consider to use a module instead.
>
>
>I'm not sure what you are trying to do. If the file isn't really huge,
>why don't you just slurp it into a scalar variable instead of reading
>it line by line?
I think I will have to do the slurp ...
re above - in the html file the end of lines have ODOA so by changing
the value of $/ to ODOA I get all the text between <p> and </p>.
Problem is that I then find that the script is finding text which I do
not want! I wondered whether thaat was because I have not been able to
change the value of $/ back to the default value. If I print out $/
before changing it to ODOA I get
$/ =
so I assumed that I could get $/ back to default value by
$/ = "";
but not quite working ...
Cheers
Geoff
| |
| Geoff Cox 2004-03-29, 7:34 am |
| On 29 Mar 2004 10:22:17 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
Tassilo,
>If you really want to tamper with $/ manually, use local() so that perl
>recovers the old value for you:
>
> { # need a block here
> local $/ = "\0d\0a";
The local did the trick - the rest of the code works find now!
Thanks a lot...
Cheers
Geoff
| |
| Geoff Cox 2004-03-29, 7:34 am |
| On 29 Mar 2004 10:22:17 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
>If you really want to tamper with $/ manually, use local() so that perl
>recovers the old value for you:
>
> { # need a block here
> local $/ = "\0d\0a";
oop! I spoke too soon! If I have the
local $/ = "\0D\0A";
in a sub routine - I do not get the <p> .... </p> text. If I have
$/ = "\0D\0A";
I do get the text but I then get some data which I do not want!
Geoff
> ...
> }
> # here $/ has its previous value again
>
>Tassilo
| |
| Gunnar Hjalmarsson 2004-03-29, 7:34 am |
| Geoff Cox wrote:
> I think I will have to do the slurp ...
That would probably make things much easier. :)
> re above - in the html file the end of lines have ODOA so by
> changing the value of $/ to ODOA I get all the text between <p> and
> </p>.
I don't understand.
> I assumed that I could get $/ back to default value by
>
> $/ = "";
That does not set it to default. This does:
$/ = "\n";
But if you for some reason want to fiddle with $/, you'd better do it
locally within a block, as Tassilo suggested.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Geoff Cox 2004-03-29, 8:35 am |
| On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
>That does not set it to default. This does:
>
> $/ = "\n";
The best I can get is as follows
sub para {
local ($/ = "\0a\0d");
my ($linepara) = @_;
$linepara =~ /<p>(.*?)<\/p>/s;
# print ("\$1 = $1 \n");
print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
$/ = "";
}
Now, this does get the
<p> jahjsdkaljk al
asdjk aksdj klad
kajsd akl </p>
text but it also get some lines which I do not want and do not get if
I do not use $/ - so am a bit lost. Tempted to put the whol code up
but that would be asking too much!
I would liek to use the slurp approach but not sure how to do it so
that as I parse through an html file and find the first line of the
first <p> etc block of text - how do I get that text and put in into a
file and then when find the second <p> block put it in the right
place...I do not want toput all the <p> etc text together..they appear
at different places in the html file....
So, I find the first line with <p>, slurp in the whole of the file,
but only wish to get the first line of the <p> already found and the
next few lines until the end of the first line with a </p>.
How to do that?!
Cheers
Geoff
Cheers
Geoff
>
>But if you for some reason want to fiddle with $/, you'd better do it
>locally within a block, as Tassilo suggested.
| |
| Anno Siegel 2004-03-29, 8:35 am |
| Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote in comp.lang.perl.misc:
> On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
> <noreply@gunnar.cc> wrote:
>
>
>
> The best I can get is as follows
>
> sub para {
>
> local ($/ = "\0a\0d");
The parentheses counteract the intention of "local". Parenthesized like
this, "\0d\0a" is assigned to $/ and that value is localized. You want
to localize $/ first:
local $/ = "\0a\0d";
[...]
> text but it also get some lines which I do not want and do not get if
That's surely because $/ carries its new value out of the sub.
Anno
| |
| Tassilo v. Parseval 2004-03-29, 8:35 am |
| Also sprach Geoff Cox:
> On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
><noreply@gunnar.cc> wrote:
>
>
>
> The best I can get is as follows
>
> sub para {
>
> local ($/ = "\0a\0d");
>
> my ($linepara) = @_;
> $linepara =~ /<p>(.*?)<\/p>/s;
> # print ("\$1 = $1 \n");
> print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
> $/ = "";
> }
>
> Now, this does get the
><p> jahjsdkaljk al
> asdjk aksdj klad
> kajsd akl </p>
>
> text but it also get some lines which I do not want and do not get if
> I do not use $/ - so am a bit lost. Tempted to put the whol code up
> but that would be asking too much!
>
> I would liek to use the slurp approach but not sure how to do it so
> that as I parse through an html file and find the first line of the
> first <p> etc block of text - how do I get that text and put in into a
> file and then when find the second <p> block put it in the right
> place...I do not want toput all the <p> etc text together..they appear
> at different places in the html file....
If I understand you right, you want to grab everything that appears in
<p> tags? Here's an example using HTML::Parser:
#! /usr/bin/perl -w
package MyParser;
use strict;
use base qw/HTML::Parser/;
our $in_para;
sub start {
my (undef, $tagname) = @_;
$in_para = 1 if $tagname eq 'p';
}
sub end {
my (undef, $tagname) = @_;
$in_para = 0 if $tagname eq 'p';
}
sub text {
my (undef, $text) = @_;
print $text if $in_para;
}
package main;
my $p = MyParser->new;
$p->parse_file("file.html");
It's dead simple: You create a subclass of HTML::Parser (MyParser) that
overwrites the start(), end() and text() method. The start() method
simply sets the global variable $in_para to a true value when it
encountered a <p>-starttag. It's set to false when </p> is encountered.
The method text() is triggered for ordinary text. It will only print it
when $in_para is true.
This solution is very robust and since the basic skeleton is only a few
lines, it is easily extensible. You most probably want to change the
text() method to let it print into a file or so. If you want to grab
anything between <p> and </p> (including other tags) you must extend
start() and end() a bit to print their last argument (which is the
original text of the tag as it appeared in the HTML-file). Something
like:
sub start {
my (undef, $tagname, undef, undef, $origtext) = @_;
print $origtext if $in_para;
$in_para = 1 if $tagname eq 'p';
}
sub end {
my (undef, $tagname, $origtext) = @_;
$in_para = 0 if $tagname eq 'p';
print $origtext if $in_para;
}
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
| |
| Gunnar Hjalmarsson 2004-03-29, 9:45 am |
| Tassilo v. Parseval wrote:
> If I understand you right, you want to grab everything that appears
> in <p> tags? Here's an example using HTML::Parser:
<code example>
> It's dead simple:
Hmm.. Not sure I agree on "dead simple".
If grabbing everything between <p> tags is *all* there is, I don't
understand why something like this wouldn't be sufficient:
open FH, 'file.html' or die $!;
$_ = do { local $/; <FH> };
close FH;
my @paras;
push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
To me, again if that's all there is, this appears to be even simpler
than "dead simple". ;-)
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Geoff Cox 2004-03-29, 9:45 am |
| On 29 Mar 2004 13:02:51 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
>If I understand you right, you want to grab everything that appears in
><p> tags? Here's an example using HTML::Parser:
Tassilo
The code below will take a bit of thinking about! However I need to
get the <p> .... </p> text in the order in which it appears in the
html file, not all together.
The html file has say
<p> ajdkjs ak lsdjas
asdja dkasj dl asd
lad akl;sdk a;dkl; </p>
<h2 align= etc </h2>
<option value = "docs/ etc >text</option>
(when the option line is met I take the path part and use it to search
another file in order to get some related text)
<h2 etc
<option etc
<p> etc
So, if I used the slurp idea - not clear how I would get the above in
order??
Cheers
Geoff
>
> #! /usr/bin/perl -w
>
> package MyParser;
>
> use strict;
> use base qw/HTML::Parser/;
>
> our $in_para;
>
> sub start {
> my (undef, $tagname) = @_;
> $in_para = 1 if $tagname eq 'p';
> }
>
> sub end {
> my (undef, $tagname) = @_;
> $in_para = 0 if $tagname eq 'p';
> }
>
> sub text {
> my (undef, $text) = @_;
> print $text if $in_para;
> }
>
> package main;
>
> my $p = MyParser->new;
> $p->parse_file("file.html");
>
>It's dead simple: You create a subclass of HTML::Parser (MyParser) that
>overwrites the start(), end() and text() method. The start() method
>simply sets the global variable $in_para to a true value when it
>encountered a <p>-starttag. It's set to false when </p> is encountered.
>The method text() is triggered for ordinary text. It will only print it
>when $in_para is true.
>
>This solution is very robust and since the basic skeleton is only a few
>lines, it is easily extensible. You most probably want to change the
>text() method to let it print into a file or so. If you want to grab
>anything between <p> and </p> (including other tags) you must extend
>start() and end() a bit to print their last argument (which is the
>original text of the tag as it appeared in the HTML-file). Something
>like:
>
> sub start {
> my (undef, $tagname, undef, undef, $origtext) = @_;
> print $origtext if $in_para;
> $in_para = 1 if $tagname eq 'p';
> }
>
> sub end {
> my (undef, $tagname, $origtext) = @_;
> $in_para = 0 if $tagname eq 'p';
> print $origtext if $in_para;
> }
>
>Tassilo
| |
| Geoff Cox 2004-03-29, 9:45 am |
| On 29 Mar 2004 12:50:44 GMT, anno4000@lublin.zrz.tu-berlin.de (Anno
Siegel) wrote:
>The parentheses counteract the intention of "local". Parenthesized like
>this, "\0d\0a" is assigned to $/ and that value is localized. You want
>to localize $/ first:
>
> local $/ = "\0a\0d";
Anno,
I do not understand this! If I use
local $/ = "\0D\0A"; in the sub routine
I do not get the <p> ...... </p> text.
If I use
local ($/ = "\0D\0A");
I do get it !! But then I get some text which I do not wish to have!
Any ideas?
Cheers
Geoff
>
>[...]
>
>
>That's surely because $/ carries its new value out of the sub.
>
>Anno
| |
| Geoff Cox 2004-03-29, 9:45 am |
| On Mon, 29 Mar 2004 15:43:25 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
>Tassilo v. Parseval wrote:
>
><code example>
>
>
>Hmm.. Not sure I agree on "dead simple".
Gunnar
not dead simple to me!
>
>If grabbing everything between <p> tags is *all* there is, I don't
>understand why something like this wouldn't be sufficient:
It is not all there is - at least I do not think so...the file
contains blocks of text between <p> and </p>s mixed in with other
lines such as <h2> haskhdjk ashj </h2>, <option jjaksdjka </option>
etc and I want to parse through the file in order, taking the <p> </p>
and <h2> </h2> data into another file. Also when finding the <option
line I use the extracted path to search another fiel to get related
text...
The nearest I get with $/ is
sub para {
local ($/ = "\0D\0A");
my ($linepara) = @_;
$linepara =~ /<p>(.*?)<\/p>/s;
# print ("\$1 = $1 \n");
print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
$/ = "";
}
but as I say, this gets the <p> </p>, <h2> ... </h2> info in the right
places. It also gets the related text using the path info from the
<option lines BUT it also puts in the option selection boxes which I
do not want and does not happen if I fo not use $/ ...!!!??
If I use slurp idea then I would put the whole html file into $total
say but how do I parse through it in order of appearance of <p>, <h2>
<option etc as above??
Cheers
Geoff
>
> open FH, 'file.html' or die $!;
> $_ = do { local $/; <FH> };
> close FH;
>
> my @paras;
> push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
>
>To me, again if that's all there is, this appears to be even simpler
>than "dead simple". ;-)
| |
| Tassilo v. Parseval 2004-03-29, 9:45 am |
| Also sprach Gunnar Hjalmarsson:
> Tassilo v. Parseval wrote:
>
><code example>
>
>
> Hmm.. Not sure I agree on "dead simple".
>
> If grabbing everything between <p> tags is *all* there is, I don't
> understand why something like this wouldn't be sufficient:
>
> open FH, 'file.html' or die $!;
> $_ = do { local $/; <FH> };
> close FH;
>
> my @paras;
> push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
>
> To me, again if that's all there is, this appears to be even simpler
> than "dead simple". ;-)
There are some contrived edge cases not captured by the above. For
instance, there could be a closing </p> in an HTML comment (yeah, I
know, this happens all the time;-).
Another situation where a regex could fail is with attributes. The
quoted string in such attributes could contain something that looks like
a tag.
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
| |
| Tassilo v. Parseval 2004-03-29, 9:45 am |
| Also sprach Geoff Cox:
> On 29 Mar 2004 13:02:51 GMT, "Tassilo v. Parseval"
><tassilo.parseval@rwth-aachen.de> wrote:
>
>
>
> Tassilo
>
> The code below will take a bit of thinking about! However I need to
> get the <p> .... </p> text in the order in which it appears in the
> html file, not all together.
Did you try the little program? The text() method is triggered for every
paragraph in the order in which they appear.
> The html file has say
>
><p> ajdkjs ak lsdjas
> asdja dkasj dl asd
> lad akl;sdk a;dkl; </p>
>
><h2 align= etc </h2>
>
><option value = "docs/ etc >text</option>
>
> (when the option line is met I take the path part and use it to search
> another file in order to get some related text)
This is yet another good reason not to use regular expressions and
employ a real parser for that. Recursing into the next file when an
<option> tag is found is incredibly easy. All you have to do is create
another instance of the parser in the start() method and have it parse
the file given through 'value'. That's around three more lines in my
example and you have that working as well:
sub start {
my (undef, $tagname, $attr) = @_;
$in_para = 1 if $tagname eq 'p';
if ($tagname eq 'option') {
__PACKAGE__->new->parse_file($attr{value});
}
}
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
| |
| Tad McClellan 2004-03-29, 9:45 am |
| Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote:
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
>
> The 3rd line does not appear to put $/ back to the default value??
^^^^^^
^^^^^^
There is nothing about the effect of $/ that can be observed
from the code you posted.
The $/ variable affects the <INPUT> operator.
The code you've posted does not use the <INPUT> operator...
[ snip TOFU ]
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
| Gunnar Hjalmarsson 2004-03-29, 10:35 am |
| Geoff Cox wrote:
> On Mon, 29 Mar 2004 15:43:25 +0200, Gunnar Hjalmarsson
> <noreply@gunnar.cc> wrote:
>
> It is not all there is - at least I do not think so...the file
> contains blocks of text between <p> and </p>s mixed in with other
> lines such as <h2> haskhdjk ashj </h2>, <option jjaksdjka
> </option> etc and I want to parse through the file in order, taking
> the <p> </p> and <h2> </h2> data into another file. Also when
> finding the <option line I use the extracted path to search another
> fiel to get related text...
Then you'd better just forget about my suggestion and follow Tassilo's
advice.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Gunnar Hjalmarsson 2004-03-29, 10:35 am |
| Tassilo v. Parseval wrote:
> Also sprach Gunnar Hjalmarsson:
>
> There are some contrived edge cases not captured by the above. For
> instance, there could be a closing </p> in an HTML comment (yeah, I
> know, this happens all the time;-).
Yeah, yeah. As do angle brackets in quoted attribute names...
I know you need to be watchful of those 'edge cases'. If you don't
know they won't appear, even I would go for a module rather than
tweaking the regex further. ;-)
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Geoff Cox 2004-03-29, 11:47 am |
| On Mon, 29 Mar 2004 07:51:24 -0600, Tad McClellan
<tadmc@augustmail.com> wrote:
>Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote:
>
Tad,
but surely if I change the value of $/ from its default value to 0d0a
can I not change it back to its default value after getting the <p>
.... </p> text?
certainly changing it to
$/ = "";
seem to give the best results in terms of the main aim of the whole
code. It is just that I gives me text which when I do not use the $/
change idea, I do not get!
Can you de-mystify this for me?!
Geoff
[color=darkred]
> ^^^^^^
> ^^^^^^
>
>There is nothing about the effect of $/ that can be observed
>from the code you posted.
>
>The $/ variable affects the <INPUT> operator.
>
>The code you've posted does not use the <INPUT> operator...
>
>
>
>[ snip TOFU ]
| |
| Geoff Cox 2004-03-29, 11:47 am |
| On Mon, 29 Mar 2004 07:51:24 -0600, Tad McClellan
<tadmc@augustmail.com> wrote:
>Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote:
>
> ^^^^^^
> ^^^^^^
>
>There is nothing about the effect of $/ that can be observed
>from the code you posted.
Tad,
the whole code follows !! I know it has little elegance but apart from
giving me some text which I do not need, it works! Perhaps you can see
why the $/ is not working correctly? The aim by the way is to make a
web site from one which uses MySQL etc to one which does not use a
database.....
Cheers
Geoff
warnings;
use strict;
use File::Find;
my $dir = 'd:/a-keep9/prog-nondb/old-prog';
find sub {
my $name = $_;
if ($name =~ /.htm/) {
open (IN, "$name");
open (OUT, ">>d:/a-keep9/prog-nondb/progs/test/$name");
my $html = "<html>\n<header>\n<title>$name</title>
\n<link rel='stylesheet' type='text/css'
href='assets/style/style-1.css'>
\n</header>\n<body>\n";
print OUT $html;
print OUT ("<table border=1 cellpadding=10>");
while (defined (my $line = <IN> )) {
if (($line =~ /<h2 align/ ) || ($line =~ /<b>/) || ($line =~
/<strong>/) || ($line =~ /<p /)) {
print OUT ("<tr><td colspan=2>" . $line . "<\/td><\/tr>
\n");
}
if ($line =~ /<p>/) {
¶($line);
}
if ($line =~ /<option value="(.*?)">/) {
&choice($1);
}
}
print OUT ("<\/table>\n<\/body>\n<\/html>");
}
}, $dir;
sub choice {
my ($path) = @_;
if ($path =~/btec-first/) {
&intro($path);
&applefirst($path);
} elsif ($path =~ /classroom-notes/) {
&intro($path);
&clasroomnotes($path);
}elsif ($path =~/pears\/assignments/) {
&intro($path);
&pearsassignments($path);
} else {
&intro($path);
&other($path);
}
}
sub intro {
my ($pathhere) = @_;
open (INN, "d:/a-keep9/prog-nondb/db/total-260304.txt");
my $lineintro = <INN>;
while (defined ($lineintro = <INN> )) {
if ($lineintro =~ /$pathhere','(.*?)'\)\;/) {
print OUT ("<tr><td>$1 <p> \</td>\n");
}
}
}
sub applefirst {
my($pattern) = @_;
my $linee = $pattern;
my $c=0;
$linee =~ /.*unit(\d).*?chap(\d)/;
my $u = $1;
my $chap = $2;
open (INNN, "d:/a-keep9/prog-nondb/allphp/allphp.htm");
while (<INNN> ){
last if /$pattern/;
}
my ($curr, $next1, $next2, $next3) = <INNN>;
close (INNN);
if ($next3 =~ /\$i\<(\d);/) {
my $nn = $1;
print OUT ("<td>\n");
for (my $c=1;$c<$nn;$c++) {
print OUT ('<a href="'. $pattern . "/unit" . $u . "-chap" .
$chap . "-doc" . $c . ".zip" . '">' . "Document$c" . "</a><br>" .
"\n");
}
print OUT ("</td></tr>\n");
}
}
sub clasroomnotes {
my($pattern) = @_;
my $c=0;
open (INNN, "d:/a-keep9/prog-nondb/allphp/allphp.htm");
while (<INNN> ){
last if /$pattern/;
}
my ($curr, $next1, $next2, $next3) = <INNN>;
close (INNN);
if ($next3 =~ /\$i\<(\d);/) {
my $nn = $1;
print OUT ("<td>\n");
for (my $c=1;$c<$nn;$c++) {
print OUT ('<a href="'. $pattern . "-doc" . $c . ".zip" . '">' .
"Document$c" . "</a><br>" . "\n");
}
print OUT ("</td></tr>\n");
}
}
sub other {
my ($pattern) = @_;
print OUT ("<td> \n");
print OUT ('<a href="'. $pattern . ".zip" . '">' . "Document" .
"</a><br>" . "\n");
print OUT ("</td></tr>\n");
}
sub pearsassignments {
my($pattern) = @_;
print OUT ("<td> \n");
print OUT ('<a href="'. $pattern . ".zip" . '">' . "Document" .
"</a><br>" . "\n");
print OUT ('<a href="'. $pattern . "-grid" . ".zip" . '">' . "Grid" .
"</a><br>" . "\n");
print OUT ("</td></tr>\n");
}
sub para {
local ($/ = "\0D\0A");
my ($linepara) = @_;
$linepara =~ /<p>(.*?)<\/p>/s;
# print ("\$1 = $1 \n");
print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
$/ = "";
}
>
>The $/ variable affects the <INPUT> operator.
>
>The code you've posted does not use the <INPUT> operator...
>
>
>
>[ snip TOFU ]
| |
| Tad McClellan 2004-03-29, 12:33 pm |
| Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
> If grabbing everything between <p> tags is *all* there is,
> push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
^^ ^^^ ^^^
^^ ^^^ ^^^
Whitespace is not allowed there in HTML.
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
| Tad McClellan 2004-03-29, 12:33 pm |
| Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote:
> The nearest I get with $/ is
>
> sub para {
>
> local ($/ = "\0D\0A");
>
> my ($linepara) = @_;
> $linepara =~ /<p>(.*?)<\/p>/s;
> # print ("\$1 = $1 \n");
> print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
> $/ = "";
> }
Why are you still changing the value of $/ ?
The value of $/ does NOT affect pattern matching.
The value of $/ may affect the string that the pattern is attempting
to match against, but you do not show that the string is being
input anywhere.
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
| Tad McClellan 2004-03-29, 12:33 pm |
| Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote:
> $linepara =~ /<p>(.*?)<\/p>/s;
> print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
You should never use the dollar-digit variables unless you
have first ensured that the match *succeeded*.
Slash characters are not special in strings, there is no
need to backslash them.
if ( $linepara =~ /<p>(.*?)<\/p>/s ) {
print OUT "<tr><td colspan=2>$1</td></tr>\n";
}
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
| Anno Siegel 2004-03-29, 1:49 pm |
| Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote in comp.lang.perl.misc:
> On 29 Mar 2004 12:50:44 GMT, anno4000@lublin.zrz.tu-berlin.de (Anno
> Siegel) wrote:
>
>
> Anno,
>
> I do not understand this! If I use
>
> local $/ = "\0D\0A"; in the sub routine
>
> I do not get the <p> ...... </p> text.
>
> If I use
>
> local ($/ = "\0D\0A");
>
> I do get it !! But then I get some text which I do not wish to have!
>
> Any ideas?
Well, for one I suggest that you print out "\0D\0A". You will see
that it isn't what you think it is. You want "\x0d\x0a".
Setting $/ = "\0D\0A" means that the next read will slurp in the rest of
the file (because that sequence is unlikely to be met). What that means
for the behavior of your program I don't know. In any case, you ought
to get the intended end-of-line sequence right first.
The behavior of "local ( $/ = 'something')" is a bit mystifying. Since
local() happens after the assignment, it should render $/ undefined,
but it preserves the value assigned. It's probably the DWIMmer.
In any case, it assigns to $/ *before* local() happens, and so makes
local() useless.
Anno
| |
| Gunnar Hjalmarsson 2004-03-29, 1:49 pm |
| Tad McClellan wrote:
> Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
>
> ^^ ^^^ ^^^
> ^^ ^^^ ^^^
>
> Whitespace is not allowed there in HTML.
Hmm.. I put in those just before sending in order to *prevent*
objections to using a regex... ;-)
I'm sure you are right. My Mozilla browser doesn't seem to know,
though, but to my surprise MSIE does.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Tad McClellan 2004-03-29, 1:49 pm |
| Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote:
> On Mon, 29 Mar 2004 07:51:24 -0600, Tad McClellan
><tadmc@augustmail.com> wrote:
>
> Tad,
>
> the whole code follows !!
Don't do that!
Make a short and complete program that we can run that illustrates
the problem you are asking about.
Have you seen the Posting Guidelines that are posted here frequently?
First make a short (less than 20-30 lines) and *complete* program
that illustrates the problem you are having. People should be able
to run your program by copy/pasting the code from your article. (You
will find that doing this step very often reveals your problem
directly. Leading to an answer much more quickly and reliably than
posting to Usenet.)
Describe *precisely* the input to your program. Also provide example
input data for your program. If you need to show file input, use the
__DATA__ token (perldata.pod) to provide the file contents inside of
your Perl program.
> I know it has little elegance
It is downright horrid style.
You can make your life easier if you make you code easier to
read and understand.
> Perhaps you can see
> why the $/ is not working correctly?
Because you set it *after* you have already read from the file.
It affects input, you much change it before you do the input
that you want to affect.
> warnings;
Please post your *actual* code:
use warnings;
Have you seen the Posting Guidelines that are posted here frequently?
Do not re-type Perl code
Use copy/paste or your editor's "import" function rather than
attempting to type in your code. If you make a typo you will get
followups about your typos instead of about the question you are
trying to get answered.
> use File::Find;
> my $dir = 'd:/a-keep9/prog-nondb/old-prog';
>
> find sub {
> my $name = $_;
There is no need to copy it from one scalar to another.
Why do you copy it from one scalar to another?
> if ($name =~ /.htm/) {
That will match if $name = 'nightmare'.
Is that what you want?
Probably not, so:
if ( /\.htm$/ ) {
> open (IN, "$name");
perldoc -q vars
What's wrong with always quoting "$vars"?
Lose the useless use of quotes.
You should always, yes *always*, check the return value from open():
open IN, $_ or die "could not open '$_' $!";
or
open IN, $name or die "could not open '$name' $!";
> while (defined (my $line = <IN> )) {
You need to change $/ *before* that line of code...
> if ($line =~ /<p>/) {
> ¶($line);
.... but you change it *after* in the para() subroutine.
Too late, changing it there will not affect the contents of $line.
> print OUT ("<\/table>\n<\/body>\n<\/html>");
Slashes are not special in strings, no backslashing needed.
print OUT ("</table>\n</body>\n</html>");
> sub choice {
> my ($path) = @_;
> if ($path =~/btec-first/) {
> &intro($path);
> &applefirst($path);
> } elsif ($path =~ /classroom-notes/) {
> &intro($path);
> &clasroomnotes($path);
> }elsif ($path =~/pears\/assignments/) {
> &intro($path);
> &pearsassignments($path);
> } else {
> &intro($path);
> &other($path);
> }
> }
You call intro($path) for every alternative!
You should just call it once at the top of the sub instead.
>
Please compose your followups properly.
Soon.
Like on your very next followup.
Have you seen the Posting Guidelines that are posted here frequently?
Use an effective followup style
When composing a followup, quote only enough text to establish the
context for the comments that you will add. Always indicate who
wrote the quoted material. Never quote an entire article. Never
quote a .signature (unless that is what you are commenting on).
Intersperse your comments *following* each section of quoted text to
which they relate. Unappreciated followup styles are referred to as
"top-posting", "Jeopardy" (because the answer comes before the
question), or "TOFU" (Text Over, Fullquote Under).
Reversing the chronology of the dialog makes it much harder to
understand (some folks won't even read it if written in that style).
For more information on quoting style, see:
http://web.presby.edu/~nnqadmin/nnq/nquote.html
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
| Geoff Cox 2004-03-29, 2:39 pm |
| On 29 Mar 2004 17:16:31 GMT, anno4000@lublin.zrz.tu-berlin.de (Anno
Siegel) wrote:
>
>Well, for one I suggest that you print out "\0D\0A". You will see
>that it isn't what you think it is. You want "\x0d\x0a".
>
>Setting $/ = "\0D\0A" means that the next read will slurp in the rest of
>the file (because that sequence is unlikely to be met). What that means
>for the behavior of your program I don't know. In any case, you ought
>to get the intended end-of-line sequence right first.
>
>The behavior of "local ( $/ = 'something')" is a bit mystifying. Since
>local() happens after the assignment, it should render $/ undefined,
>but it preserves the value assigned. It's probably the DWIMmer.
>In any case, it assigns to $/ *before* local() happens, and so makes
>local() useless.
Anno
many thanks for this - will have to try a little later - must go out
for short while!
Cheers
Geoff
>
>Anno
| |
| Geoff Cox 2004-03-29, 8:35 pm |
| On Mon, 29 Mar 2004 11:27:56 -0600, Tad McClellan
<tadmc@augustmail.com> wrote:
Tad,
>
>It is downright horrid style.
I know. but it is all working except for the reading of the <p> ..
</p>
[color=darkred]
>Because you set it *after* you have already read from the file.
>
>It affects input, you much change it before you do the input
>that you want to affect.
Have only now realised what you mean by this. had missed the point re
the sub data coming from an earlier event, ie the reading in of the
file. This makes me think that using the $/ change just will not work
as I only wish to apply it to the text between <p> and </p> on
different lines. Changing the $/ for the whole file will break the
rest of the code...so is it back to sluping in the whole file? I do
not see how this can deal with the parts of the file in the order in
which they appear in the file? The HTML::Parser is perhaps the best
way but that will take me a little while to get to grips with!
>
[color=darkred]
>There is no need to copy it from one scalar to another.
OK.
[color=darkred]
>That will match if $name = 'nightmare'.
>
>Is that what you want?
>
>Probably not, so:
>
> if ( /\.htm$/ ) {
OK
>
>
[color=darkred]
>Lose the useless use of quotes.
OK
>You should always, yes *always*, check the return value from open():
>
> open IN, $_ or die "could not open '$_' $!";
>or
> open IN, $name or die "could not open '$name' $!";
> hanging it there will not affect the contents of $line.
OK.
>
>Slashes are not special in strings, no backslashing needed.
>
> print OUT ("</table>\n</body>\n</html>");
OK
>
>
>
>
>
>You call intro($path) for every alternative!
>
>You should just call it once at the top of the sub instead.
OK
Thanks for your suggestions.
Geoff
>
> http://web.presby.edu/~nnqadmin/nnq/nquote.html
|
|
|
|
|