Home > Archive > PERL Miscellaneous > August 2005 > Looking for Regexp that strips newlines inside of a tag
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Looking for Regexp that strips newlines inside of a tag
|
|
| weston 2005-08-26, 6:58 pm |
| I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word Doc
to filtered HTML, but there's one that I find particularly interesting
(and annoying): sometimes tags have newlines within them. Especially
<span> tags. For example:
<p><span lang=JA style='font-family:
"MS Mincho"'>(</span>
Is there a regular expression that can pull the span up onto the same
line?
So far, I've tried slurping the whole file into a single string, and
doing:
s/(<span.*?)^+([^>]*> )/$1 $2/mig;
which seems to have no effect, and this:
s/(<span.*?)\n+([^>]*> )/$1 $3/mig;
which seems to lop off everything from the first line.
It seems likely there's a way to do this, but I'm sortof stuck on what
to try next. Any ideas?
| |
| A. Sinan Unur 2005-08-26, 6:58 pm |
| "weston" <notsew- reversePreceedingAndRemoveThis@canncentr
al.org> wrote
in news:1125094812.057093.89070@g49g2000cwa.googlegroups.com:
> I'm trying to streamline workflow from Word Documents to HTML. There
> are numerous atrocities perpetrated in the process of saving a Word
> Doc to filtered HTML, but there's one that I find particularly
> interesting (and annoying): sometimes tags have newlines within them.
> Especially <span> tags. For example:
>
> <p><span lang=JA style='font-family:
> "MS Mincho"'>(</span>
>
> Is there a regular expression that can pull the span up onto the same
> line?
You should use an HTML parser to parse HTML. See
perldoc -q html
With HTML::TokeParser::Simple, this can be achieved simpy using code
similar to the following (untested):
#! /usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $html = <<HTML
<p><span lang=JA style='font-family:
"MS Mincho"'>(</span></p>
HTML
;
my $p = HTML::TokeParser::Simple->new(\$html);
while( my $token = $p->get_token ) {
if( $token->is_start_tag ) {
my $attrs = $token->get_attr;
for my $attr (keys %{ $attrs }) {
$attrs->{$attr} =~ s/\s+/ /sg;
$token->set_attr($attr, $attrs->{$attr});
}
}
print $token->as_is . "\n";
}
__END__
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
| |
| weston 2005-08-26, 9:56 pm |
| > You should use an HTML parser to parse HTML.
You're quite correct, and I appreciate your help in pushing me this
way. I've been avoiding the actual parsers because regexps are what I'm
familiar with, and perhaps this is a good time to change.
However, as much as I'm interested in solving the problem at hand, I'm
also very curious about the potential gaps in my regexp knowledge.
And plot, as they say, thickens. It appears that the second regular
expression only fails in Perl (5.8.5) under Cygwin. It works when I run
it under the native Windows Command Prompt on my XP system (same perl
install), and also when I try it under OpenBSD (5.8.6). It would seem
this is a platform-related issue rather a regexp one...
| |
| William James 2005-08-26, 9:56 pm |
| weston wrote:
> I'm trying to streamline workflow from Word Documents to HTML. There
> are numerous atrocities perpetrated in the process of saving a Word Doc
> to filtered HTML, but there's one that I find particularly interesting
> (and annoying): sometimes tags have newlines within them. Especially
> <span> tags. For example:
>
> <p><span lang=JA style='font-family:
> "MS Mincho"'>(</span>
>
> Is there a regular expression that can pull the span up onto the same
> line?
Look-ahead lets you make sure the newline is within a tag.
s/\n(?=[^<]*> )/ /g;
transforms
<p>
<span lang=JA
style='font:
"Mincho"'>ÿ
</span>
into
<p>
<span lang=JA style='font: "Mincho"'>ÿ
</span>
| |
| A. Sinan Unur 2005-08-27, 3:56 am |
| "weston" <notsew- reversePreceedingAndRemoveThis@canncentr
al.org> wrote
in news:1125103491.246595.109070@g44g2000cwa.googlegroups.com:
[ Please do not omit attributions when quoting ]
>
....
> And plot, as they say, thickens. It appears that the second regular
> expression only fails in Perl (5.8.5) under Cygwin.
[ Please quote an appropriate amount of context when replying.
There are no regular expressions in your post, so looking for
the second one is a futile exercise. ]
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
|
|
|
|
|