For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > August 2005 > Looking for Regexp that strips newlines inside of a tag









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Looking for Regexp that strips newlines inside of a tag
weston

2005-08-26, 6:58 pm

I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word Doc
to filtered HTML, but there's one that I find particularly interesting
(and annoying): sometimes tags have newlines within them. Especially
<span> tags. For example:

<p><span lang=JA style='font-family:
"MS Mincho"'>(</span>

Is there a regular expression that can pull the span up onto the same
line?

So far, I've tried slurping the whole file into a single string, and
doing:

s/(<span.*?)^+([^>]*> )/$1 $2/mig;

which seems to have no effect, and this:

s/(<span.*?)\n+([^>]*> )/$1 $3/mig;

which seems to lop off everything from the first line.

It seems likely there's a way to do this, but I'm sortof stuck on what
to try next. Any ideas?

A. Sinan Unur

2005-08-26, 6:58 pm

"weston" <notsew- reversePreceedingAndRemoveThis@canncentr
al.org> wrote
in news:1125094812.057093.89070@g49g2000cwa.googlegroups.com:

> I'm trying to streamline workflow from Word Documents to HTML. There
> are numerous atrocities perpetrated in the process of saving a Word
> Doc to filtered HTML, but there's one that I find particularly
> interesting (and annoying): sometimes tags have newlines within them.
> Especially <span> tags. For example:
>
> <p><span lang=JA style='font-family:
> "MS Mincho"'>(</span>
>
> Is there a regular expression that can pull the span up onto the same
> line?


You should use an HTML parser to parse HTML. See

perldoc -q html

With HTML::TokeParser::Simple, this can be achieved simpy using code
similar to the following (untested):

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML
<p><span lang=JA style='font-family:
"MS Mincho"'>(</span></p>
HTML
;

my $p = HTML::TokeParser::Simple->new(\$html);

while( my $token = $p->get_token ) {
if( $token->is_start_tag ) {
my $attrs = $token->get_attr;
for my $attr (keys %{ $attrs }) {
$attrs->{$attr} =~ s/\s+/ /sg;
$token->set_attr($attr, $attrs->{$attr});
}
}
print $token->as_is . "\n";
}


__END__

Sinan

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
weston

2005-08-26, 9:56 pm

> You should use an HTML parser to parse HTML.

You're quite correct, and I appreciate your help in pushing me this
way. I've been avoiding the actual parsers because regexps are what I'm
familiar with, and perhaps this is a good time to change.

However, as much as I'm interested in solving the problem at hand, I'm
also very curious about the potential gaps in my regexp knowledge.

And plot, as they say, thickens. It appears that the second regular
expression only fails in Perl (5.8.5) under Cygwin. It works when I run
it under the native Windows Command Prompt on my XP system (same perl
install), and also when I try it under OpenBSD (5.8.6). It would seem
this is a platform-related issue rather a regexp one...

William James

2005-08-26, 9:56 pm

weston wrote:
> I'm trying to streamline workflow from Word Documents to HTML. There
> are numerous atrocities perpetrated in the process of saving a Word Doc
> to filtered HTML, but there's one that I find particularly interesting
> (and annoying): sometimes tags have newlines within them. Especially
> <span> tags. For example:
>
> <p><span lang=JA style='font-family:
> "MS Mincho"'>(</span>
>
> Is there a regular expression that can pull the span up onto the same
> line?


Look-ahead lets you make sure the newline is within a tag.

s/\n(?=[^<]*> )/ /g;

transforms

<p>
<span lang=JA
style='font:
"Mincho"'>ÿ
</span>

into

<p>
<span lang=JA style='font: "Mincho"'>ÿ
</span>

A. Sinan Unur

2005-08-27, 3:56 am

"weston" <notsew- reversePreceedingAndRemoveThis@canncentr
al.org> wrote
in news:1125103491.246595.109070@g44g2000cwa.googlegroups.com:

[ Please do not omit attributions when quoting ]

>

....

> And plot, as they say, thickens. It appears that the second regular
> expression only fails in Perl (5.8.5) under Cygwin.


[ Please quote an appropriate amount of context when replying.
There are no regular expressions in your post, so looking for
the second one is a futile exercise. ]

Sinan

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com