For Programmers: Free Programming Magazines  


Home > Archive > PERL CGI Beginners > September 2005 > deleting scripts









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author deleting scripts
Adriano Allora

2005-09-18, 6:55 pm

Hi all,

I need to delete all the scripts in a webpage read by LWP::get, but:

1) I cannot install modules on server;

2) this regexp doesn't work: s/<script
(?:type)|(?:language).+<\/script>//gisx;

Someone can help me?

Thanks!

alladr

|^|_|^|_|^| |^|_|^|_|^|
| | | |
| | | |
| |*\_/*\_/*\_/*\_/*\_/* | |
| |
| |
| |
| http://www.e-allora.net |
| |
| |
**************************************

wisefamily@integrity.com

2005-09-19, 9:55 pm

> 2) this regexp doesn't work: s/<script
> (?:type)|(?:language).+<\/script>//gisx;


I think that should be s/<script (?:type|language).+<\/script>//gisx.
Otherwise, it will be interpreted as "Match '<script (?:type)' or match
'(?:language).+<\/script>//gisx. Alternation uses all the patterns
separated by "|"s in the same enclosing group.

However, you have the . metacharatacter followed by the + quantifier,
representing one or more of any character. Quantifiers are, by
default, "greedy". That means it looks for the first script tag,
followed by the last ending script tag, and erases everything between
them. So, this file...

<html>
<head>
<title>Example File with script</title>
<script language="JavaScript" type = "text/javascript"><!--
document.write("Example script 1");
--></script>
</head>
<body>
<p>Example Text</p>
<script language = "JavaScript" type = "text/javascript"><!--
document.write("Example script 2");
--></script>
<p>More example Text</p>
</body>
</html>

....would be replaced with this...

<html>
<head>
<title>Example File with script</title>
<p>More example Text</p>
</body>
</html>

....which is probably not what you wanted! :-)

To avoid this problem, you should use this regular expression:
s/<script (?:type|language).+?<\/script>//gisx.
The "?" after the quantifer makes it non-greedy. This means that it
will try to match the shortest string possible, instead of the longest
string possible. So this expression will replace the example file with
this:

<html>
<head>
<title>Example File with script</title>

</head>
<body>
<p>Example Text</p>

<p>More example Text</p>
</body>
</html>

However this expression still has an error: If you have script tags in
your in quotations, or in any other non-parsed place, they, too, will
get replaced. For example, the following (very unlikely) tag...

<img src = "/images/example.gif" alt = "The <script> tag starts a
starts a script and the </script> tag ends one" />

.... would get replaced with...

<img src = "/images/example.gif" alt = "The tag ends one" />

or worse, these tags...

<img src = "/images/example1.gif" alt = "The <script> tag starts a
script" />
<p>Example text</p>
<img src = "/images/example2.gif" alt = "The </script> tag ends a
script" />

....would get replaced by these:

<img src = "/images/example1.gif" alt = "The tag ends a script" />

but, again, this is very unlikely, and will probably not be a problem.

It is possible to avoid this unlikely problem by using a very complex
regular expression with beginning and ending anchors, and accounting
for both single and double quotation marks, as well as escaped
quotation marks, but then you still have to account for "<![CDATA[" and
"]]>" tags, ignore <![CDATA[ enclosed in quotation marks, and remember
to ignore quotation marks inside html tags. The list goes on!

The best way would be to use a module, like HTML::Parser, but that
brings us back to the problem of not being able to install modules on
your server.

So assuming you don't have "<script>" or "</script>" anywhere else in
the html files besides an actual script, the previous regular
expression would probably work.

One more point: in the regular expression s/<script
(?:type|language).+?<\/script>//gisx, the "(?:type|language)" part will
make sure that the script has a type or language attribute. However
some scripts are written like:

<script><!--
// Script here...
--></script>

, and the previous regular expression would not remove that script. If
you do want to make sure that you remove only scripts with a type or
language attribute, though, I would use "\s+" instead of a space, like
s/<script\s+(?:type|language).+?<\/script>//gisx.

So, the regular expression I would use is:

s/<script.+?<\/script>//gisx

or if you want to make sure that the script has a type or language
attribute:

s/<script\s+(?:type|language).+?<\/script>/gisx

Hope this helps,
David

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com