For Programmers: Free Programming Magazines  


Home > Archive > PHP Documentation > June 2006 > #36112 [Opn->Csd]: preg_replace example suggests poor patterns, which are harmful









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author #36112 [Opn->Csd]: preg_replace example suggests poor patterns, which are harmful
colder@php.net

2006-06-17, 8:07 am

ID: 36112
Updated by: colder@php.net
Reported By: pornel at despammed dot com
-Status: Open
+Status: Closed
Bug Type: Documentation problem
PHP Version: Irrelevant
Assigned To: colder
New Comment:

This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation
better.

I simply removed the example for now.


Previous Comments:
------------------------------------------------------------------------

[2006-03-12 17:06:18] colder@php.net

There are lot of inconsistencies in this example:

1) About @<script[^>]*?>.*?</script>@si :
a) the first ? is useless.

2) About @<[\/\!]*?[^<>]*?>@si :
a) / and ! don't have to be escaped.
b) [\/\!]*? is useless, as it's already matched by [^<>]*?.
c) the ? of [^<>]*? is useless.
d) the PCRE_DOTALL modifier is useless, there is no dot.
e) the PCRE_CASELESS modifier is useless.
f) what is the point avoiding "<" in a tag?

3) About @([\r\n])[\s]+@ :
a) no need to put \s in a char class.
b) every \r\n will be changed to \r, as \s matches \n.

I think the whole example has to be reconsidered, because there are
already functions to do some of the job, like strip_tags() and
html_entity_decode().

------------------------------------------------------------------------

[2006-01-20 23:54:03] pornel at despammed dot com

Description:
------------
The code on http://uk.php.net/preg_replace:

$search = array ('@<script[^>]*?>.*?</script>@si', // Strip
out javascript
'@<[\/\!]*?[^<>]*?>@si', // Strip
out HTML tags

doesn't work as advertised. For example it will leave
contents of:
<script>xxx</script >
and worse, it will output valid script tags if given:
<<>script>evil<<>/script>

If these patterns were used on some website (for stripping
markup from user's comments for example), they'd allow XSS
attack.


Since it's near impossible to properly parse HTML with
regular expressions I suggest:
* renaming example from 'Convert HTML to text' to 'Remove
HTML markup'
* adding replacement of '<' as '&gt;'
* suggesting use of more robust methods, like strip_tags,
nl2br, htmlspecialchars or DOM interface.




------------------------------------------------------------------------


--
Edit this bug report at http://bugs.php.net/?id=36112&edit=1
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com