Home > Archive > PHP Programming > April 2007 > Best way to parse a url for validity?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Best way to parse a url for validity?
|
|
| Rick Stem 2007-04-26, 6:59 pm |
| I have checkURL(http://globalwarmingawareness2007.org.uk,
globalwarmingawareness2007.org.uk)
I see almost everyone using regular expressions. But I don't completely
trust them. Don't know if this code is the best way to find if a user
entered a valid URL and to avoid SQL injection from the URL.
function checkURL($url, $name)
{
global $incorrect_input;
$data=parse_url("http://".$url);
if(!$data)
die($incorrect_input[1].$name);
$host=$data['host'];
$path=$data['path'];
$query=$data['query'];
$fragment=$data['fragment'];
//url does not start with a letter, number
if (!preg_match('/^[A-Za-z0-9]/i',$host))
die($incorrect_input[1].$name);
//url does not contain a .
if (!preg_match('/([A-Za-z0-9]+\.)+/i',$host))
die($incorrect_input[1].$name);
//url ends with .
if (preg_match('/\.$/i',$host))
die($incorrect_input[1].$name);
$array=split('\.',$host);
$arraysize=count($array);
for ($i = 0; $i < $arraysize; $i++)
{
if (preg_match('/[^A-Za-z0-9\-\_]+/i',$array[$i]))
die($incorrect_input[1].$name);
}
//Only allow alphanumeric letters, _,-,/
if($path)
{
$len=strlen($path);
for ($i = 0; $i < $len; $i++)
{
$ascii = ord($path[$i]);
if (($ascii < 65 || $ascii > 90) &&
($ascii < 48 || $ascii > 57) &&
($ascii < 97 || $ascii > 122))
if ($ascii != 45 && $ascii != 46 && $ascii != 95 && $ascii != 47)
die($incorrect_input[1].$name);
}
}
//Do not allow more than one consecutive slash for the path
if (preg_match('/[\/]{2,}/i', $path))
die($incorrect_input[1].$name);
if($query)
{
if (preg_match('/[^A-Za-z0-9\/\-\_\=\&]+/i',$query))
die($incorrect_input[1].$name);
if (preg_match('/[\=\&]{2,}/i',$query))
die($incorrect_input[1].$name);
}
if($fragment)
{
if (preg_match('/[^A-Za-z0-9\-\_\.]+/i',$fragment))
die($incorrect_input[1].$name);
}
return($url);
}
| |
| shimmyshack 2007-04-26, 9:58 pm |
| On Apr 26, 11:52 pm, Rick Stem <ricks...@yahoo.com> wrote:
> I have checkURL(http://globalwarmingawareness2007.org.uk,
> globalwarmingawareness2007.org.uk)
>
> I see almost everyone using regular expressions. But I don't completely
> trust them. Don't know if this code is the best way to find if a user
> entered a valid URL and to avoid SQL injection from the URL.
>
> function checkURL($url, $name)
> {
> global $incorrect_input;
>
> $data=parse_url("http://".$url);
> if(!$data)
> die($incorrect_input[1].$name);
> $host=$data['host'];
> $path=$data['path'];
> $query=$data['query'];
> $fragment=$data['fragment'];
>
> //url does not start with a letter, number
> if (!preg_match('/^[A-Za-z0-9]/i',$host))
> die($incorrect_input[1].$name);
>
> //url does not contain a .
> if (!preg_match('/([A-Za-z0-9]+\.)+/i',$host))
> die($incorrect_input[1].$name);
>
> //url ends with .
> if (preg_match('/\.$/i',$host))
> die($incorrect_input[1].$name);
>
> $array=split('\.',$host);
> $arraysize=count($array);
>
> for ($i = 0; $i < $arraysize; $i++)
> {
> if (preg_match('/[^A-Za-z0-9\-\_]+/i',$array[$i]))
> die($incorrect_input[1].$name);
> }
>
> //Only allow alphanumeric letters, _,-,/
> if($path)
> {
> $len=strlen($path);
> for ($i = 0; $i < $len; $i++)
> {
> $ascii = ord($path[$i]);
> if (($ascii < 65 || $ascii > 90) &&
> ($ascii < 48 || $ascii > 57) &&
> ($ascii < 97 || $ascii > 122))
> if ($ascii != 45 && $ascii != 46 && $ascii != 95 && $ascii != 47)
> die($incorrect_input[1].$name);
> }
> }
>
> //Do not allow more than one consecutive slash for the path
> if (preg_match('/[\/]{2,}/i', $path))
> die($incorrect_input[1].$name);
>
> if($query)
> {
> if (preg_match('/[^A-Za-z0-9\/\-\_\=\&]+/i',$query))
> die($incorrect_input[1].$name);
> if (preg_match('/[\=\&]{2,}/i',$query))
> die($incorrect_input[1].$name);
> }
>
> if($fragment)
> {
> if (preg_match('/[^A-Za-z0-9\-\_\.]+/i',$fragment))
> die($incorrect_input[1].$name);
> }
>
> return($url);
>
> }
it isnt the best way no, th above code restricts the url to a small
subset of valid urls, and doesnt prevent sql inject which can occur
inside POST payload as well as GET.
Architecturally it isnt the right way to think about the problem
either, IMHO, its the easy answer - restrict restrict restrict - its
no substitute for allowing all the valid urls, even ones with
injection, and then filtering the input/output of your scripts.
this kind of approach though can have validity, have you tried using
mod_security?
Within php means you will be restricting yourself from application
adjustments, rewrites, non-ascii language implementation, besides all
this, the approach above doesnt lend itself to easy adjustment,
whereas a simple block of more readable reg exp would do, once youve
made the leap of faith (shown by others to be a worthwhile leap) into
the world of reg exps which you can indeed trust despite their
complexity.
| |
|
|
"Rick Stem" <rickstem@yahoo.com> wrote in message
news:f0rag812fcb@news4.newsguy.com...
|I have checkURL(http://globalwarmingawareness2007.org.uk,
| globalwarmingawareness2007.org.uk)
|
| I see almost everyone using regular expressions. But I don't completely
| trust them. Don't know if this code is the best way to find if a user
| entered a valid URL and to avoid SQL injection from the URL.
JESUS CHRIST!!!
'dont' trust them'? you mean 'i couldn't write one if it meant i'd get
laid'.
i don't 'trust' the code you've just written! have you completely overlooked
the fact that php has built-in functions that break out a url into the
pieces you're looking for? do you not know that even if it 'looks' valid, it
may point to nowhere?
'don't trust them'...i'm still laughing.
|
|
|
|
|