For Programmers: Free Programming Magazines  


Home > Archive > Java Help > September 2006 > Regular expression does not return all matches?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Regular expression does not return all matches?
jcsnippets.atspace.com

2006-09-17, 8:09 am

Hi everybody,

I am trying to read the html source of a web page, and finding all the
thumbnails and linked addresses within.

The code below works, but with one strange exception: if I test it on a
html page, I get all the links except for one.

So to test and debug it, I took out that html line, and stuck it in a
test example. Whatever I do, I only get the last occurrence. Even though
the first two match the expression, they are not returned.

I always seem to get only the last one - if I delete it, I get the last
one from the newly formed string.

The regular expression tests for the following:
<a href="(1)"><img src="(2)"></a>
It takes into account the fact that there may be other stuff within (like
border, height, width, ...). (1) and (2) are the addresses I should
receive.

Here is the code:
--- code ---
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest
{

public RegexTest()
{
String html =
"<a href=\"http://www.domain.com/\"><img src=\"domain.gif
\" alt=\"Domain\"></a>" +
"</td><td align=center valign=middle width=\"33%\">" +
"<a href=\"http://www.domain.org/\"><img border=0 width=
130 height=70 src=\"domain.jpg\" " +
"alt=\"Domain\"></a></td></tr></table></td>" +
"<a href=\"http://www.domein.be/\"><img border=0 width=
130 height=70 src=\"domein.gif\" " +
"alt=\"Domein\"></a></td></tr></table></td>";
String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^
\"]*)\".*>.*</a>";
Pattern p = Pattern.compile(expression);
Matcher m = p.matcher(html);
while (m.find())
{
System.out.println("Found a match!");
System.out.println(" -> href: " + m.group(1));
System.out.println(" -> isrc: " + m.group(2));
}
}

public static void main(String[] args)
{
new RegexTest();
}

}
--- /code ---

I do not understand why only the last match is returned - can somebody
please clarify this, or point me in the right direction?

Thanks in advance,

JayCee
--
http://jcsnippets.atspace.com/
a collection of source code, tips and tricks
hiwa

2006-09-17, 7:02 pm

jcsnippets.atspace.com =E3=81=AE=E3=83=A1=E3=83=83=E3=82=BB=E3=
83=BC=E3=82=
=B8:

> Hi everybody,
>
> I am trying to read the html source of a web page, and finding all the
> thumbnails and linked addresses within.
>
> The code below works, but with one strange exception: if I test it on a
> html page, I get all the links except for one.
>
> So to test and debug it, I took out that html line, and stuck it in a
> test example. Whatever I do, I only get the last occurrence. Even though
> the first two match the expression, they are not returned.
>
> I always seem to get only the last one - if I delete it, I get the last
> one from the newly formed string.
>
> The regular expression tests for the following:
> <a href=3D"(1)"><img src=3D"(2)"></a>
> It takes into account the fact that there may be other stuff within (like
> border, height, width, ...). (1) and (2) are the addresses I should
> receive.
>
> Here is the code:
> --- code ---
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
>
> public class RegexTest
> {
>
> public RegexTest()
> {
> String html =3D
> "<a href=3D\"http://www.domain.com/\"><img src=3D\"domain.gif
> \" alt=3D\"Domain\"></a>" +
> "</td><td align=3Dcenter valign=3Dmiddle width=3D\"33%\">" +
> "<a href=3D\"http://www.domain.org/\"><img border=3D0 width=3D
> 130 height=3D70 src=3D\"domain.jpg\" " +
> "alt=3D\"Domain\"></a></td></tr></table></td>" +
> "<a href=3D\"http://www.domein.be/\"><img border=3D0 width=3D
> 130 height=3D70 src=3D\"domein.gif\" " +
> "alt=3D\"Domein\"></a></td></tr></table></td>";
> String expression =3D "<a.*href=3D\"([^\"]*)\".*>.*<img.*src=3D\"([^
> \"]*)\".*>.*</a>";
> Pattern p =3D Pattern.compile(expression);
> Matcher m =3D p.matcher(html);
> while (m.find())
> {
> System.out.println("Found a match!");
> System.out.println(" -> href: " + m.group(1));
> System.out.println(" -> isrc: " + m.group(2));
> }
> }
>
> public static void main(String[] args)
> {
> new RegexTest();
> }
>
> }
> --- /code ---
>
> I do not understand why only the last match is returned - can somebody
> please clarify this, or point me in the right direction?
>
> Thanks in advance,
>
> JayCee
> --
> http://jcsnippets.atspace.com/
> a collection of source code, tips and tricks


> String expression =3D "<a.*href=3D\"([^\"]*)\".*>.*<img.*src=3D\"([^\"]*)=

\".*>.*</a>";
Your .* is called 'greedy match' that swallows all the characters until
the last 'href'.
Use more reasonable regexp string. I would simpley use <a href=3D pattern.

hiwa

2006-09-17, 7:02 pm

hiwa =E3=81=AE=E3=83=A1=E3=83=83=E3=82=BB=E3=
83=BC=E3=82=B8:

For example, this one works:
"<a href=3D\"([^\"]*)\".*?>.*?<img.*?src=3D\"([^\"]*)\".*?>.*?</a>"
But you could much more simplify it than this.

jcsnippets.atspace.com

2006-09-18, 10:03 pm

"hiwa" <HGA03630@nifty.ne.jp> wrote in
news:1158534301.201363.179490@d34g2000cwd.googlegroups.com:
<snipped>
> \".*>.*</a>";
> Your .* is called 'greedy match' that swallows all the characters
> until the last 'href'.
> Use more reasonable regexp string. I would simpley use <a href=
> pattern.


Hi Hiwa,

Now I get it - by the way, thank you for posting your working example in
your other post! Another lesson learned.

Best regards,

JayCee
--
http://jcsnippets.atspace.com/
a collection of source code, tips and tricks
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com