For Programmers: Free Programming Magazines  


Home > Archive > Java Help > August 2005 > regex help









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author regex help
Joel S

2005-08-12, 5:03 pm

i have

<something>blah</something>

where blah and something can be anything. what regex would i use for
this?

Oliver Wong

2005-08-12, 5:03 pm

"Joel S" <jjshoe@gmail.com> wrote in message
news:1123873450.488395.114310@z14g2000cwz.googlegroups.com...
>i have
>
> <something>blah</something>
>
> where blah and something can be anything. what regex would i use for
> this?


1) If "<something>" and "</something>" have to match, then you're
working with a context-free language, which regular expressions can't
handle.

2) If I were you, I'd use an XML parser.

- Oliver


Jeff Schwab

2005-08-12, 10:33 pm

Joel S wrote:
> i have
>
> <something>blah</something>
>
> where blah and something can be anything. what regex would i use for
> this?


Can "anything" include other tags? Assuming you want only leaf
elements, and that for anything more complicated you would use SAX:

import java.io.PrintWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TagTeam {

static PrintWriter out = new PrintWriter(System.out, true);

public static void main(String[] args) {

String sampleInput = "<something>blah</something>"
+ "<hello>world</hello>";

Pattern pat = Pattern.compile("<([^>]+)>([^<]*)</\\1>");

Matcher m = pat.matcher(sampleInput);

while (m.find()) {
out.println("Tag: " + m.group(1));
out.println("Content: " + m.group(2));
out.println();
}
}
}
Joel S

2005-08-12, 10:33 pm

2) You aren't me.

Joel S

2005-08-12, 10:33 pm

SAX was one thing i looked at before heading down the road that i am
going. I however don't need such a HUGE item included in my project for
just a few lines that i need to do this to.

i ended up just doing a split and using <.*?> on it.

Oliver Wong

2005-08-15, 5:03 pm


"Joel S" <jjshoe@gmail.com> wrote in message
news:1123886513.499776.82420@g49g2000cwa.googlegroups.com...
> SAX was one thing i looked at before heading down the road that i am
> going. I however don't need such a HUGE item included in my project for
> just a few lines that i need to do this to.
>
> i ended up just doing a split and using <.*?> on it.


Wouldn't <.*> suffice? * means 0 or more, so no need to make it
optional. Plus, depending on how you coded it, it may incorrectly accepts
the following:

<a>b</c>

- Oliver


Jeff Schwab

2005-08-17, 4:07 am

Oliver Wong wrote:
> "Joel S" <jjshoe@gmail.com> wrote in message
> news:1123886513.499776.82420@g49g2000cwa.googlegroups.com...
>
>
>
> Wouldn't <.*> suffice? * means 0 or more, so no need to make it
> optional.


The ? there doesn't mean "optional." It makes the * non-greedy.

> Plus, depending on how you coded it, it may incorrectly accepts
> the following:
>
> <a>b</c>


I think you've got it backwards: <.*> could match all of what you've
written, but <.*?> would match <a> and </c> separately.
Oliver Wong

2005-08-17, 9:16 am


"Jeff Schwab" <jeffrey.schwab@rcn.com> wrote in message
news:YLCdnezW8blNMJ_eRVn-gQ@rcn.net...
>
> Oliver Wong wrote:
>
> The ? there doesn't mean "optional." It makes the * non-greedy.


Interesting. I wasn't aware of a "non-greedy" operator. I'll have to
look more into that.

>
> I think you've got it backwards: <.*> could match all of what you've
> written, but <.*?> would match <a> and </c> separately.


This example was a seperate issue from greediness versus non greediness.
Since I'm assuming the OP is workign with XML (as (s)he mentions considering
SAX), then the string "<a>b</c>" doesn't match a valid document because the
opening and closing tags don't match.

I was just trying to further explain a point I was making earlier in the
thread about how the XML language is a context-free language, not a regular
language, so it's not possible to express that language using only regular
expressions. At best, you'd need a regular expression processor plus a stack
to keep track of what the last opening tag was.

- Oliver


Jeff Schwab

2005-08-17, 10:02 pm

Oliver Wong wrote:
> "Jeff Schwab" <jeffrey.schwab@rcn.com> wrote in message
> news:YLCdnezW8blNMJ_eRVn-gQ@rcn.net...
>
>
>
> Interesting. I wasn't aware of a "non-greedy" operator. I'll have to
> look more into that.
>
>
>
>
> This example was a seperate issue from greediness versus non greediness.
> Since I'm assuming the OP is workign with XML (as (s)he mentions considering
> SAX), then the string "<a>b</c>" doesn't match a valid document because the
> opening and closing tags don't match.
>
> I was just trying to further explain a point I was making earlier in the
> thread about how the XML language is a context-free language, not a regular
> language, so it's not possible to express that language using only regular
> expressions. At best, you'd need a regular expression processor plus a stack
> to keep track of what the last opening tag was.


I suppose if you wanted a single regex to match the entire document and
capture each element into a separate group, you might be hard-pressed.
Matching an individual element is pretty simple though, if your regex
engine supports back-references.
Jon Haugsand

2005-08-18, 6:00 pm

* Jeff Schwab
> I suppose if you wanted a single regex to match the entire document
> and capture each element into a separate group, you might be
> hard-pressed. Matching an individual element is pretty simple though,
> if your regex engine supports back-references.


Like Oliver Wong said, you'll need a stack, not just back references.
Or else, how do you match the outer A element here:


<A>text<B><A>core</A>...</B>...</A>

--
Jon Haugsand
Dept. of Informatics, Univ. of Oslo, Norway, mailto:jonhaug@ifi.uio.no
http://www.ifi.uio.no/~jonhaug/, Phone: +47 22 85 24 92
Jeff Schwab

2005-08-18, 6:00 pm

Jon Haugsand wrote:
> * Jeff Schwab
>
>
>
> Like Oliver Wong said, you'll need a stack, not just back references.
> Or else, how do you match the outer A element here:
>
>
> <A>text<B><A>core</A>...</B>...</A>


The outer A happens to be the easiest of all the above elements to
match: "<([^>]+)>(.*)</\\1>".
Oliver Wong

2005-08-18, 6:00 pm


"Jeff Schwab" <jeffrey.schwab@rcn.com> wrote in message
news:s4WdndRAZod24pneRVn-3A@rcn.net...
> Jon Haugsand wrote:
>
>
> The outer A happens to be the easiest of all the above elements to match:
> "<([^>]+)>(.*)</\\1>".


I guess a better example would be:

<A>text<B><A>core</A>...</B>...</A><A>text<B><A>core</A>...</B>...</A>

Here neither a purely greedy nor purely non-greedy matcher would be able
to see that there are two "outer A"s

- Oliver


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com