Home > Archive > Java Help > August 2005 > regex help
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Joel S 2005-08-12, 5:03 pm |
| i have
<something>blah</something>
where blah and something can be anything. what regex would i use for
this?
| |
| Oliver Wong 2005-08-12, 5:03 pm |
| "Joel S" <jjshoe@gmail.com> wrote in message
news:1123873450.488395.114310@z14g2000cwz.googlegroups.com...
>i have
>
> <something>blah</something>
>
> where blah and something can be anything. what regex would i use for
> this?
1) If "<something>" and "</something>" have to match, then you're
working with a context-free language, which regular expressions can't
handle.
2) If I were you, I'd use an XML parser.
- Oliver
| |
| Jeff Schwab 2005-08-12, 10:33 pm |
| Joel S wrote:
> i have
>
> <something>blah</something>
>
> where blah and something can be anything. what regex would i use for
> this?
Can "anything" include other tags? Assuming you want only leaf
elements, and that for anything more complicated you would use SAX:
import java.io.PrintWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TagTeam {
static PrintWriter out = new PrintWriter(System.out, true);
public static void main(String[] args) {
String sampleInput = "<something>blah</something>"
+ "<hello>world</hello>";
Pattern pat = Pattern.compile("<([^>]+)>([^<]*)</\\1>");
Matcher m = pat.matcher(sampleInput);
while (m.find()) {
out.println("Tag: " + m.group(1));
out.println("Content: " + m.group(2));
out.println();
}
}
}
| |
| Joel S 2005-08-12, 10:33 pm |
| 2) You aren't me.
| |
| Joel S 2005-08-12, 10:33 pm |
| SAX was one thing i looked at before heading down the road that i am
going. I however don't need such a HUGE item included in my project for
just a few lines that i need to do this to.
i ended up just doing a split and using <.*?> on it.
| |
| Oliver Wong 2005-08-15, 5:03 pm |
|
"Joel S" <jjshoe@gmail.com> wrote in message
news:1123886513.499776.82420@g49g2000cwa.googlegroups.com...
> SAX was one thing i looked at before heading down the road that i am
> going. I however don't need such a HUGE item included in my project for
> just a few lines that i need to do this to.
>
> i ended up just doing a split and using <.*?> on it.
Wouldn't <.*> suffice? * means 0 or more, so no need to make it
optional. Plus, depending on how you coded it, it may incorrectly accepts
the following:
<a>b</c>
- Oliver
| |
| Jeff Schwab 2005-08-17, 4:07 am |
| Oliver Wong wrote:
> "Joel S" <jjshoe@gmail.com> wrote in message
> news:1123886513.499776.82420@g49g2000cwa.googlegroups.com...
>
>
>
> Wouldn't <.*> suffice? * means 0 or more, so no need to make it
> optional.
The ? there doesn't mean "optional." It makes the * non-greedy.
> Plus, depending on how you coded it, it may incorrectly accepts
> the following:
>
> <a>b</c>
I think you've got it backwards: <.*> could match all of what you've
written, but <.*?> would match <a> and </c> separately.
| |
| Oliver Wong 2005-08-17, 9:16 am |
|
"Jeff Schwab" <jeffrey.schwab@rcn.com> wrote in message
news:YLCdnezW8blNMJ_eRVn-gQ@rcn.net...
>
> Oliver Wong wrote:
>
> The ? there doesn't mean "optional." It makes the * non-greedy.
Interesting. I wasn't aware of a "non-greedy" operator. I'll have to
look more into that.
>
> I think you've got it backwards: <.*> could match all of what you've
> written, but <.*?> would match <a> and </c> separately.
This example was a seperate issue from greediness versus non greediness.
Since I'm assuming the OP is workign with XML (as (s)he mentions considering
SAX), then the string "<a>b</c>" doesn't match a valid document because the
opening and closing tags don't match.
I was just trying to further explain a point I was making earlier in the
thread about how the XML language is a context-free language, not a regular
language, so it's not possible to express that language using only regular
expressions. At best, you'd need a regular expression processor plus a stack
to keep track of what the last opening tag was.
- Oliver
| |
| Jeff Schwab 2005-08-17, 10:02 pm |
| Oliver Wong wrote:
> "Jeff Schwab" <jeffrey.schwab@rcn.com> wrote in message
> news:YLCdnezW8blNMJ_eRVn-gQ@rcn.net...
>
>
>
> Interesting. I wasn't aware of a "non-greedy" operator. I'll have to
> look more into that.
>
>
>
>
> This example was a seperate issue from greediness versus non greediness.
> Since I'm assuming the OP is workign with XML (as (s)he mentions considering
> SAX), then the string "<a>b</c>" doesn't match a valid document because the
> opening and closing tags don't match.
>
> I was just trying to further explain a point I was making earlier in the
> thread about how the XML language is a context-free language, not a regular
> language, so it's not possible to express that language using only regular
> expressions. At best, you'd need a regular expression processor plus a stack
> to keep track of what the last opening tag was.
I suppose if you wanted a single regex to match the entire document and
capture each element into a separate group, you might be hard-pressed.
Matching an individual element is pretty simple though, if your regex
engine supports back-references.
| |
| Jon Haugsand 2005-08-18, 6:00 pm |
| * Jeff Schwab
> I suppose if you wanted a single regex to match the entire document
> and capture each element into a separate group, you might be
> hard-pressed. Matching an individual element is pretty simple though,
> if your regex engine supports back-references.
Like Oliver Wong said, you'll need a stack, not just back references.
Or else, how do you match the outer A element here:
<A>text<B><A>core</A>...</B>...</A>
--
Jon Haugsand
Dept. of Informatics, Univ. of Oslo, Norway, mailto:jonhaug@ifi.uio.no
http://www.ifi.uio.no/~jonhaug/, Phone: +47 22 85 24 92
| |
| Jeff Schwab 2005-08-18, 6:00 pm |
| Jon Haugsand wrote:
> * Jeff Schwab
>
>
>
> Like Oliver Wong said, you'll need a stack, not just back references.
> Or else, how do you match the outer A element here:
>
>
> <A>text<B><A>core</A>...</B>...</A>
The outer A happens to be the easiest of all the above elements to
match: "<([^>]+)>(.*)</\\1>".
| |
| Oliver Wong 2005-08-18, 6:00 pm |
|
"Jeff Schwab" <jeffrey.schwab@rcn.com> wrote in message
news:s4WdndRAZod24pneRVn-3A@rcn.net...
> Jon Haugsand wrote:
>
>
> The outer A happens to be the easiest of all the above elements to match:
> "<([^>]+)>(.*)</\\1>".
I guess a better example would be:
<A>text<B><A>core</A>...</B>...</A><A>text<B><A>core</A>...</B>...</A>
Here neither a purely greedy nor purely non-greedy matcher would be able
to see that there are two "outer A"s
- Oliver
|
|
|
|
|