For Programmers: Free Programming Magazines  


Home > Archive > Java Help > June 2005 > Converting grep/sed combination to Java









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Converting grep/sed combination to Java
Jonny

2005-06-04, 8:58 am

Hi,

I am using the following grep/sed combination to extract hyperlinks from
a document on Windows 2000. This is a one-line command:

grep -o -E "href=\"[^&#:;+?/\.0-9A-Za-z_-]*\"" Contents.html|sed -e
"s/href=\"//" -e "s/\"//"

Please could someone tell me how to pass the regular expression part
(the argument to grep), to Pattern.compile in Java. I have tried
putting \\ before characters which I think may cause problems, but Java
still reports errors within the argument passed to Pattern.compile.

Note: I realised that I had to remove the ^ before the & as this was
only included to make the command valid to the Windows 2000 command
interpreter.

Any help would be much appreciated.

Regards,
Jonny
Stefan Schulz

2005-06-04, 3:57 pm

On Sat, 04 Jun 2005 10:20:50 +0000, Jonny wrote:
> Note: I realised that I had to remove the ^ before the & as this was
> only included to make the command valid to the Windows 2000 command
> interpreter.


man grep. ;)

The ^ inverses the selection, this means you inverted the logic by
removing the ^. Don't do this.

--
In pioneer days they used oxen for heavy pulling, and when one ox
couldn't budge a log, they didn't try to grow a larger ox. We shouldn't
be trying for bigger computers, but for more systems of computers.
--- Rear Admiral Grace Murray Hopper

Jonny

2005-06-04, 3:57 pm

Stefan Schulz wrote:

> On Sat, 04 Jun 2005 10:20:50 +0000, Jonny wrote:
>
> man grep. ;)
>
> The ^ inverses the selection, this means you inverted the logic by
> removing the ^. Don't do this.


Thanks for your reply, Stefan.

The problem I am having is not how to use grep, but how to pass the
regular expression to Pattern.compile without getting Java compilation
errors. Do certain characters need to be escaped?

Regards,
Jonny
Nemo

2005-06-04, 8:57 pm

In message <6gfoe.7317$%21.3303@newsfe2-gui.ntli.net>, Jonny
<www.mail@ntlworld.com> writes
>Hi,
>
>I am using the following grep/sed combination to extract hyperlinks from
>a document on Windows 2000. This is a one-line command:
>
>grep -o -E "href=\"[^&#:;+?/\.0-9A-Za-z_-]*\"" Contents.html|sed -e
>"s/href=\"//" -e "s/\"//"
>
>Please could someone tell me how to pass the regular expression part
>(the argument to grep), to Pattern.compile in Java. I have tried
>putting \\ before characters which I think may cause problems, but Java
>still reports errors within the argument passed to Pattern.compile.


In Java, you can combine the grep and the 2 sed commands.

Try something like:

href=\"(.*?)\"

i.e. look for href=" then collect all characters up to but not including
the first ".

The ? makes the .* non-greedy.

HTH
Dale King

2005-06-05, 3:58 am

Jonny wrote:
> Stefan Schulz wrote:
>
>
>
>
> Thanks for your reply, Stefan.
>
> The problem I am having is not how to use grep, but how to pass the
> regular expression to Pattern.compile without getting Java compilation
> errors. Do certain characters need to be escaped?


The string to compile is just a string. The only thing that has to be
escaped is the backslash. So every backslash in the pattern must be doubled.

--
Dale King
Jonny

2005-06-05, 8:57 am

Nemo wrote:

> In message <6gfoe.7317$%21.3303@newsfe2-gui.ntli.net>, Jonny
> <www.mail@ntlworld.com> writes
>
> In Java, you can combine the grep and the 2 sed commands.
>
> Try something like:
>
> href=\"(.*?)\"
>
> i.e. look for href=" then collect all characters up to but not including
> the first ".
>
> The ? makes the .* non-greedy.


Thanks Nemo.

That's exactly what I was looking for. I didn't realise Java used the
advanced regexp syntax.

The statement I finally used, is:

regexp_pattern =
Pattern.compile("href=\\\"(.*?)\\\"",
Pattern.CASE_INSENSITIVE);

Your help is appreciated.

Regards,
Jonny
Jonny

2005-06-05, 8:57 am

Dale King wrote:

> Jonny wrote:
>
> The string to compile is just a string. The only thing that has to be
> escaped is the backslash. So every backslash in the pattern must be doubled.


There's more to it than that. Characters also have to be escaped within
the regular expression in order to be compiled by the regexp compiler.
These characters are also escaped by using a backslash.

See my response to Nemo's post, where I have to use three backslashes to
escape a double-quote character. A regexp compiler would expect a
double-quote to be passed as \", but so would the Java compiler, which
in addition expects the backslash itself to be escaped. Hence the need
for \\\"

Thanks for your reply.

Regards,
Jonny

Dale King

2005-06-06, 3:59 am

Jonny wrote:
> Dale King wrote:
>
>
>
>
> There's more to it than that. Characters also have to be escaped within
> the regular expression in order to be compiled by the regexp compiler.
> These characters are also escaped by using a backslash.


I assume he knew that. I was referring to what had to be escaped on top
of the normal regular expression escapes.

> See my response to Nemo's post, where I have to use three backslashes to
> escape a double-quote character. A regexp compiler would expect a
> double-quote to be passed as \", but so would the Java compiler, which
> in addition expects the backslash itself to be escaped. Hence the need
> for \\\"


D'oh. Yes forgot to mention the quote. My point was that the thing that
really trips people up is the fact that backslashes must be escaped.

I really wish they would adopt an alternate way to allow you to specify
strings that would let you get around the escaping for things like
regular expressions and path names. Such a thing has been proposed here
before:

<http://groups-beta.google.com/group...ae847810dfcb8e1>

--
Dale King
Nemo

2005-06-08, 4:01 am

In message <fazoe.1861$K5.16@newsfe4-win.ntli.net>, Jonny
<www.mail@ntlworld.com> writes
>See my response to Nemo's post, where I have to use three backslashes to
>escape a double-quote character. A regexp compiler would expect a
>double-quote to be passed as \", but so would the Java compiler, which


Are you sure that you need to quote " in an RE?
However, it won't do any harm, if not required.

I now realise my previous posting was a bit ambiguous/incomplete.
I quoted the " but left off the enclosing " and " - all of which are
needed to turn it into a Java String.

I should have written:

String pattern="href=\"(.*?)\"";

OK?
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com