Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Converting grep/sed combination to Java
Hi,

I am using the following grep/sed combination to extract hyperlinks from
a document on Windows 2000.  This is a one-line command:

grep -o -E "href=\"[^&#:;+?/\.0-9A-Za-z_-]*\"" Contents.html|sed -e
"s/href=\"//" -e "s/\"//"

Please could someone tell me how to pass the regular expression part
(the argument to grep), to Pattern.compile in Java.  I have tried
putting \\ before characters which I think may cause problems, but Java
still reports errors within the argument passed to Pattern.compile.

Note: I realised that I had to remove the ^ before the & as this was
only included to make the command valid to the Windows 2000 command
interpreter.

Any help would be much appreciated.

Regards,
Jonny

Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
06-04-05 01:58 PM


Re: Converting grep/sed combination to Java
On Sat, 04 Jun 2005 10:20:50 +0000, Jonny wrote:
> Note: I realised that I had to remove the ^ before the & as this was
> only included to make the command valid to the Windows 2000 command
> interpreter.

man grep. ;)

The ^ inverses the selection, this means you inverted the logic by
removing the ^. Don't do this.

--
In pioneer days they used oxen for heavy pulling, and when one ox
couldn't budge a log, they didn't try to grow a larger ox. We shouldn't
be trying for bigger computers, but for more systems of computers.
--- Rear Admiral Grace Murray Hopper


Report this thread to moderator Post Follow-up to this message
Old Post
Stefan Schulz
06-04-05 08:57 PM


Re: Converting grep/sed combination to Java
Stefan Schulz wrote:

> On Sat, 04 Jun 2005 10:20:50 +0000, Jonny wrote: 
>
> man grep. ;)
>
> The ^ inverses the selection, this means you inverted the logic by
> removing the ^. Don't do this.

Thanks for your reply, Stefan.

The problem I am having is not how to use grep, but how to pass the
regular expression to Pattern.compile without getting Java compilation
errors.  Do certain characters need to be escaped?

Regards,
Jonny

Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
06-04-05 08:57 PM


Re: Converting grep/sed combination to Java
In message <6gfoe.7317$%21.3303@newsfe2-gui.ntli.net>, Jonny
<www.mail@ntlworld.com> writes
>Hi,
>
>I am using the following grep/sed combination to extract hyperlinks from
>a document on Windows 2000.  This is a one-line command:
>
>grep -o -E "href=\"[^&#:;+?/\.0-9A-Za-z_-]*\"" Contents.html|sed -e
>"s/href=\"//" -e "s/\"//"
>
>Please could someone tell me how to pass the regular expression part
>(the argument to grep), to Pattern.compile in Java.  I have tried
>putting \\ before characters which I think may cause problems, but Java
>still reports errors within the argument passed to Pattern.compile.

In Java, you can combine the grep and the 2 sed commands.

Try something like:

href=\"(.*?)\"

i.e. look for href=" then collect all characters up to but not including
the first ".

The ? makes the .* non-greedy.

HTH

Report this thread to moderator Post Follow-up to this message
Old Post
Nemo
06-05-05 01:57 AM


Re: Converting grep/sed combination to Java
Jonny wrote:
> Stefan Schulz wrote:
>
> 
>
>
> Thanks for your reply, Stefan.
>
> The problem I am having is not how to use grep, but how to pass the
> regular expression to Pattern.compile without getting Java compilation
> errors.  Do certain characters need to be escaped?

The string to compile is just a string. The only thing that has to be
escaped is the backslash. So every backslash in the pattern must be doubled.

--
Dale King

Report this thread to moderator Post Follow-up to this message
Old Post
Dale King
06-05-05 08:58 AM


Re: Converting grep/sed combination to Java
Nemo wrote:

> In message <6gfoe.7317$%21.3303@newsfe2-gui.ntli.net>, Jonny
> <www.mail@ntlworld.com> writes 
>
> In Java, you can combine the grep and the 2 sed commands.
>
> Try something like:
>
> href=\"(.*?)\"
>
> i.e. look for href=" then collect all characters up to but not including
> the first ".
>
> The ? makes the .* non-greedy.

Thanks Nemo.

That's exactly what I was looking for.  I didn't realise Java used the
advanced regexp syntax.

The statement I finally used, is:

regexp_pattern =
Pattern.compile("href=\\\"(.*?)\\\"",
Pattern.CASE_INSENSITIVE);

Your help is appreciated.

Regards,
Jonny

Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
06-05-05 01:57 PM


Re: Converting grep/sed combination to Java
Dale King wrote:

> Jonny wrote: 
>
> The string to compile is just a string. The only thing that has to be
> escaped is the backslash. So every backslash in the pattern must be doubled.[/colo
r]

There's more to it than that.  Characters also have to be escaped within
the regular expression in order to be compiled by the regexp compiler.
These characters are also escaped by using a backslash.

See my response to Nemo's post, where I have to use three backslashes to
escape a double-quote character.  A regexp compiler would expect a
double-quote to be passed as \", but so would the Java compiler, which
in addition expects the backslash itself to be escaped.  Hence the need
for \\\"

Thanks for your reply.

Regards,
Jonny


Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
06-05-05 01:57 PM


Re: Converting grep/sed combination to Java
Jonny wrote:
> Dale King wrote:
>
> 
>
>
> There's more to it than that.  Characters also have to be escaped within
> the regular expression in order to be compiled by the regexp compiler.
> These characters are also escaped by using a backslash.

I assume he knew that. I was referring to what had to be escaped on top
of the normal regular expression escapes.

> See my response to Nemo's post, where I have to use three backslashes to
> escape a double-quote character.  A regexp compiler would expect a
> double-quote to be passed as \", but so would the Java compiler, which
> in addition expects the backslash itself to be escaped.  Hence the need
> for \\\"

D'oh. Yes forgot to mention the quote. My point was that the thing that
really trips people up is the fact that backslashes must be escaped.

I really wish they would adopt an alternate way to allow you to specify
strings that would let you get around the escaping for things like
regular expressions and path names. Such a thing has been proposed here
before:

<http://groups-beta.google.com/group...ae847810dfcb8e1>

--
Dale King

Report this thread to moderator Post Follow-up to this message
Old Post
Dale King
06-06-05 08:59 AM


Re: Converting grep/sed combination to Java
In message <fazoe.1861$K5.16@newsfe4-win.ntli.net>, Jonny
<www.mail@ntlworld.com> writes
>See my response to Nemo's post, where I have to use three backslashes to
>escape a double-quote character.  A regexp compiler would expect a
>double-quote to be passed as \", but so would the Java compiler, which

Are you sure that you need to quote " in an RE?
However, it won't do any harm, if not required.

I now realise my previous posting was a bit ambiguous/incomplete.
I quoted the " but left off the enclosing " and " - all of which are
needed to turn it into a Java String.

I should have written:

String pattern="href=\"(.*?)\"";

OK?

Report this thread to moderator Post Follow-up to this message
Old Post
Nemo
06-08-05 09:01 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Java Help archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 06:44 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.