For Programmers: Free Programming Magazines  


Home > Archive > AWK > November 2007 > How to gram awk's regexp submatches?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author How to gram awk's regexp submatches?
feaber

2007-11-19, 7:01 pm

For example, I have something like this:

$echo "test4325363test" | awk "/(.*)([0-9]+)(.*)/ {print NUMBER
HERE!}"

awk gets some text to parse. And it match. But I want to get some part
of that text (the number).

In apache2 module, mod_rewrite it was easy. Submatches goes into
variables ($0-$n), but here in awk the $-variables meaning something
else right? :)
Kenny McCormack

2007-11-19, 7:01 pm

In article <9826656f-be67-4113-b4cf-167ab020d87f@d61g2000hsa.googlegroups.com>,
feaber <feaber@gmail.com> wrote:
>For example, I have something like this:
>
>$echo "test4325363test" | awk "/(.*)([0-9]+)(.*)/ {print NUMBER
>HERE!}"
>
>awk gets some text to parse. And it match. But I want to get some part
>of that text (the number).
>
>In apache2 module, mod_rewrite it was easy. Submatches goes into
>variables ($0-$n), but here in awk the $-variables meaning something
>else right? :)


Both GAWK & TAWK have extensions to do this. Standard (vanilla
standard) AWK does not have anything.

Bob Harris

2007-11-19, 7:01 pm

In article
<9826656f-be67-4113-b4cf-167ab020d87f@d61g2000hsa.googlegroups.com
>,

feaber <feaber@gmail.com> wrote:

> For example, I have something like this:
>
> $echo "test4325363test" | awk "/(.*)([0-9]+)(.*)/ {print NUMBER
> HERE!}"
>
> awk gets some text to parse. And it match. But I want to get some part
> of that text (the number).
>
> In apache2 module, mod_rewrite it was easy. Submatches goes into
> variables ($0-$n), but here in awk the $-variables meaning something
> else right? :)


echo "test4325363test" | awk '
match($0,/[0-9]+/) {
print substr($0,RSTART,RLENGTH)
}
'
Ed Morton

2007-11-19, 9:58 pm



On 11/19/2007 4:53 PM, feaber wrote:
> For example, I have something like this:
>
> $echo "test4325363test" | awk "/(.*)([0-9]+)(.*)/ {print NUMBER
> HERE!}"
>
> awk gets some text to parse. And it match. But I want to get some part
> of that text (the number).
>
> In apache2 module, mod_rewrite it was easy. Submatches goes into
> variables ($0-$n), but here in awk the $-variables meaning something
> else right? :)


This might be what you're looking for (GNU awk):

gawk '{print gensub(/(.*)([0-9]+)(.*)/,"\\2","")}'

Ed.

feaber

2007-11-20, 7:58 am

Thx Guys! :)
Steffen Schuler

2007-11-20, 6:58 pm

Hi feaber, hello netlanders,

On Mon, 19 Nov 2007 14:53:04 -0800, feaber wrote:

> For example, I have something like this:
>
> $echo "test4325363test" | awk "/(.*)([0-9]+)(.*)/ {print NUMBER HERE!}"
>
> awk gets some text to parse. And it match. But I want to get some part
> of that text (the number).
>
> In apache2 module, mod_rewrite it was easy. Submatches goes into
> variables ($0-$n), but here in awk the $-variables meaning something
> else right? :)


in awk $i means the i-th field of the input record ($0 is the whole
record without record separator.) Normally a record is the same as a text
line and the record separator is then a newline.

POSIX awk does not support submatches inside parentheses in your sense
but gawk delivers support with the additional array parameter in match()
and with gensub().

(A) match()-Extension
*********************

Gawk's match-extension match(s, re, a) does what you want:

the submatches inside parentheses in re are assigned to the array
elements

a[1], a[2], ...,a[n]

a[i, "start"] is the start position of a[i] with the length
a[i, "length"]

Please observe that (g)awk matching is greedy.
After match("test4325363test", "(.*)([0-9]+)(.*)", a)

a[1] is "test432536"
a[2] is "3"
a[3] is "test"

Therefore use:

echo test4325363test |
gawk 'match($0, "([^0-9]*)([0-9]+)(.*)", a) { print a[2] }'

to extract the number.

Please, note that gawk 3.1.5 has some bugs in the match function.
These should be corrected in gawk 3.1.6 (see ftp://ftp.gnu.org).

(B) gensub()
************

As Ed Morton told you gensub is the other alternative with gawk.
The correct use in your case is (see the greedy argument above):

echo test4325363test |
gawk '/[0-9]/ { print gensub(/([^0-9]*)([0-9]+)(.*)/, "\\2", "1") }'


Hope I could help you,

Steffen "goedel" Schuler
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com