For Programmers: Free Programming Magazines  


Home > Archive > Tcl > August 2004 > Regular expression with 0 or 1 matches option









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Regular expression with 0 or 1 matches option
O.B.

2004-08-23, 3:58 am

I am trying to create a regular expression that will "optionally" match each
expression. In the example below, I would like to regexp to attempt to match
all three expressions, but I am fine if it can only match one or two of the
expressions. So in the case below, I expected it to match the "quick " and
"fox" expressions, but it failed to match any expression. What's wrong here?

set regString "(quick )+.*?(brown )+.*?(fox)"

set story "The quick fox jumped"

set results [regexp -nocase -indices -inline $regString $story]

set numMatches 0
set matchNum 0
foreach pair $results {
set indexA [lindex $pair 0]
set indexB [lindex $pair 1]
puts "$matchNum: [string range $story $indexA $indexB]"
if { (0 < $matchNum) && ($indexA < $indexB) } {
incr numMatches
}
incr matchNum
}

# This should be 2.
puts "Number of expressions matched = $numMatches"

Andreas Leitgeb

2004-08-23, 8:58 am

O.B. <funkjunk@bellsouth.net> wrote:
> I am trying to create a regular expression that will "optionally" match each
> expression. In the example below, I would like to regexp to attempt to match
> all three expressions, but I am fine if it can only match one or two of the
> expressions. So in the case below, I expected it to match the "quick " and
> "fox" expressions, but it failed to match any expression. What's wrong here?
>
> set regString "(quick )+.*?(brown )+.*?(fox)"


The RE you supplied surely does not do what you wrote it should do.

The RE says:
match ONE OR MORE instances of "quick ",
then match as few as necessary arbitrary characters,
then match ONE OR MORE instances of "brown ",
then match as few as necessary arbitrary characters,
then finally match ONE instance of "fox",
so, clearly it cannot possibly match if the word "brown" was missing.

you forgot some enclosing parentheses to collect the optional parts:
set regString "(?:(quick )+.*)?(?:(brown )+.*)?(fox)"

(the "?:" inside each of the new parentheses will make them
"non-capturing". see man-page of re_syntaxc for details)
O.B.

2004-08-23, 4:01 pm

Andreas Leitgeb wrote:
> O.B. <funkjunk@bellsouth.net> wrote:
>
>
>
> The RE you supplied surely does not do what you wrote it should do.
>
> The RE says:
> match ONE OR MORE instances of "quick ",
> then match as few as necessary arbitrary characters,
> then match ONE OR MORE instances of "brown ",
> then match as few as necessary arbitrary characters,
> then finally match ONE instance of "fox",
> so, clearly it cannot possibly match if the word "brown" was missing.
>
> you forgot some enclosing parentheses to collect the optional parts:
> set regString "(?:(quick )+.*)?(?:(brown )+.*)?(fox)"
>
> (the "?:" inside each of the new parentheses will make them
> "non-capturing". see man-page of re_syntaxc for details)


Good catch. I've tried to expand the example to further explain what I'm trying
to do. Of all the expressions, I'd prefer for the program to attempt to match
as many as possible. Am I asking too much of regular expressions?

# Test 1
set regString "(?:(quick )+.*)?(?:(brown )+.*)?(?:(fox ).*)?"
set story "The quick fox jumped over another brown fox "
set results [regexp -all -nocase -indices -inline $regString $story]

For this test, I get 5 sets of data. Looping through the data, it appears that
there were no complete matches. I was expecting the regular expression to match
the 2nd, 7th, and 8th words of "story".

Using the "same" regString: In the event that the story contains only "The quick
fox jumped ", I was expecting the regular expression to match the 2nd and 3rd
words into the 1st and 3rd expression.

# Test 2
set story "The quick fox jumped "
set results [regexp -all -nocase -indices -inline $regString $story]


FYI, the following code is used for debugging the returned results:

set numMatches [expr [llength $results] / 4]

puts "Number of matched sets = $numMatches"
puts "Results = $results"

set counter 0
set setNum 1
for {set i 0} {$i < [llength $results]} {incr i} {
if { $counter == 0 } {
puts "Set $setNum:"
}

set pair [lindex $results $i]
set indexA [lindex $pair 0]
set indexB [lindex $pair 1]
puts " $counter: [string range $story $indexA $indexB]"

if { $counter == 3 } {
set counter 0
incr setNum
} else {
incr counter
}
}





Andreas Leitgeb

2004-08-23, 4:01 pm

O.B. <funkjunk@bellsouth.net> wrote:
>
> Good catch. I've tried to expand the example to further explain what
> I'm trying to do. Of all the expressions, I'd prefer for the program
> to attempt to match as many as possible.


oh, here starts the lousy part :-/

at first glance,
set regString "(?:(quick )?.*?)?(?:(brown )+.*?)?(fox)?"
should do it by making the .* non-greedy, but it seems
RE's that utilize non-greedy matching are somewhat more
strange than one might think.

The problem with the naive approach is, that a non-match
of any word can happen at any place, and if a non-match is
ok (through use of ?- or *-quantifier) the re-engine will not
necessarily search for longer matches, if those would start
at a later position.

I can't say its impossible, but I can't think of a solution
either.

PS: if it's just about finding most of the words not caring
for them to occur in any particular order, then
regexp -all -inline -indices "quick|brown|fox" $story
may do it for you.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com