Home > Archive > Tcl > May 2004 > regexp advice
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Jeff Godfrey 2004-05-18, 1:32 pm |
| Hi All,
I am working on a "file parsing" app using regular expressions. The twist
is that the parser will be User definable, in that the User can say "Here's
the things I'm interested in". In order to do this, the User will define
individual "tokens" (a Snit based object) for each "item of interest" in the
file. Among other things, each "token" will contain a regular expression
used for matching the appropriate text within the parsed file.
The goal is to parse an input file (one line at a time) and return its
tokenized representation, based on the User-defined tokens. But, I need to
return the tokenized data in the order the tokens were found in the original
input string. Since there is no "defined" token order in the input file, I
need to determine which tokens match something in the current line, and
where.
While I have working code to do this, I would appreciate any advice on
improving my basic design - as I don't use regular expressions very often,
and I think there might be a better (faster) way to approach this problem.
Basically, I spin through all my regular expressions and determine if and
where (using -indices) each one matches. I build a list containing each
matched regular expression along with it's match indices (beginning char
position). Next, I sort the list by this index in order to get the regexps
in the same order as the text line. Finally, I spin through all regexps
again, but this time in "matched order", in order to retrieve the actual
matched data. The final result of the code is a list consisting of each
matched regular expression, along with any subMatches (created from
parenthetical portion of the regular exp).
Below is some simplified example code to provide a basic illustration of the
logic.
Any advice on improving this would be appreciated.
Thanks,
Jeff
=====================
catch {console show}
# --- some sample data
set regList {Y([0-9]+) G02 G00 G01 X([0-9]+) G03}
set string "N0010G01X10Y12M10"
# --- determine if and where each regexp matches in the string
set matchList [list]
foreach regexp $regList {
if {[regexp -indices -- $regexp $string match]} {
lappend matchList [list [lindex $match 0] $regexp]
}
}
puts "String --> $string"
# --- now, sort the returned list by match location
puts "Unsorted --> $matchList"
set matchList [lsort -integer -index 0 $matchList]
puts "Sorted --> $matchList"
# --- now, gather all matched data in the proper order
set finalList [list]
foreach {loc regexp} [join $matchList] {
regexp -- $regexp $string --> subvar
# --- the real code uses the "token name" in place of $regexp
# in the following lappend line
lappend finalList $regexp $subvar
}
puts "Final --> $finalList"
| |
| Ramon Ribó 2004-05-18, 1:32 pm |
| Hello,
What about this code:
set regList {Y([0-9]+) G02 G00 G01 X([0-9]+) G03}
set string "N0010G01X10Y12M10"
set finalList [regexp -inline -all [join $regList |] $string]
Maybe you can take some idea of it.
Regards,
--
Ramon Ribó
http://gatxan.cimne.upc.es/ramsan
"Jeff Godfrey" <jeff_godfrey@pobox.com> escribió en el mensaje
news:10akd1gesbn4qae@corp.supernews.com...
> Hi All,
>
> I am working on a "file parsing" app using regular expressions. The twist
> is that the parser will be User definable, in that the User can say
"Here's
> the things I'm interested in". In order to do this, the User will define
> individual "tokens" (a Snit based object) for each "item of interest" in
the
> file. Among other things, each "token" will contain a regular expression
> used for matching the appropriate text within the parsed file.
>
> The goal is to parse an input file (one line at a time) and return its
> tokenized representation, based on the User-defined tokens. But, I need
to
> return the tokenized data in the order the tokens were found in the
original
> input string. Since there is no "defined" token order in the input file,
I
> need to determine which tokens match something in the current line, and
> where.
>
> While I have working code to do this, I would appreciate any advice on
> improving my basic design - as I don't use regular expressions very often,
> and I think there might be a better (faster) way to approach this problem.
>
> Basically, I spin through all my regular expressions and determine if and
> where (using -indices) each one matches. I build a list containing each
> matched regular expression along with it's match indices (beginning char
> position). Next, I sort the list by this index in order to get the
regexps
> in the same order as the text line. Finally, I spin through all regexps
> again, but this time in "matched order", in order to retrieve the actual
> matched data. The final result of the code is a list consisting of each
> matched regular expression, along with any subMatches (created from
> parenthetical portion of the regular exp).
>
> Below is some simplified example code to provide a basic illustration of
the
> logic.
>
> Any advice on improving this would be appreciated.
>
> Thanks,
>
> Jeff
>
> =====================
>
> catch {console show}
>
> # --- some sample data
> set regList {Y([0-9]+) G02 G00 G01 X([0-9]+) G03}
> set string "N0010G01X10Y12M10"
>
> # --- determine if and where each regexp matches in the string
> set matchList [list]
> foreach regexp $regList {
> if {[regexp -indices -- $regexp $string match]} {
> lappend matchList [list [lindex $match 0] $regexp]
> }
> }
>
> puts "String --> $string"
> # --- now, sort the returned list by match location
> puts "Unsorted --> $matchList"
> set matchList [lsort -integer -index 0 $matchList]
> puts "Sorted --> $matchList"
>
> # --- now, gather all matched data in the proper order
> set finalList [list]
> foreach {loc regexp} [join $matchList] {
> regexp -- $regexp $string --> subvar
> # --- the real code uses the "token name" in place of $regexp
> # in the following lappend line
> lappend finalList $regexp $subvar
> }
> puts "Final --> $finalList"
>
>
| |
| Juan C. Gil 2004-05-19, 5:32 am |
| "Jeff Godfrey" <jeff_godfrey@pobox.com> wrote in message news:<10akd1gesbn4qae@corp.supernews.com>...
> Hi All,
>
> I am working on a "file parsing" app using regular expressions.
>
> [chunk deleted]
>
> Any advice on improving this would be appreciated.
>
> Thanks,
>
> Jeff
>
> [code deleted]
>
I'd accumulate the actual match when building matchList:
lappend matchList [list [lindex $match 0]\
$regexp [string range $string\
[lindex $match 0] [lindex $match 1]]
so that it is not required to invoke [regexp] twice
for each RE.
Juan Carlos---
| |
| Jeff Godfrey 2004-05-19, 11:32 am |
| "Juan C. Gil" <jgil@gmv.es> wrote in message
news:82d7542c.0405190010.8367f1@posting.google.com...
: "Jeff Godfrey" <jeff_godfrey@pobox.com> wrote in message
news:<10akd1gesbn4qae@corp.supernews.com>...
: > Hi All,
: >
: > I am working on a "file parsing" app using regular expressions.
: >
: > [chunk deleted]
: >
: > Any advice on improving this would be appreciated.
: >
: > Thanks,
: >
: > Jeff
: >
: > [code deleted]
: >
: I'd accumulate the actual match when building matchList:
:
: lappend matchList [list [lindex $match 0]\
: $regexp [string range $string\
: [lindex $match 0] [lindex $match 1]]
:
: so that it is not required to invoke [regexp] twice
: for each RE.
Juan,
Good idea - I've modified my code as per your suggestion.
Thanks,
Jeff
| |
| Bruce Hartweg 2004-05-19, 12:33 pm |
|
Jeff Godfrey wrote:
> "Juan C. Gil" <jgil@gmv.es> wrote in message
> news:82d7542c.0405190010.8367f1@posting.google.com...
> : "Jeff Godfrey" <jeff_godfrey@pobox.com> wrote in message
> news:<10akd1gesbn4qae@corp.supernews.com>...
> : > Hi All,
> : >
> : > I am working on a "file parsing" app using regular expressions.
> : >
> : > [chunk deleted]
> : >
> : > Any advice on improving this would be appreciated.
> : >
> : > Thanks,
> : >
> : > Jeff
> : >
> : > [code deleted]
> : >
> : I'd accumulate the actual match when building matchList:
> :
> : lappend matchList [list [lindex $match 0]\
> : $regexp [string range $string\
> : [lindex $match 0] [lindex $match 1]]
> :
> : so that it is not required to invoke [regexp] twice
> : for each RE.
>
> Juan,
>
> Good idea - I've modified my code as per your suggestion.
>
> Thanks,
>
did you try Ramon's suggestion? it is much simpler/faster
it does all you want with a single call to regexp. it was
what I was going to reply until I saw he already had.
In case your newsfeed ate his post - here it is
Ramon Ribó wrote:
> Hello,
>
> What about this code:
>
> set regList {Y([0-9]+) G02 G00 G01 X([0-9]+) G03}
> set string "N0010G01X10Y12M10"
> set finalList [regexp -inline -all [join $regList |] $string]
>
> Maybe you can take some idea of it.
>
> Regards,
>
bruce
| |
| Jeff Godfrey 2004-05-19, 12:33 pm |
|
"Bruce Hartweg" <bruce-news@hartweg.us> wrote in message
news:8PKqc.6$8R5.5@dfw-service2.ext.ray.com...
: did you try Ramon's suggestion? it is much simpler/faster
: it does all you want with a single call to regexp. it was
: what I was going to reply until I saw he already had.
: In case your newsfeed ate his post - here it is
:
:
: Ramon Ribó wrote:
: > Hello,
: >
: > What about this code:
: >
: > set regList {Y([0-9]+) G02 G00 G01 X([0-9]+) G03}
: > set string "N0010G01X10Y12M10"
: > set finalList [regexp -inline -all [join $regList |] $string]
: >
: > Maybe you can take some idea of it.
: >
: > Regards,
: >
Bruce, thanks for pointing out Ramon's post - as you guessed, it never made
it through my feed....
Ramon, thanks for the suggestion - I'll give it a look and let you know how
it works out.
Jeff
|
|
|
|
|