For Programmers: Free Programming Magazines  


Home > Archive > Compilers > September 2007 > Precedence Rules for '$' and '^'









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Precedence Rules for '$' and '^'
jamin.hanson@googlemail.com

2007-09-13, 4:21 am

Hi there,

I am author of lexertl (http://www.benhanson.net/lexertl.html), a
lexer generator library, and I have recently added support for '^' and
'$'. As position matching in regex is not described in any books I
have looked at, does anyone know what the precedence rules should be?
If an earlier rule starts with '$', does that mean that a subsequent
rule starting with '^' has the '^' ignored?

Thanks,

Ben Hanson
Joachim Durchholz

2007-09-13, 7:12 pm

jamin.hanson@googlemail.com schrieb:
> Hi there,
>
> I am author of lexertl (http://www.benhanson.net/lexertl.html), a
> lexer generator library, and I have recently added support for '^' and
> '$'. As position matching in regex is not described in any books I
> have looked at, does anyone know what the precedence rules should be?


I think Perl regexes are what people expect, so I'd stick with the
precedences that PCRE uses.

> If an earlier rule starts with '$', does that mean that a subsequent
> rule starting with '^' has the '^' ignored?


I didn't understand that. A concrete example might help.

Regards,
Jo
[I don't understand it either. My understanding of typical REs is
that they special case ^ at the beginning of a pattern or chunk
that could match at the beginning, and $ at the end. -John]
Joachim Durchholz

2007-09-13, 7:12 pm

John Levine wrote:
> [I don't understand it either. My understanding of typical REs is
> that they special case ^ at the beginning of a pattern or chunk
> that could match at the beginning, and $ at the end. -John]


That's just implementation; the OP was after precedences.

Assigning a precedence to ^ and $ does make sense. For example,

^asd|jkl

could mean "asd at the beginning of the text, or jkl anywhere in the
text", or it could mean "asd or jkl at the beginning of the text".
PCRE says it's either ^asd or jkl, so it assigns a higher precedence to
^ than to |.

Regards,
Jo
jamin.hanson@googlemail.com

2007-09-15, 7:11 pm

On 13 Sep, 22:06, Joachim Durchholz <j...@durchholz.org> wrote:
> John Levine wrote:
>
> That's just implementation; the OP was after precedences.
>
> Assigning a precedence to ^ and $ does make sense. For example,
>
> ^asd|jkl
>
> could mean "asd at the beginning of the text, or jkl anywhere in the
> text", or it could mean "asd or jkl at the beginning of the text".
> PCRE says it's either ^asd or jkl, so it assigns a higher precedence to
> ^ than to |.


The way I deal with ^ or $ being coincidental is to include all other
possible inputs in the state following a match of ^ or $. To use your
example:

State 0:
^ -> State 1
j -> State 2

State 1:
a -> State 3
j -> State 2

State 2:

k -> State 4

State 3:
s -> State 5

State 4:

l -> State 6

State 5:
d -> State 6

State 6
(end state)

The reasoning is that you can't get to match 'a' without matching '^',
but you can match 'j' ieven if you match '^', as the '^' is irrelevant
to the 'j' path. I'd be interested to hear what people think about
this approach.

However, the real question is that if you allow '^' and '$' to occur
anywhere in a regex (boost::regex works this way), how you handle '^'
and '$' clashes, because you may have declared a '$' rule before a '^'
rule, yet my code always checks '^' before '$' regardless. As you
have to check both possibilities on lookup (otherwise how can you ever
match them ;-) ), the right thing to do appears to be to suppress the
'^' if it occurs at a position in the rules that a '$' has already
occurred at.

Note that using PERL rules is not the answer, as lexers use left-most
longest and compile to a DFA. PERL uses leftmost precedence and uses
NFA.

Regards,

Ben
Russ Cox

2007-09-15, 7:11 pm

> If an earlier rule starts with '$', does that mean that a subsequent
> rule starting with '^' has the '^' ignored?


A related question, perhaps the one you were asking, is whether once $
has matched, ^ can still match the same position in the case of an
empty line/string.

For example, ^$ clearly matches the empty string, but does $^ match it
too? What about ^$^, $^$, and ^^$$?

The answers you expect depend on whether you think about ^ and $ as
matching positions in a line or imaginary characters at the beginning
and end of the line. In the position model, you'd expect that for an
empty string, both match position 0, so /$^/ and all the others would
work just fine. In the imaginary character model, "abc" is really
"(imaginary-^)abc(imaginary-$)" and the empty string is really
"(imaginary-^)(imaginary-$)". Then you'd expect only /^$/ to match,
of the examples given above.

Most implementations use the position model, but regexp parsers
sometimes get in the way when trying to find out. In Perl, for
example, '' =~ '$$' is true but '' =~ /$$/ is not, because in the
latter, $$ is some interpolated variable. In most greps, ^ and $ are
only recognized as special at the beginning and end of the pattern, so
^$$ is a regexp matching a line containing just a dollar sign, while
in egrep, ^ and $ are always special, so ^$$ is usually a regexp
matching an empty line.

Russ
Joachim Durchholz

2007-09-16, 7:15 pm

jamin.hanson@googlemail.com schrieb:
> However, the real question is that if you allow '^' and '$' to occur
> anywhere in a regex (boost::regex works this way),


I may be missing something, but it seems to me that such a rule
wouldn't match anything if it has a nonempty pattern before the ^ or
after the $.

I.e. asd^jkl, while a perfectly valid regex, will never match, or will it?

> how you handle '^'
> and '$' clashes, because you may have declared a '$' rule before a '^'
> rule, yet my code always checks '^' before '$' regardless.


Some example would help. Rule order definitely doesn't affect lexing,
at least not unless you provide mechanisms that go beyond regular
expressions.

> As you have to check both possibilities on lookup (otherwise how can
> you ever match them ;-) ), the right thing to do appears to be to
> suppress the '^' if it occurs at a position in the rules that a '$'
> has already occurred at.


In my book, the Right Thing would be to spit out a warning somewhere.

> Note that using PERL rules is not the answer, as lexers use left-most
> longest and compile to a DFA. PERL uses leftmost precedence and uses
> NFA.


I'm not sure how "leftmost longest" and "leftmost precedence" relate.
DFA and NFA are certainly not at odds, since they are equivalent.

I'd still look at Perl whenever I'm unsure, simply because that's how
most people expect regexes to work. That doesn't mean you have to do
everything as Perl does, particularly if you have good reasons to do
otherwise :-)

Regards,
Jo
Chris F Clark

2007-09-18, 8:11 am

I doubt that there is a completely general consensus on what the ^ and
$ meta-characters do. Moreover, if one uses regular expressions for
searching (or on matching data with an internal record structure),
these other aspects come more into play.

However, I think that understanding what Perl does is a good starting
point. And, one important factor to consider is that Perl mentions
them applying at "record" boundaries, where the normal record is
synonymous with a "line" meaning the newline character(s) (which is
it's own bag of worms) is the boundary marker. However, one can use
Perl with other definitions of records. I'm not sure how far Perl
takes the concept of records, but I do know that it is more than just
lines in a file. I don't recall off-hand whether there are ways to
compare a variable holding a bag-of-records to a pattern. However,
given the nature of Perl I'd be more surprised if you couldn't do it,
than if you could.

So, what "should" ^ mean, it should match the start of a "record".
For a text file composed of lines, it means the start of the file, and
also after each occurrence of a newline. In a CSV (comma separated
value) file, it should be the start of the file, and before each
comma. In a file of fixed length records, it should be the start of
each file and at each modulo record length boundary (note, there will
be no characgter in the stream in that case). Similarly, if one has a
file of variable length records, where the record has a record that
contains a specific length field, you figure out where your record
boundaries are and make the ^ point to the start of each record--and
there may be no "character" explicitly representing the end of one
record and start of the next. For example in network protocols, a
packet boundary (which has no character representing it at all) is
generally considered a record boundary, if that is your problem
domain, you probably want your meanings of ^ and $ to include packet
boundaries.

Similarly, the $ represents the end of a "record". In the text file
case, this is the character before the newline. In the CSV case,
the character before the comma. And, so forth and so on. The end of
file is also usually considered the end of a record.

Now, as Russ Cox mentioned, the pattern ^$ matches a record that has
no characters in between the start and end, thus is must be an empty
record.

However, I would interpret the pattern $^ as matching the end of one
record and beginning of the next, thus, it matches the boundaries
between records, and does not match the beginning of the first record
(because there is no previous record that the $ is the end of) nor the
end of the last record (for the converse reason). I would not argue
with those that want to see the end-of-record character(s), e.g. "\n"
or "," in between the $ and ^ for the above case, as in $\n^ for
matching all record separating newlines that are not at the
end-of-file.

I would also allow some of Russ's more complex cases, such as ^$$,
which would mean start-of-record, end-of-record, end-of-record
(i.e. two empty records and the same as ^$^$, i.e. omitting a ^ after
a $ is unimportant, and also the same as ^^$ with a similar
rationale). I'm assuming that one doesn't have a hierarchical record
system where one has "nested" records which whould make the concept of
two consecutive starts might be sensible in an entirely different way
(start of a containing record and start of a nested record).

If you follow that rationale, you will see that the pattern 'abc^def'
is actually well-defined, and means 'abc' start-of-record 'def'.
Again, assuming flat record structures, that's the same as 'abc$^def'
or 'abc$def'. I believe I have used the last one in emacs and gotten
the expected results, where it dealt properly with DOS \r\n terminated
lines mixed with Unix \n terminated ones.

Hope this helps, (or atleast makes sense)
-Chris

****************************************
*************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
jamin.hanson@googlemail.com

2007-09-18, 8:11 am

On 15 Sep, 21:41, Joachim Durchholz <j...@durchholz.org> wrote:
> jamin.han...@googlemail.com schrieb:
>
>
> I may be missing something, but it seems to me that such a rule
> wouldn't match anything if it has a nonempty pattern before the ^ or
> after the $.
>
> I.e. asd^jkl, while a perfectly valid regex, will never match, or will it?


No that regex won't ever match.

>
>
> Some example would help.


I will have to hunt around for some good examples! Do you know of any
example lex specs what actually use ^ and $ at all?

>Rule order definitely doesn't affect lexing,
> at least not unless you provide mechanisms that go beyond regular
> expressions.


Yes they do! Rule ordering is used to determine which rule matched in
the event of ambiguity. That's why the more general rules such as [a-
z]+ (for example) come AFTER keywords in lex specs.

>
> In my book, the Right Thing would be to spit out a warning somewhere.


As I mentioned, boost::regex deals with ^ and $ anywhere in a regex
without complaining.

>
> I'm not sure how "leftmost longest" and "leftmost precedence" relate.


Here's the difference: [-+]?(\d+[.]?|\d*[.]\d+) in PERL will match 0.
when presented with 0.1

This is because it gives up as soon as the leftmost rule matches.
This is in contrast with a leftmost longest strategy (typically used
with a DFA representation) which will keep going until it can't match
anymore regardless of the ordering of the ored expressions.

> DFA and NFA are certainly not at odds, since they are equivalent.


In theory, but not the way PERL implements lookup.

> I'd still look at Perl whenever I'm unsure, simply because that's how
> most people expect regexes to work. That doesn't mean you have to do
> everything as Perl does, particularly if you have good reasons to do
> otherwise :-)


PERL is the wrong model for lexers. Extended POSIX is closer.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com