Home > Archive > Compilers > January 2007 > Interface (Grammar precedence/notation) question
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Interface (Grammar precedence/notation) question
|
|
| Chris F Clark 2007-01-06, 7:29 pm |
| I am in the process of considering a new feature for a new version of
Yacc++ and I want some input on the notation I am planning to use, in
particular the precedence of the "and" and "or" operators I will be
using. Typically, "and" binds tighter than "or", so that clauses like
"a and b or c", mean if c is true the whole clause is true otherwise
both a and b must be true for the whole clause to be true. In the
notation I am considering it would be more convenient (as I will
explain) to reverse this precedence, so that "a and b or c" means a
mut be true for the whole clause to be true and also either b or c
must also be true.
<For those of you who want to quote a minimal amount and retain
context, here is the core line to quote:>
The question is, will this reversing of the precedence of and and or
be a bad interface idea?
<end question to quote>
The notation I am considering looks something like this:
family nterm1, nterm2, nterm3: ll or lr and mostly lr;
The grammar for the notation looks like this:
familyrule: "family" nterm ("," nterm)* ":" familyand ";";
familyand: familyor ("and" familyand)*;
familyor: familyprinciple ("or" familyprinciple)*;
familyprinciple: familyname | "(" familyand ")";
familyname: "mostly"?
( "any"
| "ll" number? // also allow parens around number
| "lr" number?
| "lalr" number?
| "slr" number?
| "glr"
| "peg"
| "regular"
| "no" "recursion"
// more alternatives may be added
);
Now, the meaning of a familyrule is that the parser generator should
check the non-terminals listed for obeying (being in) the grammar
class listed.
Thus, a typical "yacc" grammar would be annotated like:
family yacc_grammar: lalr 1;
This means that the entire grammar rooted at the non-terminal
yacc_grammar should describe a language that is LALR(1) parseable,
that is that a LALR(1) parser can be constructed which parses the
language described by the grammar. If the grammar doesn't meet that
constraint (the grammar isn't LALR(1)), an error should be generated.
An "antlr" (or "pccts") grammar would be annotated:
family antlr_grammar: ll;
Thus, an antlr grammar must be parseable by an LL parser, but it
doesn't need to be an LL(1) parser.
A rats! grammar would be annotated:
family rats_grammar: peg;
Finally, a glr grammar would be anotated:
family glr_grammar: glr;
If one wanted a grammar that was parseable by mosy any parser
generator, one might specify it as:
family nice_grammar: ll(1) and slr(1);
This notation means that the grammar nice_grammar must be both LL(1)
and SLR(1), which would mean it would be acceptable to nearly any
tool.
Finally, perhapes we have a grammar that all we care about is that it
is parseable by some other tool and we don't care whether that tool is
yacc-like or pccts-like. We might annotate that grammar as.
family ok_grammar: ll or lalr(1);
This means that ok_grammar can be either LL or it can be LALR(1) and
no error will be generated. However, we don't care which language
class the grammar belongs to, as long as it is one or the other.
Finally, if we didn't want the grammar checked, we would mark it
any.
family unchecked_grammar: any;
No errors are reported for grammars marked any. If they don't parse
what you expect, well you told us not to warn you.
Ok, that's the simple cases. Let's deal with grammars inside grammars
(sub-grammars) so to speak. Suppose, we have a grammar
(inside_grammar) that we know is LL(1) and we want to embed it in a
larger grammar (outside_grammar), and we want that larger grammar to
be LALR(1). We mould mark that case as:
family inside_grammar: LL(1);
family outside_grammar: LALR(1);
Now, note that because the inside_grammar is embedded inside the
outside_grammar, it must be LALR(1), because otherwise the checks on
the outside grammar will fail.
If we wanted to relax that constraint, we would write:
family inside_grammar: LL(1);
family outside_grammar: MOSTLY LALR(1);
That would mean the parser generator would start checking
outside_grammar using the LALR(1) rules, but when it came to
inside_grammar, since it was marked LL(1), it would apply the LL(1)
checks instead. (There's a clever paper by one of Nigel Horspool's
students whose name escapes me now, since I don't have it in front of
me, that shows how one can do exactly that--well, with a *little*
handwaving.)
The point of the "mostly" clause is that it constrains the top level
non-terminal to be of a certain class (and any unmarked non-terminals
below it), but allows phrases inside to follow different rules, as
long as they are marked as to what rules they follow. A real common
use might be to make an LL grammar for the bulk of the language, but
have expressions and if_statements be LR.
family grammar: mostly ll;
family expression, if_statement: lr;
The nice thing about specifying something that way, is that one gets
assurances that the vast majority of the language is easy to parse, as
it is LL, but one has the freedom to use the more powerful LR parsing
algorithm in the places where one needs it, so that one doesn't have
to create all those nit-picky non-terminals to encode the precedence
hierarchy.
<another quotable question:>
What should I call LL grammars that depend on the hack that allow them
to deal with the if-then-else ambiguity?
<end quotable question>
Technically, such grammars aren't actually LL. However, since most LL
parser generators can deal with them, it seems wrong to not allow one
to express that a grammar fits in the class the such parser generators
can handle. I'm thinking of calling them "near LL". However, if there
is an established term for them (or if near LL is used for something
else), I'm open to other names.
And, now to why I want the precedence reversed. Suppose one wants a
gramar where any fragment should either be LL or LALR, but the top set
of rules should be LALR. It seems natural to write it as thus.
family grammar: mostly lalr and lalr or ll;
or
family grammar: lalr or ll and mostly lalr;
Instead of requiring parenthesis as in:
family grammar: mostly lalr and (lalr or ll);
However, I am well aware that little issue of precedence can make
languages harder to write (it was an often laid critique of Wirth's
languages that with each one he got a different set of precedences
wrong), and the precedence I am proposing is "reversed" from normal
Boolean logic. My worry is that people will not trust the unusual
precedence and simply parenthesize all and/or combinations.
I guess if I implicitly "and" multiple family rules, then one could
write it as:
family grammar: lalr or ll; // any part of this grammar must be lalr or ll
family grammar: mostly lalr; // the top rule(s) must be lalr
Is that more clear?
Any thoughts?
-Chris
****************************************
*************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
| |
| Hans-Peter Diettrich 2007-01-06, 7:29 pm |
| Chris F Clark wrote:
> I am in the process of considering a new feature for a new version of
> Yacc++
I'm not familiar with Yacc nor Yacc++, so my following remarks may be of
little help.
> The question is, will this reversing of the precedence of and and or
> be a bad interface idea?
Yes, definitely.
I'd suggest a different wording, at least.
> family inside_grammar: LL(1);
> family outside_grammar: LALR(1);
>
> Now, note that because the inside_grammar is embedded inside the
> outside_grammar, it must be LALR(1), because otherwise the checks on
> the outside grammar will fail.
Some kind of inheritance?
> If we wanted to relax that constraint, we would write:
>
> family inside_grammar: LL(1);
> family outside_grammar: MOSTLY LALR(1);
>
> That would mean the parser generator would start checking
> outside_grammar using the LALR(1) rules, but when it came to
> inside_grammar, since it was marked LL(1), it would apply the LL(1)
> checks instead.
I wonder what's the purpose of the grammar classification.
Shall it describe the properties (and syntax?) of existing grammars, or
shall it indicate what grammar class is acceptable for the intended parser?
In the latter case, what's the purpose of such restrictions?
As indicated above, wouldn't it be sufficient to specify an
compatibility tree (or graph) for the grammar types? Then the parser
type can be determined from the given grammars, and it can be restricted
to some "level" in the tree. This would lead to a distinction between
the grammar and parser types, where grammar types describe the given
grammars, and the parser type describes the desired parser.
> family grammar: mostly ll;
> family expression, if_statement: lr;
>
> The nice thing about specifying something that way, is that one gets
> assurances that the vast majority of the language is easy to parse, as
> it is LL, but one has the freedom to use the more powerful LR parsing
> algorithm in the places where one needs it, so that one doesn't have
> to create all those nit-picky non-terminals to encode the precedence
> hierarchy.
IMO precedence and binding can be added to LL grammars, in the same way
as they already can be added to LR grammars. It's a matter of the
implementation, whether such an annotated LL grammar will be
auto-expanded into a traditional LL grammar, internally, or whether it
will be implemented immediatly in a specialized rule (subparser) class.
I'd also favor an optional disambiguation by sequential application of
alternatives (in LL parsers), turning LL(1) errors into non-fatal
warnings. The resulting code would look like:
if (token in FirstOfAlt1) Alt1()
else if (token in FirstOfAlt2) Alt2()
...
instead of the traditional
case (token) of {...}
> What should I call LL grammars that depend on the hack that allow them
> to deal with the if-then-else ambiguity?
Also the implied handling of the dangling else, by a "longest match"
attribute, IMO is applicable to every grammar type?
> Technically, such grammars aren't actually LL. However, since most LL
> parser generators can deal with them, it seems wrong to not allow one
> to express that a grammar fits in the class the such parser generators
> can handle. I'm thinking of calling them "near LL". However, if there
> is an established term for them (or if near LL is used for something
> else), I'm open to other names.
I'd suggest ELL, according to EBNF ;-)
> And, now to why I want the precedence reversed. Suppose one wants a
> gramar where any fragment should either be LL or LALR, but the top set
> of rules should be LALR. It seems natural to write it as thus.
>
> family grammar: mostly lalr and lalr or ll;
> or
> family grammar: lalr or ll and mostly lalr;
>
> Instead of requiring parenthesis as in:
>
> family grammar: mostly lalr and (lalr or ll);
What about :
family grammar: mostly lalr, lalr or ll;
where IMO "mostly" describes the parser type, the remainder the grammar
types.
> I guess if I implicitly "and" multiple family rules, then one could
> write it as:
>
> family grammar: lalr or ll; // any part of this grammar must be lalr or ll
Why that? What's the purpose or the consequences of according
(mis-)behaviour?
> family grammar: mostly lalr; // the top rule(s) must be lalr
Here:
family parser: lalr;
As a general note, you should make clear what's the purpose of the
presented syntax, before going into details. I'm missing indications,
when e.g. grammar syntax, checks or extensions are addressed, and where
the parser type and generation is of concern.
Apart from LL, LR etc. grammar types, the grammar syntax should be
addressed. IMO EBNF will allow to describe both LL and LR grammars,
whereas BNF may exclude an LL interpretation, due to explicitly unrolled
repetitions (into left or right recursion).
DoDi
| |
| Hans Aberg 2007-01-06, 7:29 pm |
| Chris F Clark <cfc@shell01.TheWorld.com> wrote:
> The question is, will this reversing of the precedence of and and or
> be a bad interface idea?
It will be no worse than interchanging the meaning of 'true' and 'false',
or '0' and '1', or 'US' and 'China'. That is,_almost every user will do it
wrong, even if they try hard to do it right. :-)
--
Hans Aberg
| |
| Chris F Clark 2007-01-06, 7:29 pm |
| I wrote:
>
Hans-Peter Diettrich <DrDiettrich1@aol.com> writes:
[color=darkred]
>
> Yes, definitely.
> I'd suggest a different wording, at least.
Hans said:
H> It will be no worse than interchanging the meaning of 'true' and 'false',
H> or '0' and '1', or 'US' and 'China'. That is,_almost every user will do it
H> wrong, even if they try hard to do it right. :-)
After, hearing both yours and Hans' replies, I am going to not do
something non-standard. The point that most users would simply get it
wrong even when trying is most poignant and credible. Fortunately,
your reply suggests an alternate solution that preserves the
simplicity of what I want and doesn't require me to mess-up the and/or
precedence in the process. *THANKS!!!* I'll point out the key idea
later.
>
> Some kind of inheritance?
Yes, effectively, although we won't call it inheritance because we
alrady have inehritance in grammars used a completely different way.
If you have a rule where non-terminal a is defined in terms of
non-terminal b. The restrictions of non-terminal a are also applied
to b, unless the restriction on non-terminal a was a "mostly"
restriction and b had a specific restriction declared for it (like an
override). If the restrctions on non-terminal a were not "mostly"
restrictions, they apply to non-terminal b, even if b has specific
restrictions declared.
That makes the restrictions work "right". If you want the whole
grammar to be LL(1), then all of its parts must be LL(1). If you want
some part of a grammar to be LR(1), then you want all the sub-parts of
that grammar to be LR(1). If you embed one type of grammar in the
other, you wnat the outer checks applied to the inner grammar and thus
the inner grammar must be both.
The idea of the "mostly" restriction, is for cases where you know that
most of your grammar should follow some standard (i.e. most grammars
one naturally writes are both LL(1) and LR(1)), but you may have
certain problem parts that you want verified to a looser standard, or
perhaps not verified at all because you can't write them in a
class conformant manner and you have hand checked them.
> I wonder what's the purpose of the grammar classification.
> Shall it describe the properties (and syntax?) of existing grammars, or
> shall it indicate what grammar class is acceptable for the intended parser?
The purpose is to help users write "trusted" grammars. If your
grammar is LL or LR, one knows that it is unambiguous and a tool can
verify that your grammar fits those requirements.
However, if one is writing a more general grammar, a predicated one, a
PEG (parsing expression grammar), or a GLR grammar, one might write an
ambiguous grammar without knowing it. Moreover, we cannot
mechanically check such grammars for consistency.
Fortunately, if you don't have any recursive productions that escape
from LL or LR sub-parts on ones grammar, then you can verify the
sub-parts and be certain that the outer levels which are using
backtracking are still relatively "safe"--you won't have complicated
nested ambiguous cases, all your ambiguities will be simple.
> As indicated above, wouldn't it be sufficient to specify an
> compatibility tree (or graph) for the grammar types? Then the parser
> type can be determined from the given grammars, and it can be restricted
> to some "level" in the tree. This would lead to a distinction between
> the grammar and parser types, where grammar types describe the given
> grammars, and the parser type describes the desired parser.
My intent is to have a separate declaration that describes the "output
type" such as recursive descent, LL tables, LR tables, GLR tables,
DFA, or NFA. The first implementation is likely to support only 1 or
2 output types.
> IMO precedence and binding can be added to LL grammars, in the same way
> as they already can be added to LR grammars. It's a matter of the
> implementation, whether such an annotated LL grammar will be
> auto-expanded into a traditional LL grammar, internally, or whether it
> will be implemented immediatly in a specialized rule (subparser) class.
I'm pretty certain that you are right about this. I'll know more as I
work through the details.
> I'd also favor an optional disambiguation by sequential application of
> alternatives (in LL parsers), turning LL(1) errors into non-fatal
> warnings. The resulting code would look like:
> if (token in FirstOfAlt1) Alt1()
> else if (token in FirstOfAlt2) Alt2()
> ...
> instead of the traditional
> case (token) of {...}
Yes, I've thought about that as a useful thing. I don't know if I
will incorporate it into what I'm doing, as I don't know if I'm even
going to have a "recursive descent" output option yet.
>
> Also the implied handling of the dangling else, by a "longest match"
> attribute, IMO is applicable to every grammar type?
That thought is worth considering. The more things that I factor the
better the categorization scheme will be.
> What about :
> family grammar: mostly lalr, lalr or ll;
I like this comma notation, it is a nice implied "and" that doesn't
use the "and" word and thus won't get with the "boolean and"
which can have normal precedence. It works well with
I can then talk about the list of rules being implicitly anded
together (i.e. they all apply) and yet still leave the normal Boolean
"and" and "or" with their normal precedence. That will be natural and
not confusing.
[color=darkred]
> Why that? What's the purpose or the consequences of according
> (mis-)behaviour?
Worth emphasizing: the purpose of this notation is to facilitate error
checking of the grammar. The notation that Yacc++ supports allows one
to write ambiguous grammars, which can be a good thing, it simplifies
the writing of grammars. Except that with predicates, you can get
subtle ambiguous cases that may not mean what you think. This will be
more true as I extend Yacc++ into handling the GLR and PEG grammar
classes.
In fact, it was the PEG grammar class that got me thinking about this.
When one writes a PEG grammar there is an implicit ordering of rules
that affects how things are parsed. And, in many cases this is fine.
In fact, it is part of why PEGs are "composable". However, it isn't
fine when one writes a recursive case where the recursion depends on
an ordered sub-non-terminal. In that case, one can get the type of
ambiguity that is the same thing I don't like about hand-written
recursive descent. That is you can get the type of ambiguity where
you aren't really sure what is or is not in the language being parsed
(and if it is parsed what it is recognized as).
That problem doesn't occur if all the recursive parts of ones grammars
are LL or LR. Thus, if one had a tool where one could specify that
the different fragments were in the various classes (what I proposed
above) and the tool checks those constraints, and you also specified
that the parts outside the checked classes weren't recursive, you
would eliminate the worst ambiguity problems--the nested ones.
You still wouldn't know whether two fragments of the grammar parsed
the same input sequence, but at least that problem would only be at
the "gross" level and there the case where something might be both a
declaration and an expression might be an acceptable ambiguity that
you want resolved in a specific way.
Most importantly for me, it can help one "maintenance error" proof the
grammar. One does that by marking the parts of the grammar one knows
are supposed to be well-formed (e.g. LL and/or LR), so that if someone
makes a change that breaks that, the change is flagged.
The reason behind this is that some Yacc++ grammars are "venerable"
these days. I have clients who have grammars where none of the
original authors are still on the team. Moreover, they are large and
being maintained often by more inexperienced developers. Sometimes,
even I have trouble figuring out what the consequences of a given
subtle change to a grammar does to the language parsed. Keeping these
grammars working can be a major task. The more I can help the users
write sound bug-free code the better everyone's life is.
> Apart from LL, LR etc. grammar types, the grammar syntax should be
> addressed. IMO EBNF will allow to describe both LL and LR grammars,
> whereas BNF may exclude an LL interpretation, due to explicitly unrolled
> repetitions (into left or right recursion).
Yes, Yacc++ already accepts EBNF with extensions. And, I agree that
writing things as regular expressions is usually better than explicit
recursions.
Again, thanks for the detailed feedback. I hope my responses make it
more clear what I am attempting to do.
BTW, it is my intent that a version of this will be released as free
(open source) software (either GPL or BSD, I'm not sure yet), as well
as under the current license. I will let you know more as it gets
closer to reality.
-Chris
****************************************
*************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
|
|
|
|
|