For Programmers: Free Programming Magazines  


Home > Archive > Compilers > March 2005 > Q2. Why do you split a monolitic grammar into the lexing and parsing rules?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Q2. Why do you split a monolitic grammar into the lexing and parsing rules?
valentin tihomirov

2005-02-20, 8:58 pm

Looks like this is a group discussing parsing-translating issues as a matter
of compiler front-ends. So, the problem is to enable using reserwed keywords
as identifiers. The problem is purely artificial; the input "(name name)" is
can be generated by

name -> '(' 'name' ID ')';
ID -> ('a'..'z')+;

CF grammar; therefore, it must be recognizable. However, the lexer will turn
the 2nd "name" into a literal token which is different from ID token. As a
result, the token stream "LP NAME NAME RP" will not match the input. The
issue merely does not exist for unified grammar recognizes. On the on hand,
it looks natural that we need to combine letters into words as it is easier
to process a text as a stream of words and separators. In addition, the
grammar of natural languages explicitly introduces the alphabet (set of
letters) and dictionary (set of words). Nevertheless, a solid grammar does
not deny constructing high-level terms from the letters or other terms.
Actually, a grammar is more powerful device than a stream of words as they
are not limited by two levels. The ANTLR's lexer is based on the same
algorithms as the parser; that is, it can perform a complete translatorion.
So, I do not understand why do we need the artificial obstacle, the 2nd
level?
[Try writing a single-level parser that disregards spaces and optional spaces in
the usual way. A separate lexer makes that a whole lot easier. -John]

Ron Pinkas

2005-02-28, 3:59 am

> So, I do not understand why do we need the artificial obstacle, the
> 2nd level?


SimpLex (http://sourceforge.net/projects/simplex) is a generic, open source,
lexical engine, which does not have such restriction. It allows context
rules to differentiate between reserved words, as reserved words, vs.
reserved words as Identifiers. For example, in the xHarbour compiler
(http://www.xHarbour.org) it replaced a Flex based scanner, specifically to
eliminate such restriction. Interestingly it produced a faster scanner,
which is about 1/4 the size of the original Flex generated scanner.

Ron
Vladimir

2005-02-28, 3:59 am

As it is known, it is not possible to write CF grammar, which
describes the language with identifiers declaration before their
using. This condition cannot be expressed by CF grammar but only by
context sensitive one. So, as most of syntax analysers are written to
parse CF languages the condition 'declaration before using" is dropped
from the grammar and implemented as "semantic" property of the
compiler in name tables. Because lexer returns one terminal "ID" with
the ID's string as a "semantic" value. It is rather easier to use such
the strategy than incorporate names parsing in context sensitive
grammar. Another reason to use lex analyser is the identifier's
language is regular and can be parsed by using finite state machines.
It is obvious way of homo sapiens to devide some complex task on the
number of easier ones.


"valentin tihomirov" <spam@abelectron.com> wrote in message news:<05-02-087@comp.compilers>...
> Looks like this is a group discussing parsing-translating issues as a matter
> of compiler front-ends. So, the problem is to enable using reserwed keywords
> as identifiers. The problem is purely artificial; the input "(name name)" is
> can be generated by
>
> name -> '(' 'name' ID ')';
> ID -> ('a'..'z')+;
>
> CF grammar; therefore, it must be recognizable. However, the lexer will turn
> the 2nd "name" into a literal token which is different from ID token. As a
> result, the token stream "LP NAME NAME RP" will not match the input. The
> issue merely does not exist for unified grammar recognizes. On the on hand,
> it looks natural that we need to combine letters into words as it is easier
> to process a text as a stream of words and separators. In addition, the
> grammar of natural languages explicitly introduces the alphabet (set of
> letters) and dictionary (set of words). Nevertheless, a solid grammar does
> not deny constructing high-level terms from the letters or other terms.
> Actually, a grammar is more powerful device than a stream of words as they
> are not limited by two levels. The ANTLR's lexer is based on the same
> algorithms as the parser; that is, it can perform a complete translatorion.
> So, I do not understand why do we need the artificial obstacle, the 2nd
> level?
> [Try writing a single-level parser that disregards spaces and optional spaces in
> the usual way. A separate lexer makes that a whole lot easier. -John]


Vidar Hokstad

2005-02-28, 3:59 am

valentin tihomirov wrote:
> So, I do not understand why do we need the artificial obstacle, the
> 2nd level?


There's nothing inherent in parsing that prevents you from writing
parsers without any distinction between lexical symbols and compound
rules. In fact I've written several parsers that doesn't enforce any
difference at all.

Some points to keep in mind:
- It's often simpler to split the two. You can write a lexer and verify
it/debug it separately, and then nicely layer the parser on top.
- It's often nicer for error reporting to separate the two - you might
want to report on the full token expected, and point the user to a
token boundary.
- When hand writing a parser it is often much easier to understand if
the higher levels deals with a stream of tokens instead of having to
take into account issues such as whitespace, comments and
disambiguation of token boundaries that can be filtered out in the
lexer.
- The split often has little practical effect, but presents a
conceptual separation that sometimes simplify understanding of the
grammar.
- It's a split that at least I believe encourages good design, in that
it makes you think consciously about the recognition of tokens and how
to reduce ambiguity in the parser and promotes context free grammars.

Vidar
Norm Dresner

2005-02-28, 3:59 am

"valentin tihomirov" <spam@abelectron.com> wrote in message
> Looks like this is a group discussing parsing-translating issues as
> a matter of compiler front-ends. So, the problem is to enable using
> reserwed keywords as identifiers. ....


> So, I do not understand why do we need the artificial obstacle, the
> 2nd level?


> [Try writing a single-level parser that disregards spaces and
> optional spaces in the usual way. A separate lexer makes that a
> whole lot easier. -John] >


The answer to your subject question for me is "to reduce complexity".
There was (many years ago) a tradition in modular programming that no
function should be longer than a single printed page. It is a good
measure of what an ordinary human can comprehend easily. Well, we
(almost never) use printed pages anymore and my rule is that a
function shouldn't be longer than a few screens. Applying the same
principle to grammars, once a grammar gets beyond a certain point in
complexity it's easier to comprehend what's going on (and much easier
to find problems) if there's a split into the lexical and syntactical
halves -- at least for me.

Norm

Roy Haddad

2005-03-04, 8:58 pm

valentin tihomirov wrote:
> Looks like this is a group discussing parsing-translating issues as a matter
> of compiler front-ends. So, the problem is to enable using reserwed keywords
> as identifiers. The problem is purely artificial; the input "(name name)" is
> can be generated by
>
> name -> '(' 'name' ID ')';
> ID -> ('a'..'z')+;
>
> CF grammar; therefore, it must be recognizable. However, the lexer will turn
> the 2nd "name" into a literal token which is different from ID token. As a
> result, the token stream "LP NAME NAME RP" will not match the input. The
> issue merely does not exist for unified grammar recognizes.


I'm assuming this would be like being able to do this in C:
int else = 4;
printf("%i",else);

A character is a different class of object than a word, and the two
operate on different levels. Tokens that are considered to have direct
meaning, like ELSE, IF, WHILE, etc., should be differentiated from
collections of characters: 'e','l','s','e', and etc. This is how we
think of it anyway (you don't think of words as collections of
characters or sounds - otherwise it would be just as easy to memorize a
passage in an unknown language as a passage in your native language),
and because of this, with a unified grammar I would want to have all
"words" constructed out of "letters" and nothing else derived from
"letters", so there would be no practical difference between a separate
lexer and parser and a unified parser...

So I don't think the problem is artificial, or at least the problem is
not artificial in quite that way.

I the most natural solution I can think of that works purely by
grammar is to make the lexer and parser non-deterministic, so 'else'
above would come out of the lexer as the set { ID, ELSE
}. Practically, you could probably factor out the non-determinism from
there with the usual method of replacing every set with multiple
elements by a single token, and then creating rules from the rules for
each single token. This may result in a nasty grammar, but I don't
think this method will fail where any other will succeed.

So, for example, from this:

IFBLOCK -> IF STATEMENT BLOCK [ELSE BLOCK]
VARDECL -> TYPE ID

we would get this:

IFBLOCK -> IF STATEMENT BLOCK [ELSISH BLOCK]
VARDECL -> TYPE ELSISH
VARDECL -> TYPE ID

where ELSISH is the single token version of { ID, ELSE } (since there is
no way for a lexer to read an ELSE without also recognizing ID, we don't
need to keep the original IFBLOCK rule.)

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com