Home > Archive > Compilers > April 2006 > Writing a compiler compiler
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Writing a compiler compiler
|
|
| Vladimir Lushnikov 2006-04-17, 4:06 am |
| Hello,
Firstly, let it be known that I am quite new to the subject of parser
generators, and having written only simple parsers for frivoulous
grammars by hand have little experience.
My objective is thus - to create a parser generator that will
eventually generate a parser for a dynamic language. However, my
undestanding of different parser types and distinctions between is
minimal; but as the web is a good enough resourse please assume my
knowledge of those areas.
The target for this parser generator would be a language similar to
C#, and I have found Jay (a yacc clone to C# and Java). The initial
implementation language is to be C++.
So here goes my question - how do you start writing a parser generator?
I am probably considering a LALR parser, but what are the main
differences between the different types? Specifically, why is an LR
parser unable to parse Python or C++? [source: Wikipedia]
Thank you for your time,
Vladimir Lushnikov
[These topics are all covered in compiler textbooks, many of which are
listed in the FAQ. Assuming by "dyamic language" you mean that the
syntax can change on the fly, you should learn about Early and Tomita
parsers, and pay particular attention to the reasons that even though
they're technically perfectly sound, few people use them. As to why
an LR parser can't parse Python or C++, LR parsers only handle a
subset of BNF. In particular they can't handle the ambiguity of C++
declarations. -John
| |
| Hans Aberg 2006-04-17, 7:05 pm |
| John Levine wrote:
> As to why an LR parser can't parse Python or C++, LR parsers only
> handle a subset of BNF. In particular they can't handle the
> ambiguity of C++ declarations.
There appears to be techniques around this for C++, getting even down to
LALR(1):
http://www.parashift.com/c++-faq-li....html#faq-38.11
--
Hans Aberg
| |
| Russ Cox 2006-04-17, 7:05 pm |
| > So here goes my question - how do you start writing a parser generator?
> I am probably considering a LALR parser, but what are the main
> differences between the different types? Specifically, why is an LR
> parser unable to parse Python or C++? [source: Wikipedia]
I think you start writing a parser generator by writing a parser.
Once you have the parsing details down, writing the traditional
yacc interface on top is pretty simple. The hard part is getting
the parsing engine running in the first place.
Even if you need something more powerful than LALR, I'd start
with LR(0) and SLR just to cement your understanding of
whta's going on. I recently implemented a GLR parser.
It was only a reasonable amount of work because I first
made sure I understood LR(0) and then SLR and LR(1) well.
Although few people do use Earley and Tomita parsers in
practice now, I think general approaches, especially GLR,
are gaining ground.
Some references I have found useful:
John Aycock and Nigel Horspool,
Directly-Executable Earley Parsing, CC 2001
John Aycock and Nigel Horspool,
Faster Generalized LR Parsing, CC 1999.
Bryan Ford,
Parsing Expression Grammars: A Recognition-Based Syntactic Foundation,
POPL 2004
Scott McPeak and George C. Necula,
Elkhound: a Fast, Practical GLR Parser Generator,
CC 2004
Scott McPeak,
Elkhound: a Fast, Practical GLR Parser Generator,
UC Berkeley Tech Report UCB/CSD-2-1214, December 2002
(a longer more detailed version of the CC paper)
Russ
| |
| Quinn Tyler Jackson 2006-04-17, 10:02 pm |
| Vladamir Lushnikov said:
> My objective is thus - to create a parser generator that will
> eventually generate a parser for a dynamic language. However, my
> undestanding of different parser types and distinctions between is
> minimal; but as the web is a good enough resourse please assume my
> knowledge of those areas.
> The target for this parser generator would be a language similar
> to C#,...
and our moderator noted:
> [These topics are all covered in compiler textbooks, many of which are
> listed in the FAQ. Assuming by "dyamic language" you mean that the
> syntax can change on the fly, you should learn about Early and Tomita
> parsers, and pay particular attention to the reasons that even though
> they're technically perfectly sound, few people use them. As to why
> an LR parser can't parse Python or C++, LR parsers only handle a
> subset of BNF. In particular they can't handle the ambiguity of C++
> declarations.] -John
Chapter 9 (section 9.3.3) of Adapting to Babel (see sig line below for link)
covers the parsing of Perl and C++ using adaptive grammars, and deals with
C++ declarations, including the one case in C++ where declarations can
follow the first use of a call (that is, member functions called by member
functions in inline member functions in a class declaration). The C++
grammar used in the experiments has some 292+ productions, but it handles
quite a lot of C++'s constructions, and does so without any disambiguating
code. In a sense, a language like C++ might be said to be "mildly
dynamic" -- although this might be debated. (For instance, i++ is
syntactically legal, but semantically so only if the type of i has the ++
operator defined.)
Anyone interested in dynamic languages might, therefore, consider ATB as a
newly available look at such parsers.
Part of what makes writing grammars for such languages more manageable is
the fact that they can be developed in an IDE, much like programming
languages, and "debugged" in minute detail (and profiled) against test
cases. To date, the engine in its current incarnation has shown no
real-world language parses in O(n^3) or greater time, and in many cases, on
real-world languages, expensive productions are mitigated by their
locality -- that is, the most expensive productions tend to fire only
rarely, and thus not slow down the whole. (Of course, in theory-space, such
productions would predominate the big-O, but in practice-space, there is no
such thing as an infinitely long input.)
Perl in particular has some constructions that make adaptive grammars very
useful. The C# grammar put together in this formalism didn't require a
single adaptive production (and Vladamir mentions his goal is a language
similar to C#), so it is likely that any adaptive productions would be
minimal, and thus mitigated.
Cheers.
--
Quinn Tyler Jackson FRSA MBCS
http://members.shaw.ca/qtj/
Author of Adapting to Babel:
http://members.shaw.ca/qtj/writing/...ingToBabel.html
|
|
|
|
|