For Programmers: Free Programming Magazines  


Home > Archive > Compilers > June 2007 > Am I parsing this correctly? (when do I build the symbol table)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Am I parsing this correctly? (when do I build the symbol table)
Ryan Dary

2007-05-19, 8:06 am

I'm working on a phase of my compiler. The language is similar to
Visual Basic, and I want to know if I'm parsing this correctly.

Right now, I am walking through the tokens, and building up a tree
like structure so that I have nodes like Program, Function, Class,
etc. Then, as I'm parsing some of the source lines, I am actually
skipping them right now because it seems impossible to know what the
tokens mean until I have built a symbol table of all the functions,
etc. Perhaps I have this phase a bit backwards...

Take a look at this source code, and then I'll explain some problems
that I'm having...


Sub do_something_else( byRef i As Integer )
i = i * 10
End Sub

Function do_something( i As Integer ) As Integer
Dim a As Integer = i + 10
do_something_else i
Return a * 5
End Function


My understanding of the syntax tree is that I'm not supposed to be
worried about the "meaning" of the code, but rather the "structure" of
the code. So, I'm not building a symbol tree at this phase... the
problem with that seems to be that I'm unable to make heads or tails
of the lines of code within the function declaration. For instance,
as I parse the Dim statement (which is used to declare a variable), I
am able to parse the components "Dim a As Integer = <exp>" where the
<exp> (expression) seems to be impossible to really parse without
having a symbol table thus far in the parsing. I wouldn't know if "i"
is a variable or a function or a constant, because I don't have any
way of looking it up in a symbol table. So, should I be building the
symbol table as I'm parsing the syntax tree from the tokens?

So, hopefully, my nonsense explanation of my problem will make sense to
someone who can shed some light on my question. Thanks in advance.

- Ryan Dary
SM Ryan

2007-05-20, 10:13 pm

Ryan Dary <iecc@ryandary.com> wrote:
# I'm working on a phase of my compiler. The language is similar to
# Visual Basic, and I want to know if I'm parsing this correctly.

Unclean! Unclean!

# as I parse the Dim statement (which is used to declare a variable), I
# am able to parse the components "Dim a As Integer = <exp>" where the
# <exp> (expression) seems to be impossible to really parse without
# having a symbol table thus far in the parsing. I wouldn't know if "i"
# is a variable or a function or a constant, because I don't have any

Rather than make a parse tree node labelled 'variable' or 'function' or
'constant', make a node labelled 'identifier' and then come back later
and edit the node with parse tree information. You only have a problem
when you cannot proceed with the parse when confronted by a identifier
with no further information. For example in C, the construct
f(X);
can be either a function call, if f is a function, or declaration of X
if f is type name. To parse C you have to maintain a stack of typedef
names as you parse. Algol68 has a similar problem with tab tag = value
is (tab tag) = value if tab is operator, or a declaration tab (tag = value)
if tab is a mode indication; further complicated because Algol68 allows
forward references.

If in your language you know that Dim must be followed by a variable name
and that = must be followed by an expression and an identifier in an
expression must be an application of variable, function, or constant
defined elsewhere, then you can parse without knowing exactly what it
will be defined as.

# way of looking it up in a symbol table. So, should I be building the
# symbol table as I'm parsing the syntax tree from the tokens?

If the scope rules forbid any use of a identifier before it is defined,
you can possibly combine the symbol identification with the parse; this
is case with Pascal, Ada, and now C. If the scope rules allow forward
references, as Algol 60 and 68, you have to complete the parse to ensure
you have all definitions available, which might follow usage.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
There are subtler ways of badgering a witness.
Dmitry A. Kazakov

2007-05-20, 10:13 pm

On Thu, 17 May 2007 21:43:38 -0700, Ryan Dary wrote:

> Sub do_something_else( byRef i As Integer )
> i = i * 10
> End Sub
>
> Function do_something( i As Integer ) As Integer
> Dim a As Integer = i + 10
> do_something_else i
> Return a * 5
> End Function
>
> My understanding of the syntax tree is that I'm not supposed to be
> worried about the "meaning" of the code, but rather the "structure" of
> the code. So, I'm not building a symbol tree at this phase... the
> problem with that seems to be that I'm unable to make heads or tails
> of the lines of code within the function declaration. For instance,
> as I parse the Dim statement (which is used to declare a variable), I
> am able to parse the components "Dim a As Integer = <exp>" where the
> <exp> (expression) seems to be impossible to really parse without
> having a symbol table thus far in the parsing. I wouldn't know if "i"
> is a variable or a function or a constant, because I don't have any
> way of looking it up in a symbol table.


Why do you want to know anything more than "'i' is an identifier"? The only
thing you really need to know when parsing <expr> is the tokens of.
Identifier is a token and usually the language defines a formal way of
recognizing identifiers in the source without looking into any symbol
tables. So it could be something like: starts with a letter, has digits,
letters and underscore, not a reserved word. It is also a good idea to add
an additional level of delimiters like semicolons at the statements ends.
This makes reading programs easier and for parsing it would mean a
possibility to recover after typo errors.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
Hans-Peter Diettrich

2007-05-20, 10:13 pm

Ryan Dary wrote:

> Right now, I am walking through the tokens, and building up a tree
> like structure so that I have nodes like Program, Function, Class,
> etc. Then, as I'm parsing some of the source lines, I am actually
> skipping them right now because it seems impossible to know what the
> tokens mean until I have built a symbol table of all the functions,
> etc. Perhaps I have this phase a bit backwards...


I'd create the symbol tables while parsing, and attach them to the
appropriate nodes. At that moment you have at hand all informataion,
telling where symbol definitions are located, and which scope they
belong to. This would allow to add unique references from symbol usage
nodes to the according symbol tables.

I've heard that VB requires up to 3 passes through the source files,
until all declarations are well known. This might indicate that
finishing the symbol tables must occur after parsing, while dealing
with the syntax tree, but it depends on your specific language
semantics.

DoDi
Mike Burrell

2007-05-20, 10:13 pm

On 2007-05-18 00:43:38 -0400, Ryan Dary <iecc@ryandary.com> said:
> Dim a As Integer = i + 10

*snip*
> My understanding of the syntax tree is that I'm not supposed to be
> worried about the "meaning" of the code, but rather the "structure" of
> the code. So, I'm not building a symbol tree at this phase... the
> problem with that seems to be that I'm unable to make heads or tails
> of the lines of code within the function declaration. For instance,
> as I parse the Dim statement (which is used to declare a variable), I
> am able to parse the components "Dim a As Integer = <exp>" where the
> <exp> (expression) seems to be impossible to really parse without
> having a symbol table thus far in the parsing. I wouldn't know if "i"
> is a variable or a function or a constant, because I don't have any
> way of looking it up in a symbol table. So, should I be building the
> symbol table as I'm parsing the syntax tree from the tokens?


That's one way to do it, and it's been done before.

The usual way to do it is to just let 'i' be an identifier. Thus in
your syntax tree, you'll have IDENTIFIER + CONSTANT for your
expression. If your language has dynamic semantics/typing, then you'll
be content with that and go on your merry way executing, only checking
what 'i' is once you've reached that line in your thread of execution.
If your language has static semantics/typing, then after you do your
parsing, you'll have another phase in your compiler where you make
sense of what the identifiers refer to.

Mike
Jeff Kenton

2007-05-20, 10:13 pm

Ryan,

A few things are unclear. First, look at your language and make sure
that there's nothing that's truly ambiguous. Second, it looks from your
example as if identifiers aren't necessarily declared before they're
used, except perhaps in function declarations. If that's true, when do
you know what they are? Are they declared anywhere? Dimensioned? Or
else, what are the rules for assuming types and dimensions? Furthermore,
in your example it seems that "i" is declared in the statement "Function
do_something(i As Integer)". What information about "i" are you missing?

In general, you should be building the symbol table as you parse. If
there is information about an identifier that's missing, that should be
explicitly noted as you proceed. If it's still missing when you finish
parsing, you will need to report errors or apply defaults, depending on
the language specifications.

Overall, my first reaction is that you language sample looks a little
confusing. Your function declarations have parentheses but your
function calls don't. You seem to have mixed your "Dim" statement with
an assignment. Even if your language isn't exactly ambiguous, you might
find that you have made it difficult to parse (and to code) correctly,
and that a little redesign would help. In any case, when you're
parsing, you should be gathering as much information as you can, as soon
as you can. Even noting that information is missing and will have to be
found later is useful to know.

Good luck,

jeff


Ryan Dary wrote:
> I'm working on a phase of my compiler. The language is similar to
> Visual Basic, and I want to know if I'm parsing this correctly.
> ...
>
> Function do_something( i As Integer ) As Integer
> Dim a As Integer = i + 10
> do_something_else i
> Return a * 5
> End Function
> ... So, I'm not building a symbol tree at this phase... the
> problem with that seems to be that I'm unable to make heads or tails
> of the lines of code within the function declaration. For instance,
> as I parse the Dim statement (which is used to declare a variable), I
> am able to parse the components "Dim a As Integer = <exp>" where the
> <exp> (expression) seems to be impossible to really parse without
> having a symbol table thus far in the parsing. I wouldn't know if "i"
> is a variable or a function or a constant, because I don't have any
> way of looking it up in a symbol table.

George Neuner

2007-05-20, 10:13 pm

On Thu, 17 May 2007 21:43:38 -0700, Ryan Dary <iecc@ryandary.com>
wrote:

>Take a look at this source code, ...
>
>Sub do_something_else( byRef i As Integer )
> i = i * 10
>End Sub
>
>Function do_something( i As Integer ) As Integer
> Dim a As Integer = i + 10
> do_something_else i
> Return a * 5
>End Function
>
>My understanding of the syntax tree is that I'm not supposed to be
>worried about the "meaning" of the code, but rather the "structure" of
>the code. So, I'm not building a symbol tree at this phase... the
>problem with that seems to be that I'm unable to make heads or tails
>of the lines of code within the function declaration. For instance,
>as I parse the Dim statement (which is used to declare a variable), I
>am able to parse the components "Dim a As Integer = <exp>" where the
><exp> (expression) seems to be impossible to really parse without
>having a symbol table thus far in the parsing. I wouldn't know if "i"
>is a variable or a function or a constant, because I don't have any
>way of looking it up in a symbol table. So, should I be building the
>symbol table as I'm parsing the syntax tree from the tokens?


Strictly speaking, it is never *necessary* to build a symbol table
while parsing - it can always be done by a separate pass (or passes)
through the syntax tree. However, as a practical matter, there are
languages like C for which parsing is very difficult to perform
without the symbol information - it can be done but the result is very
messy and the work to disambiguate everything is more trouble than it
is worth. Regardless of the language, constructing the symbol table
as early as possible saves work later.


That said ... a VB like syntax is easy enough that you really could
leave symbol processing until later if you wanted to. While parsing
you only need to know the context in which the identifier is used -
declaration, expression, function argument, etc. You structure your
syntax tree to record the context but you leave the identifier
references generic.

Assuming your parser recognizes keywords for built-in types - and
without going into too much detail - your syntax tree might initially
look something like:

(program
(proc
( (id "do_something_else")
(params (id "i" (type (ref integer)))) )
(code
(expr (op =
(expr (id "i"))
(expr (op * (id "i") (const 10))))) ))

(func
( (id "do_something")
(type integer)
(params (id "i" (type integer)) )
(vars
(id "a"
(type array
(type integer)
(dim (expr (op + (id "i") (const 10)))) )))
(code
(call (id "do_something_else")
(args (expr (id "i"))))
(return (expr (op * (id "a") (const 5)))) ))
:
)

Since user types like records are constructed from built-in types or
recursively from previously declared user types, it is sufficient for
the parser to recognize the built-ins.

Then you walk the syntax tree, constructing your type and symbol
tables(s) as you determine what the identifiers mean. During this
pass you can check the code for undeclared and duplicate identifiers,
unfinished forward declarations, etc. If the language syntax requires
all symbols be declared before use then you can also perform some
usage checking in this pass. If your language allows identifiers to
be referenced before they are declared (e.g., function names) then
you'll have to wait until the entire symbol table is constructed
before checking usage.

Once your types and symbols are defined, you can rewrite the code
parts of the tree so the generic ID references point directly to the
symbol table entries.

Hope this helps.
George
Karsten Nyblad

2007-05-20, 10:13 pm

Ryan Dary wrote:
> Take a look at this source code, and then I'll explain some problems
> that I'm having...
>
>
> Sub do_something_else( byRef i As Integer )
> i = i * 10
> End Sub
>
> Function do_something( i As Integer ) As Integer
> Dim a As Integer = i + 10
> do_something_else i
> Return a * 5
> End Function
>
>
> My understanding of the syntax tree is that I'm not supposed to be
> worried about the "meaning" of the code, but rather the "structure" of
> the code. So, I'm not building a symbol tree at this phase... the
> problem with that seems to be that I'm unable to make heads or tails
> of the lines of code within the function declaration. For instance,
> as I parse the Dim statement (which is used to declare a variable), I
> am able to parse the components "Dim a As Integer = <exp>" where the
> <exp> (expression) seems to be impossible to really parse without
> having a symbol table thus far in the parsing. I wouldn't know if "i"
> is a variable or a function or a constant, because I don't have any
> way of looking it up in a symbol table. So, should I be building the
> symbol table as I'm parsing the syntax tree from the tokens?
>
> So, hopefully, my nonsense explanation of my problem will make sense to
> someone who can shed some light on my question. Thanks in advance.
>
> - Ryan Dary


There is no such things as a fixed set of rules for how to design a
compiler. The design will always be influenced by the language being
compiled, and a design that will be fine for compiling one language
might be clumsy for compiling a different language. You can build the
symbol table currently with building the syntax tree, if you want to.
That is the normal way of doing it in C and C++ compilers, because those
languages are very difficult to parse without knowing if an identifier
is a type or something else. Many of us would do it anyway, because we
do not want the trouble of analyzing the tree an extra time.

However, I always recommend the KISS principle for newcomers to the
compiler business. KISS = keep it simple stupid. Do not try to do too
many things at once. Compilers are complex programs, and you are
normally best of not trying to be too smart. Otherwise you will have to
go bug hunting in routines you can barely understand, and your compiler
may never become reliable. You may be better of by building a parse
tree first and a symbol table later, simply to keep down complexity.

Also, it is common to transform the parse tree into a tree, that better
reflect the semantics of the program. At first you build a tree, where
you do not distinguish between identifiers being a constant, a variable,
or a function. When you have built the symbol table you transform the
tree into a new tree with nearly the same structure, but where you label
nodes of identifiers in accordances with what they are. If an
identifier denotes a variable, you label the nodes of that identifier
with a label stating that this is a variable. If it is a constant you
label the nodes with a label stating that this is a constant, etc. I am
currently writing on a problem where I already transform the tree twice,
and I anticipate transforming the three a third time. Each
transformation makes the tree easier to deal with for the the next phase.

Karsten Nyblad
148f3wg02 at sneakemail dot com
Uli Kusterer

2007-05-20, 10:13 pm

On 18.05.2007, at 06:43, Ryan Dary wrote:
> For instance,
> as I parse the Dim statement (which is used to declare a variable), I
> am able to parse the components "Dim a As Integer = <exp>" where the
> <exp> (expression) seems to be impossible to really parse without
> having a symbol table thus far in the parsing. I wouldn't know if "i"
> is a variable or a function or a constant, because I don't have any
> way of looking it up in a symbol table. So, should I be building the
> symbol table as I'm parsing the syntax tree from the tokens?


Now, I don't have any formal teaching in compilers and parsers, but
I've wiggled through quite well with books and the web and my own
attempts so far, and I think you're right here. You will need to know
whether a certain word is a user-defined type, a variable or
whatever, unless you do like PHP or some other languages which
explicitly mark variables with an operator ("$myVar" and the likes,
or have keywords uppercase and the rest lowercase, or whatever...).

Of course, depending on the nature of your language, you might get
away with breaking it up some more without having a symbol table .
The end of the line is indicated by return characters, a line that
starts with "Dim" is a variable definition/declaration, a function
call is an identifier immediately followed by a parameter list in
brackets... You can scan ahead to find out about such things for some
languages. For the code snippet you posted, that might even work.
Depends on what else the language can do.

I don't know enough about the rules of the syntax tree to tell you
whether that's a good idea, mind you. I usually just tokenize and
then parse right away, building my symbol table as I go. But then I
usually create a variant of HyperTalk, where you neither specially
mark, nor declare variables, so it's simply impossible to parse
without the kind of context provided by a symbol table.

Cheers,
-- M. Uli Kusterer
http://www.zathras.de
Chris Dollin

2007-05-21, 7:10 pm

Jeff Kenton wrote:

> In general, you should be building the symbol table as you parse.


I would have said that /in general/ you want to keep the symbol
table well out of it while parsing. Having the syntax depend on
what the declarations are makes, I believe, for a fragile grammar.
Some languages are pretty much stuck with that fragility, but that's
not an excuse to copy it.

Decoupling the parse from the semantics like this makes the parser
easier to write, and makes the symbol table easier to work with
as well.

> If there is information about an identifier that's missing, that
> should be explicitly noted as you proceed. If it's still missing
> when you finish parsing, you will need to report errors or apply
> defaults, depending on the language specifications.


Even if you decouple the symbol table from the parser, you can still
do this one layer up (read a top-level construct, do the symbol-table
working, etc). What's important, IMAO, is that the parser not need
access to the symbol table just to make parsing decisions.

--
"I just wonder when we're going to have to sit down and re-evaluate /Sahara/
our decision-making paradigm."

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN
Uli Kusterer

2007-05-23, 4:28 am

On 21.05.2007, at 10:01, Chris Dollin wrote:
> Decoupling the parse from the semantics like this makes the parser
> easier to write, and makes the symbol table easier to work with
> as well.


In some cases one can just design the language so it's easy to
parse. But sometimes one has to implement a particular pre-defined
language, or try to approximate some other language, and in those
cases one can't or doesn't want to just change the language to make it
easier to parse.

Cheers,
-- M. Uli Kusterer
http://www.zathras.de

glen herrmannsfeldt

2007-05-24, 10:08 pm

Uli Kusterer wrote:

> In some cases one can just design the language so it's easy to
> parse. But sometimes one has to implement a particular pre-defined
> language, or try to approximate some other language, and in those
> cases one can't or doesn't want to just change the language to make it
> easier to parse.


At some point there is the choice of easy to parse by computer
or easy to write by people. Some designers choose one, some choose
the other.

One design point is reserved words or no reserved words.

Choosing reserved words makes it easier to parse in many cases,
and some might argue easier for people to read. In many cases,
though, it makes it harder to write, especially with a large
language with many keywords which one might not normally need
to know.

COBOL has a very large list of reserved words, many of which
one might find useful as descriptive variable or subroutine
names. PL/I has no reserved words to remove this restriction
on the programmer.

-- glen
Paul Robinson

2007-05-31, 10:10 pm

On May 18, 12:43 am, Ryan Dary <iecc AT ryandary.com> wrote:

> Sub do_something_else( byRef i As Integer )
> i = i * 10
> End Sub
>
> Function do_something( i As Integer ) As Integer
> Dim a As Integer = i + 10
> do_something_else i
> Return a * 5
> End Function
>
> My understanding of the syntax tree is that I'm not supposed to be
> worried about the "meaning" of the code, but rather the "structure" of
> the code. So, I'm not building a symbol tree at this phase... the
> problem with that seems to be that I'm unable to make heads or tails
> of the lines of code within the function declaration.


Well, personally I'm a bit of a procrastinator, but when I've written
compilers I've always been a fan of the "do it now" philosophy, also
because I tend to like to write compilers as one pass whenever I can
get away with it. The sooner you do anything the less you have to
fuss with it and the faster you get rid of it.

Note that in the following I use "procedure" to refer to both the sub
and the function. This is so I don't have to remember which is which,
because, for the purposes of this discussion it doesn't matter whether
these are subs or functions.

So, given the above code, in the first procedure (do_something_else),
you define i as a (writable) parameter passed to you from the caller,
I would create the identifier table as soon as I see the identifier.
The variable i then becomes a kind of global variable, say an
internediate variable, so that if there is a global variable named i
the parameter overrides it,or however you would handle a parameter
variable as opposed to a local variable or a global one (if you handle
them differently).

Now, in the second procedure (do_something), what I would do is change
the name of the parameter i to something else with a name that is only
available for internal use. Now, you do a create of a local variable
named i of the same type as this internal variable that parameter i
was named to (integer in this case). Now, I would set up a "prefix
code" array or table, that says, before you write the executable code
for this procedure, you generate this code at the beginning of the
procedure, e.g. you prefix this code ahead of any code in the
procedure. And in that prefix code, I would do an assignment of
parameter i to local i (because your parameter here is non-
writable). By translating the parameter into a local variable and
copying it, you don't have to worry if the procedure trashes it or
what it does with the variable, it's a local variable, instantiated
when the procedure is invoked and is destroyed when the procedure
ends.

Now, when you get to the statement Dim a As Integer = i + 10 you
create a local variable a, and in the prefix code table, as the second
thing to do before generating any code for this procedure, is a
statement assigning the value of I+10 to a. Since the first thing in
the prefix code is to copy parameter i to local i, you're done
worrying about it.

What you could do is, to make it simpler and push the code generation
to later, is have the "prefix code" table consist of statments in the
language, and the compiler simply does this: when it reaches the first
executable line of code in a procedure, if the prefix code table is
not empty, it reads those first (until exhausted) as if the table was
a form of an include file immediately before the first line of
executable code. The compiler won't care, all it's doing is compiling
valid code, and the result provides what you need.

Now, that's how I would do it. Maybe that's not the "best" way, but
it is fairly simple - at least I think it's simple - and I like simple
ways to solve problems.

Paul Robinson - http://paul-robinson.us (My Blog)
"The lessons of history teach us - if they teach us anything - that
nobody learns the lessons that history teaches us."
Eddermotardten

2007-06-12, 5:58 am

Paris Hilton and Jessica Alba In Pussy Cucumber!
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com