Home > Archive > Compilers > April 2006 > Need Information on how to create bytecode
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Need Information on how to create bytecode
|
|
| megavlad@gmail 2006-04-08, 7:03 pm |
| Looked for days in google, here, everywhere. Theres alot of
informaltion here about virtual machines, but I don't think anyone
actually has pointed to some tutorials show how to create bytecode or
what the structure of bytecode is.
Some info.
My goal is to create a scripting language which I can use to make 3d
video games. I know OpenGL, C++. I have already created a parser and
lexer using Bison and Flex for my language. I have also created the
framework for the graphics in C++ and OpenGL. But I'm stuck now. I can
parse my language, check for syntax erros, insert information into
symbol tables... that kind of stuff. But that's it. Now I want to to
turn my source language into Bytecode and then interpret that bytecode
(for instance, load a 3d model, print to the screen, etc).
I'm not familiar with the structure of bytecode. I know that there
isn't just one kind of bytecode, but I'm sure that alot of them share
many similarities. I'm sure the have some type of header, maybe
followed by the globals they the hard coded values, that kind of
stuff.I've read here that it's a stream of 8 bit values (sometimes 16
bits), which has instructions that jump to a given set of other
instructions; and other explanations which are TOO abstract for me. I
mean, I'm so eager to translate my source language into bytecode, yet
have no idea where to start. For instance, if I have this is my source
language:
$someValue = 1 + 2;
What does that look like in bytecode? What about:
$i = true;
if($i)
$otherValue = someFunction();
else
$otherValue = "Not true";
There is a hard coded string here, so I figure that somehow it has to
be inside the bytecode file. I read alot about JIT compiling, yet
don't know how it relates to bytecode. And forget it, once I, at
least, learn how I can structure my source program into a bytecode
file, I'm gonna have to tackle interpreting the actual bytecode in my
vm(which doens't yet exists).
So many questions, so many questions. I appreciate ANY input.
Thanks
| |
| Jürgen Kahrs 2006-04-09, 7:06 pm |
| Sun's spec is actually quite precise:
http://java.sun.com/docs/books/vmsp...ssFile.doc.html
As usual, WikiPedia is also a good starting point. Here they present
a list of byte codes for the JVM mnemonics:
http://en.wikipedia.org/wiki/Java_byte_code
The basic principle of P-Code is explained in an excellent article at
WikiPedia:
http://en.wikipedia.org/wiki/P-Code_machine
Similar documents exist for .NET's CLR. You should find these on you
own. If it really takes you days to find them, there's something wrong
with the way you are searching.
[I suspect he was searching for info on bytecode design principles
rather than on the spec for various existing bytecodes. It's a good
question. We all know that the two main models are stack and
pseudo-registers, but beyond that, for issues like how you encode
constants and addresses and calls and interroutine links, as far as I
know it's all folklore. -John]
| |
| Hans Aberg 2006-04-09, 7:06 pm |
| <megavlad@gmail.com> wrote:
> Looked for days in google, here, everywhere. Theres alot of
> informaltion here about virtual machines, but I don't think anyone
> actually has pointed to some tutorials show how to create bytecode or
> what the structure of bytecode is.
I was once told that Simon Peyton-Jones has written a book describing how
to create bytecodes, but I do not know which one.
--
Hans Aberg
| |
| Satyam 2006-04-12, 10:02 pm |
| ----- Original Message -----
From: "Jürgen Kahrs" <Juergen.Kahrs@vr-web.de>
Sent: Sunday, April 09, 2006 10:59 PM
Subject: Re: Need Information on how to create bytecode
...
> [I suspect he was searching for info on bytecode design principles
> rather than on the spec for various existing bytecodes. It's a good
> question. We all know that the two main models are stack and
> pseudo-registers, but beyond that, for issues like how you encode
> constants and addresses and calls and interroutine links, as far as I
> know it's all folklore. -John]
The bytecodes of early processors came from the hardware designers.
Most often, sections of the bytecode would directly go to act on
specific decoders and gates in the circuit. Current microprocessors,
having many translation stages before acting on any hardware that
actually executes the instruction (bytecodes are entry points into a
translation table that gives the start address of the microprogram
that actually guides the processing) can be arbitrary and in Intel
80x86 family of backward compatible processors, they are chosen as to
keep that compatibility. Thus, even in actual hardware bytecodes, you
might as well assign them whichever way you want..
At most, the only precaution I would take, if 16 bits of bytecode is
not enough for you instruction set (which would be strange), is that
you assign single word bytecodes to the instructions used most often
and use prefixes for seldom used instructions. Sixteen bits, though,
is often quite enough so you might pack some extra information into a
bytecode, for example, to access a word at a certain offset from the
current stack frame, you might have two instructions, one which allows
you to specify a short offset in the least few significant bytes of
the instruction and another with a full extra word to specify the
offset. Since most function calls have few parameters, the short
offset bytecode would be used very often.
Same thing with jumps, specially conditional jumps. If the address to
be jumped is a short offset from the current pointer, see if you can
assign some of the bits in the bytecode to that offset. Since these
short addresses will be used far more frequently than longer ones,
packing that extra info into the bytecode will shorten the generated
bytecode and if it were to be executed on real hardware, it would
speed up execution as well. Actually, some processors have only short
conditional jumps (subroutine calls fall into this as well) and
arbitrary length unconditional jumps, if you want to conditionally
jump far away, you first have to make a short jump to the long
unconditional jump.
How many bits would you assign to those offsets? Well, do a first
draft of your bytecodes, see how many you have and then see how many
bits you can spare for offsets. And since you would probably have the
bytecodes used as index to a jump table in your interpreter, make the
bytecodes containing offsets consecutive and at the end so you know
that if you read a bytecode higher than whatever you got, you don't
use the regular jump table but first mask out those offsets, shift it
right and use what's left as the index to a separate table. Like:
111111xxcccooooo where 1's are ones, xx can be load, store, jump or
call, ccc would be the condition for the jump and ooooo would be the
offset or, in case of load and store, all of cccooooo would be the
offset.
Jumps and calls also have an easily recognizable prefix to signal
lookahead hardware to stop bothering looking ahead sequentially since
what's after won't get executed.
Few of this is applicable to virtual machines though still, shorter
programs tend to be faster anyway.
Satyam
[The design tradeoffs for hardware vs. software architectures are
different. In hardware, you can pipline decoding and execution so you
typically want to make instructions simple and all take about the same
time to keep the pipeline simple, and you want to limit the number of
different opcodes to limit the amount of execution hardware. In
software, the decode is a bottleneck while there's little penalty for
having lots of operations, so you want more complex instructions to
minimize the number of trips through the interpreter. -John]
| |
| scavadini@ucse.edu.ar 2006-04-12, 10:02 pm |
| Maybe Jasmin (or its documentation) is useful for you
http://jasmin.sourceforge.net/
Best regards
Salvador Cavadini
| |
| DavidM 2006-04-14, 7:07 pm |
| Just out of curiousity, why would you not use Lua or Small or some
other existing stable and well documented implementation. Lua 5 in
particular is lightweight and very fast, plus it produces bytecodes.
http://www.lua.org/
http://www.compuphase.com/small.htm
If your focus is to make a game, such a task can only bog you down.
-DavidM
| |
| megavlad@gmail 2006-04-17, 10:02 pm |
| Thanks alot for the info guys, specially to Oliver for the very
awesome examples. I have a couple of more questions, although this
time a little more general. I'm still trying to get the hang of this,
even though I've been reading the dragon book on and off for the past
6 months. I find that book tough.
Let's say I have an Input Source File (SF) written in my language.
Is this the general process?:
1) As the SF is parsed by the parser (Bison generated), information is
entered into a symbol table and the AST as semantic actions are encountered.
2) After the parsing is done, I can optionally optimize the AST.
3) Then, I run some algorithm that traverses the AST and outputs the
information found on the AST and symbol-table somewhere, in my case to
bytecode.
4) Have the vm run the bytecode.
I've read there are other kinds of intermediate representations, but it
seems that the AST seems to be the most balanced between ease of use and
benefits.
How much information should I put into the AST compared to the symbol-table?
Is there any other structure that I need, beside the two I mentioned above?
thanks so much
| |
|
| I have a question, not necessarily for you, but for the knowledgeable
residents of comp.compilers: is p-code/bytecode the best solution? I
wrote a little scripting language once and felt sort of resigned to
generating bytecode just as you are doing, and while it ended up
working, debugging the scripts was a nightmare!
In contrast, a co-worker was recently showing me a scripting language
he'd developed which generated a parse-tree, with classes representing
the various node-types all descended from a single basic node. The
interpreter executed over the tree rather than P-code, and I found it
more straightforward to step through. I also found it fairly
flexible, allowing the development of continuations and other nifty
stuff, and it could be stored and read from an XML file.
I wonder why I don't see such a solution more often, except for the
fact that whenever we open up a book on compilers, they all assume
you're generating p-code/assembly.
[There isn't really that much practical difference between bytecodes
and a parse tree. A typical RPN byte code is easily generated by doing
a depth first walk of a parse tree, and you can rebuild the tree just
as easily as you read in the codes. -John]
| |
| Anton Ertl 2006-04-25, 7:15 pm |
| "kov" <ken.overton@gmail.com> writes:
>I have a question, not necessarily for you, but for the knowledgeable
>residents of comp.compilers: is p-code/bytecode the best solution?
If you want execution speed, using a virtual machine (what you
probably mean with p-code/bytecode) is certainly necessary for the
best solution. In other respects I don't see di vantages from using
a virtual machine.
>I wrote a little scripting language once and felt sort of resigned to
>generating bytecode just as you are doing, and while it ended up
>working, debugging the scripts was a nightmare!
I don't know how you tried to debug the scripts. Trying to use gdb or
similar debuggers to single-step through interpreters is a waste of
time for any kind of interpreter, because you will see lots of
uninteresting interpreter internals, and will have a hard time seeing
the higher-level stuff you are interested in.
>In contrast, a co-worker was recently showing me a scripting language
>he'd developed which generated a parse-tree, with classes representing
>the various node-types all descended from a single basic node. The
>interpreter executed over the tree rather than P-code, and I found it
>more straightforward to step through. I also found it fairly
>flexible, allowing the development of continuations and other nifty
>stuff, and it could be stored and read from an XML file.
Having a virtual-machine representation also has the advantage that it
decouples the front end from the interpreter, whereas using a parse
tree does not. If you enhance the syntax, you usually have to enhance
a tree-interpreter, while you are often able to keep a VM interpreter
unchanged. This is especially useful if you store the compiled code
in files as you suggest, but I also found it helpful on other
occasions.
>[There isn't really that much practical difference between bytecodes
>and a parse tree. A typical RPN byte code is easily generated by doing
>a depth first walk of a parse tree, and you can rebuild the tree just
>as easily as you read in the codes. -John]
That's true for expression evaluation and simple statements, but for
control flow there is a big difference between virtual-machine code
(which typcially uses virtual-machine branches) and a parse tree
(which typically represents structured control flow constructs).
- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/home.html
|
|
|
|
|