For Programmers: Free Programming Magazines  


Home > Archive > Compilers > December 2004 > Taking an AST back into C









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Taking an AST back into C
Neil Bradley

2004-11-29, 4:10 pm

I'm currently working on a code recompiler that translates binary code
back into its C equivalent. Fortunately, just about every processor
instruction can be represented as an expression (computing carry,
overflow for example).

I've created an abstract syntax tree for the binary code and am
translating it back into C, and now I'm trying to figure out when to
emit parenthesis in my expressions (so that operator precedence will
be honored).

Does anyone know of a rule or approach that would let me know when
parentheses should be emitted? As near as I can tell, it's when the
node on the right tree of the current operator has an operator that is
of differing precedence, but that just doesn't seem complete enough.

Any advice/help/hints would be greatly appreciated. Thanks!

-->Neil
[There's been a lot of work on decompilers over the years, including
one I tried that disassembled x86 object code and turned it into C.
It worked, but the results were so low-level that they were useless.
-John]
Martin Ward

2004-12-02, 3:57 am

> [There's been a lot of work on decompilers over the years, including
> one I tried that disassembled x86 object code and turned it into C.
> It worked, but the results were so low-level that they were useless.
> -John]


FermaT has been sucessfully used to translate 544,000 lines of x86
assembler (an embedded system) into efficient and maintainable C code.
The totally automated technique is to translate the assembler into
WSL, apply several thousand WSL to WSL transformations (per module),
and then translate the restructured and simplified WSL into C. See
the paper "Pigs from Sausages? Reengineering from Assembler to C via
FermaT Transformations" at
http://www.cse.dmu.ac.uk/~mward/martin/papers/

Sample assembler code:

extrn dsaft :abs
extrn adtn1 :word
extrn hrfft :abs
extrn oldgs :byte

no_pick:
mov dx,dsaft
mov bx,adtn1
call far ptr tstbt
jnz htst_irf_ret
mov bx,adtn1
mov dx,hrfft
call far ptr tstbt
jz htst_irf
mov oldgs,0
call far ptr hwal
jnz htst_irf_ret
jmp htst_irf
htst_irf_ret:
ret


and the corresponding C code:


void
no_pick()
{
if ((adtn1->dsaft == 0 && adtn1->hrfft == 0))
{
htst_irf();
}
else if (adtn1->dsaft == 0)
{
oldgs = 0;
hwal_zf = hwal();
if (hwal_zf != 0)
{
htst_irf();
}
}
return;
}

--
Martin
Martin.Ward@durham.ac.uk http://www.cse.dmu.ac.uk/~mward/ Erdos number: 4
[I'd think that translating assembler into C should be a lot easier
than decompiling object code because you have the symbols and
labels. -John]
Torben Ęgidius Mogensen

2004-12-02, 3:57 am

"Neil Bradley" <nb_no_spam@synthcom.com> writes:

> I'm currently working on a code recompiler that translates binary code
> back into its C equivalent. Fortunately, just about every processor
> instruction can be represented as an expression (computing carry,
> overflow for example).
>
> I've created an abstract syntax tree for the binary code and am
> translating it back into C, and now I'm trying to figure out when to
> emit parenthesis in my expressions (so that operator precedence will
> be honored).
>
> Does anyone know of a rule or approach that would let me know when
> parentheses should be emitted? As near as I can tell, it's when the
> node on the right tree of the current operator has an operator that is
> of differing precedence, but that just doesn't seem complete enough.


Placing parentheses is the least of your problems when decompiling.
There are standard methods for doing that: If you have an abstract
syntax tree of arithmetic operations (or, equivalently, a stack
program), you first compute the infix string representation of the two
subexpressions (subtrees / substacks) and the priority of the
top-level operator of each (including whether it is left- right- or
non-associative). You then compare these precedences with the
operator of the whole expression.

If the priority of the topmost operator is greater (i.e., binds
tighter) than that of a subexpression, you add parentheses around this
subexpression. If the priority of the topmost operator is lower than
that of a subexpression, you don't add parentheses around this. If
the priority of the topmost operator is the same as that of a of the
subexpression, you look at the associativity. If non-associative, you
add parentheses around the subexpression. If left-associative, you
add parentheses around the subexpression if it is the right operand of
the topmost operator but not if it is the left operand. If
right-assiciative, you add parenthesis around the subexpression if it
is the left operand of the topmost operator but not if it is the right
operand.

Note that this implies that operators with identical priority have
identical associativity. This is the usual case, but if not you
shoudl add parentheses around the subexpression regardless if it to
the left or right.

> [There's been a lot of work on decompilers over the years, including
> one I tried that disassembled x86 object code and turned it into C.
> It worked, but the results were so low-level that they were useless.
> -John]


If the code doesn't use any data structures, it is usually not so bad,
as operators translate fairly directly back into high-level operators.
Your main problem will be to translate register and stack references
back into variable references, but if you know the layout of
activation records, that's doable too. Control structure isn't so
bad, as loops and if-then-else constructs are fairly easy to detect.
The main problem is data structures, as different data structures are
translated into similar code. Alan Mycroft had a paper called
"type-based decompilation" at ESOP'99 (Springer LNCS 1576) that
attempted to do this, but admitted that the method doesn't work for
all programs.

Torben Mogensen


VBDis

2004-12-02, 3:57 am

"Neil Bradley" <nb_no_spam@synthcom.com> schreibt:

>I'm currently working on a code recompiler that translates binary code
>back into its C equivalent. Fortunately, just about every processor
>instruction can be represented as an expression (computing carry,
>overflow for example).


That's a reasonable approach, for expressions.

>I've created an abstract syntax tree for the binary code and am
>translating it back into C, and now I'm trying to figure out when to
>emit parenthesis in my expressions (so that operator precedence will
>be honored).


Perhaps it's easier to emit parentheses by default, and to determine
when the parentheses can be omitted for sure.

>Does anyone know of a rule or approach that would let me know when
>parentheses should be emitted? As near as I can tell, it's when the
>node on the right tree of the current operator has an operator that
>is of differing precedence, but that just doesn't seem complete
>enough.


The procedures depend on your tree structure. You'll often have
operators in both arguments of an operation.

In my own decompiler I used an tree restructuring pass before trying
to produce syntactically valid C code. The resulting tree then was
arranged and annotated as appropriate for the source code
generation. More passes may be required for control flow analysis, and
for x86 or other CISC machines with their processor specific idioms.

>Any advice/help/hints would be greatly appreciated. Thanks!


Feel free to contact me by e-mail. Decompilers have been my hobby for
decades, and I know about many related problems and possible
solutions.

DoDi
VBDis

2004-12-07, 4:21 am

Martin Ward <Martin.Ward@durham.ac.uk> schreibt:

>void
>no_pick()
>{
> if ((adtn1->dsaft == 0 && adtn1->hrfft == 0))
> {
> htst_irf();
> }
> else if (adtn1->dsaft == 0)
> {
> oldgs = 0;
> hwal_zf = hwal();
> if (hwal_zf != 0)
> {
> htst_irf();
> }
> }
> return;
>}


Another possible representation:

void
no_pick()
{
if (adtn1->dsaft == 0 && adtn1->hrfft != 0 && oldgs = 0, hwal() == 0)
htst_irf();
}
(Please ignore possible errors in the code, I decompiled manually ;-)


Provided that your assembler code actually was disassembler output,
I'd furthermore suspect a typo in the original C source code, which
probably should have read as:

if (adtn1->dsaft == 0 && adtn1->hrfft != 0 && oldgs == 0 && hwal() == 0)
htst_irf();

It were not the first bug of that kind that I found while decompiling ;-)


>[I'd think that translating assembler into C should be a lot easier
>than decompiling object code because you have the symbols and
>labels. -John]


True, the separation of code and data often is not trivial. Microsoft
code can contain some very ugly hacks, confusing even sophisticated
automated and experienced human disassemblers. The more stupid the
development tools are, the more creative are the coders :-(

Symbols also are found in object and library modules. It's a good idea
to determine the compiler, OS, and the use of (according) libraries
and declarations. With that knowledge it's possible to re-import many
names and proper data types, and to separate library from application
functions, etc. For the user it's much more helpful to identify the
library functions, which usually are well documented elsewhere, and to
propagate the related data (argument...) types into the application
code, instead of attempting to decompile such functions. I used that
technique in my own C decompiler, and the IDA disassembler uses the
equivalent FLIRT technique.

DoDi
VBDis

2004-12-07, 4:21 am

torbenm@diku.dk (=?iso-8859-1?q?Torben_=C6gidius_Mogensen?=) schreibt:

>Control structure isn't so bad, as loops and if-then-else constructs
>are fairly easy to detect.


One of the usually ignored or inappropriately "solved" problems are
complex conditional expressions, with short circuit evaluation. It's
not so easy to determine the range of conditional expressions, and how
to translate it back into high level code.

>The main problem is data structures, as different data structures are
>translated into similar code.


That's true, but not the hardest problem. Often it's undecidable what
data type a pointer actually refers to, be it a structure, or the
first member of that structure, or the first element in an array
etc. It's a real puzzle to make all subroutine argument data types
match all related calls, in detail when typecasts were used in the
original code.

DoDi
VBDis

2004-12-11, 3:59 pm

Martin Ward <Martin.Ward@durham.ac.uk> schreibt:

>The totally automated technique is to translate the assembler into
>WSL, apply several thousand WSL to WSL transformations (per module),
>and then translate the restructured and simplified WSL into C. See
>the paper "Pigs from Sausages? Reengineering from Assembler to C via
>FermaT Transformations" at
>http://www.cse.dmu.ac.uk/~mward/martin/papers/


Very nice, working in the sub-atomic area. And for completeness, your
example is impressioning for an automated analysis :-)

DoDi
Martin Ward

2004-12-11, 3:59 pm

On Monday 06 Dec 2004 2:34 am, VBDis wrote:
> Martin Ward <Martin.Ward@durham.ac.uk> schreibt:
>
> Another possible representation:
>
> void
> no_pick()
> {
> if (adtn1->dsaft == 0 && adtn1->hrfft != 0 && oldgs = 0, hwal() == 0)
> htst_irf();
> }
> (Please ignore possible errors in the code, I decompiled manually ;-)


This is not equivalent to the original: in the original code
htst_irf() is called whenever both adtn1->dsaft and adtn1->hrfft are
zero. Yours does nothing in that case. Also, your code never executes
the assignment hwal_zf = hwal().

> Provided that your assembler code actually was disassembler output,
> I'd furthermore suspect a typo in the original C source code, which
> probably should have read as:
>
> if (adtn1->dsaft == 0 && adtn1->hrfft != 0 && oldgs == 0 && hwal() == 0)
> htst_irf();
>
> It were not the first bug of that kind that I found while decompiling ;-)


The assembler code was actually human implemented. Human implemented
assembler is easier in some ways because you have a better idea of the
meaning of data structures (eg if a variable is declared as an
integer, or character string: but that doesn't stop the programmer
from *using* the data in any way they choose), but harder in other
ways: you would be amazed at just how many different ways there are to
write a jump table, for example, all of which need to be detected and
accounted for by the translator.

--
Martin

Martin.Ward@durham.ac.uk http://www.cse.dmu.ac.uk/~mward/ Erdos number: 4
G.K.Chesterton web site: http://www.cse.dmu.ac.uk/~mward/gkc/
VBDis

2004-12-13, 3:57 am

Martin Ward <Martin.Ward@durham.ac.uk> schreibt:

>This is not equivalent to the original: in the original code
>htst_irf() is called whenever both adtn1->dsaft and adtn1->hrfft are
>zero. Yours does nothing in that case.


I already sent you my detailed answer. Here the essential parts for
completeness:

You are right, my manual translation was incorrect, it should read:

if (adtn1->dsaft == 0 && (adtn1->hrfft == 0 || oldgs = 0, hwal() == 0))
htst_irf();

The reduction of the expression, resulting in the omission of the
second call to htst_irf (why call???) is based on the placement of the
True and False labels for the if-statement as follows:

jnz htst_irf_ret
True:
jmp htst_irf
False:
htst_irf_ret:
ret

These labels can be interpreted as Then and Else in an If, or as Continue and
Break in a loop, e.g.:

void htst_irf()
{
do { whatever_not_listed_before_no_pick; }
while (adtn1->dsaft == 0 && (adtn1->hrfft == 0 || oldgs = 0, hwal() == 0));
}


> Also, your code never executes the assignment hwal_zf = hwal().


I couldn't find such an assignment in your assembly code?

DoDi
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com