Home > Archive > Compilers > February 2006 > 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648
|
|
| Edsko de Vries 2006-01-31, 7:07 pm |
| Hi,
I have a question regarding lexical analysis. I recently came across a
bug in our lexical analyser in phc (www.phpcompiler.org), that I am
unsure how to solve. This is the problem: our current definition for
integer constant looks something like
INT ([1-9][0-9]*)|0
In particular, note that it does not allow for an (optional) "+" or "-"
at the start of the integer. This means that the strings "1 - 1", "1
-1" and "1-1" all generate the same sequence of three tokens INT(1),
OP(-), INT(1), for which the syntax analyser generates the subtree
BIN_OP(-, 1, 1).
For the string "1 - -1", the lexer (unsurprisingly) generates INT(1),
OP(-), OP(-), INT(1). The syntax analyser recognises this as BIN_OP(1,
UNARY_OP(-, 1)). In other words, the second "-" is treated as a unary
operator, rather than as part of the number.
This works fine, with the sole exception of the number "-2147483648".
The problem is, of course, overflow: -2147483648 is a valid negative
number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
valid positive number. Thus, the above method of dealing with "-" as a
unary operator breaks down.
The solution is to interpret the "-" as part of the number, and
generate INT(-2147483648), rather than OP(-), INT(...). However,
changing the definition of INT to
INT [+-]?([1-9][0-9]*)|0
causes "1-1" to be recognised as INT(1), INT(-1), which is incorrect.
Even requiring a space before the operator
INT (" "[+-])?([1-9][0-9]*)|0
does not help, because "1 -1" will be recognised as INT(1), INT(-1)
instead of INT(1), OP(-), INT(1).
Is there a good solution that solves both problems? At the moment, I
have added a new rule to the lexer that matches this exact number,
which solves the problem - but it's a hack.. (actually, there are two
rules, one for 32-bit machines, one for 64-bit machines..)
Thanks,
Edsko
| |
| Russ Cox 2006-01-31, 7:07 pm |
| > This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number. Thus, the above method of dealing with "-" as a
> unary operator breaks down.
I know a few vendor-supplied C compilers that get this wrong,
printing warnings about "positive constant overflows" or something
like that, which is all the more confusing an error when you look at
it and say "but that's *not* a positive constant!".
> The solution is to interpret the "-" as part of the number, and
> generate INT(-2147483648),
I think a better solution is to let the parser do its job and determine
whether the - is a negation or a subtraction, and then once that
is done, you can warn about INT(2147483648) that isn't inside a
negation. I put this into the Plan 9 C compilers recently and it
worked well.
Russ
| |
| Torben Ęgidius Mogensen 2006-01-31, 7:07 pm |
| "Edsko de Vries" <devriese@cs.tcd.ie> writes:
> our current definition for integer constant looks something like
>
> INT ([1-9][0-9]*)|0
>
> In particular, note that it does not allow for an (optional) "+" or "-"
> at the start of the integer. In other words, "-" is treated as a unary
> operator, rather than as part of the number.
>
> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number.
>
> At the moment, I have added a new rule to the lexer that matches
> this exact number, which solves the problem - but it's a hack..
One solution is to let the lexer keep the number as a string and
handle unary minus applied to a constant in the parser (at this time
converting the string to a number). Or you can use a bignum package
in the compiler, only converting to machine integers when you generate
code.
Torben
| |
| Jean-Marc Bourguet 2006-01-31, 7:07 pm |
| "Edsko de Vries" <devriese@cs.tcd.ie> writes:
> Is there a good solution that solves both problems?
I don't know if you consider it "good", but doing compile time
evaluation of constant expressions with an unbounded precision package
would solve the issue. It immediately extend to having unbounded
precision integers at runtime when you'll add it.
--
Jean-Marc
| |
| Dmitry A. Kazakov 2006-01-31, 7:07 pm |
| On 31 Jan 2006 09:57:37 -0500, Edsko de Vries wrote:
[...]
> Is there a good solution that solves both problems? At the moment, I
> have added a new rule to the lexer that matches this exact number,
> which solves the problem - but it's a hack.. (actually, there are two
> rules, one for 32-bit machines, one for 64-bit machines..)
In Ada all literals are considered of some indefinite numeric type
[Universal_Integer]. The compiler evaluates all expressions exactly and
generates error when the result is converted to the target [finite] type.
So, for example, if Int_8 has range -128..127, nevertheless the following
is legal Ada:
X : Int_8 := 1000 - 900; -- The result is in the range
With this approach, literals are unsigned and unary minus works just fine.
Of course, you will not be able to compile something like
2**98756466 - 2**98756466
which is also legal. But it is not a big loss. (:-))
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
| |
| David R Tribble 2006-02-01, 3:58 am |
| Edsko de Vries writes:
>
Torben Ęgidius Mogensen wrote:
> One solution is to let the lexer keep the number as a string and
> handle unary minus applied to a constant in the parser (at this time
> converting the string to a number). Or you can use a bignum package
> in the compiler, only converting to machine integers when you generate
> code.
Another method is to convert all but the last digit into an integer,
so that <2147483648> becomes <214748364, 8>. Then a simple
range comparison can be made to determine if the token is too
big for a valid integer value. This method avoids overflow.
It's interesting to note that the C and C++ grammars do not
treat leading signs as part of a constant but as a separate
unary operator. Other languages such as COBOL do treat
leading signs as part of a literal constant, but these kinds of
languages also typically require additional spaces to separate
operands from operators.
-drt
| |
| Henry Spencer 2006-02-01, 3:58 am |
| Edsko de Vries <devriese@cs.tcd.ie> wrote:
>The problem is, of course, overflow: -2147483648 is a valid negative
>number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
>valid positive number... Is there a good solution...?
The hack that has been used for this in the past is to compute and store a
number as the negative of its actual value (temporarily, until you know
whether there's a unary minus in front).
If that's too distasteful, you really have to find an extra bit somewhere:
+ Store the constant as an unsigned number (32-bit unsigned goes up to
4_294_967_295 without overflow, so 2_147_483_648 is no problem),
until you know the sign.
+ Use a larger signed type, if you have one.
+ Use a bignum (unlimited-length integer) library.
+ Store the number as a string, unconverted, until you know its sign.
--
spsystems.net is temporarily off the air; | Henry Spencer
mail to henry at zoo.utoronto.ca instead. | henry@spsystems.net
| |
| George Neuner 2006-02-01, 3:58 am |
| On 31 Jan 2006 09:57:37 -0500, "Edsko de Vries" <devriese@cs.tcd.ie>
wrote:
>-2147483648 is a valid negative number (assuming 32-bit numbers),
>but the integer 2147483648 is _not_ a valid positive number.
>Even requiring a space before the operator
>does not help, because "1 -1" will be recognised as INT(1), INT(-1)
>instead of INT(1), OP(-), INT(1).
>
>Is there a good solution that solves both problems?
You can solve it two ways: require that white space or another
operator (such as parentheses) follow the binary operator, or restrict
the range of integer constants so that unary negation is always
possible.
Or you could try a suitably condescending error message ... it's
obvious that "1-1" is a misspelling of "H" 8-)
George
| |
| Thomas David Rivers 2006-02-01, 3:58 am |
| Russ Cox wrote:
>
>
> I know a few vendor-supplied C compilers that get this wrong,
> printing warnings about "positive constant overflows" or something
> like that, which is all the more confusing an error when you look at
> it and say "but that's *not* a positive constant!".
>
Your C compiler is correct. In standard C there are no "negative"
integer constants.
See section 6.4.4.1 of the C99 standard.
What is happening here, on a 32-bit 2's compliment machine is that
the value is too large to fit into a signed 'int', so
it becomes an 'unsigned int', following the type hierarchy
in section 6.4.4.1. You've then applied a unary negation
to an 'unsigned int' and the result is also an 'unsigned int' (see
section 6.5.3.3).... it's not clear what the arithmetic
negation of an 'unsigned int' would be.
For this example, our compiled generates:
Warning #2296: unary negation applied to an unsigned type
Some other compilers also generate warnings like "constant is so large
that it is unsigned", I think that's one I've seen from GCC before.
So - to go back to what the original post was about, the C language
has the same problem.
Our solution is to keep constants as strings. When it is time for
the value of the constant to be evaluated, we have to examine it
to see which type can "hold" it, and place it in appropriate
"containers" for that type.
This can be particularly cumbersome in a cross-platform environment,
such as a 32-bit host compiling for a 64-bit machine... where
there would be a requirement for manipulating large values in
a non-native fashion (e.g. some "bignum" implemenation.)
- Dave Rivers -
--
rivers@dignus.com Work: (919) 676-0847
Get your mainframe programming tools at http://www.dignus.com
| |
| glen herrmannsfeldt 2006-02-02, 7:07 pm |
| Edsko de Vries wrote:
> I have a question regarding lexical analysis. I recently came across a
> bug in our lexical analyser in phc (www.phpcompiler.org), that I am
> unsure how to solve. This is the problem: our current definition for
> integer constant looks something like
(snip)
> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number. Thus, the above method of dealing with "-" as a
> unary operator breaks down.
(snip)
Look at the language definition. C allows only positive constants
and unary operators, as you say, though a constant expression might
have the value -2147483648.
Fortran allows only positive constants in expressions, but optionally
signed constants in DATA statements. It might not be required, but I
believe most compilers will allow the maximum negative number in a DATA
statement.
-- glen
| |
| Hans-Peter Diettrich 2006-02-02, 7:07 pm |
| Edsko de Vries wrote:
> This works fine, with the sole exception of the number "-2147483648".
> The problem is, of course, overflow: -2147483648 is a valid negative
> number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
> valid positive number. Thus, the above method of dealing with "-" as a
> unary operator breaks down.
When you separate the sign from the numbers, you should use unsigned
values to prevent overflows. Shouldn't your code also accept unsigned
integers > 2^31?
DoDi
| |
| David R Tribble 2006-02-08, 9:46 am |
| David R Tribble wrote:
> Another method is to convert all but the last digit into an integer,
> so that <2147483648> becomes <214748364, 8>. Then a simple
> range comparison can be made to determine if the token is too
> big for a valid integer value. This method avoids overflow.
Just to add a note...
There is a rule (I thought it was Horner's Rule, but that's not right),
which allows you to check whether adding a digit to a number will
result in overflow before doing the actual multiply and add of the
digit.
-drt
|
|
|
|
|