Home > Archive > Compilers > February 2006 > Problem with flex, parsing a large file
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Problem with flex, parsing a large file
|
|
| David Deharbe 2006-02-08, 9:46 am |
| Dear all,
I am writing a compiler for TSTP, a small language (or format), for
theorem provers. I am using flex-2.5.4 and GNU bison-1.28, and I am
encountering a problem that seems to be related to flex.
Currently, I have developped the following files:
- tstp.lex: input to flex
- tstp.y: input to bison
- test.c: a program that open a given file, and calls the generated
parser on this file.
While parsing a large file (>25MB), the program stopped and output the
following message:
"input buffer overflow, can't enlarge buffer because scanner uses REJECT".
I was puzzled, as the scanner does not use REJECT explicitly. I
googled the error message and found out that this may be caused by the
yylineno option. So I rewrote the scanner and coded myself the yyline
no option.
After that, parsed again the large file, and the parser seems to get
lost as it does not complete execution in a reasonable amount of time.
So I tried to localize the source of the error and split this large
file into several mid-sized files, each with 1000 lines. The program
was able to parse all of them in 9 seconds!
I included actions in the parser to print the line number to see where
things went wrong (line 3437). I regenerated the program using the
available debug options. The resulting program was able to parse past
line 3437 (I had to stop it when it reached line 7200)!
I include the scanner at the end of this message. If anyone can
provide me a solution of this problem, or a clue on how to solve it, I
would be grateful! You can reach me at "deharbe@gmail.com".
By the way: uname -a yields:
Darwin david-deharbes-ibook-g4.local 8.3.0 Darwin Kernel Version
8.3.0: Mon Oct 3 20:04:04 PDT 2005;
root:xnu-792.6.22.obj~2/RELEASE_PPC Power Macintosh powerpc
Best regards,
David.
--
/* scanner begins*/
%{
#include "tstpsyn.h"
extern int tstp_lineno;
%}
%option noyywrap
%x c_comment
/* <upper word> ::= <A-Z><a-z0-9A-Z_>* */
upper_word [A-Z][a-zA-Z0-9_]*
/* <lower word> ::= <a-z><a-z0-9A-Z_>* */
lower_word [a-z][a-zA-Z0-9_]*
/* <single quoted> ::= '<char>*' */
single_quoted '[^']*'
/* <double quoted> ::= "<char>*" */
double_quoted \"[^\"]*\"
/* <real> ::= <integer><decimal part>
<sign> ::= + | - | <null>
<unsigned integer> ::= <0-9>+
<decimal part> ::= .<0-9>+ | <null> */
/* DD: we cannot maintain the ambiguity in the TPTP3/TSTP definition that
any integer is also a real */
unsigned_integer [0-9]+
integer [\-\+]?{unsigned_integer}
decimal_part \.[0-9]*
real {integer}{decimal_part}
percent_comment \#[^$]*$
%%
"/*" BEGIN(c_comment);
<c_comment>[^*\n]*
<c_comment>[^*\n]*\n { printf("%i\n", ++tstp_lineno); }
<c_comment>"*"+[^*/\n]*
<c_comment>"*"+[^*/\n]*\n { printf("%i\n", ++tstp_lineno); }
<c_comment><<EOF>> { tstp_error("unterminated comment"); yyterminate(); }
<c_comment>"*"+"/" BEGIN(INITIAL);
"equality:s" { return THEORY_NAME; } /* DD: see comment in file tstp.y */
"equality" { return THEORY_NAME; }
"axiom_of_choice" { return TK_AXIOM_OF_CHOICE ; }
"axiom" { return TK_AXIOM ; }
"cnf" { return TK_CNF ; }
"conjecture" { return TK_CONJECTURE ; }
"creator" { return TK_CREATOR ; }
"definition" { return TK_DEFINITION ; }
"description" { return TK_DESCRIPTION ; }
"fof" { return TK_FOF ; }
"file" { return TK_FILE ; }
"hypothesis" { return TK_HYPOTHESIS ; }
"include" { return TK_INCLUDE ; }
"inference" { return TK_INFERENCE ; }
"introduced" { return TK_INTRODUCED ; }
"iquote" { return TK_IQUOTE ; }
"lemma_conjecture" { return TK_LEMMA_CONJECTURE ; }
"lemma" { return TK_LEMMA ; }
"negated_conjecture" { return TK_NEGATED_CONJECTURE ; }
"plain" { return TK_PLAIN ; }
"refutation" { return TK_REFUTATION ; }
"status" { return TK_STATUS ; }
"tautology" { return TK_TAUTOLOGY ; }
"theorem" { return TK_THEOREM ; }
"theory" { return TK_THEORY ; }
"unknown" { return TK_UNKNOWN ; }
"&" { return TK_AND ; }
"!=" { return TK_DIFF ; }
"=" { return TK_EQU ; }
"equal" { return TK_EQUAL ; }
"<=>" { return TK_EQV ; }
"$false" { return TK_FALSE ; }
"=>" { return TK_LIMPL ; }
"<~>" { return TK_NEQV ; }
"~&" { return TK_NAND ; }
"~|" { return TK_NOR ; }
"~" { return TK_NOT ; }
"|" { return TK_OR ; }
"<=" { return TK_RIMPL ; }
"$true" { return TK_TRUE ; }
"!" { return TK_BANG ; }
"?" { return TK_QUESTION_MARK ; }
"+" { return TK_PLUS ; }
"-" { return TK_MINUS ; }
":" { return TK_COLON ; }
"," { return TK_COMMA ; }
"." { return TK_DOT ; }
"(" { return TK_LP ; }
")" { return TK_RP ; }
"[" { return TK_LSQ ; }
"]" { return TK_RSQ ; }
{upper_word} { return UPPER_WORD ; }
{lower_word} { return LOWER_WORD ; }
{single_quoted} { return SINGLE_QUOTED ; }
{double_quoted} { return DOUBLE_QUOTED ; }
{unsigned_integer} { return UNSIGNED_INTEGER ; }
{real} { return REAL ; }
{percent_comment} { printf("%i\n", ++tstp_lineno); return PERCENT_COMMENT ; }
"\n" { printf("%i\n", ++tstp_lineno); }
[ \t]+ /* eat up space */
/*. { tstp_error("unrecognized character"); }*/
%%
/* user code */
/* tstp_clean frees the resources allocated by the scanner */
void
tstp_clean()
{
}
/* scanner ends */
| |
| Chris Dodd 2006-02-08, 9:46 am |
| David Deharbe <deharbe@gmail.com> wrote in news:06-02-036@comp.compilers:
> I am writing a compiler for TSTP, a small language (or format), for
> theorem provers. I am using flex-2.5.4 and GNU bison-1.28, and I am
> encountering a problem that seems to be related to flex.
>
> While parsing a large file (>25MB), the program stopped and output the
> following message:
> "input buffer overflow, can't enlarge buffer because scanner uses
> REJECT".
The problem almost certainly is coming from your "string" tokens:
> /* <single quoted> ::= '<char>*' */
> single_quoted '[^']*'
> /* <double quoted> ::= "<char>*" */
> double_quoted \"[^\"]*\"
:
> {single_quoted} { return SINGLE_QUOTED ; }
> {double_quoted} { return DOUBLE_QUOTED ; }
For either of these tokens, on seeing a quote character, it will
slurp up the input until it finds the matching end quote. If that
is missing, however, it will read in the entire rest of the input
into the input buffer (resizing it larger and larger to hold it). So
when you have REJECT (from automatic yylineno), you get the error,
and if you get rid of REJECT, you still try to pull in the entire input.
Since you then apparently have no rule that can match a quote without
a matching quote, it gets lost.
The usual solution is to not allow newlines in strings and more carefully
detect malformed strings:
single_quoted '[^'\n]*'
double_quoted \"[^\"\n]*\"
%%
{single_quoted} { return SINGLE_QUOTED ; }
{double_quoted} { return DOUBLE_QUOTED ; }
' { error("unmatched '"); }
\" { error("unmatched \""); }
Chris Dodd
cdodd@acm.org
| |
| David Deharbe 2006-02-11, 7:01 pm |
| The suggestion of Chris Dodd indeed resolves my problem (although I do
not understand quite why, since the problematic input does not have any
unmatched quote). I also uncommented the default rule to handle
unrecognized characters.
Thanks,
David.
--
|
|
|
|
|