Home > Archive > Compilers > October 2006 > Create nfa from a regular expression
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Create nfa from a regular expression
|
|
| Felix Dorner 2006-10-06, 7:03 pm |
| hi,
I Tried To Write A Simple Compiler That Creates A Nondeterministic
Finite Automaton From A Regular Expression, However I Am Really Not
Sure My Solution Is The Way To Go, Most Because I Cannot Figure Out How
To Introduce Operator Precedence. So the given expression is evaluated
simply from the left to the right. (Operator precedence can then be
forced by putting braces) There are some cases that are still to be
catched, but the main work is done. I would be really pleased to get
some comments on my code.
Thanks,
Felix
public class NFAFactory {
private static Stack<NFA> nfaStack;
private static Stack<Character> braceStack;
private static Tokenizer t;
private static int idCounter;
public static NFA makeNFA(String[] literals, String regex) throws
ParseException {
idCounter = 0;
nfaStack = new Stack<NFA>();
braceStack = new Stack<Character>();
t = new Tokenizer(literals, regex);
makeit();
return nfaStack.pop();
}
private static void makeit() throws ParseException {
char token;
while (true) {
token = t.next();
if (token == 0){
if (!braceStack.empty()){
throw new ParseException("ERROR: ) expected");
}
System.out.println("ENDE");
break;
}
else if (token == Tokenizer.LITERAL){
handleLiteral();
}
else if (token == Tokenizer.CONCAT){
handleConcat();
}
else if (token == Tokenizer.UNION){
handleUnion();
}
else if (token == Tokenizer.STAR){
handleStar();
}
else if (token == Tokenizer.LEFTB){
handleLeftBrace();
}
else if (token == Tokenizer.RIGHTB){
handleRightBrace();
break;
}
}
}
private static void handleLiteral() throws ParseException {
if (t.getLastToken() == Tokenizer.RIGHTB){
throw new ParseException("ERROR: ) cannot be followed by a
Literal.");
}
if (t.getLastToken() == Tokenizer.STAR){
throw new ParseException("ERROR: * cannot be followed by a
Literal");
}
NFA n = new NFA();
State s0 = new State(idCounter++);
State s1 = new State(idCounter++);
n.addState(s0);
n.addState(s1);
s0.addTransition(t.getLastLiteral(), s1);
n.setInitialState(s0);
n.setFinalState(s1);
nfaStack.push(n);
}
private static void handleUnion() throws ParseException{
if (t.getLastToken() == Tokenizer.LEFTB){
throw new ParseException("ERROR: ( cannot be followed by |");
}
if (t.getLastToken() == Tokenizer.UNION){
throw new ParseException("ERROR: | cannot be followed by |");
}
if (t.getLastToken() == Tokenizer.CONCAT){
throw new ParseException("ERROR: ; cannot be followed by |");
}
if (nfaStack.isEmpty()) {
throw new ParseException("ERROR: RegExp cannot start with '|'");
}
makeit();
NFA n2 = nfaStack.pop();
if (nfaStack.isEmpty()) {
throw new ParseException("ERROR: RegExp cannot end with '|'");
}
NFA n1 = nfaStack.pop();
n1.addStates(n2.getStates());
State start = new State(idCounter++);
State end = new State(idCounter++);
n1.addState(start);
n1.addState(end);
start.addTransition(null, n1.getInitialState());
start.addTransition(null, n2.getInitialState());
n1.getFinalState().addTransition(null, end);
n2.getFinalState().addTransition(null, end);
n1.setInitialState(start);
n1.setFinalState(end);
nfaStack.push(n1);
}
private static void handleConcat() throws ParseException {
if (t.getLastToken() == Tokenizer.LEFTB){
throw new ParseException("ERROR: ( cannot be followed by ;");
}
if (nfaStack.isEmpty()) {
throw new ParseException("ERROR: RegExp cannot start with ';'");
}
makeit();
concat();
}
private static void handleStar() throws ParseException {
if (t.getLastToken() != Tokenizer.RIGHTB){
throw new ParseException("* must be preceded by )");
}
NFA n1 = nfaStack.pop();
n1.getFinalState().addTransition(null, n1.getInitialState());
n1.getInitialState().addTransition(null, n1.getFinalState());
nfaStack.push(n1);
}
private static void handleLeftBrace() throws ParseException {
braceStack.push(Tokenizer.LEFTB);
// Handle )(, *( , a( as concatenations
if (t.getLastToken() == Tokenizer.LITERAL ||
t.getLastToken() == Tokenizer.RIGHTB ||
t.getLastToken() == Tokenizer.STAR) {
makeit();
concat();
}
else {
makeit();
}
}
private static void handleRightBrace() throws ParseException{
if (braceStack.isEmpty()){
throw new ParseException("ERROR: Too many )´s");
}
if (t.getLastToken() == Tokenizer.CONCAT){
throw new ParseException("ERROR: ; cannot be followed by )");
}
if (t.getLastToken() == Tokenizer.UNION){
throw new ParseException("ERROR: | cannot be followed by )");
}
braceStack.pop();
}
private static void concat() throws ParseException {
NFA n2 = nfaStack.pop();
if (nfaStack.isEmpty()) {
throw new ParseException("ERROR: RegExp cannot end with ';'");
}
NFA n1 = nfaStack.pop();
for (State s : n2.getStates()){
n1.addState(s);
}
n1.getFinalState().addTransition(null, n2.getInitialState());
n1.setFinalState(n2.getFinalState());
nfaStack.push(n1);
}
public static void main(String[] args) throws ParseException {
String[] lit = {"north", "east", "west", "south"};
String exp = "(south)*";
NFA f = NFAFactory.makeNFA(lit, exp);
f.print();
}
}
| |
| Pascal Bourguignon 2006-10-07, 4:14 am |
| "Felix Dorner" <felix.dorner@gmail.com> writes:
> I Tried To Write A Simple Compiler That Creates A Nondeterministic
> Finite Automaton From A Regular Expression, However I Am Really Not
> Sure My Solution Is The Way To Go, Most Because I Cannot Figure Out How
> To Introduce Operator Precedence. So the given expression is evaluated
> simply from the left to the right. (Operator precedence can then be
> forced by putting braces) There are some cases that are still to be
> catched, but the main work is done. I would be really pleased to get
> some comments on my code.
Precedence is implicit, deduced from the grammar rules, or explicitely
specified by parentheses in the expression.
So, the first step is to define your grammar for your regular
expressions.
For example, reading:
man 7 regex
Start Non Terminal:
re
Productions:
re ::= branches .
branches ::= branch | branch '|' branches .
branch ::= piece | piece branch .
piece ::= atom | atom '*' | atom '+' | atom '?' | atom bound .
bound ::= '{' integer '}' | '{' integer ',' '}'
| '{' integer ',' integer '}' .
atom ::= '(' re ')' | '(' ')' | bracket | '.'
| escaped-special-char | normal-char .
bracket ::= '[' ['^'] [']'|'-'] list-of-char ['-'] ']'
list-of-char ::= bracket-char | collating-sequence
| equivalence-class | character-class .
collating-sequence ::= '[.' collating-item '.]' .
collating-item ::= character | multi-character | collating-sequence-name .
equivalence-class ::= '[=' character | multi-character '=]' .
character-class ::= '[:' class-name ':]' .
class-name ::= identifier .
collating-sequence-name ::= identifier .
Terminals (in addition to the literal terminals in the grammar rules):
integer ::= /[0-9]+/ .
escaped-char ::= /\\./ .
normal-char ::= /[^][\\|()*+?]/
bracket-char ::= /[^]-]/
identifier ::= /[a-zA-Z]+/
character ::= /./
multi-character ::= /.+/
Don't mind the complexity of list-of-char, it merely reduces to set of
characters.
The precedence of * over |, or of [] over *, is given by the grammar
rules. A parser that analyses this grammar will automatically
implement these operator precedences.
So, once you've parsed the regular expression, you get a parse tree
with nodes of the form:
node ::= (ALT node...) . -- a|b|c
node ::= (SEQ node...) . -- abc
node ::= (REP min max node) . -- x* x+ x? x{...}
node ::= (SET characters) . -- bracket [xxx]
node ::= (INV characters) . -- bracket [^xxx]
For example, (ab)+|cd(e[a-e]){3,}|[^a]{0,5}
would result in this tree:
(ALT (REP 1 INFINITE (SEQ (SET "a") (SET "b")))
(SEQ (SET "c")
(SET "d")
(REP 3 INFINITE (SEQ (SET "e")
(SET "abcde"))))
(SEQ (REP 0 5 (INV "a"))))
Here, there's no notion of operator precedence anymore. You can have
subnodes of any kind, the structure of the tree directly indicates the
order of the operators.
Now, to generate a NFA from such a tree, you can easily do it
processing each kind of node:
(SET characters) translates to:
+-----(character-1)----+
| |
*---+-----(character-2)----+---->*
| |
| ... |
| |
+-----(character-n)----+
(INV characters) translates to the same, but for the characters that
are not in the set given in INV.
(SEQ node...) translates to:
*--(ε)-->[NFA for node 1]--+
|
+-----------(ε)------------+
|
v
*--(ε)-->[NFA for node 2]--+
|
+-----------(ε)------------+
|
v
....
*--(ε)-->[NFA for node n]--(ε)-->*
Or you can directly merge the final state of the NFA for node i with
the start state of the NFA for node i+1:
[NFA for node 1]--->[NFA for node 2]--...-->[NFA for node n]
(ALT node...) translates to:
+--(ε)--[NFA for node 1]--(ε)--+
| |
*---+--(ε)--[NFA for node 2]--(ε)--+---->*
| |
| ... |
| |
+--(ε)--[NFA for node n]--(ε)--+
(REP min max node) with max < INFINITE translates to:
*--(ε)-->[NFA for node]--+ \
| |
+-----------(ε)----------+ |
| |
v |
*--(ε)-->[NFA for node]--+ |
| |
+-----------(ε)----------+ | min times
| |
v |
.... |
*--(ε)-->[NFA for node]--+ |
| |
+-----------(ε)----------+ /
|
v
*------------(ε)-------->*---(ε)--+
| |
+-----------(ε)----------+ |
| |
v |
*--(ε)-->[NFA for node]--+---(ε)--+ \
| | |
+-----------(ε)----------+ | |
| | |
v | |
*--(ε)-->[NFA for node]--+---(ε)--+ |
| | |
+-----------(ε)----------+ | |
| | | (max - min - 1 ) times
v | |
*--(ε)-->[NFA for node]--+---(ε)--+ |
| | |
+-----------(ε)----------+ | |
| | /
v v
*--(ε)-->[NFA for node]---(ε)-----+--->*
(REP min INFINITE node) translates to:
*--(ε)-->[NFA for node]--+ \
| |
+-----------(ε)----------+ |
| |
v |
*--(ε)-->[NFA for node]--+ |
| |
+-----------(ε)----------+ | min times
| |
v |
.... |
*--(ε)-->[NFA for node]--+ |
| |
+-----------(ε)----------+ /
|
v
*------------(ε)------------>*---(ε)--->*
^ |
| |
+--(ε)--[NFA for node]--(ε)--+
This gives a NFA with some useless epsilon transaction, but you can
easily reduce this NFA or transform it into a DFA.
--
__Pascal Bourguignon__ http://www.informatimago.com/
|
|
|
|
|