For Programmers: Free Programming Magazines  


Home > Archive > Compilers > October 2007 > Perfecting State Elimination









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Perfecting State Elimination
Bjoern Hoehrmann

2007-10-01, 7:18 pm

Hi,

I am trying to find a method that will make regular expressions as
short as possible (say, using only concatenation, alternation, zero or
more repetition, characters and epsilon, I want to minimize the number
of characters) given limited resources. I could not find much material
on the problem on the web, except that I can turn the expression into
a minimal DFA and the DFA into a regular expression using state elimi-
nation, and that some removal sequences are better than others.

Unfortunately it turns out even the best removal sequence does not ge-
nerate the shortest regular expression. For example, with a DFA like

'>'
- - I -----------------+
'>' \
- - J -------------------+ Final State
'>' /
- - K -----------------+

the shortest expression would have only a single '>' at the end, but
as far as I understand it, plain state elimination will duplicate it.
Short of another approach, I wondered how to fix it and it's easy to
see that in the automaton above a Guard State can be introduced like

epsilon
- - I -----------------+
epsilon \ '>'
- - J ------------------ G -----------> Final State
epsilon /
- - K -----------------+

It's clear that this change does not semantically alter the automaton,
it does not make matters worse since eliminating the guard state first
would take us back to the original, and if it is eliminated after what
it is guarding, the regular expression will be shorter for some removal
sequence.

Put differently, the transformation would add the minimal amount of
guard states such that the automaton is deterministic from both ends
for all non-epsilon transitions, meaning it maximizes left-factoring
and right-factoring while minimizing the number of states [1]. From
this description it sounds similar to Brzozozski's DFA minimization
algorithm.

My question now is, is this transformation sufficient to ensure that
there is a state removal sequence such that state elimination would
generate the shortest possible regular expression as defined above?
If not, is there anything that would make it so?

Is it correct to say, in the model above, removing a state while it
has an incoming epsilon transition will never generate a shorter
regular expression than when removing the state after all epsilon
coming to it have been removed?

[1] Or, it minimizes the number of characters in the automaton that
state elimination could duplicate. Presumably though the result
does not need to be fully deterministic, specifically, if there
is a state with only one incoming transition on c and loops in c,
it does not need a guard, but I haven't thought much about that.

(Followup-To set to comp.compilers)

Thanks,
--
Bjvrn Hvhrmann 7 mailto:bjoern@hoehrmann.de 7 http://bjoern.hoehrmann.de
Weinh. Str. 22 7 Telefon: +49(0)621/4309674 7 http://www.bjoernsworld.de
68309 Mannheim 7 PGP Pub. KeyID: 0xA4357E78 7 http://www.websitedev.de/
Anton Ertl

2007-10-01, 7:18 pm

Bjoern Hoehrmann <bjoern@hoehrmann.de> writes:
> I am trying to find a method that will make regular expressions as
>short as possible (say, using only concatenation, alternation, zero or
>more repetition, characters and epsilon, I want to minimize the number
>of characters) given limited resources. I could not find much material
>on the problem on the web, except that I can turn the expression into
>a minimal DFA and the DFA into a regular expression using state elimi-
>nation, and that some removal sequences are better than others.


My feeling is that that approach is wrong, because a regexp
corresponds to an NFA, and a DFA can be much larger than an equivalent
NFA, so a regexp generated from the DFA is likely to be suboptimal,
unless the DFA->regexp step reintroduces nondeterminism.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Bjoern Hoehrmann

2007-10-02, 4:27 am

* Bjoern Hoehrmann wrote in comp.compilers:
>[...]


I originally posted this in August, I suppose it got stuck in the
moderator's queue. [Stuck at his ISP, actually. -John]

I might as well answer my own question. While the transformation I
discussed is generally a good one, there is of course the inevitable
blowup caused by determinization of the auto- maton.

Unfortunately in my case I already start out with a DFA and would
need a small epsilon-NFA instead. Now NFA minimization is hard
and I haven't found any suitable reduction algorithm that would be
useful for my purposes. What I did come up with is this algorithm
that will compute a comperatively good removal sequence:

1. Ensure the automaton has only a single start and a single
final state with no incoming and respectively outgoing
transitions by adding new states and epsilon transitions
as necessary.

2. Compute for each state x which states are on all paths
Start --> x --> Final. This can be done in two breadth-
first traversals; for endpoint --> endpoint only the end-
point is on the path. For all others, the states that are
on the paths for all successors/predecessors are on the
path plus the state itself.

A simple approach would be to use a bitmap, starting with
all cells set to true, then setting the row for the end-
point, and during the traversal binary-and'ing the rows
of sucessors/predecessors with the current state's row
(while keeping bitmap[x][x] true).

3. Remove a state only after removing all states that must
visit it. In simple terms, this will collapse the auto-
maton starting with the most deeply nested states and
going outwards until only Start and Final are left.

regards,
--
Bjvrn Hvhrmann 7 mailto:bjoern@hoehrmann.de 7 http://bjoern.hoehrmann.de
Weinh. Str. 22 7 Telefon: +49(0)621/4309674 7 http://www.bjoernsworld.de
68309 Mannheim 7 PGP Pub. KeyID: 0xA4357E78 7 http://www.websitedev.de/
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com