For Programmers: Free Programming Magazines  


Home > Archive > Python > March 2004 > Re: Regular Expression AND mach









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Regular Expression AND mach
Win Carus

2004-03-27, 12:20 am

michael@foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db39215@posting.google.com>...[color=darkred]
> "Robert Brewer" <fumanchu@amor.org> wrote in message news:<mailman.194.1079807509.742.python-list@python.org>...
>
> True
>
> Thanks for your reply, it was both helpful and interesting.
> Unfortunately it only confirmed my suspicion that what I wanted to do
> wasn't possible :-(
>
> The database 'module' I'm using (which is basically fine for my other
> purposes) does searches through it's records using regular
> expressions. A full search through 1800 records (each record can be a
> couple of k of text) takes about 0.2 seconds - which is fine.... but
> doing 4 or 5 searches and comparing results *isn't* (0.2 seconds delay
> is ok - 0.8-1.0 seconds isn't).... So I'm currently just searching for
> the longest word - KirbBase then returns the full *text* of each song
> containing that word... and I'm just checking each song to see if it
> has the other words :-)
>
> This works fine, isn't noticeably slow, but isn't as elegant as I'd
> hoped...
>
> Regards,
>
>
> Fuzzy
>
> http://www.voidspace.org.uk/atlanti...ythonutils.html
>

Hi Fuzzyman.

The heuristic solution you've come up with (i.e, searching using the
longest term and then filtering/ranking the results against the
remaining terms) is in fact quite well founded in practice. Many
years ago, Miller and Newman determined that there are two
length-frequency statistical patterns in written English, one for
content words and another for function words. The content-word pattern
is regular and demonstrates a very strong inverse relationship between
word length and word frequency (i.e., the longer the word, the less
frequent). There are, of course, many exceptions, but the pattern is
very strong; and there are interesting information theoretic arguments
why this makes sense (if you accept that human languages operate to
minimize the costs of communication). In the absence of statistics
collected on your specific body of text, this is a very useful crutch.

Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical
Explanation of the Rank-Frequency Relation for Word in Written
English." American Journal of Psychology (71): 209-258.

Miller, George A., E. B. Newman, and E. A. Friedman. 1958.
"Length-Frequency Statistics for Written English." Information and
Control (1): 370-389.

On the other hand, it is also possible to (a) identify the full set of
candidate search terms available (this would allow you to fail quickly
on search terms that can't be found); and (b) determine their
frequencies (this would allow you to search for terms in inverse
frequency, if that's what you really want). Creating an inverted index
is, of course, a possibility, but text search is so fast and,
furthermore, you don't have the complexities of creating and
maintaining the index itself. Wu and Manber, the developers of the
'agrep' tool (which uses some really fancy and powerful partial
matching techniques and is used by the Glimpse and Webglimpse search
engines), have argued precisely this point.

Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.

Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching
Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco,
USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991.

Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292,
November 1992.

There is a Python port of agrep available as a module called 'agrepy'
at the following site:

http://www.bio.cam.ac.uk/~mw263/pyagrep.html

As noted in an earlier post, regexes alone can't do what you want.
There are other techniques that might do what you're looking for, for
example, Aho-Corasick (and some other) automata. Aho-Corasick
automata are able to 'parallelize' pattern matching so that multiple
strings can be matched against a text in a single pass. Aho-Corasick
automata are easy to implement in both their deterministic and
non-deterministic forms in Python.

Aho, A., and Corasick, M. Efficient String Matching: An Aid to
Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340.

http://www-sr.informatik.uni-tuebin.../Automaton.html

Good luck,

Win
Fuzzyman

2004-03-27, 12:20 am

wcarus@comcast.net (Win Carus) wrote in message news:<fc91898a.0403250748.5ceb5a41@posting.google.com>...
> michael@foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db39215@posting.google.com>...

[snip..]
[color=darkred]
>
> Hi Fuzzyman.
>
> The heuristic solution you've come up with (i.e, searching using the
> longest term and then filtering/ranking the results against the
> remaining terms) is in fact quite well founded in practice. Many
> years ago, Miller and Newman determined that there are two
> length-frequency statistical patterns in written English, one for
> content words and another for function words. The content-word pattern
> is regular and demonstrates a very strong inverse relationship between
> word length and word frequency (i.e., the longer the word, the less
> frequent). There are, of course, many exceptions, but the pattern is
> very strong; and there are interesting information theoretic arguments
> why this makes sense (if you accept that human languages operate to
> minimize the costs of communication). In the absence of statistics
> collected on your specific body of text, this is a very useful crutch.
>
> Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical
> Explanation of the Rank-Frequency Relation for Word in Written
> English." American Journal of Psychology (71): 209-258.
>
> Miller, George A., E. B. Newman, and E. A. Friedman. 1958.
> "Length-Frequency Statistics for Written English." Information and
> Control (1): 370-389.
>
> On the other hand, it is also possible to (a) identify the full set of
> candidate search terms available (this would allow you to fail quickly
> on search terms that can't be found); and (b) determine their
> frequencies (this would allow you to search for terms in inverse
> frequency, if that's what you really want). Creating an inverted index
> is, of course, a possibility, but text search is so fast and,
> furthermore, you don't have the complexities of creating and
> maintaining the index itself. Wu and Manber, the developers of the
> 'agrep' tool (which uses some really fancy and powerful partial
> matching techniques and is used by the Glimpse and Webglimpse search
> engines), have argued precisely this point.
>
> Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
> Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.
>
> Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching
> Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco,
> USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991.
>
> Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292,
> November 1992.
>
> There is a Python port of agrep available as a module called 'agrepy'
> at the following site:
>
> http://www.bio.cam.ac.uk/~mw263/pyagrep.html
>
> As noted in an earlier post, regexes alone can't do what you want.
> There are other techniques that might do what you're looking for, for
> example, Aho-Corasick (and some other) automata. Aho-Corasick
> automata are able to 'parallelize' pattern matching so that multiple
> strings can be matched against a text in a single pass. Aho-Corasick
> automata are easy to implement in both their deterministic and
> non-deterministic forms in Python.
>
> Aho, A., and Corasick, M. Efficient String Matching: An Aid to
> Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340.
>
> http://www-sr.informatik.uni-tuebin.../Automaton.html
>
> Good luck,
>
> Win



Interesting - and thanks for your help.

Regards,

Fuzzy

http://www.voidspace.org.uk/atlanti...ythonutils.html
Fuzzyman

2004-03-27, 12:20 am

"Greg Ewing (using news.cis.dfn.de)" <ieyf4fu02@sneakemail.com> wrote in message news:<c3tjon$2c6trb$1@ID-169208.news.uni-berlin.de>...
> Fuzzyman wrote:
>
> If you need to do a lot of searches of this kind, you might
> find it worthwhile to build an index indicating which songs
> each word occurs in.


Hmmmm...... possibly. It seems a shame to store a database with all
the songs in *and* build a full index.

For this project, the solution I have is fine. I may well need to look
at an indexer for another one soon though.

Thanks.

Regards,


Fuzzy

http://www.voidspace.org.uk/atlanti...ythonutils.html
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2010 codecomments.com