Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Re: Regular Expression AND mach
michael@foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db39215@posting.goo
gle.com>...
> "Robert Brewer" <fumanchu@amor.org> wrote in message news:<mailman.194.107
9807509.742.python-list@python.org>... 
> 
>  True 
>
> Thanks for your reply, it was both helpful and interesting.
> Unfortunately it only confirmed my suspicion that what I wanted to do
> wasn't possible :-(
>
> The database 'module' I'm using (which is basically fine for my other
> purposes) does searches through it's records using regular
> expressions. A full search through 1800 records (each record can be a
> couple of k of text) takes about 0.2 seconds - which is fine.... but
> doing 4 or 5 searches and comparing results *isn't* (0.2 seconds delay
> is ok - 0.8-1.0 seconds isn't).... So I'm currently just searching for
> the longest word - KirbBase then returns the full *text* of each song
> containing that word... and I'm just checking each song to see if it
> has the other words :-)
>
> This works fine, isn't noticeably slow, but isn't as elegant as I'd
> hoped...
>
> Regards,
>
>
> Fuzzy
>
> http://www.voidspace.org.uk/atlanti...ythonutils.html
> 

Hi Fuzzyman.

The heuristic solution you've come up with (i.e, searching using the
longest term and then filtering/ranking the results against the
remaining terms) is in fact quite well founded in practice.  Many
years ago, Miller and Newman determined that there are two
length-frequency statistical patterns in written English, one for
content words and another for function words. The content-word pattern
is regular and demonstrates a very strong inverse relationship between
word length and word frequency (i.e., the longer the word, the less
frequent). There are, of course, many exceptions, but the pattern is
very strong; and there are interesting information theoretic arguments
why this makes sense (if you accept that human languages operate to
minimize the costs of communication). In the absence of statistics
collected on your specific body of text, this is a very useful crutch.

Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical
Explanation of the Rank-Frequency Relation for Word in Written
English." American Journal of Psychology (71): 209-258.

Miller, George A., E. B. Newman, and E. A. Friedman. 1958.
"Length-Frequency Statistics for Written English." Information and
Control (1): 370-389.

On the other hand, it is also possible to (a) identify the full set of
candidate search terms available (this would allow you to fail quickly
on search terms that can't be found); and (b) determine their
frequencies (this would allow you to search for terms in inverse
frequency, if that's what you really want). Creating an inverted index
is, of course, a possibility, but text search is so fast and,
furthermore, you don't have the complexities of creating and
maintaining the index itself.  Wu and Manber, the developers of the
'agrep' tool (which uses some really fancy and powerful partial
matching techniques and is used by the Glimpse and Webglimpse search
engines), have argued precisely this point.

Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.

Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching
Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco,
USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991.

Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292,
November 1992.

There is a Python port of agrep available as a module called 'agrepy'
at the following site:

http://www.bio.cam.ac.uk/~mw263/pyagrep.html

As noted in an earlier post, regexes alone can't do what you want.
There are other techniques that might do what you're looking for, for
example, Aho-Corasick (and some other) automata.  Aho-Corasick
automata are able to 'parallelize' pattern matching so that multiple
strings can be matched against a text in a single pass. Aho-Corasick
automata are easy to implement in both their deterministic and
non-deterministic forms in Python.

Aho, A., and Corasick, M. Efficient String Matching: An Aid to
Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340.

http://www-sr.informatik.uni-tuebin.../Automaton.html

Good luck,

Win

Report this thread to moderator Post Follow-up to this message
Old Post
Win Carus
03-27-04 05:20 AM


Re: Regular Expression AND mach
wcarus@comcast.net (Win Carus) wrote in message news:<fc91898a.0403250748.5ceb5a41@posting.
google.com>...
> michael@foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db
39215@posting.google.com>... 

[snip..]
 
>
> Hi Fuzzyman.
>
> The heuristic solution you've come up with (i.e, searching using the
> longest term and then filtering/ranking the results against the
> remaining terms) is in fact quite well founded in practice.  Many
> years ago, Miller and Newman determined that there are two
> length-frequency statistical patterns in written English, one for
> content words and another for function words. The content-word pattern
> is regular and demonstrates a very strong inverse relationship between
> word length and word frequency (i.e., the longer the word, the less
> frequent). There are, of course, many exceptions, but the pattern is
> very strong; and there are interesting information theoretic arguments
> why this makes sense (if you accept that human languages operate to
> minimize the costs of communication). In the absence of statistics
> collected on your specific body of text, this is a very useful crutch.
>
> Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical
> Explanation of the Rank-Frequency Relation for Word in Written
> English." American Journal of Psychology (71): 209-258.
>
> Miller, George A., E. B. Newman, and E. A. Friedman. 1958.
> "Length-Frequency Statistics for Written English." Information and
> Control (1): 370-389.
>
> On the other hand, it is also possible to (a) identify the full set of
> candidate search terms available (this would allow you to fail quickly
> on search terms that can't be found); and (b) determine their
> frequencies (this would allow you to search for terms in inverse
> frequency, if that's what you really want). Creating an inverted index
> is, of course, a possibility, but text search is so fast and,
> furthermore, you don't have the complexities of creating and
> maintaining the index itself.  Wu and Manber, the developers of the
> 'agrep' tool (which uses some really fancy and powerful partial
> matching techniques and is used by the Glimpse and Webglimpse search
> engines), have argued precisely this point.
>
> Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
> Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.
>
> Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching
> Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco,
> USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991.
>
> Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292,
> November 1992.
>
> There is a Python port of agrep available as a module called 'agrepy'
> at the following site:
>
> http://www.bio.cam.ac.uk/~mw263/pyagrep.html
>
> As noted in an earlier post, regexes alone can't do what you want.
> There are other techniques that might do what you're looking for, for
> example, Aho-Corasick (and some other) automata.  Aho-Corasick
> automata are able to 'parallelize' pattern matching so that multiple
> strings can be matched against a text in a single pass. Aho-Corasick
> automata are easy to implement in both their deterministic and
> non-deterministic forms in Python.
>
> Aho, A., and Corasick, M. Efficient String Matching: An Aid to
> Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340.
>
> http://www-sr.informatik.uni-tuebin.../Automaton.html
>
> Good luck,
>
> Win


Interesting - and thanks for your help.

Regards,

Fuzzy

http://www.voidspace.org.uk/atlanti...ythonutils.html

Report this thread to moderator Post Follow-up to this message
Old Post
Fuzzyman
03-27-04 05:20 AM


Re: Regular Expression AND mach
"Greg Ewing (using news.cis.dfn.de)" <ieyf4fu02@sneakemail.com> wrote in message news:<c3tj
on$2c6trb$1@ID-169208.news.uni-berlin.de>...
> Fuzzyman wrote: 
>
> If you need to do a lot of searches of this kind, you might
> find it worthwhile to build an index indicating which songs
> each word occurs in.

Hmmmm...... possibly. It seems a shame to store a database with all
the songs in *and* build a full index.

For this project, the solution I have is fine. I may well need to look
at an indexer for another one soon though.

Thanks.

Regards,


Fuzzy

http://www.voidspace.org.uk/atlanti...ythonutils.html

Report this thread to moderator Post Follow-up to this message
Old Post
Fuzzyman
03-27-04 05:20 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Python archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 04:18 AM.

 

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.