Code Comments
Programming Forum and web based access to our favorite programming groups.michael@foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db39215@posting.goo gle.com>... > "Robert Brewer" <fumanchu@amor.org> wrote in message news:<mailman.194.107 9807509.742.python-list@python.org>... > > True > > Thanks for your reply, it was both helpful and interesting. > Unfortunately it only confirmed my suspicion that what I wanted to do > wasn't possible :-( > > The database 'module' I'm using (which is basically fine for my other > purposes) does searches through it's records using regular > expressions. A full search through 1800 records (each record can be a > couple of k of text) takes about 0.2 seconds - which is fine.... but > doing 4 or 5 searches and comparing results *isn't* (0.2 seconds delay > is ok - 0.8-1.0 seconds isn't).... So I'm currently just searching for > the longest word - KirbBase then returns the full *text* of each song > containing that word... and I'm just checking each song to see if it > has the other words :-) > > This works fine, isn't noticeably slow, but isn't as elegant as I'd > hoped... > > Regards, > > > Fuzzy > > http://www.voidspace.org.uk/atlanti...ythonutils.html > Hi Fuzzyman. The heuristic solution you've come up with (i.e, searching using the longest term and then filtering/ranking the results against the remaining terms) is in fact quite well founded in practice. Many years ago, Miller and Newman determined that there are two length-frequency statistical patterns in written English, one for content words and another for function words. The content-word pattern is regular and demonstrates a very strong inverse relationship between word length and word frequency (i.e., the longer the word, the less frequent). There are, of course, many exceptions, but the pattern is very strong; and there are interesting information theoretic arguments why this makes sense (if you accept that human languages operate to minimize the costs of communication). In the absence of statistics collected on your specific body of text, this is a very useful crutch. Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical Explanation of the Rank-Frequency Relation for Word in Written English." American Journal of Psychology (71): 209-258. Miller, George A., E. B. Newman, and E. A. Friedman. 1958. "Length-Frequency Statistics for Written English." Information and Control (1): 370-389. On the other hand, it is also possible to (a) identify the full set of candidate search terms available (this would allow you to fail quickly on search terms that can't be found); and (b) determine their frequencies (this would allow you to search for terms in inverse frequency, if that's what you really want). Creating an inverted index is, of course, a possibility, but text search is so fast and, furthermore, you don't have the complexities of creating and maintaining the index itself. Wu and Manber, the developers of the 'agrep' tool (which uses some really fancy and powerful partial matching techniques and is used by the Glimpse and Webglimpse search engines), have argued precisely this point. Sun Wu and Udi Manber: Fast Text Searching Allowing Errors. Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA. Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco, USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991. Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292, November 1992. There is a Python port of agrep available as a module called 'agrepy' at the following site: http://www.bio.cam.ac.uk/~mw263/pyagrep.html As noted in an earlier post, regexes alone can't do what you want. There are other techniques that might do what you're looking for, for example, Aho-Corasick (and some other) automata. Aho-Corasick automata are able to 'parallelize' pattern matching so that multiple strings can be matched against a text in a single pass. Aho-Corasick automata are easy to implement in both their deterministic and non-deterministic forms in Python. Aho, A., and Corasick, M. Efficient String Matching: An Aid to Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340. http://www-sr.informatik.uni-tuebin.../Automaton.html Good luck, Win
Post Follow-up to this messagewcarus@comcast.net (Win Carus) wrote in message news:<fc91898a.0403250748.5ceb5a41@posting. google.com>... > michael@foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db 39215@posting.google.com>... [snip..] > > Hi Fuzzyman. > > The heuristic solution you've come up with (i.e, searching using the > longest term and then filtering/ranking the results against the > remaining terms) is in fact quite well founded in practice. Many > years ago, Miller and Newman determined that there are two > length-frequency statistical patterns in written English, one for > content words and another for function words. The content-word pattern > is regular and demonstrates a very strong inverse relationship between > word length and word frequency (i.e., the longer the word, the less > frequent). There are, of course, many exceptions, but the pattern is > very strong; and there are interesting information theoretic arguments > why this makes sense (if you accept that human languages operate to > minimize the costs of communication). In the absence of statistics > collected on your specific body of text, this is a very useful crutch. > > Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical > Explanation of the Rank-Frequency Relation for Word in Written > English." American Journal of Psychology (71): 209-258. > > Miller, George A., E. B. Newman, and E. A. Friedman. 1958. > "Length-Frequency Statistics for Written English." Information and > Control (1): 370-389. > > On the other hand, it is also possible to (a) identify the full set of > candidate search terms available (this would allow you to fail quickly > on search terms that can't be found); and (b) determine their > frequencies (this would allow you to search for terms in inverse > frequency, if that's what you really want). Creating an inverted index > is, of course, a possibility, but text search is so fast and, > furthermore, you don't have the complexities of creating and > maintaining the index itself. Wu and Manber, the developers of the > 'agrep' tool (which uses some really fancy and powerful partial > matching techniques and is used by the Glimpse and Webglimpse search > engines), have argued precisely this point. > > Sun Wu and Udi Manber: Fast Text Searching Allowing Errors. > Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA. > > Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching > Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco, > USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991. > > Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292, > November 1992. > > There is a Python port of agrep available as a module called 'agrepy' > at the following site: > > http://www.bio.cam.ac.uk/~mw263/pyagrep.html > > As noted in an earlier post, regexes alone can't do what you want. > There are other techniques that might do what you're looking for, for > example, Aho-Corasick (and some other) automata. Aho-Corasick > automata are able to 'parallelize' pattern matching so that multiple > strings can be matched against a text in a single pass. Aho-Corasick > automata are easy to implement in both their deterministic and > non-deterministic forms in Python. > > Aho, A., and Corasick, M. Efficient String Matching: An Aid to > Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340. > > http://www-sr.informatik.uni-tuebin.../Automaton.html > > Good luck, > > Win Interesting - and thanks for your help. Regards, Fuzzy http://www.voidspace.org.uk/atlanti...ythonutils.html
Post Follow-up to this message"Greg Ewing (using news.cis.dfn.de)" <ieyf4fu02@sneakemail.com> wrote in message news:<c3tj on$2c6trb$1@ID-169208.news.uni-berlin.de>... > Fuzzyman wrote: > > If you need to do a lot of searches of this kind, you might > find it worthwhile to build an index indicating which songs > each word occurs in. Hmmmm...... possibly. It seems a shame to store a database with all the songs in *and* build a full index. For this project, the solution I have is fine. I may well need to look at an indexer for another one soon though. Thanks. Regards, Fuzzy http://www.voidspace.org.uk/atlanti...ythonutils.html
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.