Code Comments
Programming Forum and web based access to our favorite programming groups.What's the neatest and/or most efficient way of testing if one of a set of strings (contained in a dictionary, list or similar) is a sub-string of a given string? I.e. I have a string delivered into my program and I want to see if any of a set of strings is a substring of the string I have been given. It's quite OK to stop at the first one found. Ideally the strings being searched through will be the keys of a dictionary but this isn't a necessity, they can just be in a list if it could be done more efficiently using a list. Is this the best one can do (ignoring the likelihood that I've got some syntax wrong) :- # l is the list # str is the incoming string answer = "" for x in l: if str.find(x) < 0: continue answer = x -- Chris Green
Post Follow-up to this messageOn Apr 3, 12:37 pm, tinn...@isbd.co.uk wrote: > What's the neatest and/or most efficient way of testing if one of a > set of strings (contained in a dictionary, list or similar) is a > sub-string of a given string? > > I.e. I have a string delivered into my program and I want to see if > any of a set of strings is a substring of the string I have been > given. It's quite OK to stop at the first one found. Ideally the > strings being searched through will be the keys of a dictionary but > this isn't a necessity, they can just be in a list if it could be done > more efficiently using a list. > > Is this the best one can do (ignoring the likelihood that I've got > some syntax wrong) :- > > # l is the list > # str is the incoming string > answer = "" > for x in l: > if str.find(x) < 0: > continue > answer = x I'd not use 'l' (with '1') or 'str' (a standard module) as variable names. Your code checks every string in the list even when it's found one... you can reverse the test and break when the first one is found. Using 'in' rather than testing the return value of find is nicer as a substring test. Finally, using the 'else' clause lets you make it clear that answer is set to the empty string when no match is found. for answer in l: if str in answer: break else: answer = '' -- Paul Hankin
Post Follow-up to this messagedef foo(sample, strings): for s in strings: if sample in s: return True return False This was an order of magnitude faster for me than using str.find or str.index. That was finding rare words in the entire word-list (w/ duplicates) of War and Peace.
Post Follow-up to this messagePaul Hankin <paul.hankin@gmail.com> wrote: > On Apr 3, 12:37 pm, tinn...@isbd.co.uk wrote: > > I'd not use 'l' (with '1') or 'str' (a standard module) as > variable names. Neither would I, it was just thrown together to show how I was thinking. > Your code checks every string in the list even when > it's found one... you can reverse the test and break when the first > one is found. Using 'in' rather than testing the return value of find > is nicer as a substring test. Finally, using the 'else' clause lets > you make it clear that answer is set to the empty string when no match > is found. > > for answer in l: > if str in answer: break > else: > answer = '' > OK, that does improve things somewhat, thanks. -- Chris Green
Post Follow-up to this messageJeff <jeffober@gmail.com> wrote: > def foo(sample, strings): > for s in strings: > if sample in s: > return True > return False > > This was an order of magnitude faster for me than using str.find or > str.index. That was finding rare words in the entire word-list (w/ > duplicates) of War and Peace. However it's the wrong way around, in my case 'sample' is the longer string and I want to know if s is in it. It's simple enough to do it the other way around though:- def foo(sample, strings): for s in strings: if s in sample: return True return False Using in rather than find() and making it a function would seem to be the way to go, thanks. -- Chris Green
Post Follow-up to this messageOn Apr 3, 8:03 am, Jeff <jeffo...@gmail.com> wrote:
> def foo(sample, strings):
> for s in strings:
> if sample in s:
> return True
> return False
>
> This was an order of magnitude faster for me than using str.find or
> str.index. That was finding rare words in the entire word-list (w/
> duplicates) of War and Peace.
If you test against the same substrings over and over again, an
alternative would be to build a regular expression:
import re
search = re.compile('|'.join(re.escape(x)
for x in substrings)).search
p = search(somestring)
if p is not None:
print 'Found', p.group()
George
Post Follow-up to this messageOn Apr 3, 12:37 pm, tinn...@isbd.co.uk wrote: > What's the neatest and/or most efficient way of testing if one of a A different approach: True False Perhaps not as obvious or readable as Jeff's example, but is essentially doing the same thing using generator syntax.
Post Follow-up to this messageOn Apr 3, 8:19=A0am, George Sakkis <george.sak...@gmail.com> wrote:
> On Apr 3, 8:03 am, Jeff <jeffo...@gmail.com> wrote:
>
>
>
> If you test against the same substrings over and over again, an
> alternative would be to build a regular expression:
>
> import re
> search =3D re.compile('|'.join(re.escape(x)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for x in substr=[/color
]
ings)).search
> p =3D search(somestring)
> if p is not None:
> =A0 print 'Found', p.group()
>
> George
That would be an enormous regular expression and eat a lot of memory.
But over an enormous number of substrings, it would be O(log n),
rather than O(n).
Post Follow-up to this messageOn Apr 3, 8:44=A0am, Ant <ant...@gmail.com> wrote: > On Apr 3, 12:37 pm, tinn...@isbd.co.uk wrote: > > > A different approach: > > > True > > > False > > Perhaps not as obvious or readable as Jeff's example, but is > essentially doing the same thing using generator syntax. That's pretty :)
Post Follow-up to this messageOn Apr 3, 1:37 pm, tinn...@isbd.co.uk wrote: > What's the neatest and/or most efficient way of testing if one of a > set of strings (contained in a dictionary, list or similar) is a > sub-string of a given string? > [...] You could use the Aho-Corasick algorithm <http://en.wikipedia.org/wiki/ Aho-Corasick_algorithm>. I don't know if there's a Python implementation yet. Dennis Benzinger
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.