Code Comments
Programming Forum and web based access to our favorite programming groups.Hi,
I would like to make life easier for myself my automating (as best as
possible) the removal of messages supplied by users.
If my incoming string is $input, I originally thought of searching as
follows:
<Pseudocode>
foreach my $rudeword in @RudeWordList {
if ($input =~ s/$rudeword/i) {
REJECT;
}
}
</Pseudocode>
However, this seems a rather unoptimised method of searching. Is there a
more optimised way of doing this?
Cheers,
Greg
Post Follow-up to this messageGreg wrote:
> I would like to make life easier for myself my automating (as best as
> possible) the removal of messages supplied by users.
> if ($input =~ s/$rudeword/i) {
The above code would improperly reject these:
Scunthorpe Hospital Radio (www.shronline.co.uk)
Sussex and Essex in England
going off half-cocked
Matsushita is the parent corporation of Panasonic
farther and farther
Failing to take context into account would reject these:
breast cancer survivor
Tom, Dick, and Harry
cute little pussy cat
prize-winning XXXXX and her puppies
Post Follow-up to this messageJoe Smith wrote: > Greg wrote: > > > > The above code would improperly reject these: Hi, In order to reply, I've re-ordered your (very good) examples: > Scunthorpe Hospital Radio (www.shronline.co.uk) > Matsushita is the parent corporation of Panasonic Word boundaries - very good point! > Sussex and Essex in England > going off half-cocked > breast cancer survivor > farther and farther > Tom, Dick, and Harry "Sex, cocked, breast, fart and dick" - these were not the type of words I was planning to look for. I wouldn't call these words particularly rude :P They are certainly "acceptable" for where this code will be deployed > Failing to take context into account would reject these: > cute little pussy cat > prize-winning XXXXX and her puppies VERY good point. Perhaps a better solution would be: <Pseudocode> foreach my $rudeword in @RudeWordList { if ($input =~ /\b$rudeword\b/i) { PLACE MESSAGE ON ICE; FLAG MESSAGE "Awaiting acceptance from moderator"; } } </Pseudocode> Thanks Joe!
Post Follow-up to this messageGreg <gmills@nilELEPHANTdram.com> writes:
> Hi,
>
> I would like to make life easier for myself my automating (as best as
> possible) the removal of messages supplied by users.
>
> If my incoming string is $input, I originally thought of searching as
> follows:
>
> <Pseudocode>
> foreach my $rudeword in @RudeWordList {
> if ($input =~ s/$rudeword/i) {
> REJECT;
> }
> }
> </Pseudocode>
>
>
> However, this seems a rather unoptimised method of searching. Is there a
> more optimised way of doing this?
>
> Cheers,
>
>
Hi Greg,
Your right, its not a very good search approach. the problem is, you
willl be comparing every word to every rudeword in the list of
rudewords. So, if you had 1000 rude words, you would do 1000
comparisons for each input.
something which may help might be to use a hash instead of a list for
your rude words. Use the rude word as the key and just put a 1 in for
the value. This would allow you to do a single comparison for each
word, rather than multiple comparisons with the whole list. If
performance is still not good enough, you could then look at other
optimizations - for example, you may be able to skip any input word
with less than 4 characters as there are not many rude words within
that set. You could also eliminate any words with more characters than
your longest 'rude' word.
Tim
--
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you
really need to send mail, you should be able to work it out!
Post Follow-up to this messageJoe Smith <joe@inwap.com> writes: > Greg wrote: > > > The above code would improperly reject these: > Scunthorpe Hospital Radio (www.shronline.co.uk) > Sussex and Essex in England > going off half-cocked > Matsushita is the parent corporation of Panasonic > farther and farther > > Failing to take context into account would reject these: > breast cancer survivor > Tom, Dick, and Harry > cute little pussy cat > prize-winning XXXXX and her puppies This is a common problem with any filtering approach. In Australia, the government passed legislation which required ISPs to block access to sites considered offensive (whatever that is). All the technical people, professors of computing science, programmers etc, tried to explain the problems. The response from the senator pushing this through was to accuse these people of being pornographers. The problem comes down to more than context, it comes down to an area of computing science called natural language processing (NLP). the aim here is to try and write software which can understand natural language. THis is a very difficult problem, particularly in languages such as english, because the rules have so many exceptions and are difficult to specify in a concise way. However, unless the computer can 'understand' what is being expressed, there is no 100% guaranteed to work solution. You can reduce the number of false positives by extending the match criteria to look for more information - for example, if the band word was 'breast' (which isn't really a rude word - unless your one of those prudes who finds penis and vagina rude), you could also look for the word 'cancer' within x number of words and you would be less likely to flag it as indicating 'rude' content i.e. getting a false positive. However, this is still what the AI world refers to as a heuristic rule - or more commonly known as a 'rule of thumb'. Note that if you are trying to eliminate spam with adds for porn sites etc, you would be better off looking at some of the quite good anti-spam algorithms. There are a number of interesting approaches currently being developed. One approach is to use a collection of weighted rules and apply some basic statistical calculations which give you a probability score for the likelyhood of the message being spam. Some of these systems use a training process to adjust the weights of each rule. Another very intesting network approach to spam detection is the use of a centralised database and a server. Users send copies of spam to this server, which does some md5 checksums on the message and puts this info in a database. You then have a client which calculates md5 sums on the incomming messge and then your client queries the remote anti-spam database to see if it knows of any other messages with the same md5 checksum. The theory here is that most spam is sent to a large number of users and has the same message body contents. As md5 checksums are based on the actual content of the message, the odds are very high that if you have the same md5 checksum, you have the same message and therefore you can be fairly confident it is spam. Of course, if this begins to work, the spammers will just begin to add random characters or blank spaces to each message, which will change the md5 checksum for that message. . Tim -- Tim Cross The e-mail address on this message is FALSE (obviously!). My real e-mail is to a company in Australia called rapttech and my login is tcross - if you really need to send mail, you should be able to work it out!
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.