Home > Archive > PERL Beginners > March 2005 > regex for l33t speak
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
regex for l33t speak
|
|
| Andrew Gaffney 2005-03-24, 3:56 am |
| I'm trying to come up with a regex for my IRC bot that detects 1337 (in order to
kick them from the channel). I can't seem to come up with one that will have few
false positives but also work most of the time. Has anyone done something like
this before? Does anyone have any suggestions?
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
| |
| Tim Johnson 2005-03-24, 3:56 am |
|
First off, "perldoc perlre" is a good place to start.
What do you have so far?
Does something like /\b1337\b/ work? Or am I taking you too literally?
-----Original Message-----
From: Andrew Gaffney [mailto:agaffney@skylineaero.com]=20
Sent: Wednesday, March 23, 2005 5:52 PM
To: beginners@perl.org
Subject: regex for l33t speak
I'm trying to come up with a regex for my IRC bot that detects 1337 (in
order to=20
kick them from the channel). I can't seem to come up with one that will
have few=20
false positives but also work most of the time. Has anyone done
something like=20
this before? Does anyone have any suggestions?
| |
| Andrew Gaffney 2005-03-24, 3:56 am |
| Tim Johnson wrote:
> First off, "perldoc perlre" is a good place to start.
>
> What do you have so far?
>
> Does something like /\b1337\b/ work? Or am I taking you too literally?
Too literally. Basically, I'm trying to match a word that contains a mix of >=2
numbers (possibly next to each other) and letters. My current regex is:
\b\d*[a-zA-Z]*(\d+[a-zA-Z]+)+\d*[a-zA-Z]*[^:,]\b
but that seems to catch too much.
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
| |
| Chris Devers 2005-03-24, 3:56 am |
| On Wed, 23 Mar 2005, Andrew Gaffney wrote:
> I'm trying to come up with a regex for my IRC bot that detects 1337
> (in order to kick them from the channel).
For those unfamiliar with 'leet, see here:
<http://www.microsoft.com/athome/sec...en/kidtalk.mspx>
<http://en.wikipedia.org/wiki/Leetspeak>
<http://www.straightdope.com/columns/030110.html>
> I can't seem to come up with one that will have few false positives
> but also work most of the time. Has anyone done something like this
> before? Does anyone have any suggestions?
I strongly suspect that there is no general solution for this.
The problem is that the set you're trying to match against is completely
unbounded, and the whole point of 'leet is to be unconventional with
rules for spelling, grammar, diction, courtesy, etc.
You could go halfway with code to catch the most common terms -- 1337,
w00t, pr0n, warez, 0\/\/n3d, etc -- but note how dissimilar those are.
* One is all numbers, while another is all letters, so they both look
like normal text.
* You could consider a rule to catch ones with mixed numbers & letters,
but that would catch legit terms like "perl6", "md5", or "mp3".
* One mixes in punctuation, so now you have to deal with anywhere that
alphanumeric characters are adjacent to symbols. Like, for example,
everywhere you have a comma, a hyphentated-word, or: a period. Nuts!
Ultimately, you can't win. If the users can guess what the matching
patterns might be -- and remember, this is IRC, so assume that they'll
talk to each other as they figure things out -- then they can *always*
come up with text that will get around your filters.
The most reasonable approach is probably to set up some hard-coded rules
for the most common terms -- see the URLs above for examples -- and some
very broad rules to warn (but *not* kick) possible offenders, and with
that have actual human moderators to catch whatever slips through.
Anything more aggressive than that and you're going to be buried in a
pile of false positives & false negatives... :-/
--
Chris Devers
| |
| Thomas Bätzler 2005-03-24, 8:56 am |
| Andrew Gaffney <agaffney@skylineaero.com> wrote:
> Too literally. Basically, I'm trying to match a word that
> contains a mix of >=2 numbers (possibly next to each other)
> and letters. My current regex is:
>
> \b\d*[a-zA-Z]*(\d+[a-zA-Z]+)+\d*[a-zA-Z]*[^:,]\b
>
> but that seems to catch too much.
Ever considered doing this w/o a regex? Maybe it would be
easier to split the text into words first, and then count
letters and numbers using tr//, like
#!/usr/bin/perl -w
sub badword {
my $word = shift;
return $word =~ tr/a-zA-Z/ / >= 2 && $word =~ tr/0-9/ / >= 2;
}
my $text = 'I confess to 0wn1ng the email address hax0r@1337c0.de';
foreach my $word (split /\s+/, $text){
print "bad: $word\n" if badword( $word );
}
__END__
HTH,
Thomas
| |
| Andrew Gaffney 2005-03-24, 8:56 am |
| Thomas Bätzler wrote:
> Andrew Gaffney <agaffney@skylineaero.com> wrote:
>
>
>
> Ever considered doing this w/o a regex? Maybe it would be
> easier to split the text into words first, and then count
> letters and numbers using tr//, like
>
> #!/usr/bin/perl -w
>
> sub badword {
> my $word = shift;
>
> return $word =~ tr/a-zA-Z/ / >= 2 && $word =~ tr/0-9/ / >= 2;
> }
>
> my $text = 'I confess to 0wn1ng the email address hax0r@1337c0.de';
>
> foreach my $word (split /\s+/, $text){
> print "bad: $word\n" if badword( $word );
> }
>
> __END__
Thanks. That's an interesting solution.
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
| |
| Randy W. Sims 2005-03-24, 8:56 am |
| Andrew Gaffney wrote:
> I'm trying to come up with a regex for my IRC bot that detects 1337 (in
> order to kick them from the channel). I can't seem to come up with one
> that will have few false positives but also work most of the time. Has
> anyone done something like this before? Does anyone have any suggestions?
>
Write a converter to translate common "symbols" to the correct letter.
If the translated "word" is a valid dictionary word, flag it.
3X@mP1e
3 => E
X => X
@ => A
m => M
P => P
1 => L
e => E
3X@mP1e => EXAMPLE
EXAMPLE is a dictionary word, so 3X@mP1e must be leet since the
conversion rules produced meaningful results.
It's not perfect, but should work with very few if any false positives.
Randy.
| |
| Andrew Gaffney 2005-03-24, 8:56 am |
| Randy W. Sims wrote:
> Andrew Gaffney wrote:
>
>
> Write a converter to translate common "symbols" to the correct letter.
> If the translated "word" is a valid dictionary word, flag it.
>
> 3X@mP1e
> 3 => E
> X => X
> @ => A
> m => M
> P => P
> 1 => L
> e => E
>
> 3X@mP1e => EXAMPLE
>
> EXAMPLE is a dictionary word, so 3X@mP1e must be leet since the
> conversion rules produced meaningful results.
>
> It's not perfect, but should work with very few if any false positives.
Thanks for yet another very interesting approach.
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
| |
| Paul Johnson 2005-03-24, 8:56 am |
| On Thu, Mar 24, 2005 at 02:25:19AM -0600, Andrew Gaffney wrote:
> Randy W. Sims wrote:
>
> Thanks for yet another very interesting approach.
Check out Lingua::31337 on CPAN. That C really does stand for
comprehensive.
It works the other way around, ie it converts normal text to 31337, but
you coud probably reverse the conversions it uses. Best of all, it's
written by the founder of this list (hi Casey!) but I don't think it has
ever been plugged here. It's about time that was remedied.
I'm sure Casey would be happy to accept a patch to add a 313372text
function.
--
Paul Johnson - paul@pjcj.net
http://www.pjcj.net
| |
| Andrew Gaffney 2005-03-25, 3:55 am |
| Randy W. Sims wrote:
> The only problem with that is that a dictionary is required for it to
> work because each "symbol" can have multiple translations. Taking info
> from the wikipedia[1]: a final "s" can be changed to "z" to get the
> l33t, but to reverse it you have to check first with the "z" because it
> might be an actual "z". Then if it is not a dictionary word perform the
> translation and check for a word ending in "s".
>
> For example, given the l33t word "h4x0rz", an algorithm would have to
> perform something like the following translations, checking each one
> till it finds a dictionary entry if any:
>
> (done by hand and I don't know much about l33t, so...)
>
> h4x0rz
> h4x0rs
> h4xorz
> h4xors
> h4xerz
> h4xers
> h4ck0rz
> h4ck0rs
> h4ckorz
> h4ckors
> h4ckerz
> h4ckers
> h4cks0rz
> h4cks0rs
> h4cksorz
> h4cksors
> h4ckserz
> h4cksers
> hack0rz
> hack0rs
> hackorz
> hackors
> hackerz
> hackers => BINGO
>
> (More permutations here, but we already found a dictionary word, so we
> stop.)
>
> The basic algorithm for anyone who want to try it, and it's pretty
> commonly seen in parsing, so it's relatively straigtforward:
>
> scan string till you reach the end of a "word"
> check dictionary for the "word"
> LOOP:
> back up
> apply conversion(s)
> check dictionary
> repeat until success or no more permutations
> END LOOP:
>
>
> This would probably make a good QotW, or rather the original question
> would make a good quiz while the above would be one possible solution.
> So would implementing an efficient dictionary lookup without loading the
> entire dictionary in memory.
>
> Randy.
>
> 1. <http://en.wikipedia.org/wiki/Leetspeak>
Wow, this is more difficult than I first thought. I think I'm just going to drop
the whole idea as the channel is relatively low traffic and it was more for fun
than usefulness. Thanks for all the suggestions, though.
--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
| |
| Randy W. Sims 2005-03-25, 3:55 am |
| Paul Johnson wrote:
> On Thu, Mar 24, 2005 at 02:25:19AM -0600, Andrew Gaffney wrote:
>
>
>
> Check out Lingua::31337 on CPAN. That C really does stand for
> comprehensive.
>
> It works the other way around, ie it converts normal text to 31337, but
> you coud probably reverse the conversions it uses. Best of all, it's
> written by the founder of this list (hi Casey!) but I don't think it has
> ever been plugged here. It's about time that was remedied.
>
> I'm sure Casey would be happy to accept a patch to add a 313372text
> function.
The only problem with that is that a dictionary is required for it to
work because each "symbol" can have multiple translations. Taking info
from the wikipedia[1]: a final "s" can be changed to "z" to get the
l33t, but to reverse it you have to check first with the "z" because it
might be an actual "z". Then if it is not a dictionary word perform the
translation and check for a word ending in "s".
For example, given the l33t word "h4x0rz", an algorithm would have to
perform something like the following translations, checking each one
till it finds a dictionary entry if any:
(done by hand and I don't know much about l33t, so...)
h4x0rz
h4x0rs
h4xorz
h4xors
h4xerz
h4xers
h4ck0rz
h4ck0rs
h4ckorz
h4ckors
h4ckerz
h4ckers
h4cks0rz
h4cks0rs
h4cksorz
h4cksors
h4ckserz
h4cksers
hack0rz
hack0rs
hackorz
hackors
hackerz
hackers => BINGO
(More permutations here, but we already found a dictionary word, so we
stop.)
The basic algorithm for anyone who want to try it, and it's pretty
commonly seen in parsing, so it's relatively straigtforward:
scan string till you reach the end of a "word"
check dictionary for the "word"
LOOP:
back up
apply conversion(s)
check dictionary
repeat until success or no more permutations
END LOOP:
This would probably make a good QotW, or rather the original question
would make a good quiz while the above would be one possible solution.
So would implementing an efficient dictionary lookup without loading the
entire dictionary in memory.
Randy.
1. <http://en.wikipedia.org/wiki/Leetspeak>
| |
| Chris Devers 2005-03-25, 3:55 am |
| On Thu, 24 Mar 2005, Randy W. Sims wrote:
> The only problem with that is that a dictionary is required for
> it to work because each "symbol" can have multiple translations.
Not only that -- a 'leet word could have multiple possible meanings.
For example, "pwn" ("own") could just be a typo for "pawn".
Any attempt to get back from a 'leet term to real word is going to be
extremely prone to false positives & false negatives. You could cheat
and assume a list of banned words and suspect words, and try to find
probable correlations between the two sets, but that's logically wrong:
you're starting from the conclusion that every word is probably banned,
then digging through what you find until you get what you wanted. The
false positive rate will be huge with such an approach, but it's about
the only approach that has a chance of working at all.
The problem of differentiating between 'leet and conventional English is
very similar to the problem of detecting spam and "ham" email. In that
case, you can use various approaches that do a decent guesstimate --
Bayesian statistical filters, various hard-wired heuristics, a cocktail
of both approaches, etc -- but there's *always* going to be some level
of both false negatives (spam or 'leet that gets through) and false
positives (good messages that get blocked). This is unavoidable -- all
you can do is make reasonable attempts to minimize it.
Maybe the IRC bot should be hooked up to SpamAssassin :-)
--
Chris Devers
| |
| Randy W. Sims 2005-03-25, 8:56 am |
| Andrew Gaffney wrote:
> Wow, this is more difficult than I first thought.
Not really. Just for kicks here is a simple driver:
#!/usr/bin/perl
use strict;
use warnings;
use Leetspeak;
# $Leetspeak::DEBUG = 1;
my $word = shift( @ARGV );
my $l33t = Leetspeak->new();
my $translation = $l33t->translate( $word );
if ( $translation ) {
print "$word => ", $translation , "\n";
} else {
print "not found\n";
}
__END__
for the following module:
package Leetspeak;
use strict;
use warnings;
our $DEBUG = 0;
sub new {
my $package = shift;
my %args = @_;
my $dict =
(grep defined && -e, ( $args{dict}, '/usr/share/dict/words' ))[0];
my %data = ( dict => $dict );
my $self = bless( \%data, $package );
$self->_read_dict();
return $self;
}
{
my %trans_tbl = (
'@' => [ 'A' ],
'$' => [ 'S' ],
'+' => [ 'T' ],
'0' => {
'0' => [ 'O' ],
'0r' => [ 'ER' ],
},
'1' => [ 'I', 'L' ],
'2' => [ 'Z' ],
'3' => {
'3' => [ 'E' ],
'3y3' => [ 'I' ],
},
'4' => [ 'A' ],
'5' => [ 'S', 'Z' ],
'6' => [ 'B', 'G' ],
'7' => [ 'T' ],
'8' => [ 'B' ],
'9' => [ 'P', 'Q' ],
'l' => [ 'I' ],
'p' => {
'p' => [ 'O' ],
'ph' => [ 'F' ],
},
'x' => [ 'CK', 'CKS' ],
'z' => [ 'S' ],
);
sub translate {
my $self = shift;
my $word = shift;
my $start = shift || 0;
print "translate( $word, $start )\n" if $DEBUG;
return $word if $self->_has_word( $word );
for my $i ( $start .. length( $word ) - 1 ) {
my $ch = substr( $word, $i, 1 );
next unless exists( $trans_tbl{$ch} );
my $trans = ( ref( $trans_tbl{$ch} ) eq 'HASH' ) ?
$trans_tbl{$ch} : { $ch => $trans_tbl{$ch} };
foreach my $key ( keys( %$trans ) ) {
my $key_len = length( $key );
if ( substr( $word, $i, $key_len ) eq $key ) {
foreach my $tr ( @{ $trans->{$key} } ) {
print "substr( $word, $i, 1 ) = $tr\n" if $DEBUG;
my $new_word = $word;
substr( $new_word, $i, $key_len ) = $tr;
$new_word = lc( $new_word );
my $offset = $key_len - length( $tr );
$offset ||= 1;
my $result =
$self->translate( $new_word, $i + $offset );
return $result if $result;
}
}
}
}
return undef;
}
}
sub dict { return $_[0]->{dict} }
sub _read_dict {
my $self = shift;
my %words;
open( my $fh, '<', $self->{dict} ) or die $!;
while (defined( my $word = <$fh> )) {
chomp( $word );
$words{$word} = 1;
}
close( $fh );
$self->{words} = \%words;
return scalar %words;
}
sub _has_word {
my $self = shift;
my $word = shift;
return 1 if exists( $self->{words}{$word} );
}
1;
__END__
| |
| Randy W. Sims 2005-03-25, 8:56 am |
|
Oops, I forgot to detab before inlining that, and there was some loose
ends. Here it is again inlined and attached. It's not optimized in any
way, opting instead for a quick straightforward implementation.
Incomplete, possibly buggy, completely undocumented. I didn't write it
as a patch for Casey's module because of the requirement for a
dictionary file which is fairly standard on unixish, but must be
obtained for other OS.
package Leetspeak;
use strict;
use warnings;
our $DEBUG = 0;
sub new {
my $package = shift;
my %args = @_;
my $dict =
(grep defined && -e, ( $args{dict}, '/usr/share/dict/words' ))[0];
my %data = ( dict => $dict );
my $self = bless( \%data, $package );
$self->_read_dict();
return $self;
}
{
my %trans_tbl = (
'@' => [ 'A' ],
'$' => [ 'S' ],
'+' => [ 'T' ],
'0' => {
'0' => [ 'O' ],
'0r' => [ 'ER' ],
},
'1' => [ 'I', 'L' ],
'2' => [ 'Z' ],
'3' => {
'3' => [ 'E' ],
'3y3' => [ 'I' ],
},
'4' => [ 'A' ],
'5' => [ 'S', 'Z' ],
'6' => [ 'B', 'G' ],
'7' => [ 'T' ],
'8' => [ 'B' ],
'9' => [ 'P', 'Q' ],
'l' => [ 'I' ],
'p' => {
'p' => [ 'O' ],
'ph' => [ 'F' ],
},
'x' => [ 'CK', 'CKS' ],
'z' => [ 'S' ],
);
sub translate {
my $self = shift;
my $word = shift;
my $start = shift || 0;
print "translate( $word, $start )\n" if $DEBUG;
return $word if $self->_has_word( $word );
for my $i ( $start .. length( $word ) - 1 ) {
my $ch = substr( $word, $i, 1 );
next unless exists( $trans_tbl{$ch} );
my $trans = ( ref( $trans_tbl{$ch} ) eq 'HASH' ) ?
$trans_tbl{$ch} : { $ch => $trans_tbl{$ch} };
foreach my $key ( keys( %$trans ) ) {
my $key_len = length( $key );
if ( substr( $word, $i, $key_len ) eq $key ) {
foreach my $tr ( @{ $trans->{$key} } ) {
print "substr( $word, $i, $key_len ) = $tr\n"
if $DEBUG;
my $new_word = $word;
substr( $new_word, $i, $key_len ) = lc( $tr );
my $offset = $key_len - length( $tr );
$offset ||= 1;
my $result =
$self->translate( $new_word, $i + $offset );
return $result if $result;
}
}
}
}
return undef;
}
}
sub dict { return $_[0]->{dict} }
sub _read_dict {
my $self = shift;
my %words;
open( my $fh, '<', $self->{dict} ) or die $!;
while (defined( my $word = <$fh> )) {
chomp( $word );
$words{$word} = 1;
}
close( $fh );
$self->{words} = \%words;
return scalar %words;
}
sub _has_word {
my $self = shift;
my $word = shift;
$word = lc( $word );
return 1 if exists( $self->{words}{$word} );
}
1;
|
|
|
|
|