For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > March 2004 > Fuzzy string matching









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Fuzzy string matching
Juman

2004-03-26, 11:14 pm

I have two strings I want to compare doing some kind of fuzzy matching?
Is there some good way to that in perl or could someone help with a
routine matching word by word and giving a percental result.

Like

String 1 : This is a ten characters long string is it not
String 2 : This is not so long

String 1 compared to String 2 gives 40% (four words are the same)
String 2 compared to String 1 gives 80% (four word are the same)

/juman
James Edward Gray II

2004-03-26, 11:14 pm

On Mar 24, 2004, at 4:31 AM, juman wrote:

> I have two strings I want to compare doing some kind of fuzzy matching?
> Is there some good way to that in perl or could someone help with a
> routine matching word by word and giving a percental result.
>
> Like
>
> String 1 : This is a ten characters long string is it not
> String 2 : This is not so long
>
> String 1 compared to String 2 gives 40% (four words are the same)
> String 2 compared to String 1 gives 80% (four word are the same)


See if this gives you some ideas:

#!/usr/bin/perl

use strict;
use warnings;

my $string1 = 'This is a ten characters long string is it not';
my $string2 = 'This is not so long';

print compare_words($string1, $string2), "%\n";
print compare_words($string2, $string1), "%\n";

sub compare_words {
my($str1, $str2) = @_;

my @words = split ' ', $str2;
my $in_both_count = 0;
my %seen;
foreach (split ' ', $str1) {
next if $seen{$_}++;
$in_both_count++ if $str2 =~ m/\b$_\b/;
}

return sprintf '%.0f', $in_both_count / scalar(@words) * 100;
}

__END__

James

Juman

2004-03-26, 11:14 pm

Great... got it! :) Now my little script is running... Thanks for the
help (again)...

/juman

On Wed, Mar 24, 2004 at 09:34:58AM -0600, James Edward Gray II wrote:
> On Mar 24, 2004, at 4:31 AM, juman wrote:
>
>
> See if this gives you some ideas:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my $string1 = 'This is a ten characters long string is it not';
> my $string2 = 'This is not so long';
>
> print compare_words($string1, $string2), "%\n";
> print compare_words($string2, $string1), "%\n";
>
> sub compare_words {
> my($str1, $str2) = @_;
>
> my @words = split ' ', $str2;
> my $in_both_count = 0;
> my %seen;
> foreach (split ' ', $str1) {
> next if $seen{$_}++;
> $in_both_count++ if $str2 =~ m/\b$_\b/;
> }
>
> return sprintf '%.0f', $in_both_count / scalar(@words) * 100;
> }
>
> __END__
>
> James
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>

Damon Allen Davison

2004-03-26, 11:14 pm

On Wed, Mar 24, 2004 at 11:31:15AM +0100, juman wrote:
> String 1 compared to String 2 gives 40% (four words are the same)
> String 2 compared to String 1 gives 80% (four word are the same)


You can find a summary of possibilities here:

http://www.perlmonks.org/index.pl?node_id=162038

Basically, what is comes down to are these modules:

String::Approx
Text::Levenshtein
Algorithm::Diff

My favorite is the Levenshtein distance module.

Good luck,

Damon

--


Damon Allen Davison

http://allolex.freeshell.org/

Perl and Linguistics
<http://world.std.com/~swmcd/steven/...inguistics.html>
<http://www.linuxjournal.com/article.php?sid=3394>
<http://www.wall.org/~larry/keynote/keynote.html>
Chris McMahon

2004-03-26, 11:14 pm

Hi Juman...=20

> -----Original Message-----
> From: juman [mailto:juman@chello.se]=20
> Sent: Wednesday, March 24, 2004 3:31 AM
> To: beginners@perl.org
> Subject: Fuzzy string matching
>=20
>=20
> I have two strings I want to compare doing some kind of fuzzy=20
> matching?
> Is there some good way to that in perl or could someone help with a
> routine matching word by word and giving a percental result.
>=20
> Like
>=20
> String 1 : This is a ten characters long string is it not
> String 2 : This is not so long
>=20
> String 1 compared to String 2 gives 40% (four words are the same)
> String 2 compared to String 1 gives 80% (four word are the same)
>=20
> /juman
>=20


You might try fooling around with the List::Compare module
http://search.cpan.org/~jkeenan/Lis...0.22/Compare.pm . I've
been using it recently and it's pretty nifty. It won't give you the
percentages you want, but I think it could supply the raw comparison
data, and then you could compute the percentages yourself. =20
And if that's not quite right, the List::Compare page on CPAN
has references to similar modules at the bottom of the page, maybe one
of the other diff/compare modules might get you there. =20
Hope that helps.=20
-Chris =20
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com