Home > Archive > PERL Miscellaneous > June 2005 > Re: Link List matching
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: Link List matching
|
|
| Matteo D'Amato 2005-06-07, 8:57 pm |
|
Hi,
I have a pipe delimited text file containing country codes with their
country names. i.e.
30741|Greece - Corinth
30751|Greece - Argos
30821|Greece - Crete
31|Netherlands
316|Netherlands - Mobile
319|Netherlands - Mobile
3110|Netherlands - Rotterdam
3120|Netherlands - Amsterdam
3123|Netherlands - Haarlem/Heemstede
This is just a sample, the actual file contains 19732 lines. Now I need to
write a program that will take in a full phone number as an argument i.e.
308212398721 -> and will return 'Greece - Crete' and matched prefix '30821'
I tried to used the link list example from the PERL FAQ, but it takes 7
minutes just to read the file and build the tree. Can anyone give me some
guide lines on how to write this in the fastest way using Perl. Thanks
--Matteo
| |
| Ruey-Cheng Chen 2005-06-07, 8:57 pm |
| Assume you have an hash mapping codes to names, like:
(You could use whichever function you prefer, either split or m// would
be fine.)
%map = (
'30821' => "Greece - Corinth",
# ...
);
$p = "308212398721";
@prefix = map { substr $p, 0, $_ } 1..length $p;
print map { $map{$_} } grep { $map{$_} } @prefix;
| |
| Eric Bohlman 2005-06-07, 8:57 pm |
| "Matteo D'Amato" <matteod@intelcocomm.com> wrote in
news:Xns966EB3FD5CE67matteodintelcocommc
o@216.40.28.72:
> I have a pipe delimited text file containing country codes
> with their
> country names. i.e.
>
>
> 30741|Greece - Corinth
> 30751|Greece - Argos
> 30821|Greece - Crete
> 31|Netherlands
> 316|Netherlands - Mobile
> 319|Netherlands - Mobile
> 3110|Netherlands - Rotterdam
> 3120|Netherlands - Amsterdam
> 3123|Netherlands - Haarlem/Heemstede
>
>
> This is just a sample, the actual file contains 19732 lines. Now I
> need to write a program that will take in a full phone number as an
> argument i.e.
>
> 308212398721 -> and will return 'Greece - Crete' and matched prefix
> '30821'
>
> I tried to used the link list example from the PERL FAQ, but it takes
> 7 minutes just to read the file and build the tree. Can anyone give me
> some guide lines on how to write this in the fastest way using Perl.
> Thanks
I'd populate a hash (let's call it %codes) with the country codes as keys
and the locations as values, and while doing that I'd keep track of the
longest country code seen (in your example, it would be 5 digits); let's
call that $maxcode. To match a number, I'd start by taking the first
$maxcode digits of the number and looking it up in the hash. If not
found, I'd repeat using the first $maxcode-1 digits, and so on.
I'm being a little vague here deliberately because your request looks
slightly like a homework problem so I don't want to fill in *all* the
details.
| |
| Matteo D'Amato 2005-06-08, 3:59 am |
| Ok, this sounds faster. You gave me an idea, how about grouping them by
prefix size? i.e
group size 5:
30741|Greece - Corinth
30751|Greece - Argos
...
group size 2:
31|Netherlands
...
while also keeping track of max size. So start at max size and only look in
the group matching that size and work my way down. How does this sound? Can
it be improved?
--Matteo
Eric Bohlman <ebohlman@omsdev.com> wrote in
news:Xns966EB08DF2A02ebohlmanomsdevcom@1
30.133.1.4:
> "Matteo D'Amato" <matteod@intelcocomm.com> wrote in
> news:Xns966EB3FD5CE67matteodintelcocommc
o@216.40.28.72:
>
>
> I'd populate a hash (let's call it %codes) with the country codes as
> keys and the locations as values, and while doing that I'd keep track
> of the longest country code seen (in your example, it would be 5
> digits); let's call that $maxcode. To match a number, I'd start by
> taking the first $maxcode digits of the number and looking it up in
> the hash. If not found, I'd repeat using the first $maxcode-1 digits,
> and so on.
>
> I'm being a little vague here deliberately because your request looks
> slightly like a homework problem so I don't want to fill in *all* the
> details.
| |
| Anno Siegel 2005-06-08, 8:58 am |
| Matteo D'Amato <matteod@intelcocomm.com> wrote in comp.lang.perl.misc:
>
> Hi,
> I have a pipe delimited text file containing country codes with their
> country names. i.e.
>
>
> 30741|Greece - Corinth
> 30751|Greece - Argos
> 30821|Greece - Crete
> 31|Netherlands
> 316|Netherlands - Mobile
> 319|Netherlands - Mobile
> 3110|Netherlands - Rotterdam
> 3120|Netherlands - Amsterdam
> 3123|Netherlands - Haarlem/Heemstede
>
>
> This is just a sample, the actual file contains 19732 lines. Now I need to
> write a program that will take in a full phone number as an argument i.e.
>
> 308212398721 -> and will return 'Greece - Crete' and matched prefix '30821'
>
> I tried to used the link list example from the PERL FAQ, but it takes 7
> minutes just to read the file and build the tree. Can anyone give me some
> guide lines on how to write this in the fastest way using Perl. Thanks
The canonical data structure for a prefix-oriented search is the "trie".
A trie (from "retrieval") is a tree-like collection of data organized
so that all codes that begin with the same prefix are stored together.
A classic treatment is in Knuth, _TAoCP_, Sorting and Searching, but
most textbooks with "Data Structures" in the title will deal with tries.
In view of the fact that codes consist of digits only, we can use a
Perl array(ref) as the basic structure, using the individual digits
as indices. Thus the data above would be represented as an array
with a single entry under index 3 (because all codes begin with 3).
This location holds a sub-trie (an array again) whose two entries
are stored at indices 0 and 1 (the only digits in second position
in the data) with 0 holding the Gr fraction and 1 the Dutch one.
And so on... Since "31" is a valid entry (for "Netherlands"),
we use an index that can't be a digit (10) to mark the fact.
Retrieval is fast and simple: At each level, (that is, for each digit
of a code to search) follow the sub-trie indicated by the digit.
If it doesn't exist, the search has failed. If the last digit of
the search code is reached, look if an entry under index 10 exists in
the corresponding sub-trie. If it does, we have found the entry.
If it doesn't the search also fails.
This standard trie search finds only exact matches, but a variant
would find the longest possible prefix of a given number.
There are trie modules on CPAN you could use, but below is an ad-hoc
solution.
Anno
#!/usr/bin/perl
use strict; use warnings; $| = 1; # @^~`
use Vi::QuickFix;
my $trie;
while ( <DATA> ) {
print;
chomp;
add_to_trie( $trie, split /\|/);
}
print "\n";
for (
qw( 4 99 30742 30741999), # non-existing
qw( 30741 30751 30821 31 316 319 3110 3120 3123), # existing
) {
my $res = find_in_trie( $trie, $_) || '-none-';
print "$_ -> $res\n";
}
########################################
################################
use constant VAL => 10;
sub add_to_trie {
$_[ 0] ||= [];
my ( $trie, $str, $val) = @_;
if ( my ( $char, $rem) = $str =~ /(.)(.*)/ ) {
add_to_trie( $trie->[ $char], $rem, $val);
} else {
$trie->[ VAL] = $val;
}
}
sub find_in_trie {
my ( $trie, $str) = @_;
if ( my ( $char, $rem) = $str =~ /(.)(.*)/ ) {
return find_in_trie( $trie->[ $char], $rem);
} else {
return $trie->[ VAL];
}
}
__DATA__
30741|Greece - Corinth
30751|Greece - Argos
30821|Greece - Crete
31|Netherlands
316|Netherlands - Mobile
319|Netherlands - Mobile
3110|Netherlands - Rotterdam
3120|Netherlands - Amsterdam
3123|Netherlands - Haarlem/Heemstede
|
|
|
|
|