Home > Archive > PERL Beginners > June 2007 > Count co-occurrences
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Count co-occurrences
|
|
| Andrej Kastrin 2007-06-22, 9:59 pm |
| Dear all,
I wrote a simple sql querry to count co-occurrences between words but it
performs very very slow on large datasets. So, it's time to do it with
Perl. I need just a short tip to start out: which structure to use to
count all possible occurrences between letters (e.g. A, B and C) under
the particular document number. My dataset looks like following:
1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C
etc. till doc. number 100.000
The result file should than be similar to:
A B 4 ### 2 co-occurrences under doc. number 1 + 2 co-occurrences
under doc. number 2
A C 3 ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under
doc. number 2
B C 3 ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under
doc. number 2
Thanks in advance for any pointers.
Best, Andrej
| |
| Chas Owens 2007-06-22, 9:59 pm |
| On 6/22/07, Andrej Kastrin <andrej.kastrin@siol.net> wrote:
> Dear all,
>
> I wrote a simple sql querry to count co-occurrences between words but it
> performs very very slow on large datasets. So, it's time to do it with
> Perl. I need just a short tip to start out: which structure to use to
> count all possible occurrences between letters (e.g. A, B and C) under
> the particular document number. My dataset looks like following:
snip
Here is my naive solution.
#!/usr/bin/perl
use strict;
use warnings;
my %co;
my %doc;
my $oldid;
while (<DATA> ) {
chomp;
my ($id, $val) = split / /;
unless ($id eq $oldid or not defined $oldid) {
my @vals = sort keys %doc;
while (@vals) {
my $val = shift @vals;
for my $other (@vals) {
$co{"$val $other"} += $doc{$val} * $doc{$other};
}
}
%doc = ();
}
$doc{$val}++;
$oldid = $id;
}
my @vals = sort keys %doc;
while (@vals) {
my $val = shift @vals;
for my $other (@vals) {
$co{"$val $other"} += $doc{$val} * $doc{$other};
}
}
for my $co (sort keys %co) {
print "$co $co{$co}\n";
}
__DATA__
1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C
|
|
|
|
|