For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > June 2007 > Count co-occurrences









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Count co-occurrences
Andrej Kastrin

2007-06-22, 9:59 pm

Dear all,

I wrote a simple sql querry to count co-occurrences between words but it
performs very very slow on large datasets. So, it's time to do it with
Perl. I need just a short tip to start out: which structure to use to
count all possible occurrences between letters (e.g. A, B and C) under
the particular document number. My dataset looks like following:

1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C
etc. till doc. number 100.000

The result file should than be similar to:
A B 4 ### 2 co-occurrences under doc. number 1 + 2 co-occurrences
under doc. number 2
A C 3 ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under
doc. number 2
B C 3 ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under
doc. number 2

Thanks in advance for any pointers.

Best, Andrej



Chas Owens

2007-06-22, 9:59 pm

On 6/22/07, Andrej Kastrin <andrej.kastrin@siol.net> wrote:
> Dear all,
>
> I wrote a simple sql querry to count co-occurrences between words but it
> performs very very slow on large datasets. So, it's time to do it with
> Perl. I need just a short tip to start out: which structure to use to
> count all possible occurrences between letters (e.g. A, B and C) under
> the particular document number. My dataset looks like following:

snip

Here is my naive solution.

#!/usr/bin/perl

use strict;
use warnings;

my %co;
my %doc;
my $oldid;

while (<DATA> ) {
chomp;
my ($id, $val) = split / /;
unless ($id eq $oldid or not defined $oldid) {
my @vals = sort keys %doc;
while (@vals) {
my $val = shift @vals;
for my $other (@vals) {
$co{"$val $other"} += $doc{$val} * $doc{$other};
}
}
%doc = ();
}
$doc{$val}++;
$oldid = $id;
}
my @vals = sort keys %doc;
while (@vals) {
my $val = shift @vals;
for my $other (@vals) {
$co{"$val $other"} += $doc{$val} * $doc{$other};
}
}

for my $co (sort keys %co) {
print "$co $co{$co}\n";
}

__DATA__
1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com