Home > Archive > PERL Beginners > September 2006 > Position Weight Matrix of Set of Strings with Perl
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Position Weight Matrix of Set of Strings with Perl
|
|
| Wijaya Edward 2006-09-06, 7:57 am |
|
Dear Experts,
I am looking for a really efficient way to compute a position weight matrix (PWM) from a set of strings. In each set the strings are of the same length. Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur in each position/column o
f a string. For example the set of strings below:
AAA
ATG
TTT
GTC
Note that the length of these strings in the set
maybe greater than 3.
Would give the following result:
$VAR1 = {
'A' => [2,1,1],
'T' => [1,3,1],
'C' => [0,0,1],
'G' => [1,0,1]
};
So the size of the array is the same with the length of the string.
In my case I need the variation of it, namely the probability of the
each base occur in the particular position:
$VAR = {
'A' => ['0.5','0.25','0.25'],
'T' => ['0.25','0.75','0.25'],
'C' => ['0','0','0.25'],
'G' => ['0.25','0','0.25']
}
In this link you can find my incredibly naive and inefficient code.
Can any body suggest a better and faster solution than this:
http://www.rafb.net/paste/results/c6T7B629.html
Thanks and Regards,
Edward WIJAYA
SINGAPORE
------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------
| |
| Mumia W. 2006-09-06, 7:57 am |
| On 09/06/2006 04:02 AM, Wijaya Edward wrote:
> Dear Experts,
>
> I am looking for a really efficient way to compute a position weight matrix (PWM) from a set of strings. In each set the strings are of the same length. Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur in each position/column
of a string. For example the set of strings below:
>
> AAA
> ATG
> TTT
> GTC
>
> Note that the length of these strings in the set
> maybe greater than 3.
>
> Would give the following result:
>
> $VAR1 = {
> 'A' => [2,1,1],
> 'T' => [1,3,1],
> 'C' => [0,0,1],
> 'G' => [1,0,1]
> };
>
> So the size of the array is the same with the length of the string.
> In my case I need the variation of it, namely the probability of the
> each base occur in the particular position:
>
> $VAR = {
> 'A' => ['0.5','0.25','0.25'],
> 'T' => ['0.25','0.75','0.25'],
> 'C' => ['0','0','0.25'],
> 'G' => ['0.25','0','0.25']
> }
>
> In this link you can find my incredibly naive and inefficient code.
> Can any body suggest a better and faster solution than this:
>
> http://www.rafb.net/paste/results/c6T7B629.html
>
>
> Thanks and Regards,
> Edward WIJAYA
> SINGAPORE
>
Although I'm sure that smarter posters than I will turn this
into a one-liner, I think that my solution is not so atrocious:
use strict;
use warnings;
use Data::Dumper;
local our @deep;
local $; = ','; # A vestige of a previous version
my @data = qw(AAA ATG TTT GTC);
my @d2 = map [ split // ], @data;
my (%hash);
for my $entry (@d2) {
*deep = $entry;
for my $nx (0..$#deep) {
$hash{$deep[$nx]}[$nx]++;
}
}
foreach my $entry (values %hash) {
$entry = [ map defined $_ ? $_ : 0, @$entry ];
}
print Dumper(\%hash);
__HTH__
| |
| Rob Dixon 2006-09-06, 7:57 am |
| Wijaya Edward wrote:
> Dear Experts,
>
> I am looking for a really efficient way to compute a position weight matrix
> (PWM) from a set of strings. In each set the strings are of the same length.
> Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur
> in each position/column of a string. For example the set of strings below:
>
> AAA
> ATG
> TTT
> GTC
>
> Note that the length of these strings in the set
> maybe greater than 3.
>
> Would give the following result:
>
> $VAR1 = {
> 'A' => [2,1,1],
> 'T' => [1,3,1],
> 'C' => [0,0,1],
> 'G' => [1,0,1]
> };
>
> So the size of the array is the same with the length of the string.
> In my case I need the variation of it, namely the probability of the
> each base occur in the particular position:
>
> $VAR = {
> 'A' => ['0.5','0.25','0.25'],
> 'T' => ['0.25','0.75','0.25'],
> 'C' => ['0','0','0.25'],
> 'G' => ['0.25','0','0.25']
> }
>
> In this link you can find my incredibly naive and inefficient code.
> Can any body suggest a better and faster solution than this:
>
> http://www.rafb.net/paste/results/c6T7B629.html
Hi Edward.
A nice little problem. Thank you.
The main reason for the length of your own solution is that you haven't taken
the opportunity to use hashes to store data that is parallel across the four
possible characters, so the code is about four times as long as it needs to be!
Here is my solution. I have written it to pull data from the pseudo-filehandle
DATA, as it is unlikely that you will want your actual data hard-coded as an
array.
HTH.
Rob Dixon
use strict;
use warnings;
my %pwm;
while (<DATA> ) {
my $col = 0;
foreach my $c (/\S/g) {
$pwm{$c}[$col++]++;
}
}
foreach my $freq (values %pwm) {
$_ = $_ ? $_ / keys %pwm : 0 foreach @$freq;
}
use Data::Dumper;
print Dumper \%pwm;
__END__
AAA
ATG
TTT
GTC
OUTPUT
$VAR1 = {
'A' => [
'0.5',
'0.25',
'0.25'
],
'T' => [
'0.25',
'0.75',
'0.25'
],
'C' => [
0,
0,
'0.25'
],
'G' => [
'0.25',
0,
'0.25'
]
};
| |
| Dr.Ruud 2006-09-06, 6:57 pm |
| Rob Dixon schreef:
> use strict;
> use warnings;
>
> my %pwm;
>
> while (<DATA> ) {
> my $col = 0;
> foreach my $c (/\S/g) {
> $pwm{$c}[$col++]++;
> }
> }
>
> foreach my $freq (values %pwm) {
> $_ = $_ ? $_ / keys %pwm : 0 foreach @$freq;
> }
>
> use Data::Dumper;
> print Dumper \%pwm;
>
>
> __END__
> AAA
> ATG
> TTT
> GTC
Is "keys %pwm" the right divisor, or is the number of lines?
Variant:
#!/usr/bin/perl
use warnings ;
use strict ;
my ($c, $r, %pwm) ;
/\s/ ? ($r++, $c=0) : $pwm{$_}[$c++]++
for do {local $/; <DATA> =~ /(.)/sg} ;
for (values %pwm) { ($_||=0)/=$r for @$_ } ;
use Data::Dumper ;
print Dumper \%pwm ;
__DATA__
AAA
ATG
TTT
GTC
:)
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Wijaya Edward 2006-09-06, 9:57 pm |
|
Dear Rob,
I was trying your script with this set of strings:
__DATA__
CAGGTG
CAGGTG
But how come it returns:
$VAR1 = {
'A' => [ 0, '0.5' ],
'T' => [ 0, 0, 0, 0, '0.5' ],
'C' => [ '0.5' ],
'G' => [ 0, 0, '0.5', '0.5', 0, '0.5' ]
};
Instead of the correct:
$VAR1 = {
'A' => [ '0', '1', '0', '0', '0', '0' ],
'T' => [ '0', '0', '0', '0', '1', '0' ],
'C' => [ '1', '0', '0', '0', '0', '0' ],
'G' => [ '0', '0', '1', '1', '0', '1' ]
};
Hope to hear from you again.
Regards,
Edward WIJAYA
SINGAPORE
________________________________
From: Rob Dixon [mailto:rob.dixon@350.com]
Sent: Wed 9/6/2006 8:14 PM
To: beginners@perl.org
Subject: Re: Position Weight Matrix of Set of Strings with Perl
use strict;
use warnings;
my %pwm;
while (<DATA> ) {
my $col = 0;
foreach my $c (/\S/g) {
$pwm{$c}[$col++]++;
}
}
foreach my $freq (values %pwm) {
$_ = $_ ? $_ / keys %pwm : 0 foreach @$freq;
}
use Data::Dumper;
print Dumper \%pwm;
__END__
AAA
ATG
TTT
GTC
OUTPUT
$VAR1 = {
'A' => [
'0.5',
'0.25',
'0.25'
],
'T' => [
'0.25',
'0.75',
'0.25'
],
'C' => [
0,
0,
'0.25'
],
'G' => [
'0.25',
0,
'0.25'
]
};
--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------
| |
| Mumia W. 2006-09-07, 3:57 am |
| On 09/06/2006 05:41 AM, Mumia W. wrote:
> On 09/06/2006 04:02 AM, Wijaya Edward wrote:
>
> Although I'm sure that smarter posters than I will [...]
do it right.
Ugh, I forgot about Wijaya's requirement that the PWM be
calculated in probabilities, and I also forgot that the
lengths of the base-pair strings can be different. Here is my
updated code:
use strict;
use warnings;
use Data::Dumper;
local our @deep;
local $" = ', ';
my $length = 5;
my @data = qw(AAA ATG TTT GTC);
@data = map [ split // ], @data;
my (%hash);
for my $entry (@data) {
*deep = $entry;
for my $nx (0..$#deep) {
$hash{$deep[$nx]}[$nx]++;
}
}
my $count = keys %hash;
while (my ($key, $values) = each %hash) {
$#{$values} = $length;
@$values = map defined $_ ? $_ / ($count) : 0, @$values;
@$values = map sprintf('%4.2f',$_), @$values;
print "$key => [ @{$hash{$key}} ]\n";
}
__END__
Output:
A => [ 0.50, 0.25, 0.25, 0.00, 0.00, 0.00 ]
T => [ 0.25, 0.75, 0.25, 0.00, 0.00, 0.00 ]
C => [ 0.00, 0.00, 0.25, 0.00, 0.00, 0.00 ]
G => [ 0.25, 0.00, 0.25, 0.00, 0.00, 0.00 ]
End output.
Note: I use $length to set the maximal array index value.
| |
| Rob Dixon 2006-09-07, 3:57 am |
| Wijaya Edward wrote:
>
> From: Rob Dixon
>
> I was trying your script with this set of strings:
>
> __DATA__
> CAGGTG
> CAGGTG
>
> But how come it returns:
>
> $VAR1 = {
> 'A' => [ 0, '0.5' ],
> 'T' => [ 0, 0, 0, 0, '0.5' ],
> 'C' => [ '0.5' ],
> 'G' => [ 0, 0, '0.5', '0.5', 0, '0.5' ]
> };
>
> Instead of the correct:
>
> $VAR1 = {
> 'A' => [ '0', '1', '0', '0', '0', '0' ],
> 'T' => [ '0', '0', '0', '0', '1', '0' ],
> 'C' => [ '1', '0', '0', '0', '0', '0' ],
> 'G' => [ '0', '0', '1', '1', '0', '1' ]
> };
Hi Edward
(Please bottom-post your replies)
Yes, I'm sorry, Ruud pointed out that I was dividing by the number of symbols
instead of the number of rows to get the probabilities. You have also found that
each symbol was given no probability value after the last column where it
occurred. This revision fixes both of these problems. I've also borrowed Ruud's
neater construct to normalise the hash values in the final loop. Thanks Ruud!
use strict;
use warnings;
my %pwm;
my ($rows, $cols) = (0, 0);
while (<DATA> ) {
my $col = 0;
$rows++ if /\S/;
foreach my $c (/\S/g) {
$pwm{$c}[$col++]++;
$cols = $col if $col > $cols;
}
}
foreach my $freq (values %pwm) {
($_ |= 0) /= $rows foreach @$freq[0 .. $cols - 1];
}
use Data::Dumper;
print Dumper \%pwm;
__DATA__
CAGGTG
CAGGTG
*OUTPUT*
$VAR1 = {
'A' => [
'0',
'1',
'0',
'0',
'0',
'0'
],
'T' => [
'0',
'0',
'0',
'0',
'1',
'0'
],
'C' => [
'1',
'0',
'0',
'0',
'0',
'0'
],
'G' => [
'0',
'0',
'1',
'1',
'0',
'1'
]
};
|
|
|
|
|