For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > September 2006 > Position Weight Matrix of Set of Strings with Perl









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Position Weight Matrix of Set of Strings with Perl
Wijaya Edward

2006-09-06, 7:57 am


Dear Experts,

I am looking for a really efficient way to compute a position weight matrix (PWM) from a set of strings. In each set the strings are of the same length. Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur in each position/column o
f a string. For example the set of strings below:

AAA
ATG
TTT
GTC

Note that the length of these strings in the set
maybe greater than 3.

Would give the following result:

$VAR1 = {
'A' => [2,1,1],
'T' => [1,3,1],
'C' => [0,0,1],
'G' => [1,0,1]
};

So the size of the array is the same with the length of the string.
In my case I need the variation of it, namely the probability of the
each base occur in the particular position:

$VAR = {
'A' => ['0.5','0.25','0.25'],
'T' => ['0.25','0.75','0.25'],
'C' => ['0','0','0.25'],
'G' => ['0.25','0','0.25']
}

In this link you can find my incredibly naive and inefficient code.
Can any body suggest a better and faster solution than this:

http://www.rafb.net/paste/results/c6T7B629.html


Thanks and Regards,
Edward WIJAYA
SINGAPORE

------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------
Mumia W.

2006-09-06, 7:57 am

On 09/06/2006 04:02 AM, Wijaya Edward wrote:
> Dear Experts,
>
> I am looking for a really efficient way to compute a position weight matrix (PWM) from a set of strings. In each set the strings are of the same length. Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur in each position/column

of a string. For example the set of strings below:
>
> AAA
> ATG
> TTT
> GTC
>
> Note that the length of these strings in the set
> maybe greater than 3.
>
> Would give the following result:
>
> $VAR1 = {
> 'A' => [2,1,1],
> 'T' => [1,3,1],
> 'C' => [0,0,1],
> 'G' => [1,0,1]
> };
>
> So the size of the array is the same with the length of the string.
> In my case I need the variation of it, namely the probability of the
> each base occur in the particular position:
>
> $VAR = {
> 'A' => ['0.5','0.25','0.25'],
> 'T' => ['0.25','0.75','0.25'],
> 'C' => ['0','0','0.25'],
> 'G' => ['0.25','0','0.25']
> }
>
> In this link you can find my incredibly naive and inefficient code.
> Can any body suggest a better and faster solution than this:
>
> http://www.rafb.net/paste/results/c6T7B629.html
>
>
> Thanks and Regards,
> Edward WIJAYA
> SINGAPORE
>


Although I'm sure that smarter posters than I will turn this
into a one-liner, I think that my solution is not so atrocious:

use strict;
use warnings;
use Data::Dumper;
local our @deep;
local $; = ','; # A vestige of a previous version

my @data = qw(AAA ATG TTT GTC);
my @d2 = map [ split // ], @data;

my (%hash);
for my $entry (@d2) {
*deep = $entry;
for my $nx (0..$#deep) {
$hash{$deep[$nx]}[$nx]++;
}
}
foreach my $entry (values %hash) {
$entry = [ map defined $_ ? $_ : 0, @$entry ];
}
print Dumper(\%hash);

__HTH__

Rob Dixon

2006-09-06, 7:57 am

Wijaya Edward wrote:

> Dear Experts,
>
> I am looking for a really efficient way to compute a position weight matrix
> (PWM) from a set of strings. In each set the strings are of the same length.
> Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur
> in each position/column of a string. For example the set of strings below:
>
> AAA
> ATG
> TTT
> GTC
>
> Note that the length of these strings in the set
> maybe greater than 3.
>
> Would give the following result:
>
> $VAR1 = {
> 'A' => [2,1,1],
> 'T' => [1,3,1],
> 'C' => [0,0,1],
> 'G' => [1,0,1]
> };
>
> So the size of the array is the same with the length of the string.
> In my case I need the variation of it, namely the probability of the
> each base occur in the particular position:
>
> $VAR = {
> 'A' => ['0.5','0.25','0.25'],
> 'T' => ['0.25','0.75','0.25'],
> 'C' => ['0','0','0.25'],
> 'G' => ['0.25','0','0.25']
> }
>
> In this link you can find my incredibly naive and inefficient code.
> Can any body suggest a better and faster solution than this:
>
> http://www.rafb.net/paste/results/c6T7B629.html


Hi Edward.

A nice little problem. Thank you.

The main reason for the length of your own solution is that you haven't taken
the opportunity to use hashes to store data that is parallel across the four
possible characters, so the code is about four times as long as it needs to be!

Here is my solution. I have written it to pull data from the pseudo-filehandle
DATA, as it is unlikely that you will want your actual data hard-coded as an
array.

HTH.

Rob Dixon


use strict;
use warnings;

my %pwm;

while (<DATA> ) {
my $col = 0;
foreach my $c (/\S/g) {
$pwm{$c}[$col++]++;
}
}

foreach my $freq (values %pwm) {
$_ = $_ ? $_ / keys %pwm : 0 foreach @$freq;
}

use Data::Dumper;
print Dumper \%pwm;


__END__
AAA
ATG
TTT
GTC


OUTPUT


$VAR1 = {
'A' => [
'0.5',
'0.25',
'0.25'
],
'T' => [
'0.25',
'0.75',
'0.25'
],
'C' => [
0,
0,
'0.25'
],
'G' => [
'0.25',
0,
'0.25'
]
};
Dr.Ruud

2006-09-06, 6:57 pm

Rob Dixon schreef:

> use strict;
> use warnings;
>
> my %pwm;
>
> while (<DATA> ) {
> my $col = 0;
> foreach my $c (/\S/g) {
> $pwm{$c}[$col++]++;
> }
> }
>
> foreach my $freq (values %pwm) {
> $_ = $_ ? $_ / keys %pwm : 0 foreach @$freq;
> }
>
> use Data::Dumper;
> print Dumper \%pwm;
>
>
> __END__
> AAA
> ATG
> TTT
> GTC



Is "keys %pwm" the right divisor, or is the number of lines?


Variant:

#!/usr/bin/perl
use warnings ;
use strict ;

my ($c, $r, %pwm) ;
/\s/ ? ($r++, $c=0) : $pwm{$_}[$c++]++
for do {local $/; <DATA> =~ /(.)/sg} ;
for (values %pwm) { ($_||=0)/=$r for @$_ } ;

use Data::Dumper ;
print Dumper \%pwm ;

__DATA__
AAA
ATG
TTT
GTC

:)

--
Affijn, Ruud

"Gewoon is een tijger."


Wijaya Edward

2006-09-06, 9:57 pm


Dear Rob,

I was trying your script with this set of strings:

__DATA__
CAGGTG
CAGGTG

But how come it returns:

$VAR1 = {
'A' => [ 0, '0.5' ],
'T' => [ 0, 0, 0, 0, '0.5' ],
'C' => [ '0.5' ],
'G' => [ 0, 0, '0.5', '0.5', 0, '0.5' ]
};

Instead of the correct:

$VAR1 = {
'A' => [ '0', '1', '0', '0', '0', '0' ],
'T' => [ '0', '0', '0', '0', '1', '0' ],
'C' => [ '1', '0', '0', '0', '0', '0' ],
'G' => [ '0', '0', '1', '1', '0', '1' ]
};


Hope to hear from you again.


Regards,
Edward WIJAYA
SINGAPORE






________________________________

From: Rob Dixon [mailto:rob.dixon@350.com]
Sent: Wed 9/6/2006 8:14 PM
To: beginners@perl.org
Subject: Re: Position Weight Matrix of Set of Strings with Perl



use strict;
use warnings;

my %pwm;

while (<DATA> ) {
my $col = 0;
foreach my $c (/\S/g) {
$pwm{$c}[$col++]++;
}
}

foreach my $freq (values %pwm) {
$_ = $_ ? $_ / keys %pwm : 0 foreach @$freq;
}

use Data::Dumper;
print Dumper \%pwm;


__END__
AAA
ATG
TTT
GTC


OUTPUT


$VAR1 = {
'A' => [
'0.5',
'0.25',
'0.25'
],
'T' => [
'0.25',
'0.75',
'0.25'
],
'C' => [
0,
0,
'0.25'
],
'G' => [
'0.25',
0,
'0.25'
]
};

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>





------------ Institute For Infocomm Research - Disclaimer -------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------------
Mumia W.

2006-09-07, 3:57 am

On 09/06/2006 05:41 AM, Mumia W. wrote:
> On 09/06/2006 04:02 AM, Wijaya Edward wrote:
>
> Although I'm sure that smarter posters than I will [...]


do it right.

Ugh, I forgot about Wijaya's requirement that the PWM be
calculated in probabilities, and I also forgot that the
lengths of the base-pair strings can be different. Here is my
updated code:

use strict;
use warnings;
use Data::Dumper;
local our @deep;
local $" = ', ';

my $length = 5;
my @data = qw(AAA ATG TTT GTC);
@data = map [ split // ], @data;

my (%hash);
for my $entry (@data) {
*deep = $entry;
for my $nx (0..$#deep) {
$hash{$deep[$nx]}[$nx]++;
}
}

my $count = keys %hash;
while (my ($key, $values) = each %hash) {
$#{$values} = $length;
@$values = map defined $_ ? $_ / ($count) : 0, @$values;
@$values = map sprintf('%4.2f',$_), @$values;
print "$key => [ @{$hash{$key}} ]\n";
}

__END__

Output:
A => [ 0.50, 0.25, 0.25, 0.00, 0.00, 0.00 ]
T => [ 0.25, 0.75, 0.25, 0.00, 0.00, 0.00 ]
C => [ 0.00, 0.00, 0.25, 0.00, 0.00, 0.00 ]
G => [ 0.25, 0.00, 0.25, 0.00, 0.00, 0.00 ]
End output.

Note: I use $length to set the maximal array index value.

Rob Dixon

2006-09-07, 3:57 am

Wijaya Edward wrote:
>
> From: Rob Dixon
>
> I was trying your script with this set of strings:
>
> __DATA__
> CAGGTG
> CAGGTG
>
> But how come it returns:
>
> $VAR1 = {
> 'A' => [ 0, '0.5' ],
> 'T' => [ 0, 0, 0, 0, '0.5' ],
> 'C' => [ '0.5' ],
> 'G' => [ 0, 0, '0.5', '0.5', 0, '0.5' ]
> };
>
> Instead of the correct:
>
> $VAR1 = {
> 'A' => [ '0', '1', '0', '0', '0', '0' ],
> 'T' => [ '0', '0', '0', '0', '1', '0' ],
> 'C' => [ '1', '0', '0', '0', '0', '0' ],
> 'G' => [ '0', '0', '1', '1', '0', '1' ]
> };


Hi Edward

(Please bottom-post your replies)

Yes, I'm sorry, Ruud pointed out that I was dividing by the number of symbols
instead of the number of rows to get the probabilities. You have also found that
each symbol was given no probability value after the last column where it
occurred. This revision fixes both of these problems. I've also borrowed Ruud's
neater construct to normalise the hash values in the final loop. Thanks Ruud!

use strict;
use warnings;

my %pwm;
my ($rows, $cols) = (0, 0);

while (<DATA> ) {
my $col = 0;
$rows++ if /\S/;
foreach my $c (/\S/g) {
$pwm{$c}[$col++]++;
$cols = $col if $col > $cols;
}
}

foreach my $freq (values %pwm) {
($_ |= 0) /= $rows foreach @$freq[0 .. $cols - 1];
}

use Data::Dumper;
print Dumper \%pwm;


__DATA__
CAGGTG
CAGGTG

*OUTPUT*

$VAR1 = {
'A' => [
'0',
'1',
'0',
'0',
'0',
'0'
],
'T' => [
'0',
'0',
'0',
'0',
'1',
'0'
],
'C' => [
'1',
'0',
'0',
'0',
'0',
'0'
],
'G' => [
'0',
'0',
'1',
'1',
'0',
'1'
]
};
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com