Home > Archive > PERL Beginners > October 2004 > counting gaps in sequence data
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
counting gaps in sequence data
|
|
| Michael Robeson 2004-10-14, 3:56 pm |
| I have a set of data that looks something like the following:
>human
acgtt---cgatacg---acgact-----t
>chimp
acgtacgatac---actgca---ac
>mouse
acgata---acgatcg----acgt
I am having trouble setting up a hash etc., to count the number and
types of continuous gaps. For example the 'human' sequence above has 2
sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps
and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.
So, I am having trouble being able to assign a dynamic variable (i.e.
gap length) and place that in a pattern match so that it can count how
many gaps of that length are in that particular sequence. I know how to
set up a hash to count the number of times a gap appears:
'$gaptype{$gap}++' or something. The problem is: what is the best way
(and how) can I set '$gap' to be dynamic.
I need to know the length of each consecutive string of gaps. I know
how to count the gaps by using the 'tr' function. But it gets confusing
when I need to add counts to every instance of that gap length. I also
need to know the position of each gap (denoted by the position of the
first gap in that particular instance). I know that I can use the
'pos()' command for this.
So, my problem is that I think I know some of the bits of code to put
into place the problem is I am getting lost on how to structure it all
together. For now I am just trying to get my output to look like this:
Human
number of 3 base pair gaps: 2
at positions: 6, 16
number of 5 base pair gaps: 1
at positions: 25
Chimp
..... and so on ...
So, any suggestions would be greatly appreciated. If anyone can help me
out with all or even just bits of this I would greatly appreciate it.
This should help me get started on some more advanced parsing I need to
do after this. I like to try and figure things out on my own if I can,
so even pseudo code would be of great help!
-Thanks
-Mike
| |
| Errin Larsen 2004-10-14, 3:56 pm |
| On Thu, 14 Oct 2004 11:02:06 -0600, Michael Robeson <popgen23@mac.com> wrote:
> I have a set of data that looks something like the following:
>
<<SNIP>>
>
> So, any suggestions would be greatly appreciated. If anyone can help me
> out with all or even just bits of this I would greatly appreciate it.
> This should help me get started on some more advanced parsing I need to
> do after this. I like to try and figure things out on my own if I can,
> so even pseudo code would be of great help!
>
> -Thanks
> -Mike
>
Hi Mike,
This list works best if you show us some code you have tried and
then we can help you troubleshoot it. It lets us know that you've
already tried to solve your problem and aren't looking for a free
scripting service! Show us some of that code you talked about and
we'll help you out
--Errin
PS: is this a common problem/exercise in some class somewhere? I keep
seeing requests for help having to do with those exact strings of DNA
data. Is there a bunch of people working on DNA projects using Perl
somewhere? Or, is this some homework?
| |
| Willy West 2004-10-14, 3:56 pm |
| > PS: is this a common problem/exercise in some class somewhere? I keep
> seeing requests for help having to do with those exact strings of DNA
> data. Is there a bunch of people working on DNA projects using Perl
> somewhere? Or, is this some homework?
bio-informatics is a big area in which Perl is involved... there's even
a book from O'reilly on the subject...
also, a mailing-list is available...
from http://lists.perl.org/ -
bioperl-announce-l List is for people only interested in announcements
of Bioperl code releases, updates and events.
bioperl-l Unmoderated list for general discussion of bioperl modules,
current & future efforts and related topics.
-
in the latter of the two lists, i counted about 80 messages in the first half
of this month.
hmm... i might join it... :) might be fun!!
--
Willy
http://www.hackswell.com/corenth
| |
| Errin Larsen 2004-10-14, 8:55 pm |
| On Thu, 14 Oct 2004 15:40:24 -0400, Willy West <corenth@gmail.com> wrote:
>
> bio-informatics is a big area in which Perl is involved... there's even
> a book from O'reilly on the subject...
>
> also, a mailing-list is available...
>
> from http://lists.perl.org/ -
>
> bioperl-announce-l List is for people only interested in announcements
> of Bioperl code releases, updates and events.
>
> bioperl-l Unmoderated list for general discussion of bioperl modules,
> current & future efforts and related topics.
>
> -
>
> in the latter of the two lists, i counted about 80 messages in the first half
> of this month.
>
> hmm... i might join it... :) might be fun!!
>
> --
> Willy
> http://www.hackswell.com/corenth
>
>
If what you say is true, then maybe Mike needs to take his questions
to those list? I mean, if the problem he's describing is common and
the data format he's using is common, I bet it's been solved already.
Hey mike, have you searched on CPAN (search.cpan.org) for this?
| |
| Willy West 2004-10-14, 8:55 pm |
| On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen <errinlarsen@gmail.com> wrote:
>
[color=darkred]
> If what you say is true, then maybe Mike needs to take his questions
> to those list? I mean, if the problem he's describing is common and
> the data format he's using is common, I bet it's been solved already.
>
> Hey mike, have you searched on CPAN (search.cpan.org) for this?
>
that might be fine- but his question is fundamentally Perl in nature-
he may use the information for bio-informatics, but he is looking for
an answer regarding effective Perl use when dealing with strings-
that's classic Perl and a classic question for this list :)
i wish i could answer his question right off the bat, but i can't :/
awell...
--
Willy
http://www.hackswell.com/corenth
| |
| Errin Larsen 2004-10-14, 8:55 pm |
| On Thu, 14 Oct 2004 16:08:44 -0400, Willy West <corenth@gmail.com> wrote:
> On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen <errinlarsen@gmail.com> wrote:
>
>
>
> that might be fine- but his question is fundamentally Perl in nature-
> he may use the information for bio-informatics, but he is looking for
> an answer regarding effective Perl use when dealing with strings-
> that's classic Perl and a classic question for this list :)
>
> i wish i could answer his question right off the bat, but i can't :/
>
> awell...
>
very true. I've been playin' with this problem, it seems very
fun/interesting. I was thinking this would be alot easier to play
with if the characters in the string were all in an array. so I
played with the length() function and the substr() function to push
them all into an array in a loop. something like this (prolly a
prettier/easier way to do this):
use Data::Dumper;
my $human = "acgtt---cgatacg---acgact-----t";
my @human;
for my $pos( 0 .. (length($human) -1)) {
push @human, substr($human, $pos, 1);
}
print Dumper(\@human};
The above produces the following output, which I was thinking might be
easier to work with for Mike's problem:
$VAR1 = [
'a',
'c',
'g',
't',
't',
'-',
'-',
'-',
'c',
'g',
'a',
't',
'a',
'c',
'g',
'-',
'-',
'-',
'a',
'c',
'g',
'a',
'c',
't',
'-',
'-',
'-',
'-',
'-',
't'
];
This is kinda fun!
--Errin
| |
| Paul Johnson 2004-10-14, 8:55 pm |
| On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:
> I have a set of data that looks something like the following:
> So, my problem is that I think I know some of the bits of code to put
> into place the problem is I am getting lost on how to structure it all
> together.
> I like to try and figure things out on my own if I can,
> so even pseudo code would be of great help!
Here's some code that works, incorporating most of your ideas. It's
only a partial solution, but it think it covers the bit that was causing
you problems. I'll present it without comment, but I'm sure people will
be able to explain anything you don't understand.
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
while (<DATA> )
{
print, next unless /-/;
my %gaps;
push @{$gaps{length $1}}, pos() + 1 - length $1 while /(-+)/g;
print Dumper \%gaps;
}
__DATA__
human
acgtt---cgatacg---acgact-----t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcg----acgt
--
Paul Johnson - paul@pjcj.net
http://www.pjcj.net
| |
| Michael Robeson 2004-10-14, 8:55 pm |
| Here is as far as I can get with some real code and pseudo code. This
is just to show you that I am actually trying. :-)
Pseudo - code
################################
# open DNA sequence file
################################
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN> );
open(DNA_SEQ, $dna_seq)
or die "Can't open sequence file for input: $!";
########################################
################
# read sequence data into a hash - not sure if this is how I should do
it?
########################################
################
my %sequences;
$/ = '>'; # set to read in paragraph mode
print "\n***Discovered the following DNA sequences:***\n";
while ( <DNA_SEQ> ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;
s/\s//g;
$sequences{$name} = $_;
print "$name found!\n";
}
close DNA_SEQ;
###################################
# search for and characterize gaps
###################################
<<somehow get data from hash and present it to a loop>>
%gaptype;
<<major pseudo code below>>
foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by
sequence
$position = pos($1);
$gaplength = $1;
$gaplength =~ tr/-//g; # count the number of '-' for that particular
# gap being processed
$gaptype{gaplength}++; # count the number of times each gap type
appears
<<somehow get information from loop an print as seen below>>
____OUTPUT_____
Human
number of 3 base pair gaps: 2
at positions: 6, 16
number of 5 base pair gaps: 1
at positions: 25
..
... and so on...
..
__DATA__
human
acgtt---cgatacg---acgact-----t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcg----acgt
| |
| Errin Larsen 2004-10-14, 8:55 pm |
| On Thu, 14 Oct 2004 23:23:48 +0200, Paul Johnson <paul@pjcj.net> wrote:
> On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:
>
>
Hi Paul,
I think you missed a critical part of Mike's post!:
[color=darkred]
> For now I am just trying to get my output to look like this:
>
>Human
>number of 3 base pair gaps: 2
> at positions: 6, 16
>number of 5 base pair gaps: 1
> at positions: 25
>
>Chimp
>.... and so on ...
I've put together something that will get the first part done
(counting base pair gaps, I guess is the point!) Code is as follows:
use warnings;
use strict;
use Data::Dumper;
my %sequences = (
'human' => "acgtt---cgatacg---acgact-----t",
'chimp' => "acgtt---cgatacg---acgact-----t",
'mouse' => "acgata---acgatcg----acgt",
);
my %results;
foreach my $species( keys %sequences ) {
my $base_pair = 0;
my $base_pair_value;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
$base_pair = 1;
$base_pair_value .= $_;
} elsif( $base_pair ) {
$results{$species}{length($base_pair_val
ue)} += 1;
$base_pair = 0;
$base_pair_value = undef;
}
}
}
foreach my $species( keys %results ) {
print "$species = $sequences{$species}\n";
foreach my $base_pair( keys %{$results{$species}} ) {
print " Number of $base_pair base pair
gaps:\t$results{$species}{$base_pair}\n";
}
}
This will produce the following output:
# dnatest
human = acgtt---cgatacg---acgact-----t
Number of 3 base pair gaps: 2
Number of 5 base pair gaps: 1
chimp = acgtt---cgatacg---acgact-----t
Number of 3 base pair gaps: 2
Number of 5 base pair gaps: 1
mouse = acgata---acgatcg----acgt
Number of 4 base pair gaps: 1
Number of 3 base pair gaps: 1
I put the sequence in the output for easy troubleshooting and
checking. I'm still working on figuring out the positional data.
This IS fun. I'll post when I've got it figured out
--Errin
| |
| Paul Johnson 2004-10-14, 8:55 pm |
| On Thu, Oct 14, 2004 at 04:50:30PM -0500, Errin Larsen wrote:
> Hi Paul,
> I think you missed a critical part of Mike's post!:
No, I didn't miss it. I just thought it likely that Mike could get to
there from where I left off, so I gave a solution to the bit that seemed
most troublesome.
> I'm still working on figuring out the positional data.
That data is included in the code I posted.
> This IS fun. I'll post when I've got it figured out
I look forward to it.
--
Paul Johnson - paul@pjcj.net
http://www.pjcj.net
| |
| Michael Robeson 2004-10-14, 8:55 pm |
| Yeah, I have just submitted that same question verbatim to the bio-perl
list. I am still running through some ideas though. I have both
Bioinformatics perl books. They are not very effective teaching books.
The books spend too much time on using modules. Though while I
understand the usefulness of not having to re-write code, it is a bad
idea for beginners like me. Because re-writing code at first gives me a
lot of practice. Some of the scripts in the books use like 3-5 modules,
so it gets confusing on what is going on.
I mean the books are not useless, but they definitely are structured
for a class with a teacher.
:-)
-Mike
| |
| Philipp Traeder 2004-10-14, 8:55 pm |
| On Thursday 14 October 2004 21:44, Michael Robeson wrote:
> Here is as far as I can get with some real code and pseudo code. This
> is just to show you that I am actually trying. :-)
Hi Michael,
this looks quite promising - I'll try to sketch a solution without giving
everything away ;-)
>
> Pseudo - code
This is really important: Always start your scripts with
use warnings;
use strict;
This will help you avoiding errors.
>
> ################################
> # open DNA sequence file
> ################################
>
> print "Enter in the name of the DNA sequence file:\n";
> chomp (my $dna_seq = <STDIN> );
> open(DNA_SEQ, $dna_seq)
> or die "Can't open sequence file for input: $!";
>
> ########################################
################
> # read sequence data into a hash - not sure if this is how I should do
> it?
> ########################################
################
>
> my %sequences;
>
> $/ = '>'; # set to read in paragraph mode
>
> print "\n***Discovered the following DNA sequences:***\n";
>
> while ( <DNA_SEQ> ) {
> chomp;
> next unless s/^\s*(.+)//;
> my $name = $1;
> s/\s//g;
> $sequences{$name} = $_;
> print "$name found!\n";
> }
> close DNA_SEQ;
This part is not working completely for me - I get the first sequence only.
I've never got used to working with the $/ operator, therefore I'd rewrite
this part like this (without wanting to say that this would be easier than
your approach):
print "\n***Discovered the following DNA sequences:***\n";
# each line holds either a caption or a sequence
my $is_sequence_line = 0; # first line holds a caption
my $name;
while ( <DNA_SEQ> ) {
chomp;
if ($is_sequence_line) {
# this line holds a sequence
$sequences{$name} = $_;
print "sequence for $name found!\n";
}
else {
# this line holds a name - remember it
print "new name : $_\n";
$name = $_;
}
$is_sequence_line = ! $is_sequence_line;
}
close DNA_SEQ;
print "\n";
>
>
> ###################################
> # search for and characterize gaps
> ###################################
>
> <<somehow get data from hash and present it to a loop>>
>
> %gaptype;
>
> <<major pseudo code below>>
> foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by
> sequence
> $position = pos($1);
> $gaplength = $1;
> $gaplength =~ tr/-//g; # count the number of '-' for that particular
> # gap being processed
> $gaptype{gaplength}++; # count the number of times each gap type
> appears
>
> <<somehow get information from loop an print as seen below>>
>
# first, we need to iterate over the sequences
while (my ($name, $sequence) = each(%sequences)) {
# now we can work on each sequence
print "working on sequence $name\n";
print "sequence : $sequence\n";
# find all gaps in the sequence
# (the g modifier means to search for all occurences)
while ($sequence =~ /(-+)/g) {
print "\n";
# the special variables $`, $& and $' give you the
# text left of the match, the match itself, and the
# text right of the match.
# Using those variables and the length() function,
# you can print something like this:
print "found the following gap : '" . $& . "'\n";
print "the text before the gap : '" . $` . "'\n";
print "the text after the gap : '" . $' . "'\n";
print "the text before the gap is " . length ($`) . " chars long.\n";
}
print "\n";
}
I hope this might give you an approach for handling this - with the variables
I printed, you should be able to fill your %gaptype hash like you sketched
above (you find the length of the gap in length($&), and the gap's position
is the same as the "length of the text left of the gap" + 1).
> ____OUTPUT_____
>
> Human
> number of 3 base pair gaps: 2
> at positions: 6, 16
> number of 5 base pair gaps: 1
> at positions: 25
> .
> .. and so on...
You could either (easy variant) print this information while running through
the while loop or (a bit harder, but might be easier in the long run if you
need to process the data further) store the amount and position of the gaps
inside a two-dimensional hash of arrays - something like:
my %gaps =
( human =>
{ 3 => [6, 16],
5 => [25]
},
chimp =>
...
)
It's a bit tricky to deal with structures like this, but once you got used to
them, you must love them ;-)
Hope this helps.
Philipp
| |
| Errin Larsen 2004-10-15, 3:55 pm |
| On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson <popgen23@mac.com> wrote:
> Yeah, I have just submitted that same question verbatim to the bio-perl
> list. I am still running through some ideas though. I have both
> Bioinformatics perl books. They are not very effective teaching books.
>
> The books spend too much time on using modules. Though while I
> understand the usefulness of not having to re-write code, it is a bad
> idea for beginners like me. Because re-writing code at first gives me a
> lot of practice. Some of the scripts in the books use like 3-5 modules,
> so it gets confusing on what is going on.
>
> I mean the books are not useless, but they definitely are structured
> for a class with a teacher.
>
> :-)
>
> -Mike
>
Hi again, Mike!
I've thrown together the following code. I have not commented this!
If you have some questions, just ask. I hard coded the sequences for
my ease-of-use. It looked to me like you have figured out how to grab
the sequences out of a file and throw them in a hash. This code uses
some deep nested references, and therefore, some crazy dereferences.
Have fun with it, I know I did! Things that might look weird: check
out perldoc -f split for info on using a null-string to split with
(That's were I found it!) and of course perldoc perlref for all the
deep nested references and dereferencing stuff! I'm currently reading
"Learning Perl Objects, References & Modules" by Randal Schwartz. I
highly recommend it. It helped a lot in this exercise. Here's the
code:
use warnings;
use strict;
my %sequences = (
'Human' => "acgtt---cgatacg---acgact-----t",
'Chimp' => "acgtt---cgatacg---acgact-----t",
'Mouse' => "acgata---acgatcg----acgt",
);
my %results;
foreach my $species( keys %sequences ) {
my $is_base_pair_gap = 0;
my $base_pair_gap;
my $base_pair_gap_pos;
my $position = 1;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
unless( $is_base_pair_gap ) {
$base_pair_gap_pos = $position;
}
$is_base_pair_gap = 1;
$base_pair_gap .= $_;
} elsif( $is_base_pair_gap ) {
push
@{$results{$species}{length($base_pair_g
ap)}}, $base_pair_gap_pos;
$is_base_pair_gap = 0;
$base_pair_gap = undef;
}
$position++;
}
}
foreach my $species( keys %results ) {
print "$species:\n";
foreach my $base_pair_gap( keys %{$results{$species}} ) {
print " Number of $base_pair_gap base pair gaps:\t",
scalar( @{$results{$species}{$base_pair_gap}}), "\n";
print " at position(s) ", join( ',',
@{$results{$species}{$base_pair_gap}} ), ".\n";
}
print "\n";
}
The heart of this code is this line:
push @{$results{$species}{length($base_pair_g
ap)}}, $base_pair_gap_pos;
there is a %results hash which has keys that are the different
species, and values that point to another hash. THAT hash (the inner
hash) has keys that are the length of the base-pair-gaps, and values
that point to an array. The array holds a list of the positions of
those base-pair gaps! The first base pair gap in the human sequence
is '---' at the 6th character. That looks like this (warning: pseudo
code for clarity!)
%results->{'Human'}->{ 3 }->[6]
When we find the second '---' gap, we add it's position to the array:
%results->{'Human'}->{ 3 }->[6,16]
Then, we find a new base-pair-gap ('-----') so we add a new key to inner hash:
%results->{'Human'}->{ 3 }->[6,16]
->{ 5 }->[25]
Next, we move on to the next species ...
%results->{'Human'}->{ 3 }->[6,16]
->{ 5 }->[25]
->{'Mouse'}->{ 3 }->[7]
So, finally, with Data::Dumper, we can see the %results hash when the
code is done processing the sequence:
%results = {
'Human' => {
'3' => [
6,
16
],
'5' => [
25
]
},
'Mouse' => {
'4' => [
17
],
'3' => [
7
]
},
'Chimp' => {
'3' => [
6,
16
],
'5' => [
25
]
}
};
I hope this is helpful! This really was a lot of fun.
--Errin
| |
| Chris Cole 2004-10-15, 3:55 pm |
| On Thu, 14 Oct 2004 15:40:24 -0400, Willy West wrote:
>
> bio-informatics is a big area in which Perl is involved... there's
> even a book from O'reilly on the subject...
I think you'll find it's called bioinformatics, and yes it's a huge
(scientific) field.
The reason why perl is used so much in bioinformatics is that the majority
of biological data out there is textual and there's terabytes of it around
(if not more). There are large databases with +100K entries which are
parsed to extract either protein or DNA data (or other data) for further
analysis. Look here if you're interested in finding out more:
http://www.ebi.ac.uk
> also, a mailing-list is available...
>
>
> from http://lists.perl.org/ -
>
> bioperl-announce-l List is for people only interested in announcements
> of Bioperl code releases, updates and events.
>
> bioperl-l Unmoderated list for general discussion of bioperl modules,
> current & future efforts and related topics.
>
>
The problem with bioperl mailing lists is that they are primarily for
discussing BioPerl, which is a collection of perl modules designed to aid
the anaylsis of biological data. The problem I find with BioPerl is that
it isn't very good IMHO. In fact, no-one I know uses it and the few times
I've tried using it I ended getting so frustrated with how it was written
that I ended up writing my own code from scratch.
I actually think there should a specific bioinformatics ng which is
language agnostic to aid solving common problems, of which there are many.
Chris.
| |
| Michael Robeson 2004-10-15, 3:55 pm |
| Errin,
Thanks so much! I will spend the w end going over what you've posted.
Looks like I will learn a lot from this post alone. This stuff is so
addictive. I can spend hours doing this and not realize it. If I am
successful or not is another story! :-)
I'll definitely let you know if I have any trouble.
-Cheers!
-Mike
On Oct 15, 2004, at 7:22 AM, Errin Larsen wrote:
> On Thu, 14 Oct 2004 16:11:42 -0600, Michael Robeson <popgen23@mac.com>
> wrote:
>
> Hi again, Mike!
>
> I've thrown together the following code. I have not commented this!
> If you have some questions, just ask. I hard coded the sequences for
> my ease-of-use. It looked to me like you have figured out how to grab
> the sequences out of a file and throw them in a hash. This code uses
> some deep nested references, and therefore, some crazy dereferences.
> Have fun with it, I know I did! Things that might look weird: check
> out perldoc -f split for info on using a null-string to split with
> (That's were I found it!) and of course perldoc perlref for all the
> deep nested references and dereferencing stuff! I'm currently reading
> "Learning Perl Objects, References & Modules" by Randal Schwartz. I
> highly recommend it. It helped a lot in this exercise. Here's the
> code:
>
> use warnings;
> use strict;
>
> my %sequences = (
> 'Human' => "acgtt---cgatacg---acgact-----t",
> 'Chimp' => "acgtt---cgatacg---acgact-----t",
> 'Mouse' => "acgata---acgatcg----acgt",
> );
> my %results;
>
> foreach my $species( keys %sequences ) {
> my $is_base_pair_gap = 0;
> my $base_pair_gap;
> my $base_pair_gap_pos;
> my $position = 1;
> foreach( split( / */, $sequences{$species} )) {
> if( /-/ ) {
> unless( $is_base_pair_gap ) {
> $base_pair_gap_pos = $position;
> }
> $is_base_pair_gap = 1;
> $base_pair_gap .= $_;
> } elsif( $is_base_pair_gap ) {
> push
> @{$results{$species}{length($base_pair_g
ap)}}, $base_pair_gap_pos;
> $is_base_pair_gap = 0;
> $base_pair_gap = undef;
> }
> $position++;
> }
> }
>
> foreach my $species( keys %results ) {
> print "$species:\n";
> foreach my $base_pair_gap( keys %{$results{$species}} ) {
> print " Number of $base_pair_gap base pair gaps:\t",
> scalar( @{$results{$species}{$base_pair_gap}}), "\n";
> print " at position(s) ", join( ',',
> @{$results{$species}{$base_pair_gap}} ), ".\n";
> }
> print "\n";
> }
>
>
>
>
> The heart of this code is this line:
> push @{$results{$species}{length($base_pair_g
ap)}}, $base_pair_gap_pos;
>
> there is a %results hash which has keys that are the different
> species, and values that point to another hash. THAT hash (the inner
> hash) has keys that are the length of the base-pair-gaps, and values
> that point to an array. The array holds a list of the positions of
> those base-pair gaps! The first base pair gap in the human sequence
> is '---' at the 6th character. That looks like this (warning: pseudo
> code for clarity!)
> %results->{'Human'}->{ 3 }->[6]
> When we find the second '---' gap, we add it's position to the array:
> %results->{'Human'}->{ 3 }->[6,16]
> Then, we find a new base-pair-gap ('-----') so we add a new key to
> inner hash:
> %results->{'Human'}->{ 3 }->[6,16]
> ->{ 5 }->[25]
> Next, we move on to the next species ...
> %results->{'Human'}->{ 3 }->[6,16]
> ->{ 5 }->[25]
> ->{'Mouse'}->{ 3 }->[7]
>
> So, finally, with Data::Dumper, we can see the %results hash when the
> code is done processing the sequence:
>
> %results = {
> 'Human' => {
> '3' => [
> 6,
> 16
> ],
> '5' => [
> 25
> ]
> },
> 'Mouse' => {
> '4' => [
> 17
> ],
> '3' => [
> 7
> ]
> },
> 'Chimp' => {
> '3' => [
> 6,
> 16
> ],
> '5' => [
> 25
> ]
> }
> };
>
>
> I hope this is helpful! This really was a lot of fun.
>
> --Errin
| |
| Zeus Odin 2004-10-24, 3:55 pm |
| This is a really problem. See solution below. Michael, next time post
some code please.
Thanks,
ZO
--------------
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my(%gap, $animal);
while (<DATA> ) {
if (/>(\w+)/) {
$animal = $1;
} else {
while (/(-+)/) {
$gap{$animal}{ length $1 } += s/-+//;
}
}
}
print Dumper \%gap;
__DATA__
>human
acgtt---cgatacg---acgact-----t
>chimp
acgtacgatac---actgca---ac
>mouse
acgata---acgatcg----acgt
--------------
even[color=darkred]
>
>
>
> that might be fine- but his question is fundamentally Perl in nature-
> he may use the information for bio-informatics, but he is looking for
> an answer regarding effective Perl use when dealing with strings-
> that's classic Perl and a classic question for this list :)
>
> i wish i could answer his question right off the bat, but i can't :/
>
> awell...
|
|
|
|
|