Code Comments
Programming Forum and web based access to our favorite programming groups.I have a set of data that looks something like the following:
>human
acgtt---cgatacg---acgact-----t
>chimp
acgtacgatac---actgca---ac
>mouse
acgata---acgatcg----acgt
I am having trouble setting up a hash etc., to count the number and
types of continuous gaps. For example the 'human' sequence above has 2
sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps
and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.
So, I am having trouble being able to assign a dynamic variable (i.e.
gap length) and place that in a pattern match so that it can count how
many gaps of that length are in that particular sequence. I know how to
set up a hash to count the number of times a gap appears:
'$gaptype{$gap}++' or something. The problem is: what is the best way
(and how) can I set '$gap' to be dynamic.
I need to know the length of each consecutive string of gaps. I know
how to count the gaps by using the 'tr' function. But it gets confusing
when I need to add counts to every instance of that gap length. I also
need to know the position of each gap (denoted by the position of the
first gap in that particular instance). I know that I can use the
'pos()' command for this.
So, my problem is that I think I know some of the bits of code to put
into place the problem is I am getting lost on how to structure it all
together. For now I am just trying to get my output to look like this:
Human
number of 3 base pair gaps: 2
at positions: 6, 16
number of 5 base pair gaps: 1
at positions: 25
Chimp
.... and so on ...
So, any suggestions would be greatly appreciated. If anyone can help me
out with all or even just bits of this I would greatly appreciate it.
This should help me get started on some more advanced parsing I need to
do after this. I like to try and figure things out on my own if I can,
so even pseudo code would be of great help!
-Thanks
-Mike
Post Follow-up to this messageOn Thu, 14 Oct 2004 11:02:06 -0600, Michael Robeson <popgen23@mac.com> wrote: > I have a set of data that looks something like the following: > <<SNIP>> > > So, any suggestions would be greatly appreciated. If anyone can help me > out with all or even just bits of this I would greatly appreciate it. > This should help me get started on some more advanced parsing I need to > do after this. I like to try and figure things out on my own if I can, > so even pseudo code would be of great help! > > -Thanks > -Mike > Hi Mike, This list works best if you show us some code you have tried and then we can help you troubleshoot it. It lets us know that you've already tried to solve your problem and aren't looking for a free scripting service! Show us some of that code you talked about and we'll help you out --Errin PS: is this a common problem/exercise in some class somewhere? I keep seeing requests for help having to do with those exact strings of DNA data. Is there a bunch of people working on DNA projects using Perl somewhere? Or, is this some homework?
Post Follow-up to this message> PS: is this a common problem/exercise in some class somewhere? I keep > seeing requests for help having to do with those exact strings of DNA > data. Is there a bunch of people working on DNA projects using Perl > somewhere? Or, is this some homework? bio-informatics is a big area in which Perl is involved... there's even a book from O'reilly on the subject... also, a mailing-list is available... from http://lists.perl.org/ - bioperl-announce-l List is for people only interested in announcements of Bioperl code releases, updates and events. bioperl-l Unmoderated list for general discussion of bioperl modules, current & future efforts and related topics. - in the latter of the two lists, i counted about 80 messages in the first hal f of this month. hmm... i might join it... :) might be fun!! -- Willy http://www.hackswell.com/corenth
Post Follow-up to this messageOn Thu, 14 Oct 2004 15:40:24 -0400, Willy West <corenth@gmail.com> wrote: > > bio-informatics is a big area in which Perl is involved... there's even > a book from O'reilly on the subject... > > also, a mailing-list is available... > > from http://lists.perl.org/ - > > bioperl-announce-l List is for people only interested in announcements > of Bioperl code releases, updates and events. > > bioperl-l Unmoderated list for general discussion of bioperl modules, > current & future efforts and related topics. > > - > > in the latter of the two lists, i counted about 80 messages in the first h alf > of this month. > > hmm... i might join it... :) might be fun!! > > -- > Willy > http://www.hackswell.com/corenth > > If what you say is true, then maybe Mike needs to take his questions to those list? I mean, if the problem he's describing is common and the data format he's using is common, I bet it's been solved already. Hey mike, have you searched on CPAN (search.cpan.org) for this?
Post Follow-up to this messageOn Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen <errinlarsen@gmail.com> wrote: > > If what you say is true, then maybe Mike needs to take his questions > to those list? I mean, if the problem he's describing is common and > the data format he's using is common, I bet it's been solved already. > > Hey mike, have you searched on CPAN (search.cpan.org) for this? > that might be fine- but his question is fundamentally Perl in nature- he may use the information for bio-informatics, but he is looking for an answer regarding effective Perl use when dealing with strings- that's classic Perl and a classic question for this list :) i wish i could answer his question right off the bat, but i can't :/ awell... -- Willy http://www.hackswell.com/corenth
Post Follow-up to this messageOn Thu, 14 Oct 2004 16:08:44 -0400, Willy West <corenth@gmail.com> wrote:
> On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen <errinlarsen@gmail.com> w
rote:
>
>
>
> that might be fine- but his question is fundamentally Perl in nature-
> he may use the information for bio-informatics, but he is looking for
> an answer regarding effective Perl use when dealing with strings-
> that's classic Perl and a classic question for this list :)
>
> i wish i could answer his question right off the bat, but i can't :/
>
> awell...
>
very true. I've been playin' with this problem, it seems very
fun/interesting. I was thinking this would be alot easier to play
with if the characters in the string were all in an array. so I
played with the length() function and the substr() function to push
them all into an array in a loop. something like this (prolly a
prettier/easier way to do this):
use Data::Dumper;
my $human = "acgtt---cgatacg---acgact-----t";
my @human;
for my $pos( 0 .. (length($human) -1)) {
push @human, substr($human, $pos, 1);
}
print Dumper(\@human};
The above produces the following output, which I was thinking might be
easier to work with for Mike's problem:
$VAR1 = [
'a',
'c',
'g',
't',
't',
'-',
'-',
'-',
'c',
'g',
'a',
't',
'a',
'c',
'g',
'-',
'-',
'-',
'a',
'c',
'g',
'a',
'c',
't',
'-',
'-',
'-',
'-',
'-',
't'
];
This is kinda fun!
--Errin
Post Follow-up to this messageOn Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:
> I have a set of data that looks something like the following:
> So, my problem is that I think I know some of the bits of code to put
> into place the problem is I am getting lost on how to structure it all
> together.
> I like to try and figure things out on my own if I can,
> so even pseudo code would be of great help!
Here's some code that works, incorporating most of your ideas. It's
only a partial solution, but it think it covers the bit that was causing
you problems. I'll present it without comment, but I'm sure people will
be able to explain anything you don't understand.
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
while (<DATA> )
{
print, next unless /-/;
my %gaps;
push @{$gaps{length $1}}, pos() + 1 - length $1 while /(-+)/g;
print Dumper \%gaps;
}
__DATA__
human
acgtt---cgatacg---acgact-----t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcg----acgt
--
Paul Johnson - paul@pjcj.net
http://www.pjcj.net
Post Follow-up to this messageHere is as far as I can get with some real code and pseudo code. This
is just to show you that I am actually trying. :-)
Pseudo - code
################################
# open DNA sequence file
################################
print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN> );
open(DNA_SEQ, $dna_seq)
or die "Can't open sequence file for input: $!";
########################################
################
# read sequence data into a hash - not sure if this is how I should do
it?
########################################
################
my %sequences;
$/ = '>'; # set to read in paragraph mode
print "\n***Discovered the following DNA sequences:***\n";
while ( <DNA_SEQ> ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;
s/\s//g;
$sequences{$name} = $_;
print "$name found!\n";
}
close DNA_SEQ;
###################################
# search for and characterize gaps
###################################
<<somehow get data from hash and present it to a loop>>
%gaptype;
<<major pseudo code below>>
foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by
sequence
$position = pos($1);
$gaplength = $1;
$gaplength =~ tr/-//g; # count the number of '-' for that particular
# gap being processed
$gaptype{gaplength}++; # count the number of times each gap type
appears
<<somehow get information from loop an print as seen below>>
____OUTPUT_____
Human
number of 3 base pair gaps: 2
at positions: 6, 16
number of 5 base pair gaps: 1
at positions: 25
.
.. and so on...
.
__DATA__
human
acgtt---cgatacg---acgact-----t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcg----acgt
Post Follow-up to this messageOn Thu, 14 Oct 2004 23:23:48 +0200, Paul Johnson <paul@pjcj.net> wrote:
> On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:
>
>
Hi Paul,
I think you missed a critical part of Mike's post!:
> For now I am just trying to get my output to look like this:
>
>Human
>number of 3 base pair gaps: 2
> at positions: 6, 16
>number of 5 base pair gaps: 1
> at positions: 25
>
>Chimp
>.... and so on ...
I've put together something that will get the first part done
(counting base pair gaps, I guess is the point!) Code is as follows:
use warnings;
use strict;
use Data::Dumper;
my %sequences = (
'human' => "acgtt---cgatacg---acgact-----t",
'chimp' => "acgtt---cgatacg---acgact-----t",
'mouse' => "acgata---acgatcg----acgt",
);
my %results;
foreach my $species( keys %sequences ) {
my $base_pair = 0;
my $base_pair_value;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
$base_pair = 1;
$base_pair_value .= $_;
} elsif( $base_pair ) {
$results{$species}{length($base_pair_val
ue)} += 1;
$base_pair = 0;
$base_pair_value = undef;
}
}
}
foreach my $species( keys %results ) {
print "$species = $sequences{$species}\n";
foreach my $base_pair( keys %{$results{$species}} ) {
print " Number of $base_pair base pair
gaps:\t$results{$species}{$base_pair}\n";
}
}
This will produce the following output:
# dnatest
human = acgtt---cgatacg---acgact-----t
Number of 3 base pair gaps: 2
Number of 5 base pair gaps: 1
chimp = acgtt---cgatacg---acgact-----t
Number of 3 base pair gaps: 2
Number of 5 base pair gaps: 1
mouse = acgata---acgatcg----acgt
Number of 4 base pair gaps: 1
Number of 3 base pair gaps: 1
I put the sequence in the output for easy troubleshooting and
checking. I'm still working on figuring out the positional data.
This IS fun. I'll post when I've got it figured out
--Errin
Post Follow-up to this messageOn Thu, Oct 14, 2004 at 04:50:30PM -0500, Errin Larsen wrote: > Hi Paul, > I think you missed a critical part of Mike's post!: No, I didn't miss it. I just thought it likely that Mike could get to there from where I left off, so I gave a solution to the bit that seemed most troublesome. > I'm still working on figuring out the positional data. That data is included in the code I posted. > This IS fun. I'll post when I've got it figured out I look forward to it. -- Paul Johnson - paul@pjcj.net http://www.pjcj.net
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.