Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

counting gaps in sequence data
I have a set of data that looks something like the following:

>human
acgtt---cgatacg---acgact-----t
>chimp
acgtacgatac---actgca---ac
>mouse
acgata---acgatcg----acgt

I am having trouble setting up a hash etc., to count the number and
types of continuous gaps. For example the 'human' sequence above has 2
sets of 3 gaps and 1 set of 5 gaps. The 'chimp' has 2 sets of 3 gaps
and finally the 'mouse' has 1 set of 3 gaps and 1 set of 4 gaps.

So, I am having trouble being able to assign a dynamic variable (i.e.
gap length) and place that in a pattern match so that it can count how
many gaps of that length are in that particular sequence. I know how to
set up a hash to count the number of times a gap appears:
'$gaptype{$gap}++' or something. The problem is: what is the best way
(and how) can I set '$gap' to be dynamic.

I need to know the length of each consecutive string of gaps. I know
how to count the gaps by using the 'tr' function. But it gets confusing
when I need to add counts to every instance of that gap length. I also
need to know the position of each gap (denoted by the position of the
first gap in that particular instance). I know that I can use the
'pos()' command for this.

So, my problem is that I think I know some of the bits of code to put
into place the problem is I am getting lost on how to structure it all
together. For now I am just trying to get my output to look like this:

Human
number of 3 base pair gaps:		2
at positions:		6, 16
number of 5 base pair gaps:		1
at positions:		25

Chimp
.... and so on ...

So, any suggestions would be greatly appreciated. If anyone can help me
out with all or even just bits of this I would greatly appreciate it.
This should help me get started on some more advanced parsing I need to
do after this. I like to try and figure things out on my own if I can,
so even pseudo code would be of great help!

-Thanks
-Mike



Report this thread to moderator Post Follow-up to this message
Old Post
Michael Robeson
10-14-04 08:56 PM


Re: counting gaps in sequence data
On Thu, 14 Oct 2004 11:02:06 -0600, Michael Robeson <popgen23@mac.com> wrote:
> I have a set of data that looks something like the following:
>
<<SNIP>>
>
> So, any suggestions would be greatly appreciated. If anyone can help me
> out with all or even just bits of this I would greatly appreciate it.
> This should help me get started on some more advanced parsing I need to
> do after this. I like to try and figure things out on my own if I can,
> so even pseudo code would be of great help!
>
> -Thanks
> -Mike
>

Hi Mike,

This list works best if you show us some code you have tried and
then we can help you troubleshoot it.  It lets us know that you've
already tried to solve your problem and aren't looking for a free
scripting service!  Show us some of that code you talked about and
we'll help you out

--Errin

PS: is this a common problem/exercise in some class somewhere?  I keep
seeing requests for help having to do with those exact strings of DNA
data.  Is there a bunch of people working on DNA projects using Perl
somewhere?  Or, is this some homework?

Report this thread to moderator Post Follow-up to this message
Old Post
Errin Larsen
10-14-04 08:56 PM


Re: counting gaps in sequence data
> PS: is this a common problem/exercise in some class somewhere?  I keep
> seeing requests for help having to do with those exact strings of DNA
> data.  Is there a bunch of people working on DNA projects using Perl
> somewhere?  Or, is this some homework?

bio-informatics is a big area in which Perl is involved...   there's even
a book from O'reilly on the subject...

also, a mailing-list is available...


from http://lists.perl.org/   -

bioperl-announce-l List is for people only interested in announcements
of Bioperl code releases, updates and events.

bioperl-l Unmoderated list for general discussion of bioperl modules,
current & future efforts and related topics.

-

in the latter of the two lists, i counted about 80 messages in the first hal
f
of this month.

hmm... i might join it... :)  might be fun!!

--
Willy
http://www.hackswell.com/corenth

Report this thread to moderator Post Follow-up to this message
Old Post
Willy West
10-14-04 08:56 PM


Re: counting gaps in sequence data
On Thu, 14 Oct 2004 15:40:24 -0400, Willy West <corenth@gmail.com> wrote: 
>
> bio-informatics is a big area in which Perl is involved...   there's even
> a book from O'reilly on the subject...
>
> also, a mailing-list is available...
>
> from http://lists.perl.org/   -
>
> bioperl-announce-l List is for people only interested in announcements
> of Bioperl code releases, updates and events.
>
> bioperl-l Unmoderated list for general discussion of bioperl modules,
> current & future efforts and related topics.
>
> -
>
> in the latter of the two lists, i counted about 80 messages in the first h
alf
> of this month.
>
> hmm... i might join it... :)  might be fun!!
>
> --
> Willy
> http://www.hackswell.com/corenth
>
>

If what you say is true, then maybe Mike needs to take his questions
to those list?  I mean, if the problem he's describing is common and
the data format he's using is common, I bet it's been solved already.

Hey mike, have you searched on CPAN (search.cpan.org) for this?

Report this thread to moderator Post Follow-up to this message
Old Post
Errin Larsen
10-15-04 01:55 AM


Re: counting gaps in sequence data
On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen <errinlarsen@gmail.com> wrote:
>
 


> If what you say is true, then maybe Mike needs to take his questions
> to those list?  I mean, if the problem he's describing is common and
> the data format he's using is common, I bet it's been solved already.
>
> Hey mike, have you searched on CPAN (search.cpan.org) for this?
>

that might be fine- but his question is fundamentally Perl in nature-
he may use the information for bio-informatics, but he is looking for
an answer regarding effective Perl use when dealing with strings-
that's classic Perl and a classic question for this list :)

i wish i could answer his question right off the bat, but i can't :/

awell...



--
Willy
http://www.hackswell.com/corenth

Report this thread to moderator Post Follow-up to this message
Old Post
Willy West
10-15-04 01:55 AM


Re: counting gaps in sequence data
On Thu, 14 Oct 2004 16:08:44 -0400, Willy West <corenth@gmail.com> wrote:
> On Thu, 14 Oct 2004 14:47:57 -0500, Errin Larsen <errinlarsen@gmail.com> w
rote: 
> 
> 
>
> that might be fine- but his question is fundamentally Perl in nature-
> he may use the information for bio-informatics, but he is looking for
> an answer regarding effective Perl use when dealing with strings-
> that's classic Perl and a classic question for this list :)
>
> i wish i could answer his question right off the bat, but i can't :/
>
> awell...
>

very true.  I've been playin' with this problem, it seems very
fun/interesting.  I was thinking this would be alot easier to play
with if the characters in the string were all in an array.  so I
played with the length() function and the substr() function to push
them all into an array in a loop.  something like this (prolly a
prettier/easier way to do this):

use Data::Dumper;

my $human = "acgtt---cgatacg---acgact-----t";
my @human;

for my $pos( 0 .. (length($human) -1)) {
push @human, substr($human, $pos, 1);
}

print Dumper(\@human};

The above produces the following output, which I was thinking might be
easier to work with for Mike's problem:

$VAR1 = [
'a',
'c',
'g',
't',
't',
'-',
'-',
'-',
'c',
'g',
'a',
't',
'a',
'c',
'g',
'-',
'-',
'-',
'a',
'c',
'g',
'a',
'c',
't',
'-',
'-',
'-',
'-',
'-',
't'
];

This is kinda fun!

--Errin

Report this thread to moderator Post Follow-up to this message
Old Post
Errin Larsen
10-15-04 01:55 AM


Re: counting gaps in sequence data
On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:

> I have a set of data that looks something like the following:

> So, my problem is that I think I know some of the bits of code to put
> into place the problem is I am getting lost on how to structure it all
> together.

>                I like to try and figure things out on my own if I can,
> so even pseudo code would be of great help!

Here's some code that works, incorporating most of your ideas.  It's
only a partial solution, but it think it covers the bit that was causing
you problems.  I'll present it without comment, but I'm sure people will
be able to explain anything you don't understand.

#!/usr/bin/perl -w

use strict;
use Data::Dumper;

while (<DATA> )
{
print, next unless /-/;
my %gaps;
push @{$gaps{length $1}}, pos() + 1 - length $1 while /(-+)/g;
print Dumper \%gaps;
}

__DATA__
human
acgtt---cgatacg---acgact-----t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcg----acgt

--
Paul Johnson - paul@pjcj.net
http://www.pjcj.net

Report this thread to moderator Post Follow-up to this message
Old Post
Paul Johnson
10-15-04 01:55 AM


Re: counting gaps in sequence data
Here is as far as I can get with some real code and pseudo code. This
is just to show you that I am actually trying. :-)

Pseudo - code

################################
# open DNA sequence file
################################

print "Enter in the name of the DNA sequence file:\n";
chomp (my $dna_seq = <STDIN> );
open(DNA_SEQ, $dna_seq)
or die "Can't open sequence file for input: $!";

 ########################################
################
# read sequence data into a hash - not sure if this is how I should do
it?
 ########################################
################

my %sequences;

$/ = '>'; # set to read in paragraph mode

print "\n***Discovered the following DNA sequences:***\n";

while ( <DNA_SEQ> ) {
chomp;
next unless s/^\s*(.+)//;
my $name = $1;
s/\s//g;
$sequences{$name} = $_;
print "$name found!\n";
}
close DNA_SEQ;


###################################
# search for and characterize gaps
###################################

<<somehow get data from hash and present it to a loop>>

%gaptype;

<<major pseudo code below>>
foreach /\D(-+)\D/ found in each sequece # searches for gaps flanked by
sequence
$position = pos($1);
$gaplength = $1;
$gaplength =~ tr/-//g;	# count the number of '-' for that particular
# gap being processed
$gaptype{gaplength}++;	# count the number of times each gap type
appears

<<somehow get information from loop an print as seen below>>

____OUTPUT_____

Human
number of 3 base pair gaps:		2
at positions:		6, 16
number of 5 base pair gaps:		1
at positions:		25
.
.. and so on...
.


__DATA__
human
acgtt---cgatacg---acgact-----t
chimp
acgtacgatac---actgca---ac
mouse
acgata---acgatcg----acgt


Report this thread to moderator Post Follow-up to this message
Old Post
Michael Robeson
10-15-04 01:55 AM


Re: counting gaps in sequence data
On Thu, 14 Oct 2004 23:23:48 +0200, Paul Johnson <paul@pjcj.net> wrote:
> On Thu, Oct 14, 2004 at 11:02:06AM -0600, Michael Robeson wrote:
> 
> 

Hi Paul,
I think you missed a critical part of Mike's post!:

> For now I am just trying to get my output to look like this:
>
>Human
>number of 3 base pair gaps:             2
>                      at positions:           6, 16
>number of 5 base pair gaps:             1
>                       at positions:           25
>
>Chimp
>.... and so on ...

I've put together something that will get the first part done
(counting base pair gaps, I guess is the point!)  Code is as follows:

use warnings;
use strict;

use Data::Dumper;

my %sequences = (
'human' => "acgtt---cgatacg---acgact-----t",
'chimp' => "acgtt---cgatacg---acgact-----t",
'mouse' => "acgata---acgatcg----acgt",
);
my %results;

foreach my $species( keys %sequences ) {
my $base_pair = 0;
my $base_pair_value;
foreach( split( / */, $sequences{$species} )) {
if( /-/ ) {
$base_pair = 1;
$base_pair_value .= $_;
} elsif( $base_pair ) {
 $results{$species}{length($base_pair_val
ue)} += 1;
$base_pair = 0;
$base_pair_value = undef;
}
}
}

foreach my $species( keys %results ) {
print "$species = $sequences{$species}\n";
foreach my $base_pair( keys %{$results{$species}} ) {
print "   Number of $base_pair base pair
gaps:\t$results{$species}{$base_pair}\n";
}
}

This will produce the following output:

# dnatest
human = acgtt---cgatacg---acgact-----t
Number of 3 base pair gaps:  2
Number of 5 base pair gaps:  1
chimp = acgtt---cgatacg---acgact-----t
Number of 3 base pair gaps:  2
Number of 5 base pair gaps:  1
mouse = acgata---acgatcg----acgt
Number of 4 base pair gaps:  1
Number of 3 base pair gaps:  1

I put the sequence in the output for easy troubleshooting and
checking.  I'm still working on figuring out the positional data.

This IS fun.  I'll post when I've got it figured out
--Errin

Report this thread to moderator Post Follow-up to this message
Old Post
Errin Larsen
10-15-04 01:55 AM


Re: counting gaps in sequence data
On Thu, Oct 14, 2004 at 04:50:30PM -0500, Errin Larsen wrote:

>   Hi Paul,
>   I think you missed a critical part of Mike's post!:

No, I didn't miss it.  I just thought it likely that Mike could get to
there from where I left off, so I gave a solution to the bit that seemed
most troublesome.

>            I'm still working on figuring out the positional data.

That data is included in the code I posted.

> This IS fun.  I'll post when I've got it figured out

I look forward to it.

--
Paul Johnson - paul@pjcj.net
http://www.pjcj.net

Report this thread to moderator Post Follow-up to this message
Old Post
Paul Johnson
10-15-04 01:55 AM


Sponsored Links




Last Thread Next Thread Next
Pages (2): [1] 2 »
Search this forum -> 
Post New Thread

PERL Beginners archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:55 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.