For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > January 2006 > new for reading file containing multiple records









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author new for reading file containing multiple records
Chen Li

2006-01-10, 4:02 am

Hi all,

I have a big file (2.7G) containing multiple records
in this format:
>gi|618748|dbj|D21618.1| MUS74F01 mouse embryonal

carcinoma cell line F9 Mus mus culus cDNA clone 74F01,
mRNA sequence
GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCTTGTCTTT

ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCCGGTGTNATGGCGCTCGTCCGC

AATGTTTTAGCGGCATGGGCCGCTATTGACAGCAAGAG[c
olor=darkred]
>gi|618749|dbj|D21619.1| MUS74F09 mouse embryonal[/color]
carcinoma cell line F9 Mus mus culus cDNA clone 74F09,
mRNA sequence
GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAACCAAATGT

TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTCATTAA

Each record starts with ">". I want to read each
record once at a time.I hear about a special variable
call $/ might do the job but not sure how to use it. I
wonder if anyone could help me out.

Thanks,

Li






________________________________________
__
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

usenet@DavidFilmer.com

2006-01-10, 4:02 am

Chen Li wrote:
> I have a big file (2.7G) containing multiple records
> in this format:

<snip>
> Each record starts with ">". I want to read each
> record once at a time.I hear about a special variable
> call $/ might do the job...


It's not clear exactly how you want to define each "record." I presume
you want to compress each "record" into a single line of text
(eleminating linefeeds). But your sample data wraps oddly when viewed
in Google Groups, and it's not clear what linebreaks are caused by
wrapping and which represent actual newlines within your input file.

$/ may indeed be of help to you. Since I can't really discern the
exact nature of your input data, consider this script which generalizes
the situation:

#!/usr/bin/perl

$\ = "\n"; #output record separator
$/ = '>'; #input record separator
while (<DATA> ) {
s/\n+/ /g;
s/\s*\>?$//;
print if $_;
}

__DATA__
>Now is the time for

all good men to come
to the aid of their party.

>The quick fox jumped over

the lazy brown dog.

>Today is the first

day of the rest of
your life.

### OUTPUT ######################
Now is the time for all good men to come to the aid of their party.
The quick fox jumped over the lazy brown dog.
Today is the first day of the rest of your life.
#################################

In this situation, the script assumed that line breaks within the input
record should be replaced with a single whitespace (to prevent it from
running the last word of one line together with the first word of the
next, such as: "Now is the time forall...". This assumption may not
apply to your input data, as it seems that maybe there are linefeeds
between blocks of enzyme designations, and presumably you wouldn't want
to insert a whitespace here - so you may need to provide a "smarter"
(ie, more programatically developed) way to handle linefeeds within
your record (replace linefeeds with whitespace under some conditions
and replace linefeeds with null in other conditions).

Note also that the script now treats ">" as the SEPARATOR BETWEEN
records. Thus, if the very first character in the input data is a ">"
(as it is here, and probably is in your data), Perl will assume this is
a SEPARATOR BETWEEN records. Since there's not a record before the ">"
in the input data, Perl will assume it is an empty string. That's why I
said "print if $_;" (so it would not print anything for the null "first
record"). If I had simply said "print" then you would observe a blank
line as the first line in the output, because Perl insists that
something comes BEFORE AND AFTER the separator.

Note also that the separator (">") is not "stripped" out of the data
stream. You still need to strip it off. You could use a chop(), but I
prefer a regex for safety.

Needless to say, if your data happens to contain a ">" character within
it, you're screwed.

--
http://DavidFilmer.com

Xavier Noria

2006-01-10, 4:02 am

On Jan 6, 2006, at 4:12, chen li wrote:

> Hi all,
>
> I have a big file (2.7G) containing multiple records
> in this format:
> carcinoma cell line F9 Mus mus culus cDNA clone 74F01,
> mRNA sequence
> GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGA
> GCTTGTCTTT
> ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCCGGTGTNATGGC
> GCTCGTCCGC
> AATGTTTTAGCGGCATGGGCCGCTATTGACAGCAAGAG
> carcinoma cell line F9 Mus mus culus cDNA clone 74F09,
> mRNA sequence
> GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTT
> AACCAAATGT
> TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTCATTAA
>
> Each record starts with ">". I want to read each
> record once at a time.I hear about a special variable
> call $/ might do the job but not sure how to use it. I
> wonder if anyone could help me out.


Word wrapping possibly mangled the example records, could you please
upload a handful of them in a file somewhere?

-- fxn


Chris

2006-01-10, 4:02 am

Chen Li wrote:
> Hi all,
>
> I have a big file (2.7G) containing multiple records
> in this format:
>
>
> carcinoma cell line F9 Mus mus culus cDNA clone 74F01,
> mRNA sequence
> GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCTTGTCTTT

> ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCCGGTGTNATGGCGCTCGTCCGC

> AATGTTTTAGCGGCATGGGCCGCTATTGACAGCAAGAG
>
>
> carcinoma cell line F9 Mus mus culus cDNA clone 74F09,
> mRNA sequence
> GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAACCAAATGT

> TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTCATTAA
>
> Each record starts with ">". I want to read each
> record once at a time.I hear about a special variable
> call $/ might do the job but not sure how to use it. I
> wonder if anyone could help me out.
>
> Thanks,
>
> Li
>

This file is in FASTA format. Very common for storing
biological data. Have a look at bioperl, there will be a
module for reading and writing this type of file.
Chen Li

2006-01-10, 4:02 am

Hi Xicheng,

Thanks. I search the list before I post the question
but I can't find similar topics. Could you please tell
me some ealier posts? Also I try to use your code to
read a very small file containing only these two
records. Here is what I got:

This is record 1.
This is sequence:
This is record 2.
This is sequence:
This is record 3.
This is sequence:

I can't print each record out. How do I fix it?

And here is my code:

#!/usr/bin/perl-w
use strict;

my $filename='sequence.fasta';
open(FILENAME,$filename) or die " This $filename
cannot be open!!!\n\n";

{local $/='>';

my $count_record=0;

while (<FILENAME> ){
++$count_record;

print "This is record $count_record.\n";

print "This is sequence:$sequence\n"; }

}
exit;
~


--- Xicheng <xicheng@gmail.com> wrote:

> This is a FAQ problem, so I would not post it in the
> group, pls see the
> following code.
> ===================
> #!/usr/bin/perl -w
> use strict;
> open FH, "<data.txt" or die "cant open data.txt:
> $!";
> {
> local $/='>';
> while (<FH> ) {
> # do sth on $_; # $_ does not include the
> leading '>' though.
> }
> }
> Dont know if this can go smoothly with your 2.7G
> file though. Good
> luck,
> XC
> =====================
>
> Chen Li wrote:
> records
> 74F01,
>

GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCTTGTCTTT

>

ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCCGGTGTNATGGCGCTCGTCCGC

> 74F09,
>

GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAACCAAATGT

>

TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTCATTAA
> variable
> it. I
>
>





________________________________________
__
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Chen Li

2006-01-10, 4:02 am


> Word wrapping possibly mangled the example records,
> could you please
> upload a handful of them in a file somewhere?
>
> -- fxn



Hi,

I am just a newbie. What is word wrapping? Is it a
perl module or something else?

Thanks,

Li



________________________________________
__
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

John Doe

2006-01-10, 4:02 am

chen li am Freitag, 6. Januar 2006 11.27:
> Hi Xicheng,


Hi Chen

> Thanks. I search the list before I post the question
> but I can't find similar topics. Could you please tell
> me some ealier posts? Also I try to use your code to
> read a very small file containing only these two
> records. Here is what I got:
>
> This is record 1.
> This is sequence:
> This is record 2.
> This is sequence:
> This is record 3.
> This is sequence:
>
> I can't print each record out. How do I fix it?
>
> And here is my code:
>
> #!/usr/bin/perl-w


A space before -w is missing. Alternatively, just add the following line ab=
ove=20
use strict;

use warnings;

> use strict;
>
> my $filename=3D'sequence.fasta';
> open(FILENAME,$filename) or die " This $filename
> cannot be open!!!\n\n";
>
> {local $/=3D'>';
>
> my $count_record=3D0;
>
> while (<FILENAME> ){
> ++$count_record;
>
> print "This is record $count_record.\n";
>
> print "This is sequence:$sequence\n";=20


$sequence is not defined; when you run the script under 'use warnings;' (or=
=20
'#!/usr/bin/perl -w', a warning should be displayed.

To actually print the sequence out:

print "This is sequence: $_\n";=20

see (on the cmdline)

perldoc perlvar

for documentation of all the funny $... vars


> }=20
>
> }
> exit;


exit is not needed.



[top posting history:]
> ~
>
> --- Xicheng <xicheng@gmail.com> wrote:
>
> GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCT=

TG
>TCTTT
>
>
> ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCCGGTGTNATGGCGCT=

CG
>TCCGC
>
>
> GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAAC=

CA
>AATGT
>
>
> TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTCATTAA
>
>
> ________________________________________
__
> Yahoo! DSL =96 Something to write home about.
> Just $16.99/mo. or less.
> dsl.yahoo.com

Shawn Corey

2006-01-10, 4:02 am

chen li wrote:
> Each record starts with ">". I want to read each
> record once at a time.I hear about a special variable
> call $/ might do the job but not sure how to use it. I
> wonder if anyone could help me out.


See `perldoc perlvar` and search for INPUT_RECORD_SEPARATOR.

Here is a simple script which might do:

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

$/ = ">";
while( <DATA> ){
print Dumper \$_;
}

__END__
>gi|618748|dbj|D21618.1| MUS74F01 mouse embryonal carcinoma cell line

F9 Mus mus culus cDNA clone 74F01, mRNA sequence
GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCTTGTCTTT
ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCC
GGTGTNATGGCGCTCGTCCGCAATGTTTTAGCGGCATGGG
CCGCTATTGACAGCAAGAG
>gi|618749|dbj|D21619.1| MUS74F09 mouse embryonal carcinoma cell line

F9 Mus mus culus cDNA clone 74F09, mRNA sequence
GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAACCAAATGT
TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTC
ATTAA


--

Just my 0.00000002 million dollars worth,
--- Shawn

"Probability is now one. Any problems that are left are your own."
SS Heart of Gold, _The Hitchhiker's Guide to the Galaxy_

* Perl tutorials at http://perlmonks.org/?node=Tutorials
* A searchable perldoc is available at http://perldoc.perl.org/
Tom Phoenix

2006-01-10, 4:02 am

On 1/5/06, chen li <chen_li3@yahoo.com> wrote:
> Each record starts with ">". I want to read each
> record once at a time.I hear about a special variable
> call $/ might do the job but not sure how to use it.


The perlvar manpage documents $/ and all of Perl's other special
variables. In particular, this variable contains a string ("\n" by
default) which Perl expects to come at the end of every "line" of
input.

Although it's tempting to set $/ to "\n>" for the file format you
describe, that's probably not correct for the first or last record in
your file. I recommend that you write code to identify each record
(with regular expressions, perhaps?) instead of using $/.
Alternatively, you could pre-process the data file in some way so that
using $/ would be a good solution.

Good luck with it!

--Tom Phoenix
Stonehenge Perl Training
usenet@DavidFilmer.com

2006-01-10, 4:02 am

Chen Li wrote:
> I am just a newbie. What is word wrapping? Is it a
> perl module or something else?


Word wrapping has nothing to do with Perl or even computers. It's what
happens when you get to the end of a line of text - you begin a new
line of text. Computers often do this for you (so you don't need to
"manually" insert returns, like we do when we use a typewriter or paper
and pencil). That's "wrapping."

The problem with your question is that you provide sample data with
some VERY long lines and some line breaks. We can't tell which line
breaks are really in your input data, and which line breaks may have
been inserted by your (or our) newsreader when you pasted the data into
your question.

Since we can't tell which linebreaks are "real" and which are inserted
by wrapping, we can't really give you a specific answer to your
question, so we must generalize.

When you post questions, you can help us help you by making sure that
any code snippets or sample data are fairly short (65 characters or
less). No newsreader will "wrap" such short lines (and it gives us a
few extra characters if we quote you), so there's never any ambiguity.

Chen Li

2006-01-10, 4:02 am

Hi Shawn,

I use the your code to do the job:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
my $filename='sequence.fasta';
open (DATA,$filename) or die;

{local $/ = '>';
while( <DATA> ){
print Dumper \$_;
}
}
exit;


And I get the following output:

$VAR1 = '>';
$VAR1 = 'gi|618748|dbj|D21618.1| MUS74F01 mouse
embryonal carcinoma cell line F9 Mus musculus cDNA
clone 74F01, mRNA sequence
GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCTTGTCTTT

ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCCGGTGTNATGGCGCTCGTCCGC

AATGTTTTAGCGGCATGGGCCGCTATTGACAGCAAGAG[c
olor=darkred]
>';[/color]
$VAR1 = 'gi|618749|dbj|D21619.1| MUS74F09 mouse
embryonal carcinoma cell line F9 Mus musculus cDNA
clone 74F09, mRNA sequence
GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAACCAAATGT

TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTCATTAA
';

Is it possible to remove ($VAR1 = '>';), ($VAR1 =
'), (>';), and (';) from the output direclty or
several lines containing regular expression needed to
do the job?

Thanks,

Li





> Here is a simple script which might do:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> use Data::Dumper;
>
> $/ = ">";
> while( <DATA> ){
> print Dumper \$_;
> }
>
> __END__
> carcinoma cell line
> F9 Mus mus culus cDNA clone 74F01, mRNA sequence
>

GCTGCCTCGACGATCTTCGCTTGCNTCCTCGCTCGCTGTC
CCGTTGTCCTAGCCCGCCGCCGCCCGCTGAGCTTGTCTTT
ACCCTGCTTGCAGACATGGCTGACATCAAGAACAACCCCG
AATATTTCTTTCGTNANCC
>

GGTGTNATGGCGCTCGTCCGCAATGTTTTAGCGGCATGGG
CCGCTATTGACAGCAAGAG
> carcinoma cell line
> F9 Mus mus culus cDNA clone 74F09, mRNA sequence
>

GGCGNNNTGGCCTCGGGCGGCTGGACGTGCCCAGCGCCCG
ATTAACAAGATACATTTAATTGCTGTGTTTAACCAAATGT
TTGAAGGCTGTGGGACTTTTTGAAATCATATGATCTCCTA
AAAGCTGTTCACATTGTTC
> ATTAA
>
>





________________________________________
__
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Wijaya Edward

2006-01-10, 4:02 am


Hi,

I would suggest you use the built-in BioPerl method for reading
fasta (or other) format, for example:

sub get_sequence_from_fasta
{
#designed for getting sequences into array from a file (fasta format),
#input: file name

use Bio::SeqIO;

my $file = shift;
my @seqs= ();

open INFILE, "<$file" or die "$0: Can't open file $file: $!";
my $in = Bio::SeqIO->new(-format => 'fasta',
-noclose => 1 ,
-fh => \*INFILE);

while ( my $seq = $in->next_seq() ) {
push @seqs, $seq->seq();
} #end while

return \@seqs;
}

Hope that helps.

Regards,
Edward WIJAYA

----- Original Message -----
From: chen li <chen_li3@yahoo.com>
Date: Saturday, January 7, 2006 9:27 am
Subject: Re: new for reading file containing multiple records

> Hi Shawn,
>
> I use the your code to do the job:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use Data::Dumper;
> my $filename='sequence.fasta';
> open (DATA,$filename) or die;
>
> {local $/ = '>';
> while( <DATA> ){
> print Dumper \$_;
> }
> }
> exit;
>
>




--------------------------------------------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
--------------------------------------------------
Chen Li

2006-01-10, 4:02 am

Hi Tom,

Thanks for the reply.

> Although it's tempting to set $/ to "\n>" for the
> file format you
> describe, that's probably not correct for the first
> or last record in
> your file.


You are 50% right. This method is not correct for the
first record(which actually contains ">' only) but it
is correct for the last record(and others in between).

>I recommend that you write code to
> identify each record
> (with regular expressions, perhaps?) instead of
> using $/.
> Alternatively, you could pre-process the data file
> in some way so that
> using $/ would be a good solution.


I want to edit the file first and try to delete the
first ">" in this big file. I browse Programming Perl
and Perl Cookbook there is not such example: just
delete the first charater in a file. But they have
examples to delete the last line from a file. It seems
odd to me.


Li



________________________________________
__
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Shawn Corey

2006-01-10, 4:02 am

chen li wrote:

> You are 50% right. This method is not correct for the
> first record(which actually contains ">' only) but it
> is correct for the last record(and others in between).


> I want to edit the file first and try to delete the
> first ">" in this big file. I browse Programming Perl
> and Perl Cookbook there is not such example: just
> delete the first charater in a file. But they have
> examples to delete the last line from a file. It seems
> odd to me.


First, I'd recommend against changing a large file. Unless your program
is the only user of the file, you would have to change all the other
programs. And in this case, you would be unable to distinguish one
record from another.

There are three ways to distinguish records in a file: by record
separators, by beginning-of-record tokens, and by end-of-record tokens.
Your file may use one, two or all three methods. When writing code, your
preference should be (in order) record separator, end-of-record token,
and finally beginning-of-record token.

In Perl, the variable $/ is used to distinguish the end-of-record token;
even though it is called the INPUT_RECORD_SEPARATOR. Its name is
misleading. If it was a true record separator, your code would never
have to process the record separator; it would be discarded at a lower
level.

The records in your file are distinguished only by a beginning-of-record
token, specifically a greater-than sign at the beginning of a record.
You can process the file in two ways: treat the beginning-of record
token as an end-of-record token, or read ahead in the file and process
the record only after reading the beginning of the next record. Both
have the advantages and divantages.

If you want to treat the beginning-of-record token as an end-of-record
one, your records are going to have some anomalies. The first record is
going to have a beginning-of-record token attached to it. Your last
record is not going to have an end-of-record token. For your case, it
would look something like this:

my $beginning_token = '>';
my $end_token = "\n$beginning_token";
$/ = $end_token;
my $first = 1;
while( <FH> ){
if( $first ){
s/^\Q$beginning_token//;
$first = 0;
}
s/\Q$end_token\E$//;
process_record( $_ );
}

If you want to use only the beginning-of-record token, you will have to
do at least a partial read ahead. This means you have to store the read
ahead and the last record will be processed outside the read loop. For
you case:

my $beginning_token = '>';
my $record = '';
while( <FH> ){
if( /^\Q$beginning_token/ ){
if( $record =~ /^\Q$beginning_token/ ){
process_record( $record );
}
$record = '';
}
$record .= $_;
}
if( $record =~ /^\Q$beginning_token/ ){
process_record( $record );
}



--

Just my 0.00000002 million dollars worth,
--- Shawn

"Probability is now one. Any problems that are left are your own."
SS Heart of Gold, _The Hitchhiker's Guide to the Galaxy_

* Perl tutorials at http://perlmonks.org/?node=Tutorials
* A searchable perldoc is available at http://perldoc.perl.org/
Chen Li

2006-01-10, 4:02 am

Hi Shawn,

Thanks for the detailed explanations.

But Edward(see one of posts in my thread) tells me to
try Bio::SeqIO from www.bioperl.org. After I try I
think it is what I really need.

Once again thank you so much for the help.

Li




________________________________________
__
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com