Home > Archive > PERL Beginners > June 2006 > A loop to parse a large text file--output is empty!
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
A loop to parse a large text file--output is empty!
|
|
| Michael Oldham 2006-06-17, 8:05 am |
| Dear all,
I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
..........etc.
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!
~ Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID = <IDFILE>;
chomp @ID;
while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}
}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
| |
| Timothy Johnson 2006-06-17, 8:05 am |
|
One problem is that you are using the $_ variable twice.
"while(<FILE> )" assigns $_ to the current line being read, and
"foreach(@array)" assigns $_ to the current element of the array in
question.
It's usually a good idea to be more explicit anyway, and keep the $_
usage to a minimum so you don't have to worry about this kind of thing.
Also, I'm not sure what you're trying to accomplish by this line:
print OUT scalar(<PROBES> );
As far as I can see, you're grabbing the next line, assigning it to $_
(maybe), and printing it out in scalar context. I'm assuming that you
actually wanted to print the line you read instead, so that's what I
did.
Try this and see if it is closer to what you want:
###################
#!/usr/bin/perl -w
use strict;
my $IDs =3D 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes =3D 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID =3D <IDFILE>;
chomp @ID;
# vvvvvvvvvvvvvvvvvvv
while (my $line =3D <PROBES> ) {
# vvvvvvvv =20
foreach my $find(@ID) {
if($find =3D~ /^>probe:\w+:(\w+):/) {
print OUT $find."\n";
# VVVVVV
print OUT $line."\n";
}
}
}
exit;
###########################
-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]=20
Sent: Tuesday, June 13, 2006 6:42 PM
To: Beginners@Perl. Org
Subject: A loop to parse a large text file--output is empty!
Dear all,
I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=3D2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=3D258; =
Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=3D2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=3D4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
..........etc.
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!
~ Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs =3D 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes =3D 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID =3D <IDFILE>;
chomp @ID;
while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}
}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
| |
| Michael Oldham 2006-06-17, 8:05 am |
| Thanks Timothy. I tried the code you supplied and unfortunately the
output file is still empty. Do you think there might be a problem with
the regular expression in:
if($find =~ /^>probe:\w+:(\w+):/)
?
Mike
-----Original Message-----
From: Timothy Johnson [mailto:tjohnson@zonelabs.com]
Sent: Tuesday, June 13, 2006 6:59 PM
To: Michael Oldham; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!
One problem is that you are using the $_ variable twice.
"while(<FILE> )" assigns $_ to the current line being read, and
"foreach(@array)" assigns $_ to the current element of the array in
question.
It's usually a good idea to be more explicit anyway, and keep the $_
usage to a minimum so you don't have to worry about this kind of thing.
Also, I'm not sure what you're trying to accomplish by this line:
print OUT scalar(<PROBES> );
As far as I can see, you're grabbing the next line, assigning it to $_
(maybe), and printing it out in scalar context. I'm assuming that you
actually wanted to print the line you read instead, so that's what I
did.
Try this and see if it is closer to what you want:
###################
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID = <IDFILE>;
chomp @ID;
# vvvvvvvvvvvvvvvvvvv
while (my $line = <PROBES> ) {
# vvvvvvvv
foreach my $find(@ID) {
if($find =~ /^>probe:\w+:(\w+):/) {
print OUT $find."\n";
# VVVVVV
print OUT $line."\n";
}
}
}
exit;
###########################
-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]
Sent: Tuesday, June 13, 2006 6:42 PM
To: Beginners@Perl. Org
Subject: A loop to parse a large text file--output is empty!
Dear all,
I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
..........etc.
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!
~ Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID = <IDFILE>;
chomp @ID;
while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}
}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
| |
| Timothy Johnson 2006-06-17, 8:05 am |
|
Oops! =20
Try changing
if($find =3D~ /^>probe:\w+:(\w+):/)
to
if($line =3D~ /^>probe\:\w+\:$find\:/) {
I can't remember if you have to escape colons or not. If you do, then
you're probably a pearlfish.
-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]=20
Sent: Tuesday, June 13, 2006 7:13 PM
To: Timothy Johnson; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!
Thanks Timothy. I tried the code you supplied and unfortunately the
output file is still empty. Do you think there might be a problem with
the regular expression in:
if($find =3D~ /^>probe:\w+:(\w+):/)
?
Mike
-----Original Message-----
From: Timothy Johnson [mailto:tjohnson@zonelabs.com]
Sent: Tuesday, June 13, 2006 6:59 PM
To: Michael Oldham; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!
One problem is that you are using the $_ variable twice.
"while(<FILE> )" assigns $_ to the current line being read, and
"foreach(@array)" assigns $_ to the current element of the array in
question.
It's usually a good idea to be more explicit anyway, and keep the $_
usage to a minimum so you don't have to worry about this kind of thing.
Also, I'm not sure what you're trying to accomplish by this line:
print OUT scalar(<PROBES> );
As far as I can see, you're grabbing the next line, assigning it to $_
(maybe), and printing it out in scalar context. I'm assuming that you
actually wanted to print the line you read instead, so that's what I
did.
Try this and see if it is closer to what you want:
###################
#!/usr/bin/perl -w
use strict;
my $IDs =3D 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes =3D 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID =3D <IDFILE>;
chomp @ID;
# vvvvvvvvvvvvvvvvvvv
while (my $line =3D <PROBES> ) {
# vvvvvvvv
foreach my $find(@ID) {
if($line =3D~ /^>probe:\w+:$find:/) {
print OUT $find."\n";
# VVVVVV
print OUT $line."\n";
}
}
}
exit;
###########################
-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]
Sent: Tuesday, June 13, 2006 6:42 PM
To: Beginners@Perl. Org
Subject: A loop to parse a large text file--output is empty!
Dear all,
I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=3D2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=3D258; =
Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=3D2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=3D4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
..........etc.
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!
~ Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs =3D 'ID.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes =3D 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID =3D <IDFILE>;
chomp @ID;
while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}
}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
| |
| Tom Phoenix 2006-06-17, 8:05 am |
| On 6/13/06, Michael Oldham <oldham@ucla.edu> wrote:
> while (<PROBES> ) {
> print OUT scalar(<PROBES> );
You don't want that second use of <PROBES>. Check the documentation
for readline() in the perlfunc manpage. Hope this helps!
--Tom Phoenix
Stonehenge Perl Training
| |
| Muma W. 2006-06-17, 8:05 am |
| Michael Oldham wrote:
> Dear all,
>
> I am a Perl newbie struggling to accomplish a conceptually simple
> bioinformatics task. I have a single large file containing about
> 200,000 DNA probe sequences (from an Affymetrix microarray), each of
> which is accompanied by a header, like so (this is in FASTA format):
>
> Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
> CTACTCTCGTGGTGCACAAGGAGTG
> Antisense;
> TGCAGGTGGCAGATCTGCAGTCCAT
> Antisense;
> GTGAAGGTTGCTGAGGCTCTGACCC
>
> .........etc.
>
> What I would like to do is extract from this file a subset of ~130,800
> probes (both the header and the sequence) and output this subset into a
> new text file in the same (FASTA) format. These 130,800 probes
> correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
> set ID in the header listed above). I have these 8,175 IDs listed in a
> separate file called "ID.txt" and the 200,000 probe sequences in a file
> called "HG_U95Av2_probe_fasta.txt". The script below is missing
> something because the output file ("probe_subset.txt") is blank. This
> is also the case if I replace the file "ID.txt" with a file consisting
> of a single probe set ID (e.g. 1138_at). Does anyone know what I am
> missing? I am running this script in Cygwin on Windows XP. I
> appreciate any suggestions!
>
> ~ Mike O.
> [...]
You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.
use strict;
use warnings;
# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054; Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316; Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
};
# @ids should come from ID.txt.
my @ids = (1102);
my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");
while ($_ = <$fh> ) {
next unless m/^>probe:[^:]+:(\d+)/;
my $id = $1;
if (grep $id eq $_, @ids) {
print;
$_ = <$fh>;
print;
}
}
close $fh;
| |
| Timothy Johnson 2006-06-17, 8:05 am |
|
-----Original Message-----
From: Jenda Krynicky [mailto:Jenda@Krynicky.cz]=20
Sent: Wednesday, June 14, 2006 8:02 AM
To: Timothy Johnson; Michael Oldham; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!
<snip>[color=darkred]
$_[color=darkred]
<snip>[color=darkred]
>
> Nope, it's not changing $_. It just reads one line from PROBES and=20
> prints it to OUT. (Erm ... reads from PROBES until it finds $/ and=20
> then prints whatever it read and $\ .)
> The scalar is necessary there since otherwise the print would provide=20
> list context to <PROBES> and the statement would read and then write=20
> the rest of the file, not just a line.
Thank you for correcting me. Good to know.
| |
| M. Kristall 2006-06-18, 3:57 am |
| [snip]
....[color=darkred]
....[color=darkred]
....[color=darkred]
[snip][color=darkred]
> if(/^>probe:\w+:(\w+):/) {
vvv
Perhaps you meant /^> probe:\w+:$id:/ ?
use strict;
use warnings;
my $file = "> probe:HG_U95Av2:1138_at:395:301;
Interrogation_Position=2631; Antisense;\nTGGCTCCTGCTGAGGTCCCCTTTCC\n";
my @ID = qw(1138_at);
open (FILE, '<', \$file);
while (<FILE> ) {
foreach my $id (@ID) { # Keep magically assigned $_
print $_ . <FILE> if (/^> probe:\w+:$id:/);
}
}
__END__
Here's the output I get:
> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTC
HTH
| |
| Michael Oldham 2006-06-24, 8:01 am |
| Hello again,
Thanks to everyone for their helpful suggestions. I finally got it to
work, using the following script. However, it takes about 5 hours to
run on a fast computer. Using grep (in bash), on the other hand, takes
about 5 minutes (see below if you are interested). Thanks again!
SLOW perl script:
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID_all_X';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
my @ID = <IDFILE>;
print @ID;
chomp @ID;
while (my $line = <PROBES> ) {
foreach my $identifier (@ID) {
if($line=~/^>probe:\w+:$identifier:/) {
print OUT $line;
print OUT scalar(<PROBES> );
}
}
}
exit;
FAST bash script:
#!/usr/bin/bash
exec<"ID_all_X"
while read line; do
echo $line
grep -A 1 :$line: HG_U95Av2_probe_fasta >>myresults.txt
done
-----Original Message-----
From: Muma W. [mailto:mumia.w.18.spam+nospam@earthlink.net]
Sent: Wednesday, June 14, 2006 12:20 AM
To: Beginners List
Subject: Re: A loop to parse a large text file--output is empty!
Michael Oldham wrote:
> Dear all,
>
> I am a Perl newbie struggling to accomplish a conceptually simple
> bioinformatics task. I have a single large file containing about
> 200,000 DNA probe sequences (from an Affymetrix microarray), each of
> which is accompanied by a header, like so (this is in FASTA format):
>
> Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
Antisense;[color=darkred]
> CTACTCTCGTGGTGCACAAGGAGTG
> Antisense;
> TGCAGGTGGCAGATCTGCAGTCCAT
> Antisense;
> GTGAAGGTTGCTGAGGCTCTGACCC
>
> .........etc.
>
> What I would like to do is extract from this file a subset of ~130,800
> probes (both the header and the sequence) and output this subset into
a
> new text file in the same (FASTA) format. These 130,800 probes
> correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
> set ID in the header listed above). I have these 8,175 IDs listed in
a
> separate file called "ID.txt" and the 200,000 probe sequences in a
file
> called "HG_U95Av2_probe_fasta.txt". The script below is missing
> something because the output file ("probe_subset.txt") is blank. This
> is also the case if I replace the file "ID.txt" with a file consisting
> of a single probe set ID (e.g. 1138_at). Does anyone know what I am
> missing? I am running this script in Cygwin on Windows XP. I
> appreciate any suggestions!
>
> ~ Mike O.
> [...]
You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.
use strict;
use warnings;
# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;
CTACTCTCGTGGTGCACAAGGAGTG
>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;
Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;
Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
};
# @ids should come from ID.txt.
my @ids = (1102);
my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");
while ($_ = <$fh> ) {
next unless m/^>probe:[^:]+:(\d+)/;
my $id = $1;
if (grep $id eq $_, @ids) {
print;
$_ = <$fh>;
print;
}
}
close $fh;
--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.9.2/373 - Release Date: 6/22/2006
| |
| D. Bolliger 2006-06-24, 6:58 pm |
| Michael Oldham am Freitag, 23. Juni 2006 18:20:
> Hello again,
Hi Michael again,
> Thanks to everyone for their helpful suggestions. I finally got it to
> work, using the following script. However, it takes about 5 hours to
> run on a fast computer. Using grep (in bash), on the other hand, takes
> about 5 minutes (see below if you are interested). Thanks again!
>
> SLOW perl script:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> my $IDs = 'ID_all_X';
>
> unless (open(IDFILE, $IDs)) {
> print "Could not open file $IDs!\n";
> }
>
> my $probes = 'HG_U95Av2_probe_fasta';
>
> unless (open(PROBES, $probes)) {
> print "Could not open file $probes!\n";
> }
>
> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>
> my @ID = <IDFILE>;
> print @ID;
> chomp @ID;
>
> while (my $line = <PROBES> ) {
> foreach my $identifier (@ID) {
> if($line=~/^>probe:\w+:$identifier:/) {
> print OUT $line;
> print OUT scalar(<PROBES> );
> }
> }
> }
> exit;
[...]
here's a skeleton how I would do it. It should run quite fast.
The creation of the lookup hash from file is left off as well
as the usage of a real input/output data files.
Hope this helps,
Dani
#!/usr/bin/perl
use strict;
use warnings;
# just a dummy, must be built from file.
#
my %lookup=map {$_=>1} ('1138_at','1102_s_at');
# This is the only regex we have to use.
# Maybe split is faster, didn't test it.
#
my $re=qr/^.*?:.*?:(.*?):/;
# extract the target string for the selection test,
# test, and print if test ok
#
while (<DATA> ) {
print $_ if ~/$re/ and exists $lookup{$1};
}
__DATA__
>probe:HG_U95Av2:1138_at:395:301;
Interrogation_Position=2631;Antisense;TG
GCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258;
Antisense;CTACTCTCGTGGTGCACAAGGAGTG[colo
r=darkred]
>probe:HG_U95Av2:1156_at:528:483;[/color]
Interrogation_Position=2054;Antisense;TG
CAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589;
Interrogation_Position=4316;Antisense;GT
GAAGGTTGCTGAGGCTCTGACCC
|
|
|
|
|