For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > June 2006 > A loop to parse a large text file--output is empty!









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author A loop to parse a large text file--output is empty!
Michael Oldham

2006-06-17, 8:05 am

Dear all,

I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;

Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;

CTACTCTCGTGGTGCACAAGGAGTG

>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;

Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT

>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;

Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC

..........etc.

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!

~ Mike O.

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;

while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}

}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

Timothy Johnson

2006-06-17, 8:05 am



One problem is that you are using the $_ variable twice.
"while(<FILE> )" assigns $_ to the current line being read, and
"foreach(@array)" assigns $_ to the current element of the array in
question.

It's usually a good idea to be more explicit anyway, and keep the $_
usage to a minimum so you don't have to worry about this kind of thing.

Also, I'm not sure what you're trying to accomplish by this line:

print OUT scalar(<PROBES> );

As far as I can see, you're grabbing the next line, assigning it to $_
(maybe), and printing it out in scalar context. I'm assuming that you
actually wanted to print the line you read instead, so that's what I
did.

Try this and see if it is closer to what you want:

###################

#!/usr/bin/perl -w

use strict;

my $IDs =3D 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes =3D 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID =3D <IDFILE>;
chomp @ID;

# vvvvvvvvvvvvvvvvvvv
while (my $line =3D <PROBES> ) {
# vvvvvvvv =20
foreach my $find(@ID) {
if($find =3D~ /^>probe:\w+:(\w+):/) {
print OUT $find."\n";
# VVVVVV
print OUT $line."\n";
}
}

}
exit;

###########################

-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]=20
Sent: Tuesday, June 13, 2006 6:42 PM
To: Beginners@Perl. Org
Subject: A loop to parse a large text file--output is empty!

Dear all,

I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=3D2631;

Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=3D258; =

Antisense;
CTACTCTCGTGGTGCACAAGGAGTG

>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=3D2054;

Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT

>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=3D4316;

Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC

..........etc.

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!

~ Mike O.

#!/usr/bin/perl -w

use strict;

my $IDs =3D 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes =3D 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID =3D <IDFILE>;
chomp @ID;

while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}

}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006


--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>



Michael Oldham

2006-06-17, 8:05 am

Thanks Timothy. I tried the code you supplied and unfortunately the
output file is still empty. Do you think there might be a problem with
the regular expression in:

if($find =~ /^>probe:\w+:(\w+):/)

?

Mike



-----Original Message-----
From: Timothy Johnson [mailto:tjohnson@zonelabs.com]
Sent: Tuesday, June 13, 2006 6:59 PM
To: Michael Oldham; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!




One problem is that you are using the $_ variable twice.
"while(<FILE> )" assigns $_ to the current line being read, and
"foreach(@array)" assigns $_ to the current element of the array in
question.

It's usually a good idea to be more explicit anyway, and keep the $_
usage to a minimum so you don't have to worry about this kind of thing.

Also, I'm not sure what you're trying to accomplish by this line:

print OUT scalar(<PROBES> );

As far as I can see, you're grabbing the next line, assigning it to $_
(maybe), and printing it out in scalar context. I'm assuming that you
actually wanted to print the line you read instead, so that's what I
did.

Try this and see if it is closer to what you want:

###################

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;

# vvvvvvvvvvvvvvvvvvv
while (my $line = <PROBES> ) {
# vvvvvvvv
foreach my $find(@ID) {
if($find =~ /^>probe:\w+:(\w+):/) {
print OUT $find."\n";
# VVVVVV
print OUT $line."\n";
}
}

}
exit;

###########################

-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]
Sent: Tuesday, June 13, 2006 6:42 PM
To: Beginners@Perl. Org
Subject: A loop to parse a large text file--output is empty!

Dear all,

I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;

Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;

CTACTCTCGTGGTGCACAAGGAGTG

>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;

Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT

>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;

Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC

..........etc.

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!

~ Mike O.

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;

while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}

}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

Timothy Johnson

2006-06-17, 8:05 am


Oops! =20

Try changing

if($find =3D~ /^>probe:\w+:(\w+):/)

to

if($line =3D~ /^>probe\:\w+\:$find\:/) {



I can't remember if you have to escape colons or not. If you do, then
you're probably a pearlfish.



-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]=20
Sent: Tuesday, June 13, 2006 7:13 PM
To: Timothy Johnson; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!

Thanks Timothy. I tried the code you supplied and unfortunately the
output file is still empty. Do you think there might be a problem with
the regular expression in:

if($find =3D~ /^>probe:\w+:(\w+):/)

?

Mike



-----Original Message-----
From: Timothy Johnson [mailto:tjohnson@zonelabs.com]
Sent: Tuesday, June 13, 2006 6:59 PM
To: Michael Oldham; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!




One problem is that you are using the $_ variable twice.
"while(<FILE> )" assigns $_ to the current line being read, and
"foreach(@array)" assigns $_ to the current element of the array in
question.

It's usually a good idea to be more explicit anyway, and keep the $_
usage to a minimum so you don't have to worry about this kind of thing.

Also, I'm not sure what you're trying to accomplish by this line:

print OUT scalar(<PROBES> );

As far as I can see, you're grabbing the next line, assigning it to $_
(maybe), and printing it out in scalar context. I'm assuming that you
actually wanted to print the line you read instead, so that's what I
did.

Try this and see if it is closer to what you want:

###################

#!/usr/bin/perl -w

use strict;

my $IDs =3D 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes =3D 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID =3D <IDFILE>;
chomp @ID;

# vvvvvvvvvvvvvvvvvvv
while (my $line =3D <PROBES> ) {
# vvvvvvvv
foreach my $find(@ID) {
if($line =3D~ /^>probe:\w+:$find:/) {
print OUT $find."\n";
# VVVVVV
print OUT $line."\n";
}
}

}
exit;

###########################

-----Original Message-----
From: Michael Oldham [mailto:oldham@ucla.edu]
Sent: Tuesday, June 13, 2006 6:42 PM
To: Beginners@Perl. Org
Subject: A loop to parse a large text file--output is empty!

Dear all,

I am a Perl newbie struggling to accomplish a conceptually simple
bioinformatics task. I have a single large file containing about
200,000 DNA probe sequences (from an Affymetrix microarray), each of
which is accompanied by a header, like so (this is in FASTA format):

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=3D2631;

Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=3D258; =

Antisense;
CTACTCTCGTGGTGCACAAGGAGTG

>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=3D2054;

Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT

>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=3D4316;

Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC

..........etc.

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new text file in the same (FASTA) format. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
set ID in the header listed above). I have these 8,175 IDs listed in a
separate file called "ID.txt" and the 200,000 probe sequences in a file
called "HG_U95Av2_probe_fasta.txt". The script below is missing
something because the output file ("probe_subset.txt") is blank. This
is also the case if I replace the file "ID.txt" with a file consisting
of a single probe set ID (e.g. 1138_at). Does anyone know what I am
missing? I am running this script in Cygwin on Windows XP. I
appreciate any suggestions!

~ Mike O.

#!/usr/bin/perl -w

use strict;

my $IDs =3D 'ID.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes =3D 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID =3D <IDFILE>;
chomp @ID;

while (<PROBES> ) {
foreach (@ID) {
if(/^>probe:\w+:(\w+):/) {
print OUT;
print OUT scalar(<PROBES> );
}
}

}
exit;
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006


Tom Phoenix

2006-06-17, 8:05 am

On 6/13/06, Michael Oldham <oldham@ucla.edu> wrote:

> while (<PROBES> ) {


> print OUT scalar(<PROBES> );


You don't want that second use of <PROBES>. Check the documentation
for readline() in the perlfunc manpage. Hope this helps!

--Tom Phoenix
Stonehenge Perl Training
Muma W.

2006-06-17, 8:05 am

Michael Oldham wrote:
> Dear all,
>
> I am a Perl newbie struggling to accomplish a conceptually simple
> bioinformatics task. I have a single large file containing about
> 200,000 DNA probe sequences (from an Affymetrix microarray), each of
> which is accompanied by a header, like so (this is in FASTA format):
>
> Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
> CTACTCTCGTGGTGCACAAGGAGTG
> Antisense;
> TGCAGGTGGCAGATCTGCAGTCCAT
> Antisense;
> GTGAAGGTTGCTGAGGCTCTGACCC
>
> .........etc.
>
> What I would like to do is extract from this file a subset of ~130,800
> probes (both the header and the sequence) and output this subset into a
> new text file in the same (FASTA) format. These 130,800 probes
> correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
> set ID in the header listed above). I have these 8,175 IDs listed in a
> separate file called "ID.txt" and the 200,000 probe sequences in a file
> called "HG_U95Av2_probe_fasta.txt". The script below is missing
> something because the output file ("probe_subset.txt") is blank. This
> is also the case if I replace the file "ID.txt" with a file consisting
> of a single probe set ID (e.g. 1138_at). Does anyone know what I am
> missing? I am running this script in Cygwin on Windows XP. I
> appreciate any suggestions!
>
> ~ Mike O.
> [...]


You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.


use strict;
use warnings;

# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;

TGGCTCCTGCTGAGGTCCCCTTTCC

>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;

CTACTCTCGTGGTGCACAAGGAGTG

>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054; Antisense;

TGCAGGTGGCAGATCTGCAGTCCAT

>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316; Antisense;

GTGAAGGTTGCTGAGGCTCTGACCC
};

# @ids should come from ID.txt.
my @ids = (1102);

my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");

while ($_ = <$fh> ) {
next unless m/^>probe:[^:]+:(\d+)/;
my $id = $1;
if (grep $id eq $_, @ids) {
print;
$_ = <$fh>;
print;
}
}

close $fh;



Timothy Johnson

2006-06-17, 8:05 am



-----Original Message-----
From: Jenda Krynicky [mailto:Jenda@Krynicky.cz]=20
Sent: Wednesday, June 14, 2006 8:02 AM
To: Timothy Johnson; Michael Oldham; Beginners@Perl. Org
Subject: RE: A loop to parse a large text file--output is empty!

<snip>[color=darkred]
$_[color=darkred]
<snip>[color=darkred]
>
> Nope, it's not changing $_. It just reads one line from PROBES and=20
> prints it to OUT. (Erm ... reads from PROBES until it finds $/ and=20
> then prints whatever it read and $\ .)


> The scalar is necessary there since otherwise the print would provide=20
> list context to <PROBES> and the statement would read and then write=20
> the rest of the file, not just a line.


Thank you for correcting me. Good to know.

M. Kristall

2006-06-18, 3:57 am

[snip]
....[color=darkred]
....[color=darkred]
....[color=darkred]
[snip][color=darkred]
> if(/^>probe:\w+:(\w+):/) {

vvv
Perhaps you meant /^> probe:\w+:$id:/ ?

use strict;
use warnings;
my $file = "> probe:HG_U95Av2:1138_at:395:301;
Interrogation_Position=2631; Antisense;\nTGGCTCCTGCTGAGGTCCCCTTTCC\n";
my @ID = qw(1138_at);
open (FILE, '<', \$file);
while (<FILE> ) {
foreach my $id (@ID) { # Keep magically assigned $_
print $_ . <FILE> if (/^> probe:\w+:$id:/);
}
}
__END__

Here's the output I get:
> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;

TGGCTCCTGCTGAGGTCCCCTTTC

HTH
Michael Oldham

2006-06-24, 8:01 am

Hello again,

Thanks to everyone for their helpful suggestions. I finally got it to
work, using the following script. However, it takes about 5 hours to
run on a fast computer. Using grep (in bash), on the other hand, takes
about 5 minutes (see below if you are interested). Thanks again!

SLOW perl script:

#!/usr/bin/perl -w

use strict;

my $IDs = 'ID_all_X';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
print @ID;
chomp @ID;

while (my $line = <PROBES> ) {
foreach my $identifier (@ID) {
if($line=~/^>probe:\w+:$identifier:/) {
print OUT $line;
print OUT scalar(<PROBES> );
}
}
}
exit;


FAST bash script:

#!/usr/bin/bash
exec<"ID_all_X"
while read line; do
echo $line
grep -A 1 :$line: HG_U95Av2_probe_fasta >>myresults.txt
done


-----Original Message-----
From: Muma W. [mailto:mumia.w.18.spam+nospam@earthlink.net]
Sent: Wednesday, June 14, 2006 12:20 AM
To: Beginners List
Subject: Re: A loop to parse a large text file--output is empty!


Michael Oldham wrote:
> Dear all,
>
> I am a Perl newbie struggling to accomplish a conceptually simple
> bioinformatics task. I have a single large file containing about
> 200,000 DNA probe sequences (from an Affymetrix microarray), each of
> which is accompanied by a header, like so (this is in FASTA format):
>
> Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
Antisense;[color=darkred]
> CTACTCTCGTGGTGCACAAGGAGTG
> Antisense;
> TGCAGGTGGCAGATCTGCAGTCCAT
> Antisense;
> GTGAAGGTTGCTGAGGCTCTGACCC
>
> .........etc.
>
> What I would like to do is extract from this file a subset of ~130,800
> probes (both the header and the sequence) and output this subset into

a
> new text file in the same (FASTA) format. These 130,800 probes
> correspond to 8,175 probe set IDs ("1138_at" is an example of a probe
> set ID in the header listed above). I have these 8,175 IDs listed in

a
> separate file called "ID.txt" and the 200,000 probe sequences in a

file
> called "HG_U95Av2_probe_fasta.txt". The script below is missing
> something because the output file ("probe_subset.txt") is blank. This
> is also the case if I replace the file "ID.txt" with a file consisting
> of a single probe set ID (e.g. 1138_at). Does anyone know what I am
> missing? I am running this script in Cygwin on Windows XP. I
> appreciate any suggestions!
>
> ~ Mike O.
> [...]


You'd have to modify this to work on your data, but it's close to what
you're looking for I think. For my convenience, I considered only the
numeric portions of the IDs, and I put the 'Antisense' strings on the
same lines as the 'probe' lines. (I assume it's this way in the original
file and the current look is due to MUA wrapping). I hope this helps.


use strict;
use warnings;

# This data should come from HG_U95Av2_probe_fasta.txt.
my $probe = q{
>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;

Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258; Antisense;

CTACTCTCGTGGTGCACAAGGAGTG

>probe:HG_U95Av2:1156_at:528:483; Interrogation_Position=2054;

Antisense;
TGCAGGTGGCAGATCTGCAGTCCAT

>probe:HG_U95Av2:1102_s_at:541:589; Interrogation_Position=4316;

Antisense;
GTGAAGGTTGCTGAGGCTCTGACCC
};

# @ids should come from ID.txt.
my @ids = (1102);

my $fh;
open $fh, '<', \$probe or die("Couldn't open memory file\n");

while ($_ = <$fh> ) {
next unless m/^>probe:[^:]+:(\d+)/;
my $id = $1;
if (grep $id eq $_, @ids) {
print;
$_ = <$fh>;
print;
}
}

close $fh;




--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>


--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.4/363 - Release Date: 6/13/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.9.2/373 - Release Date: 6/22/2006

D. Bolliger

2006-06-24, 6:58 pm

Michael Oldham am Freitag, 23. Juni 2006 18:20:
> Hello again,


Hi Michael again,

> Thanks to everyone for their helpful suggestions. I finally got it to
> work, using the following script. However, it takes about 5 hours to
> run on a fast computer. Using grep (in bash), on the other hand, takes
> about 5 minutes (see below if you are interested). Thanks again!
>
> SLOW perl script:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> my $IDs = 'ID_all_X';
>
> unless (open(IDFILE, $IDs)) {
> print "Could not open file $IDs!\n";
> }
>
> my $probes = 'HG_U95Av2_probe_fasta';
>
> unless (open(PROBES, $probes)) {
> print "Could not open file $probes!\n";
> }
>
> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>
> my @ID = <IDFILE>;
> print @ID;
> chomp @ID;
>
> while (my $line = <PROBES> ) {
> foreach my $identifier (@ID) {
> if($line=~/^>probe:\w+:$identifier:/) {
> print OUT $line;
> print OUT scalar(<PROBES> );
> }
> }
> }
> exit;

[...]


here's a skeleton how I would do it. It should run quite fast.
The creation of the lookup hash from file is left off as well
as the usage of a real input/output data files.

Hope this helps,
Dani


#!/usr/bin/perl
use strict;
use warnings;

# just a dummy, must be built from file.
#
my %lookup=map {$_=>1} ('1138_at','1102_s_at');

# This is the only regex we have to use.
# Maybe split is faster, didn't test it.
#
my $re=qr/^.*?:.*?:(.*?):/;

# extract the target string for the selection test,
# test, and print if test ok
#
while (<DATA> ) {
print $_ if ~/$re/ and exists $lookup{$1};
}

__DATA__
>probe:HG_U95Av2:1138_at:395:301;

Interrogation_Position=2631;Antisense;TG
GCTCCTGCTGAGGTCCCCTTTCC
>probe:HG_U95Av2:107_at:543:519; Interrogation_Position=258;

Antisense;CTACTCTCGTGGTGCACAAGGAGTG[colo
r=darkred]
>probe:HG_U95Av2:1156_at:528:483;[/color]
Interrogation_Position=2054;Antisense;TG
CAGGTGGCAGATCTGCAGTCCAT
>probe:HG_U95Av2:1102_s_at:541:589;

Interrogation_Position=4316;Antisense;GT
GAAGGTTGCTGAGGCTCTGACCC

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com