Home > Archive > PERL Miscellaneous > August 2005 > split, no repeat- Regular expression
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
split, no repeat- Regular expression
|
|
|
| I have a file, the content is like this:
ATATTTGATTGGCCAGCCCTGCGTTTGCGGTTTTTTTTTG
TTTTTTTATTTCCTGTATTTTTTTTGGGGGGGAAAAATTG
CAGTTCCACGGA
4f-rnp Gene 204:267
ACCTTATCGACTAGTATAAAAGGCACTGTCAGCTCTCCAG
CCCGAACAAAATCGATCAAAATGCGCCCGCAATCAGCTGC
GTGTCTATTACT
44D JMB 166:101
ATGGGAGCGGTATGCTTAAATAGGGGCACCTTTTAATCCC
TCTGGCCATTGGCAATCGATCCATTTAGTGGGAGCCATGT
TCAAGTTGCTGG
44L JMB 166:101
AACTTATGTAATCATATAGATTCTATAATAAACAAAGAAA
CAAAACTAGTTGTAAAACAAACACGATTCCTGTGTGTCAT
TGCGGGATATGG
74F EMBO 3:289
TTTCCACACGATCGTGCTGCCTCCCAATAAACCCGGTGCA
GTGAGTCAGTGTGTTGTGTGCCCCAGTCGCGAGCGGACGA
TCCGTGGAGATC
Abdb EMBO 7:3223
TGCGGATCAATTAAACCGTAAAAAACAGAGCAGGCGAGCG
TAAGCAAGAGAGAGAGGTGAAGCCAGAGGCGGAGGCGCAA
GACAAAGTGCAT
abl p1 Oncogene 3:33
AAAAAACAGAGCAGGCGAGCGTAAGCAAGAGAGAGAGGTG
AAGCCAGAGGCGGAGGCGCAAGACAAAGTGCATTTTCAGG
GCGTGTTTTTGA
abl p2 Oncogene 3:33
TAATAGTCGCTCAAAAGCTGTCGAGAGAGAGGGAGAGAAA
AGAGAGAGTGAAAGCATAGTCCCGCTATTTTGCCGAGAGA
AATAAAGAGCAG
ace JMB 210:15
for example, the first sequence, what I want is after sequence: 4f-rnp;
AND then collect all this name to a new file.
so the new file is like:
4f-rnp
44D
44L JMB
74F
Abdb
abl
*here I don;t want another alb, so the output should not be repeated.*
ace
I know how to make script to split and get the name, but How can I
avoid this repeatment?
Thanks!
| |
| Anno Siegel 2005-08-31, 7:56 am |
| Nina <tin_tint@hotmail.com> wrote in comp.lang.perl.misc:
> I have a file, the content is like this:
>
> ATATTTGATTGGCCAGCCCTGCGTTTGCGGTTTTTTTTTG
TTTTTTTATTTCCTGTATTTTTTTTGGGGGGGAAAAATTG
CAGTTCCACGGA
> 4f-rnp Gene 204:267
> ACCTTATCGACTAGTATAAAAGGCACTGTCAGCTCTCCAG
CCCGAACAAAATCGATCAAAATGCGCCCGCAATCAGCTGC
GTGTCTATTACT
> 44D JMB 166:101
> ATGGGAGCGGTATGCTTAAATAGGGGCACCTTTTAATCCC
TCTGGCCATTGGCAATCGATCCATTTAGTGGGAGCCATGT
TCAAGTTGCTGG
> 44L JMB 166:101
> AACTTATGTAATCATATAGATTCTATAATAAACAAAGAAA
CAAAACTAGTTGTAAAACAAACACGATTCCTGTGTGTCAT
TGCGGGATATGG
> 74F EMBO 3:289
> TTTCCACACGATCGTGCTGCCTCCCAATAAACCCGGTGCA
GTGAGTCAGTGTGTTGTGTGCCCCAGTCGCGAGCGGACGA
TCCGTGGAGATC
> Abdb EMBO 7:3223
> TGCGGATCAATTAAACCGTAAAAAACAGAGCAGGCGAGCG
TAAGCAAGAGAGAGAGGTGAAGCCAGAGGCGGAGGCGCAA
GACAAAGTGCAT
> abl p1 Oncogene 3:33
> AAAAAACAGAGCAGGCGAGCGTAAGCAAGAGAGAGAGGTG
AAGCCAGAGGCGGAGGCGCAAGACAAAGTGCATTTTCAGG
GCGTGTTTTTGA
> abl p2 Oncogene 3:33
> TAATAGTCGCTCAAAAGCTGTCGAGAGAGAGGGAGAGAAA
AGAGAGAGTGAAAGCATAGTCCCGCTATTTTGCCGAGAGA
AATAAAGAGCAG
> ace JMB 210:15
>
> for example, the first sequence, what I want is after sequence: 4f-rnp;
> AND then collect all this name to a new file.
> so the new file is like:
> 4f-rnp
> 44D
> 44L JMB
> 74F
> Abdb
> abl
> *here I don;t want another alb, so the output should not be repeated.*
> ace
>
> I know how to make script to split and get the name, but How can I
> avoid this repeatment?
Use a hash. See the FAQ "How can I remove duplicate elements from a
list or array?". It talks about arrays, not files, but the technique
is the same.
Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
| |
| William James 2005-08-31, 7:56 am |
| Nina wrote:
> I have a file, the content is like this:
>
> ATATTTGATTGGCCAGCCCTGCGTTTGCGGTTTTTTTTTG
TTTTTTTATTTCCTGTATTTTTTTTGGGGGGGAAAAATTG
CAGTTCCACGGA
> 4f-rnp Gene 204:267
> ACCTTATCGACTAGTATAAAAGGCACTGTCAGCTCTCCAG
CCCGAACAAAATCGATCAAAATGCGCCCGCAATCAGCTGC
GTGTCTATTACT
> 44D JMB 166:101
> ATGGGAGCGGTATGCTTAAATAGGGGCACCTTTTAATCCC
TCTGGCCATTGGCAATCGATCCATTTAGTGGGAGCCATGT
TCAAGTTGCTGG
> 44L JMB 166:101
> AACTTATGTAATCATATAGATTCTATAATAAACAAAGAAA
CAAAACTAGTTGTAAAACAAACACGATTCCTGTGTGTCAT
TGCGGGATATGG
> 74F EMBO 3:289
> TTTCCACACGATCGTGCTGCCTCCCAATAAACCCGGTGCA
GTGAGTCAGTGTGTTGTGTGCCCCAGTCGCGAGCGGACGA
TCCGTGGAGATC
> Abdb EMBO 7:3223
> TGCGGATCAATTAAACCGTAAAAAACAGAGCAGGCGAGCG
TAAGCAAGAGAGAGAGGTGAAGCCAGAGGCGGAGGCGCAA
GACAAAGTGCAT
> abl p1 Oncogene 3:33
> AAAAAACAGAGCAGGCGAGCGTAAGCAAGAGAGAGAGGTG
AAGCCAGAGGCGGAGGCGCAAGACAAAGTGCATTTTCAGG
GCGTGTTTTTGA
> abl p2 Oncogene 3:33
> TAATAGTCGCTCAAAAGCTGTCGAGAGAGAGGGAGAGAAA
AGAGAGAGTGAAAGCATAGTCCCGCTATTTTGCCGAGAGA
AATAAAGAGCAG
> ace JMB 210:15
>
> for example, the first sequence, what I want is after sequence: 4f-rnp;
> AND then collect all this name to a new file.
> so the new file is like:
> 4f-rnp
> 44D
> 44L JMB
> 74F
> Abdb
> abl
> *here I don;t want another alb, so the output should not be repeated.*
> ace
awk 'NF > 1 && 1 == ++a[$1] { print $1 }' datafile
| |
|
| I like it!
So quick responce!
Thanks!
& another question:
if I want to replace space with \t for the same file as above, how can
I add \n to the last replaced TAB, what's the problem with this script?
#!usr/bin/perl -w
my $file="dcpd.txt";
open (FILE, "$file") or die "Cannot open file.\n";
@file = <FILE>;
foreach (@file) {s/\s+/\t+/g;}
open (OUT, ">dcpd_tab.txt")or die "Cannot open file.\n";
print OUT @file;
close OUT;
close FILE;
thanks
| |
| Tad McClellan 2005-08-31, 7:56 am |
| Nina <tin_tint@hotmail.com> wrote:
> what's the problem with this script?
>
> #!usr/bin/perl -w
You are missing a slash character there.
Please post _actual_ code!
> open (FILE, "$file") or die "Cannot open file.\n";
You should include the $! variable in your die() message.
You should not have quotes around a lone variable:
perldoc -q vars
What's wrong with always quoting "$vars"?
> foreach (@file) {s/\s+/\t+/g;}
That replaces _runs_ of whitespace characters (any of: line feed,
carriage return, tab, formfeed, space) with a tab and a plus sign.
The 2nd part of s/// is a *string*, not a regular expression.
To TRansliterate space characters to tab characters:
tr/ /\t/;
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
|
|
|
|
|