For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > October 2005 > How to put brackets in a string given substrings









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author How to put brackets in a string given substrings
Wijaya Edward

2005-10-28, 7:56 am

Dear Sirs,

I have the following problem.
I am trying to put the bracket in a string given the set of its substrings.
Those bracketed region is "bounded" by the given substrings.
Like this, given input "String" and it's "substrings"

String
1.CCCATCTGTCCTTATTTGCTG
2.ACCCATCTGTCCTTGGCCAT
3.CCACCAGCACCTGTC
4.CCCAACACCTGCTGCCT
5.CTGGGTATGGGT
6.AGGAACTTGCCTGTACCACAGGAAG

Substrings:
1. ATCTG ATTTG
2. CCATC
3. CCACC CCAGC GCAAC
4. CCAAC ACACC
5. GTATG TGGGT
6. CAGGA AGGAA

The desired answer are:
1. CCC[ATCTG]TCCTT[ATTTG]CTG
2. AC[CCATC]TGTCCTTGGCCAT
3. [CCACCAGCACC]TGTC *
4. C[CCAACACC]TGCTGCCT *
5. CTGG[GTATGGGT] **
6. AGGAACTTGCCTGTACCA[CAGGAA]G **

Please note that in example 3 and 4 the substrings are "overlapping".
Pay attention also to for example 5 and 6, there exist substrings that occur
twice. So the answer for example 5 and 6 are NOT

5. C[TGGGTATGGGT] ----this is wrong
6. [AGGAA]CTTGCCTGTACCA[CAGGAA]G ----this is wrong

Since they do not follow the order from the given substrings (array -- see my code).
Below is my code. It only work for example 1 and 2.
How can I approach this problem so that it can handle all those cases?


__BEGIN__
#!/usr/bin/perl -w
use strict;

my $s1 ='CCCATCTGTCCTTATTTGCTG'; my @a1 = qw(ATCTG ATTTG);
my $s2 ='ACCCATCTGTCCTTGGCCAT'; my @a2 = qw(CCATC);
my $s3 ='CCACCAGCACCTGTC'; my @a3 = qw(CCACC CCAGC GCACC);
my $s4 ='CCCAACACCTGCTGCCT'; my @a4 = qw(CCAAC ACACC);
my $s5 ='CTGGGTATGGGT'; my @a5 = qw(GTATG TGGGT);
my $s6 = 'AGGAACTTGCCTGTACCACAGGAAG'; my @a6 = qw( CAGGA AGGAA );

#These two work fine.
put_bracket($s1,\@a1);
put_bracket($s2,\@a2);

#But these the rest don't work
put_bracket($s3,\@a3);
put_bracket($s4,\@a4);
put_bracket($s5,\@a3);
put_bracket($s6,\@a4);

sub put_bracket
{
my ($str,$ar) = @_;
my $bstr;
my $slen = length $ar->[0];

foreach my $subs ( @$ar )
{
my $idx = index($str,$subs);
my $bgn = $idx;
my $end = $idx+$slen+1;
substr($str,$bgn,0,"[");
substr($str,$end,0,"]");
}
print "$str\n";
return ;


__END__

Really hope to hear from you again.
---
Regards,
Edward WIJAYA
SINGAPORE

John W. Krahn

2005-10-28, 7:56 am

Wijaya Edward wrote:
>
> I have the following problem.
> I am trying to put the bracket in a string given the set of its substrings.
> Those bracketed region is "bounded" by the given substrings.
> Like this, given input "String" and it's "substrings"
>
> String
> 1.CCCATCTGTCCTTATTTGCTG
> 2.ACCCATCTGTCCTTGGCCAT
> 3.CCACCAGCACCTGTC
> 4.CCCAACACCTGCTGCCT
> 5.CTGGGTATGGGT
> 6.AGGAACTTGCCTGTACCACAGGAAG
>
> Substrings:
> 1. ATCTG ATTTG
> 2. CCATC
> 3. CCACC CCAGC GCAAC
> 4. CCAAC ACACC
> 5. GTATG TGGGT
> 6. CAGGA AGGAA
>
> The desired answer are:
> 1. CCC[ATCTG]TCCTT[ATTTG]CTG
> 2. AC[CCATC]TGTCCTTGGCCAT
> 3. [CCACCAGCACC]TGTC *
> 4. C[CCAACACC]TGCTGCCT *
> 5. CTGG[GTATGGGT] **
> 6. AGGAACTTGCCTGTACCA[CAGGAA]G **
>
> Please note that in example 3 and 4 the substrings are "overlapping".
> Pay attention also to for example 5 and 6, there exist substrings that occur
> twice. So the answer for example 5 and 6 are NOT
>
> 5. C[TGGGTATGGGT] ----this is wrong
> 6. [AGGAA]CTTGCCTGTACCA[CAGGAA]G ----this is wrong
>
> Since they do not follow the order from the given substrings (array -- see my code).
> Below is my code. It only work for example 1 and 2.
> How can I approach this problem so that it can handle all those cases?
>
>
> __BEGIN__
> #!/usr/bin/perl -w
> use strict;
>
> my $s1 ='CCCATCTGTCCTTATTTGCTG'; my @a1 = qw(ATCTG ATTTG);
> my $s2 ='ACCCATCTGTCCTTGGCCAT'; my @a2 = qw(CCATC);
> my $s3 ='CCACCAGCACCTGTC'; my @a3 = qw(CCACC CCAGC GCACC);
> my $s4 ='CCCAACACCTGCTGCCT'; my @a4 = qw(CCAAC ACACC);
> my $s5 ='CTGGGTATGGGT'; my @a5 = qw(GTATG TGGGT);
> my $s6 = 'AGGAACTTGCCTGTACCACAGGAAG'; my @a6 = qw( CAGGA AGGAA );
>
> #These two work fine.
> put_bracket($s1,\@a1);
> put_bracket($s2,\@a2);
>
> #But these the rest don't work
> put_bracket($s3,\@a3);
> put_bracket($s4,\@a4);
> put_bracket($s5,\@a3);
> put_bracket($s6,\@a4);
>
> sub put_bracket
> {
> my ($str,$ar) = @_;
> my $bstr;
> my $slen = length $ar->[0];
>
> foreach my $subs ( @$ar )
> {
> my $idx = index($str,$subs);
> my $bgn = $idx;
> my $end = $idx+$slen+1;
> substr($str,$bgn,0,"[");
> substr($str,$end,0,"]");
> }
> print "$str\n";
> return ;
>
>
> __END__
>
> Really hope to hear from you again.


This appears to do what you want:

sub put_bracket {
my ( $str, $ar ) = @_;

my $x = 0;
for my $subs ( @$ar ) {
if ( substr( $str, $x ) =~ /$subs/i ) {
$x += $-[ 0 ];
substr( $str, $x, length $subs ) =~ tr/A-Z/a-z/;
}
}
$str =~ s/([a-z]+)/[\U$1\E]/g;

print "$str\n";
return;
}



John
--
use Perl;
program
fulfillment
Edward WIJAYA

2005-10-28, 6:56 pm

Dear John,

Thanks so much for your life saving response.
There are one minor issue I still couldn't solve.

It is the fact that when the bounded region marked
by the array may occur more than once.
(See example no. 7 and 8 in my code below)

To disambiguate the situation, I can give the array that
comes along with the index.

I tried to modify your code below to handle
the matters. But I still cannot solve it.
I think I'm almost there but not quite yet.

Can you advice, how can I go about it?
Thanks so much beforehand. Really hope to hear
from you again.

__BEGIN__
my $t1 ='CCCATCTGTCCTTATTTGCTG'; my @ar1 = qw(ATCTG-3 ATTTG-13);
my $t2 ='ACCCATCTGTCCTTGGCCAT'; my @ar2 = qw(CCATC-2);
my $t3 ='CCACCAGCACCTGTC'; my @ar3 = qw(CCACC-0 CCAGC-3 GCACC-6);
my $t4 ='CCCAACACCTGCTGCCT'; my @ar4 = qw(CCAAC-1 ACACC-4);
my $t5 ='CTGGGTATGGGT'; my @ar5 = qw(GTATG-4 TGGGT-1);
my $t6 = 'AGGAACTTGCCTGTACCACAGGAAG'; my @ar6 = qw( CAGGA-18 AGGAA-19 );

#The above example should yield the same result as previously

# These two examples below are the 'ambiguous' cases.

my $t7 = 'CAGGACTTGCCTGTACCACAGGAAG'; my @ar7 = qw( CAGGA-18 );
my $t8 = 'CAGGATTTGAGGAAGTACCACAGGAAG'; my @ar8 = qw( CAGGA-18 AGGAA-19 );

# Answer 7 -- CAGGACTTGCCTGTACCA[CAGGA]AG Instead of --
[CAGGA]CTTGCCTGTACCACAGGAAG
# Answer 8 -- CAGGATTTGAGGAAGTACCA[CAGGAA]G Instead of --
[CAGGA]TTTG[AGGAA]GTACCACAGGAAG


print put_bracket_jk_idx($t8,\@ar8),"\n";

sub put_bracket_jk_idx {
my ( $str, $ar ) = @_;

for my $subs ( @$ar ) {

my ($sb,$id) = split("-",$subs);
print "$sb $id\n";

if ( substr( $str, $id ) =~ /$subs/i ) {
$id += $-[ 0 ];
substr( $str, $id, length $subs ) =~ tr/A-Z/a-z/;
}
}
$str =~ s/([a-z]+)/[\U$1\E]/g;

return $str;
}


print "\n";

__END__


--
Regards,
Edward WIJAYA
SINGAPORE

On Fri, 28 Oct 2005 18:42:11 +0800, John W. Krahn <krahnj@telus.net> wrote:

> Wijaya Edward wrote:
> --

Regards,
Edward WIJAYA
SINGAPORE>

John W. Krahn

2005-10-28, 6:56 pm

Edward WIJAYA wrote:
> Dear John,
>
> Thanks so much for your life saving response.
> There are one minor issue I still couldn't solve.
>
> It is the fact that when the bounded region marked
> by the array may occur more than once.
> (See example no. 7 and 8 in my code below)
>
> To disambiguate the situation, I can give the array that
> comes along with the index.
>
> I tried to modify your code below to handle
> the matters. But I still cannot solve it.
> I think I'm almost there but not quite yet.
>
> Can you advice, how can I go about it?
> Thanks so much beforehand. Really hope to hear
> from you again.
>
> __BEGIN__
> my $t1 ='CCCATCTGTCCTTATTTGCTG'; my @ar1 = qw(ATCTG-3 ATTTG-13);
> my $t2 ='ACCCATCTGTCCTTGGCCAT'; my @ar2 = qw(CCATC-2);
> my $t3 ='CCACCAGCACCTGTC'; my @ar3 = qw(CCACC-0 CCAGC-3 GCACC-6);
> my $t4 ='CCCAACACCTGCTGCCT'; my @ar4 = qw(CCAAC-1 ACACC-4);
> my $t5 ='CTGGGTATGGGT'; my @ar5 = qw(GTATG-4 TGGGT-1);
> my $t6 = 'AGGAACTTGCCTGTACCACAGGAAG'; my @ar6 = qw( CAGGA-18 AGGAA-19 );
>
> #The above example should yield the same result as previously
>
> # These two examples below are the 'ambiguous' cases.
>
> my $t7 = 'CAGGACTTGCCTGTACCACAGGAAG'; my @ar7 = qw( CAGGA-18 );
> my $t8 = 'CAGGATTTGAGGAAGTACCACAGGAAG'; my @ar8 = qw( CAGGA-18 AGGAA-19 );
>
> # Answer 7 -- CAGGACTTGCCTGTACCA[CAGGA]AG Instead of --
> [CAGGA]CTTGCCTGTACCACAGGAAG
> # Answer 8 -- CAGGATTTGAGGAAGTACCA[CAGGAA]G Instead of --
> [CAGGA]TTTG[AGGAA]GTACCACAGGAAG
>
>
> print put_bracket_jk_idx($t8,\@ar8),"\n";
>
> sub put_bracket_jk_idx {
> my ( $str, $ar ) = @_;
>
> for my $subs ( @$ar ) {
>
> my ($sb,$id) = split("-",$subs);
> print "$sb $id\n";
>
> if ( substr( $str, $id ) =~ /$subs/i ) {

^^^^^

> $id += $-[ 0 ];
> substr( $str, $id, length $subs ) =~ tr/A-Z/a-z/;

^^^^^
You should be using $sb instead of $subs.

> }
> }
> $str =~ s/([a-z]+)/[\U$1\E]/g;
>
> return $str;
> }
>
>
> print "\n";



John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com