For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > July 2004 > problem with splitting on "words"









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author problem with splitting on "words"
Charlotte Hee

2004-07-30, 3:55 pm



Hello All,

I am having trouble splitting words from titles from a list of research
papers. I thought I could split the title into words like so:

#!/usr/local/bin/perl
use locale;

%forums = ( 1 => 'B0->K+K-Ks',
2 => 'B+->K+KsKs Decays',
3 => 'Measurement of the Total Width',
4 => 'Asymmetries in B0->K0s pi0 Decays'
);

foreach $forum ( sort keys %forums ){
my $title = $forums{$forum};
foreach $w (split /[^\w-]+/, $title) {
next unless ($w =~ /^[A-Za-z]/);
$title =~ /\b\Q$w\E\b/;
print "Journal $forum indexed word = " . ucfirst($w) . "\n";
}
}

exit;

But the results show that I'm losing some characters:

Journal 1 indexed word = B0- # this should be B0->
Journal 1 indexed word = K # what happened to the '+'?
Journal 1 indexed word = K-Ks

Journal 2 indexed word = B # '+->' missing
Journal 2 indexed word = K # '+' missing
Journal 2 indexed word = KsKs
Journal 2 indexed word = Decays

Journal 3 indexed word = Measurement
Journal 3 indexed word = Of
Journal 3 indexed word = The
Journal 3 indexed word = Total
Journal 3 indexed word = Width

Journal 4 indexed word = Asymmetries
Journal 4 indexed word = In
Journal 4 indexed word = B0- # should be 'B0->'
Journal 4 indexed word = K0s
Journal 4 indexed word = Pi0
Journal 4 indexed word = Decays

These are only example titles but the other titles have similar characters
in them as part of a "word". I tried adding the '-' and '>' to my character
class but that did not work. What am I doing wrong here?

thanks, Chee
Bob Showalter

2004-07-30, 3:55 pm

Charlotte Hee wrote:
> Hello All,
>
> I am having trouble splitting words from titles from a list of
> research papers. I thought I could split the title into words like so:
>
> #!/usr/local/bin/perl
> use locale;
>
> %forums = ( 1 => 'B0->K+K-Ks',
> 2 => 'B+->K+KsKs Decays',
> 3 => 'Measurement of the Total Width',
> 4 => 'Asymmetries in B0->K0s pi0 Decays'
> );
>
> foreach $forum ( sort keys %forums ){
> my $title = $forums{$forum};
> foreach $w (split /[^\w-]+/, $title) {
> next unless ($w =~ /^[A-Za-z]/);
> $title =~ /\b\Q$w\E\b/;
> print "Journal $forum indexed word = " . ucfirst($w) . "\n";
> }
> }
>
> exit;
>
> But the results show that I'm losing some characters:
>
> Journal 1 indexed word = B0- # this should be B0->


No, because > matches the character class [^\w-]

> Journal 1 indexed word = K # what happened to the '+'?


Same as above.

> Journal 1 indexed word = K-Ks
>
> Journal 2 indexed word = B # '+->' missing


The '-' is there, but you're only printing tokens that start with a letter.

> Journal 2 indexed word = K # '+' missing
> Journal 2 indexed word = KsKs
> Journal 2 indexed word = Decays
>
> Journal 3 indexed word = Measurement
> Journal 3 indexed word = Of
> Journal 3 indexed word = The
> Journal 3 indexed word = Total
> Journal 3 indexed word = Width
>
> Journal 4 indexed word = Asymmetries
> Journal 4 indexed word = In
> Journal 4 indexed word = B0- # should be 'B0->'
> Journal 4 indexed word = K0s
> Journal 4 indexed word = Pi0
> Journal 4 indexed word = Decays
>
> These are only example titles but the other titles have similar
> characters in them as part of a "word". I tried adding the '-' and
> '>' to my character class but that did not work. What am I doing
> wrong here?


It's not clear what you're defining as a "word". I'm wondering why you
aren't just splitting on whitespace?

foreach $w (split ' ', $title) {
Charlotte Hee

2004-07-30, 3:55 pm



Hi Bob,

In one of my tests I added the '>' to the character class [^\w->] but
I still didn't get 'B0->'. I've just learned about character classes
so I am trying to get a better handle on how they work. A lot of my titles
contain physics terms like B0->K- and I would consider 'B0->' a word and
'K-' another word.

thanks for the quick repy. Chee

On Fri, 30 Jul 2004, Bob Showalter wrote:

> Date: Fri, 30 Jul 2004 13:29:54 -0400
> From: Bob Showalter <Bob_Showalter@taylorwhite.com>
> To: 'Charlotte Hee' <chee@slac.stanford.edu>, beginners@perl.org
> Subject: RE: problem with splitting on "words"
>
> Charlotte Hee wrote:
>
> No, because > matches the character class [^\w-]
>
>
> Same as above.
>
>
> The '-' is there, but you're only printing tokens that start with a letter.
>
>
> It's not clear what you're defining as a "word". I'm wondering why you
> aren't just splitting on whitespace?
>
> foreach $w (split ' ', $title) {
>

Bob Showalter

2004-07-30, 3:55 pm

Charlotte Hee wrote:
> Hi Bob,
>
> In one of my tests I added the '>' to the character class [^\w->] but
> I still didn't get 'B0->'.


I'm guessing it's because that looks like a range. Using [^\w\->] should
work.

> I've just learned about character classes
> so I am trying to get a better handle on how they work. A lot of my
> titles contain physics terms like B0->K- and I would consider 'B0->'
> a word and 'K-' another word.


OK. Instead of using split, why not capture the tokens you're interested in.
Something like:

for my $w ($title =~ /([A-Za-z]+[^A-Za-z\s]*)\s*/g) {
Charlotte Hee

2004-07-30, 3:55 pm



On Fri, 30 Jul 2004, Bob Showalter wrote:

> Date: Fri, 30 Jul 2004 13:52:57 -0400
> From: Bob Showalter <Bob_Showalter@taylorwhite.com>
> To: 'Charlotte Hee' <chee@slac.stanford.edu>
> Cc: beginners@perl.org
> Subject: RE: problem with splitting on "words"
>
> Charlotte Hee wrote:
>
> I'm guessing it's because that looks like a range. Using [^\w\->] should
> work.
>
>
> OK. Instead of using split, why not capture the tokens you're interested in.
> Something like:
>
> for my $w ($title =~ /([A-Za-z]+[^A-Za-z\s]*)\s*/g) {
>


That's amazing! Yes, that works.

Let me see if I understand this expression:
/([A-Za-z]+
This matches any letter, uppercase or lowercase, 1 or more times.

[^A-Za-z\s]*)
This matches anything that's not a letter, uppercase or lowercase, or a
space, zero or more times. Here is how I will match my '->'.

\s*/g
This matches a blank space zero or more times and the 'g' means apply the
whole thing globally.

But why do I need the character classes in parentheses?

thanks again! Chee
Bob Showalter

2004-07-30, 3:55 pm

Charlotte Hee wrote:
> On Fri, 30 Jul 2004, Bob Showalter wrote:
>
> That's amazing! Yes, that works.
>
> Let me see if I understand this expression:
> /([A-Za-z]+
> This matches any letter, uppercase or lowercase, 1 or more times.


Yes. A token needs to start with a letter.

>
> [^A-Za-z\s]*)
> This matches anything that's not a letter, uppercase or lowercase, or
> a space, zero or more times. Here is how I will match my '->'.


Right. And it will stop at the next letter or whitespace char.

>
> \s*/g
> This matches a blank space zero or more times and the 'g' means apply
> the whole thing globally.
>
> But why do I need the character classes in parentheses?


I did that so as not to capture the whitespace. Actually, I don't think you
need it; try leaving it out. I'm not the strongest on regexes; other folks
can probably improve on my approach here...
John W. Krahn

2004-07-30, 8:55 pm

Charlotte Hee wrote:
>
> On Fri, 30 Jul 2004, Bob Showalter wrote:
>
> Let me see if I understand this expression:
>
> [snip]
>
> \s*/g
> This matches a blank space zero or more times and the 'g' means apply the
> whole thing globally.
>
> But why do I need the character classes in parentheses?


You don't really need parentheses, this should work as well:

for my $w ( $title =~ /[A-Za-z]+[^A-Za-z\s]*/g ) {



John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com