Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

problem with splitting on "words"

Hello All,

I am having trouble splitting words from titles from a list of research
papers. I thought I could split the title into words like so:

#!/usr/local/bin/perl
use locale;

%forums = ( 1 => 'B0->K+K-Ks',
2 => 'B+->K+KsKs Decays',
3 => 'Measurement of the Total Width',
4 => 'Asymmetries in B0->K0s pi0 Decays'
);

foreach $forum ( sort keys %forums ){
my $title = $forums{$forum};
foreach $w (split /[^\w-]+/, $title) {
next unless ($w =~ /^[A-Za-z]/);
$title =~ /\b\Q$w\E\b/;
print "Journal $forum indexed word = " .  ucfirst($w) . "\n";
}
}

exit;

But the results show that I'm losing some characters:

Journal 1 indexed word = B0-    # this should be B0->
Journal 1 indexed word = K      # what happened to the '+'?
Journal 1 indexed word = K-Ks

Journal 2 indexed word = B      # '+->' missing
Journal 2 indexed word = K      # '+' missing
Journal 2 indexed word = KsKs
Journal 2 indexed word = Decays

Journal 3 indexed word = Measurement
Journal 3 indexed word = Of
Journal 3 indexed word = The
Journal 3 indexed word = Total
Journal 3 indexed word = Width

Journal 4 indexed word = Asymmetries
Journal 4 indexed word = In
Journal 4 indexed word = B0-   # should be 'B0->'
Journal 4 indexed word = K0s
Journal 4 indexed word = Pi0
Journal 4 indexed word = Decays

These are only example titles but the other titles have similar characters
in them as part of a "word". I tried adding the '-' and '>' to my character
class but that did not work. What am I doing wrong here?

thanks, Chee

Report this thread to moderator Post Follow-up to this message
Old Post
Charlotte Hee
07-30-04 08:55 PM


RE: problem with splitting on "words"
Charlotte Hee wrote:
> Hello All,
>
> I am having trouble splitting words from titles from a list of
> research papers. I thought I could split the title into words like so:
>
>   #!/usr/local/bin/perl
>   use locale;
>
>   %forums = ( 1 => 'B0->K+K-Ks',
>               2 => 'B+->K+KsKs Decays',
>               3 => 'Measurement of the Total Width',
>               4 => 'Asymmetries in B0->K0s pi0 Decays'
>   );
>
>   foreach $forum ( sort keys %forums ){
>      my $title = $forums{$forum};
>      foreach $w (split /[^\w-]+/, $title) {
>         next unless ($w =~ /^[A-Za-z]/);
>         $title =~ /\b\Q$w\E\b/;
>         print "Journal $forum indexed word = " .  ucfirst($w) . "\n";
>       }
>   }
>
> exit;
>
> But the results show that I'm losing some characters:
>
> Journal 1 indexed word = B0-    # this should be B0->

No, because > matches the character class [^\w-]

> Journal 1 indexed word = K      # what happened to the '+'?

Same as above.

> Journal 1 indexed word = K-Ks
>
> Journal 2 indexed word = B      # '+->' missing

The '-' is there, but you're only printing tokens that start with a letter.

> Journal 2 indexed word = K      # '+' missing
> Journal 2 indexed word = KsKs
> Journal 2 indexed word = Decays
>
> Journal 3 indexed word = Measurement
> Journal 3 indexed word = Of
> Journal 3 indexed word = The
> Journal 3 indexed word = Total
> Journal 3 indexed word = Width
>
> Journal 4 indexed word = Asymmetries
> Journal 4 indexed word = In
> Journal 4 indexed word = B0-   # should be 'B0->'
> Journal 4 indexed word = K0s
> Journal 4 indexed word = Pi0
> Journal 4 indexed word = Decays
>
> These are only example titles but the other titles have similar
> characters in them as part of a "word". I tried adding the '-' and
> '>' to my character class but that did not work. What am I doing
> wrong here?

It's not clear what you're defining as a "word". I'm wondering why you
aren't just splitting on whitespace?

foreach $w (split ' ', $title) {

Report this thread to moderator Post Follow-up to this message
Old Post
Bob Showalter
07-30-04 08:55 PM


RE: problem with splitting on "words"

Hi Bob,

In one of my tests I added the '>' to the character class [^\w->] but
I still didn't get 'B0->'. I've just learned about character classes
so I am trying to get a better handle on how they work. A lot of my titles
contain physics terms like B0->K- and I would consider 'B0->' a word and
'K-' another word.

thanks for the quick repy.  Chee

On Fri, 30 Jul 2004, Bob Showalter wrote:

> Date: Fri, 30 Jul 2004 13:29:54 -0400
> From: Bob Showalter <Bob_Showalter@taylorwhite.com>
> To: 'Charlotte Hee' <chee@slac.stanford.edu>, beginners@perl.org
> Subject: RE: problem with splitting on "words"
>
> Charlotte Hee wrote: 
>
> No, because > matches the character class [^\w-]
> 
>
> Same as above.
> 
>
> The '-' is there, but you're only printing tokens that start with a letter
.
> 
>
> It's not clear what you're defining as a "word". I'm wondering why you
> aren't just splitting on whitespace?
>
>    foreach $w (split ' ', $title) {
>

Report this thread to moderator Post Follow-up to this message
Old Post
Charlotte Hee
07-30-04 08:55 PM


RE: problem with splitting on "words"
Charlotte Hee wrote:
> Hi Bob,
>
> In one of my tests I added the '>' to the character class [^\w->] but
> I still didn't get 'B0->'.

I'm guessing it's because that looks like a range. Using [^\w\->] should
work.

> I've just learned about character classes
> so I am trying to get a better handle on how they work. A lot of my
> titles contain physics terms like B0->K- and I would consider 'B0->'
> a word and 'K-' another word.

OK. Instead of using split, why not capture the tokens you're interested in.
Something like:

for my $w ($title =~ /([A-Za-z]+[^A-Za-z\s]*)\s*/g) {

Report this thread to moderator Post Follow-up to this message
Old Post
Bob Showalter
07-30-04 08:55 PM


RE: problem with splitting on "words"

On Fri, 30 Jul 2004, Bob Showalter wrote:

> Date: Fri, 30 Jul 2004 13:52:57 -0400
> From: Bob Showalter <Bob_Showalter@taylorwhite.com>
> To: 'Charlotte Hee' <chee@slac.stanford.edu>
> Cc: beginners@perl.org
> Subject: RE: problem with splitting on "words"
>
> Charlotte Hee wrote: 
>
> I'm guessing it's because that looks like a range. Using [^\w\->] should
> work.
> 
>
> OK. Instead of using split, why not capture the tokens you're interested i
n.
> Something like:
>
>     for my $w ($title =~ /([A-Za-z]+[^A-Za-z\s]*)\s*/g) {
>

That's amazing! Yes, that works.

Let me see if I understand this expression:
/([A-Za-z]+
This matches any letter, uppercase or lowercase, 1 or more times.

[^A-Za-z\s]*)
This matches anything that's not a letter, uppercase or lowercase, or a
space, zero or more times. Here is how I will match my '->'.

\s*/g
This matches a blank space zero or more times and the 'g' means apply the
whole thing globally.

But why do I need the character classes in parentheses?

thanks again!  Chee

Report this thread to moderator Post Follow-up to this message
Old Post
Charlotte Hee
07-30-04 08:55 PM


RE: problem with splitting on "words"
Charlotte Hee wrote:
> On Fri, 30 Jul 2004, Bob Showalter wrote: 
>
> That's amazing! Yes, that works.
>
> Let me see if I understand this expression:
> /([A-Za-z]+
> This matches any letter, uppercase or lowercase, 1 or more times.

Yes. A token needs to start with a letter.

>
> [^A-Za-z\s]*)
> This matches anything that's not a letter, uppercase or lowercase, or
> a space, zero or more times. Here is how I will match my '->'.

Right. And it will stop at the next letter or whitespace char.

>
>  \s*/g
> This matches a blank space zero or more times and the 'g' means apply
> the whole thing globally.
>
> But why do I need the character classes in parentheses?

I did that so as not to capture the whitespace. Actually, I don't think you
need it; try leaving it out. I'm not the strongest on regexes; other folks
can probably improve on my approach here...

Report this thread to moderator Post Follow-up to this message
Old Post
Bob Showalter
07-30-04 08:55 PM


Re: problem with splitting on "words"
Charlotte Hee wrote:
>
> On Fri, 30 Jul 2004, Bob Showalter wrote: 
>
> Let me see if I understand this expression:
>
> [snip]
>
>  \s*/g
> This matches a blank space zero or more times and the 'g' means apply the
> whole thing globally.
>
> But why do I need the character classes in parentheses?

You don't really need parentheses, this should work as well:

for my $w ( $title =~ /[A-Za-z]+[^A-Za-z\s]*/g ) {



John
--
use Perl;
program
fulfillment

Report this thread to moderator Post Follow-up to this message
Old Post
John W. Krahn
07-31-04 01:55 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Beginners archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 04:27 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.