For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > July 2005 > Help with split using multiple delimiters









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Help with split using multiple delimiters
geeknc@yahoo.com

2005-07-27, 5:05 pm

I have a file that contains 5 elements per line each seperated by white
space, however the 4th element is surrounded by quotes.

Each line in a file looks like this:

ItemA ItemB 1.1.1.1.1 "xxx xx xxxxxx" ItemD

I was hoping to do something like this....

($a,$b,$c,$d,$e) = split(/split on white space or "...."/, $string);

and end up with....

$a = "ItemA";
$b = "ItemB";
$c = "1.1.1.1.1";
$d = "xxx xx xxxxxx";
$e = "ItemD";

I have tried multiple delimiters, but nothing seems to return 5
elements. Thank you, in advance, for any help you can offer.

it_says_BALLS_on_your forehead

2005-07-27, 5:05 pm

don't use split--use a regex.

($a, $b, $c, $d, $e) = $string =~
/(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

or if using $_

($a, $b, $c, $d, $e) = /(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

you can wrap each element in double quotes later.

you may be able to do

@array = /(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

for (@array) {
$_ = qq{"$_"};
}

Paul Lalli

2005-07-27, 5:05 pm

gnc@yahoo.com wrote:
> I have a file that contains 5 elements per line each seperated by white
> space, however the 4th element is surrounded by quotes.


Can you explain what was wrong with the solution you found in the FAQ?
You did, of course, search the FAQ before asking hundreds of other
people for help, right?

perldoc -q split
How can I split a [character] delimited string except when
inside [character]? (Comma-separated files)

In your case, the first [character] is a space, the second is a
double-quotes.

Paul Lalli

James Taylor

2005-07-28, 10:00 pm

In article <1122488128.477319.145360@g43g2000cwa.googlegroups.com>,
<simon.chao@fmr.com> wrote:
>
> don't use split--use a regex.
>
> ($a, $b, $c, $d, $e) = $string =~
> /(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;


If you don't know in advance which fields will be quoted,
you can use this regex instead:

my ($a, $b, $c, $d, $e) = $string =~ /("[^"]*"|\S+)/g;
# but then you need to remove any quotes by saying:
s/^"([^"]*)"$/$1/ foreach $a, $b, $c, $d, $e;

If you don't mind the fields all going in one array, you
could do it all in one go like this:

my @fields;
push @fields, $+ while $string =~ /"([^"]*)"|(\S+)/g;

Of course, nothing stops you then assigning the @fields
array to individual scalar variables:

my ($a, $b, $c, $d, $e) = @fields;

If a single line while loop with a fairly simple regex seems too
easy or too efficient, you can always spend time reading up on
the various CPAN modules suggested by the FAQ (perldoc -q split)
work out how to setup the necessary OO object instances, how
to call the provided methods to get the result you require,
test that it does what you expect, pray that there are no
earlier versions of the module around that are buggy, pray
that no future versions will be buggy, load the whole module
at compile time and hope that this and the method call interface
don't hit performance too much, and then sit back and enjoy
the somewhat dubious pleasures of OPC (Other People's Code)
in the knowledge that at least you didn't have to do the
work yourself. (Irony intended.)

Even if you wanted to use a module, I note that the FAQ
entry "How can I split a [character] delimited string except
when inside [character]?" recommends the use of Text::CVS or
Text::CVS_XS but I don't believe CVS is what's needed here. :-)

--
James Taylor, London, UK PGP key: 3FBE1BF9
To protect against spam, the address in the "From:" header is not valid.
In any case, you should reply to the group so that everyone can benefit.
If you must send me a private email, use james at oakseed demon co uk.

it_says_BALLS_on_your forehead

2005-07-28, 10:00 pm

i don't know if that would work because of greedy matching. you may
need a ? after your asterisk, to make it stingy matching.

Anno Siegel

2005-07-29, 4:00 am

James Taylor <spam-block-@-SEE-MY-SIG.com> wrote in comp.lang.perl.misc:
> In article <1122488128.477319.145360@g43g2000cwa.googlegroups.com>,
> <simon.chao@fmr.com> wrote:


[...]

> Even if you wanted to use a module, I note that the FAQ
> entry "How can I split a [character] delimited string except
> when inside [character]?" recommends the use of Text::CVS or
> Text::CVS_XS but I don't believe CVS is what's needed here. :-)


That must be a typo in the FAQ. s/CVS/CSV/g.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
James Taylor

2005-07-29, 9:04 am

In article <dccplg$oqk$1@mamenchi.zrz.TU-Berlin.DE>,
Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> wrote:
>
> James Taylor wrote:
>
> That must be a typo in the FAQ. s/CVS/CSV/g.


Who's responsible for maintaining the FAQ?
What's the correct procedure for nudging them?

--
James Taylor, London, UK PGP key: 3FBE1BF9
To protect against spam, the address in the "From:" header is not valid.
In any case, you should reply to the group so that everyone can benefit.
If you must send me a private email, use james at oakseed demon co uk.

James Taylor

2005-07-29, 9:04 am

Simon, I'm not sure which bit of my post you were replying
to, or even if it was me you were replying to, as you did
not quote any context. I will therefore attempt to rebuild
the relevant context below with the correct attributions.
You probably need to get a better news reader if you can.

In article <1122599411.163024.159890@g44g2000cwa.googlegroups.com>,
<simon.chao@fmr.com> wrote:
>
> In article <ant2823381cbfNdQ@riscpc.jtnet>,
> James Taylor wrote:
>
> i don't know if that would work because of greedy matching. you may
> need a ? after your asterisk, to make it stingy matching.


If we're sure that the OP's input lines contain simple
double quoted strings that do not themselves contain double
quotes (and this is what his example illustrated) then a
greedy [^"]* will swallow everything up to the next double
quote just as we require. Obviously, if the closing quote was
missing, it wouldn't capture the correct thing. (I think it
would backtrack and treat the opening quote as part of a
space delimited word instead). The OP could check there are
an even number of double quotes beforehand by saying:

die "Bad input line: $string\n" if $string =~ tr/"// % 2;

If the input lines were similar to CSV in allowing strings
that themselves contain double quotes, doubled up like this:

ItemA ItemB 1.1.1.1.1 "He said ""Hello"" to me" ItemD

then a more complex regex would be required. If this is what the
OP wants he can ask, but I don't believe it is. What he shouldn't
do, though, is use Text::ParseWords because, contrary to popular
belief, it doesn't handle CSV style quotes.

--
James Taylor, London, UK PGP key: 3FBE1BF9
To protect against spam, the address in the "From:" header is not valid.
In any case, you should reply to the group so that everyone can benefit.
If you must send me a private email, use james at oakseed demon co uk.

it_says_BALLS_on_your forehead

2005-07-29, 9:04 am

> James Taylor wrote:
> If you don't know in advance which fields will be quoted,
> you can use this regex instead:



....so based on that (you said fieldS), the greedy matching would have
caused the regex to do something that was unintended.

> James Taylor also wrote:
> If this is what the
> OP wants he can ask, but I don't believe it is.


....referring to nested quotes. you'r right, he didn't ask that. nor did
i assume he did. the example that he gave suggests that the 4th field
would always be the quoted field, so that's why i gave him the simple
regex that i did.

i was simply pointing out what i thought was an oversight in your
regex, because my interpretation was that you thought the OP may have
to deal with multiple quoted fields, and if that were the case, the
default greedy matching would eat up all but the last quote.

xhoster@gmail.com

2005-07-29, 5:03 pm

"it_says_BALLS_on_your forehead" <simon.chao@fmr.com> wrote:
>
> ...so based on that (you said fieldS), the greedy matching would have
> caused the regex to do something that was unintended.


Can you illustrate this alleged problem?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
it_says_BALLS_on_your forehead

2005-07-29, 5:03 pm

woops, you're right. the [^"] deals with that, so it wouldn't be a
problem. and nested quotes would be a problem regardless of whether the
repetition specifier (?) was used. sorry about that...


i was thinking something like:
my $str = q{"one" "two" "three" "four" "five"};
my @fields = $str =~ /(".*")/g;

....
which would populate the whole string in the $fields[0];

again, sorry about that James Taylor.

James Taylor

2005-07-29, 5:03 pm

In article <1122652691.622586.260570@g14g2000cwa.googlegroups.com>,
<simon.chao@fmr.com> wrote:
>
> sorry about that James Taylor.


No problem Simon. :-)

--
James Taylor, London, UK PGP key: 3FBE1BF9
To protect against spam, the address in the "From:" header is not valid.
In any case, you should reply to the group so that everyone can benefit.
If you must send me a private email, use james at oakseed demon co uk.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com