For Programmers: Free Programming Magazines  


Home > Archive > AWK > August 2007 > nested field separators









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author nested field separators
Sashi

2007-07-10, 6:57 pm

Hi All,

I have a file whose fields are separated by commas. The tricky thing
is that a field could have a comma inside of it, in which case it's
delimited by double quotes.

How do I get awk to recognize the fields correctly?
E.g:

field1, field2, field3, "this is, field4", field5, "and , this is,
field6"

Thanks,
Sashi

Jürgen Kahrs

2007-07-10, 6:57 pm

Sashi wrote:

> I have a file whose fields are separated by commas. The tricky thing
> is that a field could have a comma inside of it, in which case it's
> delimited by double quotes.


The problem of comma-separated-values comes up often
in this newsgroup. Google for CSV in comp.lang.awk.
Kenny McCormack

2007-07-10, 6:57 pm

In article <5fi0puF3cejkaU1@mid.individual.net>,
Jürgen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de> wrote:
>Sashi wrote:
>
>
>The problem of comma-separated-values comes up often
>in this newsgroup. Google for CSV in comp.lang.awk.


Incidentally, the best (and, IMHO, only) solution is to fix whatever
process is generating the file to output it with a different delimiter
(TAB is best). In fact, comma is just about the worst possible choice
for a delimiter.

Jürgen Kahrs

2007-07-10, 6:57 pm

Kenny McCormack wrote:

> Incidentally, the best (and, IMHO, only) solution is to fix whatever
> process is generating the file to output it with a different delimiter
> (TAB is best). In fact, comma is just about the worst possible choice
> for a delimiter.


Some people suggest using XML instead of using delimiters.
But this idea also has some drawbacks: Characters like
& and < inside data have to be escaped and processing
XML is not as straightforward as it looks at first sight.
Ed Morton

2007-07-10, 6:57 pm

Sashi wrote:
> Hi All,
>
> I have a file whose fields are separated by commas. The tricky thing
> is that a field could have a comma inside of it, in which case it's
> delimited by double quotes.
>
> How do I get awk to recognize the fields correctly?
> E.g:
>
> field1, field2, field3, "this is, field4", field5, "and , this is,
> field6"
>
> Thanks,
> Sashi
>


This will replace all sequences of a comma followed by white space
outside of double quotes by a tab character and use that tab character
as the field separator, and it gets rid of those superfluous double quotes:

$ cat file
field1, field2, field3, "this is, field4", field5, "and , this is, field6"
field1, field2, field3, "this is, field4", field5, "and , this is,
field6", and field seven

$ cat decsv.awk
BEGIN { FS="\t" }
function decsv( inStr,quoteCnt,subStr) {
inStr=$0
$0=""
quoteCnt=0
while (match(inStr,/[^"]*\"/)) {
subStr = substr(inStr,1,RLENGTH-1)
if ((++quoteCnt)%2) {
gsub(/,[[:space:]]*/,FS,subStr)
}
$0 = $0 subStr
inStr = substr(inStr,RLENGTH+1)
}
gsub(/,[[:space:]]*/,FS,inStr)
$0 = $0 inStr
}
{ decsv() }
{
print "\n"$0
for (i=1;i<=NF;i++) {
printf "%d:<%s>\n",i, $i
}
}

$ awk -f decsv.awk file

field1 field2 field3 this is, field4 field5 and , this is, field6
1:<field1>
2:<field2>
3:<field3>
4:<this is, field4>
5:<field5>
6:<and , this is, field6>

field1 field2 field3 this is, field4 field5 and , this is, field6
and field seven
1:<field1>
2:<field2>
3:<field3>
4:<this is, field4>
5:<field5>
6:<and , this is, field6>
7:<and field seven>
$

If your text within quotes can contain a tab, use some other character
for the FS (e.g. SUBSEP perhaps) or just gsub(/\t/," ") up-front. If
your text within quotes can contain escaped quotes (\"), then gsub()
those up-front to some other char/sequence at the start of decsv() then
gsub() them back at the end of decsv().

Ed.
Ed Morton

2007-07-10, 9:56 pm

Ed Morton wrote:
> Sashi wrote:
>
>
> This will replace all sequences of a comma followed by white space
> outside of double quotes by a tab character and use that tab character
> as the field separator, and it gets rid of those superfluous double quotes:
>
> $ cat file
> field1, field2, field3, "this is, field4", field5, "and , this is, field6"
> field1, field2, field3, "this is, field4", field5, "and , this is,
> field6", and field seven
>
> $ cat decsv.awk
> BEGIN { FS="\t" }
> function decsv( inStr,quoteCnt,subStr) {
> inStr=$0
> $0=""
> quoteCnt=0
> while (match(inStr,/[^"]*\"/)) {
> subStr = substr(inStr,1,RLENGTH-1)
> if ((++quoteCnt)%2) {
> gsub(/,[[:space:]]*/,FS,subStr)
> }
> $0 = $0 subStr
> inStr = substr(inStr,RLENGTH+1)
> }
> gsub(/,[[:space:]]*/,FS,inStr)
> $0 = $0 inStr
> }
> { decsv() }
> {
> print "\n"$0
> for (i=1;i<=NF;i++) {
> printf "%d:<%s>\n",i, $i
> }
> }
>
> $ awk -f decsv.awk file
>
> field1 field2 field3 this is, field4 field5 and , this is, field6
> 1:<field1>
> 2:<field2>
> 3:<field3>
> 4:<this is, field4>
> 5:<field5>
> 6:<and , this is, field6>
>
> field1 field2 field3 this is, field4 field5 and , this is, field6
> and field seven
> 1:<field1>
> 2:<field2>
> 3:<field3>
> 4:<this is, field4>
> 5:<field5>
> 6:<and , this is, field6>
> 7:<and field seven>
> $
>
> If your text within quotes can contain a tab, use some other character
> for the FS (e.g. SUBSEP perhaps) or just gsub(/\t/," ") up-front. If
> your text within quotes can contain escaped quotes (\"), then gsub()
> those up-front to some other char/sequence at the start of decsv() then
> gsub() them back at the end of decsv().
>
> Ed.



The OPs input shows a field that may contain a newline so I tweaked my
script to handle those:

$ cat decsv.awk
BEGIN { fs=",[[:space:]]"; FS="\t"; nl=" " }

function RS2nl( inStr,quoteCnt) {
# Converts RS characters inside of quotes to the nl character.
inStr = $0
quoteCnt = gsub(/"/,"",inStr)
while ( (quoteCnt%2) && ((getline inStr) > 0) ) {
$0 = $0 nl inStr
quoteCnt += gsub(/"/,"",inStr)
NR--
}
}

function fs2FS( inStr,quoteCnt,subStr) {
# Converts original field separators outside of quotes to the new
# FS character and discards the now-redundant pairs of double
quotes.
inStr = $0
$0 = ""
quoteCnt = 0
while (match(inStr,/[^"]*\"/)) {
subStr = substr(inStr,1,RLENGTH-1)
if ((++quoteCnt)%2) {
gsub(fs,FS,subStr)
}
$0 = $0 subStr
inStr = substr(inStr,RLENGTH+1)
}
gsub(fs,FS,inStr)
$0 = $0 inStr
}

{ RS2nl(); fs2FS() }

{
print NR": "$0
for (i=1;i<=NF;i++) {
printf "\t%d:<%s>\n",i, $i
}
print ""
}

Regards,

Ed.
Sashi

2007-07-11, 6:57 pm

On Jul 10, 3:42 pm, gaze...@xmission.xmission.com (Kenny McCormack)
wrote:
> In article <5fi0puF3cejk...@mid.individual.net>,
> J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de> wrote:
>
>
>
>
> Incidentally, the best (and, IMHO, only) solution is to fix whatever
> process is generating the file to output it with a different delimiter
> (TAB is best). In fact, comma is just about the worst possible choice
> for a delimiter.


I could not agree with you more.
But such is the data that I receive and I need to live with it.

Sashi

2007-07-11, 6:57 pm

On Jul 10, 8:54 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Ed Morton wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> The OPs input shows a field that may contain a newline so I tweaked my
> script to handle those:
>
> $ cat decsv.awk
> BEGIN { fs=",[[:space:]]"; FS="\t"; nl=" " }
>
> function RS2nl( inStr,quoteCnt) {
> # Converts RS characters inside of quotes to the nl character.
> inStr = $0
> quoteCnt = gsub(/"/,"",inStr)
> while ( (quoteCnt%2) && ((getline inStr) > 0) ) {
> $0 = $0 nl inStr
> quoteCnt += gsub(/"/,"",inStr)
> NR--
> }
>
> }
>
> function fs2FS( inStr,quoteCnt,subStr) {
> # Converts original field separators outside of quotes to the new
> # FS character and discards the now-redundant pairs of double
> quotes.
> inStr = $0
> $0 = ""
> quoteCnt = 0
> while (match(inStr,/[^"]*\"/)) {
> subStr = substr(inStr,1,RLENGTH-1)
> if ((++quoteCnt)%2) {
> gsub(fs,FS,subStr)
> }
> $0 = $0 subStr
> inStr = substr(inStr,RLENGTH+1)
> }
> gsub(fs,FS,inStr)
> $0 = $0 inStr
>
> }
>
> { RS2nl(); fs2FS() }
>
> {
> print NR": "$0
> for (i=1;i<=NF;i++) {
> printf "\t%d:<%s>\n",i, $i
> }
> print ""
>
> }
>
> Regards,
>
> Ed.


Thank you, Ed. That is really helpful.
Regards,
Sashi

bjonkman@sobac.com

2007-07-11, 6:57 pm

On Jul 11, 10:17 am, Sashi <small...@gmail.com> wrote:

> Thank you, Ed. That is really helpful.
> Regards,
> Sashi


I've done something similar to Ed, but not as elegantly. See the
function parsecsv() in the code at

http://sobac.com/cgi-bin/awk/code2html.awk?library.awk

parsecsv() does not (yet) deal with embedded newlines, but it does
deal with embedded quotes that have been escaped by doubling, eg.

....,"these are ""embedded"" quotes",...

There's actually an RFC for the CSV format: http://tools.ietf.org/html/rfc4180

--Bob.





Ed Morton

2007-07-12, 7:57 am

bjonkman@sobac.com wrote:
> On Jul 11, 10:17 am, Sashi <small...@gmail.com> wrote:
>
>
>
>
> I've done something similar to Ed, but not as elegantly. See the
> function parsecsv() in the code at
>
> http://sobac.com/cgi-bin/awk/code2html.awk?library.awk
>
> parsecsv() does not (yet) deal with embedded newlines, but it does
> deal with embedded quotes that have been escaped by doubling, eg.
>
> ...,"these are ""embedded"" quotes",...
>
> There's actually an RFC for the CSV format: http://tools.ietf.org/html/rfc4180


Thanks for the RFC tip. I've modified my script to handle "escaped
quotes" as defined there as a contiguous pair of double quotes, to
convert tabs in the original file to spaces and to only use a comma as
the separator by default, instead of a comma followed by a space as in
the OP. So, the script is now:

----------------
$ cat decsv.awk
# This script converts CSV files that conform to the RFC
# http://tools.ietf.org/html/rfc4180 to a tab-separated
# fields format by default.
BEGIN {
# co = whatever string really represents a "comma" in CSV.
# FS = the eventual Field Separator you want to use.
# fs = the string you want to replace a FS within fields with.
# rs = the string you want to replace a RS within fields with.
# So, to process a file with a separator of a comma followed
# by a space instead of just a comma would be:
# awk -v co=",[[:space:]]" ...
co=(co ~ /^$/ ? "," : co)
FS=(FS ~ /^ $/ ? "\t" : FS)
fs=(fs ~ /^$/ ? " " : fs)
rs=(rs ~ /^$/ ? " " : rs)
}

function qq2dQ() {
# Converts every letter d to dD and every pair of contiguous
# double quotes to dQ (which due to the preceding
# modification cannot exist in the text otherwise).
gsub( /d/,"dD")
gsub(/""/,"dQ")
}

function dQ2q() {
# Converts every dQ to a single double quote (since escaping
# it is no longer necessary) and every dD back to d.
gsub(/dQ/,"\"")
gsub(/dD/, "d")
}

function FS2fs() {
# Converts all FS characters to the fs string.
gsub(FS,fs)
}

function RS2rs( inStr,quoteCnt) {
# Converts RS characters inside of quotes to the rs string.
inStr = $0
quoteCnt = gsub(/"/,"",inStr)
while ( (quoteCnt%2) && ((getline inStr) > 0) ) {
$0 = $0 rs inStr
quoteCnt += gsub(/"/,"",inStr)
NR--
}
}

function co2FS( inStr,quoteCnt,subStr) {
# Converts comma separators outside of quotes to the new
# FS character and discards the now-redundant pairs of
# double quotes.
inStr = $0
$0 = ""
quoteCnt = 0
while (match(inStr,/[^"]*\"/)) {
subStr = substr(inStr,1,RLENGTH-1)
if ((++quoteCnt)%2) {
gsub(co,FS,subStr)
}
$0 = $0 subStr
inStr = substr(inStr,RLENGTH+1)
}
gsub(co,FS,inStr)
$0 = $0 inStr
}

{ qq2dQ(); FS2fs(); RS2rs(); co2FS(); dQ2q(); print }
------------------

So, since the OPs file doesn't exactly conform to that RFC (because it
uses a comma followed by a space as the separator rather than just a
comma) the OP would now need to invoke it as:

$ awk -v co=", " -f decsv.awk file

If anyone finds a problem with it, let me know and if it's something I
can fix while keeping the script simple, I'll probably do it.

Regards,

Ed.
Ed Morton

2007-07-12, 9:57 pm

Ed Morton wrote:
<snip>
> Thanks for the RFC tip. I've modified my script to handle "escaped
> quotes" as defined there as a contiguous pair of double quotes, to
> convert tabs in the original file to spaces and to only use a comma as
> the separator by default, instead of a comma followed by a space as in
> the OP. So, the script is now:


Some cleanup, a bug fix, and some clarification:

> ----------------
> $ cat decsv.awk

# This script, decsv.awk, converts CSV files that conform to
# the RFC http://tools.ietf.org/html/rfc4180 to a tab-separated
# fields format by default, leaving the result in $0. To
# use this as a preprocessor for your own script, just invoke
# it as:
# awk -f decsv.awk -f myscript.awk csvfile

BEGIN {
# decsv_co = the string that represents a "comma" in CSV.
# FS = the eventual Field Separator you want to use.
# decsv_fs = the string to replace a FS within fields with.
# decsv_rs = the string to replace a RS within fields with.
# So, to process a file with a separator of a comma followed
# by a space instead of just a comma would be:
# awk -v decsv_co=",[[:space:]]" ...
decsv_co=(decsv_co ~ /^$/ ? "," : decsv_co)
FS=(FS ~ /^ $/ ? "\t" : FS)
decsv_fs=(decsv_fs ~ /^$/ ? " " : decsv_fs)
decsv_rs=(decsv_rs ~ /^$/ ? " " : decsv_rs)
}

function decsv_qq2dQ() {
# Converts every letter d to dD and every pair of contiguous
# double quotes to dQ (which due to the preceding
# modification cannot exist in the text otherwise).
gsub( /d/,"dD")
gsub(/""/,"dQ")
}

function decsv_dQ2q() {
# Converts every dQ to a single double quote (since escaping
# it is no longer necessary) and every dD back to d.
gsub(/dQ/,"\"")
gsub(/dD/, "d")
}

function decsv_FS2fs() {
# Converts all FS strings to the decsv_fs string.
gsub(FS,decsv_fs)
}

function decsv_RS2rs( inStr,quoteCnt) {
# Converts RS strings inside of quotes to the decsv_rs string.
inStr = $0
quoteCnt = gsub(/"/,"",inStr)
while ( (quoteCnt%2) && ((getline inStr) > 0) ) {
$0 = $0 decsv_rs inStr
quoteCnt += gsub(/"/,"",inStr)
FNR--
NR--
}
}

function decsv_co2FS( inStr,quoteCnt,subStr) {
# Converts decsv_co separators outside of quotes to the
# new FS string and discards the now-redundant pairs of
# enclosing double quotes.
inStr = $0
$0 = ""
quoteCnt = 0
while (match(inStr,/[^"]*\"/)) {
subStr = substr(inStr,1,RLENGTH-1)
if ((++quoteCnt)%2) {
gsub(decsv_co,FS,subStr)
}
$0 = $0 subStr
inStr = substr(inStr,RLENGTH+1)
}
gsub(decsv_co,FS,inStr)
$0 = $0 inStr
}

function decsv() {
decsv_qq2dQ()
decsv_FS2fs()
decsv_RS2rs()
decsv_co2FS()
decsv_dQ2q()
}

{ decsv() }
> ------------------


1) I fixed a bug above where I wasn't decrementing FNR appropriately
inside RS2rs.
2) I cleaned up the global name space so that the above function names
and global variables (co, fs, and rs) would be unlikely to clash with a
subsequent script by prefixing them all with "decsv_".
3) I got rid of the "; print" at the end of the script so you don't need
an intermediate file to process your CSV file since the conversion above
stores each record from the csvfile as you'd expect in $0. So all you
need to do is:

awk -f decsv.awk -f myscript.awk csvfile

where "myscript.awk" contains whatever script you really want to run on
your file based on it being formated as FS-separated fields rather than
CSV. If you just want the file printed, then you just need to put "1" or
"{ print }" in "myscript.awk" (or at the end of "decsv.awk").

For example:
------
$ cat csvfile
1,abc,def ghi,jkl,unquoted character strings
2,"abc","def ghi","jkl",quoted character strings
3,123,4""56,789,numbers with escaped quote
4,,"", ,empty fields
5,abc,"def
ghi",234,embedded newline
6,abc,"def "" ghi",789,quoted fields

------
$ cat field4num.awk
# print records that contain a number in the 4th field
$4 ~ /[[:digit:]]/ {
printf "NR=%d, FNR=%d, NF=%d, $0=<%s>\n",NR,FNR,NF,$0
for (i=1;i<=NF;i++) {
printf "\t%d:<%s>\n",i,$i
}
}

------
$ awk -f decsv.awk -f field4num.awk csvfile
NR=3, FNR=3, NF=5, $0=<3 123 4"56 789 numbers with
escaped quote>
1:<3>
2:<123>
3:<4"56>
4:<789>
5:<numbers with escaped quote>
NR=5, FNR=5, NF=5, $0=<5 abc def ghi 234 embedded newline>
1:<5>
2:<abc>
3:<def ghi>
4:<234>
5:<embedded newline>
NR=6, FNR=6, NF=5, $0=<6 abc def " ghi 789 quoted
fields>
1:<6>
2:<abc>
3:<def " ghi>
4:<789>
5:<quoted fields>
------

Regards,

Ed.

Lukin27

2007-08-11, 2:11 pm


SHOCKING P0RN VIDEOS!


















homosexual pic, gay boy film, Links to gay interracial clip, gay muscle sex clip, free gay bareback video, Amateur gay videos from Bobbie, Amateur gay videos from Boontjie, sexo homosexual, free gay latino sex video, Links to gay bear film
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com