Home > Archive > AWK > September 2006 > Closing a field
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
|
| Just for fun I'm learning how relational databases might have been done
from the Unix command-line.
Anyway, I have a table file in ASCII with the first field sometimes
having two words and sometimes having one word. I want to do something
to make all the words of the field always readable as a single field.
What's the best way to do that so that it works with the POSIX command
line database worth utilities like join, cut, paste, and etc.
It seems like embedded colons is preferred, but I wonder why there is a
proliferation in the Windows world of so-called comma-delimited tables
with additional quotes around strings that shouldn't be interpreted as
numbers.
But my real question is how to use AWK to make that first field readable
for my relational database on which I am working.
The second field is always a number, so I think I can use that as a clue.
Ex. of table:
Bastard Sword 1 10 1
Halberd 3 6 0
Here I want "Bastard Sword" to be the first field of record one and
"Halberd" to be the first field of record two.
--
Jeff
| |
|
| Jeff wrote:
> It seems like embedded colons is preferred,
Or are tabs preferred?
If so, please inspire me to learn AWK by showing me how easy it is to
change the the inter-field spaces to tabs while leaving the intra-field
spaces as spaces.
I apologize for the monologue. ^ ^
Thanks in advance.
--
Jeff
| |
| Janis Papanagnou 2006-09-15, 6:56 pm |
| Jeff wrote:
> Just for fun I'm learning how relational databases might have been done
> from the Unix command-line.
>
> Anyway, I have a table file in ASCII with the first field sometimes
> having two words and sometimes having one word. I want to do something
> to make all the words of the field always readable as a single field.
>
> What's the best way to do that so that it works with the POSIX command
> line database worth utilities like join, cut, paste, and etc.
>
> It seems like embedded colons is preferred, but I wonder why there is a
> proliferation in the Windows world of so-called comma-delimited tables
> with additional quotes around strings that shouldn't be interpreted as
> numbers.
>
> But my real question is how to use AWK to make that first field readable
> for my relational database on which I am working.
>
> The second field is always a number, so I think I can use that as a clue.
That would be possible. But if your number of elements per line does
not vary, other than having 4 or 5 fields, I would use this approach...
BEGIN { OFS="|" }
NF==4 { print $1, $2, $3, $4 ; next }
NF==5 { print $1" "$2, $3, $4, $5 ; next }
> Ex. of table:
>
> Bastard Sword 1 10 1
> Halberd 3 6 0
>
> Here I want "Bastard Sword" to be the first field of record one and
> "Halberd" to be the first field of record two.
....which produces records like...
Bastard Sword|1|10|1
Halberd|3|6|0
Janis
| |
| Janis Papanagnou 2006-09-15, 6:56 pm |
| Janis Papanagnou wrote:
> Jeff wrote:
>
>
>
> That would be possible. But if your number of elements per line does
> not vary, other than having 4 or 5 fields, I would use this approach...
>
> BEGIN { OFS="|" }
> NF==4 { print $1, $2, $3, $4 ; next }
> NF==5 { print $1" "$2, $3, $4, $5 ; next }
Umm.., you can of course omit the "next"...
BEGIN { OFS="|" }
NF==4 { print $1, $2, $3, $4 }
NF==5 { print $1" "$2, $3, $4, $5 }
Janis
>
>
>
> ...which produces records like...
>
> Bastard Sword|1|10|1
> Halberd|3|6|0
>
>
> Janis
| |
|
| Janis Papanagnou wrote:
[color=darkred]
> That would be possible. But if your number of elements per line does
> not vary, other than having 4 or 5 fields, I would use this approach...
>
> BEGIN { OFS="|" }
> NF==4 { print $1, $2, $3, $4 ; next }
> NF==5 { print $1" "$2, $3, $4, $5 ; next }
Looks fine to me. Thank you. I think I learn as much from what people
say as don't say about the topic of many a newsgroup.
I thought of the same solution, that of using the field number count as
the discriminant, mainly because I don't have any idea how to use AWK to
validate a field as a number. Also, I didn't really know the syntax of
using the field count yet.
[color=darkred]
> ...which produces records like...
> Bastard Sword|1|10|1
> Halberd|3|6|0
Great. I was thinking of using tabs as delimiters because cut--I think
it was--expected them by default. I'm not very practiced or well read
yet. I'm really more of a C++ programmer.
By the way, part of my interest in AWK is what it is revealing to me
about some of the strangeness in Ruby that I've never used in that
language. Ruby also has BEGIN{} and END{}, for example. So, I would
assume Ruby can be used in a way I've never used it before.
But Ruby's not POSIX, so I'll continue on with AWK and SED instead while
I'm transforming tables into a POSIX-compliant command line database. :-)
Do I suppose right that I could use AWK to make some more SQL-like
filters for the command line? Surely, I wouldn't be the first to have
done such a thing, would I?
--
Jeff
| |
| Janis Papanagnou 2006-09-16, 7:56 am |
| Jeff wrote:
> Janis Papanagnou wrote:
>
>
>
>
>
> Looks fine to me. Thank you. I think I learn as much from what people
> say as don't say about the topic of many a newsgroup.
>
> I thought of the same solution, that of using the field number count as
> the discriminant, mainly because I don't have any idea how to use AWK to
> validate a field as a number. Also, I didn't really know the syntax of
> using the field count yet.
The field count is the least obfuscated pattern, while constructing
regular expressions is the more ambitious approach. Testing for
integral numbers is the easier part (using /^[0-9]+$/), but testing
for real numbers can be a pain if you don't consider locale effects.
>
>
>
>
>
>
>
> Great. I was thinking of using tabs as delimiters because cut--I think
> it was--expected them by default. I'm not very practiced or well read
> yet. I'm really more of a C++ programmer.
You can change the OFS in the program as you like; to TAB, to colon,
comma, or semicolon, without touching the rest of the awk program.
And you don't need a separate cut; you may do that also within awk.
>
> By the way, part of my interest in AWK is what it is revealing to me
> about some of the strangeness in Ruby that I've never used in that
> language. Ruby also has BEGIN{} and END{}, for example. So, I would
> assume Ruby can be used in a way I've never used it before.
>
> But Ruby's not POSIX, so I'll continue on with AWK and SED instead while
> I'm transforming tables into a POSIX-compliant command line database. :-)
>
> Do I suppose right that I could use AWK to make some more SQL-like
> filters for the command line? Surely, I wouldn't be the first to have
> done such a thing, would I?
I don't know what you have specifically in mind with "SQL-like" filters,
but yes, such filtering is possible, depending on the requirements.
The simplest is filtering by column number, like in
($2 == "hello" && $4 ~ /xy?z*/) || ($5 * $6 < $3 + 100)
If you have a heading line in your data file where the column names are
defined you can also search by column name; something like...
BEGIN { select="ColName" } # or pass name by command line argument
NR==1 { for(f=1;f<=NF;f++) if($f==select) colNum=f }
{ print $ colNum }
Which can be extended further to allow searching multiple column names
with a single call.
Janis
| |
| Kenny McCormack 2006-09-16, 7:56 am |
| In article <sgQOg.8337$v%4.5993@newsread1.news.pas.earthlink.net>,
Jeff <jefferio007@earthlink.net> wrote:[color=darkred]
>Janis Papanagnou wrote:
>
>
That's actually pretty good, but it does have a couple of "issues":
1) It assumes (as you state) that the text field never has more
than 2 space separated strings.
2) I don't like "duplicative" code - that writing the same (or
almost the same) code twice. So, I try to avoid that.
Here's an approach that I kind of like (admitedly, it is a little more
C-like and less AWK-like than I might prefer, but so it goes):
{
for (i=1; i<NF; i++)
printf("%s%s",$i,i<NF-3 ? " " : "|")
print $i
}
| |
|
| >>> BEGIN { OFS="|" }
Kenny McCormack wrote:
[color=darkred]
> That's actually pretty good, but it does have a couple of "issues":
> 1) It assumes (as you state) that the text field never has more
> than 2 space separated strings.
True, and as it turns out I forgot that I have one or two such in my
table. Also, I had one fewer numeric fields than what I had said.
It was easy to doctor up what Janis has shown me, though. I adjusted for
the fewer fields and I added another line like this.
BEGIN { OFS="|" }
NF==3 { print $1, $2, $3 }
NF==4 { print $1" "$2, $3, $4 }
NF==5 { print $1" "$2" "$3, $4, $5 }
> 2) I don't like "duplicative" code - that writing the same (or
> almost the same) code twice. So, I try to avoid that.
Occasionally, though, I think simplicity can make maintenance less error
prone than duplication avoidance, although when one has three similar
lines one begins to push the line.
> Here's an approach that I kind of like (admitedly, it is a little more
> C-like and less AWK-like than I might prefer, but so it goes):
>
> {
> for (i=1; i<NF; i++)
> printf("%s%s",$i,i<NF-3 ? " " : "|")
> print $i
> }
That is a sweeter solution in the sense of allowing a much greater range
of strings in the text field.
In order to use that on my actual database which is different than the
one I had described in this forum, I had to change the magic number, 3,
to a 2.
So, I guess what might be even sweeter is if it could be generalized for
more or fewer numeric fields and the number of them passed on the
command line.
But I myself would still not have any idea how one would get a command
line argument into the AWK script.
--
Jeff
| |
|
| >>>> The second field is always a number, so I think I can use that as a
[color=darkred]
[color=darkred]
Janis Papanagnou wrote:
[color=darkred]
> The field count is the least obfuscated pattern, while constructing
> regular expressions is the more ambitious approach. Testing for
> integral numbers is the easier part (using /^[0-9]+$/), but testing
> for real numbers can be a pain if you don't consider locale effects.
Like the way much of the rest of the world interchanges commas and
periods as separators? Plus the problem designating signed mantissas.
[color=darkred]
[color=darkred]
[color=darkred]
[color=darkred]
> You can change the OFS in the program as you like; to TAB, to colon,
> comma, or semicolon, without touching the rest of the awk program.
The treatment of tabs as an alias for spaces in some POSIX is motivating
me to stay away from them. For example, I tried using 'tr' to change the
'|'s to '\t's for two different files. It worked correctly for one of
them, changing them into tabs, but for the other, all of the '|'s got
changed to spaces. Beats me why. So, I guess I'll stick with non-Tabs as
delimiters. I still want something easy to type, though, so I'll
probably end up going with a ',' or a ';', both of which are easy to type.
> And you don't need a separate cut; you may do that also within awk.
Shouldn't I use the right tool for the job to make the code always the
simplest it can be? A POSIX programmer should be expected know all of
the POSIX utilties, shouldn't he be?
[color=darkred]
> I don't know what you have specifically in mind with "SQL-like" filters,
> but yes, such filtering is possible, depending on the requirements.
Oh, sorry. I should have written, "SQL-like Unix pipe filters." We
already have a filter called 'join' so why not use AWK to create a
filter called 'select' and for other SQL commands as well.
> The simplest is filtering by column number, like in
>
> ($2 == "hello" && $4 ~ /xy?z*/) || ($5 * $6 < $3 + 100)
Great. I'd want to call the script by its SQL command name and use it as
an SQL-like Unix pipe filter. I'd have to give the script its search
parameters on the command line or in the environment variables--whatever
works with AWK and I don't know that part yet.
For example, I'd have to give the column numbers (the $2 and the $4 in
your ex.) and the search strings (the "hello" in your ex.) on the
command line or in an environment variable. I'm not yet exactly sure
what the "~ /xy?z*/" you have in your example does.
> If you have a heading line in your data file where the column names are
> defined you can also search by column name; something like...
That's interesting. I haven't read about that. Is it a Unix database idiom?
> BEGIN { select="ColName" } # or pass name by command line argument
> NR==1 { for(f=1;f<=NF;f++) if($f==select) colNum=f }
> { print $ colNum }
>
> Which can be extended further to allow searching multiple column names
> with a single call.
I get it. Sweet enough to go in my coffee. :-)
--
Jeff
| |
| Jon LaBadie 2006-09-16, 6:56 pm |
| Jeff wrote:
>
> So, I guess what might be even sweeter is if it could be generalized for
> more or fewer numeric fields and the number of them passed on the
> command line.
>
> But I myself would still not have any idea how one would get a command
> line argument into the AWK script.
In all awk's that I know of except "oawk" (old awk) you use the -v option
awk -v awkvariable1="$shellvariable" -v awkvariable2="any string" ... '
<awkprogram>
' datafile
| |
| Janis Papanagnou 2006-09-16, 6:56 pm |
| Jeff wrote:
> Janis Papanagnou wrote:
>
>
>
> Shouldn't I use the right tool for the job to make the code always the
> simplest it can be? A POSIX programmer should be expected know all of
> the POSIX utilties, shouldn't he be?
Yes, to both questions. But the point is that shell expressions with
long pipe sequences usually get complicated and slower than necessary.
In principle, if you have command pipes that make heavy use of head,
tail, cat, cut, sed, grep, and many other filters, you are better off
using a fast, coherent, extensible and powerful awk solution. So if
you know all the specialized POSIX tools, including all the facilities
of awk, you'll not have any difficulty selecting the most appropriate
solution.
> [...]
>
> Great. I'd want to call the script by its SQL command name and use it as
> an SQL-like Unix pipe filter. I'd have to give the script its search
> parameters on the command line or in the environment variables--whatever
> works with AWK and I don't know that part yet.
Both is possible. Pass variables using the -v option as in...
awk -v awkvar="${shellvar}" '$2 ~ awkvar { print NR, $2, awkvar }'
There's also the ARGV[] and ARGC available as predefined variables.
> [...] I'm not yet exactly sure
> what the "~ /xy?z*/" you have in your example does.
It's the matching operator: variable ~ /regularExpression/
The regular expression /xy?z*/ means an x followed optionally by one y
and arbitrary amounts (0..n) of z's.
Janis
| |
| Ed Morton 2006-09-16, 9:56 pm |
| Jeff wrote:
> Just for fun I'm learning how relational databases might have been done
> from the Unix command-line.
>
> Anyway, I have a table file in ASCII with the first field sometimes
> having two words and sometimes having one word. I want to do something
> to make all the words of the field always readable as a single field.
>
> What's the best way to do that so that it works with the POSIX command
> line database worth utilities like join, cut, paste, and etc.
>
> It seems like embedded colons is preferred, but I wonder why there is a
> proliferation in the Windows world of so-called comma-delimited tables
> with additional quotes around strings that shouldn't be interpreted as
> numbers.
>
> But my real question is how to use AWK to make that first field readable
> for my relational database on which I am working.
>
> The second field is always a number, so I think I can use that as a clue.
>
> Ex. of table:
>
> Bastard Sword 1 10 1
> Halberd 3 6 0
>
> Here I want "Bastard Sword" to be the first field of record one and
> "Halberd" to be the first field of record two.
awk '{text=rest=$0
sub(/[ 0-9]*$/,"",text)
sub(text,"",rest)
gsub(/ +/,":",rest)
printf "\"%s\"%s\n",text,rest}' file
Regards,
Ed.
|
|
|
|
|