Code Comments
Programming Forum and web based access to our favorite programming groups.Hi, I would like to be able to split a file according to values in a field. For example, to split the following file by surname, into separate files: John Smith Dave Green Kim Brown Steve West Tom Brown Dave West Mary Green Steve Smith Jack Brown giving separate files with contents: John Smith Steve Smith Dave Green Mary Green Kim Brown Tom Brown Jack Brown Steve West Dave West It doesn't matter how the resulting files are named, as long as I can specify a directory in which they are to be created. I'm asking in this group but I'm not sure if awk is the most appropriate tool. I would post to comp.unix.shell also, but I'm not sure if I'm supposed to do that, so I'll just post it here for the moment. If anyone knows how to do this in awk, or of a more appropriate tool, please could you let me know. Thanks for your help. Regards, Jonny
Post Follow-up to this messageJonny wrote:
> Hi,
>
> I would like to be able to split a file according to values in a field.
>
> For example, to split the following file by surname, into separate
> files:
>
> John Smith
> Dave Green
> Kim Brown
> Steve West
> Tom Brown
> Dave West
> Mary Green
> Steve Smith
> Jack Brown
>
> giving separate files with contents:
>
> John Smith
> Steve Smith
>
> Dave Green
> Mary Green
>
> Kim Brown
> Tom Brown
> Jack Brown
>
> Steve West
> Dave West
>
{ print > $NF }
--
Regards,
---Robert
Post Follow-up to this messageJonny wrote:
> Hi,
>
> I would like to be able to split a file according to values in a field.
>
> For example, to split the following file by surname, into separate
> files:
>
> John Smith
> Dave Green
> Kim Brown
> Steve West
> Tom Brown
> Dave West
> Mary Green
> Steve Smith
> Jack Brown
>
> giving separate files with contents:
>
> John Smith
> Steve Smith
>
> Dave Green
> Mary Green
>
> Kim Brown
> Tom Brown
> Jack Brown
>
> Steve West
> Dave West
>
> It doesn't matter how the resulting files are named, as long as I can
> specify a directory in which they are to be created.
>
> I'm asking in this group but I'm not sure if awk is the most appropriate
> tool. I would post to comp.unix.shell also, but I'm not sure if I'm
> supposed to do that, so I'll just post it here for the moment.
>
> If anyone knows how to do this in awk, or of a more appropriate tool,
> please could you let me know.
Awk seems the most appropriate tool to me. The following program
will create files named West, Smith, Green, and Brown containing
the respective entries.
{ print > $2 }
Janis
Post Follow-up to this messageJanis Papanagnou wrote:
> Jonny wrote:
>
> Awk seems the most appropriate tool to me. The following program
> will create files named West, Smith, Green, and Brown containing
> the respective entries.
>
> { print > $2 }
Thanks for the reply Janis.
I had no idea it was so simple. Because of the awk implemenation I'm
using, I had to use:
{ print >> $2; close($2) }
otherwise I got a "Too many open files error" when dealing with large
files containing many surnames.
However, since my initial post, I have found out that there also needs
to be a limit on the number of lines in each file produced. That is, in
my initial example, if there was a limit of 2 lines per file, then
instead of producing the files:
Smith
Green
Brown
West
the following files would need to be produced:
Smith
Green
Brown_1
Brown_2
West
which I suppose makes things not so simple. I appreciate that this
makes the problem more complex - or rather, completely different - but
if you have any ideas of how to use awk (or any other tool) to do this,
I would be very grateful.
Regards,
Jonny
Post Follow-up to this messageIn article <tkNce.11104$5A3.2716@newsfe4-win.ntli.net>, Jonny <www.mail@ntlworld.com> wrote: ... >Thanks for the reply Janis. > >I had no idea it was so simple. Because of the awk implemenation I'm >using, I had to use: > >{ print >> $2; close($2) } Get gawk. Then come back. >However, since my initial post, I have found out that there also needs >to be a limit on the number of lines in each file produced. That is, in >my initial example, if there was a limit of 2 lines per file, then >instead of producing the files: Consider that your homework. We wouldn't to spoil you, now would we?
Post Follow-up to this messageJonny wrote:
> Janis Papanagnou wrote:
>
>
>
>
>
> Thanks for the reply Janis.
>
> However, since my initial post, I have found out that there also needs
> to be a limit on the number of lines in each file produced. That is, in
> my initial example, if there was a limit of 2 lines per file, then
> instead of producing the files:
>
> Smith
> Green
> Brown
> West
>
> the following files would need to be produced:
>
> Smith
> Green
> Brown_1
> Brown_2
> West
>
> which I suppose makes things not so simple.
Well, if you need to index file names *ONLY AFTER* the first two entries, th
en it's still simple.
{ print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m : "") }
However that'll produce files
Brown
Brown_1
Green
Smith
West
--
Regards,
---Robert
Post Follow-up to this messageIn article <tkNce.11104$5A3.2716@newsfe4-win.ntli.net>, Jonny <www.mail@ntlworld.com> wrote: % However, since my initial post, I have found out that there also needs % to be a limit on the number of lines in each file produced. That is, in % my initial example, if there was a limit of 2 lines per file, then % instead of producing the files: You could just keep counts of how many of each name you've spat out. if (++ncount[$2] > limit) { ncount[$2] = 1 fcount[$2]++ } else if (! ($2 in fcount)) fcount[$2] = 1 fn = $2 "_" fcount[$2] print >> fn close(fn) Depending on the size of your data, you could keep all the names in memory then spit them out when you go over the limit. Something like { c[$2]++ if (c[$2] > l) { f[$2]++ fn = $2 "_" f[$2] for (i = 1; i <= l; i++) print n[$2,i] >> fn c[$2] = 1 } n[$2,c[$2]] = $0 } END { for (nm in c) { if (nm in f) { f[nm]++ for (i = 1; i <= c[nm]; i++) print n[nm,i] >> (nm "_" f[nm]) } else { for (i = 1; i <= c[nm]; i++) print n[nm,i] >> nm } } } -- Patrick TJ McPhee North York Canada ptjm@interlog.com
Post Follow-up to this messageRobert Katz wrote:
> Jonny wrote:
>
>
>
> Well, if you need to index file names *ONLY AFTER* the first two
> entries, then it's still simple.
>
> { print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m : "") }
>
> However that'll produce files
>
> Brown
> Brown_1
> Green
> Smith
> West
>
How about,
{ print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m + 1 : "") }
END {
for (x in a)
if (a[x] > 2)
system("mv " x " " x "_1")
}
--
Regards,
---Robert
Post Follow-up to this messageJonny wrote: > However, since my initial post, I have found out that there also needs > to be a limit on the number of lines in each file produced. That is, in > my initial example, if there was a limit of 2 lines per file, then > instead of producing the files: > > Smith > Green > Brown > West > > the following files would need to be produced: > > Smith > Green > Brown_1 > Brown_2 > West Thanks to Patrick and Robert for the additional replies. They all produce the five files expected. I will have to test them on larger files to determine which is the quickest or most feasible. It's not going to be straightforward to test because I ran one of the scripts on a file containing over a million names, and almost 300,000 files were produced, before I stopped awk. It then took around 2 hours for the directory containing the files to be deleted. But that's NTFS for you. Thanks again for your help. I appreciate it. Regards, Jonny
Post Follow-up to this message
Robert Katz wrote:
> Jonny wrote:
>
>
>
> Well, if you need to index file names *ONLY AFTER* the first two
> entries, then it's still simple.
>
> { print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m : "") }
>
> However that'll produce files
>
> Brown
> Brown_1
> Green
> Smith
> West
>
It might be more efficient to do:
$ awk -vCONVFMT="%d" '{print $2 > $2 "_" a[$2]++/2}' file
since on every line:
a) You don't have to evaluate NF (which I vaguely recall hearing took
some extra processing time compared to just using the field number)
b) You don't need to call the int() function
c) You don't need to test for a value befre appending the suffix
It will mean you end up with the first file names being having "_0"
appended but it seems more appropriate to have a common format for all
file names rather than having a special case for the first files anyway.
CONVFMT is specific to gawk. For other awks, setting OFMT should produce
the same result.
Ed.
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.