Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Split a file according to values in a field
Hi,

I would like to be able to split a file according to values in a field.

For example, to split the following file by surname, into separate
files:

John Smith
Dave Green
Kim Brown
Steve West
Tom Brown
Dave West
Mary Green
Steve Smith
Jack Brown

giving separate files with contents:

John Smith
Steve Smith

Dave Green
Mary Green

Kim Brown
Tom Brown
Jack Brown

Steve West
Dave West

It doesn't matter how the resulting files are named, as long as I can
specify a directory in which they are to be created.

I'm asking in this group but I'm not sure if awk is the most appropriate
tool.  I would post to comp.unix.shell also, but I'm not sure if I'm
supposed to do that, so I'll just post it here for the moment.

If anyone knows how to do this in awk, or of a more appropriate tool,
please could you let me know.

Thanks for your help.

Regards,
Jonny

Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
04-30-05 01:55 PM


Re: Split a file according to values in a field
Jonny wrote:
> Hi,
>
> I would like to be able to split a file according to values in a field.
>
> For example, to split the following file by surname, into separate
> files:
>
> John Smith
> Dave Green
> Kim Brown
> Steve West
> Tom Brown
> Dave West
> Mary Green
> Steve Smith
> Jack Brown
>
> giving separate files with contents:
>
> John Smith
> Steve Smith
>
> Dave Green
> Mary Green
>
> Kim Brown
> Tom Brown
> Jack Brown
>
> Steve West
> Dave West
>

{ print > $NF }

--
Regards,

---Robert

Report this thread to moderator Post Follow-up to this message
Old Post
Robert Katz
04-30-05 01:55 PM


Re: Split a file according to values in a field
Jonny wrote:
> Hi,
>
> I would like to be able to split a file according to values in a field.
>
> For example, to split the following file by surname, into separate
> files:
>
> John Smith
> Dave Green
> Kim Brown
> Steve West
> Tom Brown
> Dave West
> Mary Green
> Steve Smith
> Jack Brown
>
> giving separate files with contents:
>
> John Smith
> Steve Smith
>
> Dave Green
> Mary Green
>
> Kim Brown
> Tom Brown
> Jack Brown
>
> Steve West
> Dave West
>
> It doesn't matter how the resulting files are named, as long as I can
> specify a directory in which they are to be created.
>
> I'm asking in this group but I'm not sure if awk is the most appropriate
> tool.  I would post to comp.unix.shell also, but I'm not sure if I'm
> supposed to do that, so I'll just post it here for the moment.
>
> If anyone knows how to do this in awk, or of a more appropriate tool,
> please could you let me know.

Awk seems the most appropriate tool to me. The following program
will create files named West, Smith, Green, and Brown containing
the respective entries.

{ print > $2 }


Janis

Report this thread to moderator Post Follow-up to this message
Old Post
Janis Papanagnou
04-30-05 01:55 PM


Re: Split a file according to values in a field
Janis Papanagnou wrote:

> Jonny wrote: 
>
> Awk seems the most appropriate tool to me. The following program
> will create files named West, Smith, Green, and Brown containing
> the respective entries.
>
> { print > $2 }


Thanks for the reply Janis.

I had no idea it was so simple.  Because of the awk implemenation I'm
using, I had to use:

{ print >> $2; close($2) }

otherwise I got a "Too many open files error" when dealing with large
files containing many surnames.

However, since my initial post, I have found out that there also needs
to be a limit on the number of lines in each file produced.  That is, in
my initial example, if there was a limit of 2 lines per file, then
instead of producing the files:

Smith
Green
Brown
West

the following files would need to be produced:

Smith
Green
Brown_1
Brown_2
West

which I suppose makes things not so simple.  I appreciate that this
makes the problem more complex - or rather, completely different - but
if you have any ideas of how to use awk (or any other tool) to do this,
I would be very grateful.

Regards,
Jonny

Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
04-30-05 08:55 PM


Re: Split a file according to values in a field
In article <tkNce.11104$5A3.2716@newsfe4-win.ntli.net>,
Jonny  <www.mail@ntlworld.com> wrote:
...
>Thanks for the reply Janis.
>
>I had no idea it was so simple.  Because of the awk implemenation I'm
>using, I had to use:
>
>{ print >> $2; close($2) }

Get gawk.  Then come back.

>However, since my initial post, I have found out that there also needs
>to be a limit on the number of lines in each file produced.  That is, in
>my initial example, if there was a limit of 2 lines per file, then
>instead of producing the files:

Consider that your homework.  We wouldn't to spoil you, now would we?


Report this thread to moderator Post Follow-up to this message
Old Post
Kenny McCormack
04-30-05 08:55 PM


Re: Split a file according to values in a field
Jonny wrote:
> Janis Papanagnou wrote:
>
> 
>
>
>
> Thanks for the reply Janis.
>
> However, since my initial post, I have found out that there also needs
> to be a limit on the number of lines in each file produced.  That is, in
> my initial example, if there was a limit of 2 lines per file, then
> instead of producing the files:
>
> Smith
> Green
> Brown
> West
>
> the following files would need to be produced:
>
> Smith
> Green
> Brown_1
> Brown_2
> West
>
> which I suppose makes things not so simple.

Well, if you need to index file names *ONLY AFTER* the first two entries, th
en it's still simple.

{ print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m : "") }

However that'll produce files

Brown
Brown_1
Green
Smith
West

--
Regards,

---Robert

Report this thread to moderator Post Follow-up to this message
Old Post
Robert Katz
04-30-05 08:55 PM


Re: Split a file according to values in a field
In article <tkNce.11104$5A3.2716@newsfe4-win.ntli.net>,
Jonny  <www.mail@ntlworld.com> wrote:

% However, since my initial post, I have found out that there also needs
% to be a limit on the number of lines in each file produced.  That is, in
% my initial example, if there was a limit of 2 lines per file, then
% instead of producing the files:

You could just keep counts of how many of each name you've spat out.

if (++ncount[$2] > limit) {
ncount[$2] = 1
fcount[$2]++
}
else if (! ($2 in fcount))
fcount[$2] = 1

fn = $2 "_" fcount[$2]
print >> fn
close(fn)

Depending on the size of your data, you could keep all the names in
memory then spit them out when you go over the limit. Something like

{
c[$2]++
if (c[$2] > l) {
f[$2]++
fn = $2 "_" f[$2]
for (i = 1; i <= l; i++) print n[$2,i] >> fn
c[$2] = 1
}
n[$2,c[$2]] = $0
}

END {
for (nm in c) {
if (nm in f) {
f[nm]++
for (i = 1; i <= c[nm]; i++) print n[nm,i] >> (nm "_" f[nm])
}
else {
for (i = 1; i <= c[nm]; i++) print n[nm,i] >> nm
}
}
}
--

Patrick TJ McPhee
North York  Canada
ptjm@interlog.com

Report this thread to moderator Post Follow-up to this message
Old Post
Patrick TJ McPhee
04-30-05 08:55 PM


Re: Split a file according to values in a field
Robert Katz wrote:
> Jonny wrote:
> 
>
>
> Well, if you need to index file names *ONLY AFTER* the first two
> entries, then it's still simple.
>
>  { print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m : "") }
>
> However that'll produce files
>
>  Brown
>  Brown_1
>  Green
>  Smith
>  West
>


How about,

{ print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m + 1 : "") }
END {
for (x in a)
if (a[x] > 2)
system("mv " x " " x "_1")
}

--
Regards,

---Robert

Report this thread to moderator Post Follow-up to this message
Old Post
Robert Katz
05-01-05 08:55 AM


Re: Split a file according to values in a field
Jonny wrote:

> However, since my initial post, I have found out that there also needs
> to be a limit on the number of lines in each file produced.  That is, in
> my initial example, if there was a limit of 2 lines per file, then
> instead of producing the files:
>
> Smith
> Green
> Brown
> West
>
> the following files would need to be produced:
>
> Smith
> Green
> Brown_1
> Brown_2
> West

Thanks to Patrick and Robert for the additional replies.  They all
produce the five files expected.  I will have to test them on larger
files to determine which is the quickest or most feasible.

It's not going to be straightforward to test because I ran one of the
scripts on a file containing over a million names, and almost 300,000
files were produced, before I stopped awk.  It then took around 2 hours
for the directory containing the files to be deleted.  But that's NTFS
for you.

Thanks again for your help.  I appreciate it.

Regards,
Jonny

Report this thread to moderator Post Follow-up to this message
Old Post
Jonny
05-01-05 08:56 PM


Re: Split a file according to values in a field

Robert Katz wrote:

> Jonny wrote:
> 
>
>
> Well, if you need to index file names *ONLY AFTER* the first two
> entries, then it's still simple.
>
>  { print > $NF ((m = int(a[$NF]++/2)) > 0 ? "_" m : "") }
>
> However that'll produce files
>
>  Brown
>  Brown_1
>  Green
>  Smith
>  West
>

It might be more efficient to do:

$ awk -vCONVFMT="%d" '{print $2 > $2 "_" a[$2]++/2}' file

since on every line:

a) You don't have to evaluate NF (which I vaguely recall hearing took
some extra processing time compared to just using the field number)
b) You don't need to call the int() function
c) You don't need to test for a value befre appending the suffix

It will mean you end up with the first file names being having "_0"
appended but it seems more appropriate to have a common format for all
file names rather than having a special case for the first files anyway.

CONVFMT is specific to gawk. For other awks, setting OFMT should produce
the same result.

Ed.



Report this thread to moderator Post Follow-up to this message
Old Post
Ed Morton
05-01-05 08:56 PM


Sponsored Links




Last Thread Next Thread Next
Pages (3): [1] 2 3 »
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 09:19 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.