For Programmers: Free Programming Magazines  


Home > Archive > AWK > January 2006 > Processing columns instead of rows









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Processing columns instead of rows
mickey

2006-01-10, 3:58 am

Hi,

I have a text processing problem which I think awk should be able to
tackle in a better way than I think at present.

I have a file with the following format
------

ACGTAC
AGTCGT
ACGA?T
GTACGT

-------

What I would like as output is

------
000111
011000
0001?0
111000
-------

i. e, it counts the number of times each character occurs in a column,
and replaces the majority character with a zero and any other character
with a 1, except for ? which it leaves as is. Is there a way to tackle
this without transposing the character matrix? The present file I am
dealing with is a couple of hundred megs and I ran out of memory on a 2G
ram machine while trying to traspose it using a simple awk script.

Thanks in advance,

-M
Ed Morton

2006-01-10, 3:58 am

mickey wrote:

> Hi,
>
> I have a text processing problem which I think awk should be able to
> tackle in a better way than I think at present.
>
> I have a file with the following format
> ------
>
> ACGTAC
> AGTCGT
> ACGA?T
> GTACGT
>
> -------
>
> What I would like as output is
>
> ------
> 000111
> 011000
> 0001?0
> 111000
> -------
>
> i. e, it counts the number of times each character occurs in a column,
> and replaces the majority character with a zero and any other character
> with a 1, except for ? which it leaves as is. Is there a way to tackle
> this without transposing the character matrix? The present file I am
> dealing with is a couple of hundred megs and I ran out of memory on a 2G
> ram machine while trying to traspose it using a simple awk script.


This doesn't transpose it, but it does use 3 arrays that could get
fairly large:

cnt[] = one entry for each column/character pair
maxVal[] = one entry for each column
maxCnt[] = one entry for each column

and it parses the input file twice, the first time to figure out which
character occurs most frequently in each column (field) the second to
replace the non-? characters with zeros and ones.

Put this in a file called "map.awk":

------------------
BEGIN{FS=OFS="";keep="?";ARGV[ARGC++]=ARGV[1]}
NR==FNR {
for (i=1; i<=NF; i++) {
if ($i != keep) {
cnt[i,$i]++
if (cnt[i,$i] > maxCnt[i]) {
maxVal[i] = $i
maxCnt[i] = cnt[i,$i]
}
}
}
next
}
{
for (i=1; i<=NF; i++)
if ($i != keep)
$i = ($i == maxVal[i] ? 0 : 1)
}
1
--------------------

then invoke it as:

awk -f map.awk file

Using FS="" to specify that the input records should be split into
separate characters may be a gawk extension. If so, use split() or
substr() if necsseary to get the same effect in another awk. (Hint:
g.et gawk!)

Regards,

Ed.
mickey

2006-01-10, 6:59 pm

Ed Morton wrote:
> mickey wrote:
>
>
>
> This doesn't transpose it, but it does use 3 arrays that could get
> fairly large:
>
> cnt[] = one entry for each column/character pair
> maxVal[] = one entry for each column
> maxCnt[] = one entry for each column
>
> and it parses the input file twice, the first time to figure out which
> character occurs most frequently in each column (field) the second to
> replace the non-? characters with zeros and ones.
>
> Put this in a file called "map.awk":
>
> ------------------
> BEGIN{FS=OFS="";keep="?";ARGV[ARGC++]=ARGV[1]}
> NR==FNR {
> for (i=1; i<=NF; i++) {
> if ($i != keep) {
> cnt[i,$i]++
> if (cnt[i,$i] > maxCnt[i]) {
> maxVal[i] = $i
> maxCnt[i] = cnt[i,$i]
> }
> }
> }
> next
> }
> {
> for (i=1; i<=NF; i++)
> if ($i != keep)
> $i = ($i == maxVal[i] ? 0 : 1)
> }
> 1
> --------------------
>
> then invoke it as:
>
> awk -f map.awk file
>
> Using FS="" to specify that the input records should be split into
> separate characters may be a gawk extension. If so, use split() or
> substr() if necsseary to get the same effect in another awk. (Hint:
> g.et gawk!)
>
> Regards,
>
> Ed.


Thanks a lot. That works well. My memory usage shot up to 1.3G but the
system held on.

-M
Jürgen Kahrs

2006-01-10, 6:59 pm

mickey wrote:


This looks like genomic data. But what kind of
operation is that you are trying to do ?
[color=darkred]
> Thanks a lot. That works well. My memory usage shot up to 1.3G but the
> system held on.


Looks like you pushed a whole genome through poor old AWK.
mickey

2006-01-10, 6:59 pm

Jürgen Kahrs wrote:
> mickey wrote:
>
>
>
>
> This looks like genomic data. But what kind of
> operation is that you are trying to do ?
>


I needed to convert this to binary data for a test.

>
>
>
> Looks like you pushed a whole genome through poor old AWK.


That's correct. :)

-M
Jürgen Kahrs

2006-01-10, 6:59 pm

mickey wrote:

>
> I needed to convert this to binary data for a test.


But why does it make sense to identify all "majority"
bases with 1 and the "minority" with 0 ? Or are you
doing some calculations with the ratios of A-T to
C-G ?
mickey

2006-01-10, 6:59 pm

Jürgen Kahrs wrote:
> mickey wrote:
>
>
>
>
> But why does it make sense to identify all "majority"
> bases with 1 and the "minority" with 0 ? Or are you
> doing some calculations with the ratios of A-T to
> C-G ?


Its an experimental test to correlate changes in genotypes with
phenotypes but presently works only with binary data. 1s and 0s were the
most convenient choice.

-M
Marek Simon

2006-01-25, 6:56 pm

So, lets solve the problem.
You have a file with four very long lines and you want to either pull
them into awk by column, or transpose it and then pull it normaly. Awk
takes input as a stream of characters so there are no columns, only
lines. If you want work with columns, it would be better to transpose it
into a new file.

There are more solutions how to solve it, dependig on how much RAM or
disk space do you have. Your files is too big to fit it in memory, so I
suggest this solution:
First split file into four files, each for one line. Use for example the
combination of head and tail command. Then compile this small C code:

// --------------------- BEGIN OF CODE ----------------
#include <stdio.h>

int main(int argc, char * * argv)
{
FILE * inF[4];
if (argc != 5) return(1);
int chr[4];

if( (inF[0] = fopen(argv[1], "r")) == NULL ||
(inF[1] = fopen(argv[2], "r")) == NULL ||
(inF[2] = fopen(argv[3], "r")) == NULL ||
(inF[3] = fopen(argv[4], "r")) == NULL )
{
perror("open");
return(-1);
}
while ( ! ferror(inF[0]) && ! feof(inF[0]) &&
! ferror(inF[1]) && ! feof(inF[1]) &&
! ferror(inF[2]) && ! feof(inF[2]) &&
! ferror(inF[3]) && ! feof(inF[3]) )
{
chr[0]=fgetc(inF[0]);
chr[1]=fgetc(inF[1]);
chr[2]=fgetc(inF[2]);
chr[3]=fgetc(inF[3]);
printf("%c%c%c%c\n",chr[0],chr[1],chr[2],chr[3]);
}
fclose(inF[0]);
fclose(inF[1]);
fclose(inF[2]);
fclose(inF[3]);
}
// --------------------- END OF CODE -----------------------


It takes 4 filenames as parameters and it produces the transposed file
on standard output, four characters on each line. It will be quicker and
less memory eating than any awk script.
I know this group is an awk group, but it is better to select the
propper tools for a problem than modify the other tools to fit it.

Then, having the file transposed, you can work with awk normaly. May be,
you would like to use FIELDWIDTH="1 1 1 1" command at BEGIN section.

Marek
Juergen Kahrs

2006-01-25, 6:56 pm

Marek Simon wrote:

> There are more solutions how to solve it, dependig on how much RAM or
> disk space do you have. Your files is too big to fit it in memory, so I


That's correct.

> It takes 4 filenames as parameters and it produces the transposed file
> on standard output, four characters on each line. It will be quicker and
> less memory eating than any awk script.


No, the problem of transposing a file has been solved
in AWK many times. You _can_ solve this in AWK without
reading the complete file into memory. I think such a
solution has already been described in TAPL back in 1985.

> I know this group is an awk group, but it is better to select the
> propper tools for a problem than modify the other tools to fit it.


Yes, choose the right tools, but dont underestimate AWK.
Marek Simon

2006-01-25, 6:56 pm

I can imagine advanced algorhitm, which transpose data without storing
it into memory. But in this case, the a single line have few hundered
MB, so a conventional use of awk wouldnt be good solution, but some
non-conventional solution would be good. But I would rather try that C
code, it would be surely faster then awk interpeter.
MArek
Ed Morton

2006-01-25, 6:56 pm



Marek Simon wrote:
> I can imagine advanced algorhitm, which transpose data without storing
> it into memory.


If you look back up the thread you'll see on 9th Jan a solution posted
that's pretty basic and doesn't need to transpose the data and doesn't
require creating a separate file for each line of input.

But in this case, the a single line have few hundered
> MB, so a conventional use of awk wouldnt be good solution, but some
> non-conventional solution would be good.


The posted solutions is perfectly conventional, it just reads the input
file twice.

But I would rather try that C
> code, it would be surely faster then awk interpeter.
> MArek


I doubt if what you porposed would be faster since what you suggested
was multiple steps using various shell commands to create multiple input
files before you even start to execute the C program.

Please read http://cfaj.freeshell.org/google before posting again as
you're not quoting the articles you're responding to properly.

Ed.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com