For Programmers: Free Programming Magazines  


Home > Archive > AWK > December 2005 > general compression with awk









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author general compression with awk
Rufus T. Firefly

2005-11-16, 6:55 pm

Hello,

my problem is to compress a long text file with recurrent patterns. I
experimented with awk but could not find a satisfying solution. The
compression should be lossless in the sense that line numbering is
preserved when it is decompressed.

The original file structure may look like this:

a
a
a
a
b
a
b
a
b
c
d
d

which should be compressed to something similar as

4 a
2 b a <-- note pattern [b a] spans several lines
1 b
1 c
2 d

or even better

3 a
3 a b <-- or [a b] here
1 c
2 d

where the numbers are the counts of the occurrence of pattern.

This is similar to the bash command uniq but should span patterns
including more than one line (I am aware of the possible ambiguity here
that does not really matter). Using uniq twice (with sort) on the above
example is not really what I want (the pattern [a b] is lost):
$ cat example |uniq -c |sort |uniq -c
2 1 a
3 1 b
1 1 c
1 2 d
1 4 a
(Also I loose the original sequence with the application of sort.)

Has anybody a simple answer to this question?
Thanks!
Ed Morton

2005-11-16, 6:55 pm

Rufus T. Firefly wrote:

> Hello,
>
> my problem is to compress a long text file with recurrent patterns. I
> experimented with awk but could not find a satisfying solution. The
> compression should be lossless in the sense that line numbering is
> preserved when it is decompressed.
>
> The original file structure may look like this:
>
> a
> a
> a
> a
> b
> a
> b
> a
> b
> c
> d
> d
>
> which should be compressed to something similar as
>
> 4 a
> 2 b a <-- note pattern [b a] spans several lines
> 1 b
> 1 c
> 2 d
>
> or even better
>
> 3 a
> 3 a b <-- or [a b] here
> 1 c
> 2 d
>
> where the numbers are the counts of the occurrence of pattern.
>
> This is similar to the bash command uniq but should span patterns
> including more than one line (I am aware of the possible ambiguity here
> that does not really matter). Using uniq twice (with sort) on the above
> example is not really what I want (the pattern [a b] is lost):
> $ cat example |uniq -c |sort |uniq -c
> 2 1 a
> 3 1 b
> 1 1 c
> 1 2 d
> 1 4 a
> (Also I loose the original sequence with the application of sort.)
>
> Has anybody a simple answer to this question?
> Thanks!


So, if your input file was twice that length and repeated once, i.e.:

a
a
a
a
b
a
b
a
b
c
d
d
a
a
a
a
b
a
b
a
b
c
d
d

would you expect the result to be:

2 a a a a b a b a b c d d

or:

3 a
3 a b
1 c
2 d
3 a
3 a b
1 c
2 d

or something else? I assume the former since it uses less characters but
really what you should want is something like this:

2 3 a 3 a b 1 c 2 d

Now, if you provide the algorithm to do that (i.e. the hard part), I
expect some of the regulars here could help you code it in awk. Don't
expect it to run fast though ;-).

Ed.
Ed Morton

2005-11-16, 9:55 pm

Ed Morton wrote:
> Rufus T. Firefly wrote:
>
<snip>[color=darkred]
> Now, if you provide the algorithm to do that (i.e. the hard part), I
> expect some of the regulars here could help you code it in awk. Don't


You may want to take a look at
http://www.cs.princeton.edu/courses...5/cos333/j2.pdf as a
starting point.

Ed.
Michael Zawrotny

2005-11-17, 6:55 pm

Rufus T. Firefly <mb.atelier@web.de> wrote:
>
> my problem is to compress a long text file with recurrent patterns. I
> experimented with awk but could not find a satisfying solution. The
> compression should be lossless in the sense that line numbering is
> preserved when it is decompressed.
>
> The original file structure may look like this:
>
> a
> a
> a
> a


[ snip ]

>
> which should be compressed to something similar as
>
> 4 a


[ snip ]

This kind of compression is called "run length encoding". You should
be able to find it in books on algorithms and/or compression. I have
a vague recollection of it being mentioned in Kernighan & Pike's "The
Practice of Programming". You may be lucky if it is, since many of
the programs in the book were written in awk (and other languages as
well). I tried to look for it online, but the Bell labs site seems
to be down now.


Mike

--
Michael Zawrotny
Institute of Molecular Biophysics
Florida State University | email: zawrotny@sb.fsu.edu
Tallahassee, FL 32306-4380 | phone: (850) 644-0069
William James

2005-11-18, 7:55 am

Rufus T. Firefly wrote:
> Hello,
>
> my problem is to compress a long text file with recurrent patterns. I
> experimented with awk but could not find a satisfying solution. The
> compression should be lossless in the sense that line numbering is
> preserved when it is decompressed.
>
> The original file structure may look like this:
>
> a
> a
> a
> a
> b
> a
> b
> a
> b
> c
> d
> d
>
> which should be compressed to something similar as
>
> 4 a
> 2 b a <-- note pattern [b a] spans several lines
> 1 b
> 1 c
> 2 d
>
> or even better
>
> 3 a
> 3 a b <-- or [a b] here
> 1 c
> 2 d
>
> where the numbers are the counts of the occurrence of pattern.



Since no one has given an Awk solution, and since
it seems easier in Ruby:

def compress( data )

if "" == data
return [ ]
end

params = [ 0, 1, data, "", "" ]

data.scan( /[^\n]+\n/ ) { |s|
before = $~.pre_match
s << $~.post_match
s =~ /\A(.+\n)\1+/m
if $&
saved_bytes = $&.size - $1.size
times = $&.size / $1.size
if saved_bytes > params.first
params = [ saved_bytes, times, $1, before, $~.post_match ]
end
end
}

[ compress(params[-2]), params[1], params[2], compress(params[-1]) ]
end

data = ARGF.read

compress( data ).flatten.compact.each_with_index { |x,i|
if i%2 == 0
print x, " "
else
puts x.gsub(/\n/, " ").strip
end
}

---------------------------------------------------------

Using your input, the output is
3 a
3 a b
1 c
2 d

Rufus T. Firefly

2005-11-19, 6:55 pm

William James wrote:
> Rufus T. Firefly wrote:
>
>
>
> Since no one has given an Awk solution, and since
> it seems easier in Ruby:


> Using your input, the output is
> 3 a
> 3 a b
> 1 c
> 2 d
>


Hi!
Thanks, William, for your ruby solution. I tried it out, first on some
sample data, then on my actual data file. While on the former it worked
quite as it should, the larger data file turned out wrong:

2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b c
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a b c a a a a a a a a a a a a a a a a a
a a a
2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a
1 a b c
2 a a a a a a a a a a a a
1 a

Which is obviously flawed as a series of a's could be condensed.

(I am not giving the source here, it is too long. I just took the first
1000 entries. The whole file has 142145 lines. The strings were recoded
as a,b,c for convenience - in this case:

1 a C7 00 01 01 C6 C8
2 b C6 00 01 02 C5 C8
3 c C7 00 02 01 C5 C8

I am not familiar with ruby and have difficulties improving it.

As I am looking for ways to accomplish the task in awk, my first attempt
was to calculate the run length, as Michael indicated. But I got stuck
quickly. Maybe someone here on the list can spot the idea that I had:

BEGIN { m=rep=0 }
{ r=p=NR-nr0 # position p=1,2,..
while(r--) if(n=N[q=p-r,$0]++) break # q=1,2,..,p
if(n && !rep) rep=q # position q repeated!
if(rep) m=p-rep # m after repeat
print p, q, rep, m, n=N[q,$0], $0
}

(the printing is only for visualizing the counters. What I wanted to do
is basing the decision when to print on a break in the flow of a
repeating sequence and start over.)

So far,

RTF
nt4-ever

2005-11-19, 6:55 pm

> a vague recollection of it being mentioned in Kernighan & Pike's "The
> Practice of Programming". You may be lucky if it is, since many of
> the programs in the book were written in awk


it is here: http://cm.bell-labs.com/cm/cs/tpop/index.html
The Practice of Programming
http://cm.bell-labs.com/cm/cs/tpop/code.html

also from Bell Labs: Programming Pearls
http://www.cs.bell-labs.com/cm/cs/pearls/index.html
Web References for Programming Pearls
http://www.cs.bell-labs.com/cm/cs/pearls/webrefs.html
"Column 10: Squeezing Space
The column talks about some simple approaches to
squeezing space. For a high-tech compression scheme,
let Craig Nevill-Manning and Ian Witten demonstrate
the sequitur system for inferring hierarchies from sequences. "

:rod--sacramento

William James

2005-11-20, 7:55 am

Rufus T. Firefly wrote:
> William James wrote:
>
>
> Hi!
> Thanks, William, for your ruby solution. I tried it out, first on some
> sample data, then on my actual data file. While on the former it worked
> quite as it should, the larger data file turned out wrong:
>
> 2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b c
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a b c a a a a a a a a a a a a a a a a a
> a a a
> 2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a
> 1 a b c
> 2 a a a a a a a a a a a a
> 1 a
>
> Which is obviously flawed as a series of a's could be condensed.
>
> (I am not giving the source here, it is too long. I just took the first
> 1000 entries. The whole file has 142145 lines. The strings were recoded
> as a,b,c for convenience - in this case:
>
> 1 a C7 00 01 01 C6 C8
> 2 b C6 00 01 02 C5 C8
> 3 c C7 00 02 01 C5 C8



A new version. With the original data:

3 a
3 a b
1 c
2 d

With the new data:

176 a
4 1 a b c 196 a
1 a b c
25 a

------------------------------------------------------------

def compress( data )

if "" == data
return nil
end

params = [ 0, 1, data.dup, "", "" ]

data.scan( /[^\n]+\n/ ) { |s|
before = $~.pre_match
s << $~.post_match

s =~ /\A(.+?\n)\1+/m

if $&
saved_bytes = $&.size - $1.size
times = $&.size / $1.size
if saved_bytes > params.first
params = [ saved_bytes, times, $1, before, $~.post_match ]
end
end

}

if params.first == 0
[ compress(params[-2]), params[1], params[2],
compress(params[-1]) ]
else
[ compress(params[-2]), [params[1], compress(params[2])],
compress(params[-1]) ]
end
end

def num?(x); x.class.to_s=="Fixnum"; end

def show_line( ary )
ary = ary.flatten.compact
limit = ary.select{|x| num?(x) }.size > 2 ? 1 : 0
ary = ary.inject([]){|a,x|
a << x unless num?(x) && num?(a.last) && a.size > limit
a
}
puts ary.map{|x| x.to_s.gsub("\n"," ").strip}.join(" ")
end

def show( ary )
ary.compact!
if ary == ary.flatten
show_line( ary )
else
ary.each {|x|
x.compact!
if num?(x.first)
show_line( x )
else
show( x )
end
}
end
end

data = ARGF.read
show( compress( data ) )

Rufus T. Firefly

2005-11-20, 7:55 am

William James wrote:

> A new version. With the original data:
>
> 3 a
> 3 a b
> 1 c
> 2 d
>
> With the new data:
>
> 176 a
> 4 1 a b c 196 a
> 1 a b c
> 25 a
>



Many thanks for your effort. Unfortunately computation time seems to
grow quadratically with string length *and* number of lines (sigh)!
Maybe because of the programs recursive nature?

As a workaround I compressed once with `uniq -c` before applying it.
That should do for now.

Thanks to everyone and for valuable references...
RTF
Loki Harfagr

2005-11-29, 7:55 am

Le Sun, 20 Nov 2005 14:47:08 +0100, Rufus T. Firefly a écrit_:

> William James wrote:
>
>
>
> Many thanks for your effort. Unfortunately computation time seems to
> grow quadratically with string length *and* number of lines (sigh)!
> Maybe because of the programs recursive nature?
>
> As a workaround I compressed once with `uniq -c` before applying it.
> That should do for now.
>
> Thanks to everyone and for valuable references...
> RTF


If you still want to try awk, here's a very
timid starter.
The commented prints are for looking at the engine
being clumsy (It doesn't pair doubletons like "a b").

With your first original sample data it gives :
a 4
b 1
a 1
b 1
a 1
b 1
c 1
d 2

Maybe someone (maybe me) will find some time to improve
the poor thing :D)

$ cat RLEinawk.awk
BEGIN{
### print "========="
}
(imp[$0]>0){
pot=$0
imp[pot]++;
### print "["FNR"] put "$0" in accu, count "imp[$0]
next
}
{
### print "["FNR"] We read a newt "$0", first depot "pot" ->"imp[pot]
if(FNR>1)print pot" "imp[pot]
imp[pot]=0
pot=$0
### print "Then put the newt "pot" into accu"
imp[pot]++
next
}
END{
### print "In the end, depot the rest"
if(FNR>1)print pot" "imp[pot]
### print "========="
}

Rufus T. Firefly

2005-11-30, 7:55 am

Loki Harfagr wrote:
>
> If you still want to try awk, here's a very
> timid starter.
> The commented prints are for looking at the engine
> being clumsy (It doesn't pair doubletons like "a b").
>

discovering (and compressing) recurrent patterns IS the whole point of
this exercise

> With your first original sample data it gives :
> a 4
> b 1
> a 1
> b 1
> a 1
> b 1
> c 1
> d 2
>
> Maybe someone (maybe me) will find some time to improve
> the poor thing :D)
>
> $ cat RLEinawk.awk
> BEGIN{
> ### print "========="
> }
> (imp[$0]>0){
> pot=$0
> imp[pot]++;
> ### print "["FNR"] put "$0" in accu, count "imp[$0]
> next
> }
> {
> ### print "["FNR"] We read a newt "$0", first depot "pot" ->"imp[pot]
> if(FNR>1)print pot" "imp[pot]
> imp[pot]=0
> pot=$0
> ### print "Then put the newt "pot" into accu"
> imp[pot]++
> next
> }
> END{
> ### print "In the end, depot the rest"
> if(FNR>1)print pot" "imp[pot]
> ### print "========="
> }
>


Your code does `uniq -c` in an obscure way.

$ uniq -c example
4 a
1 b
1 a
1 b
1 a
1 b
1 c
2 d

Maybe I did not quite see your point. - Looking forward to your
improvement...

As Ed pointed out

$ cat example example |awk -f RLEinawk.awk

should print something like
2 example

if you understand what I mean.

RTF
Loki Harfagr

2005-12-02, 8:51 pm

Le Wed, 30 Nov 2005 13:09:53 +0100, Rufus T. Firefly a écrit_:

> Loki Harfagr wrote:
> discovering (and compressing) recurrent patterns IS the whole point of
> this exercise


Oh yes! Indeed I missed your point, I thought you were lacking of
a starting architecture for a compressor type engine. While you
really did want an algo !-) Sorru for this !

....

> Your code does `uniq -c` in an obscure way.


Mmm, I wouldn't say it is that obscure :D)

> $ uniq -c example
> 4 a
> 1 b
> 1 a
> 1 b
> 1 a
> 1 b
> 1 c
> 2 d
>
> Maybe I did not quite see your point.


It seems we both had difficulties in reading each other, I think now it
is clear, and yes it *has* to look like a 'uniq -c' which is a good start
for a compression engine (collect and count element as to build a hash
dictionnary). It could be a basis for a huffman compressor, why don't you
use a huffman ? Is your research for research (intelectual curiosity) or
do you intend to actually use the thing in real process ?

> - Looking forward to your
> improvement...


Well, I don't know if I'll have time to try but if I can
get some relief off preparing Xmas presents, cards, apologies for parties,
I'll have a try :-)

> As Ed pointed out
>
> $ cat example example |awk -f RLEinawk.awk
>
> should print something like
> 2 example
>
> if you understand what I mean.


Yes, I do, of course, but you reckon that using this type of algo
on a "freely sized" dictionnary will be a very resource consuming
task for a computer, at best it'll need "factory of number of elements"
loop, if you have big files and very heterogenous you'll crash fairly
quick !-)
Of course if you can define a closed context of data you may find
out some tricks to do it in a refined optimized way :-)

See you.
James

2005-12-05, 9:55 pm


Rufus T. Firefly wrote:
> William James wrote:
>
>
>
> Many thanks for your effort. Unfortunately computation time seems to
> grow quadratically with string length *and* number of lines (sigh)!
> Maybe because of the programs recursive nature?
>
> As a workaround I compressed once with `uniq -c` before applying it.
> That should do for now.
>
> Thanks to everyone and for valuable references...
> RTF


Using perl,

#!/bin/env perl -w

my ($match, $prev, $next, $a, $b, $m, $c);
$_ = join "",<>;

while (/(.+?)\1{1,}/gs) {
$match = $&;
$prev = $`;
$next = $';
if ($prev) {for $b (split /[\s\n]/,$prev) {print "1 $b\n" if
$b}};
$a = $1;
$m = $match;
$c = 0;
$c++ while $m =~ /$a/g;
$a =~ s/\n/ /g;
$a =~ s/(^\s+|\s+$)//;
print "$c $a\n" if $a;
s/$prev$match//;
}
print $next;


James

Kenny McCormack

2005-12-05, 9:55 pm

In article <1133834179.408266.255260@o13g2000cwo.googlegroups.com>,
James <hslee@ind.alcatel.com> wrote:
....
>Using perl,


Off topic gibberish - deleted.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com