Home > Archive > AWK > December 2005 > general compression with awk
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
general compression with awk
|
|
| Rufus T. Firefly 2005-11-16, 6:55 pm |
| Hello,
my problem is to compress a long text file with recurrent patterns. I
experimented with awk but could not find a satisfying solution. The
compression should be lossless in the sense that line numbering is
preserved when it is decompressed.
The original file structure may look like this:
a
a
a
a
b
a
b
a
b
c
d
d
which should be compressed to something similar as
4 a
2 b a <-- note pattern [b a] spans several lines
1 b
1 c
2 d
or even better
3 a
3 a b <-- or [a b] here
1 c
2 d
where the numbers are the counts of the occurrence of pattern.
This is similar to the bash command uniq but should span patterns
including more than one line (I am aware of the possible ambiguity here
that does not really matter). Using uniq twice (with sort) on the above
example is not really what I want (the pattern [a b] is lost):
$ cat example |uniq -c |sort |uniq -c
2 1 a
3 1 b
1 1 c
1 2 d
1 4 a
(Also I loose the original sequence with the application of sort.)
Has anybody a simple answer to this question?
Thanks!
| |
| Ed Morton 2005-11-16, 6:55 pm |
| Rufus T. Firefly wrote:
> Hello,
>
> my problem is to compress a long text file with recurrent patterns. I
> experimented with awk but could not find a satisfying solution. The
> compression should be lossless in the sense that line numbering is
> preserved when it is decompressed.
>
> The original file structure may look like this:
>
> a
> a
> a
> a
> b
> a
> b
> a
> b
> c
> d
> d
>
> which should be compressed to something similar as
>
> 4 a
> 2 b a <-- note pattern [b a] spans several lines
> 1 b
> 1 c
> 2 d
>
> or even better
>
> 3 a
> 3 a b <-- or [a b] here
> 1 c
> 2 d
>
> where the numbers are the counts of the occurrence of pattern.
>
> This is similar to the bash command uniq but should span patterns
> including more than one line (I am aware of the possible ambiguity here
> that does not really matter). Using uniq twice (with sort) on the above
> example is not really what I want (the pattern [a b] is lost):
> $ cat example |uniq -c |sort |uniq -c
> 2 1 a
> 3 1 b
> 1 1 c
> 1 2 d
> 1 4 a
> (Also I loose the original sequence with the application of sort.)
>
> Has anybody a simple answer to this question?
> Thanks!
So, if your input file was twice that length and repeated once, i.e.:
a
a
a
a
b
a
b
a
b
c
d
d
a
a
a
a
b
a
b
a
b
c
d
d
would you expect the result to be:
2 a a a a b a b a b c d d
or:
3 a
3 a b
1 c
2 d
3 a
3 a b
1 c
2 d
or something else? I assume the former since it uses less characters but
really what you should want is something like this:
2 3 a 3 a b 1 c 2 d
Now, if you provide the algorithm to do that (i.e. the hard part), I
expect some of the regulars here could help you code it in awk. Don't
expect it to run fast though ;-).
Ed.
| |
| Ed Morton 2005-11-16, 9:55 pm |
| Ed Morton wrote:
> Rufus T. Firefly wrote:
>
<snip>[color=darkred]
> Now, if you provide the algorithm to do that (i.e. the hard part), I
> expect some of the regulars here could help you code it in awk. Don't
You may want to take a look at
http://www.cs.princeton.edu/courses...5/cos333/j2.pdf as a
starting point.
Ed.
| |
| Michael Zawrotny 2005-11-17, 6:55 pm |
| Rufus T. Firefly <mb.atelier@web.de> wrote:
>
> my problem is to compress a long text file with recurrent patterns. I
> experimented with awk but could not find a satisfying solution. The
> compression should be lossless in the sense that line numbering is
> preserved when it is decompressed.
>
> The original file structure may look like this:
>
> a
> a
> a
> a
[ snip ]
>
> which should be compressed to something similar as
>
> 4 a
[ snip ]
This kind of compression is called "run length encoding". You should
be able to find it in books on algorithms and/or compression. I have
a vague recollection of it being mentioned in Kernighan & Pike's "The
Practice of Programming". You may be lucky if it is, since many of
the programs in the book were written in awk (and other languages as
well). I tried to look for it online, but the Bell labs site seems
to be down now.
Mike
--
Michael Zawrotny
Institute of Molecular Biophysics
Florida State University | email: zawrotny@sb.fsu.edu
Tallahassee, FL 32306-4380 | phone: (850) 644-0069
| |
| William James 2005-11-18, 7:55 am |
| Rufus T. Firefly wrote:
> Hello,
>
> my problem is to compress a long text file with recurrent patterns. I
> experimented with awk but could not find a satisfying solution. The
> compression should be lossless in the sense that line numbering is
> preserved when it is decompressed.
>
> The original file structure may look like this:
>
> a
> a
> a
> a
> b
> a
> b
> a
> b
> c
> d
> d
>
> which should be compressed to something similar as
>
> 4 a
> 2 b a <-- note pattern [b a] spans several lines
> 1 b
> 1 c
> 2 d
>
> or even better
>
> 3 a
> 3 a b <-- or [a b] here
> 1 c
> 2 d
>
> where the numbers are the counts of the occurrence of pattern.
Since no one has given an Awk solution, and since
it seems easier in Ruby:
def compress( data )
if "" == data
return [ ]
end
params = [ 0, 1, data, "", "" ]
data.scan( /[^\n]+\n/ ) { |s|
before = $~.pre_match
s << $~.post_match
s =~ /\A(.+\n)\1+/m
if $&
saved_bytes = $&.size - $1.size
times = $&.size / $1.size
if saved_bytes > params.first
params = [ saved_bytes, times, $1, before, $~.post_match ]
end
end
}
[ compress(params[-2]), params[1], params[2], compress(params[-1]) ]
end
data = ARGF.read
compress( data ).flatten.compact.each_with_index { |x,i|
if i%2 == 0
print x, " "
else
puts x.gsub(/\n/, " ").strip
end
}
---------------------------------------------------------
Using your input, the output is
3 a
3 a b
1 c
2 d
| |
| Rufus T. Firefly 2005-11-19, 6:55 pm |
| William James wrote:
> Rufus T. Firefly wrote:
>
>
>
> Since no one has given an Awk solution, and since
> it seems easier in Ruby:
> Using your input, the output is
> 3 a
> 3 a b
> 1 c
> 2 d
>
Hi!
Thanks, William, for your ruby solution. I tried it out, first on some
sample data, then on my actual data file. While on the former it worked
quite as it should, the larger data file turned out wrong:
2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b c
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a b c a a a a a a a a a a a a a a a a a
a a a
2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a
1 a b c
2 a a a a a a a a a a a a
1 a
Which is obviously flawed as a series of a's could be condensed.
(I am not giving the source here, it is too long. I just took the first
1000 entries. The whole file has 142145 lines. The strings were recoded
as a,b,c for convenience - in this case:
1 a C7 00 01 01 C6 C8
2 b C6 00 01 02 C5 C8
3 c C7 00 02 01 C5 C8
I am not familiar with ruby and have difficulties improving it.
As I am looking for ways to accomplish the task in awk, my first attempt
was to calculate the run length, as Michael indicated. But I got stuck
quickly. Maybe someone here on the list can spot the idea that I had:
BEGIN { m=rep=0 }
{ r=p=NR-nr0 # position p=1,2,..
while(r--) if(n=N[q=p-r,$0]++) break # q=1,2,..,p
if(n && !rep) rep=q # position q repeated!
if(rep) m=p-rep # m after repeat
print p, q, rep, m, n=N[q,$0], $0
}
(the printing is only for visualizing the counters. What I wanted to do
is basing the decision when to print on a break in the flow of a
repeating sequence and start over.)
So far,
RTF
| |
|
|
| William James 2005-11-20, 7:55 am |
| Rufus T. Firefly wrote:
> William James wrote:
>
>
> Hi!
> Thanks, William, for your ruby solution. I tried it out, first on some
> sample data, then on my actual data file. While on the former it worked
> quite as it should, the larger data file turned out wrong:
>
> 2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b c
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a b c a a a a a a a a a a a a a a a a a
> a a a
> 2 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
> a a a a a a a a a a a a a a a a a
> 1 a b c
> 2 a a a a a a a a a a a a
> 1 a
>
> Which is obviously flawed as a series of a's could be condensed.
>
> (I am not giving the source here, it is too long. I just took the first
> 1000 entries. The whole file has 142145 lines. The strings were recoded
> as a,b,c for convenience - in this case:
>
> 1 a C7 00 01 01 C6 C8
> 2 b C6 00 01 02 C5 C8
> 3 c C7 00 02 01 C5 C8
A new version. With the original data:
3 a
3 a b
1 c
2 d
With the new data:
176 a
4 1 a b c 196 a
1 a b c
25 a
------------------------------------------------------------
def compress( data )
if "" == data
return nil
end
params = [ 0, 1, data.dup, "", "" ]
data.scan( /[^\n]+\n/ ) { |s|
before = $~.pre_match
s << $~.post_match
s =~ /\A(.+?\n)\1+/m
if $&
saved_bytes = $&.size - $1.size
times = $&.size / $1.size
if saved_bytes > params.first
params = [ saved_bytes, times, $1, before, $~.post_match ]
end
end
}
if params.first == 0
[ compress(params[-2]), params[1], params[2],
compress(params[-1]) ]
else
[ compress(params[-2]), [params[1], compress(params[2])],
compress(params[-1]) ]
end
end
def num?(x); x.class.to_s=="Fixnum"; end
def show_line( ary )
ary = ary.flatten.compact
limit = ary.select{|x| num?(x) }.size > 2 ? 1 : 0
ary = ary.inject([]){|a,x|
a << x unless num?(x) && num?(a.last) && a.size > limit
a
}
puts ary.map{|x| x.to_s.gsub("\n"," ").strip}.join(" ")
end
def show( ary )
ary.compact!
if ary == ary.flatten
show_line( ary )
else
ary.each {|x|
x.compact!
if num?(x.first)
show_line( x )
else
show( x )
end
}
end
end
data = ARGF.read
show( compress( data ) )
| |
| Rufus T. Firefly 2005-11-20, 7:55 am |
| William James wrote:
> A new version. With the original data:
>
> 3 a
> 3 a b
> 1 c
> 2 d
>
> With the new data:
>
> 176 a
> 4 1 a b c 196 a
> 1 a b c
> 25 a
>
Many thanks for your effort. Unfortunately computation time seems to
grow quadratically with string length *and* number of lines (sigh)!
Maybe because of the programs recursive nature?
As a workaround I compressed once with `uniq -c` before applying it.
That should do for now.
Thanks to everyone and for valuable references...
RTF
| |
| Loki Harfagr 2005-11-29, 7:55 am |
| Le Sun, 20 Nov 2005 14:47:08 +0100, Rufus T. Firefly a écrit_:
> William James wrote:
>
>
>
> Many thanks for your effort. Unfortunately computation time seems to
> grow quadratically with string length *and* number of lines (sigh)!
> Maybe because of the programs recursive nature?
>
> As a workaround I compressed once with `uniq -c` before applying it.
> That should do for now.
>
> Thanks to everyone and for valuable references...
> RTF
If you still want to try awk, here's a very
timid starter.
The commented prints are for looking at the engine
being clumsy (It doesn't pair doubletons like "a b").
With your first original sample data it gives :
a 4
b 1
a 1
b 1
a 1
b 1
c 1
d 2
Maybe someone (maybe me) will find some time to improve
the poor thing :D)
$ cat RLEinawk.awk
BEGIN{
### print "========="
}
(imp[$0]>0){
pot=$0
imp[pot]++;
### print "["FNR"] put "$0" in accu, count "imp[$0]
next
}
{
### print "["FNR"] We read a newt "$0", first depot "pot" ->"imp[pot]
if(FNR>1)print pot" "imp[pot]
imp[pot]=0
pot=$0
### print "Then put the newt "pot" into accu"
imp[pot]++
next
}
END{
### print "In the end, depot the rest"
if(FNR>1)print pot" "imp[pot]
### print "========="
}
| |
| Rufus T. Firefly 2005-11-30, 7:55 am |
| Loki Harfagr wrote:
>
> If you still want to try awk, here's a very
> timid starter.
> The commented prints are for looking at the engine
> being clumsy (It doesn't pair doubletons like "a b").
>
discovering (and compressing) recurrent patterns IS the whole point of
this exercise
> With your first original sample data it gives :
> a 4
> b 1
> a 1
> b 1
> a 1
> b 1
> c 1
> d 2
>
> Maybe someone (maybe me) will find some time to improve
> the poor thing :D)
>
> $ cat RLEinawk.awk
> BEGIN{
> ### print "========="
> }
> (imp[$0]>0){
> pot=$0
> imp[pot]++;
> ### print "["FNR"] put "$0" in accu, count "imp[$0]
> next
> }
> {
> ### print "["FNR"] We read a newt "$0", first depot "pot" ->"imp[pot]
> if(FNR>1)print pot" "imp[pot]
> imp[pot]=0
> pot=$0
> ### print "Then put the newt "pot" into accu"
> imp[pot]++
> next
> }
> END{
> ### print "In the end, depot the rest"
> if(FNR>1)print pot" "imp[pot]
> ### print "========="
> }
>
Your code does `uniq -c` in an obscure way.
$ uniq -c example
4 a
1 b
1 a
1 b
1 a
1 b
1 c
2 d
Maybe I did not quite see your point. - Looking forward to your
improvement...
As Ed pointed out
$ cat example example |awk -f RLEinawk.awk
should print something like
2 example
if you understand what I mean.
RTF
| |
| Loki Harfagr 2005-12-02, 8:51 pm |
| Le Wed, 30 Nov 2005 13:09:53 +0100, Rufus T. Firefly a écrit_:
> Loki Harfagr wrote:
> discovering (and compressing) recurrent patterns IS the whole point of
> this exercise
Oh yes! Indeed I missed your point, I thought you were lacking of
a starting architecture for a compressor type engine. While you
really did want an algo !-) Sorru for this !
....
> Your code does `uniq -c` in an obscure way.
Mmm, I wouldn't say it is that obscure :D)
> $ uniq -c example
> 4 a
> 1 b
> 1 a
> 1 b
> 1 a
> 1 b
> 1 c
> 2 d
>
> Maybe I did not quite see your point.
It seems we both had difficulties in reading each other, I think now it
is clear, and yes it *has* to look like a 'uniq -c' which is a good start
for a compression engine (collect and count element as to build a hash
dictionnary). It could be a basis for a huffman compressor, why don't you
use a huffman ? Is your research for research (intelectual curiosity) or
do you intend to actually use the thing in real process ?
> - Looking forward to your
> improvement...
Well, I don't know if I'll have time to try but if I can
get some relief off preparing Xmas presents, cards, apologies for parties,
I'll have a try :-)
> As Ed pointed out
>
> $ cat example example |awk -f RLEinawk.awk
>
> should print something like
> 2 example
>
> if you understand what I mean.
Yes, I do, of course, but you reckon that using this type of algo
on a "freely sized" dictionnary will be a very resource consuming
task for a computer, at best it'll need "factory of number of elements"
loop, if you have big files and very heterogenous you'll crash fairly
quick !-)
Of course if you can define a closed context of data you may find
out some tricks to do it in a refined optimized way :-)
See you.
| |
|
|
Rufus T. Firefly wrote:
> William James wrote:
>
>
>
> Many thanks for your effort. Unfortunately computation time seems to
> grow quadratically with string length *and* number of lines (sigh)!
> Maybe because of the programs recursive nature?
>
> As a workaround I compressed once with `uniq -c` before applying it.
> That should do for now.
>
> Thanks to everyone and for valuable references...
> RTF
Using perl,
#!/bin/env perl -w
my ($match, $prev, $next, $a, $b, $m, $c);
$_ = join "",<>;
while (/(.+?)\1{1,}/gs) {
$match = $&;
$prev = $`;
$next = $';
if ($prev) {for $b (split /[\s\n]/,$prev) {print "1 $b\n" if
$b}};
$a = $1;
$m = $match;
$c = 0;
$c++ while $m =~ /$a/g;
$a =~ s/\n/ /g;
$a =~ s/(^\s+|\s+$)//;
print "$c $a\n" if $a;
s/$prev$match//;
}
print $next;
James
| |
| Kenny McCormack 2005-12-05, 9:55 pm |
| In article <1133834179.408266.255260@o13g2000cwo.googlegroups.com>,
James <hslee@ind.alcatel.com> wrote:
....
>Using perl,
Off topic gibberish - deleted.
|
|
|
|
|