For Programmers: Free Programming Magazines  


Home > Archive > Compression > May 2005 > gzip









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author gzip
iskywalker@gmx.de

2005-04-27, 8:55 am

Hi!
I got 2 files bla.pgm and blu.pgm
if i concatenate:
cat blu.pgm bla.pgm> blo.pgm
gzip blo.pgm
and see the size of blo.pgm.gz it is bigger than
the sum of blu.pgm.gz and bla.pgm.gz.
How can it be? Is the dictionary bad constructed?
I thought first it would be the hashtable size, and then i lowered
the size of blo.pgm below 32700 but it doesnt matter...
What would be a plausible explanation for it?
Thanks for the answers in advance
Fernando Benites
Matt Mahoney

2005-04-27, 3:55 pm


iskywal...@gmx.de wrote:
> Hi!
> I got 2 files bla.pgm and blu.pgm
> if i concatenate:
> cat blu.pgm bla.pgm> blo.pgm
> gzip blo.pgm
> and see the size of blo.pgm.gz it is bigger than
> the sum of blu.pgm.gz and bla.pgm.gz.
> How can it be? Is the dictionary bad constructed?
> I thought first it would be the hashtable size, and then i lowered
> the size of blo.pgm below 32700 but it doesnt matter...
> What would be a plausible explanation for it?
> Thanks for the answers in advance
> Fernando Benites


If the files are different statistically, then you are better off
compressing them separately. In the Calgary corpus, compressing geo
and pic separately from the text files will compress smaller than a tar
file of the whole corpus with most compressors. If the files are
similar, like paper1 and paper2, you are better off compressing them
together.

-- Matt Mahoney

iskywalker@gmx.de

2005-05-01, 8:55 pm

"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:<1114603728.819584.23010@o13g2000cwo.googlegroups.com>...
> If the files are different statistically, then you are better off
> compressing them separately. In the Calgary corpus, compressing geo
> and pic separately from the text files will compress smaller than a tar
> file of the whole corpus with most compressors. If the files are
> similar, like paper1 and paper2, you are better off compressing them
> together.
>
> -- Matt Mahoney

Hi!
thx for the answer! But do you have any explanation for it?
I mean gzip should compress optimally and not size dependent...
I mean, if i mix some part of the data maybe i get a better result
(surely i wont sort alphabetically the data). Why can't gzip build the
best dictionary for compressing the data (i am really asking why (the
reason) and not which could be another better method for compressing
it, since i am interested in creating a better program which uses
entropy and LZ algorithms for compressing, speed of the method is not
important).
Thx again
Fernando Benites
Matt Mahoney

2005-05-01, 8:55 pm

iskywal...@gmx.de wrote:
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message

news:<1114603728.819584.23010@o13g2000cwo.googlegroups.com>...
geo[color=darkred]
tar[color=darkred]
them[color=darkred]
> Hi!
> thx for the answer! But do you have any explanation for it?
> I mean gzip should compress optimally and not size dependent...
> I mean, if i mix some part of the data maybe i get a better result
> (surely i wont sort alphabetically the data). Why can't gzip build

the
> best dictionary for compressing the data (i am really asking why (the
> reason) and not which could be another better method for compressing
> it, since i am interested in creating a better program which uses
> entropy and LZ algorithms for compressing, speed of the method is not
> important).
> Thx again
> Fernando Benites


For an ideal compressor, concatenating files with different statistics
shouldn't hurt compression, but often it does because a compressor may
optimize its model for the average case statistics. There are some
advanced compressors that find boundaries between data types, but they
are slower and the method is not perfect.

gzip works by replacing repeated strings with pointers to previous
occurrences in a 32K sliding window. It can either use a default
Huffman table to code the pointers, or explicitly transmit a better
table to match the statistics of the file for better compression. If
it uses the default table then it should not make any difference
because there will be few pointers between the two sections. However
if it uses a custom table then it will not be optimized for either
section so compression will suffer.

-- Matt Mahoney

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com