Code Comments
Programming Forum and web based access to our favorite programming groups.Hi! I got 2 files bla.pgm and blu.pgm if i concatenate: cat blu.pgm bla.pgm> blo.pgm gzip blo.pgm and see the size of blo.pgm.gz it is bigger than the sum of blu.pgm.gz and bla.pgm.gz. How can it be? Is the dictionary bad constructed? I thought first it would be the hashtable size, and then i lowered the size of blo.pgm below 32700 but it doesnt matter... What would be a plausible explanation for it? Thanks for the answers in advance Fernando Benites
Post Follow-up to this messageiskywal...@gmx.de wrote: > Hi! > I got 2 files bla.pgm and blu.pgm > if i concatenate: > cat blu.pgm bla.pgm> blo.pgm > gzip blo.pgm > and see the size of blo.pgm.gz it is bigger than > the sum of blu.pgm.gz and bla.pgm.gz. > How can it be? Is the dictionary bad constructed? > I thought first it would be the hashtable size, and then i lowered > the size of blo.pgm below 32700 but it doesnt matter... > What would be a plausible explanation for it? > Thanks for the answers in advance > Fernando Benites If the files are different statistically, then you are better off compressing them separately. In the Calgary corpus, compressing geo and pic separately from the text files will compress smaller than a tar file of the whole corpus with most compressors. If the files are similar, like paper1 and paper2, you are better off compressing them together. -- Matt Mahoney
Post Follow-up to this message"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:<1114603728.819584.23010@o13g20 00cwo.googlegroups.com>... > If the files are different statistically, then you are better off > compressing them separately. In the Calgary corpus, compressing geo > and pic separately from the text files will compress smaller than a tar > file of the whole corpus with most compressors. If the files are > similar, like paper1 and paper2, you are better off compressing them > together. > > -- Matt Mahoney Hi! thx for the answer! But do you have any explanation for it? I mean gzip should compress optimally and not size dependent... I mean, if i mix some part of the data maybe i get a better result (surely i wont sort alphabetically the data). Why can't gzip build the best dictionary for compressing the data (i am really asking why (the reason) and not which could be another better method for compressing it, since i am interested in creating a better program which uses entropy and LZ algorithms for compressing, speed of the method is not important). Thx again Fernando Benites
Post Follow-up to this messageiskywal...@gmx.de wrote: > "Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:<1114603728.819584.23010@o13g2000cwo.googlegroups.com>... geo tar them > Hi! > thx for the answer! But do you have any explanation for it? > I mean gzip should compress optimally and not size dependent... > I mean, if i mix some part of the data maybe i get a better result > (surely i wont sort alphabetically the data). Why can't gzip build the > best dictionary for compressing the data (i am really asking why (the > reason) and not which could be another better method for compressing > it, since i am interested in creating a better program which uses > entropy and LZ algorithms for compressing, speed of the method is not > important). > Thx again > Fernando Benites For an ideal compressor, concatenating files with different statistics shouldn't hurt compression, but often it does because a compressor may optimize its model for the average case statistics. There are some advanced compressors that find boundaries between data types, but they are slower and the method is not perfect. gzip works by replacing repeated strings with pointers to previous occurrences in a 32K sliding window. It can either use a default Huffman table to code the pointers, or explicitly transmit a better table to match the statistics of the file for better compression. If it uses the default table then it should not make any difference because there will be few pointers between the two sections. However if it uses a custom table then it will not be optimized for either section so compression will suffer. -- Matt Mahoney
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.