Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

gzip
Hi!
I got 2 files bla.pgm and blu.pgm
if i concatenate:
cat blu.pgm bla.pgm> blo.pgm
gzip blo.pgm
and see the size of blo.pgm.gz it is bigger than
the sum of blu.pgm.gz and bla.pgm.gz.
How can it be? Is the dictionary bad constructed?
I thought first it would be the hashtable size, and then i lowered
the size of blo.pgm below 32700 but it doesnt matter...
What would be a plausible explanation for it?
Thanks for the answers in advance
Fernando Benites

Report this thread to moderator Post Follow-up to this message
Old Post
iskywalker@gmx.de
04-27-05 01:55 PM


Re: gzip
iskywal...@gmx.de wrote:
> Hi!
> I got 2 files bla.pgm and blu.pgm
> if i concatenate:
> cat blu.pgm bla.pgm> blo.pgm
> gzip blo.pgm
> and see the size of blo.pgm.gz it is bigger than
> the sum of blu.pgm.gz and bla.pgm.gz.
> How can it be? Is the dictionary bad constructed?
> I thought first it would be the hashtable size, and then i lowered
> the size of blo.pgm below 32700 but it doesnt matter...
> What would be a plausible explanation for it?
> Thanks for the answers in advance
> Fernando Benites

If the files are different statistically, then you are better off
compressing them separately.  In the Calgary corpus, compressing geo
and pic separately from the text files will compress smaller than a tar
file of the whole corpus with most compressors.  If the files are
similar, like paper1 and paper2, you are better off compressing them
together.

-- Matt Mahoney


Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
04-27-05 08:55 PM


Re: gzip
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:<1114603728.819584.23010@o13g20
00cwo.googlegroups.com>...
> If the files are different statistically, then you are better off
> compressing them separately.  In the Calgary corpus, compressing geo
> and pic separately from the text files will compress smaller than a tar
> file of the whole corpus with most compressors.  If the files are
> similar, like paper1 and paper2, you are better off compressing them
> together.
>
> -- Matt Mahoney
Hi!
thx for the answer! But do you have any explanation for it?
I mean gzip should compress optimally and not size dependent...
I mean, if i mix some part of the data maybe i get a better result
(surely i wont sort alphabetically the data). Why can't gzip build the
best dictionary for compressing the data (i am really asking why (the
reason) and not which could be another better method for compressing
it, since i am interested in creating a better program which uses
entropy and LZ algorithms for compressing, speed of the method is not
important).
Thx again
Fernando Benites

Report this thread to moderator Post Follow-up to this message
Old Post
iskywalker@gmx.de
05-02-05 01:55 AM


Re: gzip
iskywal...@gmx.de wrote:
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:<1114603728.819584.23010@o13g2000cwo.googlegroups.com>... 
geo 
tar 
them 
> Hi!
> thx for the answer! But do you have any explanation for it?
> I mean gzip should compress optimally and not size dependent...
> I mean, if i mix some part of the data maybe i get a better result
> (surely i wont sort alphabetically the data). Why can't gzip build
the
> best dictionary for compressing the data (i am really asking why (the
> reason) and not which could be another better method for compressing
> it, since i am interested in creating a better program which uses
> entropy and LZ algorithms for compressing, speed of the method is not
> important).
> Thx again
> Fernando Benites

For an ideal compressor, concatenating files with different statistics
shouldn't hurt compression, but often it does because a compressor may
optimize its model for the average case statistics.  There are some
advanced compressors that find boundaries between data types, but they
are slower and the method is not perfect.

gzip works by replacing repeated strings with pointers to previous
occurrences in a 32K sliding window.  It can either use a default
Huffman table to code the pointers, or explicitly transmit a better
table to match the statistics of the file for better compression.  If
it uses the default table then it should not make any difference
because there will be few pointers between the two sections.  However
if it uses a custom table then it will not be optimized for either
section so compression will suffer.

-- Matt Mahoney


Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
05-02-05 01:55 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Compression archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:33 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.