Home > Archive > Compression > May 2005 > Does this mean I have no way to compress it?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Does this mean I have no way to compress it?
|
|
| maplewang 2005-05-16, 3:55 pm |
| Hi :
I have tested my copressed data (about 130Kbytes, compressed by
some compressing method like WinRar/Winzip), I have used
sixpack/arithmetic/lz77/lzw/huffman/PPM/GngCode but none can compress
data less than 7 bits/every 8 bits. the best are 7.2 bits/per 8 bits.
Does this mean the data is almost uncompressible? I have not test
it on wavelet or BWT, will these two be a candidate?
thanks ahead.
| |
| Matt Mahoney 2005-05-16, 3:55 pm |
| maplewang wrote:
> Hi :
> I have tested my copressed data (about 130Kbytes, compressed by
> some compressing method like WinRar/Winzip), I have used
> sixpack/arithmetic/lz77/lzw/huffman/PPM/GngCode but none can compress
> data less than 7 bits/every 8 bits. the best are 7.2 bits/per 8 bits.
> Does this mean the data is almost uncompressible? I have not test
> it on wavelet or BWT, will these two be a candidate?
> thanks ahead.
If your data is random, encrypted, or already compressed, then you
probably can't compress it further without special tricks. For
example, if it is compressed as a zip file you might try unzipping it
and compressing it with a better algorithm. (I think Stuffit does
something like this with jpeg files, which you normally can't
compress).
Sometimes random looking data that won't compress by an off the shelf
program might be compressible if you know of some non-obvious
redundancy. For example, it was believed that protein is
incompressible:
http://www.data-compression.info/Corpora/ProteinCorpus/
But in the data they used, there are actually repeats separated by long
intervals that were picked up by some of the better compressors that
the authors missed.
Another example, the million random digits file
http://www.rand.org/publications/classics/randomdigits/
was widely believed to be incompressible, but there is actually at
least 50 bits of redundancy in that if you add up every 50'th digit,
the result is even.
Another example is complementary palindromes in DNA, such as e.coli in
the large Canterbury corpus. If you know that "ACCC" has a certain
frequency, then "GGGT" will have about the same frequency.
Kolmogorov proved there is no general way to find all the redundancy in
data. You have to know something about the data source to know where to
look.
-- Matt Mahoney
| |
| maplewang 2005-05-17, 3:55 am |
| Since these data are user data(i.e user uploaded data), I can not make
sure which method they used to compresse it.
| |
| Matt Mahoney 2005-05-18, 3:55 pm |
| Matt Mahoney wrote:
> Sometimes random looking data that won't compress by an off the shelf
> program might be compressible if you know of some non-obvious
> redundancy. For example, it was believed that protein is
> incompressible:
> http://www.data-compression.info/Corpora/ProteinCorpus/
> But in the data they used, there are actually repeats separated by
long
> intervals that were picked up by some of the better compressors that
> the authors missed.
Frank Schwellinger kindly pointed out to me that since my original post
on this topic in Jan. 2004, the files originally posted were found to
be incorrect and have since been updated, removing the long repeats, as
it says on the above web page.
-- Matt Mahoney
|
|
|
|
|