For Programmers: Free Programming Magazines  


Home > Archive > Compression > November 2004 > When Huffman compression is useless









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author When Huffman compression is useless
Scout

2004-11-12, 3:55 pm

Hi all!

I know that when there are many different symbols from a message
(or from an image,i.e. 256 different values)
Huffman code is much longer than the code without compression,
BUT using some particular frequency distribution (i.e. normal distribution).

The question is:
When the Huffman code is bad in comparison to a statistical distribution
?
Which parameter can point out such eventuality (i.e. parameters for
the normal distribution such as the standard deviation, the mean,ecc.) ?

I hope that the question is clear.

Thanks,
~ Scout ~


Matt Mahoney

2004-11-13, 8:55 pm

"Scout" <user@domain.com> wrote in message
news:RZ7ld.30105$Ni.1046285@twister1.libero.it...
> Hi all!
>
> I know that when there are many different symbols from a message
> (or from an image,i.e. 256 different values)
> Huffman code is much longer than the code without compression,
> BUT using some particular frequency distribution (i.e. normal

distribution).
>
> The question is:
> When the Huffman code is bad in comparison to a statistical

distribution
> ?
> Which parameter can point out such eventuality (i.e. parameters for
> the normal distribution such as the standard deviation, the mean,ecc.) ?
>
> I hope that the question is clear.
>
> Thanks,
> ~ Scout ~
>
>


Huffman coding is never much worse than no compression. On random data
there is some expansion due to transmitting the table, either explicitly, or
implicitly through adaptive decompression.

There are many cases where you can do better than Huffman codes. If the
symbol probabilities are not powers of 1/2, then arithmetic coding will do
better. The difference is greatest when one symbol has a probability near
1.

Huffman coding is not well suited for context models (higher than order 0).
It could be done in theory, but would be slow from having to recompute the
table for each symbol. Order 0 Huffman coders like UNIX pack compress text
poorly and aren't widely used. However, Huffman codes are widely used as
parts of other algorithms, for example, to code pointers and lengths in LZ77
(gzip), and to encode some transformed data in JPEG.

-- Matt Mahoney


Scout

2004-11-13, 8:55 pm

"Matt Mahoney" <matmahoney@yahoo.com> ha scritto nel messaggio
news:Kswld.10278$_J2.9303@newsread2.news.atl.earthlink.net...
> "Scout" <user@domain.com> wrote in message
> news:RZ7ld.30105$Ni.1046285@twister1.libero.it...
> distribution).
> distribution
>
> Huffman coding is never much worse than no compression. On random data
> there is some expansion due to transmitting the table, either explicitly,
> or
> implicitly through adaptive decompression.
>
> There are many cases where you can do better than Huffman codes. If the
> symbol probabilities are not powers of 1/2, then arithmetic coding will do
> better. The difference is greatest when one symbol has a probability near
> 1.
>
> Huffman coding is not well suited for context models (higher than order
> 0).
> It could be done in theory, but would be slow from having to recompute the
> table for each symbol. Order 0 Huffman coders like UNIX pack compress
> text
> poorly and aren't widely used. However, Huffman codes are widely used as
> parts of other algorithms, for example, to code pointers and lengths in
> LZ77
> (gzip), and to encode some transformed data in JPEG.
>
> -- Matt Mahoney
>
>


Hi Matt,
thanks for your answer.

My problem about Huffman code is on the number of symbols.
Let's say we have 150 (!) different symbols to encode using Huffman
algorithm (with their respective frequencies;
it doesn't care too much which distribution they match).
So, if you try to build the Huffman tree
the lenght of codewords will grow a lot.
I 've done some tests and the 'text' with no compression
(i.e. using 8 bit for each symbol)
is shorter than the Huffman coding of the 'text'!
I want to understand this fact.

Bye,
~Scout~


cr88192

2004-11-14, 3:55 am


"Scout" <user@domain.com> wrote in message
news:cExld.32442$Es2.705895@twister2.libero.it...
> "Matt Mahoney" <matmahoney@yahoo.com> ha scritto nel messaggio
> news:Kswld.10278$_J2.9303@newsread2.news.atl.earthlink.net...
>
> Hi Matt,
> thanks for your answer.
>
> My problem about Huffman code is on the number of symbols.
> Let's say we have 150 (!) different symbols to encode using Huffman
> algorithm (with their respective frequencies;
> it doesn't care too much which distribution they match).
> So, if you try to build the Huffman tree
> the lenght of codewords will grow a lot.
> I 've done some tests and the 'text' with no compression
> (i.e. using 8 bit for each symbol)
> is shorter than the Huffman coding of the 'text'!
> I want to understand this fact.
>

dunno, if one builds a huffman tree that has the same structure as the input
text, little will change really (and the data will look the same).

this may depend on how one constructs the tree.
if one uses just the plain variant I have heard described sometimes, then
likely compression will not be so well, given one gets a long list-like
tree.

err, myself I took a somewhat different approach to tree construction, and
skipped the whole bit-scheme thing entirely, so I would guess my algo is not
huffman.
mine does not guerantee optimal bit lengths, instead, it just constructs a
number of "levels", each with a different bit-count, and puts symbols there.
I don't think it is too horrible, the results are not that far from what I
get from arithmatic coding. I could probably optimize the coding some,
however, according to the profiler, most of the time is still spent in
matters of tree-construction anyways...

I had also considered messing with shannon-fano coding, but didn't really
finish it...



James K.

2004-11-14, 3:55 am

On Sat, 13 Nov 2004 23:07:22 GMT, "Matt Mahoney"
<matmahoney@yahoo.com> wrote:

>There are many cases where you can do better than Huffman codes. If the
>symbol probabilities are not powers of 1/2, then arithmetic coding will do
>better. The difference is greatest when one symbol has a probability near
>1.


To my knowledge, there is somehow an unclear point in your discussion.
Comparing arithmetic coding (for vector) to Huffman coding with symbol
is unfair. I'm sure we can do something better if we concatenated
subsequent symbols as a vector before applying Huffman coding than
Huffman coding with each symbol.

-James
BR,
------
James K. (txdiversity@hotmail.com)
- Any remarks, proposal and/or indicator would be greatly respected.
- Private opinions: These are not the opinions from my affiliation.
[Home] http://home.naver.com/txdiversity
James K.

2004-11-14, 3:55 am

On Sat, 13 Nov 2004 23:07:22 GMT, "Matt Mahoney"
<matmahoney@yahoo.com> wrote:

>There are many cases where you can do better than Huffman codes. If the
>symbol probabilities are not powers of 1/2, then arithmetic coding will do
>better. The difference is greatest when one symbol has a probability near
>1.


To my knowledge, there is somehow an unclear point in your discussion.
Comparing arithmetic coding (for vector) to Huffman coding with symbol
is unfair. I'm sure we can do something better if we concatenated
subsequent symbols as a vector before applying Huffman coding than
Huffman coding with each symbol.

-James
BR,
------
James K. (txdiversity@hotmail.com)
- Any remarks, proposal and/or indicator would be greatly respected.
- Private opinions: These are not the opinions from my affiliation.
[Home] http://home.naver.com/txdiversity
Matt Mahoney

2004-11-14, 3:55 pm


"James K." <txdiversity@hotmail.com> wrote in message
news:lhfdp0dto9kgatq9313i63rkno5c0b7pkg@
4ax.com...
> On Sat, 13 Nov 2004 23:07:22 GMT, "Matt Mahoney"
> <matmahoney@yahoo.com> wrote:
>
do[color=darkred]
near[color=darkred]
>
> To my knowledge, there is somehow an unclear point in your discussion.
> Comparing arithmetic coding (for vector) to Huffman coding with symbol
> is unfair. I'm sure we can do something better if we concatenated
> subsequent symbols as a vector before applying Huffman coding than
> Huffman coding with each symbol.


I am comparing arithmetic coding and Huffman coding independent of the
model. Either could be used for an order 0 model or for a context sensitive
model. Arithmetic coding gives better compression because the code lengths
are not constrained to be integers. If you have 3 symbols each with
probability 1/3, then their Huffman codes would have lengths 1, 2, and 2,
for an average symbol length of 5/3 = 1.667. An arithmetic code would use
lg(3) = 1.585 bits each.

If all your symbols have low probability (several bits) then the advantage
of arithmetic coding is very slight. If the codes are mostly n bits long,
then the advantage is around 1/n^2. You can achieve this by coding several
symbols at once (vectors), but this is not very practical for Huffman coding
because you still have to transmit the Huffman tree, which grows
exponentially with the length of the vectors.

-- Matt Mahoney


Matt Mahoney

2004-11-14, 3:55 pm

"Scout" <user@domain.com> wrote in message
news:cExld.32442$Es2.705895@twister2.libero.it...
> My problem about Huffman code is on the number of symbols.
> Let's say we have 150 (!) different symbols to encode using Huffman
> algorithm (with their respective frequencies;
> it doesn't care too much which distribution they match).
> So, if you try to build the Huffman tree
> the lenght of codewords will grow a lot.
> I 've done some tests and the 'text' with no compression
> (i.e. using 8 bit for each symbol)
> is shorter than the Huffman coding of the 'text'!
> I want to understand this fact.


If your probability model is right then your data (not including the Huffman
tree) should never expand. Be sure you do not have any 0 probabilities,
because their Huffman codes would be infinitely long. I suggest you
estimate the probabilities by counting the symbols and adding 1 to all the
counts. If your input is length n, then no symbol should be longer than
lg(n+256)+1 bits (lg = log base 2).

-- Matt Mahoney


James K.

2004-11-18, 3:55 pm

On Sun, 14 Nov 2004 17:50:23 GMT, "Matt Mahoney"
<matmahoney@yahoo.com> wrote:

>If all your symbols have low probability (several bits) then the advantage
>of arithmetic coding is very slight. If the codes are mostly n bits long,
>then the advantage is around 1/n^2. You can achieve this by coding several
>symbols at once (vectors), but this is not very practical for Huffman coding
>because you still have to transmit the Huffman tree, which grows
>exponentially with the length of the vectors.


As you said, It is obvious that arithmatic coding is preferable than
Huffman coding for actual implementation. However, still arithmatic
coding is often not of practical importance in many cases in terms of
speed and storage requirement. So, Lampel-Ziff that is not sort of
statistical compression is most popular in real life.
BR,
------
James K. (txdiversity@hotmail.com)
[Home] http://home.naver.com/txdiversity
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com