Code Comments
Programming Forum and web based access to our favorite programming groups.Hi all! I know that when there are many different symbols from a message (or from an image,i.e. 256 different values) Huffman code is much longer than the code without compression, BUT using some particular frequency distribution (i.e. normal distribution). The question is: When the Huffman code is bad in comparison to a statistical distribution ? Which parameter can point out such eventuality (i.e. parameters for the normal distribution such as the standard deviation, the mean,ecc.) ? I hope that the question is clear. Thanks, ~ Scout ~
Post Follow-up to this message"Scout" <user@domain.com> wrote in message news:RZ7ld.30105$Ni.1046285@twister1.libero.it... > Hi all! > > I know that when there are many different symbols from a message > (or from an image,i.e. 256 different values) > Huffman code is much longer than the code without compression, > BUT using some particular frequency distribution (i.e. normal distribution). > > The question is: > When the Huffman code is bad in comparison to a statistical distribution > ? > Which parameter can point out such eventuality (i.e. parameters for > the normal distribution such as the standard deviation, the mean,ecc.) ? > > I hope that the question is clear. > > Thanks, > ~ Scout ~ > > Huffman coding is never much worse than no compression. On random data there is some expansion due to transmitting the table, either explicitly, or implicitly through adaptive decompression. There are many cases where you can do better than Huffman codes. If the symbol probabilities are not powers of 1/2, then arithmetic coding will do better. The difference is greatest when one symbol has a probability near 1. Huffman coding is not well suited for context models (higher than order 0). It could be done in theory, but would be slow from having to recompute the table for each symbol. Order 0 Huffman coders like UNIX pack compress text poorly and aren't widely used. However, Huffman codes are widely used as parts of other algorithms, for example, to code pointers and lengths in LZ77 (gzip), and to encode some transformed data in JPEG. -- Matt Mahoney
Post Follow-up to this message"Matt Mahoney" <matmahoney@yahoo.com> ha scritto nel messaggio news:Kswld.10278$_J2.9303@newsread2.news.atl.earthlink.net... > "Scout" <user@domain.com> wrote in message > news:RZ7ld.30105$Ni.1046285@twister1.libero.it... > distribution). > distribution > > Huffman coding is never much worse than no compression. On random data > there is some expansion due to transmitting the table, either explicitly, > or > implicitly through adaptive decompression. > > There are many cases where you can do better than Huffman codes. If the > symbol probabilities are not powers of 1/2, then arithmetic coding will do > better. The difference is greatest when one symbol has a probability near > 1. > > Huffman coding is not well suited for context models (higher than order > 0). > It could be done in theory, but would be slow from having to recompute the > table for each symbol. Order 0 Huffman coders like UNIX pack compress > text > poorly and aren't widely used. However, Huffman codes are widely used as > parts of other algorithms, for example, to code pointers and lengths in > LZ77 > (gzip), and to encode some transformed data in JPEG. > > -- Matt Mahoney > > Hi Matt, thanks for your answer. My problem about Huffman code is on the number of symbols. Let's say we have 150 (!) different symbols to encode using Huffman algorithm (with their respective frequencies; it doesn't care too much which distribution they match). So, if you try to build the Huffman tree the lenght of codewords will grow a lot. I 've done some tests and the 'text' with no compression (i.e. using 8 bit for each symbol) is shorter than the Huffman coding of the 'text'! I want to understand this fact. Bye, ~Scout~
Post Follow-up to this message"Scout" <user@domain.com> wrote in message news:cExld.32442$Es2.705895@twister2.libero.it... > "Matt Mahoney" <matmahoney@yahoo.com> ha scritto nel messaggio > news:Kswld.10278$_J2.9303@newsread2.news.atl.earthlink.net... > > Hi Matt, > thanks for your answer. > > My problem about Huffman code is on the number of symbols. > Let's say we have 150 (!) different symbols to encode using Huffman > algorithm (with their respective frequencies; > it doesn't care too much which distribution they match). > So, if you try to build the Huffman tree > the lenght of codewords will grow a lot. > I 've done some tests and the 'text' with no compression > (i.e. using 8 bit for each symbol) > is shorter than the Huffman coding of the 'text'! > I want to understand this fact. > dunno, if one builds a huffman tree that has the same structure as the input text, little will change really (and the data will look the same). this may depend on how one constructs the tree. if one uses just the plain variant I have heard described sometimes, then likely compression will not be so well, given one gets a long list-like tree. err, myself I took a somewhat different approach to tree construction, and skipped the whole bit-scheme thing entirely, so I would guess my algo is not huffman. mine does not guerantee optimal bit lengths, instead, it just constructs a number of "levels", each with a different bit-count, and puts symbols there. I don't think it is too horrible, the results are not that far from what I get from arithmatic coding. I could probably optimize the coding some, however, according to the profiler, most of the time is still spent in matters of tree-construction anyways... I had also considered messing with shannon-fano coding, but didn't really finish it...
Post Follow-up to this messageOn Sat, 13 Nov 2004 23:07:22 GMT, "Matt Mahoney" <matmahoney@yahoo.com> wrote: >There are many cases where you can do better than Huffman codes. If the >symbol probabilities are not powers of 1/2, then arithmetic coding will do >better. The difference is greatest when one symbol has a probability near >1. To my knowledge, there is somehow an unclear point in your discussion. Comparing arithmetic coding (for vector) to Huffman coding with symbol is unfair. I'm sure we can do something better if we concatenated subsequent symbols as a vector before applying Huffman coding than Huffman coding with each symbol. -James BR, ------ James K. (txdiversity@hotmail.com) - Any remarks, proposal and/or indicator would be greatly respected. - Private opinions: These are not the opinions from my affiliation. [Home] http://home.naver.com/txdiversity
Post Follow-up to this messageOn Sat, 13 Nov 2004 23:07:22 GMT, "Matt Mahoney" <matmahoney@yahoo.com> wrote: >There are many cases where you can do better than Huffman codes. If the >symbol probabilities are not powers of 1/2, then arithmetic coding will do >better. The difference is greatest when one symbol has a probability near >1. To my knowledge, there is somehow an unclear point in your discussion. Comparing arithmetic coding (for vector) to Huffman coding with symbol is unfair. I'm sure we can do something better if we concatenated subsequent symbols as a vector before applying Huffman coding than Huffman coding with each symbol. -James BR, ------ James K. (txdiversity@hotmail.com) - Any remarks, proposal and/or indicator would be greatly respected. - Private opinions: These are not the opinions from my affiliation. [Home] http://home.naver.com/txdiversity
Post Follow-up to this message"James K." <txdiversity@hotmail.com> wrote in message news:lhfdp0dto9kgatq9313i63rkno5c0b7pkg@ 4ax.com... > On Sat, 13 Nov 2004 23:07:22 GMT, "Matt Mahoney" > <matmahoney@yahoo.com> wrote: > do near > > To my knowledge, there is somehow an unclear point in your discussion. > Comparing arithmetic coding (for vector) to Huffman coding with symbol > is unfair. I'm sure we can do something better if we concatenated > subsequent symbols as a vector before applying Huffman coding than > Huffman coding with each symbol. I am comparing arithmetic coding and Huffman coding independent of the model. Either could be used for an order 0 model or for a context sensitive model. Arithmetic coding gives better compression because the code lengths are not constrained to be integers. If you have 3 symbols each with probability 1/3, then their Huffman codes would have lengths 1, 2, and 2, for an average symbol length of 5/3 = 1.667. An arithmetic code would use lg(3) = 1.585 bits each. If all your symbols have low probability (several bits) then the advantage of arithmetic coding is very slight. If the codes are mostly n bits long, then the advantage is around 1/n^2. You can achieve this by coding several symbols at once (vectors), but this is not very practical for Huffman coding because you still have to transmit the Huffman tree, which grows exponentially with the length of the vectors. -- Matt Mahoney
Post Follow-up to this message"Scout" <user@domain.com> wrote in message news:cExld.32442$Es2.705895@twister2.libero.it... > My problem about Huffman code is on the number of symbols. > Let's say we have 150 (!) different symbols to encode using Huffman > algorithm (with their respective frequencies; > it doesn't care too much which distribution they match). > So, if you try to build the Huffman tree > the lenght of codewords will grow a lot. > I 've done some tests and the 'text' with no compression > (i.e. using 8 bit for each symbol) > is shorter than the Huffman coding of the 'text'! > I want to understand this fact. If your probability model is right then your data (not including the Huffman tree) should never expand. Be sure you do not have any 0 probabilities, because their Huffman codes would be infinitely long. I suggest you estimate the probabilities by counting the symbols and adding 1 to all the counts. If your input is length n, then no symbol should be longer than lg(n+256)+1 bits (lg = log base 2). -- Matt Mahoney
Post Follow-up to this messageOn Sun, 14 Nov 2004 17:50:23 GMT, "Matt Mahoney" <matmahoney@yahoo.com> wrote: >If all your symbols have low probability (several bits) then the advantage >of arithmetic coding is very slight. If the codes are mostly n bits long, >then the advantage is around 1/n^2. You can achieve this by coding several >symbols at once (vectors), but this is not very practical for Huffman codin g >because you still have to transmit the Huffman tree, which grows >exponentially with the length of the vectors. As you said, It is obvious that arithmatic coding is preferable than Huffman coding for actual implementation. However, still arithmatic coding is often not of practical importance in many cases in terms of speed and storage requirement. So, Lampel-Ziff that is not sort of statistical compression is most popular in real life. BR, ------ James K. (txdiversity@hotmail.com) [Home] http://home.naver.com/txdiversity
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.