Code Comments
Programming Forum and web based access to our favorite programming groups.The standard algorithm for generating Huffman codes assumes that the codes can be arbitrarily long. Is there an efficient algorithm to generate an optimal set of codes subject to the constraint that no code can be longer than a specified length? The only algorithm I have found that addresses this problem is the one in gzip, which does not produce an optimal result. It works in three passes: 1. Generate an optimal Huffman tree. If no code exceeds the maximum length, we are done. 2. Modify the tree so that its depth does not exceed max_depth, where max_depth is the maximum code length. In this step, we don't actually perform the modifications described below. Instead, we just keep track of the number of leaves at each depth that would exist if we performed the modifications. A. Remove all nodes whose depth is greater than max_depth from the tree. Store the removed leaf nodes in set S, and discard the non-leaf nodes. B. Remove all non-leaf nodes whose depth is max_depth from the tree, replacing them with leaf nodes in set S. Remove these leaf nodes from set S. C. For each node N1 in set S, chose a leaf node N2 from the tree such that 1. The depth of N2 is less than max_depth. 2. The depth of N2 is at least as great as any other leaf whose depth is less than max_depth. Remove N2 from the tree, and replace it with a non-leaf node whose children are N1 and N2. 3. Count the number of leaves at each depth in the tree produced in step 2. Sort the leaves by symbol frequency, and assign depths to the leaves such that the leaves representing the most frequently occurring symbols have the smallest depth values. The following example demonstrates that the algorithm used in gzip doesn't produce optimal results. The first two columns below specify the frequency counts for a set of symbols. The third column shows the code lengths for an optimal Huffman tree, assuming that there is no limit on the length of a code. The fourth column shows the code lengths produced by the gzip algorithm when the maximum code length is set to three. The last column shows a set of code lengths which are better than those produced by the gzip algorithm. -------- lengths --------- symbol freqency initial gzip optimal A 8 1 1 2 B 7 2 3 2 C 4 3 3 2 D 1 4 3 3 E 1 4 3 3 compressed size: 42 47 44
Post Follow-up to this message"Kenneth Almquist" <ka@sorry.no.email> wrote in message news:zk56d.16293$M45.7358@trndny09. . > The standard algorithm for generating Huffman codes assumes that > the codes can be arbitrarily long. Is there an efficient algorithm > to generate an optimal set of codes subject to the constraint that > no code can be longer than a specified length? > [snip] Could you detail what causes such a requirement? Are there any practical applications for that? Why doesn't canonical Huffman code suit your application? Regards, Alex Vinokur http://mathforum.org/library/view/10978.html http://sourceforge.net/users/alexvn
Post Follow-up to this message"Alex Vinokur" <alexvn@big-foot.com> wrote in message news:<2rsavpF1cddj2U1@uni-berlin.de>. . > Are there any practical applications for that? Yes, for example performance reasons. Limiting the code length to 32 bits can improve performance of the encoding/decoding. A Google search for "huffman length limited" give some interesting results. /Jesper Nordenberg
Post Follow-up to this messageka@sorry.no.email (Kenneth Almquist) wrote in news:zk56d.16293$M45.7358@trndny09: > > The following example demonstrates that the algorithm used in gzip > doesn't produce optimal results. The first two columns below specify > the frequency counts for a set of symbols. The third column shows the > code lengths for an optimal Huffman tree, assuming that there is no > limit on the length of a code. The fourth column shows the code > lengths produced by the gzip algorithm when the maximum code length is > set to three. The last column shows a set of code lengths which are > better than those produced by the gzip algorithm. > > -------- lengths --------- > symbol freqency initial gzip optimal > A 8 1 1 2 > B 7 2 3 2 > C 4 3 3 2 > D 1 4 3 3 > E 1 4 3 3 > > compressed size: 42 47 44 > One way though not the fastest is to increase the wieght of less common symbols till a shorter tree is first obtained that matches the length you are striving for in this example start with 1 and go up small amounts. say you reach 2.1 then the frequency count is A 8 B 7 C 4 D 2.1 E 2.1 you bring DE to togeth for a weith of 4.2 then combint that with next lowest one which is C with a weight of 4 the tree know weigh 8.2 the next two lowest nodes are the A and B so they are combined then the last two left givine the following symbol length huffman code A 2 10 B 2 01 C 2 11 D 3 001 E 3 000 If you don't shorten you get the folowing where D and E freq of 1 symbol length huffman code A 1 1 B 2 01 C 3 001 D 4 0001 E 4 0000 David A. Scott -- My Crypto code http://bijective.dogma.net/crypto/scott19u.zip http://www.jim.com/jamesd/Kong/scott19u.zip old version My Compression code http://bijective.dogma.net/ **TO EMAIL ME drop the roman "five" ** Disclaimer:I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged. As a famous person once said "any cryptograhic system is only as strong as its weakest link"
Post Follow-up to this message"Alex Vinokur" <alexvn@big-foot.com> asked: > "Kenneth Almquist" <ka@sorry.no.email> wrote in message news:zk56d.16293$M 45.7358@trndny09... > > Could you detail what causes such a requirement? > Are there any practical applications for that? > Why doesn't canonical Huffman code suit your application? Two reasons. First, with a canonical Huffman code, the tree is encoded by encoding the length of each code. So the amount of space required to store the tree depends on how much space is required to store the lengths. Less space is required if you have a known maximum code length. Second, if the maximum length of a code is reasonably small, you can perform decoding using a single table lookup. Otherwise, you need a more complicated scheme using multiple tables. I am trying to read and write data structures. To read in a data structure, I write a parser which calls get_next_symbol(encoding) when it's ready to process the next symbol. The "encoding" parameter specifies the Huffman codes used to represent the symbol; different symbol encodings are used at different points in the parsing process. The point is that get_next_symbol is called in a lot of places. I would like it to be simple enough that the compiler will inline it. Kenneth Almquist
Post Follow-up to this messagemegagurka@yahoo.com (Jesper Nordenberg) wrote: > A Google search for "huffman length limited" give some interesting results.[/color ] Thanks. The "package-merge" algorithm by Lawrence Larmore and Daniel Hirschberg runs in O(n*L) algorithm, where n is the number of symbols and L is the maximum code length.[1] There's also the "warm-up" algorithm, which is not optimal but produces optimal results in many cases and has a provable worst bound.[2, 3]. In the case of gzip, replacing the existing algorithm with an optimal algorithm such as package-merge produces a minimal benefit. The best result reported in [4] is that using the package-merge algorithm reduced the size of a compressed file from 1,817,536 bytes to 1,817,533 bytes, a savings of 0.00017%. On the hand, in the example I gave at the start of the thread, using an optimal algorithm reduced the size of the compressed data by 6.4%, so there are values of L and n for which using an optimal algorithm will produce a significant benefit. Kenneth Almquist [1] A Fast Algorithm for Optimal Length-Limited Huffman Codes, by Lawrence L. Larmore and Daniel S. Hirschberg <http://www.ics.uci.edu/~dan/pubs/LenLimHuff.ps.gz> [2] The WARM-UP Algorithm: A Lagrangean Construction of Length Restricted Huffman Codes, by Ruy Luiz Milidiu, Eduardo Sany Laber <ftp://ftp.inf.puc-rio.br/pub/docs/t...5_milidiu.ps.gz> [3] Efficient Implementation of the WARM-UP Algorithm for the Construction of Length-Restricted Prefix Codes, by Ruy Luiz Milidiu, Artur Alves Pessoa, Eduardo Sany Laber <http://citeseer.ist.psu.edu/rd/0%2C...efficient.ps.gz> [4] Improved GNU gzip Data Compression, by Roger Sayle <http://www.daylight.com/meetings/mug00/Sayle/gzip.html>
Post Follow-up to this messageOn Tue, 28 Sep 2004, Kenneth Almquist wrote: > The standard algorithm for generating Huffman codes assumes that > the codes can be arbitrarily long. Is there an efficient algorithm > to generate an optimal set of codes subject to the constraint that > no code can be longer than a specified length? See [1] Lawrence L. Larmore, Daniel S. Hirschberg "A Fast Algorithm for Optimal Length-Limited Huffman Codes" [2] Andrew Turpin, Alistair Moffat "Practical Length-Limited Codes for Large Alphabets" In [1] an optimal O(nL) time algorithm is presented where n=symbol count and L=maximum code word length. This algorithm is further optimized in [2] in terms of memory demand / speed. The problem "OLBPC" (optimal length bounded prefix code) can be reduced to CCP (Coin Collector's Problem) which is in turn similar to the KnappSack problem but can be solved much faster. The optimized algorithm presented in [2] is nearly as fast as an optimized Huffman algorithm (still O(nL) time complexity but much faster than the first version of Lawrence and Hirschberg) This optimized algorithm looks suprisingly simple though you won't understand why & how it works by looking at the pseudo code instead of studying these papers and their proofs. HTH, Sebastian -- PGP-Key-ID (long): 572B1778A4CA0707
Post Follow-up to this messageOn Thu, 30 Sep 2004, Kenneth Almquist wrote: > Thanks. The "package-merge" algorithm by Lawrence Larmore and Daniel > Hirschberg runs in O(n*L) algorithm, where n is the number of symbols > and L is the maximum code length.[1] There's also the "warm-up" > algorithm, which is not optimal but produces optimal results in many > cases and has a provable worst bound.[2, 3]. Oups... sorry, I didn't see this message. Yeah, the package-merge algorithm is the way to go. :) You might be interested in the optimization of Turpin and Moffat as well. It's a very fast version of the package-merge algo. Ghis! Sebastian -- PGP-Key-ID (long): 572B1778A4CA0707
Post Follow-up to this message
"Kenneth Almquist" <ka@sorry.no.email> wrote in message news:WKr6d.3195$OX.2510@trndny07...
> "Alex Vinokur" <alexvn@big-foot.com> asked:
>
> Two reasons. First, with a canonical Huffman code, the tree is encoded
> by encoding the length of each code. So the amount of space required to
> store the tree depends on how much space is required to store the lengths.
> Less space is required if you have a known maximum code length.
>
> Second, if the maximum length of a code is reasonably small, you can
> perform decoding using a single table lookup. Otherwise, you need a
> more complicated scheme using multiple tables.
[snip]
In the context of this question, several kinds of criterions for building Hu
ffman trees can be considered.
Here are some of them:
1) Minimum cost (weighted external path length the tree) - criterion of the
canonical Huffman algorithm;
2) Limit on height of the tree;
3) Limit on sum of codeword sizes.
Minimum cost is Data Transmission criterion,
Limits on height and sum are Data Storage criterions.
Those criterions can be combine in a triple-criterion.
Here is a table of possible triple-criterions for Huffman codes.
Table 1. Huffman algorithm triple-criterions
----------------------------------------------------------------------------
------
| Criterion : Triple-criterion : Note
|
| : priorities P[i] :
|
| :--------------------------------------:
|
| : P[1] : P[2] : P[3] :
|
|---------------------------------------------------------------------------
-----|
| TC-0 : Cost : No : No : Canonical Huffman algor
ithm |
| TC-1 : Cost : Height : Code-Sizes : Canonical Huffman algor
ithm |
| TC-2 : Cost : Code-Sizes : Height : Canonical Huffman algor
ithm |
| TC-3 : Height : Cost : Code-Sizes : Limited Length
|
| TC-4 : Height : Code-Sizes : Cost : Limited Length & Code-S
izes |
| TC-5 : Code-Sizes : Cost : Height : Limited Code-Sizes Code
|
| TC-6 : Code-Sizes : Height : Cost : Limited Code-Sizes & Le
ngth |
----------------------------------------------------------------------------
------
Note. 'Cost' means 'Huffman tree cost'
'Code-Sizes' means 'Sum of codeword sizes'
For instance, TC-3 means: while building Huffman
first of all one should limit height of tree,
after that minimize cost of the tree,
after that minimize sum of codeword sizes.
Of course, Huffman trees for different triple-criterions are usually quite d
ifferent ones (trees).
How to build Huffman trees for different triple-criterions?
n-ary Huffman Template Algorithm was used to do that.
That algorithm can be seen at
* http://alexvn.freeservers.com/s1/hu..._algorithm.html
* http://sourceforge.net/projects/huffman-ta/
The Huffman Template Algorithm is not aimed to _efficiently_ generate Huffma
n trees.
It is aimed
* to use _canonical_ Huffman algorithm non-numerical weights as well
and
* to produce _comprehensive_ output.
To build Huffman trees for different triple-criterions an extended weight wa
s used.
The extended weight is a couple <actual weight, delta weight>.
Actual weight is ordinary weight (of a symbol), delta weight is generated wh
ile some loop.
Here is a test set of symbols and their (actual) weights.
A 1
B 1
C 1
D 2
E 3
F 7
G 10
H 13
I 15
J 17
K 26
L 35
Table 2. Several Huffman codes for the set above.
-----------------------------------------------------------------
| Method : Cri- : Tree : Tree cost : Code- : Note |
| : te- : Height :----------------: Sizes : |
| : rion : : Actual : Delta : : |
|---------------------------------------------------------------|
| 1 : TC-0 : 8 : 388 : 0 : 56 : |
| : : : : : : |
| 2 : TC-1 : 7 : 388 : 0 : 54 : Height <= 7 |
| : : : : : : |
| 3 : TC-3 : 6 : 390 : 36 : 52 : Height <= 6 |
| 4 : TC-4 : 6 : 394 : 48 : 49 : Height <= 6 |
| : : : : : : |
| 5 : TC-3 : 5 : 398 : 60 : 48 : Height <= 5 |
| 6 : TC-4 : 5 : 413 : 146 : 46 : Height <= 5 |
| : : : : : : |
| 7 : TC-3 : 4 : 428 : 156 : 45 : Height <= 4 |
| 8 : TC-4 : 4 : 431 : 436 : 44 : Height <= 4 |
-----------------------------------------------------------------
######### Method-1 : BEGIN #########
===== Set of weight+delta: 1+0 1+0 1+0 2+0 3+0 7+0 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 8
Weights sum = 131+0
Average weight = 10+0
Code-sizes sum = 56
Average code-size = 4.66667
Weighted code-sizes sum = 388+0
Ave. weighted code-size = 32+0
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
26+0 K 00
35+0 L 10
15+0 I 011
17+0 J 110
7+0 F 0100
10+0 G 1110
13+0 H 1111
3+0 E 01010
2+0 D 010110
1+0 C 0101110
1+0 A 01011110
1+0 B 01011111
######### Method-1 : END ###########
######### Method-2 : BEGIN #########
===== Set of weight+delta: 1+0 1+0 1+0 2+0 3+0 7+0 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 7
Weights sum = 131+0
Average weight = 10+0
Code-sizes sum = 54
Average code-size = 4.5
Weighted code-sizes sum = 388+0
Ave. weighted code-size = 32+0
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
26+0 K 00
35+0 L 10
15+0 I 010
17+0 J 110
7+0 F 0110
10+0 G 1110
13+0 H 1111
1+0 C 011100
2+0 D 011101
3+0 E 011111
1+0 A 0111100
1+0 B 0111101
######### Method-2 : END ###########
######### Method-3 : BEGIN #########
===== Set of weight+delta: 1+2 1+2 1+2 2+0 3+0 7+0 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 6
Weights sum = 131+6
Average weight = 10+0
Code-sizes sum = 52
Average code-size = 4.33333
Weighted code-sizes sum = 390+36
Ave. weighted code-size = 32+3
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
26+0 K 00
35+0 L 10
15+0 I 010
17+0 J 011
10+0 G 1101
13+0 H 1111
3+0 E 11000
7+0 F 11101
2+0 D 110010
1+2 A 110011
1+2 B 111000
1+2 C 111001
######### Method-3 : END ###########
######### Method-4 : BEGIN #########
===== Set of weight+delta: 1+2 1+3 1+3 2+1 3+0 7+0 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 6
Weights sum = 131+9
Average weight = 10+0
Code-sizes sum = 49
Average code-size = 4.08333
Weighted code-sizes sum = 394+48
Ave. weighted code-size = 32+4
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
35+0 L 10
13+0 H 000
15+0 I 010
17+0 J 011
26+0 K 111
7+0 F 0011
10+0 G 1101
3+0 E 00100
1+3 A 00101
1+3 B 11000
1+2 C 110010
2+1 D 110011
######### Method-4 : END ###########
######### Method-5 : BEGIN #########
===== Set of weight+delta: 1+3 1+3 1+3 2+2 3+1 7+0 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 5
Weights sum = 131+12
Average weight = 10+1
Code-sizes sum = 48
Average code-size = 4
Weighted code-sizes sum = 398+60
Ave. weighted code-size = 33+5
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
35+0 L 10
13+0 H 000
15+0 I 001
17+0 J 011
26+0 K 111
10+0 G 1100
1+3 A 01000
1+3 B 01001
1+3 C 01010
2+2 D 01011
3+1 E 11010
7+0 F 11011
######### Method-5: END ###########
######### Method-6 : BEGIN #########
===== Set of weight+delta: 1+7 1+7 1+7 2+6 3+5 7+1 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 5
Weights sum = 131+33
Average weight = 10+2
Code-sizes sum = 46
Average code-size = 3.83333
Weighted code-sizes sum = 413+146
Ave. weighted code-size = 34+12
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
35+0 L 01
17+0 J 100
26+0 K 110
1+7 C 0000
2+6 D 0001
3+5 E 0010
7+1 F 0011
10+0 G 1010
13+0 H 1011
15+0 I 1110
1+7 A 11110
1+7 B 11111
######### Method-6 : END ###########
######### Method-7 : BEGIN #########
===== Set of weight+delta: 1+8 1+8 1+8 2+7 3+6 7+2 10+0 13+0 15+0 17+0 26+0
35+0 =====
Alphabet size = 12
Shortest code size = 2
Longest code size = 4
Weights sum = 131+39
Average weight = 10+3
Code-sizes sum = 45
Average code-size = 3.75
Weighted code-sizes sum = 428+156
Ave. weighted code-size = 35+13
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
35+0 L 00
26+0 K 110
1+8 A 0100
1+8 B 0101
1+8 C 0110
2+7 D 0111
3+6 E 1000
7+2 F 1001
10+0 G 1010
13+0 H 1011
15+0 I 1110
17+0 J 1111
######### Method-7 : END ###########
######### Method-8 : BEGIN #########
===== Set of weight+delta: 1+17 1+17 1+17 2+16 3+15 7+11 10+8 13+5 15+3 17+1
26+0 35+0 =====
Alphabet size = 12
Shortest code size = 3
Longest code size = 4
Weights sum = 131+110
Average weight = 10+9
Code-sizes sum = 44
Average code-size = 3.66667
Weighted code-sizes sum = 431+436
Ave. weighted code-size = 35+36
=======================
Codes and their symbols
-> Sorted by Code Size
=======================
Weight Symbol Code
------ ------ ----
15+3 I 000
17+1 J 001
26+0 K 010
35+0 L 011
1+17 A 1000
1+17 B 1001
1+17 C 1010
2+16 D 1011
3+15 E 1100
7+11 F 1101
10+8 G 1110
13+5 H 1111
######### Method-8 : END ###########
--
Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://mathforum.org/library/view/10978.html
http://sourceforge.net/users/alexvn
Post Follow-up to this messageka@sorry.no.email (Kenneth Almquist) wrote in news:WKr6d.3195$OX.2510@trndny07: > > Two reasons. First, with a canonical Huffman code, the tree is > encoded by encoding the length of each code. So the amount of space > required to store the tree depends on how much space is required to > store the lengths. Less space is required if you have a known maximum > code length. > Actually this is not true. THere are several ways to store the tree if one must. The length is not the best way not even close. As a simple example suspose you with to store the tree as for a file the has the basic 256 symbols. Where roughly half the symbols are used. if only three lengths are being used. say 10 bits 8 bits 4 bits. You need only store 00 for the not in use 01 for shortest length 10 for next length and 11 for the longest length. This would carry the exact same information but would require less information than the actually length there would be only way to construct the tree so it would be a waste to use actaul length. You could also do it as a fixed length permutation table or a varible length permutation table if one wanted that would save more length. You could even do it completely bijective to the file space though it would be harder to code and if needed I suspect an adaptive approach would be better most of the time. > Second, if the maximum length of a code is reasonably small, you can > perform decoding using a single table lookup. Otherwise, you need a > more complicated scheme using multiple tables. > > As computers get faster and since memorty only seems to get cheaper this should seldom be a concern. David A. Scott -- My Crypto code http://bijective.dogma.net/crypto/scott19u.zip http://www.jim.com/jamesd/Kong/scott19u.zip old version My Compression code http://bijective.dogma.net/ **TO EMAIL ME drop the roman "five" ** Disclaimer:I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged. As a famous person once said "any cryptograhic system is only as strong as its weakest link"
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.