Home > Archive > Compression > September 2006 > Huffman - beginner question
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Huffman - beginner question
|
|
| lallous 2006-09-21, 7:55 am |
| Hello
I am learning about Huffman compression algorithm:
quote:
- The two free nodes with the lowest weights are located.
- A parent node for these two nodes is created. It is assigned a weight
equal to the sum of the two
child nodes.
- The parent node is added to the list of free nodes, and the two child
nodes are removed from the
list.
- One of the child nodes is designated as the path taken from the parent
node when decoding a 0
bit. The other is arbitrarily set to the 1 bit.
- The previous steps are repeated until only one free node is left. This
free node is designated the
root of the tree.
1) I built the starting nodes with their initial weight values based on
their occurences
2) I built the tree based on this algorithm, however, the tree depth
sometimes grows to 9 which means that a certain symbol will have 9 bits to
encode it.
How can I calculate what will be the maximum theoretical depth a tree can
have independant from the input?
I initially thought it would be a max depth of 8, but tests showed that it
can go more than that.
Please explain and advise.
Regards,
Elias
| |
| Willem 2006-09-21, 6:55 pm |
| lallous wrote:
) How can I calculate what will be the maximum theoretical depth a tree can
) have independant from the input?
) I initially thought it would be a max depth of 8, but tests showed that it
) can go more than that.
The max depth depends on the number of symbols, and the largest ratio
between weights. If you're using symbol occurrences, the largest ratio
obviously depends on the length of the input.
In theory, the largest depth is equal to the number of symbols, minus one.
In practise, the dependancy on the input length plays a large role, because
it's roughly logarithmic (IIRC).
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| Mark Adler 2006-09-21, 6:55 pm |
| lallous wrote:
> - One of the child nodes is designated as the path taken from the parent
> node when decoding a 0 bit. The other is arbitrarily set to the 1 bit.
You can wait until later to assign which one is 0 and which is 1. In
most applications the Huffman algorithm is used to only produce code
lengths (in number of bits) for each symbol. Then later you can use
the symbol lexical order to assign codes within each code length to get
a "canoncial" code for that set of probabilities. The advantage is
that when you have to transmit the code, you only need to send the
lengths -- not the codes themselves.
> How can I calculate what will be the maximum theoretical depth a tree can
> have independant from the input?
As was stated by another poster, the maximum is n-1, where n is the
number of symbols. Since typical applications have hundreds of
symbols, if not more, you could end up with codes hundreds of bits
long.
Since the really long codes are, by definition, unlikely, you can make
the code very slightly less efficient by length-limiting the codes.
The advantage then is that your coder and decoder can be simpler and
faster if all of the codes fit in, for example, a processor register.
The deflate Huffman codes are length-limited to 15 bits, even though
they could otherwise be 284 bits long in the worst case.
How exactly you length-limit the codes is a much longer discussion.
There is a whole section on that, and a good one too, in "Compression
and Coding Algorithms" by Moffat and Turpin. There is also a paper by
Moffat on a good optimal algorithm for the common case when most of the
codes already fit in the length limit.
mark
| |
| cr88192 2006-09-21, 6:55 pm |
|
"Mark Adler" <madler@alumni.caltech.edu> wrote in message
news:1158858162.027514.69790@m7g2000cwm.googlegroups.com...
> lallous wrote:
>
> You can wait until later to assign which one is 0 and which is 1. In
> most applications the Huffman algorithm is used to only produce code
> lengths (in number of bits) for each symbol. Then later you can use
> the symbol lexical order to assign codes within each code length to get
> a "canoncial" code for that set of probabilities. The advantage is
> that when you have to transmit the code, you only need to send the
> lengths -- not the codes themselves.
>
>
> As was stated by another poster, the maximum is n-1, where n is the
> number of symbols. Since typical applications have hundreds of
> symbols, if not more, you could end up with codes hundreds of bits
> long.
>
> Since the really long codes are, by definition, unlikely, you can make
> the code very slightly less efficient by length-limiting the codes.
> The advantage then is that your coder and decoder can be simpler and
> faster if all of the codes fit in, for example, a processor register.
> The deflate Huffman codes are length-limited to 15 bits, even though
> they could otherwise be 284 bits long in the worst case.
>
though theoretically possible, it is difficult to imagine input that would
provide these results.
eg: I would estimate for that n symbols, with power of 2 probabilities, you
would need a file at least 2^n symbols long (and a little more, taking into
account the lz runs). likewise, the file would have to be carefully crafted
to not be compressible in any way otherwise (that or a custom encoder).
otherwise, the pattern would occure for more probable symbols, but break
down for less probable ones, resulting in necissarily shorter code
lengths...
> How exactly you length-limit the codes is a much longer discussion.
> There is a whole section on that, and a good one too, in "Compression
> and Coding Algorithms" by Moffat and Turpin. There is also a paper by
> Moffat on a good optimal algorithm for the common case when most of the
> codes already fit in the length limit.
>
dunno myself.
I intuited my way through the problem, and my solution came out as roughly:
step along the tree, and, on unwinding, for any nodes that exceed the height
limit, rebalance the node.
I found that this algo doesn't always work in a single pass, so my encoder
may end up calling it multiple times until the tree is ok.
this algo worked out ok myself, so it can't be too terrible.
> mark
>
| |
| Matt Mahoney 2006-09-21, 6:55 pm |
| cr88192 wrote:[color=darkred]
> "Mark Adler" <madler@alumni.caltech.edu> wrote in message
> news:1158858162.027514.69790@m7g2000cwm.googlegroups.com...
>
> though theoretically possible, it is difficult to imagine input that would
> provide these results.
>
> eg: I would estimate for that n symbols, with power of 2 probabilities, you
> would need a file at least 2^n symbols long (and a little more, taking into
> account the lz runs). likewise, the file would have to be carefully crafted
> to not be compressible in any way otherwise (that or a custom encoder).
>
> otherwise, the pattern would occure for more probable symbols, but break
> down for less probable ones, resulting in necissarily shorter code
> lengths...
>
>
>
> dunno myself.
> I intuited my way through the problem, and my solution came out as roughly:
> step along the tree, and, on unwinding, for any nodes that exceed the height
> limit, rebalance the node.
>
> I found that this algo doesn't always work in a single pass, so my encoder
> may end up calling it multiple times until the tree is ok.
>
> this algo worked out ok myself, so it can't be too terrible.
>
A symbol with probability p always has a Huffman code of length lg 1/p
bits, rounded up or down, where lg means log base 2. So the longest
possible symbol is ceil(lg 1/p) for the least probable symbol.
-- Matt Mahoney
| |
| cr88192 2006-09-21, 6:55 pm |
|
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:1158869800.962307.236680@i3g2000cwc.googlegroups.com...
> cr88192 wrote:
>
> A symbol with probability p always has a Huffman code of length lg 1/p
> bits, rounded up or down, where lg means log base 2. So the longest
> possible symbol is ceil(lg 1/p) for the least probable symbol.
>
yes, and this is why I am saying, to get a file with a bunch of power-of-2
probabilities (the theoretically worst case) you would need a file at least
about 2^n symbols long...
so, ignoring LZ or similar:
ABBCCCCDDDDDDDDEEEEEEEEEEEEEEEE
....
this defeats the logarithm, by making an exponential probability
distribution...
so, my thought is, for any sane input the worst case, or even anything near
it, is highly improbable. however, it is fairly common to get a file that
produces trees that do not fit so nicely within a 15 bit limit, but this is
different...
> -- Matt Mahoney
>
| |
| lallous 2006-09-22, 3:55 am |
| Thank you all for your replies.
As an experimentation tool for huffman codes, I wrote a small program, that
I'ld like to share and hope to get comments on it.
Here are some outputs based on cr88192 theoretical inputs:
Output for string of power of 2 occurences from A to E
http://lgwm.org/downloads/huffman/a2e.txt
http://lgwm.org/downloads/huffman/a2e.jpg
Output for string of power of 2 occurences from A to J
http://lgwm.org/downloads/huffman/a2j.txt
http://lgwm.org/downloads/huffman/a2j.jpg
The program's C++ source code:
http://lgwm.org/downloads/huffman/huff1.cpp
Regards,
Elias
"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in message
news:9187e$4512f95d$ca83a8d6$27465@saipa
n.com...
>
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
> news:1158869800.962307.236680@i3g2000cwc.googlegroups.com...
>
> yes, and this is why I am saying, to get a file with a bunch of power-of-2
> probabilities (the theoretically worst case) you would need a file at
> least about 2^n symbols long...
>
> so, ignoring LZ or similar:
> ABBCCCCDDDDDDDDEEEEEEEEEEEEEEEE
> ...
>
> this defeats the logarithm, by making an exponential probability
> distribution...
>
>
> so, my thought is, for any sane input the worst case, or even anything
> near it, is highly improbable. however, it is fairly common to get a file
> that produces trees that do not fit so nicely within a 15 bit limit, but
> this is different...
>
>
>
| |
| Mark Adler 2006-09-22, 9:55 pm |
| cr88192 wrote:
> though theoretically possible, it is difficult to imagine input that would
> provide these results.
Actually it's quite easy to imagine. However it would be absurdly long
and repetitive and would only occur during malicious testing of the
algorithm. However if you want your algorithm to always work (which i
think is a good criteria) and you don't want to deal with n-1 bit
codes, then you have to deal with this.
In any case, if you limit your code lengths to, say, 15 bits like
deflate does, then it is very common in practice to exceed that code
length with reasonable sized inputs.
Matt Mahoney wrote:
> A symbol with probability p always has a Huffman code of length lg 1/p
> bits, rounded up or down, where lg means log base 2. So the longest
> possible symbol is ceil(lg 1/p) for the least probable symbol.
Well, that would be nice, but it's not the case. You can construct
cases using Fibonacci numbers for the frequencies in which the longest
code is several bits longer than your quoted limit. Having said that,
yes, you can come up with a correct limit based on the total number of
cases you're coding that is usually much less than the number of
symbols minus one. But it can also be more than your defined maximum
code length, requiring that you do something to the Huffman algorithm
or post-processing the results to length limit the codes.
mark
| |
| Mark Adler 2006-09-22, 9:55 pm |
| Mark Adler wrote:
> Matt Mahoney wrote:
>
> Well, that would be nice, but it's not the case. You can construct
> cases using Fibonacci numbers for the frequencies in which the longest
> code is several bits longer than your quoted limit.
I should have provided an explicit example of that. Suppose I wanted
to limit my code lengths to 15 bits. Using the rule above, I would
figure I'm safe if allow only 2^14 samples, even rounded up that should
only give me at most a 15-bit code. However if my frequencies adding
up to 16384 are a Fibonacci sequence, or close to it, my maximum code
length from the Huffman algorithm will actually be 23 bits.
Very roughly, the maximum length will be around the golden ratio times
the log base 2 of the number of samples. If I really wanted to stick
with the Huffman algorithm as is, and limit my code lengths to 15 bits,
I would have to limit the input to something less than 512 samples! (I
think the actual limit is 376 samples, but I'd have to check my
arithmetic on that.)
mark
| |
| cr88192 2006-09-22, 9:55 pm |
|
"Mark Adler" <madler@alumni.caltech.edu> wrote in message
news:1158944668.418698.230760@m7g2000cwm.googlegroups.com...
> cr88192 wrote:
>
> Actually it's quite easy to imagine. However it would be absurdly long
> and repetitive and would only occur during malicious testing of the
> algorithm. However if you want your algorithm to always work (which i
> think is a good criteria) and you don't want to deal with n-1 bit
> codes, then you have to deal with this.
>
I meant sane data, and not some specially designed and very long file (of
the sort mentioned in a later post).
> In any case, if you limit your code lengths to, say, 15 bits like
> deflate does, then it is very common in practice to exceed that code
> length with reasonable sized inputs.
>
yes.
I mentioned elsewhere the teqnique I used to limit to this length...
> Matt Mahoney wrote:
>
> Well, that would be nice, but it's not the case. You can construct
> cases using Fibonacci numbers for the frequencies in which the longest
> code is several bits longer than your quoted limit. Having said that,
> yes, you can come up with a correct limit based on the total number of
> cases you're coding that is usually much less than the number of
> symbols minus one. But it can also be more than your defined maximum
> code length, requiring that you do something to the Huffman algorithm
> or post-processing the results to length limit the codes.
>
ok.
> mark
>
| |
| Matt Mahoney 2006-09-23, 9:55 pm |
|
Mark Adler wrote:
> Mark Adler wrote:
>
> I should have provided an explicit example of that. Suppose I wanted
> to limit my code lengths to 15 bits. Using the rule above, I would
> figure I'm safe if allow only 2^14 samples, even rounded up that should
> only give me at most a 15-bit code. However if my frequencies adding
> up to 16384 are a Fibonacci sequence, or close to it, my maximum code
> length from the Huffman algorithm will actually be 23 bits.
>
> Very roughly, the maximum length will be around the golden ratio times
> the log base 2 of the number of samples. If I really wanted to stick
> with the Huffman algorithm as is, and limit my code lengths to 15 bits,
> I would have to limit the input to something less than 512 samples! (I
> think the actual limit is 376 samples, but I'd have to check my
> arithmetic on that.)
>
> mark
Yes, you're right. Here is an example of a longer code:
n code
85 0
53 10
32 110
21 1110
11 11110
10 111110
10 111111 (p = 10/222, lg 1/p = 4.47)
and an example of a shorter code:
n code
999 0
1 1 (p = 1/1000, lg 1/p = 9.97)
-- Matt Mahoney
|
|
|
|
|