| Author |
Re: How to write 3 symbol alphabets to bit stream?
|
|
| moogie 2004-06-25, 6:56 pm |
| i would have never thought about combining two or more symbols to
effectively tranfsorming the bit stream from a 3 symbol alphabet to a
larger alphabet and allowing the huffman coder to give more optimal
bit representation. Clever.
I understand huffman codes and have created an huffman tree/code
creater which i could use for this application.
The probablilities i gave are the average probablilities. In real life
they will vary a little ammount.
e.g. once such case will have probablilities:
0.24967804821660688
0.3888744059921865
0.3614475457912067
Yes, I did miscalculate the optimal bits required in my first post.
Please do correct me if i am still wrong in my thinking for the above
probablilites i derive the optimal bit representations should be:
1.6254829276750897
1.4166883910117203
1.45782868131319
using the following forumla:
(1/3-PROB)*(log(3)/log(2))+(log(3)/log(2)) where PROB is the
probablility of a certain symbol.
I will try using huffman codes with the combination of symbols to see
how well it achieves the optimum bit representations.
Thanks!
"David A. Scott" <daVvid_a_scott@email.com> wrote in message news:< Xns950BA154C4C58H110W296LC45WIN3030R@130
.133.1.4>...
> Michael Schöberl <MSchoeberl@ratnet.stw.uni-erlangen.de> wrote in
> news:2jdbi0F10jc8nU1@uni-berlin.de:
>
>
> With the three probalities given
> Symbol a = .375
> b = .375
> c = .250
> an optimal huffman table in order of high probanility to low is
> as follows for the nine states.
> aa = 100
> ab = 010
> ba = 110
> bb = 001 Note the first 4 states equal likely 9/64
> ac = 101 Note the next 4 states eaual likelely 6/64
> ca = 011
> bc = 111
> cb = 0001
> cc = 0000 The least likelyest 4/64
> note that in 7 of the pairs you use 3 bits for an average of
> 1,5 bits per symbol.
> Go to more states for a better fit.
>
> This is what one would do for a bit stream. But its also
> the table you would use for a bijective file compressor
> to bijectively file compress pretend the ending are all c's
> if the code actually ended in c or c followed by any number of
> a's add accc... Note you end end after picking first or second
> symbol
| |
| David A. Scott 2004-06-25, 6:56 pm |
| budgetanime@mystarship.com (moogie) wrote in
news:e353ade.0406172017.b61f37c@posting.google.com:
>
> e.g. once such case will have probablilities:
>
> 0.24967804821660688
gives 2.00185911031008809930435169850278
> 0.3888744059921865
gives 1.36262380885070434233747791811146
> 0.3614475457912067
gives 1.46814180018051756241153899683201
>
> Yes, I did miscalculate the optimal bits required in my first post.
>
> Please do correct me if i am still wrong in my thinking for the above
> probablilites i derive the optimal bit representations should be:
>
> 1.6254829276750897
> 1.4166883910117203
> 1.45782868131319
For the multiple symbol huffman you don't need the optimal
lengths that existed for the optimal arithmetic to derive
the lengths of the multi symbol huffma table.
>
> using the following forumla:
> (1/3-PROB)*(log(3)/log(2))+(log(3)/log(2)) where PROB is the
> probablility of a certain symbol.
>
try this formula (ln(1/p)/ln(2)) with p = .25 you get 2 for
p = .125 you get 3 and p = 1 gives 0 this the symbol length or try
-(ln(p)/ln(2))
>
> I will try using huffman codes with the combination of symbols to see
> how well it achieves the optimum bit representations.
>
>
The target length for the optimal arithmetic is
compressed length = (p1*l1 + p2*l2 +p3*l3)*n
max compressed length = l1*n
min compressed length = l3*n
where n is large and there are small changes due to end effects
pn probability of symbol subn and ln is length of symbol subn
l1 is length of longest arimetic symbol while l3 was for shortest.
It takes a lot of work to calculte the average length for the huffman
but it can be done. For the huffman max and min compressed length just
look at result table and calculate from witch entry has largest length
per starting symbol and which has shortest. Then you can compare that with
an arithmetic limits. General the huffman will have both a longer max
length and shorter min length as you increat the numbers of entries in
the huffman you approach the arithmetic. In the example we are talking
about you go from 3 to 9 to 27 to 81 ,,, to 3**n symbol so the table
would get big fast. There are also alternate ways to encreast the size
from the values a b c you could go to aa ab ac b c thats 3 to 5 values
you have to be careful which one you use be there are many huffman
multisymbol approaches.
I however would not do the huffman and instead go for the gusto with
2 binary 2 state fixed encoders each with 64 bit state registers. Its
just my nature but the huffman ok is ok.
David A. Scott
--
My Crypto code
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott19u.zip
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott16u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
| Michael Schöberl 2004-06-25, 6:56 pm |
| > e.g. once such case will have probablilities:
>
> 0.24967804821660688
> 0.3888744059921865
> 0.3614475457912067
as you want binary representation the optimum wordlength for
encoding a symbol is -log2(p) ...
so huffman is best for probablilities p = 2^-n
in your case you get wordlengths of
2.0019 1.3626 1.4681
sum(wordlength * p) gives the (longtime) average rate ...
optimum would be 1.5604 Bits/Symbol
the wordlength of 2 Bits is okay for Huffman ... but 1.3 and 1.4
is not possible so you would assign wordlengths of 1 and 2 ...
here goes the optimum ...
example of huffman-tree: a -> 00, b -> 1, c -> 01
with this you wold get an average of 1.6111 Bits/Symbol
for the double-symbol encoding you get probablilities of
0.0623 0.0971 0.0902
0.0971 0.1512 0.1406
0.0902 0.1406 0.1306
and optimum wordlengths of
4.0037 3.3645 3.4700
3.3645 2.7252 2.8308
3.4700 2.8308 2.9363
this gives an average of 3.1207 Bits / 2 Symbols
(the same as above ... )
now I constructed an example huffman-tree and got these
mappings:
aa -> 1111 ab -> 011 ac -> 1110
ba -> 001 bb -> 110 bc -> 101
ca -> 101 cb -> 100 cc -> 010
the wordlengths are as you can see:
4 3 4
3 3 3
3 3 3
sum(wordlength * p ) gives the average ... in this case
3.1526 Bits / 2Symbols ...
this gives 1.5763 Bits / Symbol and is better than the
1.6111 Bits / Symbol for the huffman with one-Symbol
encoding ...
> Please do correct me if i am still wrong in my thinking for the above
> probablilites i derive the optimal bit representations should be:
> [...]
> using the following forumla:
> (1/3-PROB)*(log(3)/log(2))+(log(3)/log(2)) where PROB is the
> probablility of a certain symbol.
looks complicated ;-) ... where is that from ..?!
as you can see above you only need -log2( p ) ...
if you have no log2 you can substitute
log2( x ) = ln(x) / ln(2)
the hardest above is the construction of a huffman tree from
the probabilities (on a piece of paper) ... but I'm sure you can
find that somewhere on the net ...
bye,
Michael
| |
| moogie 2004-06-25, 6:57 pm |
| > try this formula (ln(1/p)/ln(2)) with p = .25 you get 2 for
> p = .125 you get 3 and p = 1 gives 0 this the symbol length or try
> -(ln(p)/ln(2))
Cheers, i admit that proablilies was never my strong point...
> The target length for the optimal arithmetic is
> compressed length = (p1*l1 + p2*l2 +p3*l3)*n
> max compressed length = l1*n
> min compressed length = l3*n
> where n is large and there are small changes due to end effects
> pn probability of symbol subn and ln is length of symbol subn
> l1 is length of longest arimetic symbol while l3 was for shortest.
>
> It takes a lot of work to calculte the average length for the huffman
> but it can be done. For the huffman max and min compressed length just
> look at result table and calculate from witch entry has largest length
> per starting symbol and which has shortest. Then you can compare that with
> an arithmetic limits. General the huffman will have both a longer max
> length and shorter min length as you increat the numbers of entries in
> the huffman you approach the arithmetic. In the example we are talking
> about you go from 3 to 9 to 27 to 81 ,,, to 3**n symbol so the table
> would get big fast. There are also alternate ways to encreast the size
> from the values a b c you could go to aa ab ac b c thats 3 to 5 values
> you have to be careful which one you use be there are many huffman
> multisymbol approaches.
Yes, i have also noticed how hard it was to get and average length
(apart from actually performing the compression and counting the
frequencies of the lengths of codes). The table does indeed get large
very quickly. As a compromize between better compression, table size
and my algorithms complexitiy i have decided to go with 4 symbol
combinations (i.e. 3^4 or 81)
>
> I however would not do the huffman and instead go for the gusto with
> 2 binary 2 state fixed encoders each with 64 bit state registers. Its
> just my nature but the huffman ok is ok.
Hmm, if i knew what you meant by "2 binary 2 state fixed encoders each
with 64 bit state registers" i might agree ;) I will have to research
into that as well.
Cheers
N Klaebe
| |
| David A. Scott 2004-06-25, 6:57 pm |
| budgetanime@mystarship.com (moogie) wrote in
news:e353ade.0406201717.5132a044@posting.google.com:
>
> Hmm, if i knew what you meant by "2 binary 2 state fixed encoders each
> with 64 bit state registers" i might agree ;) I will have to research
> into that as well.
>
> Cheers
>
>
Here is what I mean the first hard binary arithmetic coder says is
this symbol A or the symbols B or C. If it was A you start over. If it
was B or C you go to second binary coder and see if it was B or C.
So each coder always has only two states. You can think of it like
a Huffman tree except the splits and not exactly balances 1 0.
David A. Scott
--
My Crypto code
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott19u.zip
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott16u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
| David A. Scott 2004-06-25, 6:57 pm |
| "David A. Scott" <daVvid_a_scott@email.com> wrote in
news:Xns950EC6D85848BH110W296LC45WIN3030
R@130.133.1.4:
> budgetanime@mystarship.com (moogie) wrote in
> news:e353ade.0406201717.5132a044@posting.google.com:
>
>
> Here is what I mean the first hard binary arithmetic coder says is
> this symbol A or the symbols B or C. If it was A you start over. If it
> was B or C you go to second binary coder and see if it was B or C.
> So each coder always has only two states. You can think of it like
> a Huffman tree except the splits and not exactly balances 1 0.
>
>
As a model check out
http://bijective.dogma.net/compres1bse.htm
its for 26 symbol compress A to Z to match ENGLISH.
David A. Scott
--
My Crypto code
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott19u.zip
http://cryptography.org/cgi-bin/crypto.cgi/Misc/scott16u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
| moogie 2004-06-25, 6:57 pm |
| >
> As a model check out
> http://bijective.dogma.net/compres1bse.htm
> its for 26 symbol compress A to Z to match ENGLISH.
>
Thanks, i will look into it.
|
|
|
|