For Programmers: Free Programming Magazines  


Home > Archive > Compression > January 2006 > Re: Quantized Indexing Source Code (update & alg. history)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Quantized Indexing Source Code (update & alg. history)
nightlight

2006-01-11, 6:55 pm

This is the post:
http://groups.google.com/group/comp...f1ee67d18b63f5a

All you are asking is right there, answered, plus many you didn't ask.
For example, you ask here:

> How many bits did this so called test of yours compress to?


There it answers:
----
The size it produces is 1584962.50... bits, which compared
to the exact N*log(3) entropy has an excess of 1.62 e-04 bits on the
total of 10^6 symbols (i.e. the excess per symbol is 1.6e-10 bits).
---

> And did you include the lengths of the extra fields for your
> so called length and your so called number of ones?


There it says how it added these numbers to match the AC coding
conditions from the numbers QI.exe gave it (which wasn't running
a test for AC but stand alone function) -- It says righ below it added
70 bits (which was very generously rounded up, all pieces separately):

--------
To compare that with AC output size, one option is to make AC work in
static mode without adapting to probabilities and make it not count the
transmission of frequency table or number of symbols n (which is the
same condition that the QI.exe figure applies to).

Alternatively, you can add to QI's output the size to transmit N, A and
the frequency table. QI.exe has a command QI cl<int> which computes
self-delimiting size for <int>, or just "QI cl" to list a table for
common values. There you get for N=10^6 its self-delimiting length
L(N)=27.543 and for L(A)=2.49 bits. The cost for frequency table with
QI/EC is the log of the binomial C(N+A-1,A-1), for N=10^6 and A=3,
which is log(C(1000002,2))=38.863 bits, which totals (each rounded
separately, which they don't need to) 28+3+39=70 bits to be added to
QI's output to match the adaptive AC's coding conditions. Since the
QI's output was essentially the entropy, the QI's total is 70 at most
whole bits above the "entropy" (note the "entropy" N*log(3) didn't
include N; also in high entropy limit QI doesn't need to transmit freq.
table, but one would need to modify AC to work in high entropy limit,
so I added table to QI, which distorts a bit comparison to entropy H).
-----------


> And was it really a static zero entropy type of compression where
> each of the 3 symbols assumed equally likely?


That was answered right there on the top line: the equiprobable symbols
are "high entropy limit" (as opposed to "low entropy limit" which is
for highly sparse arrays, which one symbol vastly dominating others).

------------
The QI.exe file which you may already have (from the source; current
source version is 1.03) has a command line option to test it on that
same input (which is a high entropy limit for multi-alphabet coding,
and which in I call radix codes):

QI cr3 n1000000 i100

which tells it to code inputs in radix 3 (this can be any 32 bit value
above 2), to use input of 1 million symbols (there is a constant
MAXRDIG in Intro.h which limits the input size to max 2^20 or 1M
digits, you can change that to allow larger sizes e.g. to 16 MEG) and
to run the test 100 times on 100 random inputs (i100 for 100
iterations).
----------------------------

The comparison to Moffat98 answer (for 10^6 symbol file): QI output
(adjusted generously to favor AC) came out 3403 bits shorter than AC's:
----------------------------
Now, running the Moffat98 coder (in 8 bit max symbol coding mode &
frugal bits enabled), it outputs: 1588435.52 bits (this is avg. over
100 iterations), which is 3473 bits above the entropy, or 3403 bits
above the comparable QI output size. (Note that Mofat98 coder has
generally a slight bias to code worst for the max entropy inputs, but
it gains in return on very low entropy inputs.)
-------------------------------

All that you can run right there and look at the source that there was
no cheating. The test was run on 100 random inputs. And each encode
decode cycle checked that decoded data matches inpuit. For QI, the
outputs all come out to exactly same size, i.e. the QI takes for size
in high entropy limit the max one can get for converting number of 10^6
digits given in radix 3 into binary number (which is QI's output for
high entropy limit coding, essentially the radix coding/decoding).

David A. Scott

2006-01-14, 6:55 pm

"nightlight" <nightlight@omegapoint.com> wrote in
news:1137211901.606040.273310@z14g2000cwz.googlegroups.com:

>
> The AC generally doesn't have a deterministic output size due to
> truncation of infinite fractions when computing the split of the range
> in terms of 32 or 64 bit integers, and these may add up differently
> based on variation in frequencies & symbol order. Thus the AC output
> will generally vary by several bits (usually O(log(n)) bits) and
> Moffat98 coder shows this small variation on random samples from the
> Set_QI (you can verify that, Moffat98 coder is publicly avialable,see
>


Actually one of the great things about an aithmetic coder that you
don't seem to grasp is that the size of output changes very little
based on order. The order can be random or sorted. The output will be
roughly the same for an adaptive arithmetic coder. Another point if one
is using a standard adaptive coder to handle say compression there is
a cost as the model is adapting to the problem. Take arb255 as such
a compressor. Its bijective there are no gaps any file will compress
or decompress. If you feed it a file of just random A B C it will
compress to longer than QI since it will handle any set of data. In
fact the file of 1,000,000 A compresses to less than a 1/1000 of what
arbabc ccompresses to. However neither compressor is wasting space
both are doing what they are desinged for. It not clear to me from
what you wrote if you understand Moffat code well enough to modify
it to do the proper conparasion. Why do I say this. I say it because
you seem to think that my compression designed to compress based on
equal probability of each symbol should have compressed the all A file
smaller. It didn't because it was designed for the case you claim
you coder is doing. If the 1,000,000 file is truely random than its
possible the file could all be all the same symbol. Its as possible as
any other given series. This is true since all the combinations of
series of three are possible. True its not likely but it represents
a good worse case test which is why I choose it.

I just want to clarify for others reading this I say for a given
sequence that arithmetic compresses to basically the same size
and order is not a function is not true based on design of many
arithmetic compresses due to how they teminate. This problem becomes
larger when there is one symbol occuring at a much higher rate than
the others. Since the compressor will not ouput a single bit for long
streches. of that symbol. And then when the low probability symbol hits
it will out put a large number of bits. Termination is given little
thought in most arithmetic compressor designs. Often as in versions of
fpaq0 a low probability symbol is written at end to basically flush
out the unused bits. To attempt to keep things easy when one is using
a large number of states there is an EOF symbol that is using up space
in 2 state coders like fpaq0 instead of table space it checks at some
predetermined point to output a "continue" or "nocontinue" bit. This
wastes some space but it is a very small part of the compressor and
most times people could care less about squeezing a few bytes. But
this is the kind of compressor where symbol order should not make
a length difference based on symbol order. Assuming a properly
designed 2 state coder.

I don't want to argue to hard that arithmetic is perfect it has
problems. Though the coder arbabc is using 64 bit high and low
the file length seems to be optimal. In my code you change the
number of bits used here by changing the number of F's in one
line where Top_value is defined. This code designed to be easy
to modify and test. Changeing this one line see arbx and arb255
changes the length of the high and low and each change of one bit
results in half the number of possible internal states of the
compressor. This will result in using values farther and farther
away from the 1/3 that the coder is trying to use. This means
eventually the coder will produce some files longer than the
optimin and some files shorter. People can test this for them
selves. The code however will still have no errors. Its just
the spread will increase slightly. If the files random like
in your test I suspect you will have to reduce this several bits.
I will not quote theory as to how many bits need to be changed
the code is there for people to play with. The neat thing is
that even in the reduced state if will have NO EORRORS. It
will have NOT GAPS. It will be a COMPLETE MAPPING. And even
here many files would compress shorter than the full 64 bit
version and some longer. Since both full use bijective the
output space of all files to the space of input files.

ENJOY!


David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com