| nightlight 2006-01-24, 9:55 pm |
| > If so, then your collection at least had a
> very strong tendency.
What kind of "tendency"? To show what AC shortcomings were
solved by QI? At the top is says what are the 4 elements that
follow, and that is exactly what was done:
"For the rest of this post I will address the QI's
solutions for the four principal remaining weaknesses
and tradeoffs introduced by the AC algorithm."
>
> This depends. If the CPU has to idle for many cycles,
> it will draw more power during the idling that during
> the data fetch.
The pertinent comparison is between mul/div vs memory read
(RAM or ROM). The latter takes less power.
>
> Do you mean (N^2)/4 or N^(2/4) = sqrt(N)?
The binomial table is quadratic (see N2 p. 9 [T3] where table
sizes are discussed). Hence the size is N*N/4. The C source
shows a more accurate value: N*N/4 - N +1 (see Qi.c function
new_qbct(), the rows K=0,1 are not stored).
> Thus, table size grows O(N^0.5) where N is the size
> of the message.
Double _no_: It is O(N^2), not O(N^0,5).
The N is not the size of the message, it is table size. The
table size determines max block which can be indexed as single
index. The input string length may be longer or shorter. The QI
modeler performs decomposition of the input string S(L) of
length L into enumerative classes (see [T2] pp. 27-35). The
output of modeler is a set of strings X1(n1), X2(n2),... The
table size N only limits n1<=N, n2<=N,... The strings X1(n1),...
_may_ be segments of S(L), such as for the simple quasi-
stationary memoryless source. For stationary source one would
also have n1=n2... Generally, neither is true i.e. the strings
X1(n1),... have no simple relation to S(L) or to each other e.g.
they could be fragments of the BWT output column R, each of its
own length (determined by the optimal, in MDL sense, partition
of R, see (A3) & my previous reply).
> That is, in realistic setups, the question remains whether you
> receive at all a point where the advantage pays off. Or,
> mathematically, there is a lower bound M on the memory
> requirements and a random source such that for all table sizes
> smaller than M, QI is outperformed by AC.
Note that even for table limits N=32 or 64, where exact EC has
been used, in hierarchical approximation (see Oktem) is fairly
competitive with AC. With QI, when you change the max block size
N, what changes is the balance between how much is being output
as index (within enumerative class) and how much as "class tags"
(the model info, such as count of 1's). Since QI packages "class
tags" if they are compressible (see "entropy pump" in [T2] p.
27), the only difference is how much work it does. It will work
faster if it uses maximum blocks provided by the table limit N.
But the upper limit N on table size depends on CPU cache and if
you make it too large it can slow the coder down. Value N=1K
happens to be the right balance for most machines tested,
although on newer machines, the table limit N=2K even 4K works
faster than N=1K. Now, that doesn't mean that X1(n1) has to have
n1=1K or 2K or some such. N is only the upper bound on lengths
of enumerative classes strings modeler can hand to coder.
The the effect of the input S(L) length L is that speed ratio
generally increases for longer L. The high level reason for this
is that the QI's better division of labor pays off better at
larger scales. The lower level mechanisms by which this general
pattern is realized may vary depending on coding task e.g. on
stationary source with p=probability of 1, after multiple blocks
of length N, the table elements around the line y=p*x will load
into cache, so that later passes will work faster.
Note also that timers & timing methods included in [QIC] QI.exe
may not be very accurate on short blocks (for N<128) due to
timer granularity (~280 ns per tick). The EC.exe program (and
source) includes type of loop needed to tests speed of exact EC
on short blocks (N=32).
> It remains to check what M is and whether M is small enough
> to make the above a realistic argument for the proposed
> application, namely embedded devices.
If the total input length L is very small, e.g. 32 or 64 bits,
the AC and QI times are so short and comparable to timer
ganularity and also subject to general OS time-slicing effects,
that the variation from test to test will be of similar
magnitude as the times themselves. One would need a batch method
used in EC.exe loops to refine these timings (QI will still come
out faster since it does much less work and the tiny table
easily fits in cache).
> Pretty small block sizes, actually. There's no blocking
> requirement for AC, so it may happen that for a given
> random source AC outperforms QI in the long run. Now
> since it has been understood that there are random sources
> where AC might perform better,
How does that follow? You need a finer resolution of
the concepts 'input length L', 'table limit N' and coder's
message lengths n1, n2,... There is no basis in theory
or in experiment for any such 'conjecture', be it for
speed or for compression effectivness.
> what are the random sources QI might outperform AC.
Any. As explained earlier, QI can code exactly as AC, in stream
mode, if there is anything for which that is the best method,
with the only difference that QI quantization is more accurate
(due to QI's better utilization of g bit mantissa, which AC
allows to vary to well below g bits; plus due to QI's integer
arithmetic without needless infinite fractions which have to
be dropped by a finite precision AC, one way or the other).
> And if so, are they relevant to the target application? As
> I read you, QI shows an advantage in the high entropy domain.
> What is your target application? (Video/audio data that come
> out of a predictor are IMHO not 'high entropy' because there's
> even for lossless a compression gain of 1:2 to 1:4. But then,
> what's high entropy?. How high?)
I think you need to clarify first what is "perform"? You seem to
have switched somewhere along the way from speed (table sizes,
cache) to compression ratios. For any set of messages M=
{M1,M2,...}, provided all coders have a "perfect" model of M, QI
will _always_ compress M to a smaller or equal size as AC
(trivially, AC or any compressor, can compress individual
message from M such as M1 shorter by simply assigning it
codeword of length 1).
Of course, with "imperfect model" of M, a random pick of
codewords (satisfying Kraft inequality) may do better on M than
any other coder. Regarding the "imperfect" models of M, though,
as illustrated by array "Vary" (p.10, [T3]), the QI's
"descriptive" modeling is much more resilient to "surprise" than
the usual "predictive" AC modeling. The "descriptive" method is
always "surprised" less than the "predictive" method.
The AC codewords are an approximate form of QI codewords, which
use needless scaling of addends (which results in infinite
fractions for exact AC which are "forgotten" by a finite
precision AC), randomly fluctuating mantissa length (a very bad
idea for precision), sub-optimal top down quantization (the
largest addends quantized first, in contrast to optimal bottom
up QI quantization of (21) in [T3]).
The difference is essentially as if you went out and randomly
changed all codeword lengths for set M and then checked if the
total violates Kraft inequality and retried the whole thing if
the violation was found. If the original set was optimal (the
shortest total output size on M for given precision g), as it is
with QI (assuming M is a valid enumerative class or a set of
such classes, i.e. that we have a "perfect" model of M used by
all coders), anything you get from your random variation will
produce always a larger (or at best remain equal if you are
very lucky) output for M.
Also, I have no idea what is this about "high entropy"? I did
mention radix codes, and we used that for little tests here
in this thread. But that is not the only advantage. You can
check the table in [T3] or do tests with the source [QIC].
The compression advantage is across the spectrum of inputs.
The point of (A3) in the summary is that those small advantages
are merely a tip of the iceberg, and tip on which AC is at its
best. Compared to AC, QI has intrinsically higher precision,
lower noise coding and finer-grained more flexible modeling
interface and language more suitable for finite sequences. The
test results on "conventional" compression tasks show only a
relatively small QI compression advantage, since "conventional"
is by definition the subset of finite sequences where AC
performs well. (A3) points out that there is much more beyond
the "conventional" subset and the QI compression gains are not
limited to the few percent shown in the [T3] table.
>
> Huh? N^3/2 grows faster than N^2/4.
With full parentheses that should be N^(3/2). The other one
is (N^2)/4. It should have been clear from the immediate
context that N^3/2, which came from N*sqrt(N), is N^(3/2).
>
> How does this compare to AC?
The genric QI results vs Moffat98 AC are in the table in [T3].
You can extrapolate the top rows speed ratios (which is where
sparse coder applies) by such factor.
> How to specific high-speed implementations as MQ?
I have given you [Said04] reference which has results for MQ
(and other quasi-ACs), along with full precision AC. The MQ is
only 2-3 times faster than full ACs, which doesn't even come
close to QI vs full AC ratios, even for generic QI, much less
for specialized QI version optimized for some range of inputs
(which would be a more fair comparison against specialized ACs).
You could, of course, use runlength coders, which will run as
fast as QI on very sparse inputs, some coarse grained ones even
faster than QI. They will also have larger redundancy than even
AC and will perform very badly (in speed and compression) if the
input isn't exactly the "ideal" low entropy source for which
their codewords were precomputed.
> I don't deny that QI might be interesting. However,
> I don't like unproved claims...
Placing the source code for public access is equivalent to
publishing a mathematical proof (in non peer reviewed preprint).
Now, if you can't compile source on your system, that is
equivalent of saying you can't read the published proof because
you don't read English. That is not the same thing as "unproven
claims." There are hundreds of millions win32 PCs which can
compile or run the [QIC] code.
>
> ... as this one.
The QI source code has been out there. With the level of
hostilty from some corners, anyone who wanted to empirically
falsify that claim, has a plenty of chance. Or, with preprints
publicly available, which show why it is so, one could have
shown that some key mathematical step is fatally wrong, making
the whole scheme flawed. Aside from minor typos and a minor
complaint of one of the authors whose EC contribution was cited
in [T3] (he though I should have put emphasis on a different
aspect of his contribution, which I will in the next revision),
no falsification of even a minor point has turned up.
>
> Does it? With blocking? It means that the "memory"
> of the coder and thus probabilities are only based
> on a relatively short sample set, where "short" =
> block size.
The blocking doesn't limit the modeler state variables or how
the modeling is done. The modeler considers entire input
sequence as a request to provide minimum decodable description
of that sequence, taking into account any shared knowledge coder
& decoder may have about it. Adaptive/predictive modeling is a
proper subset of descriptive modeling. Descriptive modeling
merely removes needless self-imposed constraints of the adaptive
modeler.
The blocking only limits how big is the maximum single index
that coder can compute. If the enumerative class provided by
modeler has sequences longer than coder's table limit N, coder
breaks them into smaller blocks. The block boundaries don't
cause any bit fraction loss since QI codes these fractions via
mixed radix codes (cf. N4 p. 9 [T3]).
Your conclusion is no different than concluding that one can't
compute Pi to more than 15-16 digits on a PC, since that is how
many digits floating point unit handles in a 64 bit 'double'.
> I don't know how they implemented MQ, but you can get
> it to a speed where it is close to writing out bits
> uncompressed. At least, I would want to do an
> independent measurement.
Well, you can emil him and ask. From the paper, he did seem to
go out of his way to optimize the coders tested. Moffat98 also
has a test of their coder against Q coder, which they claim is
the coder and implementation to beat (on speed), again showing
only 2-3 ratio, which is not even close to QI's ratios. Based on
those two, I decided not to waste any time 'reserching' the
quasi-AC flora and fauna.
Note also that even a mere examiniong and copying of
uncompressed bits, bit by bit, isn't the fastest way to do it.
Look for example at the QI's block coding loop (EncDec.c) which
at the top has:
do n+=32; // Encode next 32 MPS
while((w=*s++)==0); // find next nonzero dword
Anything that examines and merely copies bits, bit by bit,
on a sparse array will run slower than doing it 32 or 64 bits
per instruction (the output index was padded with 0 bits via
memset). And the above isn't even optimized QI code (e.g. it
doesn't need to do n+=32 inside loop, since the pointer's
increment already keeps track of the count, hence we can
calculate n outside of the loop as n=((int)s-s0)<<3; further
one could unroll loops as done in Moffat98,... etc).
>
> Not really. If you make this claim, you possibly should
> back it up because I do not yet find the arguments too
> convincing.
Regarding speed & power consumption, as stated above, the QI's
division of labor is clearly better, as explained there at the
two levels, within the coder and between the coder and modeler.
Taking calculations out of the coding loop for the values which
are universal constants for a given class of sequences is better
for speed and power consumption. Reorganizing modeler calls to
coder, so they don't go symbol by symbol but for the entire
sequence of symbols is better for the speed and power
consumption.
What you're mixing in, as your next comment shows, is the
separate question, whether such organization will affect
compression negatively. A fair question, but whatever its
answer, the above "better for speed & power..." is still
perfectly valid, as stated.
> The question is: Is this a useful thing to do? I do
> have situations where I know in advance that the bitstream
> I'm pushing into the coder consists actually of several
> interleaved random sources which behaive quite differently.
> AC allows me to define contexts. Can QI do that? With its
> tight modelling, is it able to exploit this statistics?
Of course it can, in more ways than one. That was already
answered at the end of the original post:
M1.
http://groups.google.com/group/comp...7c6f329038a1bdc
which points to methods (a) and (b) in an earlier post:
M2.
http://groups.google.com/group/comp...314ff87da597fad
The more relevant method for your setup is (b). That is all
assuminmg only use of the existent AC modeling engine i.e. which
parametrizes the properties of the finite sequences in the
probabilistic language and which converts all that it can
extract from the sequence into "probability of the next symbol".
Generally, that is only a coarse grained, impoverished way of
modeling the finite sequences. But, if as you say, that is all
one is allowed to have for the sake of argument, then QI can use
methods (a) and (b) above. The price paid is the compression
quality of adaptive AC and in case of method (a), the coding
speed of AC.
Of course, there is neither law nor theorem, be it in practice
or in theory, requiring that one has to use coarse grained
probabilistic parametrization on finite sequences or gratuituous
constraints and communication bottleneck "probability of single
next symbol" or model+code "online" (meaning analog line on-
line, a la Morse telegraph). Needlessly imposing upon onself
such gratuituous constraints is simply a result of conceptual
intertia and a herd mentality.
> Example: Consider a source consisting of two first-order
> Markov sources A and B. At an even timestep, I draw from
> A, at odd timesteps, I draw from B. With modelling, the
> implementation of an AC coder that is optimized to this
> situation is easy. What would I need to do with QI? Not
> exploiting the special nature of this source might be
> very wasteful.
QI would simply classify odd and even elements into two separate
enumerative classes, each having its own index. The QI output
will consist of 2 indices I1 and I2. Since QI also knows the
exact upper bounds M1 and M2 on I1 and I2 (these are all
integers with QI, M1 & M2 are SWI mantissas from quantized
addend tables), it would encode the top g=32 bits of I1 and I2,
call them D1 and D2, as digits in mixed radix M1, M2, i.e. as I
= D1 + M1*D2, and this "I" will have upper bound M=M1*M2, in
case you need to package similarly that index via mixed radix
with some other components of your output. The combined index
will be Iq consisting of concatenated leftover sections of I1
and I2 (which now fill the full bits since SWI addends which are
quantized full length upper bounds on I1 & I2, and whose
mantissas are M1 & M2, have all bits beyond the top g bits=0,
i.e. no additional redundancy beyond that contained in the QI
quantized addends is produced by these truncated I1 & I2, while
I has no additional error since it is computed as a 64 bit
product without any approximation) with the mixed radix number I
as the high 64 bits of Iq.
Note that AC will produce single index Ia ( longer than QI's Iq,
due to suboptimal AC quantization + unavoidable O(1) term at
least), but unlike QI's index Iq where QI knows its exact upper
bound M, AC's coarse grained probabilistic parametrization
requires it to normalize all its indices to have the same upper
bound 1.00 (since they 'supposed' to be 'cummulative
probabilities'), obliterating in the process the bit fractions.
Hence the AC's complete index is always in integer number of
bits, while QI's is in fractional bits (which can be packaged
exactly via mixed radix codes if there are more than just one
item to send, or in all cases via tapered/flat Huffman code,
where one wastes on average less than 1-log(e)+log(log(e)) =
0.086... bits per index from the exact bit fraction).
> If you care about compression performance, the only
> thing that is meaingful is the limit N->infinity.
Why? Have you ever coded an infinite sequence? Do you expect to
ever code one? It is meaningful only in the sense that such
coarse grained parametrization (in terms of infinite limit
values, after all the 'vanishing' terms can be dropped) has an
advantage of having often simple closed form expression which
are easy for human consumption.
> Is there a finite memory implementation of QI that
> runs optimal in this limit?
AC or QI implementations tested were limited to max inputs
of 2^30 bits. Any finite precision coder, AC or QI or
any other, has redundancy terms O(N) due to quantization.
Hence none of them will approach entropy in the limit n->inf.
For binary alphabet, QI's O(N) term is 4 times smaller than
AC's O(N) term (see (A3) and refes there, already discussed).
The QI table size limit, call it TN, has nothing to do with
these O(N) terms. (They would have a relation if one were
outputing blocks in whole number of bits, which QI doesn't
do, cf. N4, p. 9 [T3]). Of course, the addends at N=TN have
particular quantization error, which is a special case of
these O(N) terms, as do addends at N>TN and addends at N<TN.
{ The program QI.exe included with [QIC] has a command "ct"
which displays table stats, showng such redundancies for
the whole table, max, avg, last row, etc. Command "cbr"
shows the same for any row, plus individual quantizations,
on the last step and cumulative, for each binomial. These
commands, unlike the full coding tables, keep only one
row at a time, so they go up to much higher values N
e.g. N=1M while coding tables max is N=16K. }
> This wouldn't cover the random source example above.
> Of course, I can model this as some 2nd order markov
> chain, but I then pay the penalty of making it harder
> to adapt to it. In other words, I do not make optimal
> usage of my apriori knowledge on the source.
The method (b) in (M2), with the extra for your even/odd example
would have no problem with non-stationary even/edd example. The
quasi-stationary source for QI modeler means it has to select
the proper block boundaries (so that within a block it has
approximately stationary data i.e. a proper enumerative class).
The additional even/odd condition you specified only means that
this segementation is done separately for even and for odd input
array indices.
> What means "descriptive"? As I see it, you keep relative
frequencies or symbol counts here.
Minor point: relative frequences (that enumerative mode AC
would use) are less accurate parametrization of finite
sequences for limited precision coders than the exact
integer symbol counts (that QI uses).
> Whether you call this "descriptive" (as in, I have counted
> so and so much sequences of this kind) or "predictive" (as
> in, I predict the probability of a symbol due to the counts)
> is just a matter of language. Thus, what *is* the difference,
> leaving language issues alone?
No, that is not the difference. In "predictive" modeling scheme
modeler is trying to predict next symbol at position i+1 (in the
sense of calculating probabilities for values of the next
symbol) based _only_ on the symbols at positions 1,..,i. In
"descriptive" modeling there is no such restriction on what kind
of correlations symbols are allowed to have. Whenever the
symbols 1..i are not correlated well enough with i+1, be it
because i is too small or 1..i is low entropy data with too
little info altogether or because of nature of data, the
predictor pays a price for a wrong presumption.
Of course, any practical predictive modeler has to cheat the
"ideal" predictor scheme and use various ad hoc Escape schemes
(where encoder is allowed to look ahead of decoder and clue it
about sudden change) to get around the worst case penalties for
entirely wrong prediction that a "ideal" predictor would pay.
Consider array "Vary" = int32 { ..-2,-1,0,+1,+2,.. } which was
shown in [T3] p.10. { Any array with sudden changes will do
here, but this one had advantage (in view of 10 page limit) that
I could specify it unambiguously in less than half a line. }
Here you have a big "surprise" in the middle of the array, where
mostly 1's change to mostly 0's. Also at each 32 bit block,
there are little "surprises" of similar kind. Both QI and AC
were order-0 coders (i.e. no correlation assumed from a[i] to
a[i+1]), and both are allowed to assume quasi-stationary source,
i.e. that densities of 1's & 0's may vary along the array. The
predictive Moffat98 AC did very poorly here, especially for
shorter arrays where at N=4K it produced twice as long output as
QI. The QI quasi-stationary order-0 modeler used was very
simple: it looks at the whole array as a single block and splits
the block in exact half if the two halves would encode shorter
(including the cost to indicate the split, which uses longer
Huffman codes for the counts) and it is allowed to do the split
down to 1K and then stop. { NOTE: There are much better ways to
do this kind of modeling via sliding window splits, which are
nearly as fast as the simple method used but much more accurate.
There are also better ways to code counts instead of precomputed
Huffman tables for binomial distribution, but none of it was
used in the tests in [T3].}
The negative effect of the "suprprise" for the descriptive
modeler are limited to the second order items (small compared to
enumerative index), such as counts in this case, which cost
slightly more to encode when non-default block size is used for
enumerative class. With "predictive" AC modeler, the "surprise"
affected every "next symbol" (especially from the middle of the
array) with codeword lengths for MPS and LPS selected exactly
backwards, until eventually the frequences evened out (thus it
finally realized what is the actual MPS from now on), but that
happened at the very end of the array, well, too late. The
freuency scaling didn't help much for shorter arrays. Of course,
Moffat98 had additional di vantage with this kind of input
since it is tuned to give extra advantage to MPS, giving it
shorter codes than what even the past frequences suggest. So the
wrong MPS vs LPS for much of the second half, penalized it
more than necessary. This quirk of Moffat98 AC was explained
in the earlier post:
M3.
http://groups.google.com/group/comp...fa6336f483bbb89
There isn't really a good way for _pure_ predictive modeler to
fix this problem. Namely, if it changes the adaptation rate, to
adapt faster, thus scale accumulated frequences down more often,
that will make the coding of genuine stationary sequences less
efficient due to greater error in the probabilities (fewer bits
used for p, hence greater effect of single count fluctuations).
The only way to get around for practical "predictive" coders is
to cheat from the "pure predictive" ideal and clue in the
decoder to switch gears via some ad hoc escape mechanism. Since
"descriptive" modeler is not bound by some imagined "predictive
ideal" (but only to MDL considerations, it is much more
consistent and resilent in real life situations, where
everything is non-ideal. Of course, it doesn't mean that
"descriptive" coder has sworn never to select "predictor" as a
method to describe some sequence if that provides the minimum
length for total output. It only means that "prediction" isn't
set on a pedestal as the "only true way".
Naturally, as with any specialized compressor, good or bad,
there is trivially a subset of finite sequences on which "pure
predictive" modeler will do slightly better than the
"descriptive" modeler. You're welcome to offer an example of
array on which predictive modeler will do a lot better (of the
similar ratios as with "Vary" results) than descriptive, if you
believe there is any. I don't think there is. (While it is quite
easy to trick pure predictor to perform very badly.)
> All fine with me. For that, I would prefer:
>
> a) not to state results for tests that haven't
> been verified yet.
I consider providing publicly source code in the most commonly
available standard language C for the most common platform
win32/MSVC, equivalent to providing a mathematical proof in a
conventional formalism and in the most commonly read human
language.
> b) to have a source code here I can actually make
> use of. ...
>
> I would need to get rid of Windows.h/conio.h.
> A makefile is put toghether fastly, and I don't
> mind about the speed.
Sorry again about win32 & MSVC. Jasen Betts has made
a unix compilable version (which includes his windows.h
substitute and few changes for timer functions) which
he could email you. His email is in his posts in this
thread.
> The entropy coder is often only a minor part of
> an overall design, and you often do not have
> the choice of defining the coding order at
> your will.
I was talking about a more powerful and much faster
modeler+coder scheme which is opened by the existence
of a universally applicable, very fast across the spectrum
of inputs, high precision, low noise (the absolute optimum
at any arithmetic precision available) coder.
To take the full advantage of this largely unexplored
algorithmic territory (A3), being opened (the wonder
algorithm of the 90s, the Burrows-Wheeler transform
with its mesmerizing beauty, was a mere early precursor
of the powerful 'magical' algorithms from this territory),
there is lots of work to be done for anyone who decides
to explore these virgin lands ahead of the crowd.
For the existent AC modelers which can't be upgraded, QI can
still produce an equal compression as AC, but still code it
much faster via method (b) described earlier. { This is an
area I am less interested in, although I may do some QI
adaptations of this kind for our company clients if that
is requested. }
> Besides, hardware applications cannot buffer
> large amounts of data.
Well, even just two symbols ahead is more than "single next
symbol" ahead. Even that little flexibility gives you a slight
edge, when you don't needlessly constrain coder to "one next
symbol". I have yet to run into a coding task where coder is
required (by the nature of the task, not by some arbitrary
dogma) to encode and output 1 symbol at a time. Everything that
is compressed needs packaging and framing of some sort, in
hardware and in software, and doing such framing symbol by
symbol makes no sense.
|