For Programmers: Free Programming Magazines  


Home > Archive > Compression > February 2007 > PPM rescaling









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author PPM rescaling
alexandru.mosoi@gmail.com

2007-02-08, 6:56 pm

It i very common in PPM based compressors to rescale frequencies
stored in a context when total count (or maximum frequency) exceed a
certain threshold. I think that choosing different method to decide
when to do rescaling is done would improve compression. I think that
rescaling should be done when the total entropy of symbols (including
novel symbol), encoded since the last rescale of current context,
exceeds a certain threshold. This would way, the context will adapt
faster if the symbols, which it predicts most often, change. For
example in text: "aabaabaabaabaabaacaacaacaacaac", in context "aa"
when c appears it has a low probability, thus high entropy, so context
will be forced for rescale much earlier.

Did anybody tried something similar?

Matt Mahoney

2007-02-08, 6:56 pm

On Feb 8, 8:55 am, "alexandru.mo...@gmail.com"
<alexandru.mo...@gmail.com> wrote:
> It i very common in PPM based compressors to rescale frequencies
> stored in a context when total count (or maximum frequency) exceed a
> certain threshold. I think that choosing different method to decide
> when to do rescaling is done would improve compression. I think that
> rescaling should be done when the total entropy of symbols (including
> novel symbol), encoded since the last rescale of current context,
> exceeds a certain threshold. This would way, the context will adapt
> faster if the symbols, which it predicts most often, change. For
> example in text: "aabaabaabaabaabaacaacaacaacaac", in context "aa"
> when c appears it has a low probability, thus high entropy, so context
> will be forced for rescale much earlier.
>
> Did anybody tried something similar?


The best rescaling strategy depends on whether your data is stationary
(uniform statistics) or not. Suppose that some context has appeared
20 times and the sequence of next bits in this context is
00000000001111111111. What is the probability that the next bit is a
1? The strategy used in PAQ6 is to count the zeros and ones, but if
the opposite count is large, then discard some of that count. I found
after lots of experimenting that a good strategy is to discard half of
the opposite count over 2. For example, given this sequence, the (0
count, 1 count) would be:

0 = (0, 0)
0 = (1, 0)
0 = (2, 0)
.... (count zeros normally)
0 = (10, 0)
1 = (6, 1) (discard half of zero count above 2)
1 = (4, 2)
1 = (3, 3)
1 = (2, 4) (when opposite count reaches 2, stop discarding)
1 = (2, 5)
1 = (2, 6)
1 = (2, 7)
1 = (2, 8)
1 = (2, 9)
1 = (2, 10)

This works pretty well when models are combined by weighted averaging
of the counts (and adjusting the weights to favor better models), but
it is not ideal. For example, it will expand random data about 8%
because the probability never converges to 1/2. (This is fixed in
later SSE stages which allow large counts for low order contexts).

The method used in PAQ7/8 is fully adaptive. In each context we
record the bit history as an 8 bit state, which indexes an adaptive
table to look up the probability. The tables (for each model) are
updated slowly (about 0.4%) depending on the actual bit. The bit
history represents either the entire sequence of bits up to 4, or a
pair of counts plus a flag indicating whether a 0 or 1 was last. In
order to keep the number of states to 256, the maximum allowed counts
are (0, 41), (1, 40), (2, 12), (3, 5), (4, 4), (5, 3), (12, 2), (40,
1), (41, 0). Also, the last bit is recorded only for total counts up
to 14 (like (13, 1, last bit = 0)). Opposite counts are not discarded
as in PAQ6. Instead, if the counts exceed the limits, then the next
state is one with smaller counts but about the same ratio. For
example, if the state is (4, 4, last bit = 1) and a 0 bit is observed,
the next state will not be (5, 4, last bit = 0), it will be (4, 3,
last bit = 0). The complete state table can be found in http://
cs.fit.edu/~mmahoney/compression/paq8f.cpp in State_table[256][4].
I believe it is the same for all PAQ8 versions.

-- Matt Mahoney

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com