Home > Archive > Compression > March 2006 > Interleaved Context (Re: PAQ8C and PAQAR 4.5 released)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Interleaved Context (Re: PAQ8C and PAQAR 4.5 released)
|
|
| niels.froehling@seies.de 2006-03-13, 6:55 pm |
| > I was rather surprised to find that indirect context modeling improves
> compression on almost every file I tested, text or binary.
Why? It's columnising/rasterizing data on an alinear context-grid.
There
where PPM-variants that used a linear context-grid for improving
compressing on 'pic' (the dimensions where known, so the grid-stepsize
was choosen to be part of 640). Because of too expensive possible
mixers
(in that time) it crunched down to only observe deterministic
grid-contexts: for example if every (640/8)th byte was predicted by a
deterministic context favor that over prediction by a PPM-context.
> An indirect
> context is a context model in which a history is kept only within
> another context. For example, given input:
>
> AB...AC...AB...AC...AB...AC...A_
>
> then in the direct order-1 context "A" we have the sequence "BCBCBC".
> We observe that in the order-(1,2) context (A,BC), which occurred twice
> before, that in both cases the next symbol is "B", so that is what the
> model predicts to be most likely. An ordinary context model could not
> make such a prediction. paq8f models indirect contexts with orders
> (1,1), (1,2), (1,3), (2,1), and (2,2).
>
> An order-(m,n) indirect context would be equivalent to a BWT compressor
> sorting on the first m characters (using a stable sort) and using an
> order-n context model on the transform. Normally BWT uses an order-0
> model.
In effect it's a pattern-detector, but a not so good adjustable one.
You
have a parameter-space that's very hard to control, because you can in
principle construct every form of repetitive pattern: for example a
context that observes at position 1,2,4,8,16 of each call, or
1,2,1,1,2.
It's unclear which exact pattern may work beforehand, and it's even
extreme expensive to run a two-pass algorithm detecting optimal pattern
configurations.
Also it's unclear when to switch between patterns and off the patterns
in favor of a possible (for ex.) PPM-context.
I don't say it's a bad method (I did it also, see
http://cvs.sourceforge.net/viewcvs....1.1&view=markup
) but it's blowing up optimality (the per se detection that it models
better on most files, doesn't say anything, you know).
Ciao
Niels
| |
| Matt Mahoney 2006-03-13, 9:55 pm |
| niels.froehling@seies.de wrote:
>
> Why? It's columnising/rasterizing data on an alinear context-grid.
> There
> where PPM-variants that used a linear context-grid for improving
> compressing on 'pic' (the dimensions where known, so the grid-stepsize
> was choosen to be part of 640). Because of too expensive possible
> mixers
> (in that time) it crunched down to only observe deterministic
> grid-contexts: for example if every (640/8)th byte was predicted by a
> deterministic context favor that over prediction by a PPM-context.
>
>
> In effect it's a pattern-detector, but a not so good adjustable one.
> You
> have a parameter-space that's very hard to control, because you can in
> principle construct every form of repetitive pattern: for example a
> context that observes at position 1,2,4,8,16 of each call, or
> 1,2,1,1,2.
> It's unclear which exact pattern may work beforehand, and it's even
> extreme expensive to run a two-pass algorithm detecting optimal pattern
> configurations.
> Also it's unclear when to switch between patterns and off the patterns
> in favor of a possible (for ex.) PPM-context.
>
> I don't say it's a bad method (I did it also, see
> http://cvs.sourceforge.net/viewcvs....1.1&view=markup
> ) but it's blowing up optimality (the per se detection that it models
> better on most files, doesn't say anything, you know).
>
> Ciao
> Niels
An indirect context model is a little different from interleaving for
PPM. With interleaving on 2-D structured data you gain the bytes above
as context but lose the context of the immediately preceding byte. paq
allows both contexts by mixing the predictions of different models
(automatically detecting width). An indirect context model is distinct
from both of these contexts allowing further compression by combining 3
models. It does not require repeating at regular intervals.
Here is an interesting experiment I did. First I created the following
49999 line file, count (with CR LF line terminators):
1
2
3
....
49998
49999
This file has low Kolmogorov complexity, so ought to be highly
compressible. Here are some results:
Original size: 338887
pkzip: 103598
bzip2: 67670
GRZipII: 64425
slim v21: 35880
paqar 4.1 -6: 2111
ppmonstr J: 992
paq8a -4: 721
paq8f -6: 671
paqar uses the paq6 style based model mixing, where each model outputs
a prediction based on the history of 0 and 1 counts. Starting in paq7,
the context model also remembers what the last bit was, and also if
there are 4 or fewer observations, the exact order of these bits. Then
that state is mapped to a prediction using an adaptive table. This is
a simple type of indirect context model applied to all the regular
contexts at the bit level. This might account for some the improvement
of paq8a over paqar, although I think it is more from adding a new
model based on column position in text files. The indirect context
model I described was added to paq8f, so you might say there are
actually 2 levels of this model.
-- Matt Mahoney
| |
| niels.froehling@seies.de 2006-03-14, 6:55 pm |
| >>> An indirect
>
> An indirect context model is a little different from interleaving for
> PPM. With interleaving on 2-D structured data you gain the bytes above
> as context but lose the context of the immediately preceding byte.
I still say it's not. :)
Your only applying an alinear transform on a
linear grid. The alinear transform is (for example) creating columns
after every occurance of an 'A'. Because it's an adaptive process you
don't store the transform (positions) but determine (build them up)
while running the algorithm. In the end you have a transpositional
matrix (or at least can extract it).
Also a simple modification (adding CM) of the pure PPM leads to the
same conceptual algorithm:
a->b->r->a->c->a->d->a->b->a->grid->frequencies
---------------------------- ---- -----------
1) 2) 3)
1) the prefix-context PPM
2) any additional context CM
3) the probabalistic model
I guess that that is even the conceptual clearer variant because
you can incorporate full symbol-exlusion, even with indirect
context.
[color=darkred]
> paq
> allows both contexts by mixing the predictions of different models
> (automatically detecting width). An indirect context model is distinct
> from both of these contexts allowing further compression by combining 3
> models.
I know. But:
- the RLI is distinct from the PPM / indirect context model is distinct
from any other model
- the context for the RLI is the linear grid-position / the indirect
context is the transformed grid-position
- the mixer is 'choose-only-deterministic-for-rli-model' / the mixer is
neural-net based
So where do you see the conceptual difference?
RLI is part of the indirect context model, you can exactly create a
RLI
by changing only a handfull of characters in your code (I guess :-).
> It does not require repeating at regular intervals.
I know:
Also it's unclear when to switch between patterns and off the patterns
in favor of a possible (for ex.) other PAQ-model.
I'm speaking here about a algorithm computing an optimal indirect
context
model. Because there are (+- in the quantity of) n! (n the number of
characters to code) possible indirect context models it's not very
obvious
how to do that.
We're always screaming after pigeon-hole and
what-makes-it-smaller-makes-
it-bigger-too, so:
- for _which_ data is your (1,2) worst? (which data obeys
pattern-rules,
or can be constructed by pattern-based generators? is there an (for
ex.)
english text without this properties?)
- _why_ is it good for the files you tested? (is it only compensating
slow adaption in other models? at which point reach other models the
same or better predictions?)
[color=darkred]
> Here is an interesting experiment I did. First I created the following
> 49999 line file, count (with CR LF line terminators):
>
> 1
> 2
> 3
> ...
> 49998
> 49999
Why don't you incorporate a delta-rle-model into PAQ? ;^)
> This file has low Kolmogorov complexity, so ought to be highly
> compressible. Here are some results:
>
> Original size: 338887
> pkzip: 103598
> bzip2: 67670
> GRZipII: 64425
> slim v21: 35880
> paqar 4.1 -6: 2111
> ppmonstr J: 992
> paq8a -4: 721
> paq8f -6: 671
>
> paqar uses the paq6 style based model mixing, where each model outputs
> a prediction based on the history of 0 and 1 counts. Starting in paq7,
> the context model also remembers what the last bit was, and also if
> there are 4 or fewer observations, the exact order of these bits. Then
> that state is mapped to a prediction using an adaptive table. This is
> a simple type of indirect context model applied to all the regular
> contexts at the bit level. This might account for some the improvement
> of paq8a over paqar, although I think it is more from adding a new
> model based on column position in text files. The indirect context
> model I described was added to paq8f, so you might say there are
> actually 2 levels of this model.
>
> -- Matt Mahoney
Ciao
Niels
| |
| Matt Mahoney 2006-03-14, 6:55 pm |
| niels.froehling@seies.de wrote:
>
>
> I still say it's not. :)
> Your only applying an alinear transform on a
> linear grid. The alinear transform is (for example) creating columns
> after every occurance of an 'A'. Because it's an adaptive process you
> don't store the transform (positions) but determine (build them up)
> while running the algorithm. In the end you have a transpositional
> matrix (or at least can extract it).
> Also a simple modification (adding CM) of the pure PPM leads to the
> same conceptual algorithm:
>
> a->b->r->a->c->a->d->a->b->a->grid->frequencies
> ---------------------------- ---- -----------
> 1) 2) 3)
>
> 1) the prefix-context PPM
> 2) any additional context CM
> 3) the probabalistic model
>
> I guess that that is even the conceptual clearer variant because
> you can incorporate full symbol-exlusion, even with indirect
> context.
I suppose that's another way to look at it.
> We're always screaming after pigeon-hole and
> what-makes-it-smaller-makes-
> it-bigger-too, so:
> - for _which_ data is your (1,2) worst? (which data obeys
> pattern-rules,
> or can be constructed by pattern-based generators? is there an (for
> ex.)
> english text without this properties?)
> - _why_ is it good for the files you tested? (is it only compensating
> slow adaption in other models? at which point reach other models the
> same or better predictions?)
Typically paq has many models running, most of which are not very
effective. The mixer turns these off, although not completely without
cost. First it takes time to learn that a model is ineffective, and
meanwhile compression is worse. You will see this effect on random
data (where all models are ineffective). Second, the extra models take
away memory from the more effective ones. Mixing models is a hard
problem.
An indirect context model does improve compression on text as well as
many binary formats. In text I believe this is due to the cache
effect. A word seen recently is likely to be seen again in the near
future.
-- Matt Mahoney
|
|
|
|
|