For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2005 > Looking for best way to post-process BWT data









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Looking for best way to post-process BWT data
Rick D.

2005-04-24, 3:55 pm

Hi all,

I'm looking for the best way to post-process a bwt stream to get the
best order 0 or 1 result. So far i've tried bwt+rle+mtf, and distance
coding (DC). And got book1 from calgary corpus down to +/- 230.000
with my bwt+dc implementation using an order 0 arithmetic coder.

But in this paper:
http://sun.iinf.polsl.gliwice.pl/~sdeor/pub/deo01.pdf,

they claim DC can bring book1 down to 2.241 bpp. Which would be
(2.241/8)*768.771=215.352 bytes. Does anybody know exactly how they
code their distances (i use 1 byte up to 254 and a flag byte + 3 byte
distance for the rest)? (Or what other transformations they add to
boost the performance?)

This paper also mentions that wfc might be a better method to use, but
the description they gave was pretty cryptic and i couldn't find any
other comprehensible information on it. I do understand it is
basically an adapted mtf coder which uses multiple weights from a
local context. And i'm sure it's not that hard when you get how it
works, but right now i haven't been able to implement it. Does anybody
know a link to a comprehensive article on this?

Any help with this will be much appreciated.

Best regards,
Rick

michael

2005-04-24, 3:55 pm


Rick D. wrote:
> Hi all,
>
> I'm looking for the best way to post-process a bwt stream to get the
> best order 0 or 1 result. So far i've tried bwt+rle+mtf, and distance
> coding (DC). And got book1 from calgary corpus down to +/- 230.000
> with my bwt+dc implementation using an order 0 arithmetic coder.
>
> But in this paper:
> http://sun.iinf.polsl.gliwice.pl/~sdeor/pub/deo01.pdf,
>
> they claim DC can bring book1 down to 2.241 bpp. Which would be
> (2.241/8)*768.771=215.352 bytes. Does anybody know exactly how they
> code their distances (i use 1 byte up to 254 and a flag byte + 3 byte
> distance for the rest)? (Or what other transformations they add to
> boost the performance?)
>
> This paper also mentions that wfc might be a better method to use,

but
> the description they gave was pretty cryptic and i couldn't find any
> other comprehensible information on it. I do understand it is
> basically an adapted mtf coder which uses multiple weights from a
> local context. And i'm sure it's not that hard when you get how it
> works, but right now i haven't been able to implement it. Does

anybody
> know a link to a comprehensive article on this?
>
> Any help with this will be much appreciated.
>
> Best regards,
> Rick


book1 can be better compressed with end of line coding but this is
more of a trick than a legitimate improvement on compression.

http://www.data-compression.info/Ju..._BWT_Stages.pdf
is
a good paper which touches on WFC.

The best compression method for BWT currently is (to the best of
my knowledge) my M03 algorithm.

A really rough paper is at
http://www.michael-maniscalco.com/index2.htm

There's a fully functional version at
http://mij4x.datacompression.jp/

This achieves around 795,000 for the 18 file calgary corpus. Around
211,000 for book1.


- Michael A Maniscalco

Edgar Binder

2005-04-24, 3:55 pm

Hi,

Rick D. wrote:

> I'm looking for the best way to post-process a bwt stream to get the
> best order 0 or 1 result. So far i've tried bwt+rle+mtf, and distance
> coding (DC). And got book1 from calgary corpus down to +/- 230.000
> with my bwt+dc implementation using an order 0 arithmetic coder.
>
> But in this paper:
> http://sun.iinf.polsl.gliwice.pl/~sdeor/pub/deo01.pdf,
>
> they claim DC can bring book1 down to 2.241 bpp. Which would be
> (2.241/8)*768.771=215.352 bytes. Does anybody know exactly how they
> code their distances (i use 1 byte up to 254 and a flag byte + 3 byte
> distance for the rest)? (Or what other transformations they add to
> boost the performance?)


In the original distance coder, I coded the distances by their log2
(number of needed bits) and then sent those bits (raw, not arithmetic coded)
without the leading one (always 1). The logs were encoded with an order-1
arithmetic coder with a predictor as context.

The predictor was computed from previous distances of the same symbol like this:

pred_log = pred_log * weight1 + last_log * weight2

Of course there are a lot of things you could use additionally for predicting
(use your imagination), but AFAIR you should get down to 217k for start with
almost any reasonable choice for weight1, weight2 (like for example both 0.5).

There probably are some other better ways to encode the distances, IIRC there
are distance coders that compress book down to 213k. Nowadays I'd do it
quite differently too, but I can't seem to find any spare time :-)

--
Edgar

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com