For Programmers: Free Programming Magazines  


Home > Archive > Compression > October 2004 > IMPROVED SECOND STAGE COMPRESSION ALGORITHM?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author IMPROVED SECOND STAGE COMPRESSION ALGORITHM?
BHARATH

2004-10-11, 3:55 am

hi ,
I am doing a project on compression .can any one suggest any good
compression algortims other than MFT,WFC,SIF,RLE_EXP ,RLE_BIT which
can be used in the place of second stage algorithms to improve
compression ratio in an effecient time .if you find any new algorithms
please let me know,which will be very much helpfull to me .thanking
you
your's friendly
bharath
Matt Mahoney

2004-10-11, 8:55 pm


"BHARATH" <bharathkj@hotmail.com> wrote in message
news:57dda274.0410102319.33baa2b1@posting.google.com...
> hi ,
> I am doing a project on compression .can any one suggest any good
> compression algortims other than MFT,WFC,SIF,RLE_EXP ,RLE_BIT which
> can be used in the place of second stage algorithms to improve
> compression ratio in an effecient time .if you find any new algorithms
> please let me know,which will be very much helpfull to me .thanking
> you
> your's friendly
> bharath


What is the first stage? Burrows-Wheeler?

-- Matt Mahoney


BHARATH

2004-10-12, 8:55 am

"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:<AgFad.722$SZ5.140@newsread2.news.atl.earthlink.net>...
> "BHARATH" <bharathkj@hotmail.com> wrote in message
> news:57dda274.0410102319.33baa2b1@posting.google.com...
>
> What is the first stage? Burrows-Wheeler?
>
> -- Matt Mahoney


yes i am using BWT in the first stage.Is their any good compression
algorithms which can be used in the second stage .it will be more
helpfull if i get the combination of algorthims to be used in the
second stage.Is their any transformation algorithm other than bwt
which can transform the data efficiently when compare to BWT?thanking
you
Bob

2004-10-15, 3:55 am

On 12 Oct 2004 04:29:29 -0700, bharathkj@hotmail.com (BHARATH) wrote:

>
>yes i am using BWT in the first stage.Is their any good compression
>algorithms which can be used in the second stage .it will be more



Best I've come up with (so far) is to use bwt, then move-to-front,
and then arithmetic compression. The size of the bwt block might
also affect the final compression ratio ... not entirely sure.
I've got my bet block size at 8k right now.

Still does not do as good as pkzip though. :(

Matt Mahoney

2004-10-15, 3:55 am


"Bob" <rock@almost.bestweb.net> wrote in message
news:pesqm0hbbaau9idf1mftmsrndtsa3cdabl@
4ax.com...
> On 12 Oct 2004 04:29:29 -0700, bharathkj@hotmail.com (BHARATH) wrote:
>
>
>
> Best I've come up with (so far) is to use bwt, then move-to-front,
> and then arithmetic compression. The size of the bwt block might
> also affect the final compression ratio ... not entirely sure.
> I've got my bet block size at 8k right now.
>
> Still does not do as good as pkzip though. :(
>


A bigger block will give you better compression, especially for text. 8K is
pretty small. PKZIP gets better compression because it uses a bigger
context window (not sure how big, but gzip uses 32K). In their original
paper, Burrows and Wheeler tested on the 103 MB Hector corpus and got the
best compression using one block for the entire input.
http://gatekeeper.dec.com/pub/DEC/S...ts/SRC-124.ps.Z

For modeling the transformed stream, I suspect a 1/t model would work well,
where t is the distance back in the stream. For example, if your stream is
....ABCD, then you would assign relative probabilities of A=1/4, B=1/3,
C=1/2, D=1 for the next symbol. I haven't tried it, though.

-- Matt Mahoney



Bob

2004-10-17, 8:55 pm

On Fri, 15 Oct 2004 02:30:20 GMT, "Matt Mahoney"
<matmahoney@yahoo.com> wrote:

>
>"Bob" <rock@almost.bestweb.net> wrote in message
> news:pesqm0hbbaau9idf1mftmsrndtsa3cdabl@
4ax.com...



oops, should be "bwt block size", not "bet block size". My mistake.
[color=darkred]
>
>A bigger block will give you better compression, especially for text. 8K is
>pretty small.


OK. I'll have to try bigger block sizes.

>PKZIP gets better compression because it uses a bigger
>context window (not sure how big, but gzip uses 32K).


I didn't think PKzip used BWT. Dunno about gzip.

>In their original
>paper, Burrows and Wheeler tested on the 103 MB Hector corpus and got the
>best compression using one block for the entire input.
>http://gatekeeper.dec.com/pub/DEC/S...ts/SRC-124.ps.Z
>


More support for the idea of a larger BWT blocksize. Thank
you. I will check out that link.

>For modeling the transformed stream, I suspect a 1/t model would work well,
>where t is the distance back in the stream. For example, if your stream is
>...ABCD, then you would assign relative probabilities of A=1/4, B=1/3,
>C=1/2, D=1 for the next symbol. I haven't tried it, though.
>


Huh? You lost me with this one.



Willem

2004-10-17, 8:55 pm

Bob wrote:
)>A bigger block will give you better compression, especially for text. 8K is
)>pretty small.
)
) OK. I'll have to try bigger block sizes.

When he sais 'bigger block size' he means like a few megabytes in size.
As a rule of thumb, doubling the block size gives you in the order of 1%
better compression. That is, using 8MB blocksize versus 4MB on the linux
kernel source gave me roughly a 1% improvement.

)>PKZIP gets better compression because it uses a bigger
)>context window (not sure how big, but gzip uses 32K).
)
) I didn't think PKzip used BWT. Dunno about gzip.

It doesn't, it uses a 32K sliding context window.
But that's still (somewhat) comparable.

P.S. bzip2 uses a blocksize of between 100k and 900k depending on the
commandline flag.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Matt Mahoney

2004-10-17, 8:55 pm


"Bob" <rock@almost.bestweb.net> wrote in message
news:oq25n090h0l7bkm6qve11dvigrcilmst5d@
4ax.com...
> On Fri, 15 Oct 2004 02:30:20 GMT, "Matt Mahoney"
> <matmahoney@yahoo.com> wrote:
>
> I didn't think PKzip used BWT. Dunno about gzip.


They both use Limpel-Ziv. But there is a compression-memory tradeoff in
just about all compression algorithms.
well,[color=darkred]
is[color=darkred]
>
> Huh? You lost me with this one.


You want to use a locally adaptive order 0 model. The move-to-front model
is good for BWT because the probability of a symbol is inversely related to
the time since it was last seen. For example, if the first 4 symbols in
your BWT stream was ABCD, then the MTF model would assign the highest
probability to D since it was seen last, followed by C, B, and A in that
order. Now if the stream was AAABAABBCD, then the MTF model would assign
exactly the same probabilities since they would be in the same position in
the queue. What I am suggesting is a model that also takes frequency into
account, i.e. raise the probabilities of A and B to account for their higher
frequency, but in inverse proportion to the time since they were last seen,
for example:

A = 1/10 + 1/9 + 1/8 + 1/6 + 1/5
B = 1/7 + 1/4 + 1/3
C = 1/2
D = 1

I can't say that this would work because I haven't tried it, just suggesting
an experiment. It probably isn't done this way because there isn't an
efficient way to compute it that I can think of.

-- Matt Mahoney


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com