For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2005 > Calgary Corpus compression challenge updated









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Calgary Corpus compression challenge updated
xleobx@qmailcomq.com

2005-04-06, 12:40 pm

See http://www.mailcom.com/challenge/ for details.

Leo

Matt Mahoney

2005-04-06, 12:40 pm

xleobx@qmailcomq.com wrote:
> See http://www.mailcom.com/challenge/ for details.
>
> Leo


This is a very nice result by Przemyslaw Skibinski. It appears to be a
stripped down version of PASQDA with a tiny 225 word dictionary. The
archive is compressed to 592,486 bytes, a huge improvement. The total
size is 603,416 bytes including the decompressor. The decompressor is
C++ source compressed with RAR, then packed into the archive with HA.
The challenge entry corresponds to a release of PASQDA 4.0, which
compresses the Calgary corpus to 568,318 bytes using a larger
dictionary. I linked both from
http://cs.fit.edu/~mmahoney/compression/

-- Matt Mahoney

michael

2005-04-06, 12:40 pm

Congrats to the author!

However, I am left to wonder how many are still working on the concept
of truly effective compression. It concerns me that so many people are
focusing on sharpening the sword of a method devoted to a stacked
deck.

To Leo, I suggest that it is time to change the challenge a little. If
the intent of the challenge is to advance the state of the art for data
compression then I offer that the existing challenge is far beyond
obsolete. It is well know how to "cheat" on the Calgary corpus. And I
don't mean "cheat" to indicate that anyone is being dishonest. But we
can all reorder pic by 216. It does not advance the concept of a
re-order detection algorithm. For instance, most compressors which
break 780,000 bytes on the 18 files of the Calgary corpus would fail to
do so if pic was made smaller by deleting every 216th byte. And all
would fail to approach the current levels if the text files were
written in a different language or if book1 had its CR/LF altered. Or
if geo was altered to not be re-orderable by 4 to improve compression.

This (coupled with my *personal* opinion of "so what" to the
compressors which are not practical in terms of time and memory) drive
me to offer Leo to change his well intended challenge.

Leo, I suggest that the challenge (if intended to advance the real art
of data compression) needs to be changed by changing the test set
itself (If not by considering time and or resources required). If the
idea is to improve the state of the art by introducing real challenge
then the test set should be either unknown to all or the test set
should be a random subset from a known set of test signals. And this
test set should not be formed mostly from English. It should be large
enough to have a fair chance of being unpredictable in content.

I wish to comment further but, for the time being, will leave it at
this to the comp.compression community at large with the hope that this
will inspire worthwhile dialog concerning this topic.

We can continue down this unproductive road or we can advance the true
state of the art. I hope that we "change the deck" to see if
we are truly on the correct path. Because right now, I feel like were
are all trying to make the best poker hand of a deck with five
cards which are all known to everyone.

- Michael A Maniscalco



Matt Mahoney wrote:
> xleobx@qmailcomq.com wrote:
>
> This is a very nice result by Przemyslaw Skibinski. It appears to be

a
> stripped down version of PASQDA with a tiny 225 word dictionary. The
> archive is compressed to 592,486 bytes, a huge improvement. The

total
> size is 603,416 bytes including the decompressor. The decompressor

is
> C++ source compressed with RAR, then packed into the archive with HA.
> The challenge entry corresponds to a release of PASQDA 4.0, which
> compresses the Calgary corpus to 568,318 bytes using a larger
> dictionary. I linked both from
> http://cs.fit.edu/~mmahoney/compression/
>
> -- Matt Mahoney


Matt Mahoney

2005-04-06, 12:40 pm

michael wrote:
> Congrats to the author!
>
> However, I am left to wonder how many are still working on the

concept
> of truly effective compression. It concerns me that so many people

are
> focusing on sharpening the sword of a method devoted to a stacked
> deck.

....

I agree this is a problem. There are a couple of benchmarks where it
is not possible to tune the compressor to the benchmark because the
data has not been released.

http://www.freewebs.com/emilcont/benchmark.htm
http://www.maximumcompression.com/data/summary_mf.php

But this means that the results can't be independently verified. The
Calgary challenge isn't perfect but I think it ought to continue, just
because it has a long history, and aside from the tricks and tuning to
the corpus, there are still some general techniques used that apply to
other data types. All of the challenge winners were based on
variations of top ranked general purpose compressors at the time,
starting in 1997 with RK, followed by PPMN, SLIM, PAQ6 and PAQAR. I
think that all of these introduced some significant advances. I don't
think it is possible to design a "perfect" data set.

-- Matt Mahoney

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com