Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Calgary Corpus compression challenge updated
See http://www.mailcom.com/challenge/ for details.

Leo


Report this thread to moderator Post Follow-up to this message
Old Post
xleobx@qmailcomq.com
04-06-05 05:40 PM


Re: Calgary Corpus compression challenge updated
xleobx@qmailcomq.com wrote:
> See http://www.mailcom.com/challenge/ for details.
>
> 	Leo

This is a very nice result by Przemyslaw Skibinski.  It appears to be a
stripped down version of PASQDA with a tiny 225 word dictionary.  The
archive is compressed to 592,486 bytes, a huge improvement.  The total
size is 603,416 bytes including the decompressor.  The decompressor is
C++ source compressed with RAR, then packed into the archive with HA.
The challenge entry corresponds to a release of PASQDA 4.0, which
compresses the Calgary corpus to 568,318 bytes using a larger
dictionary.  I linked both from
http://cs.fit.edu/~mmahoney/compression/

-- Matt Mahoney


Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
04-06-05 05:40 PM


Re: Calgary Corpus compression challenge updated
Congrats to the author!

However, I am left to wonder how many are still working on the concept
of truly effective compression.  It concerns me that so many people are
focusing on sharpening the sword of  a method devoted to a stacked
deck.

To Leo, I suggest that it is time to change the challenge a little.  If
the intent of the challenge is to advance the state of the art for data
compression then I offer that the existing challenge is far beyond
obsolete.  It is well know how to "cheat" on the Calgary corpus.  And I
don't mean "cheat" to indicate that anyone is being dishonest.  But we
can all reorder pic by 216.  It does not advance the concept of a
re-order detection algorithm.  For instance, most compressors which
break 780,000 bytes on the 18 files of the Calgary corpus would fail to
do so if pic was made smaller by deleting every 216th byte.  And all
would fail to approach the current levels if the text files were
written in a different language or if book1 had its CR/LF altered.  Or
if geo was altered to not be re-orderable by 4 to improve compression.

This (coupled with my *personal* opinion of "so what" to the
compressors which are not practical in terms of time and memory) drive
me to offer Leo to change his well intended challenge.

Leo, I suggest that the challenge (if intended to advance the real art
of data compression) needs to be changed by changing the test set
itself (If not by considering time and or resources required).  If the
idea is to improve the state of the art by introducing real challenge
then the test set should be either unknown to all or the test set
should be a random subset from a known set of test signals.  And this
test set should not be formed mostly from English.  It should be large
enough to have a fair chance of being unpredictable in content.

I wish to comment further but, for the time being, will leave it at
this to the comp.compression community at large with the hope that this
will inspire worthwhile dialog concerning this topic.

We can continue down this unproductive road or we can advance the true
state of the art.  I hope that we "change the deck" to see if
we are truly on the correct path.  Because right now, I feel like were
are all trying to make the best poker hand of a deck with five
cards which are all known to everyone.

- Michael A Maniscalco



Matt Mahoney wrote:
> xleobx@qmailcomq.com wrote: 
>
> This is a very nice result by Przemyslaw Skibinski.  It appears to be
a
> stripped down version of PASQDA with a tiny 225 word dictionary.  The
> archive is compressed to 592,486 bytes, a huge improvement.  The
total
> size is 603,416 bytes including the decompressor.  The decompressor
is
> C++ source compressed with RAR, then packed into the archive with HA.
> The challenge entry corresponds to a release of PASQDA 4.0, which
> compresses the Calgary corpus to 568,318 bytes using a larger
> dictionary.  I linked both from
> http://cs.fit.edu/~mmahoney/compression/
>
> -- Matt Mahoney


Report this thread to moderator Post Follow-up to this message
Old Post
michael
04-06-05 05:40 PM


Re: Calgary Corpus compression challenge updated
michael wrote:
> Congrats to the author!
>
> However, I am left to wonder how many are still working on the
concept
> of truly effective compression.  It concerns me that so many people
are
> focusing on sharpening the sword of  a method devoted to a stacked
> deck.
...

I agree this is a problem.  There are a couple of benchmarks where it
is not possible to tune the compressor to the benchmark because the
data has not been released.

http://www.freewebs.com/emilcont/benchmark.htm
http://www.maximumcompression.com/data/summary_mf.php

But this means that the results can't be independently verified.  The
Calgary challenge isn't perfect but I think it ought to continue, just
because it has a long history, and aside from the tricks and tuning to
the corpus, there are still some general techniques used that apply to
other data types.  All of the challenge winners were based on
variations of top ranked general purpose compressors at the time,
starting in 1997 with RK, followed by PPMN, SLIM, PAQ6 and PAQAR.  I
think that all of these introduced some significant advances.  I don't
think it is possible to design a "perfect" data set.

-- Matt Mahoney


Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
04-06-05 05:40 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Compression archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:00 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.