Code Comments
Programming Forum and web based access to our favorite programming groups.See http://www.mailcom.com/challenge/ for details. Leo
Post Follow-up to this messagexleobx@qmailcomq.com wrote: > See http://www.mailcom.com/challenge/ for details. > > Leo This is a very nice result by Przemyslaw Skibinski. It appears to be a stripped down version of PASQDA with a tiny 225 word dictionary. The archive is compressed to 592,486 bytes, a huge improvement. The total size is 603,416 bytes including the decompressor. The decompressor is C++ source compressed with RAR, then packed into the archive with HA. The challenge entry corresponds to a release of PASQDA 4.0, which compresses the Calgary corpus to 568,318 bytes using a larger dictionary. I linked both from http://cs.fit.edu/~mmahoney/compression/ -- Matt Mahoney
Post Follow-up to this messageCongrats to the author! However, I am left to wonder how many are still working on the concept of truly effective compression. It concerns me that so many people are focusing on sharpening the sword of a method devoted to a stacked deck. To Leo, I suggest that it is time to change the challenge a little. If the intent of the challenge is to advance the state of the art for data compression then I offer that the existing challenge is far beyond obsolete. It is well know how to "cheat" on the Calgary corpus. And I don't mean "cheat" to indicate that anyone is being dishonest. But we can all reorder pic by 216. It does not advance the concept of a re-order detection algorithm. For instance, most compressors which break 780,000 bytes on the 18 files of the Calgary corpus would fail to do so if pic was made smaller by deleting every 216th byte. And all would fail to approach the current levels if the text files were written in a different language or if book1 had its CR/LF altered. Or if geo was altered to not be re-orderable by 4 to improve compression. This (coupled with my *personal* opinion of "so what" to the compressors which are not practical in terms of time and memory) drive me to offer Leo to change his well intended challenge. Leo, I suggest that the challenge (if intended to advance the real art of data compression) needs to be changed by changing the test set itself (If not by considering time and or resources required). If the idea is to improve the state of the art by introducing real challenge then the test set should be either unknown to all or the test set should be a random subset from a known set of test signals. And this test set should not be formed mostly from English. It should be large enough to have a fair chance of being unpredictable in content. I wish to comment further but, for the time being, will leave it at this to the comp.compression community at large with the hope that this will inspire worthwhile dialog concerning this topic. We can continue down this unproductive road or we can advance the true state of the art. I hope that we "change the deck" to see if we are truly on the correct path. Because right now, I feel like were are all trying to make the best poker hand of a deck with five cards which are all known to everyone. - Michael A Maniscalco Matt Mahoney wrote: > xleobx@qmailcomq.com wrote: > > This is a very nice result by Przemyslaw Skibinski. It appears to be a > stripped down version of PASQDA with a tiny 225 word dictionary. The > archive is compressed to 592,486 bytes, a huge improvement. The total > size is 603,416 bytes including the decompressor. The decompressor is > C++ source compressed with RAR, then packed into the archive with HA. > The challenge entry corresponds to a release of PASQDA 4.0, which > compresses the Calgary corpus to 568,318 bytes using a larger > dictionary. I linked both from > http://cs.fit.edu/~mmahoney/compression/ > > -- Matt Mahoney
Post Follow-up to this messagemichael wrote: > Congrats to the author! > > However, I am left to wonder how many are still working on the concept > of truly effective compression. It concerns me that so many people are > focusing on sharpening the sword of a method devoted to a stacked > deck. ... I agree this is a problem. There are a couple of benchmarks where it is not possible to tune the compressor to the benchmark because the data has not been released. http://www.freewebs.com/emilcont/benchmark.htm http://www.maximumcompression.com/data/summary_mf.php But this means that the results can't be independently verified. The Calgary challenge isn't perfect but I think it ought to continue, just because it has a long history, and aside from the tricks and tuning to the corpus, there are still some general techniques used that apply to other data types. All of the challenge winners were based on variations of top ranked general purpose compressors at the time, starting in 1997 with RK, followed by PPMN, SLIM, PAQ6 and PAQAR. I think that all of these introduced some significant advances. I don't think it is possible to design a "perfect" data set. -- Matt Mahoney
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.