For Programmers: Free Programming Magazines  


Home > Archive > Compression > February 2008 > New LZW variant









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author New LZW variant
encode

2008-01-30, 6:57 pm

OK, playing with UNWHAP-like decompression I've invented(?) a new LZW
variant.

I hope you remember that with UNWHAP, unlike with a generic LZW, we do
not construct any tries, we just keep pointers into buffer+phrase
lengths:

add(i, len+1); // like in generic LZW - add a new phrase -
concatenation of a current phrase with a new symbol

First thing.
With UNWHAP decoder each phrase is independent - since we do not use
any prefixes we may freely overwrite old entries. Thus if dictionary
became full, just overwrite the oldest entry. As a result we
constantly have 1<<N active phrases!

Second thing.
An accelerated dictionary loading.
Instead of just adding:
add(i, len+1);
we may additionally add
add(i+1, len);
add(i-1, len+2);
i.e. additional concatenation with a new symbol.
And this requires just minor decoder changes.

As a result we have a nice compression with super fast (200 MB/sec)
decompression. In my variant, I use 16-bit codes - i.e. no flat codes,
no Huffman, no arithmetic, whatsoever, and compression in some cases
is very close to Deflate's.

Although, in my implementation the encoder is far not so fast, since
we must do some additional stuff with such compression, however...

Link:
http://www.encode.ru/downloads/lzw01.zip (42 KB)

P.S.
Maybe patent issues killed LZW potential?

-- Ilia Muraviev
Hans-Peter Diettrich

2008-01-30, 9:58 pm

encode wrote:

> With UNWHAP decoder each phrase is independent - since we do not use
> any prefixes we may freely overwrite old entries. Thus if dictionary
> became full, just overwrite the oldest entry. As a result we
> constantly have 1<<N active phrases!


As long as you don't know how useful (how often applicable) a phrase is,
you cannot know which phrase(s) to discard. Discarding the oldest
phrases will be useful when a file has a certain "locality", where a
phrase is repeated nearby.

DoDi
cr88192

2008-01-31, 3:57 am


"Hans-Peter Diettrich" <DrDiettrich1@aol.com> wrote in message
news:60ct8jF1qc6l3U4@mid.individual.net...
> encode wrote:
>
>
> As long as you don't know how useful (how often applicable) a phrase is,
> you cannot know which phrase(s) to discard. Discarding the oldest phrases
> will be useful when a file has a certain "locality", where a phrase is
> repeated nearby.
>


MRU can give a kind of hueristic.
as such, if a string falls out of the dictionary, it is rarely, if ever,
used...

of course, this is likely to result in a distribution that is itself subject
to compression (strings with smaller indices being more frequent), and thus,
best represented with a shorter code.

an example of this would be, for example, LZW+Rice.


another possibility which was once implemented by me, which from what I
remember got decent compression but, at the time, was fairly slow, was the
direct combination of LZW with adaptive huffman (basically, the individual
strings were leaves on a fairly large huffman tree, and as such, each match
involved updating the weight and potentially rebalancing that part of the
tree).

if I implemented it now, it is much more likely to be faster.

I remember, it was pretty good with text, but less good with other data
types. part of the problem may be that such a design would likely adapt too
slowly, wheras a design with a separate ranking structure and entropy coding
(such as MRU+Huffman or MRU+Rice) is likely to do much better here.


other possibilities exist as well...


> DoDi


encode

2008-01-31, 3:57 am

MRU, MTF and similar stuff just will kill the decompression speed.
Deleting the oldest phrase can be done using circular table - i.e.
needs no additional calculations... :)
Hans-Peter Diettrich

2008-01-31, 6:58 pm

encode wrote:

> MRU, MTF and similar stuff just will kill the decompression speed.
> Deleting the oldest phrase can be done using circular table - i.e.
> needs no additional calculations... :)


I wonder what takes more time, managing data structures in memory, or
reading the data from an external device (disk, network...). Where's the
*real* bottleneck?

DoDi
Mark Nelson

2008-01-31, 6:58 pm

On Jan 30, 2:47=A0pm, encode <enc...@narod.ru> wrote:
> P.S.
> Maybe patent issues killed LZW potential?


They certainly stopped people from working on LZW, so they didn't
help.

But I think more than anything, the superior performance of deflate is
what really shut down LZW.

As I understand it, LZ77 and LZ78 will have the same performance
compression, asymptotically, but in the case of these two
implementations, deflate simply compresses better, faster, and with a
smaller footprint than any LZ78 variant, including LZW.

If there were no patent issues and people continued tinkering with
LZ78-based schemes (like you seem to be doing!) maybe we'd see some
attempts at parity.

--
|
| Mark Nelson - http://marknelson.us
|
| Got extra books? Got extra cash? Use either to bring books to Texas
prison libraries.
| Find out how to help at http://booksforcrooks.org
|
encode

2008-01-31, 6:58 pm

On 31 =D1=CE=D7, 16:30, Hans-Peter Diettrich <DrDiettri...@aol.com> wrote:
> encode wrote:
>
> I wonder what takes more time, managing data structures in memory, or
> reading the data from an external device (disk, network...). Where's the
> *real* bottleneck?
>
> DoDi


The real bottleneck for super fast coders should be the disk I/O and
only. That's why many benchmarkers do in-memory testing of such
software...
cr88192

2008-02-01, 3:57 am


"encode" <encode@narod.ru> wrote in message
news:e53b9fd2-9c26-4134-8656-8e99f30d95de@q21g2000hsa.googlegroups.com...
> MRU, MTF and similar stuff just will kill the decompression speed.
> Deleting the oldest phrase can be done using circular table - i.e.
> needs no additional calculations... :)


potentially, albeit simply dropping the oldest entries, may destroy
frequently used strings in the process, thus hurting compression.

MRU is much safer, and will compress better. yes, performance is a possible
cost...


we have 2 major options:
MRU sorted array, which may require shifting the entire array on an update,
is slower, but introduces a usable statistical skew.

a linked list can be made much faster in this case, but by itself (in the
case of linked but unsorted entries), will not lead to a usable skew (apart
from, maybe, using a secondary stage with a kind of high-width huffman
scheme or arithmetic coding or similar, which could cost in terms of either
performance or huffman table size...).


so, it is a speed ratio tradoff I think.

one reason I have generally preferred LZ77 over LZW is that it is IME easier
to get a better speed/ratio tradoff...


Hans-Peter Diettrich

2008-02-01, 3:57 am

encode wrote:

> The real bottleneck for super fast coders should be the disk I/O and
> only. That's why many benchmarkers do in-memory testing of such
> software...


That's why it is *necessary* to exclude the disk access time from
benchmarks, in order to get reasonable (comparable) values. At the same
time the user is almost only interested in the compression ratio,
because this is what makes the (disk...) transfer times shorter, and
disk usage lower at the same time.

DoDi
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com