Code Comments
Programming Forum and web based access to our favorite programming groups.afaik, LZW is supposed to be good. maybe it is just me, but I have generally had poor success with it (it tending to come up slower and with often poorer compression than what I can get from LZ77...). ok, so I tried combining a variation of LZW with my BTEC algo (maybe part of the problem, maybe not). the rough idea here is that some leaves are values (eg: special control codes and single byte values), and the rest are a large collection of strings. I am using a hash-table to look up the strings (however, I can only use the first 2 bytes for the hash as many of the strings are only 2 bytes). I also limit searches to the first 256 strings, and generally keep the lists mru-sorted (removing this limit hurts compressor speed and only slightly increases ratio). just to clarify, my understanding of LZW is: while there is more input: look up "best" string matching the current input; emit the code for the matched string (adjusts weights); append the current string to the last string, adding the result to the dictionary. or for decoding: decode the string (adjusts weights); emit the string; append the current string to the last string, adding the result to the dictionary. the dictionary is also the leaves of the huffman tree, which are concievably shifted around as the file is processed. eg, for a test consisting of an approximately 3MB text file: compression: approx 357kB/s, approx 23% original size (or about 77% compression); decompression: approx 3.4MB/s; final tree: 24 bit depth, approx 374k leaves. (I tried allowing partial matches on strings, which seemed only to hurt the compression ratio). a lot seems to depend on how I define "best". presently I am using "longest match" as that seems to give best results. a possible variation could include checking at different offsets and looking for the shortest combination of code lengths, but this would likely slow down the encoder considerably and I am unsure if it would improve compression. it is also possible that just the "mass" of the huffman trees serves to hurt compression, and a possibility could be to occasionally prune the huffman trees (thus possibly generally shortening code lengths). I am unsure though, it is possible that none of the strings has much weight and such pruning would not be helpful. I am more sure how the algo works in the case of a small number of leaves, it is more difficult to imagine how effective the algo is in this case. as for pruning, a good idea is lacking. one possibility would be to drop all the internal nodes and any sufficiently light leaves, then using good-ol' huffman to rebuild the tree from there. this, however, would likely expose a fair amount of the internals of the tree management... other algos (eg: top-down tree rebuilding) seem a lot more hairy. compared with a trivial combination of BTEC and LZ77: compression: approx 1.5MB/s, 28% original size (72% compression); decompression: approx 4MB/s. yeah, it compresses better than my LZ77 variant does, but is a lot slower (the lz77 decoder speed is based heavily on what entropy coder I use). both are defeated on both fronts by deflate (most specifically gzip), which seems to get ratios better than what I can easily understand from either the rfc describing the algo, or the way things are generally encoded. I am not quite sure of the mystery. it might be something special in the way they choose their strings?... maybe I am stupid though or missing something obvious? comments?...
Post Follow-up to this messagecr88192 wrote: > afaik, LZW is supposed to be good. maybe it is just me, but I have generally > had poor success with it (it tending to come up slower and with often poorer > compression than what I can get from LZ77...). > > > ok, so I tried combining a variation of LZW with my BTEC algo (maybe part of > the problem, maybe not). the rough idea here is that some leaves are values > (eg: special control codes and single byte values), and the rest are a large > collection of strings. > I am using a hash-table to look up the strings (however, I can only use the > first 2 bytes for the hash as many of the strings are only 2 bytes). > I also limit searches to the first 256 strings, and generally keep the lists > mru-sorted (removing this limit hurts compressor speed and only slightly > increases ratio). > > just to clarify, my understanding of LZW is: > while there is more input: > look up "best" string matching the current input; > emit the code for the matched string (adjusts weights); > append the current string to the last string, > adding the result to the dictionary. > > or for decoding: > decode the string (adjusts weights); > emit the string; > append the current string to the last string, > adding the result to the dictionary. > > > the dictionary is also the leaves of the huffman tree, which are concievably > shifted around as the file is processed. > > eg, for a test consisting of an approximately 3MB text file: > compression: approx 357kB/s, approx 23% original size (or about 77% > compression); > decompression: approx 3.4MB/s; > final tree: 24 bit depth, approx 374k leaves. > > (I tried allowing partial matches on strings, which seemed only to hurt the > compression ratio). > a lot seems to depend on how I define "best". presently I am using "longest > match" as that seems to give best results. > a possible variation could include checking at different offsets and looking > for the shortest combination of code lengths, but this would likely slow > down the encoder considerably and I am unsure if it would improve > compression. > > it is also possible that just the "mass" of the huffman trees serves to hurt > compression, and a possibility could be to occasionally prune the huffman > trees (thus possibly generally shortening code lengths). > I am unsure though, it is possible that none of the strings has much weight > and such pruning would not be helpful. > I am more sure how the algo works in the case of a small number of leaves, > it is more difficult to imagine how effective the algo is in this case. > > as for pruning, a good idea is lacking. > one possibility would be to drop all the internal nodes and any sufficiently > light leaves, then using good-ol' huffman to rebuild the tree from there. > this, however, would likely expose a fair amount of the internals of the > tree management... > other algos (eg: top-down tree rebuilding) seem a lot more hairy. > > > compared with a trivial combination of BTEC and LZ77: > compression: approx 1.5MB/s, 28% original size (72% compression); > decompression: approx 4MB/s. > > yeah, it compresses better than my LZ77 variant does, but is a lot slower > (the lz77 decoder speed is based heavily on what entropy coder I use). > > > both are defeated on both fronts by deflate (most specifically gzip), which > seems to get ratios better than what I can easily understand from either the > rfc describing the algo, or the way things are generally encoded. > > > I am not quite sure of the mystery. it might be something special in the way > they choose their strings?... > > maybe I am stupid though or missing something obvious? > > comments?... I think your understanding of LZW is correct. There are variations of building the dictionary. You can append one byte to a matched string, or append two consecutive matched strings as you do, or other variations. Which is best depends on the data. LZW should give you better compression than LZ77 in theory. That's because LZW removes a type of redundancy present in LZ77. If you have a string ...ABC...ABD...ABE, then the third AB could be matched to either the first for the second occurrence, and this arbitrary choice requires one bit to code. LZW would just have a pointer to AB in the dictionary. The divantage of LZW is that there are many symbols in the dictionary that are never used, such as ABC and ABD, so you need more bits to code a symbol. It also depends on how your symbols are coded. In practice, gzip/zlib (LZ77) compresses better than UNIX compress (LZW). gzip uses Huffman codes to code the pointer/length pairs using either a default Huffman tree or a tree explicitly transmitted in a header. Usually smaller numbers are coded with fewer bits. IIRC compress always uses ceil(log2(n)) bits to code a symbol from a dictionary of size n. Your Huffman code should improve on this. -- Matt Mahoney
Post Follow-up to this messagecr88192 wrote: > afaik, LZW is supposed to be good. maybe it is just me, but I have generally > had poor success with it (it tending to come up slower and with often poorer > compression than what I can get from LZ77...). Getting my computing chops in the 1980s, LZW seemed to make the most sense: LZ78 was released later than LZ77 so people assumed it was a better method; Welch's take on LZ78 in 1984 became LZW which people jumped on; Thom Henderson's ARC program (variant of Unix' "compress (LZ78) made Fidonet possible; Compuserve chose LZW for GIF; etc. etc. And then, at some point, people realized that LZ77 was actually a more efficient method for many implementations: The length/offset matches + literals took up the same or less space than an LZ78 dictionary of the same material, and LZ77 had other benefits like much simpler decoding requirements (which directly translates to decompression speed). I'm not sure if anyone (Phil Katz? Ross Williams?) led a "crue" for LZ77, but Deflate (RFC http://www.ietf.org/rfc/rfc1951.txt) certainly popularized it. > I am using a hash-table to look up the strings (however, I can only use the > first 2 bytes for the hash as many of the strings are only 2 bytes). > I also limit searches to the first 256 strings, and generally keep the lists > mru-sorted (removing this limit hurts compressor speed and only slightly > increases ratio). Maybe you haven't been testing with decent materal? 256 means you're not exceeding a 9-bit dictionary, right? My LZ78 implementation knowledge isn't very strong, but IIRC Unix compress uses variable codes up to a 16-bit dictionary. (If I'm misunderstanding you, please ignore :-) > a possible variation could include checking at different offsets and looking > for the shortest combination of code lengths, but this would likely slow > down the encoder considerably and I am unsure if it would improve > compression. Actually, this is a key component of getting 1%-5% more performance out of any coder that uses a variable-sized dictionary. I believe this is called "lazy coding" (the technique, not an attitude). > both are defeated on both fronts by deflate (most specifically gzip), which > seems to get ratios better than what I can easily understand from either the > rfc describing the algo, or the way things are generally encoded. > I am not quite sure of the mystery. it might be something special in the way > they choose their strings?... The strength of any LZ77 coder is in the pattern matching/combining. Check the code and also http://www.gzip.org/zlib/feldspar.html has some basic notes on compressor cleverness. Also try http://www.cs.tut.fi/~albert/Dev/pucrunch/#Graph
Post Follow-up to this message"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:1110300510.841947.38870@l41g2000cwc.googlegroups.com... > <snip> > > I think your understanding of LZW is correct. There are variations of > building the dictionary. You can append one byte to a matched string, > or append two consecutive matched strings as you do, or other > variations. Which is best depends on the data. > ok, I was unsure, I have seen differing definitions of LZW. I had found though that only appending a single character at a time seemed to somewhat hurt the compression ratio. > LZW should give you better compression than LZ77 in theory. That's > because LZW removes a type of redundancy present in LZ77. If you have > a string ...ABC...ABD...ABE, then the third AB could be matched to > either the first for the second occurrence, and this arbitrary choice > requires one bit to code. LZW would just have a pointer to AB in the > dictionary. > yes, though typically afaik LZ77 would take more bits, as it encodes both a length and an offset. however, LZ77 can refer to any arbitrary string that had occured previously, wheras LZW is limited to whatever is in the dictionary. ... > The divantage of LZW is that there are many symbols in the > dictionary that are never used, such as ABC and ABD, so you need more > bits to code a symbol. > yes. if all is working well, the more often used codes should become shorter and the less often used ones longer, however, with a large number of leaves, it would take some pretty common strings to really be able to shift the mass of the dictionary. a possibility would be adding strings with effectively 0 weight. just tried this, this signifigantly hurts sompression and leads to very deep trees. I boosted the weights for actually used strings some, which helped the ratio slightly, but if the weight is too high compression goes down. a good weight of about 4 boosts the tree depth by about 3 bits (to 27). > It also depends on how your symbols are coded. In practice, gzip/zlib > (LZ77) compresses better than UNIX compress (LZW). gzip uses Huffman > codes to code the pointer/length pairs using either a default Huffman > tree or a tree explicitly transmitted in a header. Usually smaller > numbers are coded with fewer bits. IIRC compress always uses > ceil(log2(n)) bits to code a symbol from a dictionary of size n. Your > Huffman code should improve on this. > yes, maybe. the exact cost for a string depends less on the total size of the dictionary but on its probability, which still tends to be less as the dictionary gets large...
Post Follow-up to this message"Jim Leonard" <MobyGamer@gmail.com> wrote in message news:1110315316.898244.101210@g14g2000cwa.googlegroups.com... > cr88192 wrote: > generally > poorer > > Getting my computing chops in the 1980s, LZW seemed to make the most > sense: LZ78 was released later than LZ77 so people assumed it was a > better method; Welch's take on LZ78 in 1984 became LZW which people > jumped on; Thom Henderson's ARC program (variant of Unix' "compress > (LZ78) made Fidonet possible; Compuserve chose LZW for GIF; etc. etc. > > And then, at some point, people realized that LZ77 was actually a more > efficient method for many implementations: The length/offset matches + > literals took up the same or less space than an LZ78 dictionary of the > same material, and LZ77 had other benefits like much simpler decoding > requirements (which directly translates to decompression speed). I'm > not sure if anyone (Phil Katz? Ross Williams?) led a "crue" for > LZ77, but Deflate (RFC http://www.ietf.org/rfc/rfc1951.txt) certainly > popularized it. > yeah, LZ77 is looking good, just I figured maybe I could squeeze more out of LZW, eg, by mapping the dictionary directly to the huffman tree so that in effect the bit-code refers directly to the string in question, vs a length/offset pair. LZ77, however, still seems to have some magic left. > use the > the lists > slightly > > Maybe you haven't been testing with decent materal? 256 means you're > not exceeding a 9-bit dictionary, right? My LZ78 implementation > knowledge isn't very strong, but IIRC Unix compress uses variable codes > up to a 16-bit dictionary. (If I'm misunderstanding you, please ignore > :-) > not 256 per dictionary, but 256 per hash entry. this was due to speed reasons, as checking all the strings in a single hash entry was taking too damn long, and got progressively slower as file-size increased... just tried dropping weights for strings that fall outside the 256 limit (possibly to allow them to sink out of the table). no real positive effect, everything I do seems to hurt the ratio... > looking > slow > > Actually, this is a key component of getting 1%-5% more performance out > of any coder that uses a variable-sized dictionary. I believe this is > called "lazy coding" (the technique, not an attitude). > ok. > which > either the > the way > > The strength of any LZ77 coder is in the pattern matching/combining. > Check the code and also http://www.gzip.org/zlib/feldspar.html has some > basic notes on compressor cleverness. Also try > http://www.cs.tut.fi/~albert/Dev/pucrunch/#Graph > yeah. as for the "forward looking" bit in the example, this seems troublesome for the decoder. I guess for the decoder it could be added as a hack: while distance<length copy distance bytes to output; offset window by distance; subtract distance from length. (for the encoder, I allow it to read past the end of the window). I guess this is a lot cheaper than my stepping forwards-hacks (which are expensive, ugly, and typically only marginally helped with compression). I mentioned elsewhere that my lzw variant seems to defeat deflate for english text at present, but falls short in comparrison to non-english text (eg: c source or xml) and data (eg: exe files). or such...
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.