Code Comments
Programming Forum and web based access to our favorite programming groups.Hi All, The subject is the result of my experiments on Data Compression. This is quite different from the existing logic used for comp. (RLE and lookups) Any opinions/Suggestion are welcome. find more details in www.geocities.com/vln_prj/. donot bother abt the styles in the site am an Elec and Comm Engg.
Post Follow-up to this messageHello Lakshmi, Actually, what you're doing is a kind of indexed Hufmann compression. I have done the same, created a compressor for bitmaps that is very fast (much faster than PNG) but with less compression in most cases. Using the example on your site, your "Huffmann table" is: A = 10 B = 11 The first 1 is the 1 in the index you add after writing "AB", the second 0 or 1 is the bit in the final complement index. D = 010 C = 011 The first 0 is the 0 appearing in the index for "AB", the second 1 is the 1 appearing in the index for "DC" and the last 0 or 1 is the bit in the final complement index. With some streamlining in code you can make this such that you don't have to work with codes that cross byte boundaries, so can be very efficient for streams. Note that your Huffmann table and compression is probably not optimal for all texts. Take for example this text: ABACABACABADABA (15 bytes = 120 bits) 8x A 4x B 2x C 1x D With your method: <length> AB 111011101110111 CD 111 010001000101010 This is 8 + 4*8 + 15 + 3 + 15 = 73 bits. Without the unneccesary "111" this is 70 bits. Another method could just specify most occuring character, and its occurences, like so: <length> A 101010101010101 B 1010101 C 110 D This method only takes 8 + 4*8 + 15 + 7 + 3 = 65 bits. You then have a Huffmann table like: A = 1 B = 01 C = 001 D = 0001 Which is more suitable for this data. My compressor actually first does a histogram of the data, then finds the best Huffmann table and then codes this using an indexing scheme like here. It always uses byte boundaries for the index sequences, to keep things byte aligned. It only uses Huffman codes with length 1, 2, 4 or back to 8. If it goes back to 8, or when the remaining places are just a few, it writes out the remaining bytes directly. For bitmaps it is often worth it to first do a Diff of the data (store each byte as difference to the previous byte), and what also works is doing this compression followed by RLE and then again this compression. What I also found is that you can still compress the end-result nicely with flate or zip, achieving an overall compression that is often better than PNG directly. Same will probably hold for text. Kind regards, Nils Haeck www.simdesign.nl "Lakshmi Narayanan. V" <lakshminarayanan.v@gmail.com> wrote in message news:ddf430c3.0410090310.550fd97a@posting.google.com... > Hi All, > > The subject is the result of my experiments on Data Compression. > > This is quite different from the existing logic used for comp. (RLE > and lookups) Any opinions/Suggestion are welcome. > > find more details in www.geocities.com/vln_prj/. donot bother abt the > styles in the site am an Elec and Comm Engg.
Post Follow-up to this message"Lakshmi Narayanan. V" <lakshminarayanan.v@gmail.com> wrote in message news:ddf430c3.0410090310.550fd97a@posting.google.com... > Hi All, > > The subject is the result of my experiments on Data Compression. > > This is quite different from the existing logic used for comp. (RLE > and lookups) Any opinions/Suggestion are welcome. > > find more details in www.geocities.com/vln_prj/. donot bother abt the > styles in the site am an Elec and Comm Engg. Interesting. Your algorithm seems to be optimal for a stationary (uniform statistics), memoryless (order 0 - no correlation between adjacent bytes) source where the bytes have the probability distribution 1/4, 1/4, 1/8, 1/8, 1/16, 1/16, 1/32, 1/32... A huffman coder like UNIX pack could probably do better on other data because it makes no assumption about the distribution, although it is only optimal for probabilities that are powers of 1/2. Otherwise, an order-0 arithmetic coder like fpaq0 which I posted earlier ( http://groups.google.com/groups?sel...t&output=gplain or search comp.comression for fpaq0 in Google groups) could do a bit better. However real gains in compression exploit redundancies in strings longer than 1 byte. All of the popluar compression programs (compress, zip, gzip, rar, bzip2, paqar, etc) and algorithms (LZ, Burrows-Wheeler, PPM, context mixing) do this. -- Matt Mahoney
Post Follow-up to this message"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:3OW9d.13755$Vm1.10978@newsread3.news.atl.earthlink.net... > Otherwise, an order-0 arithmetic coder like fpaq0 which I posted earlier ( > http://groups.google.com/groups?sel...t&output=gplain Or here: http://cs.fit.edu/~mmahoney/compression/fpaq0.cpp -- Matt Mahoney
Post Follow-up to this message"Matt Mahoney" <matmahoney@yahoo.com> wrote in message news:TyY9d.13652$gs1.10768@newsread2.news.atl.earthlink.net... > > "Matt Mahoney" <matmahoney@yahoo.com> wrote in message > news:3OW9d.13755$Vm1.10978@newsread3.news.atl.earthlink.net... > http://groups.google.com/groups?sel...t&output=gplain > > Or here: http://cs.fit.edu/~mmahoney/compression/fpaq0.cpp > interesting. how does it compare with huffman and other artith coders in terms of compression? I say this while, as of yet, being unable to write a working arithmetic coder... I am not sure why, maybe it is that I can't fully understand arithmetic coding. huffman coding, though I guess more 'complex', was somewhat easier to write an encoder for and get working, and is fairly conceptually easy to understand though. I don't get why it is so often viewed as working with binary trees and having to reorganize the trees some to become "optimal", the approach I came up with seems to work (a stack of multi-leaf "levels"). for building the tree it looks at the statistics, and figures about how many bits would be optimal for that level. however: with huffman coding, 0 still needs 1 bit, but statistics say it should be closer to 0.5 bits. arith coding seems better, but the working coder I have (ripped from elsewhere) is too damn slow. huffman coding seems to generate results 25..30% larger than what I get from the arithmetic coder (dunno, everything looks fine, I think the limiting factor is the fractional optimal bits for a few values). so I continue to persue arithmetic encoding, hoping for that gain, but am troubled by issues... the issues I deal with my arith encoder: my encoder works ok, for a few bytes, but then goes all to hell after that (along the lines of seemingly random data); I can't understand what is going on well enough to determine what the problem is (is it encoding, decoding, both, ...); dumps of intermediate values (eg: for the first 16 bytes) just show show me a numerical soup that I can't fully process. yes, I am handling underflow (in the same way the encoder I was using before did). oh well, I guess I could get back to it.
Post Follow-up to this message> however: > with huffman coding, 0 still needs 1 bit, but statistics say it should be > closer to 0.5 bits. arith coding seems better, but the working coder I hav e > (ripped from elsewhere) is too damn slow. > huffman coding seems to generate results 25..30% larger than what I get fr om > the arithmetic coder (dunno, everything looks fine, I think the limiting > factor is the fractional optimal bits for a few values). > so I continue to persue arithmetic encoding, hoping for that gain, but am > troubled by issues... have you considered grouping the symbols and then creating a huffman tree around these groupings? This way so can effectively achieve sub-bit Say i have the following symbols and probablilities: A=0.8 B=0.1 C=0.1 the usual huffman tree would be similar to: / \ A / \ B C and thus the following string: ABCABBAACACCABAAAAAA 12212211212212111111=28 would be compressed to 28 bits now if we group two symbols together: AA=0.64 AB=0.08 BA=0.08 AC=0.08 CA=0.08 BB=0.01 BC=0.001 CB=0.001 CC=0.001 which generates a tree: / \ AA / \ AB / \ / \ / \ / \ AC BA CA / \ BB / \ BC / \ CC CB and the string would be compressed in and thus the following string: ABCABBAACACCABAAAAAA 2 3 4 1 3 6 2 1 1 1 =24 would be compressed to 24 bits By increasing the size of the groupings from 2 to more symbols brings the effective compression via huffman codes closer to that of Arithmetric compression.
Post Follow-up to this messageI made a mistake with my other post... it should be: which generates a tree: / \ AA / \ AB / \ / \ / \ / \ AC BA CA / \ BB / \ BC / \ CC CB and the string would be compressed in and thus the following string: ABCABBAACACCABAAAAAA 2 4 5 1 4 7 2 1 1 1 =28 would be compressed to 28 bits Which is not a good example so a better one would be: AAAAAAAAAAAAAAAAAAAAAC would be compressed to: 1 1 1 1 1 1 1 1 1 1 4 =14 bits by the two symbol group huffman tree while with the single symbol huffman tree it soul be compressed to: 1111111111111111111112 = 23 bits By increasing the size of the groupings from 2 to more symbols brings the effective compression via huffman codes closer to that of Arithmetric compression.
Post Follow-up to this messagebudgetanime@mystarship.com (moogie) writes:
> I made a mistake with my other post...
Your model didn't match your data.
...
> By increasing the size of the groupings from 2 to more symbols brings
> the effective compression via huffman codes closer to that of
> Arithmetric compression.
There's no need to have a fixed length for the tokens being encoded.
{AA, AB, AC, B, C} with ratios {64,8,8,10,10} fits into a {1,3,3,3,3}
tree quite well.
Tunstall codes are the things to google for.
Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
Post Follow-up to this message"moogie" <budgetanime@mystarship.com> wrote in message news:e353ade.0410100151.33fe1b@posting.google.com... > > have you considered grouping the symbols and then creating a huffman > tree around these groupings? This way so can effectively achieve > sub-bit > > Say i have the following symbols and probablilities: > > A=0.8 > B=0.1 > C=0.1 > > the usual huffman tree would be similar to: > > / \ > A > / \ > B C > > and thus the following string: > > ABCABBAACACCABAAAAAA > 12212211212212111111=28 > > would be compressed to 28 bits > > now if we group two symbols together: > > AA=0.64 > AB=0.08 > BA=0.08 > AC=0.08 > CA=0.08 > BB=0.01 > BC=0.001 > CB=0.001 > CC=0.001 > > which generates a tree: > > / \ > AA > / \ > AB > / \ > / \ > / \ / \ > AC BA CA > / \ > BB > / \ > BC > / \ > CC CB > > and the string would be compressed in > > > > and thus the following string: > > ABCABBAACACCABAAAAAA > 2 3 4 1 3 6 2 1 1 1 =24 > > would be compressed to 24 bits > > By increasing the size of the groupings from 2 to more symbols brings > the effective compression via huffman codes closer to that of > Arithmetric compression. it is an interesting idea, but it seems a little like a hack... it is also unclear what the probability of 2 versions of a character vs just one is, and supporting the 2 value version reduces the efficiency of the single value case. my arithmatic coder is sort of working. I will just state that, given the nature of arithmatic coding, it is quite sensitive to minor variations in the mathy parts of the encoder it seems. an annoying problem is that of "possibility" for the occurance of a given value. with fixed statistics, this can be glossed over, with variable statistics, I have a problem. I am going to need a way to work up an "escape" mechanism. it is lame, I can't just stuff extra bits in the stream and be like "oh look, an escape code, better read a few bits for the code". I am probably going to have to encode/decode the values from the stream, which is itself scary seeming... or I could just live with the kind of worse but much simpler and faster approach of huffman. hell, it is at least close...
Post Follow-up to this message"cr88192" <cr88192@remove.hotmail.com> wrote in message news:f7%9d.1863$pM5.1337@news.flashnewsgroups.com... > > "Matt Mahoney" <matmahoney@yahoo.com> wrote in message > news:TyY9d.13652$gs1.10768@newsread2.news.atl.earthlink.net... earlier http://groups.google.com/groups?sel...r /> ut=gplain > interesting. how does it compare with huffman and other artith coders in > terms of compression? Arithmetic coding does better than Huffman when the symbol proabilities are not powers of 1/2. Howver, for practical cases this only matters when you have one or a few symbols with large probabilities. For example, if A, B, C each have probability 1/3 then a Huffman code might be A=00, B=01, C=1 giving you an average symbol length of 5/3 = 1.667 bits. Arithmetic coding would give you log2(3) = 1.585 bits. OTOH, if all your symbol probabilities were around 0.01, the difference would be very small. The arithmetic coder in fpaq0 (and most PAQ versions) uses a range (x1=low, x2=high in the code) but differs from a normal range coder in that it codes 1 bit at a time, so you only give it one probability, P(next bit = 1), rather than the probabilities for all the symbols in the alphabet. This lets you calculate 8 probabilities per byte rather than 256. The output of fpaq0's coder is about 0.01% larger than an ideal coder because the probabilties have to be approximated using integer arithmetic. First, it represents a probability with 12 bits of resolution. Second, when the range straddles a byte boundary so that no output is possible, the resolution of the range gets reduced, possibly to 2 in rare cases, at which point the next probability is forced by integer rounding to 1/2 to remove the split. This strategy only works when coding 1 bit at a time. The normal way to handle this problem is to keep a count of pending output bits (x1 = 011111..., x2 = 100000...). In fact, David Scott modified the coder to do this in some versions of PAQ, including PAQAR I believe. -- Matt Mahoney
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.