Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Complementary Positional Encoding
Hi All,

The subject is the result of my experiments on Data Compression.

This is quite different from the existing logic used for comp. (RLE
and lookups) Any opinions/Suggestion are welcome.

find more details in www.geocities.com/vln_prj/. donot bother abt the
styles in the site am an Elec and Comm Engg.

Report this thread to moderator Post Follow-up to this message
Old Post
Lakshmi Narayanan. V
10-09-04 01:55 PM


Re: Complementary Positional Encoding
Hello Lakshmi,

Actually, what you're doing is a kind of indexed Hufmann compression. I have
done the same, created a compressor for bitmaps that is very fast (much
faster than PNG) but with less compression in most cases.

Using the example on your site, your "Huffmann table" is:

A = 10
B = 11

The first 1 is the 1 in the index you add after writing "AB", the second 0
or 1 is the bit in the final complement index.

D = 010
C = 011

The first 0 is the 0 appearing in the index for "AB", the second 1 is the 1
appearing in the index for "DC" and the last 0 or 1 is the bit in the final
complement index.

With some streamlining in code you can make this such that you don't have to
work with codes that cross byte boundaries, so can be very efficient for
streams.

Note that your Huffmann table and compression is probably not optimal for
all texts. Take for example this text:

ABACABACABADABA (15 bytes = 120 bits)

8x A
4x B
2x C
1x D

With your method:

<length> AB 111011101110111 CD 111 010001000101010

This is 8 + 4*8 + 15 + 3 + 15 = 73 bits. Without the unneccesary "111" this
is 70 bits.

Another method could just specify most occuring character, and its
occurences, like so:

<length> A 101010101010101 B 1010101 C 110 D

This method only takes 8 + 4*8 + 15 + 7 + 3 = 65 bits.

You then have a Huffmann table like:

A = 1
B = 01
C = 001
D = 0001

Which is more suitable for this data.

My compressor actually first does a histogram of the data, then finds the
best Huffmann table and then codes this using an indexing scheme like here.
It always uses byte boundaries for the index sequences, to keep things byte
aligned. It only uses Huffman codes with length 1, 2, 4 or back to 8. If it
goes back to 8, or when the remaining places are just a few, it writes out
the remaining bytes directly.

For bitmaps it is often worth it to first do a Diff of the data (store each
byte as difference to the previous byte), and what also works is doing this
compression followed by RLE and then again this compression.

What I also found is that you can still compress the end-result nicely with
flate or zip, achieving an overall compression that is often better than PNG
directly. Same will probably hold for text.

Kind regards,

Nils Haeck
www.simdesign.nl


"Lakshmi Narayanan. V" <lakshminarayanan.v@gmail.com> wrote in message
news:ddf430c3.0410090310.550fd97a@posting.google.com...
> Hi All,
>
> The subject is the result of my experiments on Data Compression.
>
> This is quite different from the existing logic used for comp. (RLE
> and lookups) Any opinions/Suggestion are welcome.
>
> find more details in www.geocities.com/vln_prj/. donot bother abt the
> styles in the site am an Elec and Comm Engg.



Report this thread to moderator Post Follow-up to this message
Old Post
Nils
10-09-04 08:55 PM


Re: Complementary Positional Encoding
"Lakshmi Narayanan. V" <lakshminarayanan.v@gmail.com> wrote in message
news:ddf430c3.0410090310.550fd97a@posting.google.com...
> Hi All,
>
> The subject is the result of my experiments on Data Compression.
>
> This is quite different from the existing logic used for comp. (RLE
> and lookups) Any opinions/Suggestion are welcome.
>
> find more details in www.geocities.com/vln_prj/. donot bother abt the
> styles in the site am an Elec and Comm Engg.

Interesting.  Your algorithm seems to be optimal for a stationary (uniform
statistics), memoryless (order 0 - no correlation between adjacent bytes)
source where the bytes have the probability distribution 1/4, 1/4, 1/8, 1/8,
1/16, 1/16, 1/32, 1/32...  A huffman coder like UNIX pack could probably do
better on other data because it makes no assumption about the distribution,
although it is only optimal for probabilities that are powers of 1/2.
Otherwise, an order-0 arithmetic coder like fpaq0 which I posted earlier (
http://groups.google.com/groups?sel...t&output=gplain
or search comp.comression for fpaq0 in Google groups) could do a bit better.

However real gains in compression exploit redundancies in strings longer
than 1 byte.  All of the popluar compression programs (compress, zip, gzip,
rar, bzip2, paqar, etc) and algorithms (LZ, Burrows-Wheeler, PPM, context
mixing) do this.

-- Matt Mahoney



Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
10-09-04 08:55 PM


Re: Complementary Positional Encoding
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:3OW9d.13755$Vm1.10978@newsread3.news.atl.earthlink.net...
> Otherwise, an order-0 arithmetic coder like fpaq0 which I posted earlier (
>
http://groups.google.com/groups?sel...t&output=gplain

Or here: http://cs.fit.edu/~mmahoney/compression/fpaq0.cpp

-- Matt Mahoney



Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
10-10-04 02:12 AM


Re: Complementary Positional Encoding
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:TyY9d.13652$gs1.10768@newsread2.news.atl.earthlink.net...
>
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
> news:3OW9d.13755$Vm1.10978@newsread3.news.atl.earthlink.net... 
> http://groups.google.com/groups?sel...t&output=gplain
>
> Or here: http://cs.fit.edu/~mmahoney/compression/fpaq0.cpp
>
interesting. how does it compare with huffman and other artith coders in
terms of compression?

I say this while, as of yet, being unable to write a working arithmetic
coder...
I am not sure why, maybe it is that I can't fully understand arithmetic
coding. huffman coding, though I guess more 'complex', was somewhat easier
to write an encoder for and get working, and is fairly conceptually easy to
understand though.
I don't get why it is so often viewed as working with binary trees and
having to reorganize the trees some to become "optimal", the approach I came
up with seems to work (a stack of multi-leaf "levels").
for building the tree it looks at the statistics, and figures about how many
bits would be optimal for that level.

however:
with huffman coding, 0 still needs 1 bit, but statistics say it should be
closer to 0.5 bits. arith coding seems better, but the working coder I have
(ripped from elsewhere) is too damn slow.
huffman coding seems to generate results 25..30% larger than what I get from
the arithmetic coder (dunno, everything looks fine, I think the limiting
factor is the fractional optimal bits for a few values).
so I continue to persue arithmetic encoding, hoping for that gain, but am
troubled by issues...

the issues I deal with my arith encoder:
my encoder works ok, for a few bytes, but then goes all to hell after that
(along the lines of seemingly random data);
I can't understand what is going on well enough to determine what the
problem is (is it encoding, decoding, both, ...);
dumps of intermediate values (eg: for the first 16 bytes) just show show me
a numerical soup that I can't fully process.

yes, I am handling underflow (in the same way the encoder I was using before
did).

oh well, I guess I could get back to it.




Report this thread to moderator Post Follow-up to this message
Old Post
cr88192
10-10-04 02:12 AM


Re: Complementary Positional Encoding
> however:
> with huffman coding, 0 still needs 1 bit, but statistics say it should be
> closer to 0.5 bits. arith coding seems better, but the working coder I hav
e
> (ripped from elsewhere) is too damn slow.
> huffman coding seems to generate results 25..30% larger than what I get fr
om
> the arithmetic coder (dunno, everything looks fine, I think the limiting
> factor is the fractional optimal bits for a few values).
> so I continue to persue arithmetic encoding, hoping for that gain, but am
> troubled by issues...

have you considered grouping the symbols and then creating a huffman
tree around these groupings? This way so can effectively achieve
sub-bit

Say i have the following symbols and probablilities:

A=0.8
B=0.1
C=0.1

the usual huffman tree would be similar to:

/ \
A
/ \
B C

and thus the following string:

ABCABBAACACCABAAAAAA
12212211212212111111=28

would be compressed to 28 bits

now if we group two symbols together:

AA=0.64
AB=0.08
BA=0.08
AC=0.08
CA=0.08
BB=0.01
BC=0.001
CB=0.001
CC=0.001

which generates a tree:

/ \
AA
/ \
AB
/ \
/   \
/ \   / \
AC BA CA
/ \
BB
/ \
BC
/ \
CC CB

and the string would be compressed in



and thus the following string:

ABCABBAACACCABAAAAAA
2 3 4 1 3 6 2 1 1 1 =24

would be compressed to 24 bits

By increasing the size of the groupings from 2 to more symbols brings
the effective compression via huffman codes closer to that of
Arithmetric compression.

Report this thread to moderator Post Follow-up to this message
Old Post
moogie
10-10-04 01:55 PM


Re: Complementary Positional Encoding
I made a mistake with my other post...

it should be:

which generates a tree:

/ \
AA
/ \
AB
/ \
/   \
/ \   / \
AC BA CA
/ \
BB
/ \
BC
/ \
CC CB

and the string would be compressed in



and thus the following string:

ABCABBAACACCABAAAAAA
2 4 5 1 4 7 2 1 1 1 =28

would be compressed to 28 bits

Which is not a good example so a better one would be:

AAAAAAAAAAAAAAAAAAAAAC

would be compressed to:
1 1 1 1 1 1 1 1 1 1 4 =14 bits by the two symbol group huffman tree

while with the single symbol huffman tree it soul be compressed to:
1111111111111111111112 = 23 bits

By increasing the size of the groupings from 2 to more symbols brings
the effective compression via huffman codes closer to that of
Arithmetric compression.

Report this thread to moderator Post Follow-up to this message
Old Post
moogie
10-10-04 01:55 PM


Re: Complementary Positional Encoding
budgetanime@mystarship.com (moogie) writes:
> I made a mistake with my other post...

Your model didn't match your data.

...
> By increasing the size of the groupings from 2 to more symbols brings
> the effective compression via huffman codes closer to that of
> Arithmetric compression.

There's no need to have a fixed length for the tokens being encoded.
{AA, AB, AC, B, C} with ratios {64,8,8,10,10} fits into a {1,3,3,3,3}
tree quite well.

Tunstall codes are the things to google for.

Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver.   -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'

Report this thread to moderator Post Follow-up to this message
Old Post
Phil Carmody
10-10-04 08:55 PM


Re: Complementary Positional Encoding
"moogie" <budgetanime@mystarship.com> wrote in message
news:e353ade.0410100151.33fe1b@posting.google.com... 
>
> have you considered grouping the symbols and then creating a huffman
> tree around these groupings? This way so can effectively achieve
> sub-bit
>
> Say i have the following symbols and probablilities:
>
> A=0.8
> B=0.1
> C=0.1
>
> the usual huffman tree would be similar to:
>
>            / \
>              A
>           / \
>           B C
>
> and thus the following string:
>
> ABCABBAACACCABAAAAAA
> 12212211212212111111=28
>
> would be compressed to 28 bits
>
> now if we group two symbols together:
>
> AA=0.64
> AB=0.08
> BA=0.08
> AC=0.08
> CA=0.08
> BB=0.01
> BC=0.001
> CB=0.001
> CC=0.001
>
> which generates a tree:
>
>            / \
>           AA
>             / \
>            AB
>              / \
>             /   \
>           / \   / \
>          AC BA CA
>                  / \
>                 BB
>                   / \
>                  BC
>                    / \
>                   CC CB
>
> and the string would be compressed in
>
>
>
> and thus the following string:
>
> ABCABBAACACCABAAAAAA
> 2 3 4 1 3 6 2 1 1 1 =24
>
> would be compressed to 24 bits
>
> By increasing the size of the groupings from 2 to more symbols brings
> the effective compression via huffman codes closer to that of
> Arithmetric compression.

it is an interesting idea, but it seems a little like a hack...

it is also unclear what the probability of 2 versions of a character vs just
one is, and supporting the 2 value version reduces the efficiency of the
single value case.


my arithmatic coder is sort of working.
I will just state that, given the nature of arithmatic coding, it is quite
sensitive to minor variations in the mathy parts of the encoder it seems. an
annoying problem is that of "possibility" for the occurance of a given
value. with fixed statistics, this can be glossed over, with variable
statistics, I have a problem. I am going to need a way to work up an
"escape" mechanism. it is lame, I can't just stuff extra bits in the stream
and be like "oh look, an escape code, better read a few bits for the code".
I am probably going to have to encode/decode the values from the stream,
which is itself scary seeming...

or I could just live with the kind of worse but much simpler and faster
approach of huffman. hell, it is at least close...




Report this thread to moderator Post Follow-up to this message
Old Post
cr88192
10-10-04 08:55 PM


Re: Complementary Positional Encoding
"cr88192" <cr88192@remove.hotmail.com> wrote in message
news:f7%9d.1863$pM5.1337@news.flashnewsgroups.com...
>
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
> news:TyY9d.13652$gs1.10768@newsread2.news.atl.earthlink.net... 
earlier 
http://groups.google.com/groups?sel...r />
ut=gplain 
> interesting. how does it compare with huffman and other artith coders in
> terms of compression?

Arithmetic coding does better than Huffman when the symbol proabilities are
not powers of 1/2.  Howver, for practical cases this only matters when you
have one or a few symbols with large probabilities.  For example, if A, B, C
each have probability 1/3 then a Huffman code might be A=00, B=01, C=1
giving you an average symbol length of 5/3 = 1.667 bits.  Arithmetic coding
would give you log2(3) = 1.585 bits.  OTOH, if all your symbol probabilities
were around 0.01, the difference would be very small.

The arithmetic coder in fpaq0 (and most PAQ versions) uses a range (x1=low,
x2=high in the code) but differs from a normal range coder in that it codes
1 bit at a time, so you only give it one probability, P(next bit = 1),
rather  than the probabilities for all the symbols in the alphabet.  This
lets you calculate 8 probabilities per byte rather than 256.

The output of fpaq0's coder is about 0.01% larger than an ideal coder
because the probabilties have to be approximated using integer arithmetic.
First, it represents a probability with 12 bits of resolution.  Second, when
the range straddles a byte boundary so that no output is possible, the
resolution of the range gets reduced, possibly to 2 in rare cases, at which
point the next probability is forced by integer rounding to 1/2 to remove
the split.  This strategy only works when coding 1 bit at a time.  The
normal way to handle this problem is to keep a count of pending output bits
(x1 = 011111..., x2 = 100000...).  In fact, David Scott modified the coder
to do this in some versions of PAQ, including PAQAR I believe.

-- Matt Mahoney



Report this thread to moderator Post Follow-up to this message
Old Post
Matt Mahoney
10-10-04 08:55 PM


Sponsored Links




Last Thread Next Thread Next
Pages (2): [1] 2 »
Search this forum -> 
Post New Thread

Compression archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:52 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.