For Programmers: Free Programming Magazines  


Home > Archive > Compression > October 2004 > Complementary Positional Encoding









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Complementary Positional Encoding
Lakshmi Narayanan. V

2004-10-09, 8:55 am

Hi All,

The subject is the result of my experiments on Data Compression.

This is quite different from the existing logic used for comp. (RLE
and lookups) Any opinions/Suggestion are welcome.

find more details in www.geocities.com/vln_prj/. donot bother abt the
styles in the site am an Elec and Comm Engg.
Nils

2004-10-09, 3:55 pm

Hello Lakshmi,

Actually, what you're doing is a kind of indexed Hufmann compression. I have
done the same, created a compressor for bitmaps that is very fast (much
faster than PNG) but with less compression in most cases.

Using the example on your site, your "Huffmann table" is:

A = 10
B = 11

The first 1 is the 1 in the index you add after writing "AB", the second 0
or 1 is the bit in the final complement index.

D = 010
C = 011

The first 0 is the 0 appearing in the index for "AB", the second 1 is the 1
appearing in the index for "DC" and the last 0 or 1 is the bit in the final
complement index.

With some streamlining in code you can make this such that you don't have to
work with codes that cross byte boundaries, so can be very efficient for
streams.

Note that your Huffmann table and compression is probably not optimal for
all texts. Take for example this text:

ABACABACABADABA (15 bytes = 120 bits)

8x A
4x B
2x C
1x D

With your method:

<length> AB 111011101110111 CD 111 010001000101010

This is 8 + 4*8 + 15 + 3 + 15 = 73 bits. Without the unneccesary "111" this
is 70 bits.

Another method could just specify most occuring character, and its
occurences, like so:

<length> A 101010101010101 B 1010101 C 110 D

This method only takes 8 + 4*8 + 15 + 7 + 3 = 65 bits.

You then have a Huffmann table like:

A = 1
B = 01
C = 001
D = 0001

Which is more suitable for this data.

My compressor actually first does a histogram of the data, then finds the
best Huffmann table and then codes this using an indexing scheme like here.
It always uses byte boundaries for the index sequences, to keep things byte
aligned. It only uses Huffman codes with length 1, 2, 4 or back to 8. If it
goes back to 8, or when the remaining places are just a few, it writes out
the remaining bytes directly.

For bitmaps it is often worth it to first do a Diff of the data (store each
byte as difference to the previous byte), and what also works is doing this
compression followed by RLE and then again this compression.

What I also found is that you can still compress the end-result nicely with
flate or zip, achieving an overall compression that is often better than PNG
directly. Same will probably hold for text.

Kind regards,

Nils Haeck
www.simdesign.nl


"Lakshmi Narayanan. V" <lakshminarayanan.v@gmail.com> wrote in message
news:ddf430c3.0410090310.550fd97a@posting.google.com...
> Hi All,
>
> The subject is the result of my experiments on Data Compression.
>
> This is quite different from the existing logic used for comp. (RLE
> and lookups) Any opinions/Suggestion are welcome.
>
> find more details in www.geocities.com/vln_prj/. donot bother abt the
> styles in the site am an Elec and Comm Engg.



Matt Mahoney

2004-10-09, 3:55 pm

"Lakshmi Narayanan. V" <lakshminarayanan.v@gmail.com> wrote in message
news:ddf430c3.0410090310.550fd97a@posting.google.com...
> Hi All,
>
> The subject is the result of my experiments on Data Compression.
>
> This is quite different from the existing logic used for comp. (RLE
> and lookups) Any opinions/Suggestion are welcome.
>
> find more details in www.geocities.com/vln_prj/. donot bother abt the
> styles in the site am an Elec and Comm Engg.


Interesting. Your algorithm seems to be optimal for a stationary (uniform
statistics), memoryless (order 0 - no correlation between adjacent bytes)
source where the bytes have the probability distribution 1/4, 1/4, 1/8, 1/8,
1/16, 1/16, 1/32, 1/32... A huffman coder like UNIX pack could probably do
better on other data because it makes no assumption about the distribution,
although it is only optimal for probabilities that are powers of 1/2.
Otherwise, an order-0 arithmetic coder like fpaq0 which I posted earlier (
http://groups.google.com/groups?sel...t&output=gplain
or search comp.comression for fpaq0 in Google groups) could do a bit better.

However real gains in compression exploit redundancies in strings longer
than 1 byte. All of the popluar compression programs (compress, zip, gzip,
rar, bzip2, paqar, etc) and algorithms (LZ, Burrows-Wheeler, PPM, context
mixing) do this.

-- Matt Mahoney


Matt Mahoney

2004-10-09, 9:12 pm


"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:3OW9d.13755$Vm1.10978@newsread3.news.atl.earthlink.net...
> Otherwise, an order-0 arithmetic coder like fpaq0 which I posted earlier (
>

http://groups.google.com/groups?sel...t&output=gplain

Or here: http://cs.fit.edu/~mmahoney/compression/fpaq0.cpp

-- Matt Mahoney


cr88192

2004-10-09, 9:12 pm


"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:TyY9d.13652$gs1.10768@newsread2.news.atl.earthlink.net...
>
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
> news:3OW9d.13755$Vm1.10978@newsread3.news.atl.earthlink.net...
> http://groups.google.com/groups?sel...t&output=gplain
>
> Or here: http://cs.fit.edu/~mmahoney/compression/fpaq0.cpp
>

interesting. how does it compare with huffman and other artith coders in
terms of compression?

I say this while, as of yet, being unable to write a working arithmetic
coder...
I am not sure why, maybe it is that I can't fully understand arithmetic
coding. huffman coding, though I guess more 'complex', was somewhat easier
to write an encoder for and get working, and is fairly conceptually easy to
understand though.
I don't get why it is so often viewed as working with binary trees and
having to reorganize the trees some to become "optimal", the approach I came
up with seems to work (a stack of multi-leaf "levels").
for building the tree it looks at the statistics, and figures about how many
bits would be optimal for that level.

however:
with huffman coding, 0 still needs 1 bit, but statistics say it should be
closer to 0.5 bits. arith coding seems better, but the working coder I have
(ripped from elsewhere) is too damn slow.
huffman coding seems to generate results 25..30% larger than what I get from
the arithmetic coder (dunno, everything looks fine, I think the limiting
factor is the fractional optimal bits for a few values).
so I continue to persue arithmetic encoding, hoping for that gain, but am
troubled by issues...

the issues I deal with my arith encoder:
my encoder works ok, for a few bytes, but then goes all to hell after that
(along the lines of seemingly random data);
I can't understand what is going on well enough to determine what the
problem is (is it encoding, decoding, both, ...);
dumps of intermediate values (eg: for the first 16 bytes) just show show me
a numerical soup that I can't fully process.

yes, I am handling underflow (in the same way the encoder I was using before
did).

oh well, I guess I could get back to it.



moogie

2004-10-10, 8:55 am

> however:
> with huffman coding, 0 still needs 1 bit, but statistics say it should be
> closer to 0.5 bits. arith coding seems better, but the working coder I have
> (ripped from elsewhere) is too damn slow.
> huffman coding seems to generate results 25..30% larger than what I get from
> the arithmetic coder (dunno, everything looks fine, I think the limiting
> factor is the fractional optimal bits for a few values).
> so I continue to persue arithmetic encoding, hoping for that gain, but am
> troubled by issues...


have you considered grouping the symbols and then creating a huffman
tree around these groupings? This way so can effectively achieve
sub-bit

Say i have the following symbols and probablilities:

A=0.8
B=0.1
C=0.1

the usual huffman tree would be similar to:

/ \
A
/ \
B C

and thus the following string:

ABCABBAACACCABAAAAAA
12212211212212111111=28

would be compressed to 28 bits

now if we group two symbols together:

AA=0.64
AB=0.08
BA=0.08
AC=0.08
CA=0.08
BB=0.01
BC=0.001
CB=0.001
CC=0.001

which generates a tree:

/ \
AA
/ \
AB
/ \
/ \
/ \ / \
AC BA CA
/ \
BB
/ \
BC
/ \
CC CB

and the string would be compressed in



and thus the following string:

ABCABBAACACCABAAAAAA
2 3 4 1 3 6 2 1 1 1 =24

would be compressed to 24 bits

By increasing the size of the groupings from 2 to more symbols brings
the effective compression via huffman codes closer to that of
Arithmetric compression.
moogie

2004-10-10, 8:55 am

I made a mistake with my other post...

it should be:

which generates a tree:

/ \
AA
/ \
AB
/ \
/ \
/ \ / \
AC BA CA
/ \
BB
/ \
BC
/ \
CC CB

and the string would be compressed in



and thus the following string:

ABCABBAACACCABAAAAAA
2 4 5 1 4 7 2 1 1 1 =28

would be compressed to 28 bits

Which is not a good example so a better one would be:

AAAAAAAAAAAAAAAAAAAAAC

would be compressed to:
1 1 1 1 1 1 1 1 1 1 4 =14 bits by the two symbol group huffman tree

while with the single symbol huffman tree it soul be compressed to:
1111111111111111111112 = 23 bits

By increasing the size of the groupings from 2 to more symbols brings
the effective compression via huffman codes closer to that of
Arithmetric compression.
Phil Carmody

2004-10-10, 3:55 pm

budgetanime@mystarship.com (moogie) writes:
> I made a mistake with my other post...


Your model didn't match your data.

....
> By increasing the size of the groupings from 2 to more symbols brings
> the effective compression via huffman codes closer to that of
> Arithmetric compression.


There's no need to have a fixed length for the tokens being encoded.
{AA, AB, AC, B, C} with ratios {64,8,8,10,10} fits into a {1,3,3,3,3}
tree quite well.

Tunstall codes are the things to google for.

Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
.... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
cr88192

2004-10-10, 3:55 pm


"moogie" <budgetanime@mystarship.com> wrote in message
news:e353ade.0410100151.33fe1b@posting.google.com...
>
> have you considered grouping the symbols and then creating a huffman
> tree around these groupings? This way so can effectively achieve
> sub-bit
>
> Say i have the following symbols and probablilities:
>
> A=0.8
> B=0.1
> C=0.1
>
> the usual huffman tree would be similar to:
>
> / \
> A
> / \
> B C
>
> and thus the following string:
>
> ABCABBAACACCABAAAAAA
> 12212211212212111111=28
>
> would be compressed to 28 bits
>
> now if we group two symbols together:
>
> AA=0.64
> AB=0.08
> BA=0.08
> AC=0.08
> CA=0.08
> BB=0.01
> BC=0.001
> CB=0.001
> CC=0.001
>
> which generates a tree:
>
> / \
> AA
> / \
> AB
> / \
> / \
> / \ / \
> AC BA CA
> / \
> BB
> / \
> BC
> / \
> CC CB
>
> and the string would be compressed in
>
>
>
> and thus the following string:
>
> ABCABBAACACCABAAAAAA
> 2 3 4 1 3 6 2 1 1 1 =24
>
> would be compressed to 24 bits
>
> By increasing the size of the groupings from 2 to more symbols brings
> the effective compression via huffman codes closer to that of
> Arithmetric compression.


it is an interesting idea, but it seems a little like a hack...

it is also unclear what the probability of 2 versions of a character vs just
one is, and supporting the 2 value version reduces the efficiency of the
single value case.


my arithmatic coder is sort of working.
I will just state that, given the nature of arithmatic coding, it is quite
sensitive to minor variations in the mathy parts of the encoder it seems. an
annoying problem is that of "possibility" for the occurance of a given
value. with fixed statistics, this can be glossed over, with variable
statistics, I have a problem. I am going to need a way to work up an
"escape" mechanism. it is lame, I can't just stuff extra bits in the stream
and be like "oh look, an escape code, better read a few bits for the code".
I am probably going to have to encode/decode the values from the stream,
which is itself scary seeming...

or I could just live with the kind of worse but much simpler and faster
approach of huffman. hell, it is at least close...



Nils

2004-10-10, 3:55 pm

In addition to Moogies reply (which is very good), you can also consider a
reduced state arithmetic coder. I have been able to find quite some articles
about them. It speeds up the arithmetic coding considerably, because it
removes the need to do a division for each addition of a symbol.

The trouble though is that most of these are patented.

Kind regards,

Nils


Matt Mahoney

2004-10-10, 3:55 pm


"cr88192" <cr88192@remove.hotmail.com> wrote in message
news:f7%9d.1863$pM5.1337@news.flashnewsgroups.com...
>
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
> news:TyY9d.13652$gs1.10768@newsread2.news.atl.earthlink.net...
earlier[color=darkred]
http://groups.google.com/groups?sel...t&output=gplain[color=darkred]
> interesting. how does it compare with huffman and other artith coders in
> terms of compression?


Arithmetic coding does better than Huffman when the symbol proabilities are
not powers of 1/2. Howver, for practical cases this only matters when you
have one or a few symbols with large probabilities. For example, if A, B, C
each have probability 1/3 then a Huffman code might be A=00, B=01, C=1
giving you an average symbol length of 5/3 = 1.667 bits. Arithmetic coding
would give you log2(3) = 1.585 bits. OTOH, if all your symbol probabilities
were around 0.01, the difference would be very small.

The arithmetic coder in fpaq0 (and most PAQ versions) uses a range (x1=low,
x2=high in the code) but differs from a normal range coder in that it codes
1 bit at a time, so you only give it one probability, P(next bit = 1),
rather than the probabilities for all the symbols in the alphabet. This
lets you calculate 8 probabilities per byte rather than 256.

The output of fpaq0's coder is about 0.01% larger than an ideal coder
because the probabilties have to be approximated using integer arithmetic.
First, it represents a probability with 12 bits of resolution. Second, when
the range straddles a byte boundary so that no output is possible, the
resolution of the range gets reduced, possibly to 2 in rare cases, at which
point the next probability is forced by integer rounding to 1/2 to remove
the split. This strategy only works when coding 1 bit at a time. The
normal way to handle this problem is to keep a count of pending output bits
(x1 = 011111..., x2 = 100000...). In fact, David Scott modified the coder
to do this in some versions of PAQ, including PAQAR I believe.

-- Matt Mahoney


moogie

2004-10-10, 8:55 pm

>
> it is an interesting idea, but it seems a little like a hack...
>
> it is also unclear what the probability of 2 versions of a character vs just
> one is, and supporting the 2 value version reduces the efficiency of the
> single value case.


i am not sure what you are saying here... efficiency in time or bits?
it will actually increase both!

It is very easy to workout the probablility of the two grouping (or
more) symbols... you just multiply the single symbol probablilities
representing the grouped symbol.

i.e.

A = 0.5
B = 0.25
C = 0.25

so AA = 0.5 * 0.5 = 0.25
AB = 0.5 * .25 = 0.125
CB = .25 * .25 = 0.0625
etc...


This does obviously assume the data to be compressed will be more than
one symbol in length.
cr88192

2004-10-11, 3:55 am


"moogie" <budgetanime@mystarship.com> wrote in message
news:e353ade.0410101445.758f401e@posting.google.com...
>
> i am not sure what you are saying here... efficiency in time or bits?
> it will actually increase both!
>


what I meant is:
A
B
C D E
F G H

now, assume A and B also have a paired encoding, then you might also have:
A B AA
BB C D
F G H

but A is no longer 1 bit, it is now 2 bits (this is a likely case in my
situation given the probability skew).

now, AA takes 2 bits (the same as before), but A also takes 2 bits (an
increase).

I could treat AA as a seperate symbol though and effectively reduce the
probability of A whenever there is an AA, in which case I could end up with
a tree like:
AA
BB
A B C
D E F
G
H

> It is very easy to workout the probablility of the two grouping (or
> more) symbols... you just multiply the single symbol probablilities
> representing the grouped symbol.
>
> i.e.
>
> A = 0.5
> B = 0.25
> C = 0.25
>
> so AA = 0.5 * 0.5 = 0.25
> AB = 0.5 * .25 = 0.125
> CB = .25 * .25 = 0.0625
> etc...
>
>
> This does obviously assume the data to be compressed will be more than
> one symbol in length.


and, also, would be sub-optimal in many cases (AA might actually occure far
more often than a lone A), imo it is better to count the probabilities
explicitly.

it will be a hack, but it could work. will probably have to modify my
encoder to not assume that it is not necissarily working with 8 bit chunks
(this could also allow encoding the end of stream data as a special value if
I so felt like it...).

this could be useful, and it is an interesting possibility.

so, yes, a generally idea.

now, an issue:
I will need to boost the value size in my stored trees, which will boost the
size some, but this is ok so long as I am not compressing really small
chunks.



cr88192

2004-10-11, 3:55 am


"Nils" <nospam@chello.nl> wrote in message
news:Gtcad.4252$rx.1383@amsnews03-serv.chello.com...
> In addition to Moogies reply (which is very good), you can also consider a
> reduced state arithmetic coder. I have been able to find quite some
> articles
> about them. It speeds up the arithmetic coding considerably, because it
> removes the need to do a division for each addition of a symbol.
>
> The trouble though is that most of these are patented.
>

yes, that is pretty lame...
I don't like patented stuff.

ammusingly enough the half working arith coder I wrote still by far beats
out the one I was using before in terms of speed (but seems to compress a
little worse).
eg:
old arith coder 30kB, new arith coder 36kB, current huffman coder 40kB.

I am now considering the paired symbol idea given by moogie, it can get
close at least, and huffman is still a lot easier to actually get working in
my experience (having discovered that my previous problems were related
largely to minor math issues in computing statistics and finding values).
also, for whatever reason if any values get too close to "impossible" then
they can no longer be decoded properly, and there seems to be no good fix
(apart from one of monitoring the probabilities and adjusting them if they
get too skewed, which hurts compression).
huffman involves far less twiddling with math code, and can be optimized
somewhat (consider, ie, using an index for each symbol which tells how many
zero bits you need and a leaf value to encode). similarly, said
optimizations don't involve altering the bitstream.

and so, I have something that is inbetween wrt speed and performance the old
arith coder and my huffman coder. I have a few ideas though for helping my
huffman coder along, which I will try and see if things get any better...



cr88192

2004-10-11, 3:55 am


"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:Y_cad.14823$Vm1.6650@newsread3.news.atl.earthlink.net...
>
> "cr88192" <cr88192@remove.hotmail.com> wrote in message
> news:f7%9d.1863$pM5.1337@news.flashnewsgroups.com...
> earlier
> http://groups.google.com/groups?sel...t&output=gplain
>
> Arithmetic coding does better than Huffman when the symbol proabilities
> are
> not powers of 1/2. Howver, for practical cases this only matters when you
> have one or a few symbols with large probabilities. For example, if A, B,
> C
> each have probability 1/3 then a Huffman code might be A=00, B=01, C=1
> giving you an average symbol length of 5/3 = 1.667 bits. Arithmetic
> coding
> would give you log2(3) = 1.585 bits. OTOH, if all your symbol
> probabilities
> were around 0.01, the difference would be very small.
>

this is the kind of data I was dealing with, there is a rather large amount
of skew in the data (eg: the probabilities for a single symbol are around
80%, then 1 and -1 are a remaining 7%, and values outside the range
of -14..14 are approx 0.1%, ...).

> The arithmetic coder in fpaq0 (and most PAQ versions) uses a range
> (x1=low,
> x2=high in the code) but differs from a normal range coder in that it
> codes
> 1 bit at a time, so you only give it one probability, P(next bit = 1),
> rather than the probabilities for all the symbols in the alphabet. This
> lets you calculate 8 probabilities per byte rather than 256.
>
> The output of fpaq0's coder is about 0.01% larger than an ideal coder
> because the probabilties have to be approximated using integer arithmetic.
> First, it represents a probability with 12 bits of resolution. Second,
> when
> the range straddles a byte boundary so that no output is possible, the
> resolution of the range gets reduced, possibly to 2 in rare cases, at
> which
> point the next probability is forced by integer rounding to 1/2 to remove
> the split. This strategy only works when coding 1 bit at a time. The
> normal way to handle this problem is to keep a count of pending output
> bits
> (x1 = 011111..., x2 = 100000...). In fact, David Scott modified the coder
> to do this in some versions of PAQ, including PAQAR I believe.
>

ok.

as for pending output bits: I had before wondered if it would be possible to
remove them with a kludge, eg, since the encoder and decoder are roughly in
sync, the decoder could possibly predict underflow, and thus modify the
bitstream to fake them, but at this point things get less obvious and I am
wondering if that would just break things (since then the code value would
be lot more chaotic which could break reliable decoding...).

I just saw the underflow bits being used, and I used them in my encoder,
though I still don't understand what they are doing, or how half the stuff
is working...
in this way, I can't write an encoder on my own (sort of like inverse
kenetics, I was able to write something by cloning something else, but on my
own I am lost).
I at least have a model now for the structure of an encoder, but not
understanding it is a bit of an impairment.
I understand the general concept, but not much more specific.

I am stupid sometimes...



moogie

2004-10-11, 3:55 pm

> what I meant is:
> A
> B
> C D E
> F G H


I have finding hard what the above symbols mean? is it in a huffman
tree? is it the data to be compressed?
>
> now, assume A and B also have a paired encoding, then you might also have:
> A B AA
> BB C D
> F G H


There is no point in which you will have a combination of single and
paired encoding... it will be either one or the other.

>
> but A is no longer 1 bit, it is now 2 bits (this is a likely case in my
> situation given the probability skew).


As above... there is no "single" 'A' so if the symbol 'AA' takes 1 bit
to encode the effective single 'A' is 0.5 bits.

>
> now, AA takes 2 bits (the same as before), but A also takes 2 bits (an
> increase).


You must be doing something wrong as combining symbols will never
increase the effective bits needed to represent each single symbol.

>
> I could treat AA as a seperate symbol though and effectively reduce the
> probability of A whenever there is an AA, in which case I could end up with
> a tree like:
> AA
> BB
> A B C
> D E F
> G
> H


Again, there is no need to mix paired and un-paired symbols

>
> and, also, would be sub-optimal in many cases (AA might actually occure far
> more often than a lone A), imo it is better to count the probabilities
> explicitly.


how would 'AA' occuring more often than 'A' be suboptimal? However
that is a moot point as single symbol compression and grouped symbol
compression cannot be combined. I am not sure how you would explicitly
count the probabilites for the combined symbols as what you are doing
is still effectively single symbol compression but combining each two
symbols to achieve better compression.

>
> it will be a hack, but it could work. will probably have to modify my
> encoder to not assume that it is not necissarily working with 8 bit chunks
> (this could also allow encoding the end of stream data as a special value if
> I so felt like it...).
>
> this could be useful, and it is an interesting possibility.
>
> so, yes, a generally idea.


Agreed, it may well be a hack to fit it into your existing system. A
simple way to incorporate a paired symbol huffman compression to your
existing system would be to leave your current symbol counting as is.
And to use these counts to create the paired counts. These paired
symbols would be 16 bits. so instead of reading in the stream and
compressing in 8bits you will read and compress in 16 bits.

This will potentially create a huge tree depending on how many symbols
are actually used in the data stream.


>
> now, an issue:
> I will need to boost the value size in my stored trees, which will boost the
> size some, but this is ok so long as I am not compressing really small
> chunks.


Yes, it will increase the size of the tree so it is necessary to make
changes to be able to store the tree.
cr88192

2004-10-11, 3:55 pm


"moogie" <budgetanime@mystarship.com> wrote in message
news:e353ade.0410110450.43edea51@posting.google.com...
>
> I have finding hard what the above symbols mean? is it in a huffman
> tree? is it the data to be compressed?


it is a representation of how I construct huffman trees.

>
> There is no point in which you will have a combination of single and
> paired encoding... it will be either one or the other.
>

err, but there might be a single byte of a value, or maybe more...

>
> As above... there is no "single" 'A' so if the symbol 'AA' takes 1 bit
> to encode the effective single 'A' is 0.5 bits.
>

but, herin lies a problem: there may be runs of only a single A in some
cases, A does not allways exist in pairs.
>
> You must be doing something wrong as combining symbols will never
> increase the effective bits needed to represent each single symbol.
>

I don't really get it then.

>
> Again, there is no need to mix paired and un-paired symbols
>
>
> how would 'AA' occuring more often than 'A' be suboptimal? However
> that is a moot point as single symbol compression and grouped symbol
> compression cannot be combined. I am not sure how you would explicitly
> count the probabilites for the combined symbols as what you are doing
> is still effectively single symbol compression but combining each two
> symbols to achieve better compression.
>

I increased the codespace and gave the compound symbols different codes.
this tended to cause AA to be one of the main symbols, with A closely
following, but the skew is reduced and there is a slight increase in the
compressed results (since now AA takes 1 bit but A takes 3).

AA
A BB B
....

>
> Agreed, it may well be a hack to fit it into your existing system. A
> simple way to incorporate a paired symbol huffman compression to your
> existing system would be to leave your current symbol counting as is.
> And to use these counts to create the paired counts. These paired
> symbols would be 16 bits. so instead of reading in the stream and
> compressing in 8bits you will read and compress in 16 bits.
>
> This will potentially create a huge tree depending on how many symbols
> are actually used in the data stream.
>

oh, now I get it...
you were talking of working in larger units, and not just adding more
symbols to represent the paired cases...

this is a somewhat more signifigant change (err, mostly in that I would have
to double my current compression unit size, eg, to 18, 20, or 24 bits...).
then my huffman trees would become huge and likely elimate any real gain
from tighter packing...

>
>
> Yes, it will increase the size of the tree so it is necessary to make
> changes to be able to store the tree.


I boosted it to 12 bits for now...

now, I have just added a simple lz variant to my rle compressor.
my computer got rebooted by a power bump, which threw my testing...

now I have it working, but now for whatever reason it is killing the
compression (allowing a bigger window makes the compression worse...).
maybe this is related to deflate's reason for using a seperate huffman tree
for the offsets, or for compressing the tables. the size for the tables
looks ok (err, but a little on the large side, eg, a few kB). the data being
emmitted by the rle/lz compresser is somewhat smaller, but in general there
appears to be less skew and a lot more values.

hmm, I could divide the space in half (2 11 bit spaces, each with a seperate
table), or I could add a seperate 12 bit space (making an in-total 13 bit
space). splitting the tables would reduce conflicts between encoding
characters and encoding lz offsets, which could help compression.

this could be interesting...

or something...



Matt Mahoney

2004-10-11, 3:55 pm


"cr88192" <cr88192@remove.hotmail.com> wrote in message
news:N0oad.3087$pM5.1763@news.flashnewsgroups.com...
>
> "Matt Mahoney" <matmahoney@yahoo.com> wrote in message
> news:Y_cad.14823$Vm1.6650@newsread3.news.atl.earthlink.net...
you[color=darkred]
B,[color=darkred]
> this is the kind of data I was dealing with, there is a rather large

amount
> of skew in the data (eg: the probabilities for a single symbol are around
> 80%, then 1 and -1 are a remaining 7%, and values outside the range
> of -14..14 are approx 0.1%, ...).


You could code this efficiently using run length codes for strings of up to
several zeros followed by Huffman coding.

-- Matt Mahoney


cr88192

2004-10-11, 3:55 pm


"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:vkyad.246$NX5.18@newsread3.news.atl.earthlink.net...
>
> "cr88192" <cr88192@remove.hotmail.com> wrote in message
> news:N0oad.3087$pM5.1763@news.flashnewsgroups.com...
> you
> B,
> amount
>
> You could code this efficiently using run length codes for strings of up
> to
> several zeros followed by Huffman coding.
>

yes, doing this now.

now the probabilities are less skewed (0 is only about 30% now, after rle
encoding), and the encoder is fairly close in terms of size of the results
to the arithmetic encoder...

I have generally been tweaking things in the encoder for much of the day.



Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com