For Programmers: Free Programming Magazines  


Home > Archive > Compression > February 2008 > Best existing binary compressor method?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Best existing binary compressor method?
Einstein

2008-02-18, 7:56 am

I was wondering which compression routine is established as the best
binary compression system. Total capacity.

I would like to see how it works so I can see if the different values
and permutations I obtained can launch any results.
Jim Leonard

2008-02-18, 6:56 pm

On Feb 18, 7:35 am, Einstein <michae...@gmail.com> wrote:
> I was wondering which compression routine is established as the best
> binary compression system. Total capacity.
>
> I would like to see how it works so I can see if the different values
> and permutations I obtained can launch any results.


There is no one best compressor. Compression performance depends on
the input and the amount of contextual information/relationships in
the source data.
biject

2008-02-18, 6:56 pm

On Feb 18, 6:35 am, Einstein <michae...@gmail.com> wrote:
> I was wondering which compression routine is established as the best
> binary compression system. Total capacity.
>
> I would like to see how it works so I can see if the different values
> and permutations I obtained can launch any results.


It depends on what you mean by binary. But if you
change the string to ascii 1's and 0's your can use BWTS
to get the sort of the string then use a rle + entropy coder
that works on the resultant string to compress it back to
the space of all strings.

You can bijective convert the string to a 8-bit byte style file and
then use something like my bijective LZW.

If the sting is long use some version of PAQ* since the last stage
is nothing but a 2 state compressor. convert to bytes use PAQ*
then convert to binary.

Actually no best compressor in general. You can only compress
based on assuming the data fits some model. Since no matter what
you do other than identity trying to compress all binary strings.
you will expand more strings than you make smaller its just the
way things are.

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
Industrial One

2008-02-18, 6:56 pm

On Feb 18, 6:35 am, Einstein <michae...@gmail.com> wrote:
> I was wondering which compression routine is established as the best
> binary compression system. Total capacity.
>
> I would like to see how it works so I can see if the different values
> and permutations I obtained can launch any results.


Arithmetic is the best. Ironically, it is the simplest to code, yet
achieves the best compression ratio.

OTOH, it takes XXXXing FOREVER.
Mark Nelson

2008-02-19, 8:01 am

On Feb 18, 6:17=A0pm, Industrial One <industrial_...@hotmail.com> wrote:
> On Feb 18, 6:35 am, Einstein <michae...@gmail.com> wrote:
>
> Arithmetic is the best. Ironically, it is the simplest to code, yet
> achieves the best compression ratio.
>
> OTOH, it takes XXXXing FOREVER.


"Arithmetic" is not a compression routine.

You can make a general statement that data compression is achieved by
combining a model and an encoder. The model looks for redundancy, and
uses the encoder to take advantage of that redundancy. (Yes, you can
nitpick this generalization to death, but it's a useful place to
start.)

Arithmetic coding is used in data compression to encode data very
efficiently, using nearly the optimal number of digits dictated by
entropy of the model. But by itself, it has nothing to do with data
compression, despite the fact that the wikipiedia article says:
"Arithmetic coding is a method for lossless data compression."

You can find arithmetic coding used as an encoder for many different
compression algorithms, and to provide a reasonable answer to this
(unanswerable) question, you need to specify which algorithm you are
suggesting. LZARI? Lossless JPEG? PPMZ? All quite different.

|
| Mark Nelson - http://marknelson.us
|
Industrial One

2008-02-21, 7:56 am

On Feb 19, 6:06 am, Mark Nelson <snorkel...@gmail.com> wrote:
> On Feb 18, 6:17 pm, Industrial One <industrial_...@hotmail.com> wrote:
>
>
>
>
> "Arithmetic" is not a compression routine.


My bad, I meant "Arithmetic coding."

> You can make a general statement that data compression is achieved by
> combining a model and an encoder. The model looks for redundancy, and
> uses the encoder to take advantage of that redundancy. (Yes, you can
> nitpick this generalization to death, but it's a useful place to
> start.)


True.

> Arithmetic coding is used in data compression to encode data very
> efficiently, using nearly the optimal number of digits dictated by
> entropy of the model. But by itself, it has nothing to do with data
> compression, despite the fact that the wikipiedia article says:
> "Arithmetic coding is a method for lossless data compression."


Yup.

> You can find arithmetic coding used as an encoder for many different
> compression algorithms, and to provide a reasonable answer to this
> (unanswerable) question, you need to specify which algorithm you are
> suggesting. LZARI? Lossless JPEG? PPMZ? All quite different.
>
> |
> | Mark Nelson -http://marknelson.us
> |


Eg: source: ABCBAA (6 bytes).
A = 3, B = 2, C = 1.
p(A) = 3 / 6 = 0,50 -> RANGE: [0,00 - less than 0,50] -> LR = 0,00, HR
= 0,50
p(B) = 2 / 6 = 0,33 -> RANGE: [0,50 - less than 0,83] -> LR = 0,50, HR
= 0,83
p(C) = 1 / 6 = 0,16 -> RANGE: [0,83 - less than 1,00] -> LR = 0,83, HR
= 1,00
Encoding:
Define starting variable value: LOW = 0, HIGH = 1
RANGE = HIGH (previous) - LOW (previous)
LOW = LOW (previous) + (RANGE * LR of current symbol)
HIGH = LOW (previous) + (RANGE * HR of current symbol)

The output value is 0,400975.

Decoding:
Requires 5 variables, they are RANGE, LR (Low Range), HR (High Range),
VALUE and RD (Range Different).
1. Output the symbol by determine in which range the VALUE is.
2. Get a new VALUE using this calculation:
RD = HR - LR
VALUE = (VALUE - LR) / RD

Can tweaks be made to significantly improve the ratio this algo would
achieve?
Thomas Richter

2008-02-21, 7:56 am

Industrial One schrieb:

> Eg: source: ABCBAA (6 bytes).


The best way to encode this source is in zero bits. You know the sequence, thus
there is no problem to transmit it.

Note well, an encoder design *never* compresses a specific sequence. It only compresses
on grounds of a specific statistical model. This model has an entropy, and you cannot
compress below this entropy. If the data you feet into the compressor does not fit the
model, you compress worst.

> A = 3, B = 2, C = 1.
> p(A) = 3 / 6 = 0,50 -> RANGE: [0,00 - less than 0,50] -> LR = 0,00, HR
> = 0,50
> p(B) = 2 / 6 = 0,33 -> RANGE: [0,50 - less than 0,83] -> LR = 0,50, HR
> = 0,83
> p(C) = 1 / 6 = 0,16 -> RANGE: [0,83 - less than 1,00] -> LR = 0,83, HR
> = 1,00
> Encoding:
> Define starting variable value: LOW = 0, HIGH = 1
> RANGE = HIGH (previous) - LOW (previous)
> LOW = LOW (previous) + (RANGE * LR of current symbol)
> HIGH = LOW (previous) + (RANGE * HR of current symbol)
>
> The output value is 0,400975.
>
> Decoding:
> Requires 5 variables, they are RANGE, LR (Low Range), HR (High Range),
> VALUE and RD (Range Different).
> 1. Output the symbol by determine in which range the VALUE is.
> 2. Get a new VALUE using this calculation:
> RD = HR - LR
> VALUE = (VALUE - LR) / RD
>
> Can tweaks be made to significantly improve the ratio this algo would
> achieve?


Yes, zero bits. Don't transmit anything, we know the sequence, problem solved.

If your question rather is, given a probability source whose alphabet contains
three codes (a,b,c) and which is a zero-order i.i.d. Markov source with probabilities
given as above, then you cannot compress better in the limit.

Thus, the question rather is: Is your input source indeed a zero order i.i.d.
Markov process?

So long,
Thomas


biject

2008-02-21, 6:56 pm

On Feb 21, 6:22 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
> Industrial One schrieb:
>
>
> The best way to encode this source is in zero bits. You know the sequence, thus
> there is no problem to transmit it.
>
> Note well, an encoder design *never* compresses a specific sequence. It only compresses
> on grounds of a specific statistical model. This model has an entropy, and you cannot
> compress below this entropy. If the data you feet into the compressor does not fit the
> model, you compress worst.
>
>
>
>
>
>
>
> Yes, zero bits. Don't transmit anything, we know the sequence, problem solved.
>
> If your question rather is, given a probability source whose alphabet contains
> three codes (a,b,c) and which is a zero-order i.i.d. Markov source with probabilities
> given as above, then you cannot compress better in the limit.
>
> Thus, the question rather is: Is your input source indeed a zero order i.i.d.
> Markov process?
>
> So long,
> Thomas


Actually you over looked one thing. If you have an I.I.d. you would
tend to compress
close to the entopy. But its a statistical thing. Some strings of ABC
will compress
right on some will be smaller some longer? if you use arithmetic
cmpression you
get closer than huffman to the entropy on average. But in either case
some will
compress smaller than the entropy and some longer. That is assuming
you do a
proper bijective style of entropy compression.

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Matt Mahoney

2008-02-21, 6:56 pm

On Feb 18, 8:35 am, Einstein <michae...@gmail.com> wrote:
> I was wondering which compression routine is established as the best
> binary compression system. Total capacity.
>
> I would like to see how it works so I can see if the different values
> and permutations I obtained can launch any results.


The absolute best algorithm is to try running all possible programs of
length 0, 1, 2... until you find one that outputs a file identical to
the one you want to compress. Then send that program.

Some of the programs may run for a long time. If a program goes into
an infinite loop, you can kill it and go to the next one. If you
aren't sure, then let it run for 2^n clocks, where n is the number of
bits of memory in your computer. If it doesn't stop by then, then you
can be sure it never will and you can kill it.

This is provably the best you can do, but it may be rather slow.

-- Matt Mahoney
Einstein

2008-02-22, 3:56 am

I have never been worried about the time issues involved, just the
absolute best for near random binary compression
Thomas Richter

2008-02-22, 3:56 am

biject schrieb:
>
> Actually you over looked one thing. If you have an I.I.d. you would
> tend to compress
> close to the entopy. But its a statistical thing. Some strings of ABC
> will compress
> right on some will be smaller some longer?


It's a statistical statement I'm making, not an absolute one. Given the
statistics described as above, then the *expected* number of bits per
sample will approach the entropy in the limit of infinite input streams
on average.

> if you use arithmetic
> cmpression you
> get closer than huffman to the entropy on average.


That is correct, but only due to the limitation of huffman not
being able assign symbols a fractional number of bits. IOW, as long as
the probabilities of the source alphabet are inverse powers of two, there
is no difference between huffman and arithmetic in the limit.

> But in either case
> some will
> compress smaller than the entropy and some longer. That is assuming
> you do a
> proper bijective style of entropy compression.


"Bijective" is completely irrelevant for this topic. There is absolutely
nothing in the proof that requires this.

So long,
Thomas
Thomas Richter

2008-02-22, 3:56 am

Matt Mahoney schrieb:
> On Feb 18, 8:35 am, Einstein <michae...@gmail.com> wrote:
>
> The absolute best algorithm is to try running all possible programs of
> length 0, 1, 2... until you find one that outputs a file identical to
> the one you want to compress. Then send that program.
>
> Some of the programs may run for a long time. If a program goes into
> an infinite loop, you can kill it and go to the next one. If you
> aren't sure, then let it run for 2^n clocks, where n is the number of
> bits of memory in your computer. If it doesn't stop by then, then you
> can be sure it never will and you can kill it.


To be picky, "n" should also include the internal state of the CPU
performing the action. (-; Thus, n is a bit larger.

> This is provably the best you can do, but it may be rather slow.


I like this "rather slow" (-:

So long,
Thomas
Mark Nelson

2008-02-22, 7:56 am

On Feb 21, 4:41=A0pm, Matt Mahoney <matmaho...@yahoo.com> wrote:
> On Feb 18, 8:35 am, Einstein <michae...@gmail.com> wrote:
>
> The absolute best algorithm is to try running all possible programs of
> length 0, 1, 2... until you find one that outputs a file identical to
> the one you want to compress. =A0Then send that program.
>
> Some of the programs may run for a long time. =A0If a program goes into
> an infinite loop, you can kill it and go to the next one. =A0If you


But what if the optimal program takes a long, long, long time, and you
kill it prematurely?

Surely we can devise an algorithm that will analyse the program to
determine whether it is going to halt or not. After all, this is
comp.compression.

|
| Mark Nelson - http://marknelson.us
|
Thomas Richter

2008-02-22, 9:56 pm

Mark Nelson schrieb:
> On Feb 21, 4:41 pm, Matt Mahoney <matmaho...@yahoo.com> wrote:
>
> But what if the optimal program takes a long, long, long time, and you
> kill it prematurely?


You can't, according to Matt's computation. If the internal state of the machine
has n bits (counting CPU registers, hidden flip-flops in the CPU, all bits in the ram and so on),
then provably after 2^n clock cycles, the machine has to enter a state where it has been before.

Unfortunately, even for moderate n, this is a rather large number. (-:

> Surely we can devise an algorithm that will analyse the program to
> determine whether it is going to halt or not. After all, this is
> comp.compression.


One can. But again, provably, one cannot be sure whether the program that
analyzes whether a Turing machine stops, stops. (-: Luckely, today's machines have
a finite number of internal states, thus...

So long,
Thomas
biject

2008-02-22, 9:57 pm

On Feb 22, 12:51 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
> biject schrieb:



>
> That is correct, but only due to the limitation of huffman not
> being able assign symbols a fractional number of bits. IOW, as long as
> the probabilities of the source alphabet are inverse powers of two, there
> is no difference between huffman and arithmetic in the limit.
>


Actually that is not even exactly true. If you ignore the tables
and
if doing static, huffman or arihmetic then the statement above is
true.

But when doing adaptive on finite files its much more complex and
often even when the powers at end of file settle down to nice powers
of two the huffman can win.

In fact its possibe to have a source where the it has exactly powers
of two
for the symbols. The fuffman settles down at some point ann compresses
perfectly, While on the adaptive its never settles down and the longer
the
file being compressed the bigger the error.

>
> "Bijective" is completely irrelevant for this topic. There is absolutely
> nothing in the proof that requires this.
>


Actually it doe have something to do with it when you trying to
compress to the entropy limit if your not compressing bijectively
in either huffman or airhmetic you end up with longer compressed
files which is what is trying to be avoided when try to compress to
entropy limits so its not irrelevant for this topic in fact it the
opposite,
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link



Phil Carmody

2008-02-22, 9:57 pm

Mark Nelson <snorkelman@gmail.com> writes:
> On Feb 21, 4:41_pm, Matt Mahoney <matmaho...@yahoo.com> wrote:
>
> But what if the optimal program takes a long, long, long time, and you
> kill it prematurely?


Just 'long time'

> Surely we can devise an algorithm that will analyse the program to
> determine whether it is going to halt or not. After all, this is
> comp.compression.


Just 'comp.ression'

FP
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration
Stefano Brocchi

2008-02-23, 7:56 am

> Surely we can devise an algorithm that will analyse the program to
> determine whether it is going to halt or not. After all, this is
> comp.compression.


For how it could seem strange, it has been proven that that no program
that analyses any other program and always gives a result 'it will
halt' or 'it will not halt' can exist.
It is a main result of computability theory.

So long,
Stefano
Willem

2008-02-23, 9:58 pm

Stefano wrote:
) Mark wrote:
)> Surely we can devise an algorithm that will analyse the program to
)> determine whether it is going to halt or not. After all, this is
)> comp.compression.
)
) For how it could seem strange, it has been proven that that no program
) that analyses any other program and always gives a result 'it will
) halt' or 'it will not halt' can exist.
) It is a main result of computability theory.

I think your sarcasm detectors need a bit of tuning.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Thomas Richter

2008-02-24, 7:56 am

biject schrieb:
> On Feb 22, 12:51 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
>
>
>
> Actually that is not even exactly true. If you ignore the tables
> and
> if doing static, huffman or arihmetic then the statement above is
> true.


Note that I'm talking about a zero-order i.i.d. Markov process. There is
absolutely no adaption needed here.

> Actually it doe have something to do with it when you trying to
> compress to the entropy limit if your not compressing bijectively
> in either huffman or airhmetic you end up with longer compressed
> files


The "final file length" I'm talking about is infinite. No, it's not
longer. It's exactly as long, namely infinite. The proper question is
on the average number of bits per symbol, and this *will* approach
the entropy on AC in the limit. As I already said above, I'm making
statistical statements and whether a code is bijective or not is completely
irrelevant for this result.

So long,
Thomas

Einstein

2008-02-24, 7:56 am

Ok after looking at both of them (methods), seems I have already
pretty much 'remade' both methods. I do not see any extra advantages
gained in either atm, so neither method seems to be really effective
for me :(
Mark Nelson

2008-02-24, 6:56 pm

On Feb 23, 7:46=A0am, Willem <wil...@stack.nl> wrote:
>
> I think your sarcasm detectors need a bit of tuning.
>


The perpetual problem with text-only communications...
|
| Mark Nelson - http://marknelson.us
|
Stefano Brocchi

2008-02-25, 7:56 am

> I think your sarcasm detectors need a bit of tuning.

Ok, in fact I read the thread a bit hastily and I didn't catch the
context of the discussion... I promise I'll be more 'tuned' next
time :)

So long,
Stefano
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com