For Programmers: Free Programming Magazines  


Home > Archive > Compression > February 2008 > Effective Lossy Compression of Bitstream









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Effective Lossy Compression of Bitstream
Arsène Lupin

2008-02-22, 7:56 am


Hi.

I want to compress some data in binary format with a lossy algorithm.
The data is used to represent the behavior of my program, so even if
there's some loss of info the program can recover itself. Is there any
kind os lossy that is "tuned" for binary files? What's the easiest one
to implement?

Thanks for any tips and help.

Ars=E8ne
Thomas Richter

2008-02-22, 7:56 am

Arsène Lupin schrieb:
> Hi.
>
> I want to compress some data in binary format with a lossy algorithm.
> The data is used to represent the behavior of my program, so even if
> there's some loss of info the program can recover itself. Is there any
> kind os lossy that is "tuned" for binary files? What's the easiest one
> to implement?


The easiest is 'rm', compressing lossy to zero bytes. (-:

In all other cases, you would need the definition of a "distortion" that
measures the amount of loss. For example, is the "Hamming Distance" for
your application a suitable distortion metric?

As for all lossy compression algorithms, you do not get a "compression
rate", you rather get a rate-distortion curve identifying the amount of
distortion you get for a given output rate (or vice-versa,the output
rate if you tolerate a given distortion).

Depending on your program, I can only guess what a good distortion would
be, but given that you talk about "reconstructing" the missing part of
information, I can only assume that your initial data is redundant and
you are not looking "really" for lossless compression, but rather for a
way that exploits this redundancy. That, however, is then a lossless
compression algorithm all the way (namely, if you can *always*
reconstruct this information). Otherwise, you have to tolerate data loss.

IOW, are you sure you can *always* reconstruct the missing data? If so,
we are in the lossless data compression domain of a redundant source,
which is a different business. (Note that if your source has no
rendundancy built-in, then there is no way of reconstructing this
missing piece of information and you always have loss. Can you tolerate
this?)

So long,
Thomas
George Johnson

2008-02-22, 9:56 pm

"Arsène Lupin" <detentor@gmail.com> wrote in message
news:dc3793ac-ad94-4659-af5b-ba29f6fa0aa7@q70g2000hsb.googlegroups.com...
|
| Hi.
|
| I want to compress some data in binary format with a lossy algorithm.
| The data is used to represent the behavior of my program, so even if
| there's some loss of info the program can recover itself. Is there any
| kind os lossy that is "tuned" for binary files? What's the easiest one
| to implement?
|
| Thanks for any tips and help.
|
| Arsène

In simpler terms, ROUNDING ERRORS.
If you've got a stream of 12 digit decimal numbers and you can round
down to 3 decimal digits without unacceptable loss, then you've achieved a
good lossy compression without unacceptable data loss. It won't be an
accurate representation of the original data set of 12 digit decimal
numbers, but can be good enough if "good enough" is tolerable.

The bigger question is "What can you afford to lose?"
Any smart gambler never bets more than they are willing to lose.

Another form of cheap "lossy compression" is using error-correcting
codes (logically, if you're here asking for basic info on compression, to
some degree I have lost you as a reader). In short, run your data chunks
through a error-correcting coder and then figure out what you can clip off
and still have acceptably recoverable data.

The last and most final important question to know is, "What are you
compressing and what do you consider NOISE?"
Is it minimally-noisy numeric streams that can be represented as points
on a 2D graph much more efficiently as an equation or spline graph?
Is it audio data that can be ran through a MP3 compressor?
Is it simple program variable time-change list like a LOG file?

The reason is that there are as many "fine-tuned for a particular data
set type" compressors out there as there are equations in an average Algebra
book.


Arsène Lupin

2008-02-22, 9:57 pm


Thanks for the posts.

>IOW, are you sure you can *always* reconstruct the missing data? If so,
>we are in the lossless data compression domain of a redundant source,
>which is a different business. (Note that if your source has no
>rendundancy built-in, then there is no way of reconstructing this
>missing piece of information and you always have loss. Can you tolerate
>this?)


>The last and most final important question to know is, "What are you
> compressing and what do you consider NOISE?"


As I stated previously, I'm coding a program behavior. When the
program processes it's data (which is a calculation), it can "change"
it's state. So, when the program encounter an "1", it change its way
of processing the data. When I encounter the zero, it continues
processing the same way as before.

I told the program can recover the information because when some
things happen while processing the data, the program can identificate
that it has to change it's state, but it'll be one step AFTER it
should, so the processed data get's processed not the way it should
(the calculation get's rounded).

Another way to see the problem could be (as I think of lossy
compression): given a string consisting of 1's and 0's, in which the
probability of a 1 if 30%, if the entropy of the given string is x,
and you set the wanted size to x/2, what would be the reliability of
the string (i.e. what would be the number of 1's encountered)?

Is there a math process that given those numbers it returns a function
that generates that sequence? Eventually, the sequence can lose some
information in the process (the noise of the data). I wanted to know
if the size to transmit this sequence is lesser than the entropy of
the string itself.

If I wanted manually induce errors, I could, by, let's say, reducing
by half the number of 1's (transmitting only one at each two). But
that doesn't solve the problem that I wanted to solve (finding a
function to maximize the information).

Thanks for the time.

Ars=E8ne.


Thomas Richter

2008-02-23, 7:56 am

Arsène Lupin schrieb:
> Thanks for the posts.
>
>
>
> As I stated previously, I'm coding a program behavior. When the
> program processes it's data (which is a calculation), it can "change"
> it's state. So, when the program encounter an "1", it change its way
> of processing the data. When I encounter the zero, it continues
> processing the same way as before.


I don't understand. Is the stream (to be compressed in a lossy way) the *input*
to your program, or is it the output where you can recover simply be re-doing
the missing computation? If the former is the case, then not compression
losslessy will cause a change in the program behavior, and you're in a mess. If
the latter is the case, you can compress to zero bytes by re-doing the
computation all the way, so where's the problem?

> I told the program can recover the information because when some
> things happen while processing the data, the program can identificate
> that it has to change it's state, but it'll be one step AFTER it
> should, so the processed data get's processed not the way it should
> (the calculation get's rounded).


And you do know that this will have an isolated effect on the output of
the program? If so, try to understand how to relate the error generated by
the program to the error in the input. As vague as you are, I do no see
how to give any better advice.

Examples would be if the program is an orthogonal transform, and the error
is measured as mean square error of the values, then the input error is
exactly equal to the output error. A second example is that the program is
a non-orthogonal transform and the error is as above. Then, if one can
reasonably assume that the input is decorrelated and has zero mean, the output
error is given by the input error times the l^2-norm of the impulse response
of the transform.

But, again, as vague as you are, there is nothing that can be answered.

> Another way to see the problem could be (as I think of lossy
> compression): given a string consisting of 1's and 0's, in which the
> probability of a 1 if 30%, if the entropy of the given string is x,


Pardon me, *which* entropy? Zero order? If so, then x is given by
0.3 * log(0.3) + 0.7 * log(0.7) and there is no ambiguity.

> and you set the wanted size to x/2, what would be the reliability of
> the string (i.e. what would be the number of 1's encountered)?


What would be your error definition? You can look into Shanon's theorem,
stating that if the channel capacity is given (your x/2) you can only
transmit data of entropy below the capacity. Otherwise, you get errors.
Since x/2 < x, yes, you do get errors. Now, again, what is your
definition of "reliability"?

For a zero-order process, you can compute what the probabilities would have
to look like to get an entropy x/2. Given the input probabilities of the
symbols, take a random generator that flips 1 to 0 such that the probabilities
of the output sequence are as needed, and then compress this sequence.

> Is there a math process that given those numbers it returns a function
> that generates that sequence? Eventually, the sequence can lose some
> information in the process (the noise of the data). I wanted to know
> if the size to transmit this sequence is lesser than the entropy of
> the string itself.
>
> If I wanted manually induce errors, I could, by, let's say, reducing
> by half the number of 1's (transmitting only one at each two). But
> that doesn't solve the problem that I wanted to solve (finding a
> function to maximize the information).


IOW, you want to generate a sequence such that the conditional entropy of
the generated sequence given the original is maximal, and the entropy of
the generated sequence is bounded by a threshold?

So long,
Thomas
Arsène Lupin

2008-02-23, 7:56 am


Hi, Thomas.

Simplifying, I have a bitstring of ones and zeros, which the
probability of a one is about 30%.
I want to compress lossy this bitstring, and the process can erase
some 1's, but it cannot add any ones.

> IOW, you want to generate a sequence such that the conditional entropy of
> the generated sequence given the original is maximal, and the entropy of
> the generated sequence is bounded by a threshold?


IOW, yes. I wanted a function to "transform" the original sequence, in
a way that the new sequence could be better compressed, while
retaining as much reliability (as I measure by the relation of the
number of ones in the original bitstring and the compressed one) as
possible.

As an example, if the most ones follow a exponential function (ex:
x^2), then I could transmit the function to recreate the sequence. Of
course maybe not all '1's of the original could be recovered, but if
the size to transmit the function is small, then the gain would
justify it's use.

Really sorry for not making myself much clear, to me it seemed an easy
problem.

Ars=E8ne.
George Johnson

2008-02-23, 6:56 pm

Arsène Lupin wrote:
| Hi, Thomas.
|
| Simplifying, I have a bitstring of ones and zeros, which the
| probability of a one is about 30%.
| I want to compress lossy this bitstring, and the process can erase
| some 1's, but it cannot add any ones.
|
|> IOW, you want to generate a sequence such that the conditional
|> entropy of the generated sequence given the original is maximal, and
|> the entropy of the generated sequence is bounded by a threshold?
|
| IOW, yes. I wanted a function to "transform" the original sequence, in
| a way that the new sequence could be better compressed, while
| retaining as much reliability (as I measure by the relation of the
| number of ones in the original bitstring and the compressed one) as
| possible.
|
| As an example, if the most ones follow a exponential function (ex:
| x^2), then I could transmit the function to recreate the sequence. Of
| course maybe not all '1's of the original could be recovered, but if
| the size to transmit the function is small, then the gain would
| justify it's use.
|
| Really sorry for not making myself much clear, to me it seemed an easy
| problem.
|
| Arsène.

Okay, now to get to the core of your desire.
How many bits per chunk of data stream? Is it variable in size or not
counted as important?

You state that the odds of "1's" appearing is 30% and no "1's" can be
added (per individual data chunk or for the meta-bit location of your data
stream?).
You also want parity on the number of "1's" in the uncompressed and
compressed streams.

To me, it sounds basically like you want an error-correction routine
with an emphasis on bit limiting per chunk.
The downside is that most error-correcting codes ADD DATA to the stream.
My basic suggestion is to think your problem through a bit more.

I could go on about some useful probability coding schemes, but without
more information about what you consider "acceptable noise loss", I would
just be wasting my time. If there are specific bits in your data chunks you
can throw away, then do it (while counting the "1's") and generate a random
data (30% "1's") set to replace it upon inaccurate data set recreation with
a limitation of using only the number of "1" bits that equal your prestored
counted amount.

There are ways to measure "noise density" to gauge an even higher
probability of recreating an imitation of your original data set, but that
also requires a USENET page or two of information to go over the basics.
Heck, I could even toss out some pseudo-random coding techniques to give you
an accurate high-parity match of your original data stream, but without
knowing what you can toss, why you are coding it (polling data looks like a
high match for your data set as a guess), and what you consider "noise" in a
more accurate manner then there is simply no point in throwing time and
energy at solving your problems for my own unpaid amusement.

Basic Error-Correcting Code info
http://en.wikipedia.org/wiki/Error_..._and_correction

If your data stream is fairly consistent, then difference coding between
nearly identical chunks would be optimal.
If your data stream is fairly random, then there is a potential bonus to
code by relative locational ID's (if the data chunk is identical to the one
that is back 16 chunks ago, then just point to it).
If your data has a tendency to be identical for one bit location, then
it might be more optimal to code each bit location in your chunks as an
individual stream which is then Hauffman or Aritmitic-Coded using standard
compression routines.


George Johnson

2008-02-24, 10:03 pm

Arsène Lupin wrote:
| Hi.
|
| I want to compress some data in binary format with a lossy algorithm.
| The data is used to represent the behavior of my program, so even if
| there's some loss of info the program can recover itself. Is there any
| kind os lossy that is "tuned" for binary files? What's the easiest one
| to implement?
|
| Thanks for any tips and help.
|
| Arsène

Now that I think about it (given your description), are you trying to
fudge up polling data or actual voting data sets?

You do know that this is a violation of Federal Law and a crime equal to
many years in a Federal Prison?


Arsène Lupin

2008-02-25, 7:56 am


> =A0 =A0 Now that I think about it (given your description), are you trying=

to
> fudge up polling data or actual voting data sets?
>
> =A0 =A0 You do know that this is a violation of Federal Law and a crime eq=

ual to
> many years in a Federal Prison?


LOL. You couldn't be more wrong.
By the way, I'm not american. :)

Ars=E8ne
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com