For Programmers: Free Programming Magazines  


Home > Archive > Compression > November 2005 > compression of gaussian distribution?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author compression of gaussian distribution?
budgetanime@mystarship.com

2005-11-10, 6:55 pm

Hi,

I have a set of data points stored as 32 bit integers. The data points
form a gaussian distribution between 0 and some value. There is no
correlation between consequtive data points. The higest value depends
on parameters to the algorithm which generates the data points. The
number of data points depends on input to the algorithm.

I am able to find the mean and standard deviation.

>From this information is it possible to create a compression algorithm

will compress these data points?

A typical situation is that there is 6000 data points which range from
0 to 2^18.

Nils

2005-11-10, 6:55 pm

Some counter-questions:

- Does the compression have to be exact, or is some (very small) deviation
between exact and compressed value allowed?
If it doesn't have to be 100% exact, you could do some things:
1. Transform the distribution so it is no longer gaussian, but approximately
linear
2. Quantise the values

- Is the order of values important?
If the order is not important, you can sort the values, and just code the
difference of previous vs current value (delta). Combined with quantisation
(see pt 2 above) this can bring considerable compression.

Nils

<budgetanime@mystarship.com> schreef in bericht
news:1131660187.425238.156640@g44g2000cwa.googlegroups.com...
> Hi,
>
> I have a set of data points stored as 32 bit integers. The data points
> form a gaussian distribution between 0 and some value. There is no
> correlation between consequtive data points. The higest value depends
> on parameters to the algorithm which generates the data points. The
> number of data points depends on input to the algorithm.
>
> I am able to find the mean and standard deviation.
>
> will compress these data points?
>
> A typical situation is that there is 6000 data points which range from
> 0 to 2^18.
>



budgetanime@mystarship.com

2005-11-10, 6:55 pm


Nils wrote:
> Some counter-questions:
>
> - Does the compression have to be exact, or is some (very small) deviation
> between exact and compressed value allowed?


Unfortunetly lossless compression is needed, these data points are used
further on in a very input sensative algoritm.

> - Is the order of values important?


Likewise the order is important...

I am not too optimistic that compression is achieveable but thought i
would ask.

My previous typical data is not correct, typically there are 3500000
odd data points.

Matt Mahoney

2005-11-10, 9:55 pm

budgetanime@mystarship.com wrote:
> Nils wrote:
>
> Unfortunetly lossless compression is needed, these data points are used
> further on in a very input sensative algoritm.
>
>
> Likewise the order is important...
>
> I am not too optimistic that compression is achieveable but thought i
> would ask.
>
> My previous typical data is not correct, typically there are 3500000
> odd data points.


1, Map the data points to a linear distribution in [0,1] using a Z
table, for example, mean = 0.5, mean + 1 stdev = 0.84, etc. You can
use a piecewise linear approximation to compute the Z function.

2. Code each mapped number in binary using just enough bits to
distinguish it from the values before and after it.

This code wastes at most one bit per data point. If you can't afford
this, you could replace step 2 with an arithmetic code.

-- Matt Mahoney

Jani Huhtanen

2005-11-11, 7:55 am

budgetanime@mystarship.com wrote:

> Hi,
>
> I have a set of data points stored as 32 bit integers. The data points
> form a gaussian distribution between 0 and some value. There is no
> correlation between consequtive data points. The higest value depends
> on parameters to the algorithm which generates the data points. The
> number of data points depends on input to the algorithm.
>
> I am able to find the mean and standard deviation.
>
> will compress these data points?
>
> A typical situation is that there is 6000 data points which range from
> 0 to 2^18.


Huffman would do it, but the immense amount of required codes will probably
render this approach useless. Altough you could store the N lower bits
straight and code the 32-N bits with Huffman. This way the number of
required codewords is lower, but so is the compression.

Perhaps you could use some parameterizable variable length coding similar to
Golomb-codes. Golomb-codes are optimal for exponential distributions, but
there might be similar codes for gaussian distributions.

--
----
Jani Huhtanen
Tampere University of Technology, Pori
budgetanime@mystarship.com

2005-11-13, 6:55 pm

Sorry for the late reply, i was unable to access the internet this
wend.

I will attempt the method you describe. Thanks!

budgetanime@mystarship.com

2005-11-13, 6:55 pm

Yes, golomb-rice codes were my first thought, however as you also
mentioned, they model exponetinal distributions...

hmm... maybe hybrid could work...

attempt to obtain the equation of the curve for the distribution and
then using the function find the x value which gives the result. Record
this x value as an integer approximation and then use huffman to code
the difference between the approximate value and the actual value.

lets say the range is 2^16

and lets say the integer approximation is 8 bits. This gives
compartmentalizes the range into 256 buckets.

then record the difference between the aproximated value and the actual
value. and Huffman Code this difference.

hmm... it might even be benefital to huffman code the integer
approximations.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com