For Programmers: Free Programming Magazines  


Home > Archive > Compression > March 2007 > Suggestions for compression algorithm please









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Suggestions for compression algorithm please
john.leavey@gmail.com

2007-03-30, 3:55 am

My application builds compound data files that consist of a number of
blocks of data. Each block consists of about 50-300 items - either 4
byte integers or 4 or 8 byte floats (the blocks are of a uniform size
and type for a particular file).

Can someone please suggest an appropriate methodology for compressing
each block of data prior to storing in the file - the goal is to
optimize compression ratio and decompression time, compression time is
less critical.

Thanks

John Leavey

Willem

2007-03-30, 3:55 am

john.leavey@gmail.com wrote:
) My application builds compound data files that consist of a number of
) blocks of data. Each block consists of about 50-300 items - either 4
) byte integers or 4 or 8 byte floats (the blocks are of a uniform size
) and type for a particular file).
)
) Can someone please suggest an appropriate methodology for compressing
) each block of data prior to storing in the file - the goal is to
) optimize compression ratio and decompression time, compression time is
) less critical.

How is the data correlated ?
Are the numbers related to each other ?
Are they distributed evenly ?


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
cr88192

2007-03-30, 9:55 pm


<john.leavey@gmail.com> wrote in message
news:1175242817.718793.119430@o5g2000hsb.googlegroups.com...
> My application builds compound data files that consist of a number of
> blocks of data. Each block consists of about 50-300 items - either 4
> byte integers or 4 or 8 byte floats (the blocks are of a uniform size
> and type for a particular file).
>
> Can someone please suggest an appropriate methodology for compressing
> each block of data prior to storing in the file - the goal is to
> optimize compression ratio and decompression time, compression time is
> less critical.
>


depends on a lot of factors (any patterns in the data, ...).


a few possible options come to mind:
for a very simple compressor for integers (assumably biased twards 0), one
can employ a VLI scheme.

in such a scheme, values within a small range, say 0..127 or -64..63, fit
within a single byte, 0..16383 or -8192..8191 within 2 bytes, ...

floats are harder. one possibility would require a bitwise coding (such as
arithmetic or huffman), and storing a variably truncated mantissa.

an example of this would be to huffman code the sign and exponent.
the mantissa could be coded as a chain of bits terminating under some
terminal condition, such as all the bits being encoded, or a certain length
run of 0s or 1s.


maybe also possible is to use a generic compressor, such as deflate, and see
how that works (if the files compress acceptably, say, with gzip, then
deflate may be worth looking into).


> Thanks
>
> John Leavey
>



Josiah Carlson

2007-03-30, 9:55 pm

john.leavey@gmail.com wrote:
> My application builds compound data files that consist of a number of
> blocks of data. Each block consists of about 50-300 items - either 4
> byte integers or 4 or 8 byte floats (the blocks are of a uniform size
> and type for a particular file).
>
> Can someone please suggest an appropriate methodology for compressing
> each block of data prior to storing in the file - the goal is to
> optimize compression ratio and decompression time, compression time is
> less critical.


Generally, the BWT and related transforms seem to work pretty well for
fixed byte width formats when you don't know in advance the size of the
structures. Using the bzip2 implementation of BWT may or may not be
productive - it depends on how quickly it is able to adapt to the
relatively small amount of data you are trying to compress.

For 4 byte integers and floats (which are 1 sign bit, 7 exponent bits,
and 24 mantissa bits), I've found that borrowing ideas from database and
RGB image compression works well. Separate the component parts of the
values, put them all in sequence, then compress them. That is, if your
file has two 4-byte integers (or 4 byte floats):
10ab7d00a993f200
break that into 4 2-byte sequences:
10a9, ab93, 7df2, 0000
then compress with your favorite entropy coder (deflate and bzip2 are
good after the above transformation).

For data in which there is nontrivial clustering of values in
4-dimensional (or 8-dimensional) space, the above method will work quite
well (see the GEO file of the calgary corpus for an example of this on
4-byte floats).

8 byte floats are a bit different, as they have 1 bit of sign, and 10
bits of exponent (in IEEE 764 fp doubles). You could try to do a
similar thing to the above, or you could try to use a bitwise BWT (there
is a name for this, but I can't remember off the top of my head). The
standard BWT could work ok, but it will depend on your data.

- Josiah
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com