Home > Archive > Compression > November 2004 > Compression with specified output-size.
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Compression with specified output-size.
|
|
| Bram Stolk 2004-11-17, 3:55 pm |
| Hello,
Is there a compression library that automatically
stops when a specified output-size has been reached?
Normally, when compressing, e.g. with zlib, you
specify the size of the source buffer, and zlib
will tell you how large the output is.
I want to turn this around: I specify the output
size, and the compression lib tells me how much
of the input data it could compress into the
destination buffer.
I think this is related to the issue of compressing
streams versus compressing files?
Bram
| |
| Jim Leonard 2004-11-18, 3:55 pm |
| Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1100707491.177591@blaat.sara.nl>...
> Normally, when compressing, e.g. with zlib, you
> specify the size of the source buffer, and zlib
> will tell you how large the output is.
>
> I want to turn this around: I specify the output
> size, and the compression lib tells me how much
> of the input data it could compress into the
> destination buffer.
This isn't normally possible because actual compression of the source
has to occur before you know what the output is. Instead of trying to
answer your question on how to do it, could you tell us *why* you want
to do this?
| |
| Thomas Richter 2004-11-18, 3:55 pm |
| Hi,
> Is there a compression library that automatically
> stops when a specified output-size has been reached?
Depends a lot on what you're compressing and how. If you're talking
about "generic" lossless file compression, then this question doesn't
really make a lot of sense. If you truncate the compressed output
before everything has been compressed, you'd loose data and the
ability to reconstruct lossless. If you'd generate files longer
than necessary, you could have done better. Thus, there's no
lossless compressor for which this subject makes sense.
For lossy compression: Yes, clearly. This concept of defining
a target quality to define a compression ratio is one of the
building-stones of modern rate-allocation mechanisms as they
are used in some codec designs, for example in the EZW (Shapiro
Zero-Tree) or EBCOT (Embedded bit stream coding by truncation)
algorithms. The goal of all these algorithms is to grant
optimal quality for given output file size, i.e. "optimization
under constraints". They are used in compression standards like
JPEG2000 (not in traditional JPEG, though).
> Normally, when compressing, e.g. with zlib, you
> specify the size of the source buffer, and zlib
> will tell you how large the output is.
> I want to turn this around: I specify the output
> size, and the compression lib tells me how much
> of the input data it could compress into the
> destination buffer.
Unless you give me the type of data you'd like to compress, I can't
give a definite answer. For "general purpose" compressors there can't
be no such thing (except a greedy algorithm that just tries to
compress and stops when the size gets too large) just because there is
no given model for the source data, and thus there is no way of
knowing how large the output would be when getting compressed. In
other words, there's no way of knowing whether a given file is "easily
compressible" except by trying to compress it in first place. For
specific data, e.g. natural images, such models exist and allow (to a
certain extend) to give estimates of the output size.
> I think this is related to the issue of compressing
> streams versus compressing files?
No, not really.
So long,
Thomas
| |
| Bram Stolk 2004-11-18, 3:55 pm |
| Jim Leonard wrote:
> Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1100707491.177591@blaat.sara.nl>...
>
>
>
> This isn't normally possible because actual compression of the source
> has to occur before you know what the output is. Instead of trying to
> answer your question on how to do it, could you tell us *why* you want
> to do this?
Because I want the compressed data to fit the MTU size of a
UDP datagram.
Additionally, I want to be each datagram selfcontained, so that
it can decompress its part of the data, irrespective of wether
the other datagrams arrived or not.
So: lossless compression (of image data), over lossy network (udp).
Parts of the image that do not survive the network, I take for
granted. Parts that do make it, should be decompressible in their
own right (self contained).
bram
| |
| Matt Mahoney 2004-11-19, 3:55 am |
| "Bram Stolk" <bram@geenspam.sara.nl> wrote in message
news:1100800359.335989@blaat.sara.nl...
> Jim Leonard wrote:
news:<1100707491.177591@blaat.sara.nl>...[color=darkred]
>
> Because I want the compressed data to fit the MTU size of a
> UDP datagram.
>
> Additionally, I want to be each datagram selfcontained, so that
> it can decompress its part of the data, irrespective of wether
> the other datagrams arrived or not.
>
> So: lossless compression (of image data), over lossy network (udp).
> Parts of the image that do not survive the network, I take for
> granted. Parts that do make it, should be decompressible in their
> own right (self contained).
>
> bram
zlib will compress until either the input buffer is empty or the output
buffer is full. So just set the output buffer to your MTU size and compress
as much data as you can. Start a new zlib stream for each packet.
Actually it's not quite this simple because some data needs to be flushed,
so you might need to reserve a little space at the end of the packet.
-- Matt Mahoney
| |
| Pete Fraser 2004-11-19, 3:55 am |
|
"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:bW7nd.3016$Qh3.1797@newsread3.news.atl.earthlink.net...
> "Bram Stolk" <bram@geenspam.sara.nl> wrote in message
>
> zlib will compress until either the input buffer is empty or the output
> buffer is full. So just set the output buffer to your MTU size and
> compress
> as much data as you can. Start a new zlib stream for each packet.
>
> Actually it's not quite this simple because some data needs to be flushed,
> so you might need to reserve a little space at the end of the packet.
>
Presumably he also needs to reserve space for position data for
the decoded image portion.
| |
| Jim Leonard 2004-11-19, 3:55 am |
| Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1100800359.335989@blaat.sara.nl>...
> Because I want the compressed data to fit the MTU size of a
> UDP datagram.
>
> Additionally, I want to be each datagram selfcontained, so that
> it can decompress its part of the data, irrespective of wether
> the other datagrams arrived or not.
Your best bet, for speed and simplicity of implementation, is to
define a blocksize equal to that of the MTU+UDP header size, and
compress. Some data will not be compressable, and so will fit in the
largest possible packet that can be sent. Some data will compress,
and you just use a smaller packet to contain the compressed data.
There are standards for IP compression; check RFC 2395 and RFC 3173
for starters.
| |
| AberAber 2004-11-19, 3:55 am |
| As long as your final compression is lower than your output size specified,
simply do your compression, then pad your data with 00s or whatever you want
at the end. You could have your compression terminate on a sequence and it
would then just ignore the rest of the data.
It really depends what kind of compression you're looking for, but either
way you can pad your data to get it to the size, and your decompressor would
know to ignore anything after the terminating sequence.
You could even use known compression libraries, take the output, and then
pad it so you don't need to write the compression yourself. Decompressor
should just decompress properly and terminate, and the rest of the data
won't be used. If already done compression isn't easy, just write it
yourself.
----- Original Message -----
From: "Bram Stolk" <bram@geenspam.sara.nl>
Newsgroups: comp.compression
Sent: Wednesday, November 17, 2004 11:04 AM
Subject: Compression with specified output-size.
> Hello,
>
> Is there a compression library that automatically
> stops when a specified output-size has been reached?
>
> Normally, when compressing, e.g. with zlib, you
> specify the size of the source buffer, and zlib
> will tell you how large the output is.
>
> I want to turn this around: I specify the output
> size, and the compression lib tells me how much
> of the input data it could compress into the
> destination buffer.
>
> I think this is related to the issue of compressing
> streams versus compressing files?
>
> Bram
>
| |
| Bram Stolk 2004-11-19, 8:55 am |
| Matt Mahoney wrote:
>
> zlib will compress until either the input buffer is empty or the output
> buffer is full. So just set the output buffer to your MTU size and compress
> as much data as you can. Start a new zlib stream for each packet.
>
> Actually it's not quite this simple because some data needs to be flushed,
> so you might need to reserve a little space at the end of the packet.
Well, I already tried this approach with bzlib,
which I think has the same stream interface?
If I do this, I get:
BZ_SEQUENCE_ERROR during the second BZ2_bzCompress call to write
out the remaining stuff.
Once you've promized bzlib a certain nr of input bytes, you cannot
withdraw that promise, and finish up by resetting avail_in to 0.
Bram
> -- Matt Mahoney
>
>
| |
| Bram Stolk 2004-11-19, 8:55 am |
| AberAber wrote:
> As long as your final compression is lower than your output size specified,
> simply do your compression, then pad your data with 00s or whatever you want
> at the end. You could have your compression terminate on a sequence and it
> would then just ignore the rest of the data.
No, this is not the issue I try to solve.
The complete text compressed is larger than my desired output size,
so I want to know how much of my text could fit when compressed.
E.g.: source text is 1Mbyte, I want lossless compression, but my
compressed output should be 8000 bytes. Let's say the 1Mbyte compresses
to 300Kb.
What I want to know: how many bytes of my 1Mbyte will fit, when
compressed into the 8000 bytes?
I've given it more thought, and clearly this is a very difficult problem.
A non-trivial compressor needs to 'look ahead' into the entire
1Mb. Only something simple as RLE can process a stream without
looking ahead, and stop at any time when the output buffer is
full. When stopped, it has a valid compressed stream.
Unfortunately, photographic images will do very poorly on RLE
is my guess.
Bram
| |
| Bram Stolk 2004-11-19, 8:55 am |
| Jim Leonard wrote:
>
> Your best bet, for speed and simplicity of implementation, is to
> define a blocksize equal to that of the MTU+UDP header size, and
> compress. Some data will not be compressable, and so will fit in the
> largest possible packet that can be sent. Some data will compress,
> and you just use a smaller packet to contain the compressed data.
>
This defeats my purpose.
I want to use each datagram as efficiently as possible.
My MTU size is 8192.
I want this to be fully used.
My guess is that RLE could e.g. start compressing a huge input,
and automatically stop when 8192 bytes have been written. It
will then have a valid compressed stream for X nr of bytes.
bzlib, zlib or liblzo do not seem to posses such a feature.
If I can get away with the huge computational load, I could
ofcourse do lots of compression runs per MTU and with binary
search optimize the size of the source chunk.
E.g.
- Compress 32K -> Output 12111 bytes
- too large: lower source size
- Compress 16K -> Output 6666 bytes
- too small: increase source size
- Compress 24K -> Output 8001 bytes
- too small: increase source size
- Compress 28K ->
etc...
Untill I'm comfartably close to 8192 bytes.
This sounds like a huge computational waste.
Surely, some mechanism should be possible to stop
a compressor upon a desired output size, and still
have a valid compressed stream?
Bram
| |
| cr88192 2004-11-19, 3:55 pm |
|
"Bram Stolk" <bram@geenspam.sara.nl> wrote in message
news:1100800359.335989@blaat.sara.nl...
> Jim Leonard wrote:
>
> Because I want the compressed data to fit the MTU size of a
> UDP datagram.
>
> Additionally, I want to be each datagram selfcontained, so that
> it can decompress its part of the data, irrespective of wether
> the other datagrams arrived or not.
>
> So: lossless compression (of image data), over lossy network (udp).
> Parts of the image that do not survive the network, I take for
> granted. Parts that do make it, should be decompressible in their
> own right (self contained).
>
this makes me think that a combination of adaptive huffman and lz-style
encoding might work. you keep encoding so long as space remains in the
output, and break it off there.
effectively, all the contexts are reset for the next chunk to retain the
"independence" aspect (albeit at the cost of compressor efficiency).
this may not be the "best" approach, but it should be workable.
| |
| cr88192 2004-11-19, 3:55 pm |
|
"Bram Stolk" <bram@nospam.sara.nl> wrote in message
news:dJhnd.8088$lN.7748@amsnews05.chello.com...
> AberAber wrote:
>
> No, this is not the issue I try to solve.
> The complete text compressed is larger than my desired output size,
> so I want to know how much of my text could fit when compressed.
>
> E.g.: source text is 1Mbyte, I want lossless compression, but my
> compressed output should be 8000 bytes. Let's say the 1Mbyte compresses
> to 300Kb.
>
> What I want to know: how many bytes of my 1Mbyte will fit, when compressed
> into the 8000 bytes?
>
> I've given it more thought, and clearly this is a very difficult problem.
> A non-trivial compressor needs to 'look ahead' into the entire
> 1Mb. Only something simple as RLE can process a stream without
> looking ahead, and stop at any time when the output buffer is
> full. When stopped, it has a valid compressed stream.
>
> Unfortunately, photographic images will do very poorly on RLE
> is my guess.
>
I will argue: not necissarily true.
as stated in another post, lz+adaptive huffman could work well here.
of course, in general these would not be very good for images. for images it
would make more sense to use a more specialized approach.
| |
| AberAber 2004-11-19, 3:55 pm |
| Oh, I misunderstood you. It seems like you will compress your input data
fully, then just take the chunk that is 8000 bytes, decompress it ignoring a
bitwise end that is not the full bits, and then you will see what you have?
Perhaps some form of LZ77 with RLE would work better than huffman because
huffman would need to store the entire tree, well at least with standard
implementation. Still, you won't get astounding compression.
"Bram Stolk" <bram@nospam.sara.nl> wrote in message
news:dJhnd.8088$lN.7748@amsnews05.chello.com...
> AberAber wrote:
>
> No, this is not the issue I try to solve.
> The complete text compressed is larger than my desired output size,
> so I want to know how much of my text could fit when compressed.
>
> E.g.: source text is 1Mbyte, I want lossless compression, but my
> compressed output should be 8000 bytes. Let's say the 1Mbyte compresses
> to 300Kb.
>
> What I want to know: how many bytes of my 1Mbyte will fit, when compressed
> into the 8000 bytes?
>
> I've given it more thought, and clearly this is a very difficult problem.
> A non-trivial compressor needs to 'look ahead' into the entire
> 1Mb. Only something simple as RLE can process a stream without
> looking ahead, and stop at any time when the output buffer is
> full. When stopped, it has a valid compressed stream.
>
> Unfortunately, photographic images will do very poorly on RLE
> is my guess.
>
> Bram
>
| |
| Ingbert Grimpe 2004-11-19, 3:55 pm |
| On Fri, 19 Nov 2004 07:59:05 GMT, Bram Stolk <bram@nospam.sara.nl> wrote:
Have your trieb ZLib with defalte option Z_FULL_FLUSH? According to the
manual that should be what you are looking for. If you set a very large
input buffer and a 8000 byte output buffer, ZLib will probably try to fill
the outbuffer as good as possible, at the same time ensuring that each
'block' can be uncompressed independently if a block should miss or is
damaged.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
| |
| Jim Leonard 2004-11-19, 8:55 pm |
| Bram Stolk <bram@nospam.sara.nl> wrote in message news:<dJhnd.8088$lN.7748@amsnews05.chello.com>...
> What I want to know: how many bytes of my 1Mbyte will fit, when
> compressed into the 8000 bytes?
You can't know that until you actually compress the source.
> I've given it more thought, and clearly this is a very difficult problem.
> A non-trivial compressor needs to 'look ahead' into the entire
> 1Mb. Only something simple as RLE can process a stream without
> looking ahead, and stop at any time when the output buffer is
> full. When stopped, it has a valid compressed stream.
This is what Matt was saying: Start compressing with zlib and stop
when the output buffer is full. On next packet, start a *new* zlib
stream.
| |
| Matt Mahoney 2004-11-20, 3:55 am |
| "Bram Stolk" <bram@nospam.sara.nl> wrote in message
news:HBhnd.16085$jf.15376@amsnews03-serv.chello.com...
> Matt Mahoney wrote:
>
compress[color=darkred]
flushed,[color=darkred]
>
> Well, I already tried this approach with bzlib,
> which I think has the same stream interface?
>
> If I do this, I get:
> BZ_SEQUENCE_ERROR during the second BZ2_bzCompress call to write
> out the remaining stuff.
>
> Once you've promized bzlib a certain nr of input bytes, you cannot
> withdraw that promise, and finish up by resetting avail_in to 0.
That problem is specific to Burrows Wheeler compression (I assume that's
what BZ2* is). It sorts the whole input block and then compresses it, so a
partial input block won't work. zlib uses LZ77, replacing common substrings
with pointers back to the previous occurrence. The pointers are never more
than a few bytes so it should be possible to stop when the output buffer is
about full.
-- Matt Mahoney
| |
|
| Bram Stolk <bram@nospam.sara.nl> wrote in message news:<MShnd.8173$lN.1027@amsnews05.chello.com>...
>
> This defeats my purpose.
> I want to use each datagram as efficiently as possible.
>
> My MTU size is 8192.
> I want this to be fully used.
> My guess is that RLE could e.g. start compressing a huge input,
> and automatically stop when 8192 bytes have been written. It
> will then have a valid compressed stream for X nr of bytes.
> bzlib, zlib or liblzo do not seem to posses such a feature.
>
> If I can get away with the huge computational load, I could
> ofcourse do lots of compression runs per MTU and with binary
> search optimize the size of the source chunk.
>
> E.g.
> - Compress 32K -> Output 12111 bytes
> - too large: lower source size
> - Compress 16K -> Output 6666 bytes
> - too small: increase source size
> - Compress 24K -> Output 8001 bytes
> - too small: increase source size
> - Compress 28K ->
>
> etc...
> Untill I'm comfartably close to 8192 bytes.
>
> This sounds like a huge computational waste.
> Surely, some mechanism should be possible to stop
> a compressor upon a desired output size, and still
> have a valid compressed stream?
>
> Bram
Why not write your own compression routine similar to zlib that will
automatically write packets or create an array of packets that are of
size 8192. Each packet must have a "position" pointer (so the
decompressor knows where in the uncompressed stream this packet
begins) and an "end" code (if you implemented huffman, you'd need 257
codes instead of 256).
pseudo-code for your compressor would be: (being a java specialist so
some of the code may appear in java rather than c/c++ style, but the
languages are similar in syntax so deal with it :P)
//begin packet program code:
define compressor as the compressor
define packet[] as an array of packets
compressor.feedallinput(input);
int currentindex = 0;
while(!compressor.finished())
{
packet[currentindex] = compressor.getnextpacket();
currentindex++;
}
//end packet program code
//begin compressor code
globaltoallfunctions int currentposition;
function packet getnextpacket()
{
create new packet;
reset compression dictionary since the compressor may not have all
of the preceding data;
packet.write(currentposition)
while(lengthofpacket<(desiredsize-lengthofendcode))
//lengthofpacket, desiredsize and lengthofendcode are best off
being
//measured in bits rather than bytes
{
write the next byte;
}
write end code;
return packet;
}
//end compressor code
At the end of this code; you know how many packets you have to
transmit, each packet is of the desired length, and each packet
contains a position pointer and an end code for the decompressor to
use. You probably don't even need packet[] you can just send each
packet as the compressor returns them.
Two things you may want to look into:
If I remember correctly, most routers (the ones you buy for
home/office use, anyway) automatically default to an MTU size of 1500
regardless of what you send it, and the header counts toward the MTU
size. So if the header size of the UDP packet is 40 (I'm not sure
what it is, but TCP is 40 I believe), then you can only really send
1460 bytes of data.
I'm not sure if you can control individual packets in C/C++ but as far
as I know you can't in Java unless you write your own network handling
Java classes; you can only write to streams.
| |
| Bram Stolk 2004-11-20, 8:55 am |
| Matt Mahoney wrote:
>
> That problem is specific to Burrows Wheeler compression (I assume that's
> what BZ2* is). It sorts the whole input block and then compresses it, so a
> partial input block won't work. zlib uses LZ77, replacing common substrings
> with pointers back to the previous occurrence. The pointers are never more
> than a few bytes so it should be possible to stop when the output buffer is
> about full.
Thanks for the information.
I will try this approach with zlib then.
Bram
> -- Matt Mahoney
>
>
| |
| Mark Adler 2004-11-20, 3:55 pm |
| Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1100707491.177591@blaat.sara.nl>...
> I want to turn this around: I specify the output
> size, and the compression lib tells me how much
> of the input data it could compress into the
> destination buffer.
Though zlib isn't designed to do this, you can coerce it to compress
to a fixed-size output with some effort. It would require
decompressing the data as it's compressed, and a second pass of
compression on the last deflate block of the stream (see RFC 1951 for
details on the deflate format), using information from decompression
on how much will fit. You would be able to get within zero to two
bytes of exactly filling the output buffer.
However I'm not sure why you have imposed a requirement to be able to
decompress each piece independently. I suspect that any piece of the
uncompressed data is not useful by itself, and needs to be reassembled
with the other pieces of uncompressed data. In that case, you would
be better off reassembling the compressed data instead of the
uncompressed data since a) you will get much better compression on
longer streams, and b) the compressed data is most likely smaller than
the uncompressed data, and so the reassembly will require less memory
and time.
You would not however want the compressed stream to run on forever,
since then you would be subject to losing all subsequent data if one
packet got lost. For zlib, I have found that breaking up the
uncompressed data into approximately one megabyte chunks provides a
good compromise between compression effectiveness and block size. You
would take a penalty of only about 1% in compressed size by breaking
up the input every 1 MB.
mark
| |
| Mark Adler 2004-11-21, 3:55 pm |
| "Ingbert Grimpe" <i.grimpe@igrimpe.de> wrote in message news:<opshp0u1nathqce3@alpha>...
> Have your trieb ZLib with defalte option Z_FULL_FLUSH? According to the
> manual that should be what you are looking for. If you set a very large
> input buffer and a 8000 byte output buffer, ZLib will probably try to fill
> the outbuffer as good as possible, at the same time ensuring that each
> 'block' can be uncompressed independently if a block should miss or is
> damaged.
zlib will not try to just fill the provided output buffer as suggested
above. Z_FULL_FLUSH will cleanly delimit the output when all of the
*input* provided up to that point has been compressed. There will
likely be more compressed data beyond the provided output buffer,
requiring another call to deflate() with more output space. So it
doesn't help for this problem, unless you know somehow how much input
to provide to get the desired amount of output.
By the way, in this case you're better off using Z_FINISH instead of
Z_FULL_FLUSH, and starting a new deflate stream for the next packet.
That will simplify decoding, and will provide a check value for the
block.
mark
| |
| Mark Adler 2004-11-21, 3:55 pm |
| madler@alumni.caltech.edu (Mark Adler) wrote in message news:<7ab39013.0411200804.3219cbec@posting.google.com>...
> Though zlib isn't designed to do this, you can coerce it to compress
> to a fixed-size output with some effort. It would require
> decompressing the data as it's compressed, and a second pass of
> compression on the last deflate block of the stream (see RFC 1951 for
> details on the deflate format), using information from decompression
> on how much will fit. You would be able to get within zero to two
> bytes of exactly filling the output buffer.
Following up on my own post ...
I actually tried what I wrote above, and it doesn't work as well as I
thought it would for small output sizes. I underestimated how much
the compressed data would reduce in size when the dynamic deflate
block was shortened. This results in a lower number of occuring
lengths, distances, and perhaps literals, which compress to fewer bits
and requires a smaller header on the dynamic block to define the
codes. As a result, when I compress enough input to get 1500 bytes of
output, deflate() produces a burst of about 27K of compressed data as
the first deflate dynamic block. When I decompress the first 1500
bytes of that and recompress it, it only comes out to about 1300
bytes. So to really get the approach above to work, you would need a
few more iterations to fill out the 1500 byte buffer.
What I wrote in the previous post does work as advertised if I turn
off dynamic blocks. Using only fixed blocks, there is no variation in
the coding, and there are no headers, so I get 1498 to 1500 bytes in
the output after the two passes of compression. However the
compression is degraded somewhat.
mark
| |
| Ingbert Grimpe 2004-11-21, 3:55 pm |
| On 21 Nov 2004 05:26:22 -0800, Mark Adler <madler@alumni.caltech.edu>
wrote:
> zlib will not try to just fill the provided output buffer as suggested
> above. Z_FULL_FLUSH will cleanly delimit the output when all of the
Since you know this stuff much better, I stand ashamed and corrected ;)
Another question (the manual does not dig very deep into thsi topi8c, but
I would be the last one to throw stones ... ;):
I'd like to compress database records with ZLib, so I have the problem
that compressing a single record (say between 50 and 1000 byte) will not
have a very good compression ratio. Since I know the contents of the
database in total (compression is done on a server, uncompressing on a
client), I chose to use a dictionary which I build from the database
content like described here:
http://www.netnam.vn/unescocourse/c...rvision/106.htm
So far it seems to work (I put the most used words towards the 'end' of
the memory block), but I would like to know if there's a better (and
faster, cause I'm not a coder-god) solution to create a dictionary.
| |
| Bram Stolk 2004-11-22, 8:55 am |
| Mark Adler wrote:
> Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1100707491.177591@blaat.sara.nl>...
>
>
>
> Though zlib isn't designed to do this, you can coerce it to compress
> to a fixed-size output with some effort. It would require
> decompressing the data as it's compressed, and a second pass of
> compression on the last deflate block of the stream (see RFC 1951 for
> details on the deflate format), using information from decompression
> on how much will fit. You would be able to get within zero to two
> bytes of exactly filling the output buffer.
Sounds complicated, I take it then, that the solution provided
by Matt Mahoney will not work? :
*zlib will compress until either the input buffer is empty or the output
*buffer is full. So just set the output buffer to your MTU size and compress
*as much data as you can. Start a new zlib stream for each packet.
*
*Actually it's not quite this simple because some data needs to be flushed,
*so you might need to reserve a little space at the end of the packet.
>
> However I'm not sure why you have imposed a requirement to be able to
> decompress each piece independently. I suspect that any piece of the
> uncompressed data is not useful by itself, and needs to be reassembled
Nope, this is not the case:
Each packet *is* usefull to me, regardless of wether the others
arrive or not. That is why I want them to compress independently.
I transfer them over a lossy network protocol, so chunks are bound
to be missing.
Bram
| |
| Bram Stolk 2004-11-22, 8:55 am |
| AberAber wrote:
> Oh, I misunderstood you. It seems like you will compress your input data
> fully, then just take the chunk that is 8000 bytes, decompress it ignoring a
> bitwise end that is not the full bits, and then you will see what you have?
>
No, sorry.
Maybe it's time for me to specify this mathematically to
avoid further ambiguity:
I have text X,
I want to find Y0...Yn, so that
X = concatenation of Y0, Yn.
For each Yi, I require that len(F(Yi)) < 8192, whereby F(x) is the
compression function.
In other words:
What are the positions in the stream X, where I cut them up, and
compress them independently, so that their compressed sizes are < 8192.
Ofcourse I want to minimize the nr of cuts I make, which means:
compressed sizes are close to 8192.
Bram
> Perhaps some form of LZ77 with RLE would work better than huffman because
> huffman would need to store the entire tree, well at least with standard
> implementation. Still, you won't get astounding compression.
>
>
> "Bram Stolk" <bram@nospam.sara.nl> wrote in message
> news:dJhnd.8088$lN.7748@amsnews05.chello.com...
>
>
>
>
| |
| Mark Adler 2004-11-22, 8:55 pm |
| "Ingbert Grimpe" <i.grimpe@igrimpe.de> wrote in message news:<opshtrogmfthqce3@alpha>...
> So far it seems to work (I put the most used words towards the 'end' of
> the memory block), but I would like to know if there's a better (and
> faster, cause I'm not a coder-god) solution to create a dictionary.
No, you have it exactly right. You construct a dictionary by somehow
identifying what you expect to be the most commonly occurring strings
in your data. You then fill the 32K dictionary with those strings,
starting with the most common at the end of the 32K.
mark
| |
|
| Hi Mark
In article <7ab39013.0411220827.7be92f75@posting.google.com>,
madler@alumni.caltech.edu says...
> "Ingbert Grimpe" <i.grimpe@igrimpe.de> wrote in message news:<opshtrogmfthqce3@alpha>...
>
> No, you have it exactly right. You construct a dictionary by somehow
> identifying what you expect to be the most commonly occurring strings
> in your data. You then fill the 32K dictionary with those strings,
> starting with the most common at the end of the 32K.
>
If I can have a large collection of send -and alike- msg's, is there a
utility that can give a dictionary that's rather good? It's my
understanding that it's a NP-type problem to create the optimal
dictionary, but I hope to get a better one than by hand and easy. It's
no problem if a fast PC has to run a few day's (or w ) for it.
Alain
> mark
>
| |
| Mark Adler 2004-11-23, 3:55 am |
| Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1101117205.678281@blaat.sara.nl>...
> Sounds complicated, I take it then, that the solution provided
> by Matt Mahoney will not work?
No, it won't. The reason is that zlib's deflate implementation
decides on its own how to chunk the data into "dynamic blocks", where
each block has its own set of Huffman codes that are defined in a
header to the block. Those blocks will not fall where you want them
to, simply by limiting the output buffer space on a single deflate()
call. deflate() will just wait for more output space to complete the
block, and go past your desired output buffer size.
deflate() will only complete the stream when it has compressed all of
the input you have provided. So the only solution is to provide the
right amount of input to get the desired output size. Unfortunately,
you have no way of knowing that, since it depends on how well zlib
will be able to compress the data.
What I'm thinking of doing is adding a function to zlib that will tell
you how much deflate() will output if the next thing you did was to
ask it to finish and provide no more input. This would be
computationally expensive, so you wouldn't want to call it for every
byte of input. But used judiciously, you could do a root-finding
approach, taking advantage of the deflateCopy() capability to save the
state to allow you to back up. Then you could optimize filling an
output buffer with compressed data.
mark
| |
| Mark Adler 2004-11-23, 3:55 pm |
| Alain <yivol5402@sneakemail.com> wrote in message news:<MPG.1c0c4ed58b5bd2898969f@news.telenet.be>...
> If I can have a large collection of send -and alike- msg's, is there a
> utility that can give a dictionary that's rather good?
I am not aware of such a utility. It is definitely a non-trivial
problem, and would require some considerable experimentation to find a
good approach to automatically generating a dictionary.
mark
| |
| Bram Stolk 2004-11-23, 3:55 pm |
| Mark Adler wrote:
> Bram Stolk <bram@geenspam.sara.nl> wrote in message news:<1101117205.678281@blaat.sara.nl>...
>
>
>
> No, it won't. The reason is that zlib's deflate implementation
> decides on its own how to chunk the data into "dynamic blocks", where
> each block has its own set of Huffman codes that are defined in a
> header to the block. Those blocks will not fall where you want them
> to, simply by limiting the output buffer space on a single deflate()
> call. deflate() will just wait for more output space to complete the
> block, and go past your desired output buffer size.
>
> deflate() will only complete the stream when it has compressed all of
> the input you have provided. So the only solution is to provide the
> right amount of input to get the desired output size. Unfortunately,
> you have no way of knowing that, since it depends on how well zlib
> will be able to compress the data.
>
> What I'm thinking of doing is adding a function to zlib that will tell
> you how much deflate() will output if the next thing you did was to
> ask it to finish and provide no more input. This would be
> computationally expensive, so you wouldn't want to call it for every
> byte of input. But used judiciously, you could do a root-finding
> approach, taking advantage of the deflateCopy() capability to save the
> state to allow you to back up. Then you could optimize filling an
> output buffer with compressed data.
>
> mark
Thanks for all the info.
For the time being, I have adopted the following approach:
Do a lot of seperate compression runs of 8000 bytes input each.
Then concatenate the compressed messages as long as the total
is below the 8192 MTU.
This will ofcourse give sub optimal compression, but it takes
little computation, and gives reasonable large datagrams.
A better fill-rate of the 8192 datagrams could be had by
a selective concatenation, and choose the concatenation parts
based on their sizes. But this is a hard scheduling problem I
think.
Bram
| |
| Mark Adler 2004-11-25, 3:55 am |
| I came up with an approach that works pretty well. It occurred to me
that if I did three compressions, with the second one being close to
the right amount, but with enough excess to cover the shortfalls I was
seeing with the two-pass approach, that it should get pretty close.
And it does. The attached code fills the output buffer to within
typically ten or fewer bytes. The penalty is that you have to
compress the same data three times.
mark
/* fitblk.c: example of fitting compressed output to a specified size
Not copyrighted -- provided to the public domain
Version 1.0 24 November 2004 Mark Adler */
/* Version history:
1.0 24 Nov 2004 First version
*/
#include <stdio.h>
#include <stdlib.h>
#include "zlib.h"
#define local static
/* print nastygram and leave */
local void quit(char *why)
{
fprintf(stderr, "fitblk abort: %s\n", why);
exit(1);
}
#define EXCESS 256 /* empirically determined stream overage */
#define MARGIN 8 /* amount to back off for completion */
/* compress from stdin to fixed-size block on stdout */
int main(int argc, char **argv)
{
int ret; /* return code */
int flush; /* flush parameter for next deflate() */
unsigned size; /* requested fixed output block size */
unsigned have; /* bytes written by deflate() call */
z_stream def, inf; /* zlib deflate and inflate states */
char *raw; /* uncompressed input */
char *blk; /* intermediate and final stream */
char *tmp; /* close to desired size stream */
/* get requested output size */
if (argc != 2)
quit("need one argument: size of output block");
ret = strtol(argv[1], argv + 1, 10);
if (argv[1][0] != 0)
quit("argument must be a number");
if (ret < 8) /* 8 is minimum zlib stream size */
quit("need positive size of 8 or greater");
size = (unsigned)ret;
/* allocate memory for buffers and compression engine */
blk = malloc(size + EXCESS);
raw = malloc(size);
def.zalloc = Z_NULL;
def.zfree = Z_NULL;
def.opaque = Z_NULL;
ret = deflateInit2(&def, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 15, 8,
Z_DEFAULT_STRATEGY);
if (ret != Z_OK || blk == NULL || raw == NULL)
quit("out of memory");
/* compress from stdin until output full, or no more input */
def.avail_out = size + EXCESS;
def.next_out = blk;
flush = Z_NO_FLUSH;
do {
def.avail_in = fread(raw, 1, size, stdin);
if (ferror(stdin))
quit("error reading input");
def.next_in = raw;
if (feof(stdin))
flush = Z_FINISH;
ret = deflate(&def, flush);
if (ret == Z_STREAM_ERROR)
quit("internal error");
} while (def.avail_out != 0 && flush == Z_NO_FLUSH);
(void)deflateReset(&def);
/* if it all fit, then size was undersubscribed -- done! */
if (ret == Z_STREAM_END && def.avail_out >= EXCESS) {
(void)deflateEnd(&def);
free(raw);
have = size + EXCESS - def.avail_out;
ret = fwrite(blk, 1, have, stdout);
free(blk);
if (ret != have || ferror(stdout))
quit("error writing output");
fprintf(stderr,
"%u bytes unused out of %u requested (all input)\n",
size - have, size);
return 0;
}
/* it didn't all fit -- set up for recompression */
tmp = malloc(size + EXCESS);
inf.zalloc = Z_NULL;
inf.zfree = Z_NULL;
inf.opaque = Z_NULL;
inf.avail_in = 0;
inf.next_in = Z_NULL;
ret = inflateInit(&inf);
if (ret != Z_OK || tmp == NULL)
quit("out of memory");
/* do first recompression close to the right amount */
inf.avail_in = size + EXCESS;
inf.next_in = blk;
def.avail_out = size + EXCESS;
def.next_out = tmp;
flush = Z_NO_FLUSH;
do {
/* decompress */
inf.avail_out = size;
inf.next_out = raw;
ret = inflate(&inf, Z_NO_FLUSH);
switch (ret) {
case Z_NEED_DICT:
case Z_DATA_ERROR:
case Z_STREAM_ERROR:
quit("internal error");
case Z_MEM_ERROR:
quit("out of memory");
}
/* recompress */
def.avail_in = size - inf.avail_out;
def.next_in = raw;
if (inf.avail_out != 0)
flush = Z_FINISH;
ret = deflate(&def, flush);
if (ret == Z_STREAM_ERROR)
quit("internal error");
} while (ret != Z_STREAM_END && def.avail_out != 0);
(void)inflateReset(&inf);
(void)deflateReset(&def);
/* do final recompression */
inf.avail_in = size - MARGIN; /* assure stream will complete */
inf.next_in = tmp;
def.avail_out = size;
def.next_out = blk;
flush = Z_NO_FLUSH;
do {
/* decompress */
inf.avail_out = size;
inf.next_out = raw;
ret = inflate(&inf, Z_NO_FLUSH);
switch (ret) {
case Z_NEED_DICT:
case Z_DATA_ERROR:
case Z_STREAM_ERROR:
quit("internal error");
case Z_MEM_ERROR:
quit("out of memory");
}
/* recompress */
def.avail_in = size - inf.avail_out;
def.next_in = raw;
if (inf.avail_out != 0)
flush = Z_FINISH;
ret = deflate(&def, flush);
if (ret == Z_STREAM_ERROR)
quit("internal error");
} while (ret != Z_STREAM_END && def.avail_out != 0);
if (ret != Z_STREAM_END)
quit("final output did not complete! (MARGIN too small)");
/* done -- write block, clean up, print results */
(void)inflateEnd(&inf);
free(tmp);
(void)deflateEnd(&def);
free(raw);
have = size - def.avail_out;
ret = fwrite(blk, 1, have, stdout);
free(blk);
if (ret != have || ferror(stdout))
quit("error writing output");
fprintf(stderr,
"%u bytes unused out of %u requested (%lu input)\n",
size - have, size, def.total_in);
return 0;
}
|
|
|
|
|