|
|
|
| Does zlib compressed/deflated data contain NULLs?
Is there a way I can have zlib compressed/deflated data not contain
NULLs?
Thanks
Greg
| |
| Mark Adler 2007-08-27, 7:02 pm |
| On Aug 27, 10:55 am, Greg <trifus...@gmail.com> wrote:
> Does zlib compressed/deflated data contain NULLs?
Yes.
> Is there a way I can have zlib compressed/deflated data not contain
> NULLs?
No. You would need to post-process the compressed data to remove the
zeros (losslessly), with some small expansion of the data.
Mark
| |
|
| I would think that the best manner to do this post process would be to
simply replace the NULLs with an escape sequence.
Is the zlib wrapped data completely random or pseudo random such that
specific sequences of bytes could never occur?
Is there a specific escape sequence that will not appear in zlib
compressed data that could be safely used?
Are there sequences that would rarely occur? This way I could search
for the sequence and then use the one that is not in the data.
Thanks
Greg
| |
| John Reiser 2007-08-29, 6:56 pm |
| > Is the zlib wrapped data completely random or pseudo random such that
> specific sequences of bytes could never occur?
There are sequences of length 780 (3 * 258 + 6) which never occur. ;-)
> Is there a specific escape sequence that will not appear in zlib
> compressed data that could be safely used?
Not any short ones.
> Are there sequences that would rarely occur? This way I could search
> for the sequence and then use the one that is not in the data.
Considered over all possible inputs, the frequency in the output
is approximately 2**-(8 * n) for small positive numbers 'n' of
consecutive bytes. For typical short inputs (such as all e-mail
messages of fewer than 10,000 characters) there are many triples
that have very low frequency in the output. But for unrestricted
inputs, it's hopeless.
--
| |
|
| I am only compressing comma separated values (CSV) that are alpha
numeric, the resulting zlib will always be less then 60K (embedded
device limitation).
I was thinking of replacing a single NULL with something like four
consecutive 01 bytes, in HEX it would be 0x01010101, but I was
wondering if there is a specific four byte sequence that was rare for
compressed data.
Is there a 4 byte sequence that is rare for compressed alpha numeric
data?
Thanks
Greg
| |
| John Reiser 2007-08-29, 6:56 pm |
| > I am only compressing comma separated values (CSV) that are alpha
> numeric, the resulting zlib will always be less then 60K (embedded
> device limitation).
[snip]
> Is there a 4 byte sequence that is rare for compressed alpha numeric
> data?
Generate 1,000,000 random inputs that have the proper format.
Compress using zlib. Construct a histogram of all 4-byte sequences
in the outputs. (Sort as many 4-byte sequences as will fit into RAM,
then mrege the counts into 16GB of disk space. Repeat in batches
until done.)
--
| |
|
| That would work ... I was however hoping because of zlibs age and
common usage that this was already known. :(
So there is then no current post processing system for zlib data to
remove NULLs?
Greg
| |
| Mark Adler 2007-08-29, 6:56 pm |
| On Aug 29, 6:06 am, Greg <trifus...@gmail.com> wrote:
> Is the zlib wrapped data completely random or pseudo random such that
> specific sequences of bytes could never occur?
It's not completely random, since random data will usually generate a
decompression error within a few hundred bytes. However there are no
sequences that never occur.
> Is there a specific escape sequence that will not appear in zlib
> compressed data that could be safely used?
No.
> Are there sequences that would rarely occur?
No. It is a bit-oriented format, so there are no byte special byte
sequences.
Mark
| |
| Willem 2007-08-29, 6:56 pm |
| Mark wrote:
) It's not completely random, since random data will usually generate a
) decompression error within a few hundred bytes. However there are no
) sequences that never occur.
Isn't that a contradiction ? I would say that any data that generates
a decompression error is a sequence that can't occur in compressed data.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| Hans-Peter Diettrich 2007-08-29, 6:56 pm |
| Greg wrote:
> I am only compressing comma separated values (CSV) that are alpha
> numeric, the resulting zlib will always be less then 60K (embedded
> device limitation).
Why are you ever concerned with the compressed representation of your data?
You might better replace your programming language by another one, that
is not sensitive to NULL bytes in strings ;-)
DoDi
| |
|
| Not sure if anyone has a stats background but here goes:
If I want to replace the NULLs in some compressed data with a sequence
of bytes.
I know that the maximum size of the data is 64K bytes (each byte is a
number from 0 to 255) and the data can be considered random.
What is the probability if I use a randomly generated 2, 3, 4, or 5
byte sequence, that this sequence will occur in the 64 K byte data?
My basic math skills would lead me to use:
(256^byte sequence)/64K
for 2 bytes = 1 time
for 3 bytes = once every 256 times
for 4 bytes = once every 64K times
for 5 bytes = once every 16.7M
for 6 bytes = once every 4.3B
Does this sound about right?
If so, then I could sequence through 3 byte sequences until I get one
not in the data and then send the sequence with the NULL removed data.
Thanks
Greg
| |
| Mark Adler 2007-08-29, 6:56 pm |
| On Aug 29, 9:40 am, Greg <trifus...@gmail.com> wrote:
> So there is then no current post processing system for zlib data to
> remove NULLs?
Not that I'm aware of. It would be trivial to implement if you don't
mind a non-optimal expansion. Something like every 0x00 is replaced
by 0xff 0x01 and every 0xff is replaced by 0xff 0xff. About 0.8%
expansion.
A nearly optimal solution would be to code every 1415 bits as a 177-
digit base 255 number. So you only lose one bit out of every 1416
bits, or a 0.07% expansion.
There are solutions of varying complexity and optimality between those
two.
Mark
| |
| Mark Adler 2007-08-29, 6:56 pm |
| On Aug 29, 10:15 am, Willem <wil...@stack.nl> wrote:
> Mark wrote:
>
> ) It's not completely random, since random data will usually generate a
> ) decompression error within a few hundred bytes. However there are no
> ) sequences that never occur.
>
> Isn't that a contradiction ? I would say that any data that generates
> a decompression error is a sequence that can't occur in compressed data.
It is not a contradiction for floating sequences. Yes, if you fix the
start of the sequence at the beginning of the stream, there are
sequences that cannot occur. However there are no sequences that
cannot occur anywhere in the stream.
Okay, that may be a little strong. But I can say with certainty that
there are no sequences of 65535 bytes or less that cannot occur
anywhere in a valid deflate stream, since that is the length of a
stored block. You can give me a much longer stream, and I bet I could
find a dynamic block Huffman code for which the end-of-block code does
not appear in the stream (when decoded into a sequence of symbols).
There may be specially constructed streams of a few hundred thousand
bytes that you can prove will never appear anywhere in a valid deflate
stream. But it would take some effort to prove, since you have to
take into account a lot of possible ways to preface that sequence.
Mark
| |
| Mark Adler 2007-08-30, 3:56 am |
| On Aug 29, 10:44 am, Greg <trifus...@gmail.com> wrote:
> Does this sound about right?
It's close, though overestimates the two-byte case, which is about 1 -
e^-1 ~= 0.63. Also the estimates are good only for patterns with no
repeated bytes. If any bytes are repeated, then matches at different
locations are not uncorrelated, which increases the probabilities
some.
> If so, then I could sequence through 3 byte sequences until I get one
> not in the data and then send the sequence with the NULL removed data.
You would then need to also transmit the three-byte sequence with the
block. That's not too bad, but it expands the same amount as a
simpler solution (actually a smidge more due to the three-byte
prefix). The simpler solution is to replace, for example, all zeros
and ones with two-byte sequences (described in another post).
Mark
| |
| Keith Thompson 2007-08-30, 10:00 pm |
| Greg <trifusion@gmail.com> writes:
> Does zlib compressed/deflated data contain NULLs?
>
> Is there a way I can have zlib compressed/deflated data not contain
> NULLs?
Why does it matter to you? Why can't you cope with data that contains
null bytes?
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
| |
|
|
|
| I am sending the compressed zlib data via http as multi-part form data
from an embedded device to a web server. The http library (http://
www.pdadevelopers.com/homehttps.htm) I am using calculates the http
Content-Length header using strlen(). Any nulls in the data stream
will cause the Content-Length to be incorrect There is a bug in the
library that prevents me from setting the Content-Length header
manually.
All this to say it's a work around for a bug in software I am using.
In the end I did use the 3 byte replace function since it was easier
to implement and then I simply use str_replace on the web server side
to put the Nulls back in.
Greg
|
|
|
|