Home > Archive > Compression > December 2006 > GZIP / ZLIB Streaming
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
GZIP / ZLIB Streaming
|
|
| frank@0xD.com 2006-12-11, 6:57 pm |
| Hello,
I have been trying to implement a GZIP streaming compressor for HTTP,
with no luck...I have read over RFC1950, RFC1952, the ZLIB and GZIP
docs, various newgroups, etc, but cannot locate one particular bit of
information.
When decompressing, the first chunk is always correct. But all
subsequent chunks are corrupted.
I have a response, with Content-Encoding: gzip and Transfer-Encoding:
Chunked. As the chunked response data is generated, each chunk is
compressed. The first chunk that I encounter gets the GZIP header
prepended. I calculate the running size and CRC32 of the source data
until EOF, at which point I finalize the stream by writing the GZIP
trailer. Each chunk gets deflated independently of the previous, and
the chunk is prepared with ZLIB header and Adler32. This is where I
suspect my problem is.
I experimented with removing the ZLIB header from each chunk after
chunk #1, and the Adler32 until the last chunk. But that also failed.
[Although I did not maintain the Adler32 *across* chunks, which I
imagine contributes to a failed stream.]
So I guess my actual question is this : what is the prescribed method
for maintaining a valid compression stream across chunks? Should each
chunk have an independent ZLIB header and Adler32? Or am I missing
something completely [which is probably the case]?
Thanks.
Frank Pelliccio
| |
|
|
<frank@0xD.com>, haber iletisinde sunlari
yazdi:1165592415.881887.264870@73g2000cwn.googlegroups.com...
> Hello,
>
> I have been trying to implement a GZIP streaming compressor for HTTP,
> with no luck...I have read over RFC1950, RFC1952, the ZLIB and GZIP
> docs, various newgroups, etc, but cannot locate one particular bit of
> information.
>
> When decompressing, the first chunk is always correct. But all
> subsequent chunks are corrupted.
>
> I have a response, with Content-Encoding: gzip and Transfer-Encoding:
> Chunked. As the chunked response data is generated, each chunk is
> compressed. The first chunk that I encounter gets the GZIP header
> prepended. I calculate the running size and CRC32 of the source data
> until EOF, at which point I finalize the stream by writing the GZIP
> trailer. Each chunk gets deflated independently of the previous, and
> the chunk is prepared with ZLIB header and Adler32. This is where I
> suspect my problem is.
I also experimented with gzip + chunked encoding. But I couldn't get it
running. My conclusion was there was something wrong with the browser (IE)
as I was pretty sure my deflated data was OK and chunked correctly.
Which browser are you trying it with?
And what do you mean each chunk gets deflated independently?
When you have to output deflated (i.e your output buffer needs to flushed)
you should created a chunk and send.
>
> I experimented with removing the ZLIB header from each chunk after
> chunk #1, and the Adler32 until the last chunk. But that also failed.
> [Although I did not maintain the Adler32 *across* chunks, which I
> imagine contributes to a failed stream.]
Adler32? This maybe a problem. Why Adler32? Also you should maintain CRC
value *across* the chunks, I guess.
>
> So I guess my actual question is this : what is the prescribed method
> for maintaining a valid compression stream across chunks? Should each
> chunk have an independent ZLIB header and Adler32? Or am I missing
> something completely [which is probably the case]?
No. As I said, when you need to flush ZLIB output, make a chunk and send it.
>
> Thanks.
>
>
> Frank Pelliccio
>
| |
| frank@0xD.com 2006-12-11, 6:57 pm |
| Aslan,
Thanks for replying...
> I also experimented with gzip + chunked encoding. But I couldn't get it
> running. My conclusion was there was something wrong with the browser (IE)
> as I was pretty sure my deflated data was OK and chunked correctly.
> Which browser are you trying it with?
It doesn't work with *any* browser. Furthermore, I stripped the core
code, and wrote a stand-alone app to simulate reading a file in chunks,
etc...I save the output stream as a .gz file, and the the duplicate
logic fails in the same way with any GZIP decompressor.
> And what do you mean each chunk gets deflated independently?
> When you have to output deflated (i.e your output buffer needs to flushed)
> you should created a chunk and send.
That is what I was doing. I started each independent deflate operation
with a fresh buffer [with the current chunk of data].
> Adler32? This maybe a problem. Why Adler32? Also you should maintain CRC
> value *across* the chunks, I guess.
Well, each ZLIB chunk requires at minimum [per RFC1950] the 2 ZLIB ID
bytes, and the 4 byte Adler32 value...Or at least that was my
interpretation of the spec. I am maintaining the CRC32 of the source
data from first to last byte, then writing it in the GZIP trailer.
> No. As I said, when you need to flush ZLIB output, make a chunk and send it.
Well, this confirms how I was expecting it to work. I suppose that's
the bit I was really trying to find out.
So in conclusion, the proper method for streaming the response is :
- Prepare GZIP header / write to client
- First to last chunk
- Calc CRC32 of source data
- deflate current chunk
- write to client
- Write GZIP trailer [CRC32, Size]
Does that sound correct? I hope not, because it doesn't seem to work
properly for me.
Thanks again.
Frank Pelliccio
| |
|
|
<frank@0xD.com>, haber iletisinde sunlari
yazdi:1165597757.982177.164510@j72g2000cwa.googlegroups.com...
> Aslan,
>
> Thanks for replying...
>
(IE)[color=darkred]
>
> It doesn't work with *any* browser. Furthermore, I stripped the core
> code, and wrote a stand-alone app to simulate reading a file in chunks,
> etc...I save the output stream as a .gz file, and the the duplicate
> logic fails in the same way with any GZIP decompressor.
>
flushed)[color=darkred]
>
> That is what I was doing. I started each independent deflate operation
> with a fresh buffer [with the current chunk of data].
>
>
> Well, each ZLIB chunk requires at minimum [per RFC1950] the 2 ZLIB ID
> bytes, and the 4 byte Adler32 value...Or at least that was my
> interpretation of the spec. I am maintaining the CRC32 of the source
> data from first to last byte, then writing it in the GZIP trailer.
>
it.[color=darkred]
>
> Well, this confirms how I was expecting it to work. I suppose that's
> the bit I was really trying to find out.
>
> So in conclusion, the proper method for streaming the response is :
>
> - Prepare GZIP header / write to client
> - First to last chunk
> - Calc CRC32 of source data
> - deflate current chunk
> - write to client
> - Write GZIP trailer [CRC32, Size]
>
> Does that sound correct? I hope not, because it doesn't seem to work
> properly for me.
>
> Thanks again.
I will copy some part of my code:
int SZlib::GZCompress(ptr p)
{
if (p == 0 || ((ZZ*)p)->__magic != __MAGIC_NO__) return -1;
int err;
ZZ* pzz = (ZZ*)p;
err = deflateInit2(&pzz->__z, pzz->__level,
Z_DEFLATED, -MAX_WBITS, DEF_MEM_LEVEL,
Z_DEFAULT_STRATEGY);
/* windowBits is passed < 0 to suppress zlib header */
pzz->__z.next_in = pzz->__pin;
pzz->__z.avail_in = 0;
u8* px = pzz->__pout;
// WRITE A VERY SIMPLE .GZ HEADER: (TEN BYTES)
*px++ = gz_magic[0]; *px++ = gz_magic[1];
*px++ = Z_DEFLATED;
*px++ = 0;/*flags*/ *px++ = 0; *px++ = 0; *px++ = 0; *px++ = 0;/*time*/
*px++ = 0;/*xflags*/
*px++ = OS_CODE;
pzz->__z.next_out = pzz->__pout + 10;
pzz->__z.avail_out = (uInt)pzz->__outlen - 10;
pzz->__crc = crc32(0L, Z_NULL, 0); // CRC32 OF UNCOMPRESSED DATA
while (1) {
if ( !LoadInput(p) )
break;
pzz->__crc = crc32(pzz->__crc, pzz->__pin, pzz->__z.avail_in);
err = deflate(&pzz->__z, Z_NO_FLUSH);
FlushOutput(p);
if ( err != Z_OK )
break;
}
while (1) {
err = deflate(&pzz->__z, Z_FINISH);
if ( !FlushOutput(p) )
break;
if (err != Z_OK)
break;
}
err = deflateEnd(&pzz->__z);
px = pzz->__pout;
{
uLong x = pzz->__crc;
for (int n = 0; n < 4; n++) {
*px++ = (Bytef)(x & 0xff);
x >>= 8;
}
}
{
uLong x = pzz->__z.total_in;
for (int n = 0; n < 4; n++) {
*px++ = (Bytef)(x & 0xff);
x >>= 8;
}
}
__pw->Write(pzz->__pout, 8);
if ( err != Z_OK && err != Z_STREAM_END )
err = -1;
else
err = 0;
return err;
}
uint SZlib::FlushOutput(ptr p)
{
if (__pw == 0) return 0;
ZZ* pzz = (ZZ*)p;
uint n = pzz->__outlen - pzz->__z.avail_out;
if (n) {
__pw->Write(pzz->__pout, n);
pzz->__z.next_out = pzz->__pout;
pzz->__z.avail_out = pzz->__outlen;
}
return n;
}
I hope you can figure it out. Now take this FlushOutput function. It's just
writing n nytes from a buffer. Now this n bytes should be your chunk. You
need add the chunk headers and send it through the socket. Write a special
Write() function that adds the chunk header first and then send through the
socket n bytes passed to it.
>
>
> Frank Pelliccio
>
| |
| frank@0xD.com 2006-12-11, 6:57 pm |
| Aslan,
I assume __MAGIC_NO__ is 0x1f, 0x8b?
Just a question for you...Is GZCompress() being called for each chunk
you are generating? If that is the case, then it appears that your
function is writing the GZIP header and Trailer for each chunk. I
don't believe that is the desired behaviour.
When attempting to decipher my problem, I ran some Wireshark traces to
sites like Google, which GZIP + chunk their response data. I found
that they only send one GZIP header / trailer per response.
I'll continue to look this over.
Frank
| |
| Mark Adler 2006-12-11, 6:57 pm |
| The open source mod_gzip in apache does this, so you might want to look
there for answers to further questions.
frank@0xD.com wrote:
> The first chunk that I encounter gets the GZIP header
> prepended. I calculate the running size and CRC32 of the source data
> until EOF, at which point I finalize the stream by writing the GZIP
> trailer. Each chunk gets deflated independently of the previous, and
> the chunk is prepared with ZLIB header and Adler32. This is where I
> suspect my problem is.
Yep, that's where your problem is. For some reason, you are trying to
imbed the zlib format in the gzip format. The browsers are expecting
just the gzip format. All you need to do, and all you're allowed to
do, is to generate one continuous gzip stream and then chunk it. That
what HTTP 1.1 means by gzip content encoding followed by chunked
transfer encoding.
zlib 1.2.3 supports generating gzip streams with the deflate()
function. See deflateInit2() in zlib.h. So you don't have to worry
about the gzip header and trailer or computing the crc -- zlib will
handle that for you. You just need to do the chunking.
As an aside, the HTTP 1.1 standard also allows "deflate" content
encoding, which replaces the gzip format with the zlib format (again a
single large zlib stream that is then chunked). However I have heard
anecdotally -- not checked it myself -- that the deflate content
encoding is not consistently and perhaps not even correctly supported
by browsers. So stick with gzip content encoding.
mark
| |
| frank@0xD.com 2006-12-11, 6:57 pm |
| Mark,
Thanks for the reply. I'll have a look at mod_gzip.
OK, so you've helped me to identify one faux pas.
But I'm still about another point. When I receive data in the
stream, I have no idea of where it's been, where it's going, its size,
etc. So does "one continuous gzip stream" mean that each chunk in my
response has its own GZIP header and trailer? I thought I had tried
that with no luck. Again, I suspect the mod_gzip code answers this for
me.
And I had read about deflate being improperly implemented by many
browsers...too bad, it would have simplified the process.
Thanks.
Frank Pelliccio
Mark Adler wrote:
> The open source mod_gzip in apache does this, so you might want to look
> there for answers to further questions.
>
> frank@0xD.com wrote:
>
> Yep, that's where your problem is. For some reason, you are trying to
> imbed the zlib format in the gzip format. The browsers are expecting
> just the gzip format. All you need to do, and all you're allowed to
> do, is to generate one continuous gzip stream and then chunk it. That
> what HTTP 1.1 means by gzip content encoding followed by chunked
> transfer encoding.
>
> zlib 1.2.3 supports generating gzip streams with the deflate()
> function. See deflateInit2() in zlib.h. So you don't have to worry
> about the gzip header and trailer or computing the crc -- zlib will
> handle that for you. You just need to do the chunking.
>
> As an aside, the HTTP 1.1 standard also allows "deflate" content
> encoding, which replaces the gzip format with the zlib format (again a
> single large zlib stream that is then chunked). However I have heard
> anecdotally -- not checked it myself -- that the deflate content
> encoding is not consistently and perhaps not even correctly supported
> by browsers. So stick with gzip content encoding.
>
> mark
| |
|
|
<frank@0xD.com>, haber iletisinde sunlari
yazdi:1165601173.266170.299890@16g2000cwy.googlegroups.com...
> Aslan,
>
> I assume __MAGIC_NO__ is 0x1f, 0x8b?
yes
>
> Just a question for you...Is GZCompress() being called for each chunk
> you are generating? If that is the case, then it appears that your
> function is writing the GZIP header and Trailer for each chunk. I
> don't believe that is the desired behaviour.
No. I call deflate() as seen in the code I copied.
>
> When attempting to decipher my problem, I ran some Wireshark traces to
> sites like Google, which GZIP + chunk their response data. I found
> that they only send one GZIP header / trailer per response.
Yes of course. Actually my code also insert one gzip header at the beginning
(you should see it in that code).
>
> I'll continue to look this over.
>
>
> Frank
>
| |
|
|
"Mark Adler" <madler@alumni.caltech.edu>, haber iletisinde sunlari
yazdi:1165601183.868680.81530@j72g2000cwa.googlegroups.com...
> As an aside, the HTTP 1.1 standard also allows "deflate" content
> encoding, which replaces the gzip format with the zlib format (again a
> single large zlib stream that is then chunked). However I have heard
> anecdotally -- not checked it myself -- that the deflate content
> encoding is not consistently and perhaps not even correctly supported
> by browsers. So stick with gzip content encoding.
>
> mark
>
Yes. I tried it. "deflate" seems OK. And I prefer "deflate" to "gzip"
encoding. I also tried sending an already deflated data (copying it from a
zip file for example) using "deflate" encoding, and browser (IE) seem to be
happy with it.
| |
| Mark Adler 2006-12-11, 6:57 pm |
| frank@0xD.com wrote:
> So does "one continuous gzip stream" mean that each chunk in my
> response has its own GZIP header and trailer?
No. It means one continuous gzip stream with one header and one
trailer that you break up into chunks as you see fit.
mark
| |
| Mark Adler 2006-12-11, 6:57 pm |
| Aslan wrote:
> Yes. I tried it. "deflate" seems OK. And I prefer "deflate" to "gzip"
> encoding. I also tried sending an already deflated data (copying it from a
> zip file for example) using "deflate" encoding, and browser (IE) seem to be
> happy with it.
This is good information, but it needs clarification. When you say you
pulled the deflated data from the zip file, did you then send it as is
to IE, or did you wrap it first with a zlib header and trailer?
I had heard that IE was incorrectly expecting a raw deflate stream (RFC
1951) when "deflate" is specified as the content encoding, rather than
what the HTTP 1.1 standard says, which is that deflate is a zlib stream
(RFC 1950).
One of the reasons that I think the "deflate" content encoding is
messed up in browsers is simply the terminology chosen by the author(s)
of HTTP 1.1. What they *called* deflate is really the zlib format,
which is a short header and trailer wrapped around raw deflate
compressed data. The standard is clear about that, but I'm afraid
browser implementors don't read the standard carefully enough, and
incorrectly interpret "deflate" content encoding as a raw deflate
stream.
mark
| |
| Mark Adler 2006-12-11, 6:57 pm |
| Aslan wrote:
> Yes. I tried it. "deflate" seems OK. And I prefer "deflate" to "gzip"
> encoding. I also tried sending an already deflated data (copying it from a
> zip file for example) using "deflate" encoding, and browser (IE) seem to be
> happy with it.
This is good information, but it needs clarification. When you say you
pulled the deflated data from the zip file, did you then send it as is
to IE, or did you wrap it first with a zlib header and trailer?
I had heard that IE was incorrectly expecting a raw deflate stream (RFC
1951) when "deflate" is specified as the content encoding, rather than
what the HTTP 1.1 standard says, which is that deflate is a zlib stream
(RFC 1950).
One of the reasons that I think the "deflate" content encoding is
messed up in browsers is simply the terminology chosen by the author(s)
of HTTP 1.1. What they *called* deflate is really the zlib format,
which is a short header and trailer wrapped around raw deflate
compressed data. The standard is clear about that, but I'm afraid
browser implementors don't read the standard carefully enough, and
incorrectly interpret "deflate" content encoding as a raw deflate
stream.
mark
| |
|
|
"Mark Adler" <madler@alumni.caltech.edu>, haber iletisinde sunlari
yazdi:1165619201.880719.224410@f1g2000cwa.googlegroups.com...
> Aslan wrote:
a[color=darkred]
be[color=darkred]
>
> This is good information, but it needs clarification. When you say you
> pulled the deflated data from the zip file, did you then send it as is
> to IE, or did you wrap it first with a zlib header and trailer?
*As is*. If browser sends "Accept-encoding: gzip, deflate\r\n" line, I first
send a proper header (with "Content-Encoding: deflate\r\n"), then I simply
copy the deflated data from the zip file (as many bytes as the compressed
size of the zip entry), that's it.
>
> I had heard that IE was incorrectly expecting a raw deflate stream (RFC
> 1951) when "deflate" is specified as the content encoding, rather than
> what the HTTP 1.1 standard says, which is that deflate is a zlib stream
> (RFC 1950).
>
> One of the reasons that I think the "deflate" content encoding is
> messed up in browsers is simply the terminology chosen by the author(s)
> of HTTP 1.1. What they *called* deflate is really the zlib format,
> which is a short header and trailer wrapped around raw deflate
> compressed data. The standard is clear about that, but I'm afraid
> browser implementors don't read the standard carefully enough, and
> incorrectly interpret "deflate" content encoding as a raw deflate
> stream.
I also tried to send it from a windows machine to a browser (I don't
remember the name) running on a Linux machine. That's was also OK.
>
> mark
>
| |
| Mark Adler 2006-12-11, 6:57 pm |
| Aslan wrote:
> *As is*. If browser sends "Accept-encoding: gzip, deflate\r\n" line, I first
> send a proper header (with "Content-Encoding: deflate\r\n"), then I simply
> copy the deflated data from the zip file (as many bytes as the compressed
> size of the zip entry), that's it.
Thanks. That confirms that the IE authors didn't read the HTTP 1.1
standard. By the way, what version of IE were you using?
mark
| |
|
|
"Mark Adler" <madler@alumni.caltech.edu>, haber iletisinde sunlari
yazdi:1165680100.136106.64660@j72g2000cwa.googlegroups.com...
> Aslan wrote:
first[color=darkred]
simply[color=darkred]
compressed[color=darkred]
>
> Thanks. That confirms that the IE authors didn't read the HTTP 1.1
> standard. By the way, what version of IE were you using?
>
IE6. I have also checked it with Firefox2 today and it also seems to be
working with it (but only after pressing refresh button). But I have to
correct myself. In IE6 case, it seems like it works most of the times (90%)
but with some files, the browser doesn't display the content.
What do you mean IE authors didn't read the HTTP 1.1 standard? According to
standard, do I need to add some headers for deflate, if I pull it from a zip
file? If so let me know, I can give it a try.
> mark
>
| |
|
| On 2006-12-09, Aslan <aslanski2002@yahoo.com> wrote:
> *As is*. If browser sends "Accept-encoding: gzip, deflate\r\n" line, I first
> send a proper header (with "Content-Encoding: deflate\r\n"), then I simply
> copy the deflated data from the zip file (as many bytes as the compressed
> size of the zip entry), that's it.
>
about a year ago I was implementing this, I followed the RFC and deflate only
worked with mozilla etc, msie and konqueror required something non-standard be
done.
these days mozilla will work either way, as will msie, not sure about konq.
[color=darkred]
[color=darkred]
the monkeys at M$ seem to do that with many standards. eg the SMTP
auth clause...
--
Bye.
Jasen
| |
| Mark Adler 2006-12-11, 6:57 pm |
| Aslan wrote:
> What do you mean IE authors didn't read the HTTP 1.1 standard? According to
> standard, do I need to add some headers for deflate, if I pull it from a zip
> file? If so let me know, I can give it a try.
Yes, according to the HTTP 1.1 standard, you would have to add a zlib
header and trailer per the RFC 1950 specification for the zlib format.
Generating the trailer would require decompressing the data and
computing a check value.
mark
|
|
|
|
|