Code Comments
Programming Forum and web based access to our favorite programming groups.I have a binary file with N individually compressed segments. Each segment has some Info and D Data segments. I read the Info sections of the file. Later I want to go back and read the D data segments. I keep an index to the beginning of the data section in the file so i can get back there. I was hoping to just copy the z_stream structure and save it for each of the D data segments, but I get Z_DATA_ERROR. I'm thinking it has to do with the state field of z_stream not really being copied since it's a pointer Any ideas on how to do this? Chris
Post Follow-up to this messagecchgroupmail@gmail.com wrote: > ... I was hoping to just copy the z_stream structure and > save it for each of the D data segments, but I get Z_DATA_ERROR. I'm > thinking it has to do with the state field of z_stream not really being > copied since it's a pointer Any ideas on how to do this? RTFM, particularly the comments in zlib.h about Z_SYNC_FLUSH, Z_FULL_FLUSH, and Z_FINISH as argument to deflate() . --
Post Follow-up to this messageI read the FM. Actually I found a funciton in the header file ( I didn't see in the manual ) inflateCopy which says it copies the z_stream entirely to a dest ptr. This is exactly what I was looking for. But thanks for your kind suggestion in reading the manual.
Post Follow-up to this messagecchgroupmail@gmail.com wrote: > Actually I found a funciton in the header file ( I > didn't see in the manual ) inflateCopy which says it copies the > z_stream entirely to a dest ptr. This is exactly what I was looking > for. That may have been what you were looking for, but it may not be what you want. Every copy of the state takes more than 32K bytes of memory. It would be better to occasionally (say, once every few MB) use Z_SYNC_FLUSH and mark the position, and you can then start raw inflation from that point later without having to have an inflate state for it. mark
Post Follow-up to this messageWell, the problem (as i percieve it) is that the length of the compressed data is of arbitrary length (These are actually compressed variables in a Matlab MAT file). I am always using inflate with Z_SYNC_FLUSH. I had tried just marking the file position where the variable's data starts, and then restarting inflation from that point later, but kept getting Data errors. I can handle the normal compressed variables when there's only a single data position I need to start inflation from again, because I can just leave the compression stream hanging there (because each variable in the file i create a z_stream structure for) until I either return to it, or just end it. But in the struct-type matlab variables, each field of the structure has it's own data header followed by data and then it repeats. So What I do is read the header for the structure, then the first variable's header stopping at the data copy the state and save it with that field (as a struct-mat variable's data to my code is just an array of more variables), then skip over the data, read the next header,etc. I'm certainly new to compression, so any other helpful ideas would be appreciated. Not that I expect anyone to look at the file format, but If you were interested or just wanted to visually see how these files are set up, the pdf documentation of the MAT file-format is at [url]http://www.mathworks.com/access/helpdesk/help/pdf_doc/matlab/matfile_format.pdf[/u rl] Any help/advice is greatly appreciated
Post Follow-up to this messagecchgroupmail@gmail.com wrote: > I am always using inflate with > Z_SYNC_FLUSH. I had tried just marking the file position where the > variable's data starts, and then restarting inflation from that point > later, but kept getting Data errors. First off, if you're not using raw inflate, you need to. See inflateInit2(). Second, you may not be getting all the compressed data out after the Z_SYNC_FLUSH, and so not marking the position correctly. (By the way, you don't want to "always" use Z_SYNC_FLUSH -- you want to use it only when marking a breakpoint. If you use it too often, compression will be significantly degraded.) Third, from the description of your application, you may not need to bother with any of this. It sounds like you could simply compress each variable individually, ending with Z_FINISH, and starting a new stream for the next variable. mark
Post Follow-up to this messageI don't want to uncompress the entire variable at once, otherwise i'd
read inflate the first 8 bytes to find out the uncompressed size of the
whole variable, then go back and just call uncompress of the entire
compressed variable (what i used to do). I will look into using
inflateInit2 instead. I'm using Z_SYNC_FLUSH b/c I want to read in
small bits of the enitre compression at a time. I need to read the
enitre variable header (and optionally sub-headers if it's a strucuture
or cell-array or object) but stop before the data, but don't know how
big the header is b/c it depends on what's in there. Below is a code
snip of what i'm doing. If you want the complete source let me know.
matvar->name = NULL;
matvar->data = NULL;
matvar->dims = NULL;
matvar->nbytes = 0;
matvar->data_type = 0;
matvar->class_type = 0;
matvar->data_size = 0;
matvar->mem_conserve = 0;
matvar->compression = 1;
matvar->fpos = fpos;
matvar->z.zalloc = NULL;
matvar->z.zfree = NULL;
matvar->z.opaque = NULL;
/* Matlab uses magic number 56??? */
if (nBytes < 56)
nbytes = nBytes;
else
nbytes=56;
bytesread += fread(comp_buf,1,nbytes,mat->fp);
matvar->z.next_in = comp_buf;
matvar->z.next_out = uncomp_buf;
matvar->z.avail_in = nbytes;
matvar->z.avail_out = 8; /* First uncompress type */
err = inflateInit(&(matvar->z));
if ( err != Z_OK ) {
Scats_Critical("inflateInit returned %d",err);
Scats_MatVarFree(matvar);
break;
}
/* Read Variable tag */
ptr = uncomp_buf;
bytesread += InflateVarTag(mat,matvar,uncomp_buf);
matvar->class_type = *(int*)ptr;
ptr += 4;
nbytes = *(int*)ptr;
if ( matvar->class_type != miMATRIX ) {
Scats_Critical("Uncompressed type not miMATRIX");
for ( i = 0; i < matvar->z.avail_out; i++ )
fprintf(stderr,"%02x ",(int)*(uint8_t *)ptr);
fprintf(stderr,"\n");
fs
(mat->fp,nBytes-bytesread,SEEK_CUR);
Scats_MatVarFree(matvar);
matvar = NULL;
break;
}
/* Inflate Array Flags */
ptr = uncomp_buf;
bytesread += InflateArrayFlags(mat,matvar,uncomp_buf)
;
/* Array Flags */
if ( *(int *)ptr == miUINT32 ) {
ptr += 8;
array_flags = *(uint32_t*)ptr;
if ( mat->byteswap )
array_flags = int32Swap((int32_t*)&array_flags);
matvar->class_type = (int)(array_flags & miCLASS_T);
matvar->isComplex = (int)(array_flags & miCOMPLEX);
matvar->isGlobal = (int)(array_flags & miGLOBAL);
matvar->isLogical = (int)(array_flags & miLOGICAL);
}
ptr = uncomp_buf;
/* Inflate Dimensions */
bytesread += InflateDimensions(mat,matvar,uncomp_buf)
;
/* Rank and Dimension */
if ( *(int *)ptr == miINT32 ) {
ptr += 4;
nbytes = (int)*(int32_t*)ptr;
ptr += 4;
matvar->rank = nbytes / 4;
matvar->dims = (int *)malloc(matvar->rank*sizeof(int));
for ( i = 0; i < matvar->rank; i++ ) {
int32_t dim;
dim = ((int32_t*)ptr)[0];
matvar->dims[i] = (int)dim;
ptr += 4;
}
if ( matvar->rank % 2 != 0 )
ptr += 4;
}
/* Inflate variable name tag */
ptr = uncomp_buf;
bytesread += InflateVarNameTag(mat,matvar,uncomp_buf)
;
/* Name of variable */
if ( *(int*)ptr == miINT8 ) { /* Name not in tag */
int len;
ptr += 4;
len = *(int*)ptr;
if ( len % 8 == 0 )
i = len;
else
i = len+(8-(len % 8));
matvar->name = (char *)malloc(i+1);
/* Inflate variable name */
bytesread += InflateVarName(mat,matvar,matvar->name,i);
matvar->name[len] = '\0';
} else if ( *(int16_t*)ptr == miINT8 &&
*(int16_t*)(ptr+2) | 0x00 ) { /* Name in tag */
int len;
len = (int)*(int16_t*)(ptr+2);
ptr+=4;
matvar->name = (char *)malloc(len+1);
memcpy(matvar->name,ptr,len);
matvar->name[len] = '\0';
}
/*
*-------------------------------------------------------------------
* ZLIB Decompression (Inflate) Routines
*-------------------------------------------------------------------
*/
/*
* Inflate the data until nbytes of uncompressed data has been inflated
*/
static int
InflateSkip(scats_mat_t *mat, SCATS_MATVAR *matvar, int nbytes)
{
uint8_t comp_buf[32],uncomp_buf[32];
int bytesread = 0, err, cnt = 0;
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = 1;
matvar->z.next_out = uncomp_buf;
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateSkip: inflate returned %d",err);
return bytesread;
}
if ( !matvar->z.avail_out ) {
matvar->z.avail_out = 1;
matvar->z.next_out = uncomp_buf;
cnt++;
}
while ( cnt < nbytes ) {
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateSkip: inflate returned %d",err);
return bytesread;
}
if ( !matvar->z.avail_out ) {
matvar->z.avail_out = 1;
matvar->z.next_out = uncomp_buf;
cnt++;
}
}
return bytesread;
}
/*
* Inflates the variable's tag. buf must hold at least 8 bytes
*/
static int
InflateVarTag(scats_mat_t *mat, SCATS_MATVAR *matvar, void *buf)
{
uint8_t comp_buf[32];
int bytesread = 0, err;
assert(buf != NULL);
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = 8;
matvar->z.next_out = buf;
err = inflate(&(matvar->z),Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateVarTag: inflate returned %d",err);
return bytesread;
}
while ( matvar->z.avail_out && !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateVarTag: inflate returned %d",err);
return bytesread;
}
}
return bytesread;
}
/*
* Inflates the Array Flags Tag and the Array Flags data. buf must
hold at
* least 16 bytes
*/
static int
InflateArrayFlags(scats_mat_t *mat, SCATS_MATVAR *matvar, void *buf)
{
uint8_t comp_buf[32];
int bytesread = 0, err;
assert(buf != NULL);
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = 16;
matvar->z.next_out = buf;
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateArrayFlags: inflate returned %d",err);
return bytesread;
}
while ( matvar->z.avail_out && !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateArrayFlags: inflate returned
%d",err);
return bytesread;
}
}
return bytesread;
}
/*
* Inflates the Dimensions Tag and the Dimensions data. buf must hold
at
* least (8+4*rank) bytes
*/
static int
InflateDimensions(scats_mat_t *mat, SCATS_MATVAR *matvar, void *buf)
{
uint8_t comp_buf[32];
int bytesread = 0, err, rank, i;
assert(buf != NULL);
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = 8;
matvar->z.next_out = buf;
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateDimensions: inflate returned %d",err);
return bytesread;
}
while ( matvar->z.avail_out && !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateDimensions: inflate returned
%d",err);
return bytesread;
}
}
if ( *(int *)buf != miINT32 ) {
Scats_Critical("Reading dimensions expected type miINT32");
return bytesread;
}
rank = ((int *)buf)[1];
if ( rank % 8 != 0 )
i = 8-(rank %8);
else
i = 0;
rank+=i;
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = rank;
matvar->z.next_out = buf+8;
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateDimensions: inflate returned %d",err);
return bytesread;
}
while ( matvar->z.avail_out && !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateDimensions: inflate returned
%d",err);
return bytesread;
}
}
return bytesread;
}
static int
InflateVarNameTag(scats_mat_t *mat, SCATS_MATVAR *matvar, void *buf)
{
uint8_t comp_buf[32];
int bytesread = 0, err;
assert(buf != NULL);
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = 8;
matvar->z.next_out = buf;
err = inflate(&(matvar->z),Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateVarNameTag: inflate returned %d",err);
return bytesread;
}
while ( matvar->z.avail_out && !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateVarNameTag: inflate returned
%d",err);
return bytesread;
}
}
return bytesread;
}
static int
InflateVarName(scats_mat_t *mat, SCATS_MATVAR *matvar, void *buf, int
N)
{
uint8_t comp_buf[32];
int bytesread = 0, err;
assert(buf != NULL);
if ( !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
}
matvar->z.avail_out = N;
matvar->z.next_out = buf;
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateVarName: inflate returned %d",err);
return bytesread;
}
while ( matvar->z.avail_out && !matvar->z.avail_in ) {
matvar->z.avail_in = 1;
matvar->z.next_in = comp_buf;
bytesread += fread(comp_buf,1,1,mat->fp);
err = inflate(&matvar->z,Z_SYNC_FLUSH);
if ( err != Z_OK ) {
Scats_Critical("InflateVarName: inflate returned %d",err);
return bytesread;
}
}
return bytesread;
}
Post Follow-up to this messagecchgroupmail@gmail.com wrote: > I don't want to uncompress the entire variable at once, ... > I'm using Z_SYNC_FLUSH b/c I want to read in > small bits of the enitre compression at a time. You have two options, which are not all that different. For either, decide on an acceptable chunk size of your variables that balances random access speed with compression effectiveness. If you break it up into too many independent chunks, compression will suffer. Too few chunks, and it will take longer to get to the point you need to, since you always have to start decompressing at the beginning of a chunk. I have found that around 1 MB of uncompressed data is a good chunk size, and only degrades compression by about 1%. However that is highly data dependent, and so you should experiment with your data. Given the chunk size, you compress in one of two ways (I assume that you are the person doing the compression here). Either compress each chunk individually, ending each with a deflate(strm, Z_FINISH), and save the start of each stream. Or, requiring a little more care, create a single deflate stream, but end each chunk with deflate(strm, Z_SYNC_FLUSH), and mark the start of the next chunk (one byte after the last byte emitted from the flush). To decompress the first option, simple use inflate normally. For the second option, use raw inflate (see inflateInit2()) to start decompressing after the flush. The advantage of the first method is simplicity, and you get an integrity check for each chunk. The advantage of the second method is that you don't have the overhead of a header and trailer (which is insignificant for large chunks), and the raw inflate a little bit faster since it's not calculating a check value. Also you can more easily cross chunk boundaries if the requested output requires that. I recommend the first method for simplicity. In fact, even if you want some advantage of the second method, you should probably get the first method working first, and then modify it for the second method. For either method, you will need to find a place in your data to save those marks where you can start decompression from. The marks should contain both where you can start decompressing from in the compressed data, and what offset in the uncompressed data that decompressed data begins. For random access, you simply find the largest uncompressed offset less than or equal to where you want data from, and start decompressing from there until you get what you need. > err = inflate(&matvar->z,Z_SYNC_FLUSH); You're missing the point of Z_SYNC_FLUSH. It's purpose is to put restart points in the compressed stream, and so only makes sense when used with deflate(). The flush parameter of inflate() has, to first order, no effect on the operation of inflate(). mark
Post Follow-up to this messageI am not the one compressing data. It's being compressed by Matlab. By default in Matlab versions 7+, the Matlab file type (.mat) uses compression. Each variable in the file is compressed individually. Again, for more detailed information, it's on the mathworks website [url]http://www.mathworks.com/access/helpdesk/help/pdf_doc/matlab/matfile_format.pdf[/u rl] I am reading their files in my C-code and uncompressing it. I don't know how/where they put flush points. Also, the header is not likely to be too large, probably less than 100 bytes except for structures, etc. I used Z_SYNC_FLUSH b/c according to the manual the flush parameter for inflate is undefined for all values except Z_SYNC_FLUSH and Z_FINISH and Z_FINISH says it's really for single calls to inflate. If Matlab does not put a sync point at the beginning of the data ( i guess if they just run compress or deflate once on the whole buffer) then you couldn't start compression from there. This is why i was thinking of using inflateCopy. Using that though for the structures, i get a sig11: Program received signal SIGSEGV, Segmentation fault. 0x080572e8 in inflate (strm=0x9df263c, flush=2) at inflate.c:898 898 this = state->lencode[BITS(state->lenbits)]; I'm guessing that BITS(state->lenbits) is exceeding the bounds of lencode maybe b/c the accumalator has changed?
Post Follow-up to this messagecchgroupmail@gmail.com wrote: > I am not the one compressing data. It's being compressed by Matlab. ... > I don't know how/where they put flush points. Almost certainly they do not insert any flush points. In that case, in order to decompress data from the middle of the variable, you must start decompressing at the beginning of the variable. There is no way out of that. The only hope you have is that having done that once, and you want to go back and get some data before that middle point, then you can save time by having used inflateCopy() or some other approach to save some states along the way on that first pass, in order to be able restart somewhere closer to your desired access point on the second pass, instead of starting back from the beginning again. You would need to select the frequency of inflateCopy()'s to balance the speed of random access against the memory requirements of the copies (more than 32K bytes each). Alternatively, you could cache the decompressed data and access it directly. This whole discussion goes away if the memory to save the decompressed variable is not prohibitive. Or you could reprocess the file, recompressing the variables yourself with flush points. That may be a win, depending on how many times you need to access parts of variables, and if the variables are very large. > I used Z_SYNC_FLUSH b/c according to the manual the flush > parameter for inflate is undefined for all values except Z_SYNC_FLUSH > and Z_FINISH and Z_FINISH says it's really for single calls to inflate. It does? Here's what it says in zlib.h (apologies in advance if google messes up the line breaks): The flush parameter of inflate() can be Z_NO_FLUSH, Z_SYNC_FLUSH, Z_FINISH, or Z_BLOCK. Z_SYNC_FLUSH requests that inflate() flush as much output as possible to the output buffer. Z_BLOCK requests that inflate() stop if and when it gets to the next deflate block boundary. When decoding the zlib or gzip format, this will cause inflate() to return immediately after the header and before the first block. When doing a raw inflate, inflate() will go ahead and process the first block, and will return when it gets to the end of that block, or when it runs out of data. .. The use of Z_FINISH is never required, but can be used to inform inflate that a faster approach may be used for the single inflate() call. The bottom line is that if you're not trying to do a single inflate call and not trying to decompress a block at a time, then you can use Z_NO_FLUSH or Z_SYNC_FLUSH. As it turns out, they don't behave any differently -- inflate always generates as much output as it can with the provided input. > This is why i was > thinking of using inflateCopy. Using that though for the structures, i > get a sig11: > > Program received signal SIGSEGV, Segmentation fault. > 0x080572e8 in inflate (strm=0x9df263c, flush=2) at inflate.c:898 > 898 this = state->lencode[BITS(state->lenbits)]; > > I'm guessing that BITS(state->lenbits) is exceeding the bounds of > lencode maybe b/c the accumalator has changed? The only way to get this error is if the strm you provided to inflate is invalid or corrupted, and the state pointer in that structure is therefore pointing off into la la land. By the way, and this should be in the documentation: each copy made by inflateCopy() needs to be freed by inflateEnd(). If you don't, you'll end up with a massive memory leak. mark
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.