Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

enhanced fast_gztell/fast_gzseek idea
Random-access sing in a compressed file can be slow, since it
usually requires rewinding and decompressing from the beginning of the
file to the requested location.

That's necessary if you haven't already decompressed the file up to
the requested location, but with an enhanced fast_gztell/fast_gzs
pair, it should be possible to produce an 'offset' that can get you
back to a pre-decompressed location that's relative to the start of
the last compression block.  Then sing back to that 'offset' could
be done via a raw s to the start of the compression block followed
by decompression up to the offset within the block.

Something like this:

typedef struct {
off_t   start_of_last_compression_block;
off_t   offset within the block;
} GZ_OFFSET;

fast_gztell(GZ_OFFSET *offset);
fast_gzs(GZ_OFFSET *offset);

Seems like this would be fairly easy to implement.  Does anybody think
it's worth it?

Rob


Report this thread to moderator Post Follow-up to this message
Old Post
Rob Y
08-14-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
Rob Y <ryampolsky@gmail.com> writes:
> Random-access sing in a compressed file can be slow, since it
> usually requires rewinding and decompressing from the beginning of the
> file to the requested location.
>
> That's necessary if you haven't already decompressed the file up to
> the requested location, but with an enhanced fast_gztell/fast_gzs
> pair, it should be possible to produce an 'offset' that can get you
> back to a pre-decompressed location that's relative to the start of
> the last compression block.  Then sing back to that 'offset' could
> be done via a raw s to the start of the compression block followed
> by decompression up to the offset within the block.
>
> Something like this:
>
> typedef struct {
>       off_t   start_of_last_compression_block;
>       off_t   offset within the block;
> } GZ_OFFSET;
>
> fast_gztell(GZ_OFFSET *offset);
> fast_gzs(GZ_OFFSET *offset);
>
> Seems like this would be fairly easy to implement.  Does anybody think
> it's worth it?


If your intention is to return to a place where you've
previously been, then surely the easiest thing to do is
simply to clone the whole state? Throw away the one you
no  longer want to use. I have no idea how practical that
would be.

Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration

Report this thread to moderator Post Follow-up to this message
Old Post
Phil Carmody
08-14-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
Rob Y wrote:

> Random-access sing in a compressed file can be slow, since it
> usually requires rewinding and decompressing from the beginning of the
> file to the requested location.
>
> That's necessary if you haven't already decompressed the file up to
> the requested location, but with an enhanced fast_gztell/fast_gzs
> pair, it should be possible to produce an 'offset' that can get you
> back to a pre-decompressed location that's relative to the start of
> the last compression block.  Then sing back to that 'offset' could
> be done via a raw s to the start of the compression block followed
> by decompression up to the offset within the block.

In a general model, one can track all decompressed blocks, with their
physical and relative (decompressed) offsets. Then a backwards s can
start with the block that contains the given offset.

DoDi

Report this thread to moderator Post Follow-up to this message
Old Post
Hans-Peter Diettrich
08-14-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
On Aug 14, 8:06 am, Rob Y <ryampol...@gmail.com> wrote:
> That's necessary if you haven't already decompressed the file up to
> the requested location, but with an enhanced fast_gztell/fast_gzs
> pair, it should be possible to produce an 'offset' that can get you
> back to a pre-decompressed location that's relative to the start of
> the last compression block.  Then sing back to that 'offset' could
> be done via a raw s to the start of the compression block followed
> by decompression up to the offset within the block.
>
> Something like this:
>
> typedef struct {
>       off_t   start_of_last_compression_block;
>       off_t   offset within the block;
>
> } GZ_OFFSET;

This is done in examples/zran.c in the zlib distribution.  It takes a
little more state information than noted.

Mark


Report this thread to moderator Post Follow-up to this message
Old Post
Mark Adler
08-14-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
> This is done in examples/zran.c in the zlib distribution.  It takes a
> little more state information than noted.
>
> Mark

I figured it would be easy enough to do - maybe I'll clone the zran.c
example.

My note was just a schematic for an API.  The suggestion was to
provide a supported API in zlib that does whatever it takes to
efficiently support a tell/s pair.  Whatever info is required would
be filled into the GZ_OFFSET structure - kind of like a setjmp/longjmp
pair for random access to a zipped file.

It looks like zran.c pre-indexes the entire file setting up a
reasonable array of access points.  These seem to take a fairly large
structure to represent, including a 32K data 'window' to seed the
inflate dictionary.  I guess this is where I don't understand the zlib
internals enough to know what's involved.  It looks like an access
point consists of the offset to a compressed block in the file *plus*
whatever data is left over from the prior block, because it hasn't
been consumed yet at the 'fast_gztell' point so some data at
fast_gztell+x might need to come from that window instead of the next
block, right?  This requires zran.c to prescan the file.

I was looking for something conceptually simpler.  Since my app knows
exactly what access points it wants, I'd call the 'fast_gztell'
function to return me one.  I guess zran.c could be cannibalized to
support this, but I was just assuming that, since zlib needs to
effectively prescan the file up to any particular gztell point, it
could maintain the access point and window data automatically while
reading and just return it to the caller when asked for.

I'm guessing that the zlib code just isn't structured in a way that
lets it easily return one of these 'access point' structs.  So zran.c
has to scan the entire file and construct them itself.  The code in
zran.c to actually reset to an access point looks pretty
straightforward, though.  Or is zran.c just a cute example to show you
how to use the low-level inflateInit2, inflatePrime,
inflateSetDictionary functions, independent of how hard it would be to
implement a fast_gztell?

Thanks,
Rob



Report this thread to moderator Post Follow-up to this message
Old Post
Rob Y
08-16-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
>>This is done in examples/zran.c in the zlib distribution.  It takes a 

> I was looking for something conceptually simpler.  Since my app knows
> exactly what access points it wants ...

As long as those access points are few, then use Z_FULL_FLUSH
and regular *tell/*s .

--

Report this thread to moderator Post Follow-up to this message
Old Post
John Reiser
08-16-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
On Aug 16, 10:15 am, Rob Y <ryampol...@gmail.com> wrote:
> It looks like zran.c pre-indexes the entire file setting up a
> reasonable array of access points.

Yes, but you can do the same thing on-demand instead a priori.  When
an access point is requested well after your the last access point
requested, you can decompress from the last access point to the
current request, building indices as you go.

> These seem to take a fairly large
> structure to represent, including a 32K data 'window' to seed the
> inflate dictionary.  I guess this is where I don't understand the zlib
> internals enough to know what's involved.

This is a natural characteristic of any compression program.  Just
about all compression software uses the uncompressed data up to point
X in order to provide historical context, e.g. strings, statistics,
correlations, etc., which makes it possible to compress the data that
follows point X.  (The inherent asssumption of compression is that the
stuff after X is like the stuff before X.)

Different compressors have different amounts of history used.  The
deflate format only uses 32K of history.  As  a result, part of the
state information for decompression of the deflate format is the 32K
of uncompressed data immediately preceding the data about to be
decompressed.  Some compressors automatically reset the history
periodically, e.g. BWT compressors.  Deflate compressors can reset the
history when requested (see below), but in general they do not do this
by default.

> I was looking for something conceptually simpler.

zran was written for the situation where you want random access to a
very large deflate stream that you did not create, and you want to
make that access rapid over separate invocations of the application.
For that purpose, zran creates states that can be saved to an index
file, and used again later for the same deflate stream.

An alternative for random access of a stream that you did not create,
within a single invocation is to use the inflateCopy() routine of
zlib.  That copies the entire internal zlib inflate state (not
surprisingly, on the order of 32K -- as I recall, less than 48K).  You
can then stop anywhere while decompressing, copy the state, and then
be able to back there later.  The downside is the state has a bunch of
pointers in it, so you can't save it to a file and use it in another
invocation.

Lastly, all of that was assuming that you didn't make the deflate
stream.  If you are the one making the deflate stream, or if you can
reprocess the one you got, then you can put in historyless flush
points at byte boundaries using Z_FULL_FLUSH (as pointed out by
Reiser).  You can then go to those locations (and only those
locations) to start decompressing without memory of the previous 32K
of uncompressed data.  All you'd need are the byte offsets in that
case.

In general reseting the history will reduce the compression ratio,
since on average there's less information to go on.  However, for
deflate if you stick in these points every 1MB, or further separated,
the impact will likely be minimal, less than a percent.

Mark


Report this thread to moderator Post Follow-up to this message
Old Post
Mark Adler
08-16-07 11:56 PM


Re: enhanced fast_gztell/fast_gzseek idea
Wow.  Thanks for all the info.  It even looks like InflateCopy() does
exactly what I was asking for.  I'll give it a try.


Report this thread to moderator Post Follow-up to this message
Old Post
Rob Y
08-20-07 11:56 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Compression archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 11:29 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.