Home > Archive > Compression > September 2004 > Zlib speed optiimization
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Zlib speed optiimization
|
|
| Steve Heller 2004-09-12, 8:55 pm |
| Hi all,
I'm quite impressed with zlib's tradeoff between speed and
compression. But currently I need the utmost speed compatible with a
reasonable level of compression; in other words, speed is the priority
and compression efficiency secondary. Memory use could be fairly
liberal, e.g., a few megabytes of extra memory consumption would be no
problem if it would help the speed.
I need to use zlib or another "freely available" program (not
including ones licensed under GPL, unfortunately) to avoid licensing
problems in certain European countries.
The application is to compress a large number (millions or tens of
millions) of blocks of fixed length (typically 4K) that have to be
able to be decompressed in any order, but most commonly in the same
order in which they were compressed.
Is there anyone here who could suggest how to improve the speed of
zlib beyond the existing maximum speed optimizations?
Here is what I have done to get the highest throughput so far:
1. Reduced window_bits to 13
2. Reduced mem_level to 5
3. #defined FASTEST
4. Implemented my own memory allocator
The first two of these settings resulted from experimentation, and the
motivation for the others should be pretty obvious.
If anyone wants to actually do the coding, some pay might be
available. In this case, let me know what you would be willing to do,
and how much money you would want.
Of course, any improvements would be made available to the public.
Steve
| |
| John Reiser 2004-09-12, 8:55 pm |
| > Is there anyone here who could suggest how to improve the speed of
> zlib beyond the existing maximum speed optimizations?
The time taken to pack the variable-width tokens into the output buffer
can be reduced by using machine-specific code. In the case of gzip, and
depending on the data and parameters, the speedup can be worth the effort.
Similar changes are definitely worthwhile (upto 20% to 45% overall CPU time)
for decompression. http://bitwagon.com/ftp/gzip_x86-0.91.tgz (GPLv2).
--
| |
| Mark Adler 2004-09-13, 3:55 am |
| Steve Heller <steve@steveheller.com> wrote in message news:<f7a9k0t2cplkgp8718uujeipqp9chs51rs@4ax.com>...
> But currently I need the utmost speed compatible with a
> reasonable level of compression; in other words, speed is the priority
> and compression efficiency secondary.
You could try looking at LZO to see if that meets your needs:
http://www.oberhumer.com/opensource/lzo/
mark
| |
| Johann Burkard 2004-09-13, 8:55 am |
| Steve Heller wrote:
> I'm quite impressed with zlib's tradeoff between speed and
> compression. But currently I need the utmost speed compatible with a
> reasonable level of compression;
I found LZRW [1] to be well implemented and easily reusable.
[1] <http://www.ross.net/compression/>
Johann
--
Biiiiiiiiiiiirgit?!? Betest du für uns? hrhr...
(*Tönnes ist echt witzig in
<f4bb5dfc.0401052316.1d6c9165@posting.google.com> )
| |
| John Reiser 2004-09-13, 3:55 pm |
| > Memory use could be fairly
> liberal, e.g., a few megabytes of extra memory consumption would be no
> problem if it would help the speed.
For somewhere between 2MB and about 128MB or so, you can deal directly
with the substrings of length 3: 256**3 is 16M, and each element
of the array might occupy somewhere between 1 bit and several bytes.
Many compressors, including zlib, use a search that begins with hashing
and/or a data structure based on substrings of length 2. The search
is often the most expensive part of compression, and perhaps using
direct indexing based on substrings of length 3 can speed up the
search. [128MB of DRAM used to be unavailable or very expensive.]
Beware of demand paging and cache misses.
--
| |
| Hannah Schroeter 2004-09-14, 3:55 pm |
| Hello!
Mark Adler <madler@alumni.caltech.edu> wrote:
>Steve Heller <steve@steveheller.com> wrote in message
>news:<f7a9k0t2cplkgp8718uujeipqp9chs51rs@4ax.com>...
[color=darkred]
>You could try looking at LZO to see if that meets your needs:
> http://www.oberhumer.com/opensource/lzo/
IIRC the OP excluded GPL (or similar) licensing, which LZO is alas.
Perhaps the OP should just check whether zlib with at most parameter
tweaks is perhaps fast enough.
Kind regards,
Hannah.
| |
| Anthony J Bybell 2004-09-14, 3:55 pm |
| Steve Heller <steve@steveheller.com> wrote in message news:<f7a9k0t2cplkgp8718uujeipqp9chs51rs@4ax.com>...
> The application is to compress a large number (millions or tens of
> millions) of blocks of fixed length (typically 4K) that have to be
> able to be decompressed in any order, but most commonly in the same
> order in which they were compressed.
http://www.aha.com/show_prod.php?id=30 sounds like the fastest single
threaded low latency solution out there. Afternatively, if you have
compute facilities in place over a network of workstations connected
by NFS/AFS/whatever, that could provide a fairly good speedup:
1) break up data into a series of consecutive blocks (for efficiency
reasons)
2) various machines compress different chunks
3) data is reassembled by some central authority, block indexed, etc.
Note that I do the *reverse* of this for the reading/decompression of
gtkwave VZT databases and such a trick may help your read access
times:
1) data exists in offset indexed blocks in the database
2) blocks are prefetched as needed and decompressed across various
pthreads
3) pointers to decompressed blocks are made visible in the supervising
thread when the data is finished decompressing
I deliberately partitioned the database in order to allow this. On
multiprocessor machines with decent dcache-memory bandwidth (Opteron),
the speedup is excellent. On uniprocessor machines there is still
speedup as the prefetching/decompression can occur when gtkwave's main
thread (the GUI) is idle due to a lack of user intervention.
BTW, you do say tens of millions of blocks. If you're using them,
keep in mind that gzs () and gztell() will require a little bit of
tweaking with flags and such to work with > 2GB data. You might need
to use fs o(), etc on your system; I don't know.
-t
| |
| Steve Heller 2004-09-16, 3:55 am |
| Johann Burkard <johannburkard@nexgo.de> wrote:
>Steve Heller wrote:
>
>I found LZRW [1] to be well implemented and easily reusable.
>
>[1] <http://www.ross.net/compression/>
Thank you for the pointer. Unfortunately it is also unusable for my
purposes. I need an algorithm with no patent or licensing problems.
Steve
| |
| Steve Heller 2004-09-16, 3:55 am |
| John Reiser <jreiser@BitWagon.com> wrote:
>
>For somewhere between 2MB and about 128MB or so, you can deal directly
>with the substrings of length 3: 256**3 is 16M, and each element
>of the array might occupy somewhere between 1 bit and several bytes.
>Many compressors, including zlib, use a search that begins with hashing
>and/or a data structure based on substrings of length 2. The search
>is often the most expensive part of compression, and perhaps using
>direct indexing based on substrings of length 3 can speed up the
>search. [128MB of DRAM used to be unavailable or very expensive.]
>Beware of demand paging and cache misses.
Do you know where I can get an explanation of the way hashing is used
in zlib? The implementation I have uses some quite unobvious macros.
Steve
| |
| Steve Heller 2004-09-16, 3:55 am |
| hannah@schlund.de (Hannah Schroeter) wrote:
>Perhaps the OP should just check whether zlib with at most parameter
>tweaks is perhaps fast enough.
As I mentioned in my original post, I've already done that.
Steve
| |
| Steve Heller 2004-09-16, 3:55 am |
| bybell@rocketmail.com (Anthony J Bybell) wrote:
>Steve Heller <steve@steveheller.com> wrote in message news:<f7a9k0t2cplkgp8718uujeipqp9chs51rs@4ax.com>...
>
>
>http://www.aha.com/show_prod.php?id=30 sounds like the fastest single
>threaded low latency solution out there.
Thanks, but this is for use in a backup program to be sold
commercially, and that needs to work with laptops. So an expensive PCI
bus card won't really help us.
>Afternatively, if you have
>compute facilities in place over a network of workstations connected
>by NFS/AFS/whatever, that could provide a fairly good speedup:
This also won't help. for the same reasons.
....
>BTW, you do say tens of millions of blocks. If you're using them,
>keep in mind that gzs () and gztell() will require a little bit of
>tweaking with flags and such to work with > 2GB data. You might need
>to use fs o(), etc on your system; I don't know.
That's not a problem, as I'm storing and retrieving the blocks of data
in my own custom database which can handle considerably more than 2
GB.
One possible speed optimization I'm considering is to concatenate
several blocks before compressing or decompressing, as there seems to
be a significant amount of overhead in starting and finishing.
Unfortunately, that will complicate the storage and retrieval
algorithms considerably.
Steve
| |
| Mark Adler 2004-09-16, 3:55 am |
| John Reiser <jreiser@BitWagon.com> wrote in message news:<ci4hgq0297h@enews4.newsguy.com>...
> Many compressors, including zlib, use a search that begins with hashing
> and/or a data structure based on substrings of length 2.
Actually, zlib hashes on strings of length 3.
mark
| |
| Mark Adler 2004-09-16, 4:01 pm |
| Steve Heller <steve@steveheller.com> wrote in message news:<9a1ik0dvafnhkv7qfm932rnuneshcin0om@4ax.com>...
> One possible speed optimization I'm considering is to concatenate
> several blocks before compressing or decompressing, as there seems to
> be a significant amount of overhead in starting and finishing.
Are you using deflateReset() (and inflateReset())? You should only
have to call deflateInit() or deflateInit2() once in your application
and then use deflateReset() when you want to start a new stream. The
purpose of deflateReset() is to reduce significantly the restart
overhead.
There is a compression size penalty in doing small chunks versus large
chunks, but above about 1 MB chunks, the penalty is typically less
than 1%.
mark
| |
| Matt Mahoney 2004-09-16, 9:02 pm |
|
"Steve Heller" <steve@steveheller.com> wrote in message
news:iuohk05k3ftnf1rurru4haaca61qn3j9q3@
4ax.com...
> Johann Burkard <johannburkard@nexgo.de> wrote:
>
>
> Thank you for the pointer. Unfortunately it is also unusable for my
> purposes. I need an algorithm with no patent or licensing problems.
The LZ78 and LZW patents (LZRW5) have expired. You'll need to check if
4,814,746 applies. In any case, I don't think any of the algorithms
(including the unpublished SAKDC) compressed any better than zlib.
-- Matt Mahoney
| |
| Anthony J Bybell 2004-09-17, 8:55 pm |
| Steve Heller <steve@steveheller.com> wrote in message news:<9a1ik0dvafnhkv7qfm932rnuneshcin0om@4ax.com>...
> One possible speed optimization I'm considering is to concatenate
> several blocks before compressing or decompressing, as there seems to
> be a significant amount of overhead in starting and finishing.
> Unfortunately, that will complicate the storage and retrieval
> algorithms considerably.
The accesses that go on in the client don't necessarily have to
reflect the geometry of your database. Can't you add a buffer cache
layer between your R/W API and the client which does the
buffering/conversion and makes accesses appear to be 4k in size to the
client? In general (not maybe your case), that often can be the
cleanest solution: let the middleman handle the split/join ops in
order to simplify code in the client. (Think of layered networking
models.) The bonus also is that upgrades to the database (different
buffer sizes/compression algs) can be performed transparent to the
client which still thinks that it's accessing the world as
individually addressable 4k blocks.
YMMV, without knowing specifically what you're doing it's hard to
comment.
-t
| |
| Steve Heller 2004-09-17, 8:55 pm |
| bybell@rocketmail.com (Anthony J Bybell) wrote:
>Steve Heller <steve@steveheller.com> wrote in message news:<9a1ik0dvafnhkv7qfm932rnuneshcin0om@4ax.com>...
>
>
>The accesses that go on in the client don't necessarily have to
>reflect the geometry of your database. Can't you add a buffer cache
>layer between your R/W API and the client which does the
>buffering/conversion and makes accesses appear to be 4k in size to the
>client? In general (not maybe your case), that often can be the
>cleanest solution: let the middleman handle the split/join ops in
>order to simplify code in the client. (Think of layered networking
>models.) The bonus also is that upgrades to the database (different
>buffer sizes/compression algs) can be performed transparent to the
>client which still thinks that it's accessing the world as
>individually addressable 4k blocks.
Yes, that is one solution we have considered. However, in our
particular situation it has significant drawbacks besides code
complexity, which I'm not really at liberty to discuss. That's why I
have been trying to figure out how to speed up the zlib code even at
the expense of some compression. E.g., perhaps there would be a way to
load fewer strings into the hash table, which would of course mean
that compression opportunities would be lost. Or maybe there is a way
to reduce the time taken in the Huffman coding, again at the expense
of compression efficiency. The problem is that I don't really know
where the time is going, and don't understand the details of the
implementation.
>YMMV, without knowing specifically what you're doing it's hard to
>comment.
Of course. Thanks for your comments.
Steve
| |
| John Reiser 2004-09-30, 3:55 pm |
| > I'm quite impressed with zlib's tradeoff between speed and
> compression. But currently I need the utmost speed compatible with a
> reasonable level of compression ...
> The application is to compress a large number (millions or tens of
> millions) of blocks of fixed length (typically 4K) ...
"Get a better compiler." Tailor the instructions that get executed
to the specifics of the hardware and the actual runtime environment.
By constructing a special-purpose compiler, including a runtime code
generator, I have saved 7% to 19% CPU time on i686 with essentially
no changes to the algorithms of zlib-1.2.1.
--
|
|
|
|
|