Home > Archive > Compression > November 2005 > LZMA parameters
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Nicolas \Nescafe\ Le Gland 2005-11-16, 9:55 pm |
| Hi.
Not being an expert in data compression but a daily comp.compression reader,
I was wondering how many of you have had a try at the 7zip SDK and more
specifically at the LZMA command line compressor.
I am interrested in the LZMA algorithm as the final encoding step of a
homemade family of multimedia formats, a bit like JPEG's Huffman coder, so I
don't really care about the filters and other compression schemes in the 7z
format. My goal is to identify the meaning of the encoder settings to make a
decision about which value to use.
I was wondering if someone had a clear explanation of what the parameters of
the LZMA encoder do. Igor Pavlov's answers I found on the SourceForge forum
are just not detailed enough for me to understand the impact of this or that
value.
-a{N} Compression mode [0, 2] : 0 is fast, 2 compresses better. Can't find
if it has something to do with a number of passes, which I doubt.
-d{N} Dictionary size [0, 28] : I got that one pretty easily. It's a kind of
table of size 2^N containing previous sequences. The bigger the better,
whatever the file to compress is. A dictionary two times bigger than the
file itself doesn't help but the decoding process will require that size to
be allocated for decoding (seems to be an implementation need, as the size
is defined in the headers but it does not change the compressed stream over
that limit).
-fb{N}Fast bytes [5, 255] or maybe [5, 272] : According to Igor Pavlov
himself on the SF forum, "Match finder checks only this number of bytes". I
don't think this has anything to do with windowing, but maybe with the upper
limit of the size of a match. The bigger the better, whatever the file is,
according to my tests. It also slows down the compression but I don't care
about that.
-lc{N}Literal Context bits [0, 8] : Defaulted to 3, it is documented that
"sometimes lc=4 gives gain for big files". I'm still wondering what that is
and what a bigger value would mean.
-lp{N} Literal Pos bits [0, 4] : Looks like it's intended for adjusting to
periodical data with a 2^N byte period. Seems clear that for example 32 bits
aligned data should take advantage of "-lp2". By default set to 0, it's said
that any other value should be used with "-lc0". I'm still wondering why.
-pb{N} Pos Bits [0, 4] : Said to be "intended for periodical data when
period is equal 2^N". What about "literal" position bits then ? Have "-pb"
and "-lp" always to be the same ?
-mf{MF_ID}Match Finder : The match finder algorithm seems to be the next
most important setting after the dictionary size. When dealing with small
files, binary trees seem to be enough. Patricia trees maybe are better
suited for really large files. If anyone had an rough idea of how big and
why, it would be nice. Hash chains seem to really fast but "miss a lot of
matches". If these are deterministic, I am wondering why a "brute force
match finder" that just searches the whole file isn't available for small
ones.
Thank for any help on any of those parameters, or link to a well documented
LZMA analysis.
Nicolas "Nescafe" Le Gland
- Every comprehension mistakes are solely due to my bad english.
| |
| Malcolm Taylor 2005-11-16, 9:55 pm |
| Hi Nicolas,
> -a{N} Compression mode [0, 2] : 0 is fast, 2 compresses better. Can't find
> if it has something to do with a number of passes, which I doubt.
Don't know, and am curious myself :)
> -fb{N}Fast bytes [5, 255] or maybe [5, 272] : According to Igor Pavlov
> himself on the SF forum, "Match finder checks only this number of bytes". I
> don't think this has anything to do with windowing, but maybe with the upper
> limit of the size of a match. The bigger the better, whatever the file is,
> according to my tests. It also slows down the compression but I don't care
> about that.
This one is hard to explain... To my knowledge (please correct me if I
am wrong), this refers to the optimal parsing algorithm. The algorithm
tries many different combinations of matches to find the best one. If a
match is found that is over the fb value, then it will not be optimised,
and will just be used straight.
This speeds up corner cases such as pic.
> -lc{N}Literal Context bits [0, 8] : Defaulted to 3, it is documented that
> "sometimes lc=4 gives gain for big files". I'm still wondering what that is
> and what a bigger value would mean.
The context for the literal coder is 2^(lc) long. The longer it is, the
better the statistics, but also the slower it adapts. A tradeoff, which
is why 3 or 4 is reccommended.
> -lp{N} Literal Pos bits [0, 4] : Looks like it's intended for adjusting to
> periodical data with a 2^N byte period. Seems clear that for example 32 bits
> aligned data should take advantage of "-lp2". By default set to 0, it's said
> that any other value should be used with "-lc0". I'm still wondering why.
This allows you to add some bits from the position to the literal
context. Again, it will reduce the adaption speed of the literal model,
and so most of the time will hurt compression.
> -pb{N} Pos Bits [0, 4] : Said to be "intended for periodical data when
> period is equal 2^N". What about "literal" position bits then ? Have "-pb"
> and "-lp" always to be the same ?
IIRC this refers to the match flag encoding for MRU matches, allowing
them to have a few bits of the position as context. Again, only useful
in very periodic files.
> -mf{MF_ID}Match Finder : The match finder algorithm seems to be the next
> most important setting after the dictionary size. When dealing with small
> files, binary trees seem to be enough. Patricia trees maybe are better
IIRC, in theory, a binary tree or a patricia tree should be perfect (ie.
find all possible matches). However in practice, the implementations
probably have limits to depth in order to avoid worst case performance
(eg. pic), and so the various match finders will perform differently in
different situations.
A general rule though:
Pick a hash table if using the fastest, unoptimised parsing mode
(perhaps this is -a0??), as this will help you get a faster result.
Otherwise pick bt4 or similar as the binary tree tends to give the best
speed results over a wide range of files.
Hash tables are useless for optimal parsing, because they do not provide
many options to the optimiser. They also have bad worst case behaviour
when matching to every byte.
Malcolm
| |
| Ignorant 2005-11-18, 7:55 am |
|
I can comment on the back end codec. only as that was the part i
could understand.
It is a nicely implemented range codec.(Arithmetc coder..).
|
|
|
|
|