Home > Archive > Compression > February 2005 > Characterizing Source Data for Data Compression
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Characterizing Source Data for Data Compression
|
|
| george 2005-02-05, 3:55 pm |
| Data Compression algorithm used today, consists of a number of
sub-algorithms. Each sub-block's
algorithm generates a suitable data for the next block. This continues
until the desired
compression is needed.
"However the flow path is to be modified for a different type of the
source data or depending on
the result of a sub-block, Certain blocks are not selected sometimes"
Considering a frequency count of the symbols present in the source
data, the frequency count is
same irrespective of its arrangement. (i.e., Though the location of the
data is changed the
occurrences count will be the same) .
Considering the "Loss less Compression" in which preservation of
the arrangement is very important,
may use algorithms change the order of the arrangement to bring
redundancy in the source data and
restores the arrangement to original on the reverse phase (i.e.,
Decompression)
Questions:
1. How the source data may be characterized?
2. How the source data may be characterized against these algorithms to
choose which algorithm to be used first.
3. How Sub Algorithms will be chosen to achieve the best compression.
| |
| Matt Mahoney 2005-02-05, 8:55 pm |
| george wrote:
> Data Compression algorithm used today, consists of a number of
> sub-algorithms. Each sub-block's
> algorithm generates a suitable data for the next block. This
continues
> until the desired
> compression is needed.
>
> "However the flow path is to be modified for a different type of the
> source data or depending on
> the result of a sub-block, Certain blocks are not selected sometimes"
>
> Considering a frequency count of the symbols present in the source
> data, the frequency count is
> same irrespective of its arrangement. (i.e., Though the location of
the
> data is changed the
> occurrences count will be the same) .
>
> Considering the "Loss less Compression" in which preservation of
> the arrangement is very important,
> may use algorithms change the order of the arrangement to bring
> redundancy in the source data and
> restores the arrangement to original on the reverse phase (i.e.,
> Decompression)
>
> Questions:
> 1. How the source data may be characterized?
> 2. How the source data may be characterized against these algorithms
to
> choose which algorithm to be used first.
> 3. How Sub Algorithms will be chosen to achieve the best compression.
Compression is not based on algorithms and data flow, but on choosing a
representation. You have a model, or assumed probability distribution
P over the set of all strings, x. Your goal is first to estimate P
accurately (it is normally unknown), and second to choose a mapping
such that the code for x has length approximately log2 1/P(x) bits (the
Shannon limit).
The most common algorithms, from fastest to slowest, and from worst to
best compression, are LZ, BW, PPM and context mixing.
LZ (Limpel-Ziv) is used by compress, pkzip and gzip. The idea is to
replace repeated substrings in the input with pointers to the previous
occurrences in the input buffer (LZ77) or dictionary (LZW). The
pointer and length of the match are usually Huffman coded. A Huffman
code is a variable bit length code such that the most common symbols
have the shortest codes, and no code is a prefix of any other code.
BW (Burrows-Wheeler) is used in bzip2 and GRZipII. The idea is to sort
the input bytes by context to bring similar bytes together (the
transform is reversible), then assign shorter codes to bytes that have
occurred recently in the transformed stream. These can be Huffman
coded or arithmetic coded. An arithmetic code codes a string x as a
high precision fixed point number between P(< x) and P(<= x) where <
and <= refer to the lexicographical ordering over x. Arithmetic coding
is more compact than Huffman and can be efficiently computed
incrementally.
PPM (prediction by partial match) assigns a probability to each symbol
in x using the distribution of symbols found where the immediately
preceding context matches, trying the longest matches first. PPM
variants differ in how they estimate probabilities of unseen symbols
when falling back to shorter matches (the escape probability). The
escape probability is learned adaptively in PPMZ-like algorithms (slim,
durilca, RK, ppmonstr). The output is arithmetic coded.
Context mixing differs from PPM in that the input is treated as a bit
(not byte) stream and all of the predictions by different length
contexts are combined using a weighted sum of 0 and 1 counts. The
weights are adjusted during compression to minimize prediction errors.
The improved compression comes from the ability to model noncontiguous
contexts, such as the preceding row in a bitmap image or spreadsheet,
and the ability to mix predictions from large numbers of independent
models and secondary symbol estimators (SSE) organized into a tree or
lattice. SSE adaptively maps a probability and context into a new,
more accurate probability. Context mixing is used in all the PAQ
variants and in WinRK/PWCM.
-- Matt Mahoney
| |
| george 2005-02-07, 3:55 pm |
| I agree with your statement.
However the probability distribution of source data many not fall into
the Normal distribution, for which Entropy coders work the best.
To achieve such a distribution, the data is manipulated some way such
as transformation based or prediction or of some other kind.
"I am interested in finding when the data should be applied to
transformation based algorithm such as Wavelet or predication based, of
course in both of which the transformation Co-efficients or errors in
prediction follow Normal distribution"
Finally the data follows the Normal distribution when given to any
Entropy Codes such as Arithmetic Coder or Range Coder Compression is
achieved.
I hope now the objective of the post is clear.
George Wilson
| |
| Matt Mahoney 2005-02-08, 3:55 am |
| george wrote:
> I agree with your statement.
> However the probability distribution of source data many not fall
into
> the Normal distribution, for which Entropy coders work the best.
An entropy coder does not work better with a Normal distribution. It
works with any distribution. An arithmetic coder comes within 1 bit of
the Shannon limit. This part of the problem is solved. The hard part
is estimating the distribution accurately.
-- Matt Mahoney
|
|
|
|
|