| Matt Mahoney 2006-07-25, 6:55 pm |
|
Ben Rudiak-Gould wrote:
> [added comp.compression]
>
> Zeljko Vrba wrote:
>
> This is actually a data compression question. Suppose you take a large
> statistical sample of input strings and group them into equivalence classes
> by hash value. In order to minimize the number of collisions, each
> equivalence class should have about the same number of elements. This is the
> same property that a good data compression algorithm has, if you group by a
> fixed-length substring of the compressed output. Many compression algorithms
> violate your ordering constraint, but order-preserving compression is not a
> hard problem. The best way to construct your hash function (ignoring
> performance issues) is to take the first N bits of the output of a good
> order-preserving compressor applied to the input string.
>
> Note that it's unavoidable that your hash values will only depend on the
> first few bytes of the input for most inputs. E.g. for English text you
> can't do better than about 8:1 compression, which means that an N-bit hash
> will typically depend on only the first 8N bits of the input.
>
> If your input strings are something like ASCII text, and you want a
> reasonably fast hash function, you could try a static order-0 Huffman model.
>
> Question for comp.compression people (since I'm curious): do any of the
> algorithms mentioned at maximumcompression.com preserve lexicographic order?
>
> -- Ben
Most of the PAQ algorithms, when used without preprocessing transforms,
preserve lexicographic order, or in some versions, reverse
lexicographic order. FPAQ0 does also, using a simpler model.
-- Matt Mahoney
|