| Ben Rudiak-Gould 2006-07-24, 6:55 pm |
| [added comp.compression]
Zeljko Vrba wrote:
> Is there a hash function which preserves ordering of the data? I.e. given any
> two strings s1, s2 with s1 <= s2, and a hash function H, then H(s1) <= H(s2)
> must also hold. (Here, <= is the usual lexicographical ordering on raw bytes).
This is actually a data compression question. Suppose you take a large
statistical sample of input strings and group them into equivalence classes
by hash value. In order to minimize the number of collisions, each
equivalence class should have about the same number of elements. This is the
same property that a good data compression algorithm has, if you group by a
fixed-length substring of the compressed output. Many compression algorithms
violate your ordering constraint, but order-preserving compression is not a
hard problem. The best way to construct your hash function (ignoring
performance issues) is to take the first N bits of the output of a good
order-preserving compressor applied to the input string.
Note that it's unavoidable that your hash values will only depend on the
first few bytes of the input for most inputs. E.g. for English text you
can't do better than about 8:1 compression, which means that an N-bit hash
will typically depend on only the first 8N bits of the input.
If your input strings are something like ASCII text, and you want a
reasonably fast hash function, you could try a static order-0 Huffman model.
Question for comp.compression people (since I'm curious): do any of the
algorithms mentioned at maximumcompression.com preserve lexicographic order?
-- Ben
|