For Programmers: Free Programming Magazines  


Home > Archive > Compression > July 2006 > Re: Order-preserving hashes?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Order-preserving hashes?
Matt Mahoney

2006-07-25, 6:55 pm


Ben Rudiak-Gould wrote:
> [added comp.compression]
>
> Zeljko Vrba wrote:
>
> This is actually a data compression question. Suppose you take a large
> statistical sample of input strings and group them into equivalence classes
> by hash value. In order to minimize the number of collisions, each
> equivalence class should have about the same number of elements. This is the
> same property that a good data compression algorithm has, if you group by a
> fixed-length substring of the compressed output. Many compression algorithms
> violate your ordering constraint, but order-preserving compression is not a
> hard problem. The best way to construct your hash function (ignoring
> performance issues) is to take the first N bits of the output of a good
> order-preserving compressor applied to the input string.
>
> Note that it's unavoidable that your hash values will only depend on the
> first few bytes of the input for most inputs. E.g. for English text you
> can't do better than about 8:1 compression, which means that an N-bit hash
> will typically depend on only the first 8N bits of the input.
>
> If your input strings are something like ASCII text, and you want a
> reasonably fast hash function, you could try a static order-0 Huffman model.
>
> Question for comp.compression people (since I'm curious): do any of the
> algorithms mentioned at maximumcompression.com preserve lexicographic order?
>
> -- Ben


Most of the PAQ algorithms, when used without preprocessing transforms,
preserve lexicographic order, or in some versions, reverse
lexicographic order. FPAQ0 does also, using a simpler model.

-- Matt Mahoney

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com