6/07/2010

Hashing Without Collision

There are many data look-up operations in today's Internet scale web data processing, for example, from URL to a unique integer index value. It's essentially a basic hashing problem except that its data and operation scale is very large. Usually, it's required that no collision should exist in these hash function, which is called "Perfect Hashing"

A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal (perfect hash function).

Constructing such a perfect hash function from a large set of unique keys is not a simple task. But there are still many algorithms and implementations about it.

Algorithms:

1. Perfect Hashing for Data Management Applications
2. External Perfect Hashing for Very Large Key Sets
3. Gperf: A Perfect Hash Function Generator
4. Monotone Minimal Perfect Hashing

Implementations:

C Minimal Perfect Hashing Library:
http://cmph.sourceforge.net/

Gperf:
http://www.gnu.org/software/gperf/

Minimal Perfect Hashing by Bob Jenkins:
http://burtleburtle.net/bob/hash/perfect.html

General Purpose Hash Function Algorithms:
http://www.partow.net/programming/hashfunctions

These solutions require a static key set known in advance. Sometime, this is not the case, so you need to construct it dynamically. Dynamic perfect hashing is somewhat complicated, an alternative is to use Cuckoo Hashing

The basic idea is to use two hash functions(h1 and h2) to determine the hash value of a key. It's not a perfect hash function, because there is a chance that two distinct keys(k1 and k2) have the same hash value with h1 and h2. (i.e. h1(k1) == h1(k2) and h2(k1) == h2(k2)) But the probability is quite low and the worst case needs only two loop up operations.

Here are some tutorials about this algorithm:

Cuckoo Hashing for Undergraduate
Cuckoo Hashing - Theory and Practice (Part 1, Part 2 and Part 3)

Paper:

1. Initial paper about Cuckoo Hashing
2. Another version of the paper

No comments: