My favourite small hash table

25 points by bitshift 2 months ago on lobsters | 7 comments

tobin_baker | 2 months ago

Shameless plug for bidirectional linear probing, which outperformed Robin Hood in my benchmarks:

https://github.com/senderista/hashtable-benchmarks

matheusmoreira | 2 months ago

Thanks a lot, this looks awesome!! My language's hash tables use unidirectional linear probing without tombstones. Knuth's algorithm looks like could be a straight upgrade. Can't believe it was published half a century ago. Makes me wonder what else I'm sleeping on.

tobin_baker | 2 months ago

Even FCFS is a major upgrade over linear probing WRT worst-case probe length (see my benchmarks), despite being a 1-line change. Yet I’ve virtually never seen it in the wild.

A concurrent (though not lock-free) version of BLP is described here:

https://fmt.ewi.utwente.nl/media/92.pdf

I stole part of my algorithm from this paper but have not implemented it so can’t vouch for its performance.

Small hash tables can be very cute :-)

If keys are 32-bit integers but are not randomly-distributed, then we just need an invertible hash function from 32 bits to 32 bits

A nice way to do that is fibonacci hashing.

tobin_baker | 2 months ago

In my experiments, “Fibonacci hashing” only generates acceptable pseudorandomness for sequential inputs. Multiplication mixes low bits into high bits, but you also need to mix high bits into low bits with e.g. shift/xor. A fast mixing function like the Murmur3 finalizer is almost as fast and gives much better results.

Fibonacci hashing calculates the array index by throwing away the low bits so there’s not much point mixing the high bits into them. A hash table needs indexes to be dispersed well enough to keep probe sequences short; good randomness is sufficient, but the point of fibonacci hashing is that something much weaker is often good enough.

[OP] bitshift | 2 months ago

Maybe this is obvious to folks who have seen Robin Hood hashing before, but it took a minute for me to get my head around the significance of the author's Score function:

Score(Index, Key) is (Index - Key) & mask

Because keys are hashes and hashes are where probe sequences start, this is just a clever way to implicitly store how far away each entry is from its preferred location. And that's the intuition for why tombstones aren't needed: if deleting something leaves a gap, and you know the entries in front of it are already shifted forward, that means it's safe to shift them backward to fill the gap.