New high-quality hash measures 71GB/s on M4

rurban · 2025-05-17T06:21:05 1747462865

It's not the fastest on smhasher anymore. umash is better and faster, and gxhash is twice as fast. It's is however very hashmap friendly because of its small size. It's a wyhash variant

nicoshev11 · 2025-05-17T06:34:17 1747463657

Hey Reini, the rapidhash version on the SMHasher repo is outdated. The latest version was published last Tuesday, I'll submit it to SMHasher shortly. This new version is much faster than the previous one, and still passes all tests. It should beat umash on your benchmarks.

aidenn0 · 2025-05-17T03:46:25 1747453585

At some point, hashes are fast enough. I'm not going to be switching from xxhash to shave 40 picoseconds off of hashing small keys, and for large keys it's already a few orders of magnitude faster than my NVMe drive.

neonsunset · 2025-05-17T03:57:09 1747454229

If whatever you are doing is heavy on hashmap lookups (and you are ok with not rewriting into something more bespoke+complicated) - the faster hash function and the cheaper baseline cost of calling it - the better (i.e. XXH3 can have disadvantages, with its popular impl. for dispatching to the necessary routine).

This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).

xkcd1963 · 2025-05-17T05:45:26 1747460726

Question, what do you do if there is a collision? I saw the github table also mentioned collisions

lights0123 · 2025-05-17T05:58:56 1747461536

They're extremely common in hashtables. You follow standard open addressing or separate chaining procedures: https://en.m.wikipedia.org/wiki/Hash_collision

saagarjha · 2025-05-17T06:01:05 1747461665

Fall back on secondary hashes or do probing

wpollock · 2025-05-17T02:43:58 1747449838

Is rapidhash cache-friendly?

wmf · 2025-05-17T03:24:28 1747452268

Hash functions don't have any data reuse so none of them are cache-friendly.

benwills · 2025-05-17T05:46:09 1747460769

I think they may be asking about the CPU cache.

Dylan16807 · 2025-05-17T05:48:55 1747460935

You have to go out of your way to make a hash that doesn't fit into L1, so again they're all basically the same.

You'll probably end up fitting entirely inside the reorder buffer plus a sequential stream from memory, with the actual caches almost irrelevant.

benwills · 2025-05-17T05:51:43 1747461103

Sure. Any worthwhile hash function will fit in the instruction cache. But there are ways to make more or less efficient use of the data cache.

Dylan16807 · 2025-05-17T05:54:17 1747461257

> Any worthwhile hash function will fit in the instruction cache.

Yes, I'm talking about the data cache.

> But there are ways to make more or less efficient use of the data cache.

How?

You need to touch every byte of input, and nothing should be faster than going right through from start to end.

benwills · 2025-05-17T06:07:02 1747462022

I don't know you're experience with hash functions, so you may already know what I'm about to say.

This is a minor example, but since you asked...

https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L6432

That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.

Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.

And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.

Dylan16807 · 2025-05-17T06:09:54 1747462194

> And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.

And there's 150+ registers in the actual chip.

But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.