Technoid in the Valley: MySQL 64 bit hash

What do you do to create a fast lookup for about 40 million email addresses for a client? I was inspired by this article in the MySQL performance blog

And by fast I mean around 10 ms. Well the first thing was to normalize, since there was not a table with unique email addresses, but that was about 1GB of data, and another 1GB of index.

So I hashed it, but 32 bits is only enough of a hash (without collision handling) for about 10,000 to 100,000 items. crc32 is actually quite a bit worse than a random hash, it collides almost 1% of the time; which frankly seems remarkably bad, almost malicious really. And MySQL doesn't have a 64 bit hash, so I did a 64 bit hash using crc32. My first (failed) attempt was this:

crc32(invitee_email)*1234567890 + crc32(concat('x', invitee_email))

However, after the first inevitable 32 bit hash collision, it turns out that

if crc(x) == crc(y), then crc( salt + x ) == crc( salt + y )

as well. Which I guess makes almost enough sense I should have noticed that going in. Anyway, anagram collisions aside, the following overcomes that, and did not generate any collisions across all 40 million addresses:

crc32(invitee_email)*1234567890 + crc32(reverse(invitee_email))

So this is about 28% of the size of the original both data and index-wise, and produces a three times speed up, from about 40 ms to about 12 ms, and is easier on the cache.

Technoid in the Valley

Thursday, March 13, 2008

MySQL 64 bit hash

No comments:

About Me

Blog Archive

Access Codes