Thursday, March 13, 2008

MySQL 64 bit hash

What do you do to create a fast lookup for about 40 million email addresses for a client? I was inspired by this article in the MySQL performance blog

And by fast I mean around 10 ms. Well the first thing was to normalize, since there was not a table with unique email addresses, but that was about 1GB of data, and another 1GB of index.

So I hashed it, but 32 bits is only enough of a hash (without collision handling) for about 10,000 to 100,000 items. crc32 is actually quite a bit worse than a random hash, it collides almost 1% of the time; which frankly seems remarkably bad, almost malicious really. And MySQL doesn't have a 64 bit hash, so I did a 64 bit hash using crc32. My first (failed) attempt was this:

crc32(invitee_email)*1234567890 + crc32(concat('x', invitee_email))

However, after the first inevitable 32 bit hash collision, it turns out that

if crc(x) == crc(y), then crc( salt + x ) == crc( salt + y )

as well. Which I guess makes almost enough sense I should have noticed that going in. Anyway, anagram collisions aside, the following overcomes that, and did not generate any collisions across all 40 million addresses:

crc32(invitee_email)*1234567890 + crc32(reverse(invitee_email))


So this is about 28% of the size of the original both data and index-wise, and produces a three times speed up, from about 40 ms to about 12 ms, and is easier on the cache.

No comments: