And by fast I mean around 10 ms. Well the first thing was to normalize, since there was not a table with unique email addresses, but that was about 1GB of data, and another 1GB of index.
So I hashed it, but 32 bits is only enough of a hash (without collision handling) for about 10,000 to 100,000 items. crc32 is actually quite a bit worse than a random hash, it collides almost 1% of the time; which frankly seems remarkably bad, almost malicious really. And MySQL doesn't have a 64 bit hash, so I did a 64 bit hash using crc32. My first (failed) attempt was this:
crc32(invitee_email)*1234567890 + crc32(concat('x', invitee_email))
However, after the first inevitable 32 bit hash collision, it turns out that
if crc(x) == crc(y), then crc( salt + x ) == crc( salt + y )
as well. Which I guess makes almost enough sense I should have noticed that going in. Anyway, anagram collisions aside, the following overcomes that, and did not generate any collisions across all 40 million addresses:
crc32(invitee_email)*1234567890 + crc32(reverse(invitee_email))
So this is about 28% of the size of the original both data and index-wise, and produces a three times speed up, from about 40 ms to about 12 ms, and is easier on the cache.
No comments:
Post a Comment