Watch Out For Modulo And Hashes

Get a better understanding of how you can randomly sample a population with special data selection tricks.

By Den Delimarsky in Data

January 28, 2020

A colleague and I were working on some data analysis tools today, and they encountered a puzzling error when processing a large chunk of data. Due to the volume of unique users that they were analyzing telemetry for, they decided to use sampling by hashing the user IDs and then taking a slice of the group they wanted to investigate.

Because they were using Kusto, they could rely on hash() - a function that returns a hash based on the input value (it uses the xxhash algorithm behind the scenes). This would, in a way, normalize the user IDs under the same structure, that can then be sequenced.

There is one example in the aforementioned function document, that shows how to use the hash() function with a little bit of statistical sugar - modulo (written as %). Recall how I mentioned that they needed to take a slice of the overall audience? Hashing the IDs and then applying modulo allows doing just that. They were puzzled because the sample used the following code snippet:

//The following example uses the hash function to
//run a query on 10% of the data
| where hash(StartTime, 10) == 0

When they tried to adjust the modulo parameter to 20, they thought they would get 20% of the users, respectively, however the output didn’t quite match the expectations - the number was much smaller. So what is going on here? Let’s explore how the hashing approach works to select a reliable sample of the audience.

First, you start with a set of IDs:


These values are subsequently hashed with the algorithm of choice. For the purposes of this article, I will be using xxhash as well (don’t focus too much on the choice of the hashing algorithm right now, really). Running the set above through a hashing function can be done using xxhash in Python:

import xxhash

x = xxhash.xxh64()
digest = x.hexdigest()

The output should be similar to this:


See how all values are of uniform length. Next, we need to apply the modulo function. You might wonder - how would I do that to a hash? First, we need to convert the hexadecimal value to an integer. You can do that in Python with intdigest(), or in bash:

echo $((0x367b50760441849e))

Running this on our hashes, we get:

UserID DecimalValue
367b50760441849e 3925819967991284894
2d0c85895cf04f08 3246116256443551496
32da77b165ab8d0e 3664372850617978126
0e0c5bbc8559f1fc 1012284881500762620
c22f1c8a59ec2e4c 13992433947803135564
0a5dfc841d9d9586 747030757576119686
b2bee630cc26677f 12879985081584084863
ba69d71a5b3d66c5 13432503871809087173
63e75af38354423b 7198822531301917243
77f3d9157770388a 8643490796075497610

Now, when we specify the value for the modulo operation, we do not define the percentage of the audience we need, but rather the number of uniform buckets we want to have where we’ll classify user IDs. By now, this probably rings a bell if you’re familiar with hash tables. An expression like hash(x) % 5 would mean that we want to get five (5) buckets of hashes. Going back to our earlier Kusto example, where hash(StartTime, 10) meant that there are 10 buckets, and each represents 10% of the overall audience. If we split this in 20 buckets, that would mean that we’d be looking at 5% of the audience in each bucket.

Taking the table above, if we’d want to get 20% of the hashes, we would write a function that would check if hash(x) % 5 == 0, to get the chunk of the audience representative of the segment we want. Running modulo on DecimalValue would give us this:

UserID DecimalValue ModuloOutput
367b50760441849e 3925819967991284894 4
2d0c85895cf04f08 3246116256443551496 1
32da77b165ab8d0e 3664372850617978126 1
0e0c5bbc8559f1fc 1012284881500762620 0
c22f1c8a59ec2e4c 13992433947803135564 4
0a5dfc841d9d9586 747030757576119686 1
b2bee630cc26677f 12879985081584084863 3
ba69d71a5b3d66c5 13432503871809087173 3
63e75af38354423b 7198822531301917243 3
77f3d9157770388a 8643490796075497610 0

And there we have it - two values that are of interest are 0e0c5bbc8559f1fc and 77f3d9157770388a. And it’s representative of 20%. Rememember - doing modulo-based sampling means you are splitting values by bucket, not by percent of records.

Subscribe to The Den:

A monthly newsletter about technology, machine learning, security, and just tinkering with code.


Have any thoughts? Let me know on Twitter!