Watch Out For Modulo And Hashes
Get a better understanding of how you can randomly sample a population with special data selection tricks.
By Den Delimarsky in Data
January 28, 2020
A colleague and I were working on some data analysis tools today, and they encountered a puzzling error when processing a large chunk of data. Due to the volume of unique users that they were analyzing telemetry for, they decided to use sampling by hashing the user IDs and then taking a slice of the group they wanted to investigate.
Because they were using Kusto, they could rely on hash()
 a function that returns a hash based on the input value (it uses the xxhash algorithm behind the scenes). This would, in a way, normalize the user IDs under the same structure, that can then be sequenced.
There is one example in the aforementioned function document, that shows how to use the hash()
function with a little bit of statistical sugar  modulo (written as %). Recall how I mentioned that they needed to take a slice of the overall audience? Hashing the IDs and then applying modulo allows doing just that. They were puzzled because the sample used the following code snippet:


When they tried to adjust the modulo parameter to 20, they thought they would get 20% of the users, respectively, however the output didn’t quite match the expectations  the number was much smaller. So what is going on here? Let’s explore how the hashing approach works to select a reliable sample of the audience.
First, you start with a set of IDs:
UserID 

38235e66021b4092b5e072687ba6dcce 
090d75526843404c85ac30e9d93f265f 
83d4f1b7061b439b83fa6bf4e6105811 
91e7af98783d4ecba416734997b0ce27 
5de7e32f5de14307adc0fca54c33ead8 
97aa5ec050114446a3a2eb4db541f127 
04ca250b253049babcf1c28b81f834ed 
2938e539eb424c5c816d3ef6e6d052b2 
f8f39fe53bba4eaea7a8fd6e589eb5a0 
c015bc77dafd495aa15e2692c0e35731 
These values are subsequently hashed with the algorithm of choice. For the purposes of this article, I will be using xxhash as well (don’t focus too much on the choice of the hashing algorithm right now, really). Running the set above through a hashing function can be done using xxhash in Python:


The output should be similar to this:
UserID 

367b50760441849e 
2d0c85895cf04f08 
32da77b165ab8d0e 
0e0c5bbc8559f1fc 
c22f1c8a59ec2e4c 
0a5dfc841d9d9586 
b2bee630cc26677f 
ba69d71a5b3d66c5 
63e75af38354423b 
77f3d9157770388a 
See how all values are of uniform length. Next, we need to apply the modulo function. You might wonder  how would I do that to a hash? First, we need to convert the hexadecimal value to an integer. You can do that in Python with intdigest()
, or in bash:


Running this on our hashes, we get:
UserID  DecimalValue 

367b50760441849e  3925819967991284894 
2d0c85895cf04f08  3246116256443551496 
32da77b165ab8d0e  3664372850617978126 
0e0c5bbc8559f1fc  1012284881500762620 
c22f1c8a59ec2e4c  13992433947803135564 
0a5dfc841d9d9586  747030757576119686 
b2bee630cc26677f  12879985081584084863 
ba69d71a5b3d66c5  13432503871809087173 
63e75af38354423b  7198822531301917243 
77f3d9157770388a  8643490796075497610 
Now, when we specify the value for the modulo operation, we do not define the percentage of the audience we need, but rather the number of uniform buckets we want to have where we’ll classify user IDs. By now, this probably rings a bell if you’re familiar with hash tables. An expression like hash(x) % 5
would mean that we want to get five (5) buckets of hashes. Going back to our earlier Kusto example, where hash(StartTime, 10)
meant that there are 10 buckets, and each represents 10% of the overall audience. If we split this in 20 buckets, that would mean that we’d be looking at 5% of the audience in each bucket.
Taking the table above, if we’d want to get 20% of the hashes, we would write a function that would check if hash(x) % 5 == 0
, to get the chunk of the audience representative of the segment we want. Running modulo on DecimalValue
would give us this:
UserID  DecimalValue  ModuloOutput 

367b50760441849e  3925819967991284894  4 
2d0c85895cf04f08  3246116256443551496  1 
32da77b165ab8d0e  3664372850617978126  1 
0e0c5bbc8559f1fc  1012284881500762620  0 
c22f1c8a59ec2e4c  13992433947803135564  4 
0a5dfc841d9d9586  747030757576119686  1 
b2bee630cc26677f  12879985081584084863  3 
ba69d71a5b3d66c5  13432503871809087173  3 
63e75af38354423b  7198822531301917243  3 
77f3d9157770388a  8643490796075497610  0 
And there we have it  two values that are of interest are 0e0c5bbc8559f1fc
and 77f3d9157770388a
. And it’s representative of 20%. Rememember  doing modulobased sampling means you are splitting values by bucket, not by percent of records.
Feedback
Have any thoughts? Let me know on Twitter!