# Watch Out For Modulo And Hashes

A colleague and I were working on some data analysis tools today, and they encountered a puzzling error when processing a large chunk of data. Due to the volume of unique users that they were analyzing telemetry for, they decided to use sampling by hashing the user IDs and then taking a slice of the group they wanted to investigate.

Because they were using Kusto, they could rely on `hash()`

- a function that returns a hash based on the input value (*it uses the xxhash algorithm behind the scenes*). This would, in a way, normalize the user IDs under the same structure, that can then be sequenced.

There is one example in the aforementioned function document, that shows how to use the `hash()`

function with a little bit of statistical sugar - modulo (*written as %*). Recall how I mentioned that they needed to take a slice of the overall audience? Hashing the IDs and then applying modulo allows doing just that. They were puzzled because the sample used the following code snippet:

```
//The following example uses the hash function to
//run a query on 10% of the data
| where hash(StartTime, 10) == 0
```

When they tried to adjust the modulo parameter to 20, they thought they would get 20% of the users, respectively, however the output didn’t quite match the expectations - the number was much smaller. So what is going on here? Let’s explore how the hashing approach works to select a reliable sample of the audience.

First, you start with a set of IDs:

UserID |
---|

`38235e66-021b-4092-b5e0-72687ba6dcce` |

`090d7552-6843-404c-85ac-30e9d93f265f` |

`83d4f1b7-061b-439b-83fa-6bf4e6105811` |

`91e7af98-783d-4ecb-a416-734997b0ce27` |

`5de7e32f-5de1-4307-adc0-fca54c33ead8` |

`97aa5ec0-5011-4446-a3a2-eb4db541f127` |

`04ca250b-2530-49ba-bcf1-c28b81f834ed` |

`2938e539-eb42-4c5c-816d-3ef6e6d052b2` |

`f8f39fe5-3bba-4eae-a7a8-fd6e589eb5a0` |

`c015bc77-dafd-495a-a15e-2692c0e35731` |

These values are subsequently hashed with the algorithm of choice. For the purposes of this article, I will be using xxhash as well (*don’t focus too much on the choice of the hashing algorithm right now, really*). Running the set above through a hashing function can be done using xxhash in Python:

```
import xxhash
x = xxhash.xxh64()
x.update(b'38235e66-021b-4092-b5e0-72687ba6dcce')
digest = x.hexdigest()
```

The output should be similar to this:

UserID |
---|

`367b50760441849e` |

`2d0c85895cf04f08` |

`32da77b165ab8d0e` |

`0e0c5bbc8559f1fc` |

`c22f1c8a59ec2e4c` |

`0a5dfc841d9d9586` |

`b2bee630cc26677f` |

`ba69d71a5b3d66c5` |

`63e75af38354423b` |

`77f3d9157770388a` |

See how all values are of uniform length. Next, we need to apply the modulo function. You might wonder - how would I do that to a hash? First, we need to convert the hexadecimal value to an integer. You can do that in Python with `intdigest()`

, or in bash:

```
echo $((0x367b50760441849e))
```

Running this on our hashes, we get:

UserID | DecimalValue |
---|---|

`367b50760441849e` |
`3925819967991284894` |

`2d0c85895cf04f08` |
`3246116256443551496` |

`32da77b165ab8d0e` |
`3664372850617978126` |

`0e0c5bbc8559f1fc` |
`1012284881500762620` |

`c22f1c8a59ec2e4c` |
`13992433947803135564` |

`0a5dfc841d9d9586` |
`747030757576119686` |

`b2bee630cc26677f` |
`12879985081584084863` |

`ba69d71a5b3d66c5` |
`13432503871809087173` |

`63e75af38354423b` |
`7198822531301917243` |

`77f3d9157770388a` |
`8643490796075497610` |

Now, when we specify the value for the modulo operation, we do not define the percentage of the audience we need, but rather **the number of uniform buckets we want to have** where we’ll classify user IDs. By now, this probably rings a bell if you’re familiar with hash tables. An expression like `hash(x) % 5`

would mean that we want to get five (*5*) buckets of hashes. Going back to our earlier Kusto example, `where hash(StartTime, 10)`

meant that there are 10 buckets, and each represents 10% of the overall audience. If we split this in 20 buckets, that would mean that we’d be looking at 5% of the audience in each bucket.

Taking the table above, if we’d want to get 20% of the hashes, we would write a function that would check if `hash(x) % 5 == 0`

, to get the chunk of the audience representative of the segment we want. Running modulo on `DecimalValue`

would give us this:

UserID | DecimalValue | ModuloOutput |
---|---|---|

`367b50760441849e` |
`3925819967991284894` |
`4` |

`2d0c85895cf04f08` |
`3246116256443551496` |
`1` |

`32da77b165ab8d0e` |
`3664372850617978126` |
`1` |

`0e0c5bbc8559f1fc` |
`1012284881500762620` |
`0` |

`c22f1c8a59ec2e4c` |
`13992433947803135564` |
`4` |

`0a5dfc841d9d9586` |
`747030757576119686` |
`1` |

`b2bee630cc26677f` |
`12879985081584084863` |
`3` |

`ba69d71a5b3d66c5` |
`13432503871809087173` |
`3` |

`63e75af38354423b` |
`7198822531301917243` |
`3` |

`77f3d9157770388a` |
`8643490796075497610` |
`0` |

And there we have it - two values that are of interest are `0e0c5bbc8559f1fc`

and `77f3d9157770388a`

. And it’s representative of 20%. Rememember - doing modulo-based sampling means you are splitting values by bucket, not by percent of records.