Unsupervised Distributions Clustering Last time I discussed how user services usage could be clustered using the Kolmogorov-Smirnov statistic (referenced in Wikipedia as the two-sample K-S test). This time around, for our latest project in the automotive area, I re-used the same principles to cluster speed and acceleration profiles. In fact, this could be generalised to clustering of any distributions and this is the process I will go through with you today. Basically, this allows us to perform unsupervised clustering on sets of distributions and find the natural clusters in those sets.

If you want to jump right into the code, you can head to my github of this example. Otherwise stick with me and I’ll explain the process with an example.

As a first step, I generated some data to play with. Here we will voluntarily create four distinct clusters of data. Each cluster is composed of a number (random selection between 10 and 50) normal distributions. Each of these normal distributions will be composed of several elements (random selection between 5 and 50) for which the characteristics (mean and standard deviation) will be centred around picked random values, each with its own bias.

So, in the end, we will have several normal random distributions of values of variable length, all having their own specific characteristics, but polarising around four main clusters as we can see in the following figure.

Now we calculate the Kolmogorov–Smirnov metric on each distribution pair. This will form our distance matrix, where 0 indicates two identical distributions and 1 two most completely different distributions. Obviously in our case, values will range in between 0 and 1, except on the diagonal where it will all be 0 (a distribution is the same as itself!).

We can use seaborn clustermap to visualise that distance matrix and see right away the hierarchical clustering within it.

But for us the next step is to perform the hierarchical clustering of the Kolmogorov–Smirnov distances. Knowing we have four clusters we reach a depth in the hierarchical tree until we have four clusters and tada! We recover our four clusters!

But what if we would have tried to look for two clusters… or 6 clusters? Well we would have found clusters, but as you can see below, some should have obviously been further split, or merged.

Just for the sake of it, let’s try with a different set of randomly generated distributions. Again, we generate four clusters, but this time they are overlapping in pairs.

We calculate again the Kolmogorov-Smirnov metric as a distance between the different distributions and we visualise the results with seaborn. We can see that this time around there should be only two clusters.

Let’s cluster them as 4 clusters anyway and see if it finds the original distributions.