Unsupervised Distributions Clustering

Last time I discussed how user services usage could be clustered using the Kolmogorov-Smirnov statistic (referenced in Wikipedia as the two-sample K-S test). This time around, for our latest project in the automotive area, I re-used the same principles to cluster speed and acceleration profiles. In fact, this could be generalised to clustering of any distributions and this is the process I will go through with you today. Basically, this allows us to perform unsupervised clustering on sets of distributions and find the natural clusters in those sets.

If you want to jump right into the code, you can head to my github of this example. Otherwise stick with me and I’ll explain the process with an example.

As a first step, I generated some data to play with. Here we will voluntarily create four distinct clusters of data. Each cluster is composed of a number (random selection between 10 and 50) normal distributions. Each of these normal distributions will be composed of several elements (random selection between 5 and 50) for which the characteristics (mean and standard deviation) will be centred around picked random values, each with its own bias.

So, in the end, we will have several normal random distributions of values of variable length, all having their own specific characteristics, but polarising around four main clusters as we can see in the following figure.

GenData1.PNG
Randomly generated set of normal distributions

Now we calculate the Kolmogorov–Smirnov metric on each distribution pair. This will form our distance matrix, where 0 indicates two identical distributions and 1 two most completely different distributions. Obviously in our case, values will range in between 0 and 1, except on the diagonal where it will all be 0 (a distribution is the same as itself!).

We can use seaborn clustermap to visualise that distance matrix and see right away the hierarchical clustering within it.

DistMatrix1.PNG
Natural clustering of the KS distances

But for us the next step is to perform the hierarchical clustering of the Kolmogorov–Smirnov distances. Knowing we have four clusters we reach a depth in the hierarchical tree until we have four clusters and tada! We recover our four clusters!

Clustered4Dist1.PNG
Recovered clusters

But what if we would have tried to look for two clusters… or 6 clusters? Well we would have found clusters, but as you can see below, some should have obviously been further split, or merged.

Just for the sake of it, let’s try with a different set of randomly generated distributions. Again, we generate four clusters, but this time they are overlapping in pairs.

GenData2.PNG
Another Randomly generated set of normal distributions

We calculate again the Kolmogorov-Smirnov metric as a distance between the different distributions and we visualise the results with seaborn. We can see that this time around there should be only two clusters.

DistMatrix2.PNG
Natural clustering of the KS distances

Let’s cluster them as 4 clusters anyway and see if it finds the original distributions.

Clustered4Dist2.PNG
Recovered clusters

Well, it is close, even with such overlapping distributions. We can also play the what if game here and see what happens if we look for two or six clusters of distributions.

Easy to see that most probably two clusters would make more sense here.

In summary, the Kolmogorov-Smirnov metric can be used as a distance measure between distributions. I have shown in my previous post how this can be applied to user service usage segmentation, recently I have used it for clustering of driving profiles (speed and accelerations profiles of cars) and finally in the post I generalise it to any distribution of your choice. Once the distance matrix is computed, you can then cluster it (here I used the Hierarchical Clustering, but another algorithm could be used. I’ll most probably experiment with others in the future. Also in this example, you must specify how many clusters you expect. Again, we could further develop the process to automatically detect that “optimal” number of clusters, maybe using DBSCAN or HDBSCAN. Another thing I should try.

If you apply this technique to your data, do not be shy and let me know how well it went!

Advertisements