User segmentation through unsupervised clustering

For one of our current engagement our team came up with a nice way to perform meaningful user segmentation based on unsupervised clustering. In this post, I want to share a little bit of the process, it might be useful to somebody else with similar problem.

A bit of background first. We have access to internet usage data of users. This data basically states the number of times you accessed certain pre-configured “service providers” e.g. Google, Apple, Amazon, DropBox, … In some cases, the granularity is lower e.g. iMessage, Apple Push Notification Service, Apple, iTune, … All in all, we have over 150 such service providers defined. Now, what is the objective is to get meaningful cluster of usage of this usage data.

The first step was to look at correlation between those services and see if some obvious ones could be merged. For that I used a trick I showed before about clustering correlation matrix and ended up with something like the following figure.


Obviously not so much correlation… Then by looking at the service providers themselves I can see that some are of lower interest in our case. We want to look at user usage pattern, so if a user get redirected to an analytic or advertisement site, it doesn’t really indicate his usage, it is a consequence, not a cause. So, I removed all Analytic and Advertisement service providers. Same goes with virus protection and a few other smaller categories. After that feature clean-up, I still have close to 110 features… I decided then to introduce the concept of “mega” service providers which joins together similar service providers. In the end, I get the following mega service providers:

  • Google
  • Apple
  • Social Media
  • eMail
  • Video Streaming
  • Over-the-top Call
  • Online Shopping
  • Music Streaming
  • Cloud solutions
  • Health applications
  • News
  • Business applications
  • Gaming
  • Asian sites
  • General information sites
  • Russian sites
  • Peer-2-Peer
  • Dating
  • Porn

As interesting as those might be, the goal is not to micro-categorize peoples, but to get meaningful broad categories. So by applying the whole process I describe here a few times, we could remove a number of those mega service providers (features) as being of less importance to broadly categorize peoples. In the end I kept 9 features:

  • Google
  • Apple
  • Social Media
  • eMail
  • Video Streaming
  • Over-the-top Call
  • Online Shopping
  • Cloud solutions
  • Business applications

Now if you look at those features, you can notice they are heavily Poisson-like distribution. I’d like to “normalise” them a bit. So, through an iterative process, I found out that for most of them, raising the feature to a fractional exponent really helped a lot. For example, MSP_Google is raised to the power 0.2; MSP_Apple to the power 0.5; MSP_Mail to the power 0.05, etc. We now get a bit more normal curves a shown in the following figure.

Now we can perform clustering. I used DBSCAN since it is based on density. But which density to use? I made several trials and concentrated on the number of clusters found. In my case I wanted to maximise the number of clusters. I got the following figure, for which the maximum of 19 clusters is reached with a density of 0.05 .


Performing an LDA (Linear Discriminant Analysis) on the cluster confirms that the clusters are mostly isolated as shown in the following figure.


The next step is to make some automatic interpretation of those results. What is characterising these users? If we look for example at one specific cluster and depict the in-cluster distribution as orange and the out-of-cluster distribution as green we obtain the following figure.


One of the member of my team had the idea to apply the Kolmogorov-Smirnov statistic on those distributions (this is the number following name of the feature). I added a small twist, where if the mean of the in-cluster distribution is lower than the mean of the out-of-cluster distribution, then the Kolmogorov-Smirnov statistic is negated. This way the more significantly lower the in-cluster distribution is compared to the out-of-cluster distribution, the closest we get to -1. Inversely if the in-cluster distribution is higher we get close to 1 and if both distributions are the same we get 0. Making a heat map based on that metric for each cluster, we get a good idea of what makes them unique a in the following figure.


Last automated step, if we put good thresholds on the lower and higher limits for that statistic, we can simply express those difference as +, – or nothing if distributions are equivalent. We then get:

a = MSP_Google
b = MSP_Apple
c = MSP_SocMed
d = MSP_Mail
e = MSP_Video
f = MSP_Call
g = MSP_Purchase
h = MSP_Cloud
i = MSP_Prof

cl# a b c d e f g h i
 -1                   1089   <- The outliers
  0 - - - - - - - - - 10264
  1         + + + -   214
  2         - - - +   198
  3         + - + -   281
  4       - + - - +   203
  5 +   + + + + + + + 1200
  6         + - - +   1079
  7         + - - -   504
  8         + + - +   312
  9       + + - + + + 709
 10       + + - + +   1382
 11         + + - -   167
 12         + - - + + 204
 13       - + - - -   205
 14         - - - -   695
 15 +   +   + + + +   863
 16 +       + + - + + 179
 17         - - + -   150
 18         - + - -   102

The last step was that we gave a shot at naming those clusters (if you have a keen eye, you can see those names on a previous figure), stereotyping them for easier consumption. But in the end, what is interesting is that from a list of more than 150 features, we could extract 19 meaningful clusters of usage and can now use it as a smaller set of features to characterise the users in other dimensions e.g. data consumption per cluster, satisfaction per cluster, …