Unsupervised Distributions Clustering

Last time I discussed how user services usage could be clustered using the Kolmogorov-Smirnov statistic (referenced in Wikipedia as the two-sample K-S test). This time around, for our latest project in the automotive area, I re-used the same principles to cluster speed and acceleration profiles. In fact, this could be generalised to clustering of any distributions and this is the process I will go through with you today. Basically, this allows us to perform unsupervised clustering on sets of distributions and find the natural clusters in those sets.

If you want to jump right into the code, you can head to my github of this example. Otherwise stick with me and I’ll explain the process with an example.

As a first step, I generated some data to play with. Here we will voluntarily create four distinct clusters of data. Each cluster is composed of a number (random selection between 10 and 50) normal distributions. Each of these normal distributions will be composed of several elements (random selection between 5 and 50) for which the characteristics (mean and standard deviation) will be centred around picked random values, each with its own bias.

So, in the end, we will have several normal random distributions of values of variable length, all having their own specific characteristics, but polarising around four main clusters as we can see in the following figure.

GenData1.PNG
Randomly generated set of normal distributions

Now we calculate the Kolmogorov–Smirnov metric on each distribution pair. This will form our distance matrix, where 0 indicates two identical distributions and 1 two most completely different distributions. Obviously in our case, values will range in between 0 and 1, except on the diagonal where it will all be 0 (a distribution is the same as itself!).

We can use seaborn clustermap to visualise that distance matrix and see right away the hierarchical clustering within it.

DistMatrix1.PNG
Natural clustering of the KS distances

But for us the next step is to perform the hierarchical clustering of the Kolmogorov–Smirnov distances. Knowing we have four clusters we reach a depth in the hierarchical tree until we have four clusters and tada! We recover our four clusters!

Clustered4Dist1.PNG
Recovered clusters

But what if we would have tried to look for two clusters… or 6 clusters? Well we would have found clusters, but as you can see below, some should have obviously been further split, or merged.

Just for the sake of it, let’s try with a different set of randomly generated distributions. Again, we generate four clusters, but this time they are overlapping in pairs.

GenData2.PNG
Another Randomly generated set of normal distributions

We calculate again the Kolmogorov-Smirnov metric as a distance between the different distributions and we visualise the results with seaborn. We can see that this time around there should be only two clusters.

DistMatrix2.PNG
Natural clustering of the KS distances

Let’s cluster them as 4 clusters anyway and see if it finds the original distributions.

Clustered4Dist2.PNG
Recovered clusters

Well, it is close, even with such overlapping distributions. We can also play the what if game here and see what happens if we look for two or six clusters of distributions.

Easy to see that most probably two clusters would make more sense here.

In summary, the Kolmogorov-Smirnov metric can be used as a distance measure between distributions. I have shown in my previous post how this can be applied to user service usage segmentation, recently I have used it for clustering of driving profiles (speed and accelerations profiles of cars) and finally in the post I generalise it to any distribution of your choice. Once the distance matrix is computed, you can then cluster it (here I used the Hierarchical Clustering, but another algorithm could be used. I’ll most probably experiment with others in the future. Also in this example, you must specify how many clusters you expect. Again, we could further develop the process to automatically detect that “optimal” number of clusters, maybe using DBSCAN or HDBSCAN. Another thing I should try.

If you apply this technique to your data, do not be shy and let me know how well it went!

Advertisements

User segmentation through unsupervised clustering

For one of our current engagement our team came up with a nice way to perform meaningful user segmentation based on unsupervised clustering. In this post, I want to share a little bit of the process, it might be useful to somebody else with similar problem.

A bit of background first. We have access to internet usage data of users. This data basically states the number of times you accessed certain pre-configured “service providers” e.g. Google, Apple, Amazon, DropBox, … In some cases, the granularity is lower e.g. iMessage, Apple Push Notification Service, Apple, iTune, … All in all, we have over 150 such service providers defined. Now, what is the objective is to get meaningful cluster of usage of this usage data.

The first step was to look at correlation between those services and see if some obvious ones could be merged. For that I used a trick I showed before about clustering correlation matrix and ended up with something like the following figure.

sp_corr_matrix

Obviously not so much correlation… Then by looking at the service providers themselves I can see that some are of lower interest in our case. We want to look at user usage pattern, so if a user get redirected to an analytic or advertisement site, it doesn’t really indicate his usage, it is a consequence, not a cause. So, I removed all Analytic and Advertisement service providers. Same goes with virus protection and a few other smaller categories. After that feature clean-up, I still have close to 110 features… I decided then to introduce the concept of “mega” service providers which joins together similar service providers. In the end, I get the following mega service providers:

  • Google
  • Apple
  • Social Media
  • eMail
  • Video Streaming
  • Over-the-top Call
  • Online Shopping
  • Music Streaming
  • Cloud solutions
  • Health applications
  • News
  • Business applications
  • Gaming
  • Asian sites
  • General information sites
  • Russian sites
  • Peer-2-Peer
  • Dating
  • Porn

As interesting as those might be, the goal is not to micro-categorize peoples, but to get meaningful broad categories. So by applying the whole process I describe here a few times, we could remove a number of those mega service providers (features) as being of less importance to broadly categorize peoples. In the end I kept 9 features:

  • Google
  • Apple
  • Social Media
  • eMail
  • Video Streaming
  • Over-the-top Call
  • Online Shopping
  • Cloud solutions
  • Business applications

Now if you look at those features, you can notice they are heavily Poisson-like distribution. I’d like to “normalise” them a bit. So, through an iterative process, I found out that for most of them, raising the feature to a fractional exponent really helped a lot. For example, MSP_Google is raised to the power 0.2; MSP_Apple to the power 0.5; MSP_Mail to the power 0.05, etc. We now get a bit more normal curves a shown in the following figure.

Now we can perform clustering. I used DBSCAN since it is based on density. But which density to use? I made several trials and concentrated on the number of clusters found. In my case I wanted to maximise the number of clusters. I got the following figure, for which the maximum of 19 clusters is reached with a density of 0.05 .

ClustersVSDensity

Performing an LDA (Linear Discriminant Analysis) on the cluster confirms that the clusters are mostly isolated as shown in the following figure.

UserSegLDA

The next step is to make some automatic interpretation of those results. What is characterising these users? If we look for example at one specific cluster and depict the in-cluster distribution as orange and the out-of-cluster distribution as green we obtain the following figure.

AClusterDist

One of the member of my team had the idea to apply the Kolmogorov-Smirnov statistic on those distributions (this is the number following name of the feature). I added a small twist, where if the mean of the in-cluster distribution is lower than the mean of the out-of-cluster distribution, then the Kolmogorov-Smirnov statistic is negated. This way the more significantly lower the in-cluster distribution is compared to the out-of-cluster distribution, the closest we get to -1. Inversely if the in-cluster distribution is higher we get close to 1 and if both distributions are the same we get 0. Making a heat map based on that metric for each cluster, we get a good idea of what makes them unique a in the following figure.

KSStat

Last automated step, if we put good thresholds on the lower and higher limits for that statistic, we can simply express those difference as +, – or nothing if distributions are equivalent. We then get:

a = MSP_Google
b = MSP_Apple
c = MSP_SocMed
d = MSP_Mail
e = MSP_Video
f = MSP_Call
g = MSP_Purchase
h = MSP_Cloud
i = MSP_Prof

cl# a b c d e f g h i
 -1                   1089   <- The outliers
  0 - - - - - - - - - 10264
  1         + + + -   214
  2         - - - +   198
  3         + - + -   281
  4       - + - - +   203
  5 +   + + + + + + + 1200
  6         + - - +   1079
  7         + - - -   504
  8         + + - +   312
  9       + + - + + + 709
 10       + + - + +   1382
 11         + + - -   167
 12         + - - + + 204
 13       - + - - -   205
 14         - - - -   695
 15 +   +   + + + +   863
 16 +       + + - + + 179
 17         - - + -   150
 18         - + - -   102

The last step was that we gave a shot at naming those clusters (if you have a keen eye, you can see those names on a previous figure), stereotyping them for easier consumption. But in the end, what is interesting is that from a list of more than 150 features, we could extract 19 meaningful clusters of usage and can now use it as a smaller set of features to characterise the users in other dimensions e.g. data consumption per cluster, satisfaction per cluster, …

Temporal word cloud: into the future

Let me continue on the word cloud craziness. If you read my last blog entry you will see how I made a series of word cloud which shows the evolution of the words I use in my blog over time. I have also colour coded the words around theme to show how my blog reflects the tasks I am assigned to. In time it switched from Researching ideas for the telecom sector, to developing a Cloud framework and lately Machine Learning and Data Science.

I am using the same color scheme that was explained in the previous blog entry, but instead of having one window each year, I have a window of about a year that move forward by increments of 5 days. I end up with more than 700 word cloud images that I can assemble in a short one minute video. And here is the result.

It would be nice if I could make it so that important words move even less in the word cloud, as the result is still pretty jumpy, but it gives a feel of the change in time, probably even better than the previous still images.

From there one potential application would be to have a few categories of words coming out of a sentiment analysis e.g. happy, sad, … and colour those words in the word cloud accordingly. Then you could take a feed of information and generate a word cloud video as I did which would show the evolution of the writer sentiments over time. Wouldn’t it be nice?

Temporal word cloud or a recursive blog story

No, I have not become totally crazy, as the title of this blog might let you think and as some people might say. From my latest experiments with word cloud came the ide of applying it to my blog. Some might think that a dozen or so posts are not enough, and they would be right. However, my blogging habit didn’t start here on WordPress and I have in the pas 5 years or so about a hundred more posting on an Ericsson internal blog (conveniently also named TheLoneNut). Thus, if I sum up my internal presence and my newly external presence, I have written about 50k words in 5 years. According to Wikipedia, this puts my writings in the size range of a novel. Quite enough for a good word cloud experiment.

Where possible I have split the words in subject buckets:

  • Yellow: Having new ideas,
  • Red: Learning,
  • Dark Green: Peoples,
  • Light Blue: Cloud Computing,
  • Dark Blue: Ericsson and core telecom subjects,
  • Purple: Programming,
  • Light Green: Data Science, Machine Learning, …

The rest stays grey. Some of the grey words could make it in one of the subject buckets, however sometimes they are used across different subject so I preferred to keep them grey. So without further ado, lets take a look at the overall word cloud for my 5 years of blogging.

overallBlogWC
Overall word cloud of five years of TheLoneNut blogging.

From the overall word cloud, we can see a few trends. A lot of having new ideas, a good level of cloud, a fair level of programming, telecommunication and people and a little bit of Data Science… But we can do more. An interesting aspect of the blogging with respect to a novel is that it strongly embeds a temporal component. Posts are distributed through time in a semi-uniform fashion. Thus we can make an evolving word cloud. Starting five years ago and coming to today. Let’s have a look.

BlogWC5
TheLoneNut blogging five years ago.

In 2012, I started a new role at Ericsson. Something more research oriented and I had to come up with some good research subjects. As the word cloud can say, lots of ideation and idea exploration. Ideas were mostly oriented at Ericsson core telecommunication aspects. A small team and the need for help from the others. We can also see a little bit of trials in the Genetic Algorithm (and Genetic Programming) and Data Science in general.

BlogWC4
… four years ago.

As time passed, I still go through a lot of ideation and it is still focused on Ericsson core telecommunication aspects, but comes a lot more of programming, as we are trying out and implementing some of those ideas. The team and the help from others is still present, and we can see the need as well to learn new things. To figure out if some of the problems we were facing were already fixed somehow by someone else.

BlogWC3
… tree years ago.

Going forward, we see that the main research focus became the cloud (not the word cloud!). The core telecommunication aspects became secondary. With this clear research agenda, I started in parallel a Master in Computer Engineering on that new found subject of the Telecommunication Cloud.

BlogWC2
… two years ago.

With time, the research project got a name: Unity (not the 3D framework) and a specialisation around the Actor Model (from computer science). A lot of programming for my team and me.

BlogWC1
TheLoneNut blogging in the last year.

About a year ago, the Master thesis and the Unity project were concluded and I took on a new role. We can see it from the word cloud, where the cloud is shrinking and Data Science related topics are on the rise. Still a lot of ideation required, but not as much as in the early days. The core telecommunication aspects are pretty much off the radar as Machine Learning and Cloud are taking the floor of my interests.

Hope you found this word cloud analysis of my five years of blogging interesting. But more importantly I hope you see how a temporal word cloud analysis can reveal interesting insights!

Now for the recursive part… let’s add this blog entry and see if it change anything for this year blogging word cloud…

BlogRecursive

Well, not much for a single blog entry (as expected)… so long for recursive word clouding!

Correlation Matrix Clustering

Still in pursue of better understanding the cellular service satisfaction survey I mentioned earlier, I came upon another interesting problem to solve. I have encoded most of the response to a cellular service survey answered by around a thousand users and created a panda data frame with it. The columns have then been scaled with min-max in the 0..1 range. Now I want to see if any of the variables (questions in the survey) are correlated with any other variable.
Using python, the answer is quite simple, pandas provide the function corr() which build the correlation matrix from a data frame. You can then use pyplot matshow() in order to visualise that correlation matrix. Below is an example of such a visualisation I made on the data set I have. Thankfully, in this example there is already a certain level of clustering that comes from order the data was imported, lucky I am! But if you take a random example, such as the one I built to demonstrate this process in a jupyter notebook on github, then there is not much to understand without an effort.

CorrolationMatrix1
Correlation Matrix of a survey answers data frame.

Obtaining the correlation matrix was easy. What is less obvious at first glance is how to cluster that correlation matrix in order to get better and easier understanding of our data.

You can follow the process in my jupyter notebook, but basically it involves performing hierarchical clustering on the correlation matrix and tada! You obtain a clustered correlation matrix such as below.

CorrolationMatrix2
Clustered Correlation Matrix of a survey answers data frame.

It could end there. Obviously in my simple example in the jupyter notebook there is not much more to gain in term of clarity. However on the real data we can sense there is more to see. I can go forward and do a multi-pass clustering.

  • Cluster the correlation matrix.
  • For each cluster:
    • Sub-cluster the Clusters

Doing this yields to the following clustering which is marginally better as we can better see some sub-clustering within the big clusters.

CorrolationMatrix3
Multi-pass Clustering of a Correlation Matrix of a survey answers data frame.

This process could be extended to n-pass correlation matrix clustering. It could be done through a recursive process which would stop when you reach a minimum size sub-cluster.

So what can we see in this last figure? Well some obvious findings are experiencing poor coverage areas or having the desire to switch to another carrier is inversely correlated to the satisfaction with the service. We can also see that the more devices you have, the more you report using services. It appears there is no correlation between the service usage (how much you use it) and the satisfaction expressed toward it. There is a slight inverse correlation between the age and to a lesser degree to the sex of the respondent and the usage he makes of the service (again quantity). And thanks to the multi-pass clustering we can also see in the satisfaction quadrant that voice and texting satisfaction are more tied together and form a sub-cluster, and data satisfaction form another sub-cluster.

There are a few more findings you can see from that picture. I let you find them as an exercise!

10 Years of Ericsson Patents

ericssonpatents1
10 years of Ericsson Inventors

I have got interested in a recent blog post: The Real Difference Between Google and Apple and started wondering: What about Ericsson?

Well, the USPTO data is available for all to grab, so I headed there and downloaded some files of interest, namely the assignee, inventor, patent, patent_assignee and patent_inventor data files.

The first step as always is to process the data in a way we can make use of it. In this case I wanted to get used to Gephi so I created a nodes.csv containing the list of inventors and an edge.csv containing the relationship between these inventors i.e. who has a patent in common with whom. This process is handled by a Jupyter notebook I created and which is available on GitHub, well for some reason it doesn’t display properly on GitHub, but you can still clone the repo and make use of it.

The next step was to import that data in Gephi and create a relationship graph with some heavy use of the “Force Atlas 2” layout algorithm. I then proceeded further by performing a statistical Modularity analysis and using it as the basis to display communities within the inventors. So what you see above are the inventors. The biggest the dot, the most heavily linked that person is (maybe I should eventually change it so the size represent the number of patents awarded). The bluish lines are the connections between the inventors. And the bubble colour represents a “community” of inventors.

The end results I think shows a middle ground between the centralised development structure seen at Apple and the distributed approach seen at Google. From this graph, Ericsson would appear to have some strong centralised communities of inventions, but as well a large number of independent or small group inventions we can see at the periphery (left in grey). Some of the main communities of inventions are intertwined and some are more well defined.

Well, without going into the details of the internal structure and ways of working at Ericsson, this is where the analysis will stop.

UPDATE (2017-03-01): I have updated the picture in order for the size of the bubbles to be proportional to the number of patents generated by an inventor. Also below is the same kind of picture made for Huawei patents, 10 years worth of patents

huaweipatents
10 years of Huawei Inventors

.