Correlation Matrix Clustering

Still in pursue of better understanding the cellular service satisfaction survey I mentioned earlier, I came upon another interesting problem to solve. I have encoded most of the response to a cellular service survey answered by around a thousand users and created a panda data frame with it. The columns have then been scaled with min-max in the 0..1 range. Now I want to see if any of the variables (questions in the survey) are correlated with any other variable.
Using python, the answer is quite simple, pandas provide the function corr() which build the correlation matrix from a data frame. You can then use pyplot matshow() in order to visualise that correlation matrix. Below is an example of such a visualisation I made on the data set I have. Thankfully, in this example there is already a certain level of clustering that comes from order the data was imported, lucky I am! But if you take a random example, such as the one I built to demonstrate this process in a jupyter notebook on github, then there is not much to understand without an effort.

Correlation Matrix of a survey answers data frame.

Obtaining the correlation matrix was easy. What is less obvious at first glance is how to cluster that correlation matrix in order to get better and easier understanding of our data.

You can follow the process in my jupyter notebook, but basically it involves performing hierarchical clustering on the correlation matrix and tada! You obtain a clustered correlation matrix such as below.

Clustered Correlation Matrix of a survey answers data frame.

It could end there. Obviously in my simple example in the jupyter notebook there is not much more to gain in term of clarity. However on the real data we can sense there is more to see. I can go forward and do a multi-pass clustering.

  • Cluster the correlation matrix.
  • For each cluster:
    • Sub-cluster the Clusters

Doing this yields to the following clustering which is marginally better as we can better see some sub-clustering within the big clusters.

Multi-pass Clustering of a Correlation Matrix of a survey answers data frame.

This process could be extended to n-pass correlation matrix clustering. It could be done through a recursive process which would stop when you reach a minimum size sub-cluster.

So what can we see in this last figure? Well some obvious findings are experiencing poor coverage areas or having the desire to switch to another carrier is inversely correlated to the satisfaction with the service. We can also see that the more devices you have, the more you report using services. It appears there is no correlation between the service usage (how much you use it) and the satisfaction expressed toward it. There is a slight inverse correlation between the age and to a lesser degree to the sex of the respondent and the usage he makes of the service (again quantity). And thanks to the multi-pass clustering we can also see in the satisfaction quadrant that voice and texting satisfaction are more tied together and form a sub-cluster, and data satisfaction form another sub-cluster.

There are a few more findings you can see from that picture. I let you find them as an exercise!

Categorical variables encoding

A post from a colleague from Ericsson on Yammer brought me back to a question I kind of left aside for awhile. How to best encode categorical data. I dug up an old blog post which covers a lot of approaches written by Will McGinnis: Beyond one-hot: an exploration of categorical variable, and wanted to expand on that. Or to the least clarify my mind around the subject!

Categorical variables are the ones that instead of being continuous can only take a finite number of values. Some example could be:

  • Shirt sizes (XS, S, M, L, XL, …)
  • City of origin (Montréal, Québec, Vancouver, …)
  • Age group (0-15, 16-20, 21-24, 25-35, …)
  • Username of a contributor (Sheldon, Leonard, Raj, Howard, Stuart, …)

Well, anything as long as it is finite and can be completely expressed as a list. The problem with that type of data is that Machine Learning algorithm usually likes numbers. So we need to express them as number. But this step is not so simple, as in the list I expressed above, there is two types of categorical variables: ordinal data and nominal data. From the examples above we can classify as ordinal data the Shirt sizes and Age group. The categories in that case imply a growing order and if you encode the value following that order you get a semi-continuous spread of data. Again from the examples above, City of origin and Username are nominal data. You cannot assume that Sheldon comes before or after Leonard, there is no way to put them in an increasing order. And no, making the list alphabetical does not make it continuous. The alphabetical order is an artificial order and is not supporting any order in the categories themselves. Maybe sometime you can get away by making those ordinal for some specific use cases e.g. in some case maybe the name of the city could be encoded as the size (in inhabitants) of that city, or the contributor name could be encoded as the number of contribution made by that person, but that needs some analysis to figure out if appropriate for a specific usage.

So going back to Will’s blog, one thing which is not explicitly mentioned is that since categorical variables are not all made equal, your choice of encoding is limited to what is possible for that variable. Hence the conclusion that binary coding consistently performs well must be taken with a grain of salt. Binary coding applies to ordinal data and is not appropriate to nominal data. You can use any categorical encoding on ordinal data, but you cannot use an ordinal encoding on nominal data… So what is missing is the classification of encoders in the general categorical or ordinal variety. Well, I could also state a third variety, where categorical encoders can be used on ordinal data but needs special attention.

Ordinal encoders (cannot be used on nominal data):

  • Ordinal
  • Binary

Categorical encoders (works on all type of categorical variables):

  • One-Hot (or Dummy)
  • Simple
  • Sum (Deviation)
  • Backward Difference

Categorical encodes which requires special attention when used on nominal data:

  • Helmert
  • Orthogonal Polynomial

Explanations of those encoders are well detailed in Will’s blog and as he point out in the documentation for StatsModel.

There is however a last encoder I recently go fond of which I would like to mention and it is Owen Zhang Leave-One-Out encoder. Honestly I am not mathematician enough to prove or disprove its validity. My gut feeling and the trials I made tells me it is working and can be applied to any type of categorical data. But feel free to point me toward better or more background information, proofs of validity, etc. if you find (or derive) any. The Leave-One-Out encoder is described by Owen Zhang in his presentation at the NYC Data Science Academy.

One of the nice properties of the leave-one-out encoder is that the categorical data is encoded on one variable while most of the other encoders we discussed above encode on k or k-1 or log(k) variable. This becomes especially interesting when you want to be able to compare the importance of features with respect to each other when trained for a specific output. In that case if you expanded your categorical variable on a number of variables, how can you roll them back together into one category? You do not have that problem if you encoded it as only one variable.

As an example, let say you have one variable which is the age of a respondent and a second variable which is the type of phone they have e.g. Android, iOS, … and let us assume a simple problem where you know their level of satisfaction with their phone. After encoding the categorical phone type variable, you could train a classifier for their satisfaction based on their age (continuous variable) and their phone type (categorical variable possibly expanded on multiple features). If you use a linear classifier, you could look at the coefficients of the trained algorithm to determine the importance of the features in determining the satisfaction of the user, but how do you interpret the different features expanded from the categorical variable?

Well you can go so far with such analysis as explained by Karen Grace-Martin in Interpreting Regression Coefficients. The last paragraph is especially interesting and basically tells you that there is no way to roll them back together… In most case, the best you can do is interpret it has: “having all other parameter fixed, having category X is influencing by such amount the response compared to the base category”. Hence my eagerness of encoding on a single feature using leave-one-out.

That’s all folks. Do choose the right encoder for the job at hand, and remember that the interpretation of the regression coefficients is dependant on the encoder chosen.

Sentiment analysis as a tool for better data exploration

Here is the problem I was facing: a lot of comments describing in plain text why a person is not entirely satisfied with the cellular service they received. On the other hand, I also had an overall rating of their cellular service satisfaction. I decided to use sentiment analysis to better understand what make those people happy or not-happy of their cellular service. So let’s take a look at the process I followed and some of the results I found.

The first step was to convince myself there was a difference in the words used to answer the question “Why is the cellular network quality not excellent?” when people say they are overall satisfied with the cellular service (rating the service with a 4 or 5 out of 5) versus the words used to answer the same question when people say they are overall not satisfied with the cellular service (rating the service with a 1 or 2 out of 5). The quick way of doing so is via a Word Cloud. I used the excellent word_cloud generator from Andreas Mueller for python. I got the following figures for satisfied versus unsatisfied users.

We can see some similarities, but there are obvious differences in the frequency of words used between satisfied and unsatisfied users. Thus the next step was to extract the words from the comments and filter out the stop words. For that purpose I used a list from XPO6 which I expanded on a need basis (certain variations of stop words were missing from the original list). Then out of the remaining word I built a python dictionary of the frequency of occurrence of the words in the comment. I used that dictionary to build a pandas data frame where each row is for the comment of one user and I labelled those rows with either 0 for satisfied users or 1 for unsatisfied users.

Now we are ready to apply the supervised classification. To keep thing simple I use a Support Vector Machine with linear kernel (LinearSVC from scikit learn). Since I am not interested in building a generalised model for re-use but rather to group comments around certain words in order for me to analyse meaning out of them, I do not split the data set and use it all for training. I still look at the score in order to make sure the classification is somewhat coherent. It is, seeing a score of 97.6%.

Now the interesting part. We can look at the coefficients of the fitted model. They are the weights assigned to the features once the model is trained. If we sort the features (words) in order of the coefficient of those features, we get the words associated with satisfied subscribers, the lowest scores (or highest negative value) since the satisfied subscriber label is 0. In the same fashion, the words associated with the unsatisfied subscribers have the highest score as the label associated with unsatisfied subscriber is 1.

Out of this analysis we get the top 10 positive and top 10 negative words indicative of sentiment toward the cellular carrier being as follow.

Words most indicator of negative sentiment:

Words most indicator of positive sentiment:
[‘USAGE’, ‘ROAD’, ‘FARM’, ‘MAIL’, ‘WEAK’, ‘MONTH’, ‘LAND’, ‘CELL’, ‘4G’, ‘TRYIN’]

The next step is to lookup the comments containing those words and do some “human learning” on them to figure out the associated problem. For example if we look at “multiple” we get comments like (obviously anonymized):

  • […] no reception in multiple places […]
  • […] multiple area with terrible reception.

It is easy to see that having no or bad reception in multiple areas is the problem. And since multiple is the most important word indicating a negative perception of the service being received, it is probably the first point to address.

If we go down the list of words and looking at the comments, we can understand that:

  • “terrible” coverage is an issue,
  • “late” text messages are an issue,
  • Loss of signal while “roaming” is an issue,
  • etc.

Conversely, we can look at the words most associated with positive perception of the service being received. Those are still things that can be addressed, but are not serious enough to make the perception of the user into a negative one (lower importance issue). If we look at that list, we can understand that:

  • Cost of data “usage” is an issue,
  • Dropped call or no coverage while on “road” is an issue,
  • No service on a “farm” is an issue,
  • Calls going to voice “mail” is an issue,
  • etc.

But again although those are reasons “Why is the cellular network quality not excellent?” it is not reason enough to be not-satisfied with the overall cellular service received.

I think this hybrid approach where supervised classification is used to group comments around significant words having an important impact to discriminate satisfied and not satisfied users is a good way for a human to then concentrate on the “most important” comments and the “nice to have” comments and find this way meaningful insights about the free text comments.