Still in pursue of better understanding the cellular service satisfaction survey I mentioned earlier, I came upon another interesting problem to solve. I have encoded most of the response to a cellular service survey answered by around a thousand users and created a panda data frame with it. The columns have then been scaled with min-max in the 0..1 range. Now I want to see if any of the variables (questions in the survey) are correlated with any other variable.
Using python, the answer is quite simple, pandas provide the function corr() which build the correlation matrix from a data frame. You can then use pyplot matshow() in order to visualise that correlation matrix. Below is an example of such a visualisation I made on the data set I have. Thankfully, in this example there is already a certain level of clustering that comes from order the data was imported, lucky I am! But if you take a random example, such as the one I built to demonstrate this process in a jupyter notebook on github, then there is not much to understand without an effort.
Obtaining the correlation matrix was easy. What is less obvious at first glance is how to cluster that correlation matrix in order to get better and easier understanding of our data.
You can follow the process in my jupyter notebook, but basically it involves performing hierarchical clustering on the correlation matrix and tada! You obtain a clustered correlation matrix such as below.
It could end there. Obviously in my simple example in the jupyter notebook there is not much more to gain in term of clarity. However on the real data we can sense there is more to see. I can go forward and do a multi-pass clustering.
- Cluster the correlation matrix.
- For each cluster:
- Sub-cluster the Clusters
Doing this yields to the following clustering which is marginally better as we can better see some sub-clustering within the big clusters.
This process could be extended to n-pass correlation matrix clustering. It could be done through a recursive process which would stop when you reach a minimum size sub-cluster.
So what can we see in this last figure? Well some obvious findings are experiencing poor coverage areas or having the desire to switch to another carrier is inversely correlated to the satisfaction with the service. We can also see that the more devices you have, the more you report using services. It appears there is no correlation between the service usage (how much you use it) and the satisfaction expressed toward it. There is a slight inverse correlation between the age and to a lesser degree to the sex of the respondent and the usage he makes of the service (again quantity). And thanks to the multi-pass clustering we can also see in the satisfaction quadrant that voice and texting satisfaction are more tied together and form a sub-cluster, and data satisfaction form another sub-cluster.
There are a few more findings you can see from that picture. I let you find them as an exercise!