Categorical variables encoding

A post from a colleague from Ericsson on Yammer brought me back to a question I kind of left aside for awhile. How to best encode categorical data. I dug up an old blog post which covers a lot of approaches written by Will McGinnis: Beyond one-hot: an exploration of categorical variable, and wanted to expand on that. Or to the least clarify my mind around the subject!

Categorical variables are the ones that instead of being continuous can only take a finite number of values. Some example could be:

  • Shirt sizes (XS, S, M, L, XL, …)
  • City of origin (Montréal, Québec, Vancouver, …)
  • Age group (0-15, 16-20, 21-24, 25-35, …)
  • Username of a contributor (Sheldon, Leonard, Raj, Howard, Stuart, …)

Well, anything as long as it is finite and can be completely expressed as a list. The problem with that type of data is that Machine Learning algorithm usually likes numbers. So we need to express them as number. But this step is not so simple, as in the list I expressed above, there is two types of categorical variables: ordinal data and nominal data. From the examples above we can classify as ordinal data the Shirt sizes and Age group. The categories in that case imply a growing order and if you encode the value following that order you get a semi-continuous spread of data. Again from the examples above, City of origin and Username are nominal data. You cannot assume that Sheldon comes before or after Leonard, there is no way to put them in an increasing order. And no, making the list alphabetical does not make it continuous. The alphabetical order is an artificial order and is not supporting any order in the categories themselves. Maybe sometime you can get away by making those ordinal for some specific use cases e.g. in some case maybe the name of the city could be encoded as the size (in inhabitants) of that city, or the contributor name could be encoded as the number of contribution made by that person, but that needs some analysis to figure out if appropriate for a specific usage.

So going back to Will’s blog, one thing which is not explicitly mentioned is that since categorical variables are not all made equal, your choice of encoding is limited to what is possible for that variable. Hence the conclusion that binary coding consistently performs well must be taken with a grain of salt. Binary coding applies to ordinal data and is not appropriate to nominal data. You can use any categorical encoding on ordinal data, but you cannot use an ordinal encoding on nominal data… So what is missing is the classification of encoders in the general categorical or ordinal variety. Well, I could also state a third variety, where categorical encoders can be used on ordinal data but needs special attention.

Ordinal encoders (cannot be used on nominal data):

  • Ordinal
  • Binary

Categorical encoders (works on all type of categorical variables):

  • One-Hot (or Dummy)
  • Simple
  • Sum (Deviation)
  • Backward Difference

Categorical encodes which requires special attention when used on nominal data:

  • Helmert
  • Orthogonal Polynomial

Explanations of those encoders are well detailed in Will’s blog and as he point out in the documentation for StatsModel.

There is however a last encoder I recently go fond of which I would like to mention and it is Owen Zhang Leave-One-Out encoder. Honestly I am not mathematician enough to prove or disprove its validity. My gut feeling and the trials I made tells me it is working and can be applied to any type of categorical data. But feel free to point me toward better or more background information, proofs of validity, etc. if you find (or derive) any. The Leave-One-Out encoder is described by Owen Zhang in his presentation at the NYC Data Science Academy.

One of the nice properties of the leave-one-out encoder is that the categorical data is encoded on one variable while most of the other encoders we discussed above encode on k or k-1 or log(k) variable. This becomes especially interesting when you want to be able to compare the importance of features with respect to each other when trained for a specific output. In that case if you expanded your categorical variable on a number of variables, how can you roll them back together into one category? You do not have that problem if you encoded it as only one variable.

As an example, let say you have one variable which is the age of a respondent and a second variable which is the type of phone they have e.g. Android, iOS, … and let us assume a simple problem where you know their level of satisfaction with their phone. After encoding the categorical phone type variable, you could train a classifier for their satisfaction based on their age (continuous variable) and their phone type (categorical variable possibly expanded on multiple features). If you use a linear classifier, you could look at the coefficients of the trained algorithm to determine the importance of the features in determining the satisfaction of the user, but how do you interpret the different features expanded from the categorical variable?

Well you can go so far with such analysis as explained by Karen Grace-Martin in Interpreting Regression Coefficients. The last paragraph is especially interesting and basically tells you that there is no way to roll them back together… In most case, the best you can do is interpret it has: “having all other parameter fixed, having category X is influencing by such amount the response compared to the base category”. Hence my eagerness of encoding on a single feature using leave-one-out.

That’s all folks. Do choose the right encoder for the job at hand, and remember that the interpretation of the regression coefficients is dependant on the encoder chosen.

Advertisements