Last time I showed you how a quite simple character based recurrent neural network (RNN) can be used as a generative model for text examples. I used for that purpose all my blog posts and asked it to generate a new post based on that. Easy to see that this is a toy example for the use of RNN.
This time around I want to show you how an RNN can be used for binary classification with quite good results. The binary classification problem we will look at is of a time series of measures from a mobile subscriber in an operator network. We know that most of those subscribers are humans using a mobile phone. We know also that some of those “subscribers” are IoT machines. Having examples and labelling of humans and machines, can we say to which category a subscriber is belonging?
In fact, two of my colleagues have taken on the same challenge as I did, with different approach to it. Steven developed a bespoke classifier from a statistical analysis and you can read about it in Data Examination and Home Made Classifiers. Marc-Olivier upped the ante developing a “classical” machine learning (ML) approach to the same problem and details it in Machine vs Human. One interesting aspect of Marc-Olivier’s work is that he wanted to consider the temporal aspect of some features and he spent quite some time doing feature engineering in order to take that temporal aspect in the ML model.
I am still in the process of fine tuning the network architecture and the hyper-parameters, yet with a not so optimized network I get better results than the two other approaches. Those results come at a cost though. First cost is the computing resource required. Steven’s bespoke classifier uses very little computing resources in the training and prediction. Marc-Olivier method gives better results but is a bit more involving, still the training and predictions can be done easily and quite quickly on a CPU. My deep learning approach was executed on GPU (GTX 1080Ti) and yet takes the training takes more time to execute than Marc-Olivier. I’m pretty confident I can execute the prediction on CPU, as only the forward propagation has to be performed, yet I think it will take more resources than Marc-Olivier method. Anyway, we will come up with a detailed summary of the pros and cons of the three approaches in a later blog post.
Now for the RNN approach. Below is a symbolic view of the network I used.
“Wait Pascal! You’ve put the exact same picture as before for TheLoneNut AI!” Yes, at the high level architectural point of view, it is pretty much the same deep neural network, plus on the original picture I omitted certain details… However, the toy example can port to a real-world example quite easily. If we look at the details of the network, then it is a bit different. Below the output of model.summary() from Keras.
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= batch_normalization_5 (Batch (None, 12, 208) 832 _________________________________________________________________ lstm_9 (LSTM) (None, 12, 512) 1476608 _________________________________________________________________ lstm_10 (LSTM) (None, 512) 2099200 _________________________________________________________________ batch_normalization_6 (Batch (None, 512) 2048 _________________________________________________________________ dense_6 (Dense) (None, 1) 513 ================================================================= Total params: 3,579,201 Trainable params: 3,577,761 Non-trainable params: 1,440 _________________________________________________________________
The differences are not so big. First there is no need for embedding layers as I have a fixed sized feature vectors coming in (208 features) instead of embedded characters over 24 features. I take those features on 12 days instead of 40 characters at a time. Next the last LSTM layers output a “non-time distributed” output i.e. I’m not interested in the sequence, but rather the “result” of that sequence, the class the example belongs to. Finally, I go through a dense layer which output a single value using a sigmoid activation function. From that output, I simply take a cut-off value of 0.5 i.e. everything below 0.5 is considered 0 and everything above is considered 1.
After training, I evaluate the model on a hold out set of 1947 examples and below is the confusion matrix I obtain. We can see a low proportion of misclassified examples, which is good!
In conclusion, the experiment confirms what we already know 😉 . ML gives better results than statistical analysis, and DL gives better results than ML. However, there is a computational cost to it. On the other hand, and we will discuss it in a later post, Statistical analysis is more intellectually involving than ML feature engineering and feature engineering is more involving than well… no feature engineering. Though with DL you have to come up with network engineering and to para-phrase someone else talking about deep reinforcement learning: “On the scale of art versus science, Deep Learning network architecture is still more on the art side of the balance” , but that’s another story!