Where the F**k do I execute my model?

or: Toward a Machine Learning Deployment Environment.

Nowadays, big names in machine learning have their own data science analysis environments and in-production machine learning execution environment. The others have a mishmash of custom made parts or are lucky enough so that the existing commercially available machine learning environment fits their needs and they can use them. There are several data science environments commercially available, Gartner mentions the most known players (although new ones pop every week) in its Magic Quadrant for Data Science and Machine-Learning Platforms. However, most (if not all) of those platforms suffer from a limitation which might prevent some industries from adopting them. Most of those platforms starts with the premises that they will execute everything on a single cloud (whether public or private). Let see why this might not be the case for every use case.

Some machine learning models might need to be executed remotely. Let’s think for example of the autonomous vehicle industry. Latency and security prevents execution in a cloud (unless that cloud is onboard the vehicle). Some industrial use cases might require models to be executed in an edge-computing or fog-computing fashion to satisfy latency requirements. Data sensitivity in some industries may require the execution of some algorithms on customer equipment. There are many more reasons why you may want to execute your model in some other location than the cloud where you made the data science analysis.

As said before, most commercially available offerings do not cater to that requirement. And it is not a trivial thing that one may slap on top an existing solution as a simple feature. There are in some case some profound implications on allowing such distributed and heterogeneous analysis and deployment environment. Let’s just look at some of the considerations.

First one must recognize there is a distinction between the machine learning model and the complete use case to be covered, or as some would like to call it the AI. A machine learning model is simply provided a set of data and gives back an “answer”. It could be a classification task, a regression or prediction task, etc. but this is where a machine learning model stops. To get value from that model, one must wrap it in a complete use case, some calls that an AI. How do you acquire reliably the data it requires? How do you present or act on the answer given by the model? Those, and many more questions needs to be answered by a machine learning deployment environment.

Recognizing it, one of the first thing that is required to deploy a full use case is access to data. In most industries, the sources of data are limited (databases, web queries, csv files, log files, …) and the way to handle them is repetitive i.e. once I figured a way to do database queries, the next time most of my code will look the same, except for the query itself. As such, data access should be facilitated by a machine learning deployment environment which should provides “data connectors” which could be configured for the needs and deployed where the data is available.

Once you have access to data, you will need “rules” as to when the machine learning model needs to be executed: is it once a day, on request, … Again, there is many possibilities (although when you start thinking about it, a lot are the same), but expressing those “rules” should be facilitated by deployment environment so that you don’t have to rewrite a new “data dispatcher” for every use case, but simply configure a generic one.

Now we have data and we are ready to call a model, right? Not so fast. Although some think of data preparation as part of the model, I would like to consider it as an intermediary step. Why would you say? Simply because data preparation is a deterministic step where there should be no learning involved and because in many cases you will reduce significantly the size of the data in that step, data that you might want to store to monitor the model behavior. But I’ll come to this later. For now, just consider there might be a need for “data reduction” and this one cannot be generic. You can think of it as a pre-model which format the data in a way your model is ready to use. The deployment environment should facilitate the packaging of such a component and provides way to easily deploy them (again, anywhere it needs to be).

We are now ready for the machine learning execution! You already produced a model from your data science activities and this model needs to be called. As for the “data reduction”, the “model execution” should be facilitated by the deployment environment, the packaging and the deployment.

For those who have been through the loops of creating models, you certainly have the question: But how have you trained that model? So yes, we might need a “model training” component which is also dependant on the model itself. A deployment environment should also facilitate the use/deployment of a training component. However, this begs to another important question. From where comes the data used for training? And what if the model drift, is no longer accurate and needs re-training? You will need data… So, another required component is a “data sampling” component. I say data sampling because you may not need all the data, maybe some sample of it is sufficient. This can be something provided by the model execution environment and configured per use case. You remember the discussion about data reduction earlier? Well, it might be wise to store only samples coming from reduced data… You may also want to store the associated prediction made by the model.

At any rate, you will need a “sample database” which will need to be configured with proper retention policies on a use case basis (unless you want to keep that data for eternity).

As we said, models can drift, so data ops teams will have to monitor that model/use case. To facilitate that, a “model monitoring” component should be available which will take cues from the execution environment itself, but also from the sample database, which means that you will need a way to configure what are the values to be watched.

Those covers the most basics components required, but more may be required. If you are to deploy this environment in a distributed and heterogeneous fashion, you will need some “information transfer” mechanism or component to exchange information in a secured and easy fashion between different domains.

MLExecEnv
Machine Learning Execution Environment Overview.

You will also need a model orchestrator which will take care of scaling in or out all those parts on a need basis. And what about the model life-cycle management, canary deployment or A/B testing… you see, there is even more to consider there.

One thing to notice is that even at this stage, you only have the model “answer” … you still need to use it in a way which is useful for your use case. Maybe it is a dashboard, maybe it is used to actuate some process… the story simply does not end here.

For my friends at Ericsson, you can find way more information in the memorandum and architecture document I wrote on the subject: “Toward a Machine Learning Deployment Environment”. For the rest of you folks, if you are in the process of establishing such an environment, I hope those few thoughts can help you out.


Cover photo by Frans Van Heerden at Pexels.

Advertisements

How to potty train a Siamese Network

Time for an update on my One-Shot learning approach using a Siamese LSTM-based Deep Neural Network we developed for telecommunication network fault identification through traffic analysis. A lot of small details had to change as we upgraded our machine to the latest TensorFlow and Keras. That alone introduced a few new behaviors… As well as we obtained new data for new examples and found out some problems with our model. I don’t intend to go through all changes, but some of the main ones as well as some interesting findings. It feels a lot like potty training a cat… If you are new to this series, you can refer to my previous posts: “Do Telecom Networks Dreams of Siamese Memories?” and “What Siamese Dreams are made of…

First, Batch Normalization in Keras is now on my black magic list 😊 . I’ll have to dig more into how it is implemented, especially the differences between train time and prediction time. For a long time, I was wondering why I was getting extremely good train loss and poor validation losses until I removed the Batch Normalization I had on the input layer. So, something to investigate there.

Secondly, I introduced data generators for training and validation data. For a Siamese network approach where you must provide tons of similar and dissimilar pairs, using generators is a must to master at some point! Once you get the gist of it, it is quite convenient. I found Shervine Amidi blog: “A detailed example of how to use data generators with Keras” to be a very well explained example to build upon. I would recommend it to anyone learning about Keras data generators.

Along the way I found that my triplet_loss function as shown in previous post was flawed… because of the way I am packing the output of the base neural network with Keras concatenate, I must explicitly specify the ranges. Moreover, I painfully understood that a loss function in Keras is passed a mini-batch of y_true/y_pred values, not individual values. Well, that was not clear for me at first sight… I took also the opportunity to rework the logic to use more of a Keras approach than TensorFlow (subtle changes). Below is the new loss function.

The fourth interesting thing to mention is that while I was debugging all those issues, I felt a need to better visualize the results than simply looking at the prediction value. I reduced the output vector space from 10 dimensions to 3 dimensions as anyway I do not have that much different examples for now, so 3D should be more than enough to separate them. Furthermore, I changed my output layer to use a sigmoid activation function to limit the output space to the [0,1] range. Those changes in turn enabled me to look at the location of the predicted point in the transformed space e.g. a traffic pattern now corresponds to a 3D location in this output space.

SiameseSeparation

Below I made a video of how this projection evolve through training. Initially, as the neural net is initialized with random values, the output points clutter together at the center. But quickly we see them being separated and each taking a corner of the space. Sure, there is a lot of bouncing back and forth as the neural net try to find a better solution, but we can see that we can find a sweet spot where the different traffic patterns are well separated. As a side note we see three different traffic patterns here. Normal traffic in green and two different error cases, one dramatic in red where all traffic is blocked, and one subtler error in orange where we reach the capacity limit of the communication link.

Now while acquiring more data from our test bed, we are trying out with different loss functions to separate the traffic. One of my colleague has just posted on a comparison between different loss functions: “Lossless Triplet Loss” . I might also try some different loss functions and show my findings.

I hope this shows that One-Shot learning using Siamese networks can be used for other purpose than face recognition. In this case we are successfully using it for signalling traffic categorization and fault detection.


Cover photo by Jan-Mallander at Pixabay.

What Siamese Dreams are made of…

In my last post I wrote a high-level description of a One-Shot learning approach we developed for telecommunication network fault identification through traffic analysis. The One-Shot learning approach is implemented using a Siamese Deep Neural Network. In this post I will describe with more details how this can be achieved with the use of Keras and TensorFlow. As said in the previous post, this is early work and subject to a lot of change, but if it can help someone else alleviate some of the pain of building such a network, let it be!

The first step is probably to understand what is a Siamese Network and how it works. What we want out network to produce is a representation of the data we feed it e.g. a vector representing the input data like word embeddings, but for in this case telecom network traffic data. At the end of the day, this representation vector should have close distances for similar traffic and higher distance for dissimilar traffic. Hence, when the network is properly trained we can use those distances to determine which network traffic is the closest and thus the most representing. But how do we implement it?

For that, let’s look at the cute kitten image I have put on this and the previous post. The crème color cute one hiding at the bottom is Aristotle. The other crème color one is Peter Pan and the black one is Napoleon. Aristotle is our Anchor, the kitten we want to compare to. If another kitten is similar, let say Peter Pan, then the vector representing Peter Pan should be close in distance to the vector representing Aristotle. This is our Positive example. Similarly, when a kitten is different from Aristotle, let say Napoleon, we want the vector representing it being far in distance to Aristotle. This is our Negative example.

Simplifying things, training a deep neural network consist in predicting a result from a training example; finding out how far we are from the expected value using a loss function to find the error; and then correcting the weights of the deep neural network based on that error, so next time we are a bit closer. Here we do not know what is the expected value for our training examples, but we know that whatever that value is, it should be close in distance to the Anchor if we present the Positive example, and far in distance if we present the Negative example. Thus, we will build our loss function in that way. It receives python list of the representation of the Anchor, the Positive example and the Negative example through y_pred. Then it computes the distance between the Anchor and the Positive (AP), and the Anchor and the Negative (AN). As we said AP should get close to 0 while AP should get large. For this exercise, let set “large” to 0.2. So, we want AP=0 and AN=0.2 so we want AN – 0.2 = 0. Ideally, we want both of those to stand, hence we want to minimize the loss where loss = AP – (AN – 0.2). That being explained, below is the loss function we defined.

Now having a loss function to train a network with, we need a network to be defined. The network should receive as input our network traffic information and output a vector representation of it. I already mentioned the network before, so here is the function that creates it from Keras sequential model.

Now that we have that base model, we need to embed it within a Siamese “framework”. After all, that base network simply computes one vector representation for a specific network traffic data and the loss function we defined calls for three of those representation i.e. the anchor, the positive and the negative. So, what we will do is to define three inputs which will be evaluated through the SAME base network, hence the name of Siamese network. The output of that Siamese network it then simply concatenated in a list of vectors, which is what we are asking our loss function to evaluate on. Note that at this point we defines the input and output dimensions. The inputs will be in the shape of N_MINS minutes of network traffic characterization (60 minutes for now), where each minutes is characterized by n_feat features (the 130 or so features I mentioned in my previous post).

Everything is now in place to train the base model through the Siamese “framework” using our defined loss function. Note that the y values we pass to the fit method are dummies value since our loss function does not care for the real targets (which we do not know).

Now we could save the model (really, just the base model is needed here). But more importantly, we can use the base model to perform some evaluation of what would be the vector representation. For me, this was that part which was unclear from other tutorials. You simply should perform a predict on the base model and do not care anymore about the Siamese “framework”. You kind of throw it away.

For completeness sake, since what we want to do is to evaluate the “closest” vector representation to the trained faults we want to detect, we could create a method to identify the traffic case such as the following.

Assuming proper training from our Siamese network and our training data, we can use the above to create a database of the different traffic conditions we can identify in a specific network (as traffic patterns can change from network to network, but hopefully not the way to represent them). And identify the current traffic using the above created function.

Et voilà, you should now have all the pieces to properly use Aristotle, Peter Pan and Napoleon to train a Siamese Network, and then sadly throw them away when you do not need them anymore… This metaphor of Siamese cats is heartbrokenly getting closer and closer to reality… Nevertheless, I hope it can help you out there creating all sorts of Siamese Networks!

Do Telecom Networks Dreams of Siamese Memories?

In this post I will try to make understandable a Deep Neural Network I developed lately. We are still in early stages and a lot of improvements will need to get in, but the preliminary results are positive. I have been told I am not so great at explaining things high level, so a word of warning, some part may go deep technical! So, let start with the buzz words: what I will describe is a One-Shot Learning approach using a Siamese Deep Neural Network which characterize ongoing data traffic patterns in a telecom network to identify faults in real-time.

Telecom network nodes (think piece of equipment) often suffer from recurring faults. There are things which are done by human operator, or traffic pattern exhibited by the users, or situations in adjacent nodes which can impact the performance of a specific node. Once degradation is identified, an analyst goes through the alarms raised by the equipment or the logs, figure out the issue and fix it. Some of those faults are recurring for a reason or another. Analysts probably gets better and better at identifying and fixing those, but still it takes some of their precious time. Would not it be nice if we could identify those automatically and then act to fix the problem? This is pretty much in line with the ONAP vision of a complete life-cycle management of a service. Let say it is a small part of the mechanism required to make that vision real.

The objective is to develop a Machine Learning trained analytic module for a specific set of Network Function Virtualization (NFV) components which can feed into the ONAP policy engine architecture. The analytic module monitors in real-time the NFV service levels and informs the policy engine about the NFV service status i.e. normal working status or degraded/failure mode and in such a case why it is failing.

Ideally, we want a trained analytic module which knows about a lot of different errors characteristics and can adapt as easily as possible to different network conditions i.e. nodes in different networks may be subject to different traffic patterns, but still be subjected to the same errors. In other terms, it would be nice to be able to deploy this module to supervise nodes deployed at different operators without having to retrain it completely…

For the purpose of this experiment we use as data traffic information collected by probes on the control plane traffic coming into/out of a specific node (P-CSCF (a Proxy Server) of an IP Multimedia Subsystem (IMS)). The probes are part of an Ericsson product, the Ericsson Expert Analytics and takes care of the collection and storage of the data from the NFV component. The P-CSCF is part of a test network we created for the experiment and is subject to a realistic traffic model simulated by network traffic generation servers.

The first step is to characterize statistically the traffic going through the P-CSCF and collected by the probes. In this step we create a set of about 130 statistical features based on 1 minute intervals describing the traffic. For example: Number of Registrations in a minute; Number of Session Initiations; Number of  operations presenting error codes and count of those error codes e.g. number of Registrations with return code 2xx, 3xx, … ; Average time required to complete operations; Standard Deviation of those times; etc.

A first choice is how long of a stream should be base our decision on? We decided to go with 1-hour intervals thus we use 60 consecutive examples of those 130 or so features vector for training and for predictions. We label our examples such that if for the whole period there is no error present it is “normal traffic”, or if we introduced an error during that 60 minutes period, thus the example exhibits in part a specific error then it is labelled as per this error.

To fulfil our need for easy adaptation of the trained analytic module we decided to go with One-Shot learning approach. Our hope is that we can train a Deep Neural Network which characterize the traffic it is presented with on a “small” vector (here we initially selected a vector of 10 values), akin to words embedding in Natural Language Processing (NLP). We also hope then that vector arithmetic properties observed in that field for translation purpose will hold e.g. king – man + woman = queen; paris – france + Poland = warsaw. If such property hold, deployment of the trained analytic module in a different environment will consist simply in observing a few examples of regular traffic and adjusting to the specific traffic pattern through arithmetic operations. But I am getting ahead of myself here!

To perform training according to One-Shot learning strategy we developed a base LSTM-based Deep Neural Network (DNN) which is trained in a Siamese Network framework akin what is done for Image Recognition. To do so we create triplets of Anchor-Positive-Negative of 60 minutes/130 features data. In other words, we select an anchor label e.g. normal traffic, or error X, we then select a second example of the same category and a third example from another label category. This triplet of examples becomes what we provide as examples to our Siamese framework to train our LSTM-based DNN. Our initial results were obtained with as little as 100k triplets thus we expect better results when we will train with more examples.

Our Siamese framework can be described as following: The three data points from a triplet are evaluated through the base LSTM-based DNN and our loss function see to minimize the distance between Anchor-Positive examples and maximize the distance between Anchor-Negative examples. The base LSTM-based DNN is highly inspired from my precious trial with time-series and consist in the following:

_________________________________________________________________
Layer (type)                 Output Shape              Param #  
=================================================================
batch_normalization_1 (Batch (None, 60, 132)           528      
_________________________________________________________________
lstm_1 (LSTM)                (None, 60, 512)           1320960  
_________________________________________________________________
lstm_2 (LSTM)                (None, 512)               2099200  
_________________________________________________________________
batch_normalization_2 (Batch (None, 512)               2048     
_________________________________________________________________
dense_1 (Dense)              (None, 512)               262656   
_________________________________________________________________
batch_normalization_3 (Batch (None, 512)               2048     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                5130     
_________________________________________________________________
batch_normalization_4 (Batch (None, 10)                40        
=================================================================
Total params: 3,692,610
Trainable params: 3,690,278
Non-trainable params: 2,332
_________________________________________________________________

Once the base LSTM-based DNN is trained, we can compute the vector representation of each of the traffic case we are interested in e.g. Normal Traffic, Error X traffic, … and store them.

When we want to evaluate real-time the status of the node, we pick the last hour of traffic data and compute its vector representation through the trained base LSTM-based DNN. The closest match from the stored vector representation of the traffic cases and the current traffic is our predicted current traffic state.

At this point in time we only data collected for one specific error, where a link between the P-CSCF and the Home Subscriber Server (HSS) is down. Below diagram shows our predictions on a previously unseen validation set i.e. not used for training.

siameseresults.png
Siamese LSTM-based Deep Neural Network traffic condition prediction on real-time traffic.

As we can see there is quite a few small false predictions along the way, but when the real error is presented to the trained model it can identify it correctly.

Our next steps will be to collect data for other errors and train our model accordingly. As I said in the beginning this is quite early results but promising nonetheless. So keep tuned-in for more in the new year!

 

The Fallacious Simplicity of Deep Learning: zat is ze question?

This post is the fifth and last in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. The third complexity was all about hyper-parameter setting, a skill specific to machine learning and that you will need to acquire. The fourth complexity dealt with data, how to obtain it, how to clean it, how to tame it.

The next challenge we will look at is ze question. When we start a new machine learning project, we might have a question we want to answer. Through data exploration we might find that this question cannot be answered directly with the data we have. Maybe we figure the question is not that interesting after all. It all boils down to fast feedback. You need to explore your data, try to answer a question and see where it leads. Then it is time for discussion with the stakeholders, is it what they are looking for? Does it bring value?

There is different type of questions machine learning can answer, but it is not unlimited. Do we want to sort things in similar buckets, do we want to predict such and such value, do we want to find examples that are abnormal? In order for a machine learning exercise to be successful, you need a really specific question to answer: Is this subscriber an IoT device or a Human? Then you need proper data. Using the Canadian census data to try to answer to figure out if a mobile phone subscriber is a Human or a Machine might not work! But with the proper data we could start to explore if there is a model between some data, for example IP addresses visited, time of those visits, etc. and the fact the subscriber is a Machine or a Human.

Often the question will evolve with time, through discussion. You need to be ready for that evolution, for that change. You might have to bring in new data, new methods, new algorithms. It is all about searching and researching, trials and errors, finding new paths. Getting access to the data might be the biggest difficulty, but finding the right question is certainly the second.

The fifth complexity: finding the right question.

Through this series I have detailed five complexities to deep learning (and machine learning in general). There are many more. The machine learning “process” is not like software development. In general, it requires a lot more exploration and researching than regular software development. It requires an higher level of “artistic flair” than you would need to write a regular software application. There are other things that differentiate machine learning from software development, but I think those are the five first and biggest complexity one can face when developing a machine learning model:

  • First you need access to data, and that might not be trivial. You will also need to clean that data and ensure you have a consistent flow.
  • Second you will need to find the right question. This may require many iterations and might require new data sources as well.
  • Third, the results may look simple, but the code itself does not show everything that is hidden behind the curtain. A lot has to do with hyper-parameter setting and tweaking and there is no cookbook for this. The APIs do not tell you what values will give good results.
  • Fourth, machine learning requires specific competence. Some of those competences have to do with software development, but some other are quite different. It is a relatively new domain and the community size is still smaller than others in software development. Moreover, this is a highly in demand skill set and hard to find in the wild.
  • Finally, this is a quickly evolving domain which range from specialized hardware to specialized software stacks. In software development peoples are accustomed to quickly evolving environment, but the pace and breath at which it goes in machine learning might well be unprecedented.

I hope this series could give you a better appreciation of the complexities of deep learning and machine learning. It may look easy when you look at the results and the code that supports it, but there is a lot of things you are not given to see! Artificial Intelligence field will continue to grow in the coming years and will enter in way more aspect of your daily lives. This will require peoples trained in deep learning, machine learning and data science as this is not simply the usage of yet another software library.


Originally published at medium.com/@TheLoneNut on November 28, 2017.

The Fallacious Simplicity of Deep Learning: wild data

This post is the fourth in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. The third complexity was all about hyper-parameter setting, a skill specific to machine learning and that you will need to acquire.

Next challenge with machine learning, or should I say first challenge is the data! You need data to perform a machine learning task. For deep learning, you arguably need even more said data.

This is an angle that you will not see if you take courses (online or in schools) about data science, machine learning or deep learning. At least I’ve never seen it being properly presented as a difficulty. When you take a course, the data is most of the time provided. Or you get a clear indication as where to obtain it from. In most case this is well curated data. Well explained and documented fields and formats. If the data is not already available and the exercise wants you to concentrate on the data scraping aspect, then the other parameters of the exercise will be well defined. They will tell you where to scrape the data from and what is of importance for the exercise. It will make sure the scraping can be done in a structure fashion. If the exercise wants you to concentrate on data cleaning, or data augmentation, the dataset will have been prepared to properly show you how this can be done. Not to say that those courses or exercises are easy, but they do not show the real difficulty of wild data.

I think that for most companies, it all started with wild data. As data science grows in a company, peoples put structure and pipelines around data. But this comes with time and size. And it certainly does not prevent the entry of some wild data beast in the zoo from time to time. So, assuming you have access to data, it might really be wild data and you will have to tame it. Some says deep learning is less about data engineering and more about model architecture creation. It is certainly true in computer vision where the format of the data has been agreed on some time ago. But what about another domain, is the format so widely accepted? You might still have to do some feature engineering to input your data in your model. Adding the feature engineering problem to the hyper-parameter tuning problem…

On the other hand, you might well be in a situation where you do not have access to data. What if you are used to sell a specific type of equipment. That equipment might not be connected to a communication network. If it is, the customer might never have been asked if you could use its data. How do you put in place a contract or license that allows for you to collect data? Does the legislation in the regions where you sell that equipment allows for that collection or is there any restrictions you need to dance with? Is there personally identifiable information you will need to deal with? Will you need to anonymize the data? If you start a new business, based on a new business model, you might be able to build it in your product. But if you have a continuing business, how to you incorporate it in your product? Does it need to be gradual?

I guess you now have a good understanding of our fourth complexity. Where do you get your data from?

Data might be hard to come by. It might be wild and messy. It might not even relate to the question you want to answer… and that will be the subject of our fifth complication: do you have the right question?

The Fallacious Simplicity of Deep Learning: hyper-parameters tuning

This post is the third in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. We also said that to illustrate the complexities we would show an example of deep learning using keras. I have described the model I use in a previous post This is not me blogging!. The model can generate new blog post looking like mine from being trained on all my previous posts. So without any further ado, here is the short code example we will use.

 

KerasExample
A Keras code example.

In these few lines of code you can see the gist of how one would program a text generating neural network such as the one pictured besides the code. There is more code required to prepare the data and generate text from model predictions, than simply the call to model.predict. But the part of the code which related to create, train and make predictions with a deep neural network is all in those few lines.

You can easily see the definition of each layers: the embedding in green, the two Long Short Term Memory (LSTM) layers, a form of Recurrent Neural Network, here in blue. And a fully connected dense layer in orange. You can see that the inputs are passed when we train the model or fit it, in yellow as well as the expected output, our labels in orange. Here that label is given the beginning of a sentence, what would be the next character in the sequence. The subject of our prediction. Once you trained that network, you can ask for the next character, and the next, and the next… until you have a new blog post… more or less, as you have seen in a previous post.

For people with programming background, there is nothing complicated here. You have a library, Keras, you look at its API and you code accordingly, right? Well, how do you choose which layers to use and their order? The API will not tell you that… there is no cookbook. So, the selection of layers is part our next complexity. But before stating it as such let me introduce a piece of terminology: Hyper-parameters. Hyper-parameters are to deep learning and machine learning any parameter for which value you can vary, but ultimately have to finetune to your data if you want you model to behave properly.

So according to that definition of hyper-parameter, the deep neural network topology or architecture is an hyper-parameter. You must decide which layer to use and in what order. Hyper-parameter selection does not stops at the neural network topology though. Each layer has its own set of hyper parameters.

The first layer is an embedding layer. It converts in this case character input into a vector of real numbers, after all, computers can only work with numbers. How big this encoding vector will be? How long the sentences we train with will be? Those are all hyper-parameters.

On the LSTM layers, how wide or how many neurons will we use? Will we use all the outputs all the time or drop some of them (a technique called dropout which help regularizing neural network and reduce cases of overfitting)? Overfitting is when a neural network learns so well your training examples that it cannot generalize to new examples. Meaning that when you try to predict on a new value, the results are erratic. Not a situation you desire.

You have hyper-parameter to select and tweak up until the model compilation time and the model training (fit) time. How big the tweaking to your neural network weights will be at each computation pass? How big each pass will be in terms of examples you give to the neural network? How many passes will you perform?

If you take all of this into consideration, you end up with most of the code written being subject to hyper-parameters selection. And again, there is no cookbook or recipe yet to tell you how to set them. The API tells you how to enter those values in the framework, but cannot tell you what the effect will be. And the effect will be different for each problem.

It’s a little bit like if what you would give as argument to a print statement, you know like print(“hello world!”) would not be “hello word”, but some values which would print something based on that value (the hyper-parameter) and whatever has been printed in the past and you would have to tweak it so that at training time you get the expected results!!! This would make any good programmer become insane. But currently there is no other way with deep neural networks.

HelloWorldHyperParam2
Hello World with Hyper-Parameters.

So our fourth complexity is not only the selection of the neural network topology but as well as all the hyper-parameters that comes with it.

It sometimes requires insight on the mathematics behind the neural net, as well as imagination, lots of trials and error while being rigorous about what you try. This definitely does not follow the “normal” software development experience. As said in my previous post, you need a special crowd to perform machine learning or deep learning.

My next post will look at what is sometime the biggest difficulty for machine learning: obtaining the right data. Before anything can really be done, you need data. This is not always a trivial task…