Automation and Sampling

As I mentioned earlier, I have transitioned from Ericsson to Shopify. As part of this transition I start to get a taste of the public transports (previously work was a 15 minutes drive from home, now, working in the city center, I have to take the train to commute). This morning there was a woman just beside me who was working on her computer, apparently editing a document or more probably writing comments in it. A few minutes in her edits, she makes a phone call, asking someone over the phone to change some wording in the document and painfully dictating those changes (a few words). This game went about a few times; writing comments in the document, then calling someone to make the appropriate edits. What could have been a simple edit, then send the edited document via email became an apparently painful exercise in dictation. The point is not to figure out why she was not sending the document via email, it could be as simple as not having a data plan and not willing to wait for a wifi connection, who knows, but the usage of a non-automated “process” made something which is ordinarily quite simple (editing a couple of sentences in a document), a painful dictation experience. This also has the consequence of a limited bandwidth and thus only a few comments can make their ways in corrections on that document.

This reminded me of a conversation I had with a friend some time ago. He mentioned the pride he got of having put in place a data pipeline at his organization for two data sources extraction, transformation and storage in a local database. Some data is generated by a system in his company. Close to that data source he has a server which collect and reduce / transform the data and stores the results in the local file system as text files. Every day he look at the extraction process on that server to make sure it is still running, and every few days he download the new text files from that server to a server farm and a database he usually use to perform his analysis on the data. As you can see this is as well a painful, non-automated process. As a consequence, the amount of data is most probably more limited than it could be with an automated process, as my friend as to cater to the needs of those pipelines manually.

At Shopify I have the pleasure of having access to an automated ETL (Extract Transform and Load) process for the data I may want to do analysis on. If you want to get a feel of what is available at Shopify with respect to ETL, I invite you to watch the Data Science at Shopify video presentation from Françoise Provencher who touch a bit on that as well as the other aspects of the job of a data scientist at Shopify. In short we use pyspark with custom librairies developed by our data engineers to extract, transform and load data from our sources into a front room database which anyone in the company can use to get information and insight about our business. If you listen through Françoise video, you will understand that one of the benefit of that automated ETL scheme is that we transform the raw data (mostly unusable) into information that we store in the front room database. This information is then available to further being processed to extract valuable insight for the company. You immediately see the benefit. Once such a pipeline is established, it perform its work autonomously and as an added benefit, thanks to our data engineering team, monitors itself all the time. Obviously if something goes wrong somebody will have to act and correct the situation, but otherwise, you can forget about that pipeline and its always updated data is available to all. No need for a single person to spend a sizable amount of time monitoring and manually importing data. A corollary to this is that the bandwidth for new information is quite high and we get a lot of information on which we can do our analysis.

Having that much information at our fingertips bring on new challenges mostly not encountered by those who have manual pipelines. It becomes increasingly difficult and not efficient to do analysis on the whole population. You need to start thinking in term of samples. There are a couple of considerations to keep in mind when you do sampling: sample size and what are you going to sample.

There are mathematical and analytical ways to determine your sample size, but a quick way to get it right is to start with a modest random sample, perform your analysis, look at your results and keep them. Then, you redo the cycle a few times and watch if you keep getting the same results. If your results vary wildly, you probably do not have a big enough sample. Otherwise you are good. If it is important for future repeatability to be as efficient as possible, you can try to reduce your sample size until your results starts to vary (at which point you should revert to the previous sample size), but if not, good enough is good enough! Just remember that those samples must be random! If you redo your analysis using the same sample over and over again, you haven’t proven anything. In SQL terms, it is the difference between:


Which would produce a random sample of 10% of table, on the other hand:

LIMIT 1000

Will most likely always produce the same first 1000 elements… this is not a random sample!

The other consideration you should keep in mind is about what you should sample, what is the population you want to observe. Let say to use an example in the lingo of Shopify, I have a table of all my merchants customers which amongst other thing contain a foreign key to an orders table. If I want to get a picture of how many orders a customer performs, the population under observation is from the customers, not the orders. In other words, in that case I should randomly sample my customers, then look up how many orders they have. I should not sample the orders to aggregate those per customers and wishing this will produce the expected results.

Visually, we can see that sampling from orders will lead us to wrongly think each customers performs on average two orders. Random resampling will lead to the same erroneous results.


Screen Shot 2018-09-18 at 10.41.28.png
Sampling Orders leads to the wrong conclusion that each customers do an average of 2 orders.

Whereas sampling from customers, will lead to the correct answer that each customer  performs on average four orders.


Screen Shot 2018-09-18 at 10.41.40.png
Sampling Customers leads to the right conclusion that each customers do an average of 4 orders.

To summarize things, let just say that if you have manual (or even semi-manual) ETL pipelines, you need to automate them to give you consistency, and throughput. Once this is done, you will eventually discover the joys (and need) of sampling. When sampling, you must make sure you select the proper population to sample from and that your sample is randomly selected. Finally, you could always analytically find the proper sample size, but with a few trials you will most probably be just fine if your findings stay consistent through a number of random samples.

Cover photo by Stefan Schweihofer at Pixabay.

AI market place is not what you are looking for (in the telecommunication industry).

In a far away land was the kingdom of Kadana. Kadana was a vast country with few inhabitants. The fact that in the warmest days of summer, temperature was seldom above -273°C was probably a reason for it. The land was cold, but people were warm.

In Kadana there was 3 major telecom operators: B311, Steven’s and Telkad. There were also 3 regional ones: Northlink, Southlink and Audiotron. Many neighboring kingdoms also had telecom operators, some a lot bigger than the ones in Kadana. Dollartel, Southtel, Purpletel, we’re all big players and many more competed in that environment.

It was a time of excitement. A new technology called AI was becoming popular in other fields and the telecommunications operators wanted to get the benefits as well. Before going further in our story, it can be of interest to understand a little bit what this AI technology is all about. Without going into too much details, let’s just say that traditionally if you wanted a computer to do something for you, you had to feed him a program handcrafted with passion by software developer. The AI promise was that from now on, you could feed a computer with a ton of data about what you want to be done and it would figure out the specific conditions and provide the proper output without (much) of programming. For those aware of AI this looks like an overly simplistic (if not outright false) summary of the technology, but let’s keep it that way for now…

Going back to the telecommunication world, somebody with nice ideas decided to create Akut05. Akut05 was a new product combining the idea of a marketplace with the technology of AI. Cool! The benefit of a market place as demonstrated by the Apple App Store or Google Play, combined with the power of AI.

This is so interesting, I too want to get into that party, and I immediately create my company, So now I need to create a nice AI model that I could sell on the Akut05 marketplace platform.

Well, let not be so fast… You see, AI models are built from data as I said before. What data will I use? That’s just a small hurdle for company… we go out, talk with operators. Nobody knows, it’s a new company, so let’s start with local operators. B311, Steven’s and Telkad all think we are too small a player to give us access to their data. After all, their data is a treasure trove they should benefit from, why would they give us access to it. We then go to smaller regional players and Northlink has some interests. They are small and cannot invest massively in a data science team to build nice models, so with proper NDA, they agree to give us access to their data in counterpart, they will have access to our model on Akut05 with substantial rebate.

Good! We need to start somewhere. I’ll skip all the adventures along the way of getting the data, preparing it and building a model… but let me tell you that was full of adventures. We deploy a nice model in an Akut05 store and it works wonderfully… for awhile. After some time, the subscribers from Northlink change a bit their behavior, and Northlink see that our model does not respond properly anymore. How do they figure? I have no idea, since Akut05 does not provide with any real model monitoring capabilities besides the regular “cloud” monitoring metics. More alarming, we see 1-star reviews pouring in from B311, Steven’s and Telkad who tried our model and got from the get go poor results. And there is nothing we can do about it because after all we never got deals with those big names to access their data. A few weeks later, having discounted the model to Northlink and getting only bad press from all other operators, bankrupt and we never hear from it again. The same happens to a lot of other small model developers who tried their hand at it, and in no time the Akut05 store is empty of any valuable model.

So contrary to an App Store, a Model Store is generally a bad idea. To get a model right (assuming you can) you need data. This data needs to come from representative examples of what you want the model to apply to. But it easy, we just need all the operator to agree to share the data! Well, if you don’t see the irony, then good luck. But this is a nice story, lets put aside the irony. All the operators in our story decide to make their data available to any model developers on the Akut05 platform. What else could go wrong.

Let us think about a model that use the monthly payment a subscriber pays to the operator. In Kadana this amount is provided in the data pool as $KAD, and it works fine for all Kadanian operators. Dollartel tries it out and (not) surprisingly it fails miserably. You see, in the market of Dollartel, the money in use is not the $KAD, but some other currency… The model builder, even if he has data from Dollartel may have to do “local” adjustments. Can a model still provide good money to the model builder if the market is small and fractured i.e. needs special care being taken? Otherwise you’ll get 1-star review and again disappear after a short while.

Ok, so the Akut05 is not a good idea for independent model builders. Maybe it can still be used by Purpletel which is a big telecom operator which can hire a great number of data scientists. But in that case, if its their data scientist who will do the job, why would they share their data? If they don’t share their data and hire their own data scientists, why would they need a market place in the first place?

Independent model builders can’t find their worth from a model market place, operators can’t either… can the telecom manufacturer make money there? Well, why would it more valuable than for an independent model builder? Maybe it could get easier access to data, but the prerogatives are basically the same and it wouldn’t be a winning market either I bet.

Well, therefore a market place for AI is not what you are looking for… In a next post I’ll try to say a little bit about what you should be looking for in the telecom sector when it comes to AI.

For sure this story is an oversimplification of the issue, still, I think we can get the point. You have a different view? Please feel free to share it in the comments below so we can all learn from a nice discussion!

Cover photo by Ed Gregory at Pexels.

How to become a good data scientist

After being so vocal about how to be a bad data scientist, I thought I should even out the play field by giving some hints on how to become a good data scientist. The other side of the medal.

My strong feeling is that is you just start in the field for employment or salary reasons, you start on the wrong foot. You should first look at your passions. Here it is interesting to take a few seconds to lookup the word passion as defined on


[pashuh n]


  1. any powerful or compelling emotion or feeling, as love or hate.
  2. strong amorous feeling or desire; love; ardor.
  3. strong sexual desire; lust.
  4. an instance or experience of strong love or sexual desire.
  5. a person toward whom one feels strong love or sexual desire.
  6. strong or extravagant fondness, enthusiasm, or desire for anything: a passion for music.
  7. the object of such a fondness or desire:Accuracy became a passion with him.

Hopefully the scope of your passion for data science does not involve definitions 2, 3, 4 or 5. But is driven by a strong fondness and enthusiasm for data science! If so you are on the right track and my first advise would be: do not try to swallow the ocean in one sip. Zoom on one aspect of that passion, the one that piqued you interest first. See how you could apply it in a real-world problem and learn along the way. For example, in my case, I got passionate about artificial life long time ago. That evolved in becoming fond in a form of reinforcement learning, the genetic algorithms and genetic programming around 2012. As time passed, I grew my interests in machine learning and deep learning, learned about it by reading books, taking online courses and taking a graduate course while studying for my master’s degree. At that time, I had the hope to apply it to the project I had for my master thesis, but sometime plan changes. So, in short you need to follow your heart here.

If you go with such an approach, you will avoid many of the pitfall I mentioned in the first post. You won’t come to expect a “clean” data set as your input since you’ll have applied it to a few real case examples as you learned. You will learn along the way how to gather data, how to clean it, how to interpret it… it will benefit you in two ways. First you will learn one of the essential skills, data cleaning. But most importantly, it will grow your inquisitive mind. Something that I never seen a single course being able to do. Again, I do not think this is a skill you can get in a few weeks, it requires a mind shift that you will acquire through repeated practice.

Another benefit of going along your passion is that if you don’t already have the necessary mathematical background, you will grab it along the way. If you find maths hard, it is probably easier to grab them on a need basis as you expand your knowledge through your own passionate experiments! I will also re-iterate that nonetheless what you might think or have been told, mathematics is not so hard. Moreover, they are way easier to get if you start with a positive attitude, telling yourself that you can do it.

Next benefit of such an approach is that you will have to define and refine your problem. You will decide what is important to you, what is your “research” question and how it relates to the activities you are doing along the way. When I was doing my master’s degree, I saw two types of students. Those who already had a research agenda, a question they wanted to explore, or who at least sat down early with their advisor and set up such a research question inline with their interests and passions. Those students usually made high quality presentations, were following courses highly relevant to answer their research questions and became highly proficient in their field of research. The second type of student waited for their advisors to give them a research project, never were really involved in it, presented average or poor presentations, followed any courses without really seeing how they related to their research topic: well, in most cases they were not… and at the end were probably still graduating, but with a subject to forget about… You want to be like the first type of students, even if you do it on your own, you want to take control of it and reap the benefits.

Lastly, it is good for you to write or talk about your findings and learnings. Myself I found it help crystalize my thoughts and get (sometime) some feedback from other comparable minded peers. All to say that academic papers are not the only way to communicate your findings, blogs, videos, reports can all help you if you have the passion. Sure of advantage of an academic paper is the peer review system which provide you with feedback on your research, but you should not limit yourself to that single media of communication if it is not suited to your reality. Expose plainly what you found, do not claim you are something you are not, or not yet. When the time comes, other will recognize you as a data scientist and that day you will know you are one for sure!

In the same lines as my previous post, learn hard: it is easier when you are you are following a personal research/interest goal. Work hard: again, something easier (not necessarily easy) when you follow a passion. And at all time be honest with yourself (but also others) about what you know or found out. If you think of yourself as a full-grown data scientist on day one, you might not put in the work necessary to ever become one. On the other hand, if you follow your interests and passions, you might become a data scientist before you even think of yourself as one.

Cover photo by Magda Ehlers at Pexels.

How to be a bad data scientist!

So, you want to be a data scientist, or better you think you are now a data scientist and you are ready for your first job… Well make sure you are not one of the stereotypes of “wanna be data scientists” I list below, otherwise you may well go through numerous rejection in interviews. I do not claim it is a complete list of all the stereotypes out there. In fact, if you can think of other stereotypes, please share them in the comments! This is only a few stereotypes of peoples I have met or seen with time, and who sadly seems to repeat over and over again.

I want to be a data scientist [because of the money] where do I start?

This type of person has heard that there is good money to be made in data science and want its share of it… Little this type of person knows that a lot of hard work is involved in learning the knowledge and skills required to perform the job. Little also this type of persons know that data science is a constant work of research. Seldom is a clear path to the solution is in front of you. This is even truer with deep learning where new techniques and ideas pops every day and where you will have to come up with new ideas. If you need to post on a social media the question “where do I start?”, you don’t have what it takes to be one. Get a learn it all attitude, build an innovative spirit and then come back later.

I can do data science, please give me the “clean” data.

If you just came from (god forbid) a single data science course, or hopefully a few ones. And if you performed one or a few Kaggle like competition, you might be under the impression that data comes to you all cleaned up (or mostly ready) and with a couple of statements or commands it will all be well and ready for machine learning. The thing is that those courses and competitions prepare the data for you, so that you can go to the core of the problem faster and learn the subject matter of machine learning. In real life, data comes wild. It comes untamed and you must prepare it yourself. You might have to collect it yourself. A good part of most data scientists job is to play with the data, prepare it, clean it, etc. If you have not done this, figure out a problem of your own and solve it end-to-end and then come back later.

I don’t know any math or I’m bad at it, but people says I can do data science.

No, it is a fallacy. If you don’t have a mathematical mind, one day or the next you will end up in a situation where you just cannot progress anymore. The good thing is that you can learn mathematics. First, get out of the syndrome of: “this is too hard”. Anyway, data science is harder, so better start with something simple as mathematics. Learn some calculus, some statistics, learn to speak and think mathematics and then come back later.

Just give me a “well” defined problem.

Some people just want their little box with well defined interfaces, what comes in, what is expected to go out. Again, a syndrome of someone who just did some well canned coursed in the field… In reality, not only data is messy, but the problem you have to solve are messy, ill defined, muddy, … you have to figure it out. Sometimes you can define and refine it by yourself, sometimes you have to accept the messiness and play around with it. If you cannot be given vague and approximate objectives and refine them through thinking, research and discussions with the stakeholders until you come up with a solution, don’t expect be a data scientist. A big misconception here is that if you have a PhD you are immune to that problem… well not so fast, I have seen PhD struggling with this as much as any others. So, grow a spine, accept the challenge and then come back later.

I’ve learned data science, I have a blog/portfolio/… I can do anything.

Not so fast. This kind of person learned data science and being more marketing oriented and knowing it can help to build a personal brand built his portfolio or wrote blog, articles, etc. but never went to the point of trying it himself in real life. That person thinks he know it all and that he can solve anything. That type of person is probable singlehandedly responsible for the over-hype of what data science and machine learning can achieve and is more of a problem to the profession than of any help. Do some real work, grow some honesty and then come back later.

If you want to be a data scientist, it all boils down to a simple recipe. Learn hard and work hard. You must follow your path and put passion in it. Search to grow knowledge along your interests, learn about it, try things. Continuously learn new things, and not only on connected subjects. Do not limit yourself to courses, find real world examples to practice on, stay honest about what you can do, about what you know and do not know. Be a good human!

Cover image by tookapic at Pixabay.

The Fallacious Simplicity of Deep Learning: zat is ze question?

This post is the fifth and last in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. The third complexity was all about hyper-parameter setting, a skill specific to machine learning and that you will need to acquire. The fourth complexity dealt with data, how to obtain it, how to clean it, how to tame it.

The next challenge we will look at is ze question. When we start a new machine learning project, we might have a question we want to answer. Through data exploration we might find that this question cannot be answered directly with the data we have. Maybe we figure the question is not that interesting after all. It all boils down to fast feedback. You need to explore your data, try to answer a question and see where it leads. Then it is time for discussion with the stakeholders, is it what they are looking for? Does it bring value?

There is different type of questions machine learning can answer, but it is not unlimited. Do we want to sort things in similar buckets, do we want to predict such and such value, do we want to find examples that are abnormal? In order for a machine learning exercise to be successful, you need a really specific question to answer: Is this subscriber an IoT device or a Human? Then you need proper data. Using the Canadian census data to try to answer to figure out if a mobile phone subscriber is a Human or a Machine might not work! But with the proper data we could start to explore if there is a model between some data, for example IP addresses visited, time of those visits, etc. and the fact the subscriber is a Machine or a Human.

Often the question will evolve with time, through discussion. You need to be ready for that evolution, for that change. You might have to bring in new data, new methods, new algorithms. It is all about searching and researching, trials and errors, finding new paths. Getting access to the data might be the biggest difficulty, but finding the right question is certainly the second.

The fifth complexity: finding the right question.

Through this series I have detailed five complexities to deep learning (and machine learning in general). There are many more. The machine learning “process” is not like software development. In general, it requires a lot more exploration and researching than regular software development. It requires an higher level of “artistic flair” than you would need to write a regular software application. There are other things that differentiate machine learning from software development, but I think those are the five first and biggest complexity one can face when developing a machine learning model:

  • First you need access to data, and that might not be trivial. You will also need to clean that data and ensure you have a consistent flow.
  • Second you will need to find the right question. This may require many iterations and might require new data sources as well.
  • Third, the results may look simple, but the code itself does not show everything that is hidden behind the curtain. A lot has to do with hyper-parameter setting and tweaking and there is no cookbook for this. The APIs do not tell you what values will give good results.
  • Fourth, machine learning requires specific competence. Some of those competences have to do with software development, but some other are quite different. It is a relatively new domain and the community size is still smaller than others in software development. Moreover, this is a highly in demand skill set and hard to find in the wild.
  • Finally, this is a quickly evolving domain which range from specialized hardware to specialized software stacks. In software development peoples are accustomed to quickly evolving environment, but the pace and breath at which it goes in machine learning might well be unprecedented.

I hope this series could give you a better appreciation of the complexities of deep learning and machine learning. It may look easy when you look at the results and the code that supports it, but there is a lot of things you are not given to see! Artificial Intelligence field will continue to grow in the coming years and will enter in way more aspect of your daily lives. This will require peoples trained in deep learning, machine learning and data science as this is not simply the usage of yet another software library.

Originally published at on November 28, 2017.

The Fallacious Simplicity of Deep Learning: wild data

This post is the fourth in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. The third complexity was all about hyper-parameter setting, a skill specific to machine learning and that you will need to acquire.

Next challenge with machine learning, or should I say first challenge is the data! You need data to perform a machine learning task. For deep learning, you arguably need even more said data.

This is an angle that you will not see if you take courses (online or in schools) about data science, machine learning or deep learning. At least I’ve never seen it being properly presented as a difficulty. When you take a course, the data is most of the time provided. Or you get a clear indication as where to obtain it from. In most case this is well curated data. Well explained and documented fields and formats. If the data is not already available and the exercise wants you to concentrate on the data scraping aspect, then the other parameters of the exercise will be well defined. They will tell you where to scrape the data from and what is of importance for the exercise. It will make sure the scraping can be done in a structure fashion. If the exercise wants you to concentrate on data cleaning, or data augmentation, the dataset will have been prepared to properly show you how this can be done. Not to say that those courses or exercises are easy, but they do not show the real difficulty of wild data.

I think that for most companies, it all started with wild data. As data science grows in a company, peoples put structure and pipelines around data. But this comes with time and size. And it certainly does not prevent the entry of some wild data beast in the zoo from time to time. So, assuming you have access to data, it might really be wild data and you will have to tame it. Some says deep learning is less about data engineering and more about model architecture creation. It is certainly true in computer vision where the format of the data has been agreed on some time ago. But what about another domain, is the format so widely accepted? You might still have to do some feature engineering to input your data in your model. Adding the feature engineering problem to the hyper-parameter tuning problem…

On the other hand, you might well be in a situation where you do not have access to data. What if you are used to sell a specific type of equipment. That equipment might not be connected to a communication network. If it is, the customer might never have been asked if you could use its data. How do you put in place a contract or license that allows for you to collect data? Does the legislation in the regions where you sell that equipment allows for that collection or is there any restrictions you need to dance with? Is there personally identifiable information you will need to deal with? Will you need to anonymize the data? If you start a new business, based on a new business model, you might be able to build it in your product. But if you have a continuing business, how to you incorporate it in your product? Does it need to be gradual?

I guess you now have a good understanding of our fourth complexity. Where do you get your data from?

Data might be hard to come by. It might be wild and messy. It might not even relate to the question you want to answer… and that will be the subject of our fifth complication: do you have the right question?

The Fallacious Simplicity of Deep Learning: hyper-parameters tuning

This post is the third in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. We also said that to illustrate the complexities we would show an example of deep learning using keras. I have described the model I use in a previous post This is not me blogging!. The model can generate new blog post looking like mine from being trained on all my previous posts. So without any further ado, here is the short code example we will use.


A Keras code example.

In these few lines of code you can see the gist of how one would program a text generating neural network such as the one pictured besides the code. There is more code required to prepare the data and generate text from model predictions, than simply the call to model.predict. But the part of the code which related to create, train and make predictions with a deep neural network is all in those few lines.

You can easily see the definition of each layers: the embedding in green, the two Long Short Term Memory (LSTM) layers, a form of Recurrent Neural Network, here in blue. And a fully connected dense layer in orange. You can see that the inputs are passed when we train the model or fit it, in yellow as well as the expected output, our labels in orange. Here that label is given the beginning of a sentence, what would be the next character in the sequence. The subject of our prediction. Once you trained that network, you can ask for the next character, and the next, and the next… until you have a new blog post… more or less, as you have seen in a previous post.

For people with programming background, there is nothing complicated here. You have a library, Keras, you look at its API and you code accordingly, right? Well, how do you choose which layers to use and their order? The API will not tell you that… there is no cookbook. So, the selection of layers is part our next complexity. But before stating it as such let me introduce a piece of terminology: Hyper-parameters. Hyper-parameters are to deep learning and machine learning any parameter for which value you can vary, but ultimately have to finetune to your data if you want you model to behave properly.

So according to that definition of hyper-parameter, the deep neural network topology or architecture is an hyper-parameter. You must decide which layer to use and in what order. Hyper-parameter selection does not stops at the neural network topology though. Each layer has its own set of hyper parameters.

The first layer is an embedding layer. It converts in this case character input into a vector of real numbers, after all, computers can only work with numbers. How big this encoding vector will be? How long the sentences we train with will be? Those are all hyper-parameters.

On the LSTM layers, how wide or how many neurons will we use? Will we use all the outputs all the time or drop some of them (a technique called dropout which help regularizing neural network and reduce cases of overfitting)? Overfitting is when a neural network learns so well your training examples that it cannot generalize to new examples. Meaning that when you try to predict on a new value, the results are erratic. Not a situation you desire.

You have hyper-parameter to select and tweak up until the model compilation time and the model training (fit) time. How big the tweaking to your neural network weights will be at each computation pass? How big each pass will be in terms of examples you give to the neural network? How many passes will you perform?

If you take all of this into consideration, you end up with most of the code written being subject to hyper-parameters selection. And again, there is no cookbook or recipe yet to tell you how to set them. The API tells you how to enter those values in the framework, but cannot tell you what the effect will be. And the effect will be different for each problem.

It’s a little bit like if what you would give as argument to a print statement, you know like print(“hello world!”) would not be “hello word”, but some values which would print something based on that value (the hyper-parameter) and whatever has been printed in the past and you would have to tweak it so that at training time you get the expected results!!! This would make any good programmer become insane. But currently there is no other way with deep neural networks.

Hello World with Hyper-Parameters.

So our fourth complexity is not only the selection of the neural network topology but as well as all the hyper-parameters that comes with it.

It sometimes requires insight on the mathematics behind the neural net, as well as imagination, lots of trials and error while being rigorous about what you try. This definitely does not follow the “normal” software development experience. As said in my previous post, you need a special crowd to perform machine learning or deep learning.

My next post will look at what is sometime the biggest difficulty for machine learning: obtaining the right data. Before anything can really be done, you need data. This is not always a trivial task…


The Fallacious Simplicity of Deep Learning: the proliferation of frameworks.

This post is the second in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last post, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. To illustrate the other complexities, I’ll show an example of deep learning using keras. Don’t worry if you are not used to it and even if you have not programmed in a while, or at all, I’ll keep it simple. Below is one of the software stack you can use in order to perform deep learning. This one can be deployed on both CPU, so your usual computer or computer server; but it can also be deployed on Graphic Processing Units or GPU. Basically, the video card in your computer.

The stack we will use for demonstration.

To use the video card to do the type of computation required for deep learning, one of the GPU manufacturer, Nvidia, has created a software layer to program those GPU. CUDA is the Compute Unified Device Architecture, and allow someone to program the GPU to do any highly parallelizable task. On top of that layer, Nvidia still has created another layer targeting the task of running deep neural network. This is the cuDNN layer, or CUDA Deep Neural Network library. For my example, I’ll use on top of cuDNN the google framework for graph computation, Tensorflow. Lastly, to simplify my task, since I won’t build new kind of neurons or new kind of layers, I’ll use google Keras librairy which makes simpler the process of defining a deep neural network, deploying it, training it and testing it. For something simple, we already have 5 layers of librairies, and I don’t even mention the language I’ll use and the libraries required for it as well (note that in the latest release of TensorFlow, keras has been integrated). But no biggies, in software development we are used to have many layers of software piling up.

The software stack I’m using for this example is only one of the possible one to make use of. Just for the nVidia GPU there are already more than a dozen of frameworks that builds on top of cuDNN. Moreover, Intel, AMD and google are coming up with their deep neural network hardware accelerator. Many other companies are doing the same, creating accelerated hardware for deep neural networks. All this new hardware will come with their equivalent of CUDA and cuDNN and frameworks will proliferate for a while.

Some of the cuDNN accelerated frameworks.

I’m not even going to talk about the next layer of frameworks (e.g. Tensorflow and keras). Hopefully, they will adapt to the new hardware… otherwise, we’ll have even more frameworks. Same for the next layer e.g. keras builds on top of tensorflow (or theano or CNTK but let’s not open that door now). Hence, we can see our next complexity.

Second complexity, the piling of frameworks (including specialized hardware) and the proliferation of frameworks. Which one to learn? Which one will become irrelevant?

The machine learning but especially the deep learning landscape is evolving rapidly. To be efficient it requires new kind of hardware that we did not see as common in industrial servers even a few years ago. This means that the whole development stack, from hardware to released data product is evolving quickly. Changing requirement is a known issue in software development, it is not different in data product development.

My next post will tackle through an example the next complexity: Hyper-parameter tuning, something you do not see in software development but which is necessary for the development of a data product.

The Fallacious Simplicity of Deep Learning: the lack of skills.

This post is the first in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

Artificial Intelligence (AI) is currently touching our lives, it is present on your phone, your personal assistants, your thermostats, pretty much all web sites you visit. It helps you choose what you will watch, what you will listen to, what you will read, what you will purchase and so on. AI is becoming a cornerstone of user experience.

Today, if you look at AI cutting edge technology, you must conclude that you can no longer trust what you see and hear. Some neural network techniques allow researchers to impersonate anybody in video, saying whatever they want with the right intonations, the right visual cues, etc. Neural networks are creating art piece, for example applying the style of great paint masters to any of your photography.

Soon, it will become even more relevant to your everyday life. We can already see looming the days of the autonomous cars. Eventually it will be all transportation, then regulation of all technology aspect of our life and even further…

The Artificial Intelligence technology reach growth is so fast and so transformational that we sometime have the impression that it must be all easy to apply. But is it so?

AI may look easy if you look at all the available resources, all the books available. The plethora of online courses, you cannot visit a web page nowadays without getting a machine learning course being proposed to you! There are tons of video available online. From those courses, but also from enthusiasts, or university teachers. And if you start digging in the available software frameworks, you’ll find plenty of them.

So why would someone like Andrew Ng, one of the world’s best-known AI expert would come forth with the mission of training a million AI experts? Well, Machine Learning, Deep Learning and Artificial Intelligence knowledge is still sparse. Big companies have grabbed a lot of the talented peoples leaving universities empty. A lot of Universities still don’t have programs dedicated to that topic. For those who have courses, most only propose a couple of introductory courses. Then from the online front, there will be countless number of people who will start to learn but will abandon it along the way for many reasons. Online course completion rate is much lower than University program completion rate.

Moreover, this is quite a difficult subject matter. There are many considerations which are quite different from what you would have from a software development background.

A first complexity, not enough skilled and knowledgeable peoples, a smaller community than say: web programming.

Stay tuned for my next post which will tackle the next complexity: the piling and proliferation of frameworks.