The Fallacious Simplicity of Deep Learning: wild data

This post is the fourth in a series of posts about the “Fallacious Simplicity of Deep Learning”. I have seen too many comments from non-practitioner who thinks Machine Learning (ML) and Deep Learning (DL) are easy. That any computer programmer following a few hours of training should be able to tackle any problem because after all there are plenty of libraries nowadays… (or other such excuses). This series of posts is adapted from a presentation I will give at the Ericsson Business Area Digital Services technology day on December 5th. So, for my Ericsson fellows, if you happen to be in Kista that day, don’t hesitate to come see it!

In the last posts, we’ve seen that the first complexity lay around the size of the machine learning and deep learning community. There are not enough skilled and knowledgeable peoples in the field. The second complexity lay in the fact that the technology is relatively new, thus the frameworks are quickly evolving and requires software stacks that range from all the way down to the specialized hardware we use. The third complexity was all about hyper-parameter setting, a skill specific to machine learning and that you will need to acquire.

Next challenge with machine learning, or should I say first challenge is the data! You need data to perform a machine learning task. For deep learning, you arguably need even more said data.

This is an angle that you will not see if you take courses (online or in schools) about data science, machine learning or deep learning. At least I’ve never seen it being properly presented as a difficulty. When you take a course, the data is most of the time provided. Or you get a clear indication as where to obtain it from. In most case this is well curated data. Well explained and documented fields and formats. If the data is not already available and the exercise wants you to concentrate on the data scraping aspect, then the other parameters of the exercise will be well defined. They will tell you where to scrape the data from and what is of importance for the exercise. It will make sure the scraping can be done in a structure fashion. If the exercise wants you to concentrate on data cleaning, or data augmentation, the dataset will have been prepared to properly show you how this can be done. Not to say that those courses or exercises are easy, but they do not show the real difficulty of wild data.

I think that for most companies, it all started with wild data. As data science grows in a company, peoples put structure and pipelines around data. But this comes with time and size. And it certainly does not prevent the entry of some wild data beast in the zoo from time to time. So, assuming you have access to data, it might really be wild data and you will have to tame it. Some says deep learning is less about data engineering and more about model architecture creation. It is certainly true in computer vision where the format of the data has been agreed on some time ago. But what about another domain, is the format so widely accepted? You might still have to do some feature engineering to input your data in your model. Adding the feature engineering problem to the hyper-parameter tuning problem…

On the other hand, you might well be in a situation where you do not have access to data. What if you are used to sell a specific type of equipment. That equipment might not be connected to a communication network. If it is, the customer might never have been asked if you could use its data. How do you put in place a contract or license that allows for you to collect data? Does the legislation in the regions where you sell that equipment allows for that collection or is there any restrictions you need to dance with? Is there personally identifiable information you will need to deal with? Will you need to anonymize the data? If you start a new business, based on a new business model, you might be able to build it in your product. But if you have a continuing business, how to you incorporate it in your product? Does it need to be gradual?

I guess you now have a good understanding of our fourth complexity. Where do you get your data from?

Data might be hard to come by. It might be wild and messy. It might not even relate to the question you want to answer… and that will be the subject of our fifth complication: do you have the right question?

Advertisements

One thought on “The Fallacious Simplicity of Deep Learning: wild data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s