How to be a bad data scientist!

So, you want to be a data scientist, or better you think you are now a data scientist and you are ready for your first job… Well make sure you are not one of the stereotypes of “wanna be data scientists” I list below, otherwise you may well go through numerous rejection in interviews. I do not claim it is a complete list of all the stereotypes out there. In fact, if you can think of other stereotypes, please share them in the comments! This is only a few stereotypes of peoples I have met or seen with time, and who sadly seems to repeat over and over again.

I want to be a data scientist [because of the money] where do I start?

This type of person has heard that there is good money to be made in data science and want its share of it… Little this type of person knows that a lot of hard work is involved in learning the knowledge and skills required to perform the job. Little also this type of persons know that data science is a constant work of research. Seldom is a clear path to the solution is in front of you. This is even truer with deep learning where new techniques and ideas pops every day and where you will have to come up with new ideas. If you need to post on a social media the question “where do I start?”, you don’t have what it takes to be one. Get a learn it all attitude, build an innovative spirit and then come back later.

I can do data science, please give me the “clean” data.

If you just came from (god forbid) a single data science course, or hopefully a few ones. And if you performed one or a few Kaggle like competition, you might be under the impression that data comes to you all cleaned up (or mostly ready) and with a couple of statements or commands it will all be well and ready for machine learning. The thing is that those courses and competitions prepare the data for you, so that you can go to the core of the problem faster and learn the subject matter of machine learning. In real life, data comes wild. It comes untamed and you must prepare it yourself. You might have to collect it yourself. A good part of most data scientists job is to play with the data, prepare it, clean it, etc. If you have not done this, figure out a problem of your own and solve it end-to-end and then come back later.

I don’t know any math or I’m bad at it, but people says I can do data science.

No, it is a fallacy. If you don’t have a mathematical mind, one day or the next you will end up in a situation where you just cannot progress anymore. The good thing is that you can learn mathematics. First, get out of the syndrome of: “this is too hard”. Anyway, data science is harder, so better start with something simple as mathematics. Learn some calculus, some statistics, learn to speak and think mathematics and then come back later.

Just give me a “well” defined problem.

Some people just want their little box with well defined interfaces, what comes in, what is expected to go out. Again, a syndrome of someone who just did some well canned coursed in the field… In reality, not only data is messy, but the problem you have to solve are messy, ill defined, muddy, … you have to figure it out. Sometimes you can define and refine it by yourself, sometimes you have to accept the messiness and play around with it. If you cannot be given vague and approximate objectives and refine them through thinking, research and discussions with the stakeholders until you come up with a solution, don’t expect be a data scientist. A big misconception here is that if you have a PhD you are immune to that problem… well not so fast, I have seen PhD struggling with this as much as any others. So, grow a spine, accept the challenge and then come back later.

I’ve learned data science, I have a blog/portfolio/… I can do anything.

Not so fast. This kind of person learned data science and being more marketing oriented and knowing it can help to build a personal brand built his portfolio or wrote blog, articles, etc. but never went to the point of trying it himself in real life. That person thinks he know it all and that he can solve anything. That type of person is probable singlehandedly responsible for the over-hype of what data science and machine learning can achieve and is more of a problem to the profession than of any help. Do some real work, grow some honesty and then come back later.

If you want to be a data scientist, it all boils down to a simple recipe. Learn hard and work hard. You must follow your path and put passion in it. Search to grow knowledge along your interests, learn about it, try things. Continuously learn new things, and not only on connected subjects. Do not limit yourself to courses, find real world examples to practice on, stay honest about what you can do, about what you know and do not know. Be a good human!


Cover image by tookapic at Pixabay.

Advertisements

How to potty train a Siamese Network

Time for an update on my One-Shot learning approach using a Siamese LSTM-based Deep Neural Network we developed for telecommunication network fault identification through traffic analysis. A lot of small details had to change as we upgraded our machine to the latest TensorFlow and Keras. That alone introduced a few new behaviors… As well as we obtained new data for new examples and found out some problems with our model. I don’t intend to go through all changes, but some of the main ones as well as some interesting findings. It feels a lot like potty training a cat… If you are new to this series, you can refer to my previous posts: “Do Telecom Networks Dreams of Siamese Memories?” and “What Siamese Dreams are made of…

First, Batch Normalization in Keras is now on my black magic list 😊 . I’ll have to dig more into how it is implemented, especially the differences between train time and prediction time. For a long time, I was wondering why I was getting extremely good train loss and poor validation losses until I removed the Batch Normalization I had on the input layer. So, something to investigate there.

Secondly, I introduced data generators for training and validation data. For a Siamese network approach where you must provide tons of similar and dissimilar pairs, using generators is a must to master at some point! Once you get the gist of it, it is quite convenient. I found Shervine Amidi blog: “A detailed example of how to use data generators with Keras” to be a very well explained example to build upon. I would recommend it to anyone learning about Keras data generators.

Along the way I found that my triplet_loss function as shown in previous post was flawed… because of the way I am packing the output of the base neural network with Keras concatenate, I must explicitly specify the ranges. Moreover, I painfully understood that a loss function in Keras is passed a mini-batch of y_true/y_pred values, not individual values. Well, that was not clear for me at first sight… I took also the opportunity to rework the logic to use more of a Keras approach than TensorFlow (subtle changes). Below is the new loss function.

The fourth interesting thing to mention is that while I was debugging all those issues, I felt a need to better visualize the results than simply looking at the prediction value. I reduced the output vector space from 10 dimensions to 3 dimensions as anyway I do not have that much different examples for now, so 3D should be more than enough to separate them. Furthermore, I changed my output layer to use a sigmoid activation function to limit the output space to the [0,1] range. Those changes in turn enabled me to look at the location of the predicted point in the transformed space e.g. a traffic pattern now corresponds to a 3D location in this output space.

SiameseSeparation

Below I made a video of how this projection evolve through training. Initially, as the neural net is initialized with random values, the output points clutter together at the center. But quickly we see them being separated and each taking a corner of the space. Sure, there is a lot of bouncing back and forth as the neural net try to find a better solution, but we can see that we can find a sweet spot where the different traffic patterns are well separated. As a side note we see three different traffic patterns here. Normal traffic in green and two different error cases, one dramatic in red where all traffic is blocked, and one subtler error in orange where we reach the capacity limit of the communication link.

Now while acquiring more data from our test bed, we are trying out with different loss functions to separate the traffic. One of my colleague has just posted on a comparison between different loss functions: “Lossless Triplet Loss” . I might also try some different loss functions and show my findings.

I hope this shows that One-Shot learning using Siamese networks can be used for other purpose than face recognition. In this case we are successfully using it for signalling traffic categorization and fault detection.


Cover photo by Jan-Mallander at Pixabay.