Who are those Strangers?

This post is a follow-up to Who am I connected to? As stated in the previous post, a problem that arise a lot is figuring out how things are connected. Is this server directly or indirectly connected to that pool? Or who am connected to through a chain of friends. If you ever have to implement such an algorithm (and about that you can refer to my previous post), one thing you might encounter are superstars, false friends or black holes. Name them the way you want 😉 . Those are “nodes” which are connected to an abnormally high number of other nodes. Well, when someone has 50k friends, you should be suspicious that those are not all real friends! The problem with reporting fake friends is many fold.

First, if you go through the process described last time, you get a number of very high density groups which normally wouldn’t be grouped together if it was not because of those black hole nodes. This may well make any conclusion pointless, so you should take care of removing (or not considering) those superstar nodes to start with.

Second, assuming you start with big data, joining a number of those superstars on themselves will lead to an exponential growth of your data set (at least temporarily) and it will take forever to complete the associated spark tasks (if successful at all). Ok, those might be legit friends, in that case you might not have a choice and maybe Fighting the Skew in Spark can help you solve that issue. But otherwise, if those are indeed false friends, you should take a step of removing those black hole nodes before hand.

In an ever changing world of data, it may not be easy to spot those black holes, but a good first filter may be as simple as (using PySpark notation this time, just to keep you on your toes):

filter_out = node_table
  .filter(F.col('count') > black_hole_threshold)

The nodes captured by that filter-out “rule” can then be automatically removed from your node table, or examined and added to black lists if needs be. To automatically remove the filter_out nodes from your node_table, the join_anti is your friend!

output = node_table

You still need to perform the connection finding algorithm on this “output”, but at least you would have removed all nodes which have an above black_hole_threshold abnormal number of connections from your inputs.

What else can go wrong? Again, if you have big data, this process as a whole (especially since it is iterative) can take some serious time to execute. Moreover, even with the black holes removed, the join on itself part may consume a lot of resource from your cluster. The interesting part is that if you keep your “node” definition constant, you could run the algorithm in an online additive fashion which would run faster because most of the data wouldn’t change and already be reduced to find who’s who friend, so only the additional delta would in fact “move”. I know it is not that simple and quick, but it is still quicker than doing the process on the initial input data again an again…

Again, I hope this can be of help. If you apply this method or another equivalent one, let me know and let’s discuss about our experience!

Cover photo by Felix Mittermeier at Pixabay.

Who am I connected to?

A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.

First let’s go with some terminology. Nodes are the things we are connecting e.g. you, your friends, your friends friends. Vertex are how those nodes are connected. For example, below, node 0 is connecting to node 1 and 2 using two vertices to describe those connections. Node 1 is connecting to node 3 via one vertex, etc. For this first example, we use uni-directional vertex, but nothing prevents us from using bi-directional vertex. In general, if all vertex are bi-directional we will talk of non-directed graph, which would be the case of friends (usually) since you know your friend, and he knows you as well!

A first important concept to introduce in network analysis is that of an Adjacency matrix. The adjacency matrix is a matrix representing the vertex connections between the nodes. The first row in the Adjacency matrix represent connections of node 0. Thus, node 0 is connecting to node 1 and 2, but not to itself or to node 3. So the first row is 0, 1, 1, 0. Second row represent connection of node 1, which is only connecting to node 3. So the second row is 0, 0, 0, 1. Note that we could have bi-directional connections, in such a case the connection would appear on both the row and the column, but this is not the case in this example.

By inspecting the Adjacency matrix, we can reconstruct the Node/Vertex graph. It informs us on the first hop connections: who are your friends. But how can we know about the second hop connections e.g. node 0 is connected to node 3 via node 1 and node 2? A really simple way is to multiply the adjacency matrix by itself (A*A). The result of this multiplication are the second hop connection. Here, we see that node 0 is connecting through 2 hops to node 1 (via node 2), and is connecting through 2 hops to node 3. We can even see that there is 2 such connection in two hops from node 0 to node 3. Lastly we see that node 2 is connecting through 2 hops to node 3 (via node 1).

If we were to multiply again A*A by A itself, we would get the three hop connections, which in this case is limited to node 0 being connected to node 3.

In general, the network that will interest us are way bigger than this simple 4 nodes diagram. Also in general, all nodes are not connected to each other node. Well, they say everyone is connected to everyone by six degrees of separations (six hops), but for most other practical applications, not all nodes are connected to each others. Let’s take a look at a bigger example to see how the principles illustrated above can apply at scale. Let’s assume the following non-directional network graph. Here since we have a non-directional network graph, you will see the connection values appears in both the rows and the columns. This special case shows a symmetry about the diagonal.

As before, if we compute A*A, we get the second hops connections. Notice that nodes becomes connected to themselves via a second hop. For example, node 1 is connected 3 times to itself through a second hop via node 0, 7 and 8.

If you are interested in all the first hop connections and the second hop connections, you could add together A*A and A, thus leading to the following matrix. You could proceed forward to find the third hops onward, but in this example nothing else is connected, so although that the numbers you see here would grow, the pattern of zeros would not change. We have found all connections of this graph. We found that node 0 is connected to nodes 1, 7 and 8. Nodes 2, 3 and 4 are connected. Nodes 5, 6 and 9 are connected. Finally we see that node 10 is not connected to any other nodes.

In practice the matrix multiplication works well to find the next hops neighbours. If it happens also that for your problem (as in the one above) most connections are non existent i.e. 0, then you could use sparse matrices to store (and potentially compute with) your Adjacency matrix. However, those becomes quickly really huge matrices which requires a lot of operations to compute. A nice trick if you are using SQL or spark could be to use joins on tables.

To do so, you need to reverse the problem on its head. Instead of creating an Adjacency matrix of how the nodes are connected, you will create a table of the connections. So to keep with our second example, you could have something like the following network graph being turned into a node/connection table.

Node Connection
0 A
1 A
1 B
7 B
1 C
8 C

Now that we have that node/connection table, our goal will be to reduce the number of connections to the minimum possible and in the end get something like the following as a way to see everything connected (we won’t care about how many hops leads us there).

To get there we will iterate through a two step process. First we will perform connection reduction and then update the node/connection table. Then we rinse and repeat until we can no longer reduce the number of connections.

Assuming the above node/connection table (node_connections), we can reduce the number of connections via a the following SQL query and store it as the new_connections table:

SELECT A.connection, MIN(B.connection) AS new_connection
FROM node_connections AS A
JOIN node_connections AS B
ON A.node = B.node
GROUP BY A.connection

Then you can update the node_connection table with the following SQL query:

SELECT DISTINCT B.new_connection AS connection, A.node
FROM node_connections AS A
JOIN new_connections AS B
WHERE A.connection = B.connection

You iterate those two steps until the node_connections table change no more et voilà, you have a map of all nodes connected through distincts connections.

This is only one of the possible use case, but for large scale application it is probably easier and quicker to join tables than to create and multiply Adjacency matrices. I showed the logic with SQL, but obviously you could achieve similar results using spark (for my specific application, I use pySpark).

If you have questions or interesting ideas of application of the network theory to some problem, feel free to jump in the conversation!

Cover photo by Michael Gaida at Pixabay.

Automation and Sampling

As I mentioned earlier, I have transitioned from Ericsson to Shopify. As part of this transition I start to get a taste of the public transports (previously work was a 15 minutes drive from home, now, working in the city center, I have to take the train to commute). This morning there was a woman just beside me who was working on her computer, apparently editing a document or more probably writing comments in it. A few minutes in her edits, she makes a phone call, asking someone over the phone to change some wording in the document and painfully dictating those changes (a few words). This game went about a few times; writing comments in the document, then calling someone to make the appropriate edits. What could have been a simple edit, then send the edited document via email became an apparently painful exercise in dictation. The point is not to figure out why she was not sending the document via email, it could be as simple as not having a data plan and not willing to wait for a wifi connection, who knows, but the usage of a non-automated “process” made something which is ordinarily quite simple (editing a couple of sentences in a document), a painful dictation experience. This also has the consequence of a limited bandwidth and thus only a few comments can make their ways in corrections on that document.

This reminded me of a conversation I had with a friend some time ago. He mentioned the pride he got of having put in place a data pipeline at his organization for two data sources extraction, transformation and storage in a local database. Some data is generated by a system in his company. Close to that data source he has a server which collect and reduce / transform the data and stores the results in the local file system as text files. Every day he look at the extraction process on that server to make sure it is still running, and every few days he download the new text files from that server to a server farm and a database he usually use to perform his analysis on the data. As you can see this is as well a painful, non-automated process. As a consequence, the amount of data is most probably more limited than it could be with an automated process, as my friend as to cater to the needs of those pipelines manually.

At Shopify I have the pleasure of having access to an automated ETL (Extract Transform and Load) process for the data I may want to do analysis on. If you want to get a feel of what is available at Shopify with respect to ETL, I invite you to watch the Data Science at Shopify video presentation from Françoise Provencher who touch a bit on that as well as the other aspects of the job of a data scientist at Shopify. In short we use pyspark with custom librairies developed by our data engineers to extract, transform and load data from our sources into a front room database which anyone in the company can use to get information and insight about our business. If you listen through Françoise video, you will understand that one of the benefit of that automated ETL scheme is that we transform the raw data (mostly unusable) into information that we store in the front room database. This information is then available to further being processed to extract valuable insight for the company. You immediately see the benefit. Once such a pipeline is established, it perform its work autonomously and as an added benefit, thanks to our data engineering team, monitors itself all the time. Obviously if something goes wrong somebody will have to act and correct the situation, but otherwise, you can forget about that pipeline and its always updated data is available to all. No need for a single person to spend a sizable amount of time monitoring and manually importing data. A corollary to this is that the bandwidth for new information is quite high and we get a lot of information on which we can do our analysis.

Having that much information at our fingertips bring on new challenges mostly not encountered by those who have manual pipelines. It becomes increasingly difficult and not efficient to do analysis on the whole population. You need to start thinking in term of samples. There are a couple of considerations to keep in mind when you do sampling: sample size and what are you going to sample.

There are mathematical and analytical ways to determine your sample size, but a quick way to get it right is to start with a modest random sample, perform your analysis, look at your results and keep them. Then, you redo the cycle a few times and watch if you keep getting the same results. If your results vary wildly, you probably do not have a big enough sample. Otherwise you are good. If it is important for future repeatability to be as efficient as possible, you can try to reduce your sample size until your results starts to vary (at which point you should revert to the previous sample size), but if not, good enough is good enough! Just remember that those samples must be random! If you redo your analysis using the same sample over and over again, you haven’t proven anything. In SQL terms, it is the difference between:


Which would produce a random sample of 10% of table, on the other hand:

LIMIT 1000

Will most likely always produce the same first 1000 elements… this is not a random sample!

The other consideration you should keep in mind is about what you should sample, what is the population you want to observe. Let say to use an example in the lingo of Shopify, I have a table of all my merchants customers which amongst other thing contain a foreign key to an orders table. If I want to get a picture of how many orders a customer performs, the population under observation is from the customers, not the orders. In other words, in that case I should randomly sample my customers, then look up how many orders they have. I should not sample the orders to aggregate those per customers and wishing this will produce the expected results.

Visually, we can see that sampling from orders will lead us to wrongly think each customers performs on average two orders. Random resampling will lead to the same erroneous results.


Screen Shot 2018-09-18 at 10.41.28.png
Sampling Orders leads to the wrong conclusion that each customers do an average of 2 orders.

Whereas sampling from customers, will lead to the correct answer that each customer  performs on average four orders.


Screen Shot 2018-09-18 at 10.41.40.png
Sampling Customers leads to the right conclusion that each customers do an average of 4 orders.

To summarize things, let just say that if you have manual (or even semi-manual) ETL pipelines, you need to automate them to give you consistency, and throughput. Once this is done, you will eventually discover the joys (and need) of sampling. When sampling, you must make sure you select the proper population to sample from and that your sample is randomly selected. Finally, you could always analytically find the proper sample size, but with a few trials you will most probably be just fine if your findings stay consistent through a number of random samples.

Cover photo by Stefan Schweihofer at Pixabay.

Where the F**k do I execute my model?

or: Toward a Machine Learning Deployment Environment.

Nowadays, big names in machine learning have their own data science analysis environments and in-production machine learning execution environment. The others have a mishmash of custom made parts or are lucky enough so that the existing commercially available machine learning environment fits their needs and they can use them. There are several data science environments commercially available, Gartner mentions the most known players (although new ones pop every week) in its Magic Quadrant for Data Science and Machine-Learning Platforms. However, most (if not all) of those platforms suffer from a limitation which might prevent some industries from adopting them. Most of those platforms starts with the premises that they will execute everything on a single cloud (whether public or private). Let see why this might not be the case for every use case.

Some machine learning models might need to be executed remotely. Let’s think for example of the autonomous vehicle industry. Latency and security prevents execution in a cloud (unless that cloud is onboard the vehicle). Some industrial use cases might require models to be executed in an edge-computing or fog-computing fashion to satisfy latency requirements. Data sensitivity in some industries may require the execution of some algorithms on customer equipment. There are many more reasons why you may want to execute your model in some other location than the cloud where you made the data science analysis.

As said before, most commercially available offerings do not cater to that requirement. And it is not a trivial thing that one may slap on top an existing solution as a simple feature. There are in some case some profound implications on allowing such distributed and heterogeneous analysis and deployment environment. Let’s just look at some of the considerations.

First one must recognize there is a distinction between the machine learning model and the complete use case to be covered, or as some would like to call it the AI. A machine learning model is simply provided a set of data and gives back an “answer”. It could be a classification task, a regression or prediction task, etc. but this is where a machine learning model stops. To get value from that model, one must wrap it in a complete use case, some calls that an AI. How do you acquire reliably the data it requires? How do you present or act on the answer given by the model? Those, and many more questions needs to be answered by a machine learning deployment environment.

Recognizing it, one of the first thing that is required to deploy a full use case is access to data. In most industries, the sources of data are limited (databases, web queries, csv files, log files, …) and the way to handle them is repetitive i.e. once I figured a way to do database queries, the next time most of my code will look the same, except for the query itself. As such, data access should be facilitated by a machine learning deployment environment which should provides “data connectors” which could be configured for the needs and deployed where the data is available.

Once you have access to data, you will need “rules” as to when the machine learning model needs to be executed: is it once a day, on request, … Again, there is many possibilities (although when you start thinking about it, a lot are the same), but expressing those “rules” should be facilitated by deployment environment so that you don’t have to rewrite a new “data dispatcher” for every use case, but simply configure a generic one.

Now we have data and we are ready to call a model, right? Not so fast. Although some think of data preparation as part of the model, I would like to consider it as an intermediary step. Why would you say? Simply because data preparation is a deterministic step where there should be no learning involved and because in many cases you will reduce significantly the size of the data in that step, data that you might want to store to monitor the model behavior. But I’ll come to this later. For now, just consider there might be a need for “data reduction” and this one cannot be generic. You can think of it as a pre-model which format the data in a way your model is ready to use. The deployment environment should facilitate the packaging of such a component and provides way to easily deploy them (again, anywhere it needs to be).

We are now ready for the machine learning execution! You already produced a model from your data science activities and this model needs to be called. As for the “data reduction”, the “model execution” should be facilitated by the deployment environment, the packaging and the deployment.

For those who have been through the loops of creating models, you certainly have the question: But how have you trained that model? So yes, we might need a “model training” component which is also dependant on the model itself. A deployment environment should also facilitate the use/deployment of a training component. However, this begs to another important question. From where comes the data used for training? And what if the model drift, is no longer accurate and needs re-training? You will need data… So, another required component is a “data sampling” component. I say data sampling because you may not need all the data, maybe some sample of it is sufficient. This can be something provided by the model execution environment and configured per use case. You remember the discussion about data reduction earlier? Well, it might be wise to store only samples coming from reduced data… You may also want to store the associated prediction made by the model.

At any rate, you will need a “sample database” which will need to be configured with proper retention policies on a use case basis (unless you want to keep that data for eternity).

As we said, models can drift, so data ops teams will have to monitor that model/use case. To facilitate that, a “model monitoring” component should be available which will take cues from the execution environment itself, but also from the sample database, which means that you will need a way to configure what are the values to be watched.

Those covers the most basics components required, but more may be required. If you are to deploy this environment in a distributed and heterogeneous fashion, you will need some “information transfer” mechanism or component to exchange information in a secured and easy fashion between different domains.

Machine Learning Execution Environment Overview.

You will also need a model orchestrator which will take care of scaling in or out all those parts on a need basis. And what about the model life-cycle management, canary deployment or A/B testing… you see, there is even more to consider there.

One thing to notice is that even at this stage, you only have the model “answer” … you still need to use it in a way which is useful for your use case. Maybe it is a dashboard, maybe it is used to actuate some process… the story simply does not end here.

For my friends at Ericsson, you can find way more information in the memorandum and architecture document I wrote on the subject: “Toward a Machine Learning Deployment Environment”. For the rest of you folks, if you are in the process of establishing such an environment, I hope those few thoughts can help you out.

Cover photo by Frans Van Heerden at Pexels.

Goodbye and Thank You!

The goodbye is not intended for you my blogging crowd! Rather to some other dear friends I will leave behind.

If there is a constant in life, it is change. In a few weeks will be time to change my place of work. With such a change I need to say goodbye to a lot of nice peoples and friends I have worked with, over the last 21 years. I want to thank you all for the fantastic environment you surrounded me with. I want to thank you all for the great challenges you gave me to undertake and solve. I want to thank you all for the help you provided through all those years for small and big questions, the mentoring and the learning and above all the animated and insightful discussions we had. All of this was enabled by a fantastic workplace. I will never forget Ericsson.

Above all I want to thank my current manager at Ericsson, Steven Rochefort, with whom I have been closely collaborating for 11 of those years. He is a fantastic guy and I will miss him dearly.

Now it is time to say hello to a new workplace. I have already met some wonderful and bright peoples at Shopify and I’m looking forward to the new challenges opening in front of me! #LifeAtShopify

This kind of message is traditionally expressed via email to a selected crowd on the last day of work. I’m not traditional 🙂 and I’m quite transparent, so you know it all now.

I’ll continue blogging, do not worry. Feel free to continue to contact me on any of my channels, here or elsewhere.

Cover photo by Claudia Beer at Pixabay (bonus points for those who catch the reference).

AI market place is not what you are looking for (in the telecommunication industry).

In a far away land was the kingdom of Kadana. Kadana was a vast country with few inhabitants. The fact that in the warmest days of summer, temperature was seldom above -273°C was probably a reason for it. The land was cold, but people were warm.

In Kadana there was 3 major telecom operators: B311, Steven’s and Telkad. There were also 3 regional ones: Northlink, Southlink and Audiotron. Many neighboring kingdoms also had telecom operators, some a lot bigger than the ones in Kadana. Dollartel, Southtel, Purpletel, we’re all big players and many more competed in that environment.

It was a time of excitement. A new technology called AI was becoming popular in other fields and the telecommunications operators wanted to get the benefits as well. Before going further in our story, it can be of interest to understand a little bit what this AI technology is all about. Without going into too much details, let’s just say that traditionally if you wanted a computer to do something for you, you had to feed him a program handcrafted with passion by software developer. The AI promise was that from now on, you could feed a computer with a ton of data about what you want to be done and it would figure out the specific conditions and provide the proper output without (much) of programming. For those aware of AI this looks like an overly simplistic (if not outright false) summary of the technology, but let’s keep it that way for now…

Going back to the telecommunication world, somebody with nice ideas decided to create Akut05. Akut05 was a new product combining the idea of a marketplace with the technology of AI. Cool! The benefit of a market place as demonstrated by the Apple App Store or Google Play, combined with the power of AI.

This is so interesting, I too want to get into that party, and I immediately create my company, TheLoneNut.ai. So now I need to create a nice AI model that I could sell on the Akut05 marketplace platform.

Well, let not be so fast… You see, AI models are built from data as I said before. What data will I use? That’s just a small hurdle for TheLoneNut.ai company… we go out, talk with operators. Nobody knows TheLoneNut.ai, it’s a new company, so let’s start with local operators. B311, Steven’s and Telkad all think we are too small a player to give us access to their data. After all, their data is a treasure trove they should benefit from, why would they give us access to it. We then go to smaller regional players and Northlink has some interests. They are small and cannot invest massively in a data science team to build nice models, so with proper NDA, they agree to give us access to their data in counterpart, they will have access to our model on Akut05 with substantial rebate.

Good! We need to start somewhere. I’ll skip all the adventures along the way of getting the data, preparing it and building a model… but let me tell you that was full of adventures. We deploy a nice model in an Akut05 store and it works wonderfully… for awhile. After some time, the subscribers from Northlink change a bit their behavior, and Northlink see that our model does not respond properly anymore. How do they figure? I have no idea, since Akut05 does not provide with any real model monitoring capabilities besides the regular “cloud” monitoring metics. More alarming, we see 1-star reviews pouring in from B311, Steven’s and Telkad who tried our model and got from the get go poor results. And there is nothing we can do about it because after all we never got deals with those big names to access their data. A few weeks later, having discounted the model to Northlink and getting only bad press from all other operators, TheLoneNut.ai bankrupt and we never hear from it again. The same happens to a lot of other small model developers who tried their hand at it, and in no time the Akut05 store is empty of any valuable model.

So contrary to an App Store, a Model Store is generally a bad idea. To get a model right (assuming you can) you need data. This data needs to come from representative examples of what you want the model to apply to. But it easy, we just need all the operator to agree to share the data! Well, if you don’t see the irony, then good luck. But this is a nice story, lets put aside the irony. All the operators in our story decide to make their data available to any model developers on the Akut05 platform. What else could go wrong.

Let us think about a model that use the monthly payment a subscriber pays to the operator. In Kadana this amount is provided in the data pool as $KAD, and it works fine for all Kadanian operators. Dollartel tries it out and (not) surprisingly it fails miserably. You see, in the market of Dollartel, the money in use is not the $KAD, but some other currency… The model builder, even if he has data from Dollartel may have to do “local” adjustments. Can a model still provide good money to the model builder if the market is small and fractured i.e. needs special care being taken? Otherwise you’ll get 1-star review and again disappear after a short while.

Ok, so the Akut05 is not a good idea for independent model builders. Maybe it can still be used by Purpletel which is a big telecom operator which can hire a great number of data scientists. But in that case, if its their data scientist who will do the job, why would they share their data? If they don’t share their data and hire their own data scientists, why would they need a market place in the first place?

Independent model builders can’t find their worth from a model market place, operators can’t either… can the telecom manufacturer make money there? Well, why would it more valuable than for an independent model builder? Maybe it could get easier access to data, but the prerogatives are basically the same and it wouldn’t be a winning market either I bet.

Well, therefore a market place for AI is not what you are looking for… In a next post I’ll try to say a little bit about what you should be looking for in the telecom sector when it comes to AI.

For sure this story is an oversimplification of the issue, still, I think we can get the point. You have a different view? Please feel free to share it in the comments below so we can all learn from a nice discussion!

Cover photo by Ed Gregory at Pexels.

How to become a good data scientist

After being so vocal about how to be a bad data scientist, I thought I should even out the play field by giving some hints on how to become a good data scientist. The other side of the medal.

My strong feeling is that is you just start in the field for employment or salary reasons, you start on the wrong foot. You should first look at your passions. Here it is interesting to take a few seconds to lookup the word passion as defined on Dictionary.com:


[pashuh n]


  1. any powerful or compelling emotion or feeling, as love or hate.
  2. strong amorous feeling or desire; love; ardor.
  3. strong sexual desire; lust.
  4. an instance or experience of strong love or sexual desire.
  5. a person toward whom one feels strong love or sexual desire.
  6. strong or extravagant fondness, enthusiasm, or desire for anything: a passion for music.
  7. the object of such a fondness or desire:Accuracy became a passion with him.

Hopefully the scope of your passion for data science does not involve definitions 2, 3, 4 or 5. But is driven by a strong fondness and enthusiasm for data science! If so you are on the right track and my first advise would be: do not try to swallow the ocean in one sip. Zoom on one aspect of that passion, the one that piqued you interest first. See how you could apply it in a real-world problem and learn along the way. For example, in my case, I got passionate about artificial life long time ago. That evolved in becoming fond in a form of reinforcement learning, the genetic algorithms and genetic programming around 2012. As time passed, I grew my interests in machine learning and deep learning, learned about it by reading books, taking online courses and taking a graduate course while studying for my master’s degree. At that time, I had the hope to apply it to the project I had for my master thesis, but sometime plan changes. So, in short you need to follow your heart here.

If you go with such an approach, you will avoid many of the pitfall I mentioned in the first post. You won’t come to expect a “clean” data set as your input since you’ll have applied it to a few real case examples as you learned. You will learn along the way how to gather data, how to clean it, how to interpret it… it will benefit you in two ways. First you will learn one of the essential skills, data cleaning. But most importantly, it will grow your inquisitive mind. Something that I never seen a single course being able to do. Again, I do not think this is a skill you can get in a few weeks, it requires a mind shift that you will acquire through repeated practice.

Another benefit of going along your passion is that if you don’t already have the necessary mathematical background, you will grab it along the way. If you find maths hard, it is probably easier to grab them on a need basis as you expand your knowledge through your own passionate experiments! I will also re-iterate that nonetheless what you might think or have been told, mathematics is not so hard. Moreover, they are way easier to get if you start with a positive attitude, telling yourself that you can do it.

Next benefit of such an approach is that you will have to define and refine your problem. You will decide what is important to you, what is your “research” question and how it relates to the activities you are doing along the way. When I was doing my master’s degree, I saw two types of students. Those who already had a research agenda, a question they wanted to explore, or who at least sat down early with their advisor and set up such a research question inline with their interests and passions. Those students usually made high quality presentations, were following courses highly relevant to answer their research questions and became highly proficient in their field of research. The second type of student waited for their advisors to give them a research project, never were really involved in it, presented average or poor presentations, followed any courses without really seeing how they related to their research topic: well, in most cases they were not… and at the end were probably still graduating, but with a subject to forget about… You want to be like the first type of students, even if you do it on your own, you want to take control of it and reap the benefits.

Lastly, it is good for you to write or talk about your findings and learnings. Myself I found it help crystalize my thoughts and get (sometime) some feedback from other comparable minded peers. All to say that academic papers are not the only way to communicate your findings, blogs, videos, reports can all help you if you have the passion. Sure of advantage of an academic paper is the peer review system which provide you with feedback on your research, but you should not limit yourself to that single media of communication if it is not suited to your reality. Expose plainly what you found, do not claim you are something you are not, or not yet. When the time comes, other will recognize you as a data scientist and that day you will know you are one for sure!

In the same lines as my previous post, learn hard: it is easier when you are you are following a personal research/interest goal. Work hard: again, something easier (not necessarily easy) when you follow a passion. And at all time be honest with yourself (but also others) about what you know or found out. If you think of yourself as a full-grown data scientist on day one, you might not put in the work necessary to ever become one. On the other hand, if you follow your interests and passions, you might become a data scientist before you even think of yourself as one.

Cover photo by Magda Ehlers at Pexels.

How to be a bad data scientist!

So, you want to be a data scientist, or better you think you are now a data scientist and you are ready for your first job… Well make sure you are not one of the stereotypes of “wanna be data scientists” I list below, otherwise you may well go through numerous rejection in interviews. I do not claim it is a complete list of all the stereotypes out there. In fact, if you can think of other stereotypes, please share them in the comments! This is only a few stereotypes of peoples I have met or seen with time, and who sadly seems to repeat over and over again.

I want to be a data scientist [because of the money] where do I start?

This type of person has heard that there is good money to be made in data science and want its share of it… Little this type of person knows that a lot of hard work is involved in learning the knowledge and skills required to perform the job. Little also this type of persons know that data science is a constant work of research. Seldom is a clear path to the solution is in front of you. This is even truer with deep learning where new techniques and ideas pops every day and where you will have to come up with new ideas. If you need to post on a social media the question “where do I start?”, you don’t have what it takes to be one. Get a learn it all attitude, build an innovative spirit and then come back later.

I can do data science, please give me the “clean” data.

If you just came from (god forbid) a single data science course, or hopefully a few ones. And if you performed one or a few Kaggle like competition, you might be under the impression that data comes to you all cleaned up (or mostly ready) and with a couple of statements or commands it will all be well and ready for machine learning. The thing is that those courses and competitions prepare the data for you, so that you can go to the core of the problem faster and learn the subject matter of machine learning. In real life, data comes wild. It comes untamed and you must prepare it yourself. You might have to collect it yourself. A good part of most data scientists job is to play with the data, prepare it, clean it, etc. If you have not done this, figure out a problem of your own and solve it end-to-end and then come back later.

I don’t know any math or I’m bad at it, but people says I can do data science.

No, it is a fallacy. If you don’t have a mathematical mind, one day or the next you will end up in a situation where you just cannot progress anymore. The good thing is that you can learn mathematics. First, get out of the syndrome of: “this is too hard”. Anyway, data science is harder, so better start with something simple as mathematics. Learn some calculus, some statistics, learn to speak and think mathematics and then come back later.

Just give me a “well” defined problem.

Some people just want their little box with well defined interfaces, what comes in, what is expected to go out. Again, a syndrome of someone who just did some well canned coursed in the field… In reality, not only data is messy, but the problem you have to solve are messy, ill defined, muddy, … you have to figure it out. Sometimes you can define and refine it by yourself, sometimes you have to accept the messiness and play around with it. If you cannot be given vague and approximate objectives and refine them through thinking, research and discussions with the stakeholders until you come up with a solution, don’t expect be a data scientist. A big misconception here is that if you have a PhD you are immune to that problem… well not so fast, I have seen PhD struggling with this as much as any others. So, grow a spine, accept the challenge and then come back later.

I’ve learned data science, I have a blog/portfolio/… I can do anything.

Not so fast. This kind of person learned data science and being more marketing oriented and knowing it can help to build a personal brand built his portfolio or wrote blog, articles, etc. but never went to the point of trying it himself in real life. That person thinks he know it all and that he can solve anything. That type of person is probable singlehandedly responsible for the over-hype of what data science and machine learning can achieve and is more of a problem to the profession than of any help. Do some real work, grow some honesty and then come back later.

If you want to be a data scientist, it all boils down to a simple recipe. Learn hard and work hard. You must follow your path and put passion in it. Search to grow knowledge along your interests, learn about it, try things. Continuously learn new things, and not only on connected subjects. Do not limit yourself to courses, find real world examples to practice on, stay honest about what you can do, about what you know and do not know. Be a good human!

Cover image by tookapic at Pixabay.

How to potty train a Siamese Network

Time for an update on my One-Shot learning approach using a Siamese LSTM-based Deep Neural Network we developed for telecommunication network fault identification through traffic analysis. A lot of small details had to change as we upgraded our machine to the latest TensorFlow and Keras. That alone introduced a few new behaviors… As well as we obtained new data for new examples and found out some problems with our model. I don’t intend to go through all changes, but some of the main ones as well as some interesting findings. It feels a lot like potty training a cat… If you are new to this series, you can refer to my previous posts: “Do Telecom Networks Dreams of Siamese Memories?” and “What Siamese Dreams are made of…

First, Batch Normalization in Keras is now on my black magic list 😊 . I’ll have to dig more into how it is implemented, especially the differences between train time and prediction time. For a long time, I was wondering why I was getting extremely good train loss and poor validation losses until I removed the Batch Normalization I had on the input layer. So, something to investigate there.

Secondly, I introduced data generators for training and validation data. For a Siamese network approach where you must provide tons of similar and dissimilar pairs, using generators is a must to master at some point! Once you get the gist of it, it is quite convenient. I found Shervine Amidi blog: “A detailed example of how to use data generators with Keras” to be a very well explained example to build upon. I would recommend it to anyone learning about Keras data generators.

Along the way I found that my triplet_loss function as shown in previous post was flawed… because of the way I am packing the output of the base neural network with Keras concatenate, I must explicitly specify the ranges. Moreover, I painfully understood that a loss function in Keras is passed a mini-batch of y_true/y_pred values, not individual values. Well, that was not clear for me at first sight… I took also the opportunity to rework the logic to use more of a Keras approach than TensorFlow (subtle changes). Below is the new loss function.

def triplet_loss(y_true, y_pred, alpha = ALPHA):
Implementation of the triplet loss function
y_true — true labels, required when you define a loss in Keras, you don't need it in this function.
y_pred — python list containing three objects:
anchor — the encodings for the anchor data
positive — the encodings for the positive data (similar to anchor)
negative — the encodings for the negative data (different from anchor)
loss — real number, value of the loss
anchor = y_pred[:,0:3]
positive = y_pred[:,3:6]
negative = y_pred[:,6:9]
# distance between the anchor and the positive
pos_dist = K.sum(K.square(anchorpositive),axis=1)
# distance between the anchor and the negative
neg_dist = K.sum(K.square(anchornegative),axis=1)
# compute loss
basic_loss = pos_distneg_dist+alpha
loss = K.maximum(basic_loss,0.0)
return loss

view raw
hosted with ❤ by GitHub

The fourth interesting thing to mention is that while I was debugging all those issues, I felt a need to better visualize the results than simply looking at the prediction value. I reduced the output vector space from 10 dimensions to 3 dimensions as anyway I do not have that much different examples for now, so 3D should be more than enough to separate them. Furthermore, I changed my output layer to use a sigmoid activation function to limit the output space to the [0,1] range. Those changes in turn enabled me to look at the location of the predicted point in the transformed space e.g. a traffic pattern now corresponds to a 3D location in this output space.


Below I made a video of how this projection evolve through training. Initially, as the neural net is initialized with random values, the output points clutter together at the center. But quickly we see them being separated and each taking a corner of the space. Sure, there is a lot of bouncing back and forth as the neural net try to find a better solution, but we can see that we can find a sweet spot where the different traffic patterns are well separated. As a side note we see three different traffic patterns here. Normal traffic in green and two different error cases, one dramatic in red where all traffic is blocked, and one subtler error in orange where we reach the capacity limit of the communication link.

Now while acquiring more data from our test bed, we are trying out with different loss functions to separate the traffic. One of my colleague has just posted on a comparison between different loss functions: “Lossless Triplet Loss” . I might also try some different loss functions and show my findings.

I hope this shows that One-Shot learning using Siamese networks can be used for other purpose than face recognition. In this case we are successfully using it for signalling traffic categorization and fault detection.

Cover photo by Jan-Mallander at Pixabay.

What Siamese Dreams are made of…

In my last post I wrote a high-level description of a One-Shot learning approach we developed for telecommunication network fault identification through traffic analysis. The One-Shot learning approach is implemented using a Siamese Deep Neural Network. In this post I will describe with more details how this can be achieved with the use of Keras and TensorFlow. As said in the previous post, this is early work and subject to a lot of change, but if it can help someone else alleviate some of the pain of building such a network, let it be!

The first step is probably to understand what is a Siamese Network and how it works. What we want out network to produce is a representation of the data we feed it e.g. a vector representing the input data like word embeddings, but for in this case telecom network traffic data. At the end of the day, this representation vector should have close distances for similar traffic and higher distance for dissimilar traffic. Hence, when the network is properly trained we can use those distances to determine which network traffic is the closest and thus the most representing. But how do we implement it?

For that, let’s look at the cute kitten image I have put on this and the previous post. The crème color cute one hiding at the bottom is Aristotle. The other crème color one is Peter Pan and the black one is Napoleon. Aristotle is our Anchor, the kitten we want to compare to. If another kitten is similar, let say Peter Pan, then the vector representing Peter Pan should be close in distance to the vector representing Aristotle. This is our Positive example. Similarly, when a kitten is different from Aristotle, let say Napoleon, we want the vector representing it being far in distance to Aristotle. This is our Negative example.

Simplifying things, training a deep neural network consist in predicting a result from a training example; finding out how far we are from the expected value using a loss function to find the error; and then correcting the weights of the deep neural network based on that error, so next time we are a bit closer. Here we do not know what is the expected value for our training examples, but we know that whatever that value is, it should be close in distance to the Anchor if we present the Positive example, and far in distance if we present the Negative example. Thus, we will build our loss function in that way. It receives python list of the representation of the Anchor, the Positive example and the Negative example through y_pred. Then it computes the distance between the Anchor and the Positive (AP), and the Anchor and the Negative (AN). As we said AP should get close to 0 while AP should get large. For this exercise, let set “large” to 0.2. So, we want AP=0 and AN=0.2 so we want AN – 0.2 = 0. Ideally, we want both of those to stand, hence we want to minimize the loss where loss = AP – (AN – 0.2). That being explained, below is the loss function we defined.

def triplet_loss(y_true, y_pred, alpha = 0.2):
    Implementation of the triplet loss function
    y_true — true labels, required when you define a loss in Keras, not used in this function.
    y_pred — python list containing three objects:
            anchor:   the encodings for the anchor data
            positive: the encodings for the positive data (similar to anchor)
            negative: the encodings for the negative data (different from anchor)
    loss — real number, value of the loss
    anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
    # distance between the anchor and the positive
    pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)))
    # distance between the anchor and the negative
    neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)))
    # compute loss
    basic_loss = pos_distneg_dist+alpha
    loss = tf.maximum(basic_loss,0.0)
    return loss

view raw
hosted with ❤ by GitHub

Now having a loss function to train a network with, we need a network to be defined. The network should receive as input our network traffic information and output a vector representation of it. I already mentioned the network before, so here is the function that creates it from Keras sequential model.

def create_base_network(in_dims, out_dims):
    Base network to be shared.
    model = Sequential()
    model.add(LSTM(512, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, implementation=2))
    model.add(LSTM(512, return_sequences=False, dropout=0.2, recurrent_dropout=0.2, implementation=2))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(out_dims, activation='relu'))
    return model

view raw
hosted with ❤ by GitHub

Now that we have that base model, we need to embed it within a Siamese “framework”. After all, that base network simply computes one vector representation for a specific network traffic data and the loss function we defined calls for three of those representation i.e. the anchor, the positive and the negative. So, what we will do is to define three inputs which will be evaluated through the SAME base network, hence the name of Siamese network. The output of that Siamese network it then simply concatenated in a list of vectors, which is what we are asking our loss function to evaluate on. Note that at this point we defines the input and output dimensions. The inputs will be in the shape of N_MINS minutes of network traffic characterization (60 minutes for now), where each minutes is characterized by n_feat features (the 130 or so features I mentioned in my previous post).

in_dims = (N_MINS, n_feat)
out_dims = N_FACTORS
# Network definition
with tf.device(tf_device):
    # Create the 3 inputs
    anchor_in = Input(shape=in_dims)
    pos_in = Input(shape=in_dims)
    neg_in = Input(shape=in_dims)
    # Share base network with the 3 inputs
    base_network = create_base_network(in_dims, out_dims)
    anchor_out = base_network(anchor_in)
    pos_out = base_network(pos_in)
    neg_out = base_network(neg_in)
    merged_vector = concatenate([anchor_out, pos_out, neg_out], axis=1)
    # Define the trainable model
    model = Model(inputs=[anchor_in, pos_in, neg_in], outputs=merged_vector)

view raw
hosted with ❤ by GitHub

Everything is now in place to train the base model through the Siamese “framework” using our defined loss function. Note that the y values we pass to the fit method are dummies value since our loss function does not care for the real targets (which we do not know).

# Training the model
model.fit(train_data, y_dummie, batch_size=256, epochs=10)

view raw
hosted with ❤ by GitHub

Now we could save the model (really, just the base model is needed here). But more importantly, we can use the base model to perform some evaluation of what would be the vector representation. For me, this was that part which was unclear from other tutorials. You simply should perform a predict on the base model and do not care anymore about the Siamese “framework”. You kind of throw it away.

def traffic_to_encoding(x, model):
    return model.predict(np.array([x]))

view raw
hosted with ❤ by GitHub

For completeness sake, since what we want to do is to evaluate the “closest” vector representation to the trained faults we want to detect, we could create a method to identify the traffic case such as the following.

def identify_traffic(x, database, model):
    Implements traffic recognition.
    x — the traffic to identify
    database — database containing recognized traffic encodings
    model — the encoding model
    min_dist — the minimum distance between traffic encoding and the encodings from the database
    identity — string, the traffic prediction name
    # Compute the target "encoding" for the traffic.
    encoding = traffic_to_encoding(x, model)
    # Find the closest encoding
    min_dist = 100
    identity = 'unknown'
    for (name, db_enc) in database.items():
     # Compute L2 distance between the target "encoding" and the current "emb" from the database.
        dist = np.linalg.norm(db_encencoding)
        # If this distance is less than the min_dist, then set min_dist to dist, and identity to name.
if dist < min_dist:
min_dist = dist
identity = name
return min_dist, identity

view raw
hosted with ❤ by GitHub

Assuming proper training from our Siamese network and our training data, we can use the above to create a database of the different traffic conditions we can identify in a specific network (as traffic patterns can change from network to network, but hopefully not the way to represent them). And identify the current traffic using the above created function.

database = {}
database['normal'] = traffic_to_encoding(get_example_label(train_cases_df, df_lens, 0), base_network)
database['error2'] = traffic_to_encoding(get_example_label(train_cases_df, df_lens, 1), base_network)
# Prediction on traffic
identify_traffic(x, database, base_network)

view raw
hosted with ❤ by GitHub

Et voilà, you should now have all the pieces to properly use Aristotle, Peter Pan and Napoleon to train a Siamese Network, and then sadly throw them away when you do not need them anymore… This metaphor of Siamese cats is heartbrokenly getting closer and closer to reality… Nevertheless, I hope it can help you out there creating all sorts of Siamese Networks!