Who are those Strangers?

This post is a follow-up to Who am I connected to? As stated in the previous post, a problem that arise a lot is figuring out how things are connected. Is this server directly or indirectly connected to that pool? Or who am connected to through a chain of friends. If you ever have to implement such an algorithm (and about that you can refer to my previous post), one thing you might encounter are superstars, false friends or black holes. Name them the way you want 😉 . Those are “nodes” which are connected to an abnormally high number of other nodes. Well, when someone has 50k friends, you should be suspicious that those are not all real friends! The problem with reporting fake friends is many fold.

First, if you go through the process described last time, you get a number of very high density groups which normally wouldn’t be grouped together if it was not because of those black hole nodes. This may well make any conclusion pointless, so you should take care of removing (or not considering) those superstar nodes to start with.

Second, assuming you start with big data, joining a number of those superstars on themselves will lead to an exponential growth of your data set (at least temporarily) and it will take forever to complete the associated spark tasks (if successful at all). Ok, those might be legit friends, in that case you might not have a choice and maybe Fighting the Skew in Spark can help you solve that issue. But otherwise, if those are indeed false friends, you should take a step of removing those black hole nodes before hand.

In an ever changing world of data, it may not be easy to spot those black holes, but a good first filter may be as simple as (using PySpark notation this time, just to keep you on your toes):

filter_out = node_table
  .filter(F.col('count') > black_hole_threshold)

The nodes captured by that filter-out “rule” can then be automatically removed from your node table, or examined and added to black lists if needs be. To automatically remove the filter_out nodes from your node_table, the join_anti is your friend!

output = node_table

You still need to perform the connection finding algorithm on this “output”, but at least you would have removed all nodes which have an above black_hole_threshold abnormal number of connections from your inputs.

What else can go wrong? Again, if you have big data, this process as a whole (especially since it is iterative) can take some serious time to execute. Moreover, even with the black holes removed, the join on itself part may consume a lot of resource from your cluster. The interesting part is that if you keep your “node” definition constant, you could run the algorithm in an online additive fashion which would run faster because most of the data wouldn’t change and already be reduced to find who’s who friend, so only the additional delta would in fact “move”. I know it is not that simple and quick, but it is still quicker than doing the process on the initial input data again an again…

Again, I hope this can be of help. If you apply this method or another equivalent one, let me know and let’s discuss about our experience!

Cover photo by Felix Mittermeier at Pixabay.


Who am I connected to?

A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.

First let’s go with some terminology. Nodes are the things we are connecting e.g. you, your friends, your friends friends. Vertex are how those nodes are connected. For example, below, node 0 is connecting to node 1 and 2 using two vertices to describe those connections. Node 1 is connecting to node 3 via one vertex, etc. For this first example, we use uni-directional vertex, but nothing prevents us from using bi-directional vertex. In general, if all vertex are bi-directional we will talk of non-directed graph, which would be the case of friends (usually) since you know your friend, and he knows you as well!

A first important concept to introduce in network analysis is that of an Adjacency matrix. The adjacency matrix is a matrix representing the vertex connections between the nodes. The first row in the Adjacency matrix represent connections of node 0. Thus, node 0 is connecting to node 1 and 2, but not to itself or to node 3. So the first row is 0, 1, 1, 0. Second row represent connection of node 1, which is only connecting to node 3. So the second row is 0, 0, 0, 1. Note that we could have bi-directional connections, in such a case the connection would appear on both the row and the column, but this is not the case in this example.

By inspecting the Adjacency matrix, we can reconstruct the Node/Vertex graph. It informs us on the first hop connections: who are your friends. But how can we know about the second hop connections e.g. node 0 is connected to node 3 via node 1 and node 2? A really simple way is to multiply the adjacency matrix by itself (A*A). The result of this multiplication are the second hop connection. Here, we see that node 0 is connecting through 2 hops to node 1 (via node 2), and is connecting through 2 hops to node 3. We can even see that there is 2 such connection in two hops from node 0 to node 3. Lastly we see that node 2 is connecting through 2 hops to node 3 (via node 1).

If we were to multiply again A*A by A itself, we would get the three hop connections, which in this case is limited to node 0 being connected to node 3.

In general, the network that will interest us are way bigger than this simple 4 nodes diagram. Also in general, all nodes are not connected to each other node. Well, they say everyone is connected to everyone by six degrees of separations (six hops), but for most other practical applications, not all nodes are connected to each others. Let’s take a look at a bigger example to see how the principles illustrated above can apply at scale. Let’s assume the following non-directional network graph. Here since we have a non-directional network graph, you will see the connection values appears in both the rows and the columns. This special case shows a symmetry about the diagonal.

As before, if we compute A*A, we get the second hops connections. Notice that nodes becomes connected to themselves via a second hop. For example, node 1 is connected 3 times to itself through a second hop via node 0, 7 and 8.

If you are interested in all the first hop connections and the second hop connections, you could add together A*A and A, thus leading to the following matrix. You could proceed forward to find the third hops onward, but in this example nothing else is connected, so although that the numbers you see here would grow, the pattern of zeros would not change. We have found all connections of this graph. We found that node 0 is connected to nodes 1, 7 and 8. Nodes 2, 3 and 4 are connected. Nodes 5, 6 and 9 are connected. Finally we see that node 10 is not connected to any other nodes.

In practice the matrix multiplication works well to find the next hops neighbours. If it happens also that for your problem (as in the one above) most connections are non existent i.e. 0, then you could use sparse matrices to store (and potentially compute with) your Adjacency matrix. However, those becomes quickly really huge matrices which requires a lot of operations to compute. A nice trick if you are using SQL or spark could be to use joins on tables.

To do so, you need to reverse the problem on its head. Instead of creating an Adjacency matrix of how the nodes are connected, you will create a table of the connections. So to keep with our second example, you could have something like the following network graph being turned into a node/connection table.

Node Connection
0 A
1 A
1 B
7 B
1 C
8 C

Now that we have that node/connection table, our goal will be to reduce the number of connections to the minimum possible and in the end get something like the following as a way to see everything connected (we won’t care about how many hops leads us there).

To get there we will iterate through a two step process. First we will perform connection reduction and then update the node/connection table. Then we rinse and repeat until we can no longer reduce the number of connections.

Assuming the above node/connection table (node_connections), we can reduce the number of connections via a the following SQL query and store it as the new_connections table:

SELECT A.connection, MIN(B.connection) AS new_connection
FROM node_connections AS A
JOIN node_connections AS B
ON A.node = B.node
GROUP BY A.connection

Then you can update the node_connection table with the following SQL query:

SELECT DISTINCT B.new_connection AS connection, A.node
FROM node_connections AS A
JOIN new_connections AS B
WHERE A.connection = B.connection

You iterate those two steps until the node_connections table change no more et voilà, you have a map of all nodes connected through distincts connections.

This is only one of the possible use case, but for large scale application it is probably easier and quicker to join tables than to create and multiply Adjacency matrices. I showed the logic with SQL, but obviously you could achieve similar results using spark (for my specific application, I use pySpark).

If you have questions or interesting ideas of application of the network theory to some problem, feel free to jump in the conversation!

Cover photo by Michael Gaida at Pixabay.

Automation and Sampling

As I mentioned earlier, I have transitioned from Ericsson to Shopify. As part of this transition I start to get a taste of the public transports (previously work was a 15 minutes drive from home, now, working in the city center, I have to take the train to commute). This morning there was a woman just beside me who was working on her computer, apparently editing a document or more probably writing comments in it. A few minutes in her edits, she makes a phone call, asking someone over the phone to change some wording in the document and painfully dictating those changes (a few words). This game went about a few times; writing comments in the document, then calling someone to make the appropriate edits. What could have been a simple edit, then send the edited document via email became an apparently painful exercise in dictation. The point is not to figure out why she was not sending the document via email, it could be as simple as not having a data plan and not willing to wait for a wifi connection, who knows, but the usage of a non-automated “process” made something which is ordinarily quite simple (editing a couple of sentences in a document), a painful dictation experience. This also has the consequence of a limited bandwidth and thus only a few comments can make their ways in corrections on that document.

This reminded me of a conversation I had with a friend some time ago. He mentioned the pride he got of having put in place a data pipeline at his organization for two data sources extraction, transformation and storage in a local database. Some data is generated by a system in his company. Close to that data source he has a server which collect and reduce / transform the data and stores the results in the local file system as text files. Every day he look at the extraction process on that server to make sure it is still running, and every few days he download the new text files from that server to a server farm and a database he usually use to perform his analysis on the data. As you can see this is as well a painful, non-automated process. As a consequence, the amount of data is most probably more limited than it could be with an automated process, as my friend as to cater to the needs of those pipelines manually.

At Shopify I have the pleasure of having access to an automated ETL (Extract Transform and Load) process for the data I may want to do analysis on. If you want to get a feel of what is available at Shopify with respect to ETL, I invite you to watch the Data Science at Shopify video presentation from Françoise Provencher who touch a bit on that as well as the other aspects of the job of a data scientist at Shopify. In short we use pyspark with custom librairies developed by our data engineers to extract, transform and load data from our sources into a front room database which anyone in the company can use to get information and insight about our business. If you listen through Françoise video, you will understand that one of the benefit of that automated ETL scheme is that we transform the raw data (mostly unusable) into information that we store in the front room database. This information is then available to further being processed to extract valuable insight for the company. You immediately see the benefit. Once such a pipeline is established, it perform its work autonomously and as an added benefit, thanks to our data engineering team, monitors itself all the time. Obviously if something goes wrong somebody will have to act and correct the situation, but otherwise, you can forget about that pipeline and its always updated data is available to all. No need for a single person to spend a sizable amount of time monitoring and manually importing data. A corollary to this is that the bandwidth for new information is quite high and we get a lot of information on which we can do our analysis.

Having that much information at our fingertips bring on new challenges mostly not encountered by those who have manual pipelines. It becomes increasingly difficult and not efficient to do analysis on the whole population. You need to start thinking in term of samples. There are a couple of considerations to keep in mind when you do sampling: sample size and what are you going to sample.

There are mathematical and analytical ways to determine your sample size, but a quick way to get it right is to start with a modest random sample, perform your analysis, look at your results and keep them. Then, you redo the cycle a few times and watch if you keep getting the same results. If your results vary wildly, you probably do not have a big enough sample. Otherwise you are good. If it is important for future repeatability to be as efficient as possible, you can try to reduce your sample size until your results starts to vary (at which point you should revert to the previous sample size), but if not, good enough is good enough! Just remember that those samples must be random! If you redo your analysis using the same sample over and over again, you haven’t proven anything. In SQL terms, it is the difference between:


Which would produce a random sample of 10% of table, on the other hand:

LIMIT 1000

Will most likely always produce the same first 1000 elements… this is not a random sample!

The other consideration you should keep in mind is about what you should sample, what is the population you want to observe. Let say to use an example in the lingo of Shopify, I have a table of all my merchants customers which amongst other thing contain a foreign key to an orders table. If I want to get a picture of how many orders a customer performs, the population under observation is from the customers, not the orders. In other words, in that case I should randomly sample my customers, then look up how many orders they have. I should not sample the orders to aggregate those per customers and wishing this will produce the expected results.

Visually, we can see that sampling from orders will lead us to wrongly think each customers performs on average two orders. Random resampling will lead to the same erroneous results.


Screen Shot 2018-09-18 at 10.41.28.png
Sampling Orders leads to the wrong conclusion that each customers do an average of 2 orders.

Whereas sampling from customers, will lead to the correct answer that each customer  performs on average four orders.


Screen Shot 2018-09-18 at 10.41.40.png
Sampling Customers leads to the right conclusion that each customers do an average of 4 orders.

To summarize things, let just say that if you have manual (or even semi-manual) ETL pipelines, you need to automate them to give you consistency, and throughput. Once this is done, you will eventually discover the joys (and need) of sampling. When sampling, you must make sure you select the proper population to sample from and that your sample is randomly selected. Finally, you could always analytically find the proper sample size, but with a few trials you will most probably be just fine if your findings stay consistent through a number of random samples.

Cover photo by Stefan Schweihofer at Pixabay.

Where the F**k do I execute my model?

or: Toward a Machine Learning Deployment Environment.

Nowadays, big names in machine learning have their own data science analysis environments and in-production machine learning execution environment. The others have a mishmash of custom made parts or are lucky enough so that the existing commercially available machine learning environment fits their needs and they can use them. There are several data science environments commercially available, Gartner mentions the most known players (although new ones pop every week) in its Magic Quadrant for Data Science and Machine-Learning Platforms. However, most (if not all) of those platforms suffer from a limitation which might prevent some industries from adopting them. Most of those platforms starts with the premises that they will execute everything on a single cloud (whether public or private). Let see why this might not be the case for every use case.

Some machine learning models might need to be executed remotely. Let’s think for example of the autonomous vehicle industry. Latency and security prevents execution in a cloud (unless that cloud is onboard the vehicle). Some industrial use cases might require models to be executed in an edge-computing or fog-computing fashion to satisfy latency requirements. Data sensitivity in some industries may require the execution of some algorithms on customer equipment. There are many more reasons why you may want to execute your model in some other location than the cloud where you made the data science analysis.

As said before, most commercially available offerings do not cater to that requirement. And it is not a trivial thing that one may slap on top an existing solution as a simple feature. There are in some case some profound implications on allowing such distributed and heterogeneous analysis and deployment environment. Let’s just look at some of the considerations.

First one must recognize there is a distinction between the machine learning model and the complete use case to be covered, or as some would like to call it the AI. A machine learning model is simply provided a set of data and gives back an “answer”. It could be a classification task, a regression or prediction task, etc. but this is where a machine learning model stops. To get value from that model, one must wrap it in a complete use case, some calls that an AI. How do you acquire reliably the data it requires? How do you present or act on the answer given by the model? Those, and many more questions needs to be answered by a machine learning deployment environment.

Recognizing it, one of the first thing that is required to deploy a full use case is access to data. In most industries, the sources of data are limited (databases, web queries, csv files, log files, …) and the way to handle them is repetitive i.e. once I figured a way to do database queries, the next time most of my code will look the same, except for the query itself. As such, data access should be facilitated by a machine learning deployment environment which should provides “data connectors” which could be configured for the needs and deployed where the data is available.

Once you have access to data, you will need “rules” as to when the machine learning model needs to be executed: is it once a day, on request, … Again, there is many possibilities (although when you start thinking about it, a lot are the same), but expressing those “rules” should be facilitated by deployment environment so that you don’t have to rewrite a new “data dispatcher” for every use case, but simply configure a generic one.

Now we have data and we are ready to call a model, right? Not so fast. Although some think of data preparation as part of the model, I would like to consider it as an intermediary step. Why would you say? Simply because data preparation is a deterministic step where there should be no learning involved and because in many cases you will reduce significantly the size of the data in that step, data that you might want to store to monitor the model behavior. But I’ll come to this later. For now, just consider there might be a need for “data reduction” and this one cannot be generic. You can think of it as a pre-model which format the data in a way your model is ready to use. The deployment environment should facilitate the packaging of such a component and provides way to easily deploy them (again, anywhere it needs to be).

We are now ready for the machine learning execution! You already produced a model from your data science activities and this model needs to be called. As for the “data reduction”, the “model execution” should be facilitated by the deployment environment, the packaging and the deployment.

For those who have been through the loops of creating models, you certainly have the question: But how have you trained that model? So yes, we might need a “model training” component which is also dependant on the model itself. A deployment environment should also facilitate the use/deployment of a training component. However, this begs to another important question. From where comes the data used for training? And what if the model drift, is no longer accurate and needs re-training? You will need data… So, another required component is a “data sampling” component. I say data sampling because you may not need all the data, maybe some sample of it is sufficient. This can be something provided by the model execution environment and configured per use case. You remember the discussion about data reduction earlier? Well, it might be wise to store only samples coming from reduced data… You may also want to store the associated prediction made by the model.

At any rate, you will need a “sample database” which will need to be configured with proper retention policies on a use case basis (unless you want to keep that data for eternity).

As we said, models can drift, so data ops teams will have to monitor that model/use case. To facilitate that, a “model monitoring” component should be available which will take cues from the execution environment itself, but also from the sample database, which means that you will need a way to configure what are the values to be watched.

Those covers the most basics components required, but more may be required. If you are to deploy this environment in a distributed and heterogeneous fashion, you will need some “information transfer” mechanism or component to exchange information in a secured and easy fashion between different domains.

Machine Learning Execution Environment Overview.

You will also need a model orchestrator which will take care of scaling in or out all those parts on a need basis. And what about the model life-cycle management, canary deployment or A/B testing… you see, there is even more to consider there.

One thing to notice is that even at this stage, you only have the model “answer” … you still need to use it in a way which is useful for your use case. Maybe it is a dashboard, maybe it is used to actuate some process… the story simply does not end here.

For my friends at Ericsson, you can find way more information in the memorandum and architecture document I wrote on the subject: “Toward a Machine Learning Deployment Environment”. For the rest of you folks, if you are in the process of establishing such an environment, I hope those few thoughts can help you out.

Cover photo by Frans Van Heerden at Pexels.

Goodbye and Thank You!

The goodbye is not intended for you my blogging crowd! Rather to some other dear friends I will leave behind.

If there is a constant in life, it is change. In a few weeks will be time to change my place of work. With such a change I need to say goodbye to a lot of nice peoples and friends I have worked with, over the last 21 years. I want to thank you all for the fantastic environment you surrounded me with. I want to thank you all for the great challenges you gave me to undertake and solve. I want to thank you all for the help you provided through all those years for small and big questions, the mentoring and the learning and above all the animated and insightful discussions we had. All of this was enabled by a fantastic workplace. I will never forget Ericsson.

Above all I want to thank my current manager at Ericsson, Steven Rochefort, with whom I have been closely collaborating for 11 of those years. He is a fantastic guy and I will miss him dearly.

Now it is time to say hello to a new workplace. I have already met some wonderful and bright peoples at Shopify and I’m looking forward to the new challenges opening in front of me! #LifeAtShopify

This kind of message is traditionally expressed via email to a selected crowd on the last day of work. I’m not traditional 🙂 and I’m quite transparent, so you know it all now.

I’ll continue blogging, do not worry. Feel free to continue to contact me on any of my channels, here or elsewhere.

Cover photo by Claudia Beer at Pixabay (bonus points for those who catch the reference).

AI market place is not what you are looking for (in the telecommunication industry).

In a far away land was the kingdom of Kadana. Kadana was a vast country with few inhabitants. The fact that in the warmest days of summer, temperature was seldom above -273°C was probably a reason for it. The land was cold, but people were warm.

In Kadana there was 3 major telecom operators: B311, Steven’s and Telkad. There were also 3 regional ones: Northlink, Southlink and Audiotron. Many neighboring kingdoms also had telecom operators, some a lot bigger than the ones in Kadana. Dollartel, Southtel, Purpletel, we’re all big players and many more competed in that environment.

It was a time of excitement. A new technology called AI was becoming popular in other fields and the telecommunications operators wanted to get the benefits as well. Before going further in our story, it can be of interest to understand a little bit what this AI technology is all about. Without going into too much details, let’s just say that traditionally if you wanted a computer to do something for you, you had to feed him a program handcrafted with passion by software developer. The AI promise was that from now on, you could feed a computer with a ton of data about what you want to be done and it would figure out the specific conditions and provide the proper output without (much) of programming. For those aware of AI this looks like an overly simplistic (if not outright false) summary of the technology, but let’s keep it that way for now…

Going back to the telecommunication world, somebody with nice ideas decided to create Akut05. Akut05 was a new product combining the idea of a marketplace with the technology of AI. Cool! The benefit of a market place as demonstrated by the Apple App Store or Google Play, combined with the power of AI.

This is so interesting, I too want to get into that party, and I immediately create my company, TheLoneNut.ai. So now I need to create a nice AI model that I could sell on the Akut05 marketplace platform.

Well, let not be so fast… You see, AI models are built from data as I said before. What data will I use? That’s just a small hurdle for TheLoneNut.ai company… we go out, talk with operators. Nobody knows TheLoneNut.ai, it’s a new company, so let’s start with local operators. B311, Steven’s and Telkad all think we are too small a player to give us access to their data. After all, their data is a treasure trove they should benefit from, why would they give us access to it. We then go to smaller regional players and Northlink has some interests. They are small and cannot invest massively in a data science team to build nice models, so with proper NDA, they agree to give us access to their data in counterpart, they will have access to our model on Akut05 with substantial rebate.

Good! We need to start somewhere. I’ll skip all the adventures along the way of getting the data, preparing it and building a model… but let me tell you that was full of adventures. We deploy a nice model in an Akut05 store and it works wonderfully… for awhile. After some time, the subscribers from Northlink change a bit their behavior, and Northlink see that our model does not respond properly anymore. How do they figure? I have no idea, since Akut05 does not provide with any real model monitoring capabilities besides the regular “cloud” monitoring metics. More alarming, we see 1-star reviews pouring in from B311, Steven’s and Telkad who tried our model and got from the get go poor results. And there is nothing we can do about it because after all we never got deals with those big names to access their data. A few weeks later, having discounted the model to Northlink and getting only bad press from all other operators, TheLoneNut.ai bankrupt and we never hear from it again. The same happens to a lot of other small model developers who tried their hand at it, and in no time the Akut05 store is empty of any valuable model.

So contrary to an App Store, a Model Store is generally a bad idea. To get a model right (assuming you can) you need data. This data needs to come from representative examples of what you want the model to apply to. But it easy, we just need all the operator to agree to share the data! Well, if you don’t see the irony, then good luck. But this is a nice story, lets put aside the irony. All the operators in our story decide to make their data available to any model developers on the Akut05 platform. What else could go wrong.

Let us think about a model that use the monthly payment a subscriber pays to the operator. In Kadana this amount is provided in the data pool as $KAD, and it works fine for all Kadanian operators. Dollartel tries it out and (not) surprisingly it fails miserably. You see, in the market of Dollartel, the money in use is not the $KAD, but some other currency… The model builder, even if he has data from Dollartel may have to do “local” adjustments. Can a model still provide good money to the model builder if the market is small and fractured i.e. needs special care being taken? Otherwise you’ll get 1-star review and again disappear after a short while.

Ok, so the Akut05 is not a good idea for independent model builders. Maybe it can still be used by Purpletel which is a big telecom operator which can hire a great number of data scientists. But in that case, if its their data scientist who will do the job, why would they share their data? If they don’t share their data and hire their own data scientists, why would they need a market place in the first place?

Independent model builders can’t find their worth from a model market place, operators can’t either… can the telecom manufacturer make money there? Well, why would it more valuable than for an independent model builder? Maybe it could get easier access to data, but the prerogatives are basically the same and it wouldn’t be a winning market either I bet.

Well, therefore a market place for AI is not what you are looking for… In a next post I’ll try to say a little bit about what you should be looking for in the telecom sector when it comes to AI.

For sure this story is an oversimplification of the issue, still, I think we can get the point. You have a different view? Please feel free to share it in the comments below so we can all learn from a nice discussion!

Cover photo by Ed Gregory at Pexels.

How to become a good data scientist

After being so vocal about how to be a bad data scientist, I thought I should even out the play field by giving some hints on how to become a good data scientist. The other side of the medal.

My strong feeling is that is you just start in the field for employment or salary reasons, you start on the wrong foot. You should first look at your passions. Here it is interesting to take a few seconds to lookup the word passion as defined on Dictionary.com:


[pashuh n]


  1. any powerful or compelling emotion or feeling, as love or hate.
  2. strong amorous feeling or desire; love; ardor.
  3. strong sexual desire; lust.
  4. an instance or experience of strong love or sexual desire.
  5. a person toward whom one feels strong love or sexual desire.
  6. strong or extravagant fondness, enthusiasm, or desire for anything: a passion for music.
  7. the object of such a fondness or desire:Accuracy became a passion with him.

Hopefully the scope of your passion for data science does not involve definitions 2, 3, 4 or 5. But is driven by a strong fondness and enthusiasm for data science! If so you are on the right track and my first advise would be: do not try to swallow the ocean in one sip. Zoom on one aspect of that passion, the one that piqued you interest first. See how you could apply it in a real-world problem and learn along the way. For example, in my case, I got passionate about artificial life long time ago. That evolved in becoming fond in a form of reinforcement learning, the genetic algorithms and genetic programming around 2012. As time passed, I grew my interests in machine learning and deep learning, learned about it by reading books, taking online courses and taking a graduate course while studying for my master’s degree. At that time, I had the hope to apply it to the project I had for my master thesis, but sometime plan changes. So, in short you need to follow your heart here.

If you go with such an approach, you will avoid many of the pitfall I mentioned in the first post. You won’t come to expect a “clean” data set as your input since you’ll have applied it to a few real case examples as you learned. You will learn along the way how to gather data, how to clean it, how to interpret it… it will benefit you in two ways. First you will learn one of the essential skills, data cleaning. But most importantly, it will grow your inquisitive mind. Something that I never seen a single course being able to do. Again, I do not think this is a skill you can get in a few weeks, it requires a mind shift that you will acquire through repeated practice.

Another benefit of going along your passion is that if you don’t already have the necessary mathematical background, you will grab it along the way. If you find maths hard, it is probably easier to grab them on a need basis as you expand your knowledge through your own passionate experiments! I will also re-iterate that nonetheless what you might think or have been told, mathematics is not so hard. Moreover, they are way easier to get if you start with a positive attitude, telling yourself that you can do it.

Next benefit of such an approach is that you will have to define and refine your problem. You will decide what is important to you, what is your “research” question and how it relates to the activities you are doing along the way. When I was doing my master’s degree, I saw two types of students. Those who already had a research agenda, a question they wanted to explore, or who at least sat down early with their advisor and set up such a research question inline with their interests and passions. Those students usually made high quality presentations, were following courses highly relevant to answer their research questions and became highly proficient in their field of research. The second type of student waited for their advisors to give them a research project, never were really involved in it, presented average or poor presentations, followed any courses without really seeing how they related to their research topic: well, in most cases they were not… and at the end were probably still graduating, but with a subject to forget about… You want to be like the first type of students, even if you do it on your own, you want to take control of it and reap the benefits.

Lastly, it is good for you to write or talk about your findings and learnings. Myself I found it help crystalize my thoughts and get (sometime) some feedback from other comparable minded peers. All to say that academic papers are not the only way to communicate your findings, blogs, videos, reports can all help you if you have the passion. Sure of advantage of an academic paper is the peer review system which provide you with feedback on your research, but you should not limit yourself to that single media of communication if it is not suited to your reality. Expose plainly what you found, do not claim you are something you are not, or not yet. When the time comes, other will recognize you as a data scientist and that day you will know you are one for sure!

In the same lines as my previous post, learn hard: it is easier when you are you are following a personal research/interest goal. Work hard: again, something easier (not necessarily easy) when you follow a passion. And at all time be honest with yourself (but also others) about what you know or found out. If you think of yourself as a full-grown data scientist on day one, you might not put in the work necessary to ever become one. On the other hand, if you follow your interests and passions, you might become a data scientist before you even think of yourself as one.

Cover photo by Magda Ehlers at Pexels.