Better Software Writing Skills in Data Science: Dead Programs Tell No Lies

This week’s mentoring and post is based on the fourth chapter of “The Pragmatic Programmer: From Journeyman to Master”: Pragmatic Paranoia. The focus of this chapter is on making sure your programs do what they are intended to do, and in the event they don’t perform as expected, that you can figure out why in the fastest way possible.

Design by Contract can be supported in different ways depending on the programming language you use. At the very least, Python 3 can enforce it with the use of parameter typing for functions and their return values. Another good way to enforce contracts is to make assertions (assert) early on in your function or simply raise an exception if something is not as per your contract. Note that an assert will anyway raise an exception. Think of it, is it better to fail and stop execution, or populate a table with garbage? When will you figure out it was garbage? Will you be able to correct it at that point in time? Will you still have the underlying data to do the correction?

Learn to know when to use assert and/or when to raise an exception i.e. when you know a boundary, a contract and you want to enforce it, or at least tell an higher up program that something is wrong an you cannot do anything about it. When encountering something exceptional, you can either solve it yourself, or let an higher up program handle it. In the end, if no higher up program can handle the situation, the overall program will be halted.

I know for some the usage of asserts and exceptions can be obscur to understand, so let me try to build a mental model for you. An exception is an exceptional thing that happens to a program. If you are responsible for the invocation of that program and catch that exception, you have to decide if you have enough context to act on that exceptional thing on need to raise that exception higher, to the programming calling you. In the end, if no one can handle the exception, the program will crash and you can debug what this exception is all about. If you know what to do in that exceptional case, then it’s up to you to handle it. Maybe it’s a retry, maybe you need to use defaults values, … just be certain of the handling decision, as this might well pollute your data.

As to how to throw an exception higher, there are two ways: assert or raise. Generally raise will be used when you have an error condition you cannot directly act upon but an higher up program might be able to act upon e.g. a file is not present, or empty, … On the other hand, assert will be used when you can swear this should not happen. In both case, exceptions are thrown higher up.

When you figure out that something “bad” happen to your program, you know this should not happen and you know there is no way around, the way to throw and exception higher can be via an assert. This will throw an exception to the higher up program which will have to decide what to do with it. An example would be you expect an int as input and you get a string, this is contract breaking, the higher up program ask you to handle improper data, why would your program decide why the contract was broken? It should be the responsibility of the caller to handle that exception properly. That might warrant a assert right there.

Depending on the organization you work in, when an how to use exceptions and asserts might get philosophical. On the other hand it could also be subject to very specific rules. There might be really valid reason why an organization might prefer an approach over another. Learn the rules, and if there is no rule, have discussion around it and apply your best judgement. In any case, dead programs tells no lies. Better kill it than having to deal with polluted data a year in the future.


Before running one should learn to walk. Often people come to data science without much knowledge of programming and suddenly are asked to take care of existing ETL in Python, or to design new ones. The pragmatic approach is to learn programming! As I said earlier, Python is a multi-paradigm language. The procedural programming is probably the most well known approach especially if you used Python in a notebook environment. Linear programming and the use of functions… 

If you use pytest or some other libraries, you probably started wondering a little bit about Object Oriented Programming. Maybe you just copied the recipe and haven’t looked too deep into that paradigm. Here’s your chance. I found this well written primer on Object Oriented in Python which also links to other resources. If you are to write solid ETLs, you’ll want to have some knowledge of OOP.

If you manipulate data, later or sooner you will want to use lambda functions, map, filter, reduce, … or if you use numpy / pandas, you’ll get interested in apply, etc. Again, you could just follow the recipe, but again, if you want to get stronger on Functional Programming and again I found an interesting primer to Functional Programming in Python. It’s far from complete, but it links to other resources to fill the gaps and once started, you’ll simply want to learn more!

I said it multiple times: practice makes perfect! In the first two weeks, I proposed to you exercises from CheckiO. Feel free to register an account there and follow the path from island to island to gain knowledge in Python. There are other resources also to practice programming. Each year since 2015, in the days before Christmas, for the month of December, Advent of Code propose small problems the Elves and Santa face to deliver gifts! You could also see if there are coding dojos in your area, or virtually, I found them a good practice venue as well.

This concludes my complement to the fourth chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week we will take a look at Chapter 5: Bend or Break.


Cover image by William Adams on Pixabay.

Better Software Writing Skills in Data Science: Tools & Bugs

This week’s mentoring and post is based on the third chapter of “The Pragmatic Programmer: From Journeyman to Master”: A Pragmatic Approach. One of the focuses in that chapter is about the tools of the trade, and mastering those tools.

If the only tool you got used to as of now is Jupyter or another notebook-like environment, maybe it is time for you to try different approaches. Only using a notebook-like environment may give you bad programming habits which may be hard to change. Try coding some Python using a Python supporting IDE e.g. PyCharm is one, or even just plain text that you’ll execute on the command line. Learn different ways to code, this will help you understand the cost and benefits of those approaches.

As a data scientist try different ways to query data, maybe your organization already supports different databases? Learn how to use pyspark, SQL, … Try different deep learning frameworks if this applies to your line of work. The tools are endless, you don’t need to have a deep knowledge of them all, but you should try them and see what are the advantages and drawbacks of them to make a conscious choice on which ones are best for you and the task at hand.

Another area where tools are important, and where I’ve seen a lot of problems for junior programmers and data scientists to figure out how to approach it, is around debugging their code (or queries).

The focus this week for my article is on debugging. The Python exercise will consist in debugging a small Python function. If you find the bug early, you are probably not so junior in Python! Still, I encourage you to go through the steps as an exercise. Remember, practice makes perfect!

def append_y_words(y_words, base_list=[]):
'''
Purpose: Return a list of words from the base_list (if any) followed by
words starting with 'y' from the words list.
'''
y_words = [word for word in y_words if word.startswith('y')]
base_list += y_words
return base_list
print append_y_words(["yoyo", "player"]) # should print ['yoyo']
print append_y_words(["yours", "puppet"], ["truly"]) # should print ['truly', 'yours']
print append_y_words(["yesterday"]) # should print ['yesterday'] , but does something else
view raw bug.py hosted with ❤ by GitHub

Bring this code in your favorite Python execution environment and let follow the debugging procedure outlined in the book.

  1. Get the code cleanly compiled: In case of Python that shouldn’t be an issue… Python is an interpreted language. But, if you keep your code in a Jupyter notebook for example, you might have forgotten what you ran before. So a good approach would be to restart the kernel and re-execute your code cell-by-cell. Are your Jupyter notebook cells out of order? Maybe you could take time to put them in the order you want them to be executed.
  2. Understand the bug report: Througfully read the function intent and the three “test” statements. Can you understand what is the expected behavior? Maybe you can spot the bug right away!
  3. Reproduce the bug: Is the bug reproducible? In this specific example it should already be easily reproducible, but in real life, you may have to do some work to make it reproducible in one step. Are there any other tests you would like to add to better understand the behavior you see? Can you run those tests and reproduce the problem in one step?
  4. Visualize your data: here many approaches are possible. You could decide to go with pdb, the Python debugger (it’s probably a good idea to get to know a debugger if you don’t already). You may have to query data generated by your Python file if you are dealing, for example, with an ETL. If your ETL is preparing data for some Machine Learning stages, you may want to create a custom way to visualize that “faulty” data.
  5. Tracing: In addition to the use of a debugger, or in this case alternatively, you could add some tracing to the code. Tracing can be done through log files (and in that case I strongly encourage using a logging framework that supports trace levels e.g. INFO, ERROR, DEBUG, … so that you can keep your well designed traces for future debugging), or for a simple problem like this one, you could simply decide to add some print statements to follow the flow of data.
  6. Looking around corrupted variables: This is not something you’ll normally have to do with Python, but if you are using other programming languages, be aware this is something that might happen. For example, in C, you can write basically anything anywhere in memory, so a simple indexing error from another part of a program could corrupt “your” variables.

You still haven’t spotted the bug? Don’t despair, there are still a few steps you can follow!

  1. Rubber Ducking: Explain the problem to someone else, or if you are in lockdown because of COVID like many of us, explain it to your cat, or if you are stuck alone on a deserted island, explain it to Wilson! Just saying it out loud may help you figure it out.
  2. Process of elimination: As the author says in the book, it is possible the error is in the OS, the compiler or a third party product. However, in my 25+ years of career in software up to now, it happened only once that a bug I encountered was traced to the compiler, a long time ago, in a Fortran compiler. And also only once the bug could be traced to a faulty hardware design for a computer board. Those bugs are usually extremely hard to figure out, so that’s a good thing they don’t happen often. More often than not, it might be that you don’t interpret the 3rd party documentation in the right way if your bug involves other pieces of software.
  3. The element of surprise: When you find yourself thinking this bug is impossible, it is time to re-evaluate what you hold for true. Subtle hint, this might be the case here. 

So what is happening here? In this case it is quite simple when you know how Python works. The default arguments in Python are not re-evaluated each time the function is called, but rather only once when the function definition is parsed. Each time the function is called without a second argument, a reference to the same “empty” list will be used as the default argument. Any change made to that “empty” list will become the “new” default argument for subsequent calls who don’t supply a second argument.

Knowing the tools of the trade include knowing the programming language you use. There are many nice features to Python as it is a multi-paradigm programming language. You can use a procedural programming approach, an object oriented approach or a functional programming approach in Python. Those amongst others give Python a great flexibility, as well as a great potential to hurt yourself! Take the time to learn the programming language you use, what it makes easy, and where it can make things difficult.

This concludes my complement to the third chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week we will take a look at Chapter 4: Pragmatic Paranoia.


Cover picture by Nimrod Oren at Pixabay.

Better Software Writing Skills in Data Science: Be DRY

First I’d like to make a comeback on last week’s post and ask you a question. I’ve got a comment from one of my friends who read my blog where he told me something along the lines:  “Apparently, letting your mentee read a book is your way of mentoring”. This made me think… Given punctual mentorship sessions (half and hour each week for a number of weeks) with a mentee who is not working directly with me on anything, and given the goal of improving software development skills in general, and software writing skills in Python more specifically, would anyone have suggestion on better ways of proceeding? I’m genuinely open to suggestions.

For now, let’s continue with the current approach and base this week’s mentoring and post on the second chapter of “The Pragmatic Programmer: From Journeyman to Master”: A Pragmatic Approach. One of the ideas presented in that chapter is what is called the DRY principle. Don’t Repeat Yourself.

Not repeating yourself is certainly an important aspect of software development. Doing copy-paste between multiple pieces of code might seem the best way forward and it’s quick. However, when a bug is discovered in one of those multiple copies, will you do the extra effort of correcting it everywhere (including other’s work)? Will you be able to find all those copies? Sometimes the copy-pasted code might change a bit. Changing variables names, adding a condition here and there, … 

A few years ago I led a project to build a software which detected software clones “Large scale multi-language clone analysis in a telecommunication industrial setting“. Even if we could identify software clones which were different from their original version, it was really hard if not impossible to get the software developer to refactor the code based on that insight. When a clone is created, it is hard to kill it! So the best approach is to not create them, and refactor from the start, creating functions or classes as necessary instead of copy-pasting.

Hopefully I convinced you to be DRY in your software development for data science. But this is true for other aspects of data science as well. Have you ever worked in a place where data definitions which depend on a number of columns, maybe from different tables which are not standardized and repeated for many dashboards or queries. What is a visitor to your website? Is it any hits to any page of the site? Do you have some sort of sessions in place and count only once per session? Do you have a login and count only the users? Assuming sessions tracking or users login, how do you compute the number of visitors? Is it a rolling average on a period of time? What is that period of time? It can easily become a substantial query to determine who is an active user at a specific time. Are you sure everyone in the organization uses the latest and greatest definition? Was the query copy-pasted from dashboard to dashboard? Was the definition slightly modified based on the purpose of a person or a team? You see, don’t repeat yourself also apply to data definitions. There are multiple ways to solve that problem, but none are easy, and the longer you run allowing code, queries or definitions copy-paste, the harder it will be to kill those clones.

As this second week exercise, I propose to identify an area or a specific instance where a lot of clones are made in your team. Find someone (ideally in your team) to discuss it. Is there anyway to bring forward the piece of code, query or definition and start refactoring related artifacts to eliminate that clone? This might be hard! What is an acceptable incremental way forward? How can we make sure not to create new clones in the future? This might be the first few steps your team will take to limit and reduce clones in your artifacts.

As for last week, there is much more to that second chapter. Still, for the junior data scientist who wishes to improve his Python programming skills, I’d like to suggest that practice makes perfect. In that vein, this in-theme exercise might prove useful: ADFGVX Cypher

Once you complete this exercise ask yourself all the questions I proposed for last week’s exercise. More specifically, you should make sure you don’t repeat yourself. A Cypher problem is a good example of symmetric operations and symmetric operations have a good potential for code reuse. If you have not defined functions that you use more than once between the encoding and decoding, I would suggest you take a second look at your implementation. Also make sure to think about the other questions from last week: Do you break early? Do you respect the contract? Do you document your program? Do you have unit tests?

Again if you have access to someone you feel comfortable with and who you think writes great code, ask him for his comments on your exercise solution. You might learn a thing or two! Alternatively, if you solve that problem, you may be able to look at some of the other’s best solutions. Reading code from others is also a great way to improve your programming skills.

I know I said one topic per week, but there is another aspect in this chapter worth an exception. I won’t add more exercise to it, just a potentially lifelong quest! The subject is Estimations. The book will tell you to keep track of your estimations. I say: don’t simply keep track of those, keep track of everything you do and how long it takes. Early on in your career keep a log of what you do and how long it takes. As the time passes, try to generalize the data you collected e.g. a data request where all data concepts are well defined takes me on average x hours; creating a dashboard with y graphs / data points using readily available types of graphs takes me on average x hours; … build an estimation library for yourself. 

When you start to have a generalized mental model for your personal estimates, make a game out of it. Even if not asked for an estimate, make one and check how good it is afterward. Use gamification to your advantage to make you better at estimating for the future!

If or when you graduate to lead teams, you’ll see that your estimates are a good base as well. You can then apply equivalence factors for your team members as you learn to know them. Maybe even figure out their strengths and weaknesses from that exercise.

This concludes my quick complement to this second week / second chapter of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. See you next week for Chapter 3: The Basic Tools.


Cover Image by Chillervirus at Pixabay.

Better Software Writing Skills in Data Science: Broken Window

I’ve recently started mentoring a junior data scientist in the ways of improving his programming skills. It’s often said that data scientists lack programming skills, whether it be because of their data science program or their diverse background which lead them to data science. Sometimes however data scientists are willing to be vulnerable, admit it and improve upon it. I’m currently a data scientist, but given in my past life where I had many years of professional experience as a software developer (out of many things), I felt this mentorship was a good match and thought we could see how we would both learn through this mentorship.

Through my software development experience, one influential book I read was “The Pragmatic Programmer: From Journeyman to Master” by Andrew Hunt and David Thomas. I think this could be a good starting point for discussion and exercise. The book itself doesn’t have so many chapters, so my initial intention is to discuss with my mentee one chapter each week and complement the reading with some exercises. More specifically, my mentee wants to improve his Python programming skills, so we’ll target the exercises to that language.

You may ask me: Why is a programmer book good for a data scientist, especially a book which doesn’t discuss a language in depth? Well, I think there are many things which are important to programmers that are as important if not more to data scientists. For example, in the first chapter, there is a section about communication. If you haven’t been told yet, communication is key to data science!

I intend this to be a series of posts that will highlight the most important part of the week chapter as well as some discussion about the weekly Python exercise I proposed to my mentee. So without further ado, for the first week I proposed to read the Preface as well as Chapter 1: A Pragmatic Philosophy.

In data science as in programming in general, you have to adjust your approach to fit the current circumstances. Those adjustments will come from your background and your experience. In the same way as a Pragmatic Programmer does, a Pragmatic Data Scientist gets the job done, and does it well.

Rather than going into the details of a chapter, I’ll focus on one aspect of it. In that sense, this blog post is complementary to the reading of the corresponding chapter in “The Pragmatic Programmer: From Journeyman to Master”. For this week I want to focus on “Broken Window”. Building from the Broken Window Theory, when you leave a broken window on a building, it instills a sense of abandonment for that building. From that sense of abandonment, people will start to feel no one cares about the building. Further windows will be broken, graffiti will be drawn and other damages will be brought to that building. A parallel can be drawn to software, when you have a “broken window”, people will try to stay away from that piece of code and won’t have the reflex to try to improve it.

As this first week exercise, I propose to identify a broken window in the code owned by your team and find someone (ideally in your team) to discuss it. How bad is the situation? How could we improve it? Better, what can be done to fix that broken window? Maybe taking those first few steps might be the beginning of a new attitude for your team around broken windows and technical debt.

There is much more to that first chapter. Still, for the junior data scientist who wishes to improve his Python programming skills, I’d like to suggest that practice makes perfect. In that vein, this in-theme exercise might prove useful: Broken Window.

Once you complete this exercise ask yourself those questions:

  • Do you make sure your program breaks early if the contract is not followed e.g. initial asserts? In this exercise we expect a list of list of int, if you receive something else you should break right there. There are also other preconditions you may want to check.
  • Conversely if inputs are allowed by the contract, do you make sure you can handle fringe/boundary values? Do you know how your program should react?
  • Have you written functions each handling specific tasks? If you see that some part of the program repeats itself, have you taken steps to re-factor those parts so that you don’t repeat yourself?
  • Do you document your program in any way? Here, team culture may vary, but still, it is good practice to leave breadcrumbs for people who will follow (even if it’s future you!). Do your variable names make sense and are they part of that documentation?
  • Have you done the extra-mile of creating unit tests? Many frameworks can be used for that purpose. Ideally you’ll make yourself knowledgeable about the one your team uses the most.

Maybe you’ll think some of those questions are overkill for such a small program, but again, practice makes perfect. If you don’t practice those on small problems, the chances are you won’t think of doing them on larger ones.

If you have access to someone you feel comfortable with and who you think writes great code, ask him for his comments on your exercise solution. You might learn a thing or two!

This concludes my quick complement to this first week / first chapter of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. See you next week for Chapter 2: A Pragmatic Approach.


Cover photo from Cottonbro on Pexels.

Who are those Strangers?

This post is a follow-up to Who am I connected to? As stated in the previous post, a problem that arise a lot is figuring out how things are connected. Is this server directly or indirectly connected to that pool? Or who am connected to through a chain of friends. If you ever have to implement such an algorithm (and about that you can refer to my previous post), one thing you might encounter are superstars, false friends or black holes. Name them the way you want 😉 . Those are “nodes” which are connected to an abnormally high number of other nodes. Well, when someone has 50k friends, you should be suspicious that those are not all real friends! The problem with reporting fake friends is many fold.

First, if you go through the process described last time, you get a number of very high density groups which normally wouldn’t be grouped together if it was not because of those black hole nodes. This may well make any conclusion pointless, so you should take care of removing (or not considering) those superstar nodes to start with.

Second, assuming you start with big data, joining a number of those superstars on themselves will lead to an exponential growth of your data set (at least temporarily) and it will take forever to complete the associated spark tasks (if successful at all). Ok, those might be legit friends, in that case you might not have a choice and maybe Fighting the Skew in Spark can help you solve that issue. But otherwise, if those are indeed false friends, you should take a step of removing those black hole nodes before hand.

In an ever changing world of data, it may not be easy to spot those black holes, but a good first filter may be as simple as (using PySpark notation this time, just to keep you on your toes):

filter_out = node_table
  .groupBy('node')
  .count()
  .filter(F.col('count') > black_hole_threshold)

The nodes captured by that filter-out “rule” can then be automatically removed from your node table, or examined and added to black lists if needs be. To automatically remove the filter_out nodes from your node_table, the join_anti is your friend!

output = node_table
  .join(
    filter_out.select('node'),
    on='node',
    how='left_anti')

You still need to perform the connection finding algorithm on this “output”, but at least you would have removed all nodes which have an above black_hole_threshold abnormal number of connections from your inputs.

What else can go wrong? Again, if you have big data, this process as a whole (especially since it is iterative) can take some serious time to execute. Moreover, even with the black holes removed, the join on itself part may consume a lot of resource from your cluster. The interesting part is that if you keep your “node” definition constant, you could run the algorithm in an online additive fashion which would run faster because most of the data wouldn’t change and already be reduced to find who’s who friend, so only the additional delta would in fact “move”. I know it is not that simple and quick, but it is still quicker than doing the process on the initial input data again an again…

Again, I hope this can be of help. If you apply this method or another equivalent one, let me know and let’s discuss about our experience!


Cover photo by Felix Mittermeier at Pixabay.

Who am I connected to?

A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.

First let’s go with some terminology. Nodes are the things we are connecting e.g. you, your friends, your friends friends. Vertex are how those nodes are connected. For example, below, node 0 is connecting to node 1 and 2 using two vertices to describe those connections. Node 1 is connecting to node 3 via one vertex, etc. For this first example, we use uni-directional vertex, but nothing prevents us from using bi-directional vertex. In general, if all vertex are bi-directional we will talk of non-directed graph, which would be the case of friends (usually) since you know your friend, and he knows you as well!

A first important concept to introduce in network analysis is that of an Adjacency matrix. The adjacency matrix is a matrix representing the vertex connections between the nodes. The first row in the Adjacency matrix represent connections of node 0. Thus, node 0 is connecting to node 1 and 2, but not to itself or to node 3. So the first row is 0, 1, 1, 0. Second row represent connection of node 1, which is only connecting to node 3. So the second row is 0, 0, 0, 1. Note that we could have bi-directional connections, in such a case the connection would appear on both the row and the column, but this is not the case in this example.

By inspecting the Adjacency matrix, we can reconstruct the Node/Vertex graph. It informs us on the first hop connections: who are your friends. But how can we know about the second hop connections e.g. node 0 is connected to node 3 via node 1 and node 2? A really simple way is to multiply the adjacency matrix by itself (A*A). The result of this multiplication are the second hop connection. Here, we see that node 0 is connecting through 2 hops to node 1 (via node 2), and is connecting through 2 hops to node 3. We can even see that there is 2 such connection in two hops from node 0 to node 3. Lastly we see that node 2 is connecting through 2 hops to node 3 (via node 1).

If we were to multiply again A*A by A itself, we would get the three hop connections, which in this case is limited to node 0 being connected to node 3.

In general, the network that will interest us are way bigger than this simple 4 nodes diagram. Also in general, all nodes are not connected to each other node. Well, they say everyone is connected to everyone by six degrees of separations (six hops), but for most other practical applications, not all nodes are connected to each others. Let’s take a look at a bigger example to see how the principles illustrated above can apply at scale. Let’s assume the following non-directional network graph. Here since we have a non-directional network graph, you will see the connection values appears in both the rows and the columns. This special case shows a symmetry about the diagonal.

As before, if we compute A*A, we get the second hops connections. Notice that nodes becomes connected to themselves via a second hop. For example, node 1 is connected 3 times to itself through a second hop via node 0, 7 and 8.

If you are interested in all the first hop connections and the second hop connections, you could add together A*A and A, thus leading to the following matrix. You could proceed forward to find the third hops onward, but in this example nothing else is connected, so although that the numbers you see here would grow, the pattern of zeros would not change. We have found all connections of this graph. We found that node 0 is connected to nodes 1, 7 and 8. Nodes 2, 3 and 4 are connected. Nodes 5, 6 and 9 are connected. Finally we see that node 10 is not connected to any other nodes.

In practice the matrix multiplication works well to find the next hops neighbours. If it happens also that for your problem (as in the one above) most connections are non existent i.e. 0, then you could use sparse matrices to store (and potentially compute with) your Adjacency matrix. However, those becomes quickly really huge matrices which requires a lot of operations to compute. A nice trick if you are using SQL or spark could be to use joins on tables.

To do so, you need to reverse the problem on its head. Instead of creating an Adjacency matrix of how the nodes are connected, you will create a table of the connections. So to keep with our second example, you could have something like the following network graph being turned into a node/connection table.

Node Connection
0 A
1 A
1 B
7 B
1 C
8 C

Now that we have that node/connection table, our goal will be to reduce the number of connections to the minimum possible and in the end get something like the following as a way to see everything connected (we won’t care about how many hops leads us there).

To get there we will iterate through a two step process. First we will perform connection reduction and then update the node/connection table. Then we rinse and repeat until we can no longer reduce the number of connections.

Assuming the above node/connection table (node_connections), we can reduce the number of connections via a the following SQL query and store it as the new_connections table:

SELECT A.connection, MIN(B.connection) AS new_connection
FROM node_connections AS A
JOIN node_connections AS B
ON A.node = B.node
GROUP BY A.connection

Then you can update the node_connection table with the following SQL query:

SELECT DISTINCT B.new_connection AS connection, A.node
FROM node_connections AS A
JOIN new_connections AS B
WHERE A.connection = B.connection

You iterate those two steps until the node_connections table change no more et voilà, you have a map of all nodes connected through distincts connections.

This is only one of the possible use case, but for large scale application it is probably easier and quicker to join tables than to create and multiply Adjacency matrices. I showed the logic with SQL, but obviously you could achieve similar results using spark (for my specific application, I use pySpark).

If you have questions or interesting ideas of application of the network theory to some problem, feel free to jump in the conversation!


Cover photo by Michael Gaida at Pixabay.

Automation and Sampling

As I mentioned earlier, I have transitioned from Ericsson to Shopify. As part of this transition I start to get a taste of the public transports (previously work was a 15 minutes drive from home, now, working in the city center, I have to take the train to commute). This morning there was a woman just beside me who was working on her computer, apparently editing a document or more probably writing comments in it. A few minutes in her edits, she makes a phone call, asking someone over the phone to change some wording in the document and painfully dictating those changes (a few words). This game went about a few times; writing comments in the document, then calling someone to make the appropriate edits. What could have been a simple edit, then send the edited document via email became an apparently painful exercise in dictation. The point is not to figure out why she was not sending the document via email, it could be as simple as not having a data plan and not willing to wait for a wifi connection, who knows, but the usage of a non-automated “process” made something which is ordinarily quite simple (editing a couple of sentences in a document), a painful dictation experience. This also has the consequence of a limited bandwidth and thus only a few comments can make their ways in corrections on that document.

This reminded me of a conversation I had with a friend some time ago. He mentioned the pride he got of having put in place a data pipeline at his organization for two data sources extraction, transformation and storage in a local database. Some data is generated by a system in his company. Close to that data source he has a server which collect and reduce / transform the data and stores the results in the local file system as text files. Every day he look at the extraction process on that server to make sure it is still running, and every few days he download the new text files from that server to a server farm and a database he usually use to perform his analysis on the data. As you can see this is as well a painful, non-automated process. As a consequence, the amount of data is most probably more limited than it could be with an automated process, as my friend as to cater to the needs of those pipelines manually.

At Shopify I have the pleasure of having access to an automated ETL (Extract Transform and Load) process for the data I may want to do analysis on. If you want to get a feel of what is available at Shopify with respect to ETL, I invite you to watch the Data Science at Shopify video presentation from Françoise Provencher who touch a bit on that as well as the other aspects of the job of a data scientist at Shopify. In short we use pyspark with custom librairies developed by our data engineers to extract, transform and load data from our sources into a front room database which anyone in the company can use to get information and insight about our business. If you listen through Françoise video, you will understand that one of the benefit of that automated ETL scheme is that we transform the raw data (mostly unusable) into information that we store in the front room database. This information is then available to further being processed to extract valuable insight for the company. You immediately see the benefit. Once such a pipeline is established, it perform its work autonomously and as an added benefit, thanks to our data engineering team, monitors itself all the time. Obviously if something goes wrong somebody will have to act and correct the situation, but otherwise, you can forget about that pipeline and its always updated data is available to all. No need for a single person to spend a sizable amount of time monitoring and manually importing data. A corollary to this is that the bandwidth for new information is quite high and we get a lot of information on which we can do our analysis.

Having that much information at our fingertips bring on new challenges mostly not encountered by those who have manual pipelines. It becomes increasingly difficult and not efficient to do analysis on the whole population. You need to start thinking in term of samples. There are a couple of considerations to keep in mind when you do sampling: sample size and what are you going to sample.

There are mathematical and analytical ways to determine your sample size, but a quick way to get it right is to start with a modest random sample, perform your analysis, look at your results and keep them. Then, you redo the cycle a few times and watch if you keep getting the same results. If your results vary wildly, you probably do not have a big enough sample. Otherwise you are good. If it is important for future repeatability to be as efficient as possible, you can try to reduce your sample size until your results starts to vary (at which point you should revert to the previous sample size), but if not, good enough is good enough! Just remember that those samples must be random! If you redo your analysis using the same sample over and over again, you haven’t proven anything. In SQL terms, it is the difference between:

SELECT * FROM table
TABLESAMPLE BERNOULLI(10)

Which would produce a random sample of 10% of table, on the other hand:

SELECT * FROM table
LIMIT 1000

Will most likely always produce the same first 1000 elements… this is not a random sample!

The other consideration you should keep in mind is about what you should sample, what is the population you want to observe. Let say to use an example in the lingo of Shopify, I have a table of all my merchants customers which amongst other thing contain a foreign key to an orders table. If I want to get a picture of how many orders a customer performs, the population under observation is from the customers, not the orders. In other words, in that case I should randomly sample my customers, then look up how many orders they have. I should not sample the orders to aggregate those per customers and wishing this will produce the expected results.

Visually, we can see that sampling from orders will lead us to wrongly think each customers performs on average two orders. Random resampling will lead to the same erroneous results.

 

Screen Shot 2018-09-18 at 10.41.28.png
Sampling Orders leads to the wrong conclusion that each customers do an average of 2 orders.

Whereas sampling from customers, will lead to the correct answer that each customer  performs on average four orders.

 

Screen Shot 2018-09-18 at 10.41.40.png
Sampling Customers leads to the right conclusion that each customers do an average of 4 orders.

To summarize things, let just say that if you have manual (or even semi-manual) ETL pipelines, you need to automate them to give you consistency, and throughput. Once this is done, you will eventually discover the joys (and need) of sampling. When sampling, you must make sure you select the proper population to sample from and that your sample is randomly selected. Finally, you could always analytically find the proper sample size, but with a few trials you will most probably be just fine if your findings stay consistent through a number of random samples.


Cover photo by Stefan Schweihofer at Pixabay.

Where the F**k do I execute my model?

or: Toward a Machine Learning Deployment Environment.

Nowadays, big names in machine learning have their own data science analysis environments and in-production machine learning execution environment. The others have a mishmash of custom made parts or are lucky enough so that the existing commercially available machine learning environment fits their needs and they can use them. There are several data science environments commercially available, Gartner mentions the most known players (although new ones pop every week) in its Magic Quadrant for Data Science and Machine-Learning Platforms. However, most (if not all) of those platforms suffer from a limitation which might prevent some industries from adopting them. Most of those platforms starts with the premises that they will execute everything on a single cloud (whether public or private). Let see why this might not be the case for every use case.

Some machine learning models might need to be executed remotely. Let’s think for example of the autonomous vehicle industry. Latency and security prevents execution in a cloud (unless that cloud is onboard the vehicle). Some industrial use cases might require models to be executed in an edge-computing or fog-computing fashion to satisfy latency requirements. Data sensitivity in some industries may require the execution of some algorithms on customer equipment. There are many more reasons why you may want to execute your model in some other location than the cloud where you made the data science analysis.

As said before, most commercially available offerings do not cater to that requirement. And it is not a trivial thing that one may slap on top an existing solution as a simple feature. There are in some case some profound implications on allowing such distributed and heterogeneous analysis and deployment environment. Let’s just look at some of the considerations.

First one must recognize there is a distinction between the machine learning model and the complete use case to be covered, or as some would like to call it the AI. A machine learning model is simply provided a set of data and gives back an “answer”. It could be a classification task, a regression or prediction task, etc. but this is where a machine learning model stops. To get value from that model, one must wrap it in a complete use case, some calls that an AI. How do you acquire reliably the data it requires? How do you present or act on the answer given by the model? Those, and many more questions needs to be answered by a machine learning deployment environment.

Recognizing it, one of the first thing that is required to deploy a full use case is access to data. In most industries, the sources of data are limited (databases, web queries, csv files, log files, …) and the way to handle them is repetitive i.e. once I figured a way to do database queries, the next time most of my code will look the same, except for the query itself. As such, data access should be facilitated by a machine learning deployment environment which should provides “data connectors” which could be configured for the needs and deployed where the data is available.

Once you have access to data, you will need “rules” as to when the machine learning model needs to be executed: is it once a day, on request, … Again, there is many possibilities (although when you start thinking about it, a lot are the same), but expressing those “rules” should be facilitated by deployment environment so that you don’t have to rewrite a new “data dispatcher” for every use case, but simply configure a generic one.

Now we have data and we are ready to call a model, right? Not so fast. Although some think of data preparation as part of the model, I would like to consider it as an intermediary step. Why would you say? Simply because data preparation is a deterministic step where there should be no learning involved and because in many cases you will reduce significantly the size of the data in that step, data that you might want to store to monitor the model behavior. But I’ll come to this later. For now, just consider there might be a need for “data reduction” and this one cannot be generic. You can think of it as a pre-model which format the data in a way your model is ready to use. The deployment environment should facilitate the packaging of such a component and provides way to easily deploy them (again, anywhere it needs to be).

We are now ready for the machine learning execution! You already produced a model from your data science activities and this model needs to be called. As for the “data reduction”, the “model execution” should be facilitated by the deployment environment, the packaging and the deployment.

For those who have been through the loops of creating models, you certainly have the question: But how have you trained that model? So yes, we might need a “model training” component which is also dependant on the model itself. A deployment environment should also facilitate the use/deployment of a training component. However, this begs to another important question. From where comes the data used for training? And what if the model drift, is no longer accurate and needs re-training? You will need data… So, another required component is a “data sampling” component. I say data sampling because you may not need all the data, maybe some sample of it is sufficient. This can be something provided by the model execution environment and configured per use case. You remember the discussion about data reduction earlier? Well, it might be wise to store only samples coming from reduced data… You may also want to store the associated prediction made by the model.

At any rate, you will need a “sample database” which will need to be configured with proper retention policies on a use case basis (unless you want to keep that data for eternity).

As we said, models can drift, so data ops teams will have to monitor that model/use case. To facilitate that, a “model monitoring” component should be available which will take cues from the execution environment itself, but also from the sample database, which means that you will need a way to configure what are the values to be watched.

Those covers the most basics components required, but more may be required. If you are to deploy this environment in a distributed and heterogeneous fashion, you will need some “information transfer” mechanism or component to exchange information in a secured and easy fashion between different domains.

MLExecEnv
Machine Learning Execution Environment Overview.

You will also need a model orchestrator which will take care of scaling in or out all those parts on a need basis. And what about the model life-cycle management, canary deployment or A/B testing… you see, there is even more to consider there.

One thing to notice is that even at this stage, you only have the model “answer” … you still need to use it in a way which is useful for your use case. Maybe it is a dashboard, maybe it is used to actuate some process… the story simply does not end here.

For my friends at Ericsson, you can find way more information in the memorandum and architecture document I wrote on the subject: “Toward a Machine Learning Deployment Environment”. For the rest of you folks, if you are in the process of establishing such an environment, I hope those few thoughts can help you out.


Cover photo by Frans Van Heerden at Pexels.

Goodbye and Thank You!

The goodbye is not intended for you my blogging crowd! Rather to some other dear friends I will leave behind.

If there is a constant in life, it is change. In a few weeks will be time to change my place of work. With such a change I need to say goodbye to a lot of nice peoples and friends I have worked with, over the last 21 years. I want to thank you all for the fantastic environment you surrounded me with. I want to thank you all for the great challenges you gave me to undertake and solve. I want to thank you all for the help you provided through all those years for small and big questions, the mentoring and the learning and above all the animated and insightful discussions we had. All of this was enabled by a fantastic workplace. I will never forget Ericsson.

Above all I want to thank my current manager at Ericsson, Steven Rochefort, with whom I have been closely collaborating for 11 of those years. He is a fantastic guy and I will miss him dearly.

Now it is time to say hello to a new workplace. I have already met some wonderful and bright peoples at Shopify and I’m looking forward to the new challenges opening in front of me! #LifeAtShopify

This kind of message is traditionally expressed via email to a selected crowd on the last day of work. I’m not traditional 🙂 and I’m quite transparent, so you know it all now.

I’ll continue blogging, do not worry. Feel free to continue to contact me on any of my channels, here or elsewhere.


Cover photo by Claudia Beer at Pixabay (bonus points for those who catch the reference).

AI market place is not what you are looking for (in the telecommunication industry).

In a far away land was the kingdom of Kadana. Kadana was a vast country with few inhabitants. The fact that in the warmest days of summer, temperature was seldom above -273°C was probably a reason for it. The land was cold, but people were warm.

In Kadana there was 3 major telecom operators: B311, Steven’s and Telkad. There were also 3 regional ones: Northlink, Southlink and Audiotron. Many neighboring kingdoms also had telecom operators, some a lot bigger than the ones in Kadana. Dollartel, Southtel, Purpletel, we’re all big players and many more competed in that environment.

It was a time of excitement. A new technology called AI was becoming popular in other fields and the telecommunications operators wanted to get the benefits as well. Before going further in our story, it can be of interest to understand a little bit what this AI technology is all about. Without going into too much details, let’s just say that traditionally if you wanted a computer to do something for you, you had to feed him a program handcrafted with passion by software developer. The AI promise was that from now on, you could feed a computer with a ton of data about what you want to be done and it would figure out the specific conditions and provide the proper output without (much) of programming. For those aware of AI this looks like an overly simplistic (if not outright false) summary of the technology, but let’s keep it that way for now…

Going back to the telecommunication world, somebody with nice ideas decided to create Akut05. Akut05 was a new product combining the idea of a marketplace with the technology of AI. Cool! The benefit of a market place as demonstrated by the Apple App Store or Google Play, combined with the power of AI.

This is so interesting, I too want to get into that party, and I immediately create my company, TheLoneNut.ai. So now I need to create a nice AI model that I could sell on the Akut05 marketplace platform.

Well, let not be so fast… You see, AI models are built from data as I said before. What data will I use? That’s just a small hurdle for TheLoneNut.ai company… we go out, talk with operators. Nobody knows TheLoneNut.ai, it’s a new company, so let’s start with local operators. B311, Steven’s and Telkad all think we are too small a player to give us access to their data. After all, their data is a treasure trove they should benefit from, why would they give us access to it. We then go to smaller regional players and Northlink has some interests. They are small and cannot invest massively in a data science team to build nice models, so with proper NDA, they agree to give us access to their data in counterpart, they will have access to our model on Akut05 with substantial rebate.

Good! We need to start somewhere. I’ll skip all the adventures along the way of getting the data, preparing it and building a model… but let me tell you that was full of adventures. We deploy a nice model in an Akut05 store and it works wonderfully… for awhile. After some time, the subscribers from Northlink change a bit their behavior, and Northlink see that our model does not respond properly anymore. How do they figure? I have no idea, since Akut05 does not provide with any real model monitoring capabilities besides the regular “cloud” monitoring metics. More alarming, we see 1-star reviews pouring in from B311, Steven’s and Telkad who tried our model and got from the get go poor results. And there is nothing we can do about it because after all we never got deals with those big names to access their data. A few weeks later, having discounted the model to Northlink and getting only bad press from all other operators, TheLoneNut.ai bankrupt and we never hear from it again. The same happens to a lot of other small model developers who tried their hand at it, and in no time the Akut05 store is empty of any valuable model.

So contrary to an App Store, a Model Store is generally a bad idea. To get a model right (assuming you can) you need data. This data needs to come from representative examples of what you want the model to apply to. But it easy, we just need all the operator to agree to share the data! Well, if you don’t see the irony, then good luck. But this is a nice story, lets put aside the irony. All the operators in our story decide to make their data available to any model developers on the Akut05 platform. What else could go wrong.

Let us think about a model that use the monthly payment a subscriber pays to the operator. In Kadana this amount is provided in the data pool as $KAD, and it works fine for all Kadanian operators. Dollartel tries it out and (not) surprisingly it fails miserably. You see, in the market of Dollartel, the money in use is not the $KAD, but some other currency… The model builder, even if he has data from Dollartel may have to do “local” adjustments. Can a model still provide good money to the model builder if the market is small and fractured i.e. needs special care being taken? Otherwise you’ll get 1-star review and again disappear after a short while.

Ok, so the Akut05 is not a good idea for independent model builders. Maybe it can still be used by Purpletel which is a big telecom operator which can hire a great number of data scientists. But in that case, if its their data scientist who will do the job, why would they share their data? If they don’t share their data and hire their own data scientists, why would they need a market place in the first place?

Independent model builders can’t find their worth from a model market place, operators can’t either… can the telecom manufacturer make money there? Well, why would it more valuable than for an independent model builder? Maybe it could get easier access to data, but the prerogatives are basically the same and it wouldn’t be a winning market either I bet.

Well, therefore a market place for AI is not what you are looking for… In a next post I’ll try to say a little bit about what you should be looking for in the telecom sector when it comes to AI.

For sure this story is an oversimplification of the issue, still, I think we can get the point. You have a different view? Please feel free to share it in the comments below so we can all learn from a nice discussion!


Cover photo by Ed Gregory at Pexels.