Better Software Writing Skills in Data Science: Pragmatic Projects

Definition of pragmatic (Merriam-Webster)

1: relating to matters of fact or practical affairs often to the exclusion of intellectual or artistic matters: practical as opposed to idealistic.

But can you really exclude all matters of the heart? Can you remove all politics from technical decisions? Maybe thinking so is being idealistic! Being pragmatic means promoting fact based information and practical aspects over other considerations, but other considerations can also be facts to deal with. Being truly pragmatic might not be so much cutting the link between the mind and the heart but more finding the right balance where one privilegiate facts while still accounting for other pursuits such that in the end we get a practical solution.

This week’s mentoring and post is based on the eight and last chapter of “The Pragmatic Programmer: From Journeyman to Master”: Pragmatic Projects. This chapter wraps up everything we learned up to this point into a nice view and provides a good summary of the learnings.

If you want to live the pragmatic data scientist (or programmer) life, you’ve got to implement pragmatism in all aspects of your team workings. You ought to automate every repetitive aspect and make sure testing of your artifacts are automated as well. No manual steps that need to be repeated over time.

Pragmatic teams should not allow for Broken Windows (see the first post of the series). Second, you should always monitor your “environment” and not let it degrade over time. Even a slow degradation if left unchecked will amount for large problems in the future. This can creep up in many ways: data growing linearly may be a challenge if left unnoticed. You may need to adjust your resources as time passes, why not automate that? If you decide on that path, make sure you put some limits on the increase factor allowed over time if you don’t want to end up with a huge bill! Now what about exponential growth of data? If that growth follows your “customer” base, that might not be a financial problem to adjust the resources, but what if it’s not? You should address the issue as soon as possible. Maybe there are other ways to aggregate the data and for it to still be relevant.

As we have seen in previous chapters, communication is king. Communication within the team as well as communication with stakeholders and other parts of the organization. I often found that the best advice this book provides on that subject is to create a brand. When you start a project, give it a name and always refer to it. If it can by itself create a mental image of what you try to achieve the better.

Then there are a bunch of reminders. Don’t repeat yourself! Refactor to eliminate those repetitions. Create orthogonality, not by organizing around job functions, but around functionality.

Value testing, and repeatable testing. Value as well a good data scientist level documentation. Not documentation made for the sake of a process, or for other parties that won’t read it (well if they are to read it, do it, but make sure it answers their needs, no other useless extra). In a few months when/if you need to get back to that piece of code or algorithm you design, will you be able to remember all the small decisions you made? I bet not. And if you leave the organization, will the team still be able to maintain and enhance that functionality? Have documentation as close to the code as possible. If you have concepts that spans a lot of code e.g. system level information, make it meaningful for yourself and your fellow data scientists firsts. Make it accessible, and don’t repeat information found at levels closer to the code or algorithm. I especially like the tip 67: “Treat english as just another programming language”. This means that all that applies to your code should apply to your documentation…

And most importantly, remember that good enough is good enough. There is no perfection in this world and even if you could achieve it, the goal would move away the next day.

This concludes our series on reading through “The Pragmatic Programmer: From Journeyman to Master”. I hope this can help you become a better programmer for data science! Keep on programming, keep on practicing!


Cover image by Gordon Johnson at Pixabay.

Better Software Writing Skills in Data Science: Before the Project

This week’s mentoring and post is based on the seventh chapter of “The Pragmatic Programmer: From Journeyman to Master”: Before the Project. Going through this chapter you’ll notice we get one step away from coding. However, having good project preparation hygiene will ensure you have success in your data science projects and limit the surprises at the time of coding.

Reading through this chapter some data scientists will find comfort in seeing that digging for requirements is not only an activity asked of data scientists but from software developers as well.  In data science as in software engineering requirements don’t simply lie on the floor, ready to be taken and cherished, you’ll have to dig them out first. Requirements in data science will have different forms depending on the type of project you are working on. Is a classification in a few groups sufficient or do you need a finer regression?  Would it be better to allow more false positives if it increases precision or recall, or false positives would be prohibitive to handle afterward? If you are building a dashboard, what would be the proper metrics to monitor the project? When building an experiment, what should be the groups you define? 

The requirements from a data science point of view will be different from those from a software development point of view, however if your project builds data as a product, they might well be touching a lot more than you expect. One thing stays true though, work with a user to think like a user. There’s nothing better than working with the intended user to better understand what he’ll need from you.

Document those requirements. But do not over-document them! Always remember that some things are better done than described! Once written down (or drawn down), you have a base for discussion and figuring out the details which could be misunderstood by the different stakeholders. If the usage of formal methods or parts of such methods can help you, fine, go for it. However, don’t make the mistake to think the method is the end in itself. The concept of use case is as applicable to data science as to software development. Figuring out who the actors are, what they expect from the system and what you will provide to them is of the essence of building strong results. The requirement gathering, the design and the implementation are all different facets of delivering a quality solution. Don’t skimp on any of those phases.

Solving impossible puzzles is also a hallmark of data science. Here, I really like the “find the box” expression the author proposes. It’s not always a question of thinking outside the box or inside the box. Rather you should figure out what is the box, what are the real absolute constraints and at the same time get rid of the imagined preconceived notions. This is true for small mind problems as well as for real life data science problems. Sometime you just have to step back and ask yourself those few questions proposed in the book:

  • Is there an easier way?
  • Are you trying to solve the right problem?
  • Why is this a problem?
  • What makes it hard to solve?
  • Does it need to be done that way?
  • Does it need to be solved at all?

By asking yourself those questions and being honest with yourself, you might be surprised by the answer!

This concludes my complement to the seventh chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week, the last chapter of the book, and my last post in this series. We will then take a look at Chapter 8: Pragmatic Projects.


Cover image by tookapic from Pixabay.

Better Software Writing Skills in Data Science: While You Are Coding

This week’s mentoring and post is based on the sixth chapter of “The Pragmatic Programmer: From Journeyman to Master”: While You Are Coding. The focus of this chapter is on some of the little things you should keep in mind while programming. Coding is an iterative process and needs attention. If you don’t give it the deserved attention you might fall into coding by coincidence. Let’s take a look at some of the subjects presented in the book.

Coding by coincidence versus programming deliberately

If ain’t broke don’t fix it. Might be true for trusted pieces of code, but while you are coding something new, or adding new lines to existing code, you can’t be sure it ain’t broke until it gets its trial by fire. And by then it might be too late. Someone might get affected by the bug and fixing it will take longer than if detected earlier. So be deliberate about what you are coding and why you are coding it. Don’t be shy of refactoring on the go as you get a better understanding of what you need to code. Simplify things, make sure to not repeat yourself. It’s not because a piece of code was there before that it still needs to be there. Re-evaluate constantly what you are coding.

Algorithm speed

Especially in data science, you will encounter cases where figuring out the algorithm speed will help you understand why an analysis can work in some cases and simply take too much time as you increase the size of data. If you rely a lot on Jupyter notebooks, realize that most computing resources nowadays have multiple cores. If you want to benefit from the multi-core processing, you’ll have to adapt your code to make use of it. For example in scikit learn, many classes will have methods where you can specify things like n_jobs to tell the methods how many processors it can use.

If you rely on python loops, the processing can become quite long as your data grows. Maybe some of those looping algorithms could be converted to matrix operations using numpy? If not, getting used to multi-processing in Python would be beneficial. You can this way split your processing on a number of processes which will benefit from your available cores. However, you’ll need to think about reducing the results, maybe it’s as simple as joining all the outputs together, or maybe it’s more complex… knowing at least the possibility can make the difference between having to let a program run for hours instead of days.

To make multi-processing easy I made a small helper class some times ago. Feel free to use it if you need.

class simpleMultiprocessing:
'''
This class makes multiprocessing easy.
:param elements: A list of elements that can be split in smaller chunks and processed in parallel.
:param f_map: A function which takes a list of elements (normally a sublist of "elements") and process it.
:param f_reduce: [Optional] A callback function called each time f_map return from processing sublist of elements. The function takes the return value of f_map as input.
:param nProcesses: [Optional] Number of processes to spawn, default is twice the number of available processors.
:param verbose: [Optional] When set to True, displays the steps of multiprocessing.
'''
def __init__(self, elements, f_map, f_reduce=None, nProcesses=max(1, int(2.*float(os.getenv('CPU_LIMIT')))), verbose=True):
n_elements = len(elements)
pool = mp.Pool(processes=nProcesses)
elementsPerProcess = n_elements/nProcesses
boundaries = [int(i) for i in list(np.linspace(0, n_elements, nProcesses+1))]
for p in range(nProcesses):
start = boundaries[p]
stop = boundaries[p+1]
if verbose: print('simpleMultiprocessing::Creating a process for elements {}-{}'.format(start,stop1))
if f_reduce:
pool.apply_async(f_map, args=[elements[start:stop]], callback=f_reduce)
else:
pool.apply_async(f_map, args=[elements[start:stop]])
pool.close()
if verbose: print('simpleMultiprocessing::All jobs submitted')
pool.join()
if verbose: print('simpleMultiprocessing::All jobs ended')
def _f_map_capsule(f, elements):
results = []
for item in elements:
try:
result = f([item])
except:
tmp = {
'return_status': 'exception',
'return_error': sys.exc_info()[1],
'processed_element': item
}
else:
tmp = {
'return_status': 'successful',
'return_result': result,
'processed_element': item
}
results.append(tmp)
return results
def _f_reduce_capsule(f, verbose, results):
if results:
for result in results:
if result and 'return_status' in result:
if result['return_status']=='exception':
print('An error occured…')
if 'processed_element' in result: print(' while processing item: {}'.format(result['processed_element']))
if 'return_error' in result: print(' exception: {}'.format(result['return_error']))
elif result['return_status']=='successful':
if verbose=='extra':
print('Received results…')
if 'processed_element' in result: print(' for item: {}'.format(result['processed_element']))
if 'return_result' in result: print(' result: {}'.format(result['return_result']))
if f != None: f(result['return_result'])
class protectedMultiprocessing:
'''
This class makes multiprocessing easy and add a protection layer which helps debugging which element failed.
:param elements: A list of elements that can be split in smaller chunks and processed in parallel.
:param f_map: A function which takes a list of elements (normally a sublist of "elements") and process it.
:param f_reduce: [Optional] A callback function called each time f_map return from processing sublist of elements. The function takes the return value of f_map as input.
:param nProcesses: [Optional] Number of processes to spawn, default is twice the number of available processors.
:param verbose: [Optional] When set to True, displays the steps of multiprocessing.
'''
def __init__(self, elements, f_map, f_reduce=None, nProcesses=max(1, int(2.*float(os.getenv('CPU_LIMIT')))), verbose=True):
n_elements = len(elements)
pool = mp.Pool(processes=nProcesses)
elementsPerProcess = n_elements/nProcesses
boundaries = [int(i) for i in list(np.linspace(0, n_elements, nProcesses+1))]
f_map_capsule = partial(_f_map_capsule, f_map)
f_reduce_capsule = partial(_f_reduce_capsule, f_reduce, verbose)
for p in range(nProcesses):
start = boundaries[p]
stop = boundaries[p+1]
if verbose: print('protectedMultiprocessing::Creating a process for elements {}-{}'.format(start,stop1))
if f_reduce:
pool.apply_async(f_map_capsule, args=[elements[start:stop]], callback=f_reduce_capsule)
else:
pool.apply_async(f_map_capsule, args=[elements[start:stop]])
pool.close()
if verbose: print('protectedMultiprocessing::All jobs submitted')
pool.join()
if verbose: print('protectedMultiprocessing::All jobs ended')
view raw mutiprocessing.py hosted with ❤ by GitHub

Usage is quite simple, first, here is a simple example of usage. The reducing function is embedded in the method call (reduce.append).

def my_list_processing(l):
return sum(l)
elements = [i for i in range(1000)]
results = []
simpleMultiprocessing(elements, my_list_processing, results.append, verbose=True)
print(results)
result = sum(results)
print(result)
simpleMultiprocessing::Creating a process for elements 0-124
simpleMultiprocessing::Creating a process for elements 125-249
simpleMultiprocessing::Creating a process for elements 250-374
simpleMultiprocessing::Creating a process for elements 375-499
simpleMultiprocessing::Creating a process for elements 500-624
simpleMultiprocessing::Creating a process for elements 625-749
simpleMultiprocessing::Creating a process for elements 750-874
simpleMultiprocessing::Creating a process for elements 875-999
simpleMultiprocessing::All jobs submitted
simpleMultiprocessing::All jobs ended
[7750, 23375, 39000, 54625, 70250, 85875, 101500, 117125]
499500

You can easily add a proper reducing function as in this next example.

def my_list_processing(l):
return sum(l)
elements = [i for i in range(1000)]
results = []
def reducing(e):
results.append(e)
simpleMultiprocessing(elements, my_list_processing, reducing, verbose=True)
print(results)
result = sum(results)
print(result)
simpleMultiprocessing::Creating a process for elements 0-124
simpleMultiprocessing::Creating a process for elements 125-249
simpleMultiprocessing::Creating a process for elements 250-374
simpleMultiprocessing::Creating a process for elements 375-499
simpleMultiprocessing::Creating a process for elements 500-624
simpleMultiprocessing::Creating a process for elements 625-749
simpleMultiprocessing::Creating a process for elements 750-874
simpleMultiprocessing::Creating a process for elements 875-999
simpleMultiprocessing::All jobs submitted
simpleMultiprocessing::All jobs ended
[7750, 23375, 39000, 54625, 70250, 101500, 117125]
413625

If you need to debug one of those multi-processing functions, it might be helpful to change the use of simpleMultiprocessing class to use protectedMultiprocessing instead. In the next example, we show how helpful it can be by voluntarily introducing an exception for one of the elements. If an exception occurs in the simpleMultiprocessing case, we silently loose that job and all results. Here we get aware of the problem and still get the other results. The drawback is that the reducing function is called more often, but in normal cases, the hard work is done by the mapping function anyway.

def my_list_processing(l):
if 666 in l: return 666/0
return sum(l)
elements = [i for i in range(1000)]
results = []
def reducing(e):
results.append(e)
protectedMultiprocessing(elements, my_list_processing, reducing, verbose=True)
print(results)
result = sum(results)
print(result)
protectedMultiprocessing::Creating a process for elements 0-124
protectedMultiprocessing::Creating a process for elements 125-249
protectedMultiprocessing::Creating a process for elements 250-374
protectedMultiprocessing::Creating a process for elements 375-499
protectedMultiprocessing::Creating a process for elements 500-624
protectedMultiprocessing::Creating a process for elements 625-749
protectedMultiprocessing::Creating a process for elements 750-874
protectedMultiprocessing::Creating a process for elements 875-999
protectedMultiprocessing::All jobs submitted
An error occured...
  while processing item: 666
  exception: integer division or modulo by zero
protectedMultiprocessing::All jobs ended
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, [...], 663, 664, 665, 667, 668, 669, 670, [...], 995, 996, 997, 998, 999]
498834

You can do even more interesting stuff with multi-processing, it’s worth the learning. And again, for simple multi-processing use cases, feel free to use my above classes.

And more

There is more to that sixth chapter: refactoring, unit test and evil wizards. I won’t go into all the details, as the book does a good job. But will just say: don’t be shy of refactoring while you code. Many times, as I added complexity to my work, I saw there were ways to be DRY (Don’t Repeat Yourself). By refactoring, I was able to simplify the flow and even make the code more understandable. Parametrization can bring you a long way here.

At last, I plead you, do unit tests! And save them along with your code. There are good unit test frameworks in Python: unittest ships with python, I often use pytest, you could use mock, or hypothesis. Even without a framework you can do a good unit testing job using barebone asserts, so there is no excuse. The book gives excellent pointers as to what should be some prime targets of unit tests. When you will update the code later, and that probability is high, you’ll be happy to have existing unit tests to verify the base.

This concludes my complement to the sixth chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week we will take a look at Chapter 7: Before the Project.


Cover image by Gerd Altmann from Pixabay.

Better Software Writing Skills in Data Science: Bend or Break

This week’s mentoring and post is based on the fifth chapter of “The Pragmatic Programmer: From Journeyman to Master”: Bend or Break. The focus of this chapter is on making sure your programs, ETLs or models can be changed in the future when new realities arise. Besides one off prototypes that are to be trashed (but often get transformed into production code anyway) code has to evolve. Whether it be ETLs or Models, your code will have to evolve and if you want to be efficient about it, you better take some hints from the start.

Let’s first agree that code will need to change, very seldom useful code is written once to never change and a lot of code writing is not blank page type, but editing existing code (even legacy code). Some principles helps in making sure updates to code is easier in the future: 

  • low coupling (are each part of the system depending only on a few components or is the coupling of code a spider web of dependencies?), 
  • reversibility (if I want to change the db access do I need to change code everywhere, or is the db access abstracted so the required change can be localized to one file or component?), 
  • metaprogramming (is there a way to minimize the code written via properly written reusable pieces, or data driven code, or parameterisable configurations?), 
  • temporal coupling (has the code a strong dependency on events coming in a certain order?, are there feedback loops?), 
  • using views to separate presentation from the data model and the blackboard technique, anonymous and asynchronous exchange of information between modules.

I won’t go into all the details of these principles, as I think the book does a really good job at explaining them, but maybe we can take a few data science related examples to complement what that book has to offer.

Low Coupling

Often in data science we create derived data assets from existing ones by creating ETLs. Then we derive further ones from those. We mix and match data and may even inadvertently create feedback loops via those data assets. You should think early on a way to map those data assets dependencies and see which rules you can apply in your organization to ensure low coupling of your data assets. In data science, data is the most valuable asset. Don’t transform it into spaghetti data! Yes, it will be as bad as spaghetti code.

Reversibility

If it hasn’t happened to you yet, feel lucky. But at some point, your organization which settled on database provider A or cloud infrastructure provider B will decide to switch to provider C. Why? Cost, reliability, characteristics, … the reasons are endless. If it is not the database side or the cloud infrastructure side, it might well be the deep learning framework or some other infrastructure your organization has procured from a third party. If such a transition has not been thought about before putting in place your software architecture, it might become a hell of a transition.

Metaprogramming

I remember in one of my first jobs, we were doing geolocalization assuming a flat earth. It was not a problem since the area covered was sufficiently small not to see artifacts of earth curvature. One day a customer required us to cover a large area. There was no way around, we had to consider latitude, longitude and the curvature of earth. First, this was hell to decouple things that were so much embedded in each component. What’s easier than computing distance or bearing on a flat earth… so each component was doing it. The first step was to abstract those functions and create a component to do those calculations (a high coupling hell hole requiring to be filled).

Next, we decided that since some customers might want a flat earth, others a spherical model, or an elliptical model, or even more complex earth models, we were to make the earth model parameterizable. Not only the distance and bearings were now decoupled, the model used for earth was parameterized. However it had been a big undertaking to do that decoupling that late in development.

Same thing can happen with data. Maybe today you can process your ETL with a relatively small compute unit, but what about tomorrow. Better think of parametrizing this early, so if you need more power, you can just adjust that parameter. Could the input and output datasets names and paths also be a parameter easily interchangeable? Maybe the filtering conditions for an ETL might be dependent on the case, but all the processing is the same. Instead of having multiple copies of pretty much the same ETL, maybe one which can receive the filtering conditions as a parameter would make your life easier. There’s probably a ton of things you can parameterize early on that will make your life easier in the future. Just take time to think about it.

Temporal Coupling

Are there parts of your ETLs which could be performed in parallel? Think where parallel processing is possible and create your system accordingly. Maybe today it is fine to run it serially, but maybe tomorrow if you want to meet your Service Level Objective (SLO) you may have at the expense of using more resources, to do parallel processing. Future you will thank you for thinking of it early on! I know it first hand, where a chain of ETL I designed started taking too long. Changing only a few parameters I was able to run the parallelizable part as such and increase the speed of transformations.

Views

The Model View Controller design pattern might not seem so applicable to data science. But the principle behind it might be a good source of inspiration. Using other techniques presented above, maybe it is a good idea to keep your sources configurable yet having similar characteristics. This way, maybe you could develop models which are parameterizable (potentially including the training parameters or training parameters file) and easily swap them or even use them in parallel. The serving of the ETLs or Models could also be parameterizable and swappable. That could be extended to the evaluation of the models, or the health monitoring of ETLs. The good decoupling principles of Model View Controller are still very welcome in a data science environment. Just think a bit differently!

This concludes my complement to the fifth chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week we will take a look at Chapter 6: While You Are Coding.


Cover picture from unknown author on Pixabay.

Better Software Writing Skills in Data Science: Dead Programs Tell No Lies

This week’s mentoring and post is based on the fourth chapter of “The Pragmatic Programmer: From Journeyman to Master”: Pragmatic Paranoia. The focus of this chapter is on making sure your programs do what they are intended to do, and in the event they don’t perform as expected, that you can figure out why in the fastest way possible.

Design by Contract can be supported in different ways depending on the programming language you use. At the very least, Python 3 can enforce it with the use of parameter typing for functions and their return values. Another good way to enforce contracts is to make assertions (assert) early on in your function or simply raise an exception if something is not as per your contract. Note that an assert will anyway raise an exception. Think of it, is it better to fail and stop execution, or populate a table with garbage? When will you figure out it was garbage? Will you be able to correct it at that point in time? Will you still have the underlying data to do the correction?

Learn to know when to use assert and/or when to raise an exception i.e. when you know a boundary, a contract and you want to enforce it, or at least tell an higher up program that something is wrong an you cannot do anything about it. When encountering something exceptional, you can either solve it yourself, or let an higher up program handle it. In the end, if no higher up program can handle the situation, the overall program will be halted.

I know for some the usage of asserts and exceptions can be obscur to understand, so let me try to build a mental model for you. An exception is an exceptional thing that happens to a program. If you are responsible for the invocation of that program and catch that exception, you have to decide if you have enough context to act on that exceptional thing on need to raise that exception higher, to the programming calling you. In the end, if no one can handle the exception, the program will crash and you can debug what this exception is all about. If you know what to do in that exceptional case, then it’s up to you to handle it. Maybe it’s a retry, maybe you need to use defaults values, … just be certain of the handling decision, as this might well pollute your data.

As to how to throw an exception higher, there are two ways: assert or raise. Generally raise will be used when you have an error condition you cannot directly act upon but an higher up program might be able to act upon e.g. a file is not present, or empty, … On the other hand, assert will be used when you can swear this should not happen. In both case, exceptions are thrown higher up.

When you figure out that something “bad” happen to your program, you know this should not happen and you know there is no way around, the way to throw and exception higher can be via an assert. This will throw an exception to the higher up program which will have to decide what to do with it. An example would be you expect an int as input and you get a string, this is contract breaking, the higher up program ask you to handle improper data, why would your program decide why the contract was broken? It should be the responsibility of the caller to handle that exception properly. That might warrant a assert right there.

Depending on the organization you work in, when an how to use exceptions and asserts might get philosophical. On the other hand it could also be subject to very specific rules. There might be really valid reason why an organization might prefer an approach over another. Learn the rules, and if there is no rule, have discussion around it and apply your best judgement. In any case, dead programs tells no lies. Better kill it than having to deal with polluted data a year in the future.


Before running one should learn to walk. Often people come to data science without much knowledge of programming and suddenly are asked to take care of existing ETL in Python, or to design new ones. The pragmatic approach is to learn programming! As I said earlier, Python is a multi-paradigm language. The procedural programming is probably the most well known approach especially if you used Python in a notebook environment. Linear programming and the use of functions… 

If you use pytest or some other libraries, you probably started wondering a little bit about Object Oriented Programming. Maybe you just copied the recipe and haven’t looked too deep into that paradigm. Here’s your chance. I found this well written primer on Object Oriented in Python which also links to other resources. If you are to write solid ETLs, you’ll want to have some knowledge of OOP.

If you manipulate data, later or sooner you will want to use lambda functions, map, filter, reduce, … or if you use numpy / pandas, you’ll get interested in apply, etc. Again, you could just follow the recipe, but again, if you want to get stronger on Functional Programming and again I found an interesting primer to Functional Programming in Python. It’s far from complete, but it links to other resources to fill the gaps and once started, you’ll simply want to learn more!

I said it multiple times: practice makes perfect! In the first two weeks, I proposed to you exercises from CheckiO. Feel free to register an account there and follow the path from island to island to gain knowledge in Python. There are other resources also to practice programming. Each year since 2015, in the days before Christmas, for the month of December, Advent of Code propose small problems the Elves and Santa face to deliver gifts! You could also see if there are coding dojos in your area, or virtually, I found them a good practice venue as well.

This concludes my complement to the fourth chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week we will take a look at Chapter 5: Bend or Break.


Cover image by William Adams on Pixabay.

Better Software Writing Skills in Data Science: Tools & Bugs

This week’s mentoring and post is based on the third chapter of “The Pragmatic Programmer: From Journeyman to Master”: A Pragmatic Approach. One of the focuses in that chapter is about the tools of the trade, and mastering those tools.

If the only tool you got used to as of now is Jupyter or another notebook-like environment, maybe it is time for you to try different approaches. Only using a notebook-like environment may give you bad programming habits which may be hard to change. Try coding some Python using a Python supporting IDE e.g. PyCharm is one, or even just plain text that you’ll execute on the command line. Learn different ways to code, this will help you understand the cost and benefits of those approaches.

As a data scientist try different ways to query data, maybe your organization already supports different databases? Learn how to use pyspark, SQL, … Try different deep learning frameworks if this applies to your line of work. The tools are endless, you don’t need to have a deep knowledge of them all, but you should try them and see what are the advantages and drawbacks of them to make a conscious choice on which ones are best for you and the task at hand.

Another area where tools are important, and where I’ve seen a lot of problems for junior programmers and data scientists to figure out how to approach it, is around debugging their code (or queries).

The focus this week for my article is on debugging. The Python exercise will consist in debugging a small Python function. If you find the bug early, you are probably not so junior in Python! Still, I encourage you to go through the steps as an exercise. Remember, practice makes perfect!

def append_y_words(y_words, base_list=[]):
'''
Purpose: Return a list of words from the base_list (if any) followed by
words starting with 'y' from the words list.
'''
y_words = [word for word in y_words if word.startswith('y')]
base_list += y_words
return base_list
print append_y_words(["yoyo", "player"]) # should print ['yoyo']
print append_y_words(["yours", "puppet"], ["truly"]) # should print ['truly', 'yours']
print append_y_words(["yesterday"]) # should print ['yesterday'] , but does something else
view raw bug.py hosted with ❤ by GitHub

Bring this code in your favorite Python execution environment and let follow the debugging procedure outlined in the book.

  1. Get the code cleanly compiled: In case of Python that shouldn’t be an issue… Python is an interpreted language. But, if you keep your code in a Jupyter notebook for example, you might have forgotten what you ran before. So a good approach would be to restart the kernel and re-execute your code cell-by-cell. Are your Jupyter notebook cells out of order? Maybe you could take time to put them in the order you want them to be executed.
  2. Understand the bug report: Througfully read the function intent and the three “test” statements. Can you understand what is the expected behavior? Maybe you can spot the bug right away!
  3. Reproduce the bug: Is the bug reproducible? In this specific example it should already be easily reproducible, but in real life, you may have to do some work to make it reproducible in one step. Are there any other tests you would like to add to better understand the behavior you see? Can you run those tests and reproduce the problem in one step?
  4. Visualize your data: here many approaches are possible. You could decide to go with pdb, the Python debugger (it’s probably a good idea to get to know a debugger if you don’t already). You may have to query data generated by your Python file if you are dealing, for example, with an ETL. If your ETL is preparing data for some Machine Learning stages, you may want to create a custom way to visualize that “faulty” data.
  5. Tracing: In addition to the use of a debugger, or in this case alternatively, you could add some tracing to the code. Tracing can be done through log files (and in that case I strongly encourage using a logging framework that supports trace levels e.g. INFO, ERROR, DEBUG, … so that you can keep your well designed traces for future debugging), or for a simple problem like this one, you could simply decide to add some print statements to follow the flow of data.
  6. Looking around corrupted variables: This is not something you’ll normally have to do with Python, but if you are using other programming languages, be aware this is something that might happen. For example, in C, you can write basically anything anywhere in memory, so a simple indexing error from another part of a program could corrupt “your” variables.

You still haven’t spotted the bug? Don’t despair, there are still a few steps you can follow!

  1. Rubber Ducking: Explain the problem to someone else, or if you are in lockdown because of COVID like many of us, explain it to your cat, or if you are stuck alone on a deserted island, explain it to Wilson! Just saying it out loud may help you figure it out.
  2. Process of elimination: As the author says in the book, it is possible the error is in the OS, the compiler or a third party product. However, in my 25+ years of career in software up to now, it happened only once that a bug I encountered was traced to the compiler, a long time ago, in a Fortran compiler. And also only once the bug could be traced to a faulty hardware design for a computer board. Those bugs are usually extremely hard to figure out, so that’s a good thing they don’t happen often. More often than not, it might be that you don’t interpret the 3rd party documentation in the right way if your bug involves other pieces of software.
  3. The element of surprise: When you find yourself thinking this bug is impossible, it is time to re-evaluate what you hold for true. Subtle hint, this might be the case here. 

So what is happening here? In this case it is quite simple when you know how Python works. The default arguments in Python are not re-evaluated each time the function is called, but rather only once when the function definition is parsed. Each time the function is called without a second argument, a reference to the same “empty” list will be used as the default argument. Any change made to that “empty” list will become the “new” default argument for subsequent calls who don’t supply a second argument.

Knowing the tools of the trade include knowing the programming language you use. There are many nice features to Python as it is a multi-paradigm programming language. You can use a procedural programming approach, an object oriented approach or a functional programming approach in Python. Those amongst others give Python a great flexibility, as well as a great potential to hurt yourself! Take the time to learn the programming language you use, what it makes easy, and where it can make things difficult.

This concludes my complement to the third chapter reading of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. Next week we will take a look at Chapter 4: Pragmatic Paranoia.


Cover picture by Nimrod Oren at Pixabay.

Better Software Writing Skills in Data Science: Be DRY

First I’d like to make a comeback on last week’s post and ask you a question. I’ve got a comment from one of my friends who read my blog where he told me something along the lines:  “Apparently, letting your mentee read a book is your way of mentoring”. This made me think… Given punctual mentorship sessions (half and hour each week for a number of weeks) with a mentee who is not working directly with me on anything, and given the goal of improving software development skills in general, and software writing skills in Python more specifically, would anyone have suggestion on better ways of proceeding? I’m genuinely open to suggestions.

For now, let’s continue with the current approach and base this week’s mentoring and post on the second chapter of “The Pragmatic Programmer: From Journeyman to Master”: A Pragmatic Approach. One of the ideas presented in that chapter is what is called the DRY principle. Don’t Repeat Yourself.

Not repeating yourself is certainly an important aspect of software development. Doing copy-paste between multiple pieces of code might seem the best way forward and it’s quick. However, when a bug is discovered in one of those multiple copies, will you do the extra effort of correcting it everywhere (including other’s work)? Will you be able to find all those copies? Sometimes the copy-pasted code might change a bit. Changing variables names, adding a condition here and there, … 

A few years ago I led a project to build a software which detected software clones “Large scale multi-language clone analysis in a telecommunication industrial setting“. Even if we could identify software clones which were different from their original version, it was really hard if not impossible to get the software developer to refactor the code based on that insight. When a clone is created, it is hard to kill it! So the best approach is to not create them, and refactor from the start, creating functions or classes as necessary instead of copy-pasting.

Hopefully I convinced you to be DRY in your software development for data science. But this is true for other aspects of data science as well. Have you ever worked in a place where data definitions which depend on a number of columns, maybe from different tables which are not standardized and repeated for many dashboards or queries. What is a visitor to your website? Is it any hits to any page of the site? Do you have some sort of sessions in place and count only once per session? Do you have a login and count only the users? Assuming sessions tracking or users login, how do you compute the number of visitors? Is it a rolling average on a period of time? What is that period of time? It can easily become a substantial query to determine who is an active user at a specific time. Are you sure everyone in the organization uses the latest and greatest definition? Was the query copy-pasted from dashboard to dashboard? Was the definition slightly modified based on the purpose of a person or a team? You see, don’t repeat yourself also apply to data definitions. There are multiple ways to solve that problem, but none are easy, and the longer you run allowing code, queries or definitions copy-paste, the harder it will be to kill those clones.

As this second week exercise, I propose to identify an area or a specific instance where a lot of clones are made in your team. Find someone (ideally in your team) to discuss it. Is there anyway to bring forward the piece of code, query or definition and start refactoring related artifacts to eliminate that clone? This might be hard! What is an acceptable incremental way forward? How can we make sure not to create new clones in the future? This might be the first few steps your team will take to limit and reduce clones in your artifacts.

As for last week, there is much more to that second chapter. Still, for the junior data scientist who wishes to improve his Python programming skills, I’d like to suggest that practice makes perfect. In that vein, this in-theme exercise might prove useful: ADFGVX Cypher

Once you complete this exercise ask yourself all the questions I proposed for last week’s exercise. More specifically, you should make sure you don’t repeat yourself. A Cypher problem is a good example of symmetric operations and symmetric operations have a good potential for code reuse. If you have not defined functions that you use more than once between the encoding and decoding, I would suggest you take a second look at your implementation. Also make sure to think about the other questions from last week: Do you break early? Do you respect the contract? Do you document your program? Do you have unit tests?

Again if you have access to someone you feel comfortable with and who you think writes great code, ask him for his comments on your exercise solution. You might learn a thing or two! Alternatively, if you solve that problem, you may be able to look at some of the other’s best solutions. Reading code from others is also a great way to improve your programming skills.

I know I said one topic per week, but there is another aspect in this chapter worth an exception. I won’t add more exercise to it, just a potentially lifelong quest! The subject is Estimations. The book will tell you to keep track of your estimations. I say: don’t simply keep track of those, keep track of everything you do and how long it takes. Early on in your career keep a log of what you do and how long it takes. As the time passes, try to generalize the data you collected e.g. a data request where all data concepts are well defined takes me on average x hours; creating a dashboard with y graphs / data points using readily available types of graphs takes me on average x hours; … build an estimation library for yourself. 

When you start to have a generalized mental model for your personal estimates, make a game out of it. Even if not asked for an estimate, make one and check how good it is afterward. Use gamification to your advantage to make you better at estimating for the future!

If or when you graduate to lead teams, you’ll see that your estimates are a good base as well. You can then apply equivalence factors for your team members as you learn to know them. Maybe even figure out their strengths and weaknesses from that exercise.

This concludes my quick complement to this second week / second chapter of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. See you next week for Chapter 3: The Basic Tools.


Cover Image by Chillervirus at Pixabay.

Better Software Writing Skills in Data Science: Broken Window

I’ve recently started mentoring a junior data scientist in the ways of improving his programming skills. It’s often said that data scientists lack programming skills, whether it be because of their data science program or their diverse background which lead them to data science. Sometimes however data scientists are willing to be vulnerable, admit it and improve upon it. I’m currently a data scientist, but given in my past life where I had many years of professional experience as a software developer (out of many things), I felt this mentorship was a good match and thought we could see how we would both learn through this mentorship.

Through my software development experience, one influential book I read was “The Pragmatic Programmer: From Journeyman to Master” by Andrew Hunt and David Thomas. I think this could be a good starting point for discussion and exercise. The book itself doesn’t have so many chapters, so my initial intention is to discuss with my mentee one chapter each week and complement the reading with some exercises. More specifically, my mentee wants to improve his Python programming skills, so we’ll target the exercises to that language.

You may ask me: Why is a programmer book good for a data scientist, especially a book which doesn’t discuss a language in depth? Well, I think there are many things which are important to programmers that are as important if not more to data scientists. For example, in the first chapter, there is a section about communication. If you haven’t been told yet, communication is key to data science!

I intend this to be a series of posts that will highlight the most important part of the week chapter as well as some discussion about the weekly Python exercise I proposed to my mentee. So without further ado, for the first week I proposed to read the Preface as well as Chapter 1: A Pragmatic Philosophy.

In data science as in programming in general, you have to adjust your approach to fit the current circumstances. Those adjustments will come from your background and your experience. In the same way as a Pragmatic Programmer does, a Pragmatic Data Scientist gets the job done, and does it well.

Rather than going into the details of a chapter, I’ll focus on one aspect of it. In that sense, this blog post is complementary to the reading of the corresponding chapter in “The Pragmatic Programmer: From Journeyman to Master”. For this week I want to focus on “Broken Window”. Building from the Broken Window Theory, when you leave a broken window on a building, it instills a sense of abandonment for that building. From that sense of abandonment, people will start to feel no one cares about the building. Further windows will be broken, graffiti will be drawn and other damages will be brought to that building. A parallel can be drawn to software, when you have a “broken window”, people will try to stay away from that piece of code and won’t have the reflex to try to improve it.

As this first week exercise, I propose to identify a broken window in the code owned by your team and find someone (ideally in your team) to discuss it. How bad is the situation? How could we improve it? Better, what can be done to fix that broken window? Maybe taking those first few steps might be the beginning of a new attitude for your team around broken windows and technical debt.

There is much more to that first chapter. Still, for the junior data scientist who wishes to improve his Python programming skills, I’d like to suggest that practice makes perfect. In that vein, this in-theme exercise might prove useful: Broken Window.

Once you complete this exercise ask yourself those questions:

  • Do you make sure your program breaks early if the contract is not followed e.g. initial asserts? In this exercise we expect a list of list of int, if you receive something else you should break right there. There are also other preconditions you may want to check.
  • Conversely if inputs are allowed by the contract, do you make sure you can handle fringe/boundary values? Do you know how your program should react?
  • Have you written functions each handling specific tasks? If you see that some part of the program repeats itself, have you taken steps to re-factor those parts so that you don’t repeat yourself?
  • Do you document your program in any way? Here, team culture may vary, but still, it is good practice to leave breadcrumbs for people who will follow (even if it’s future you!). Do your variable names make sense and are they part of that documentation?
  • Have you done the extra-mile of creating unit tests? Many frameworks can be used for that purpose. Ideally you’ll make yourself knowledgeable about the one your team uses the most.

Maybe you’ll think some of those questions are overkill for such a small program, but again, practice makes perfect. If you don’t practice those on small problems, the chances are you won’t think of doing them on larger ones.

If you have access to someone you feel comfortable with and who you think writes great code, ask him for his comments on your exercise solution. You might learn a thing or two!

This concludes my quick complement to this first week / first chapter of the “The Pragmatic Programmer: From Journeyman to Master” as a medium to improve software writing skills in data science. See you next week for Chapter 2: A Pragmatic Approach.


Cover photo from Cottonbro on Pexels.

Who are those Strangers?

This post is a follow-up to Who am I connected to? As stated in the previous post, a problem that arise a lot is figuring out how things are connected. Is this server directly or indirectly connected to that pool? Or who am connected to through a chain of friends. If you ever have to implement such an algorithm (and about that you can refer to my previous post), one thing you might encounter are superstars, false friends or black holes. Name them the way you want 😉 . Those are “nodes” which are connected to an abnormally high number of other nodes. Well, when someone has 50k friends, you should be suspicious that those are not all real friends! The problem with reporting fake friends is many fold.

First, if you go through the process described last time, you get a number of very high density groups which normally wouldn’t be grouped together if it was not because of those black hole nodes. This may well make any conclusion pointless, so you should take care of removing (or not considering) those superstar nodes to start with.

Second, assuming you start with big data, joining a number of those superstars on themselves will lead to an exponential growth of your data set (at least temporarily) and it will take forever to complete the associated spark tasks (if successful at all). Ok, those might be legit friends, in that case you might not have a choice and maybe Fighting the Skew in Spark can help you solve that issue. But otherwise, if those are indeed false friends, you should take a step of removing those black hole nodes before hand.

In an ever changing world of data, it may not be easy to spot those black holes, but a good first filter may be as simple as (using PySpark notation this time, just to keep you on your toes):

filter_out = node_table
  .groupBy('node')
  .count()
  .filter(F.col('count') > black_hole_threshold)

The nodes captured by that filter-out “rule” can then be automatically removed from your node table, or examined and added to black lists if needs be. To automatically remove the filter_out nodes from your node_table, the join_anti is your friend!

output = node_table
  .join(
    filter_out.select('node'),
    on='node',
    how='left_anti')

You still need to perform the connection finding algorithm on this “output”, but at least you would have removed all nodes which have an above black_hole_threshold abnormal number of connections from your inputs.

What else can go wrong? Again, if you have big data, this process as a whole (especially since it is iterative) can take some serious time to execute. Moreover, even with the black holes removed, the join on itself part may consume a lot of resource from your cluster. The interesting part is that if you keep your “node” definition constant, you could run the algorithm in an online additive fashion which would run faster because most of the data wouldn’t change and already be reduced to find who’s who friend, so only the additional delta would in fact “move”. I know it is not that simple and quick, but it is still quicker than doing the process on the initial input data again an again…

Again, I hope this can be of help. If you apply this method or another equivalent one, let me know and let’s discuss about our experience!


Cover photo by Felix Mittermeier at Pixabay.

Who am I connected to?

A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.

First let’s go with some terminology. Nodes are the things we are connecting e.g. you, your friends, your friends friends. Vertex are how those nodes are connected. For example, below, node 0 is connecting to node 1 and 2 using two vertices to describe those connections. Node 1 is connecting to node 3 via one vertex, etc. For this first example, we use uni-directional vertex, but nothing prevents us from using bi-directional vertex. In general, if all vertex are bi-directional we will talk of non-directed graph, which would be the case of friends (usually) since you know your friend, and he knows you as well!

A first important concept to introduce in network analysis is that of an Adjacency matrix. The adjacency matrix is a matrix representing the vertex connections between the nodes. The first row in the Adjacency matrix represent connections of node 0. Thus, node 0 is connecting to node 1 and 2, but not to itself or to node 3. So the first row is 0, 1, 1, 0. Second row represent connection of node 1, which is only connecting to node 3. So the second row is 0, 0, 0, 1. Note that we could have bi-directional connections, in such a case the connection would appear on both the row and the column, but this is not the case in this example.

By inspecting the Adjacency matrix, we can reconstruct the Node/Vertex graph. It informs us on the first hop connections: who are your friends. But how can we know about the second hop connections e.g. node 0 is connected to node 3 via node 1 and node 2? A really simple way is to multiply the adjacency matrix by itself (A*A). The result of this multiplication are the second hop connection. Here, we see that node 0 is connecting through 2 hops to node 1 (via node 2), and is connecting through 2 hops to node 3. We can even see that there is 2 such connection in two hops from node 0 to node 3. Lastly we see that node 2 is connecting through 2 hops to node 3 (via node 1).

If we were to multiply again A*A by A itself, we would get the three hop connections, which in this case is limited to node 0 being connected to node 3.

In general, the network that will interest us are way bigger than this simple 4 nodes diagram. Also in general, all nodes are not connected to each other node. Well, they say everyone is connected to everyone by six degrees of separations (six hops), but for most other practical applications, not all nodes are connected to each others. Let’s take a look at a bigger example to see how the principles illustrated above can apply at scale. Let’s assume the following non-directional network graph. Here since we have a non-directional network graph, you will see the connection values appears in both the rows and the columns. This special case shows a symmetry about the diagonal.

As before, if we compute A*A, we get the second hops connections. Notice that nodes becomes connected to themselves via a second hop. For example, node 1 is connected 3 times to itself through a second hop via node 0, 7 and 8.

If you are interested in all the first hop connections and the second hop connections, you could add together A*A and A, thus leading to the following matrix. You could proceed forward to find the third hops onward, but in this example nothing else is connected, so although that the numbers you see here would grow, the pattern of zeros would not change. We have found all connections of this graph. We found that node 0 is connected to nodes 1, 7 and 8. Nodes 2, 3 and 4 are connected. Nodes 5, 6 and 9 are connected. Finally we see that node 10 is not connected to any other nodes.

In practice the matrix multiplication works well to find the next hops neighbours. If it happens also that for your problem (as in the one above) most connections are non existent i.e. 0, then you could use sparse matrices to store (and potentially compute with) your Adjacency matrix. However, those becomes quickly really huge matrices which requires a lot of operations to compute. A nice trick if you are using SQL or spark could be to use joins on tables.

To do so, you need to reverse the problem on its head. Instead of creating an Adjacency matrix of how the nodes are connected, you will create a table of the connections. So to keep with our second example, you could have something like the following network graph being turned into a node/connection table.

Node Connection
0 A
1 A
1 B
7 B
1 C
8 C

Now that we have that node/connection table, our goal will be to reduce the number of connections to the minimum possible and in the end get something like the following as a way to see everything connected (we won’t care about how many hops leads us there).

To get there we will iterate through a two step process. First we will perform connection reduction and then update the node/connection table. Then we rinse and repeat until we can no longer reduce the number of connections.

Assuming the above node/connection table (node_connections), we can reduce the number of connections via a the following SQL query and store it as the new_connections table:

SELECT A.connection, MIN(B.connection) AS new_connection
FROM node_connections AS A
JOIN node_connections AS B
ON A.node = B.node
GROUP BY A.connection

Then you can update the node_connection table with the following SQL query:

SELECT DISTINCT B.new_connection AS connection, A.node
FROM node_connections AS A
JOIN new_connections AS B
WHERE A.connection = B.connection

You iterate those two steps until the node_connections table change no more et voilà, you have a map of all nodes connected through distincts connections.

This is only one of the possible use case, but for large scale application it is probably easier and quicker to join tables than to create and multiply Adjacency matrices. I showed the logic with SQL, but obviously you could achieve similar results using spark (for my specific application, I use pySpark).

If you have questions or interesting ideas of application of the network theory to some problem, feel free to jump in the conversation!


Cover photo by Michael Gaida at Pixabay.