IMA MAG AGE

A small pause in my series on the complexities of Deep Learning (or in general Machine Learning). Today I will try to give an honest review of the latest course I followed: “Convolutional Neural Networks” on Coursera by Andrew Ng.

I first followed the “Machine Learning” course from Andrew Ng in 2014. I really liked his pace of delivery and the way he builds up the knowledge to bring you up to speed. So, when his new nano degree on Deep Learning has been released in August of this year, I was one of the first to jump on it and I swiftly completed the three available courses. A couple of weeks back, the fourth out of five courses was released. I immediately went on the “Convolutional Neural Networks” course to continue my journey.

The course is well structured, as expected and given at a proper pace. The content is split on four weeks. The first week builds the foundations of Convolutional Neural Networks (CNN) and explains how those convolutions are computed, the mechanics. It explains its foundations in computer vision, then will detail the convolution with the padding, the stride and the pooling layers (max pooling, average pooling).

The second week looks at a few “classic” CNNs and explains how the architecture was built by adding new concepts on top of the previously existing ones. It then goes on explaining ResNets (a concept which can apply to other networks than only CNNs) and then build toward Inception Networks (yes, Inception as in the movie)

Third week introduces two new practical concepts, object localization and object detection. Figuring out where in a picture an object is, and secondly how many objects we can detect in a picture and where they each are. It nicely shows how bounding box predictions are made and evaluated followed by the use of Anchor boxes.

Finally, in the fourth week you will learn some of the coolest and funniest things about CNNs: face recognition and neural style transfer. Some important concepts introduced there are the one shot learning (another of those things which can apply to other networks than CNNs) and Siamese networks.

All in all, this is a good course. There are some glitches along the way. Some of the videos are not as polished as they could be i.e. some “bloopers” needs removal and a few errors which needs correction on the slides, but very minor. My biggest gripes is about the programming assignments. First, they are not self-contained. They right from the start refer to the introduction to Tensorflow made in the 3rd week of the second course of the nano degree: “Improving deep neural networks”. This makes things a little bit awkward when you do not use Tensorflow day to day (usually I’m fine with Keras) and there is 2–3 months in between the courses…

Second, the assignments are a bit less instructed than the other courses in the nano degree i.e. you will need to spend more time figuring out small bits of programming where not are not as much taken by the hand as in previous courses. This is not so much a problem as a statement, however it becomes a problem when combined with the problematic Coursera assignment submission engine (well, honestly, I do not know if it is Coursera fault or the course fault, but the result is the same). Sometimes it will refuse to correctly grade your assignment for things which are not even errors, or will introduce artificial boundaries without telling you those are there… I hope those get resolved soon, as it will not deter early adopters as myself, but will most probably discourage more than one in the future.

Lastly the jupyter kernels used on the site are troublesome. Server availability seems sketchy at times. You will often lose your work even if you saved it (you should save/export to your local machine regularly to alleviate those issues). In short, it is still a far way from the Kaggle kernels handling. The same issues have been reported by a colleague of mine following another Coursera course, so this is not unique to the CNN course. Also as the access to Coursera is now subscription based, you will lose access to your kernels if you do not renew your subscription after the course. Hence if you do not want to lose your work (as it is as much a reference as the course video themselves) you will have to store them locally on your machine or preferred cloud storage.

All that being said, it is an excellent course and taught me many things which I did not knew before, so an overall 4/5! But be ready for the glitches especially in the programming assignments. I especially enjoyed the one shot learning which I intend to apply to one of my deep neural network problem at work (not CNN related) and the neural style transfer. In that last programming exercise you complete the code of a neural style transfer algorithm. With a little push from my good friend Marc-Olivier, I went a little bit further and implemented a multi-style transfer algorithm. Here are my youngest kids with different level of style transfer from Edvard MunchPablo PicassoVincent Van Gogh and Georges Braque.

ZackAbby4Styles
My youngest kids with style transferred left to right, top to bottom: Edvard Munch, Pablo Picasso, Vincent Van Gogh and Georges Braques.

Originally published at medium.com/@TheLoneNut on November 22, 2017.

Advertisements

Virtual Peoples…

This is my last day of work before Christmas and New Year festivities. Time for one last update on the crazy stuff I am currently doing and also time to wish you and you families all the best for the new year!

My group is currently working on a data science project and as peoples in that field might have experienced, it is sometime hard to get by the data… As probably others have done in the past we decided in a first phase to generate our own data from simulation.

We have to simulate peoples, more precisely in this case, peoples from the Montreal area. We want the simulation to be representative of the reality, thus we start from Canadian census data and generate virtual denizens which are statistically correct for their home location. I’m also generating names from various sources (including tombstones registry). I’m still in the first stages of that simulation, but here is a subset of virtual denizens I have generated for my neighborhood region (Any resemblance between the generated virtual denizens in this article and any persons, living or dead, is a miracle.)

Rita Desjardins a Female Adult of 22 year old, born in Europe.
– Who attended College, CEGEP or other non-university certificate or diploma.
– Is not in the work force.
– Has an income of $5,000 to $9,999.
– Usually move around by means of Car, truck or van – as a passenger.

Youssef Bedard a Male Kid of 3 year old, born in Quebec.
– Is currently attending daycare.
– Usually move around by means of Car, truck or van – as a passenger.

Louis Martel who is not a Canadian citizen.

Lyana Belanger a Female Kid of 16 year old, born in Quebec.
– Is currently attending school.
– Usually move around by means of Car, truck or van – as a passenger.

Sam Roy a Male Adult of 22 year old, born in Quebec.
– Who attended Bachelor’s degree.
– Is currently working full-time in: Business, finance and administration occupations
– For the Educational services industry and Worked at usual place.
– Has an income of $60,000 to $79,999.
– Usually move around by means of Car, truck or van – as a driver.

Matheo Gagnon a Male Adult of 22 year old, born in Asia.
– Who attended College, CEGEP or other non-university certificate or diploma.
– Is currently working full-time in: Trades, transport and equipment operators and related occupations
– For the Manufacturing industry and Worked at usual place.
– Has an income of $60,000 to $79,999.
– Usually move around by means of Car, truck or van – as a driver.

Marc Bouchard a Male Adult recent immigrant of 45 year old, born in Europe.
– Who attended Bachelor’s degree.
– Is not in the work force.
– Has an income of $50,000 to $59,999.
– Usually move around by means of Public transit.

You can expect I’ll publish some portion of the Python Notebook performing this generation somewhere in the new year, if this is of interest to you.

In the mean time I wish you joyful end of year festivities!

More Adventures in Geolocation!

I’m in Sweden this week for an Ericsson internal event where I presented a demo and the jet lag was really hitting me yesterday… couldn’t sleep much after midnight. So I took some time to continue my adventures in geolocation.

As I stated last time, more could be done on the data with the approach I am taking, like a clustering / analysis of when I’m at those locations. So to go a little bit further with the analysis I’ve updated my Jupyter Notebook to perform such analysis.

As expected, it shows that I’m at work weekdays from around 6h to 15h. It shows I’m at my cottage on weekends and it shows I’m at home otherwise. The system works!

locationTime.jpg

I’ve decided to push that notebook to GitHub as this is strictly learning, so you can access it from there if you want details on the method I used.

Finally, I am still progressing on the Advent of Code 2016. Now completed Day 8, you can follow the code I produced on GitHub as well.

That’s it, tomorrow I’m heading back home.

Advent of Code 2016

Last year I mentioned the advent calendar made specifically for peoples who want some small exercise to train or retrain in coding. The Advent of Code is back this year with the 2016 edition. As I said last year, if you are born a coder, you should go back to coding from time to time, if only to remind you of the inherent complexity of the task and Advent of Code is a good way to do it.

The concept is simple, every day till Christmas, the site proposes a two-part challenge. If you pass the first part, you can go on with the second part. Input for the problem is provided as text and you enter your solution in a text box on the site. Nothing fancy, which allow you to use pretty much any programming language you want.

Last year I did most of it using Python, this year I’m doing the same but with the twist of using as much as possible of pandas as I can since I am ramping up my python data science skills. One of my colleague and friend is trying it out with R for the same reasons.

Funny enough if you look at the first day challenge solution we provided, although being in two different programming languages are still quite close one to the other. You can take a look at my Python Notebook or my colleague R Notebook on github. It seems that pandas and R both propose a similar way to approach the problems.

Do not pass up on this opportunity to brush up your skills in one or many programming languages. And as I do, nothing prevent you for giving it a twist to learn a new library as you go.

Happy Advent 2016!

Adventures in Geolocation!

As some of you might know I’m undertaking a “new career” as a data scientist. One of the upcoming project might require geolocation analysis. To prepare myself and to warm up I did some experiments. I was playing with my location data which I downloaded from google. I wanted to see how I could cluster it and determine the locations where I spend most time. As if I didn’t know!

The first step I downloaded my location history from google as a JSON file. Then I took in that data in a Jupyter Notebook (I’m using a Python 3 kernel). In my case I read 500547 location data points. Since the number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis and in a second step I cluster all days together.

Obviously more could be done on the data with such an approach, like a clustering / analysis of when I’m at those locations, or if clusters move with time, etc. But let’s stay on the basics for now.

Once the per day clustering is done, I find the centroids of those clusters and add them in a per day cluster centroid list. I then simply redo the same approach I did on a per day basis on the list of day cluster centroids and get the centroids of those new clusters.

Finally, I can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones (and because I didn’t optimize my density parameter in the DBSCAN algorithm), it happens that the location of my house seems to appear in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

If you want the details of the method, the html version of my notebook is just below.

I’ll keep you posted as I do more experiments!

Location History

In this notebook I show how I used clustering techniques to get the locations I spend the most time at.

Loading Google Location History

First I downloaded as JSON my location history from google at: https://takeout.google.com/settings/takeout The next step is to take in that data, in my case 500547 location data points.

In [2]:
'''
Loading the location data
'''

import json
import pandas as pd

filename = 'LocationHistory.json'

# Open the location history
with open(filename) as data_file:
    loc_data = json.load(data_file)

# Creating the data frame of locations
locations=[]
for loc in loc_data['locations']:
    tmp = {}
    tmp['timestamp'] = pd.to_datetime(int(loc['timestampMs']), unit='ms')
    tmp['lat'] = loc['latitudeE7']/10000000
    tmp['lon'] = loc['longitudeE7']/10000000
    locations.append(tmp)
data = pd.DataFrame(locations)
data = data.set_index('timestamp')

print('{} locations loaded in "data".'.format(len(data)))
data.head(4)
500547 locations loaded in "data".
Out[2]:
lat lon
timestamp
2015-03-28 19:45:28.653 45.909298 -74.917595
2015-03-28 19:44:27.996 45.909298 -74.917595
2015-03-28 19:43:26.793 45.909298 -74.917595
2015-03-28 19:42:25.765 45.909298 -74.917595

Per day clustering and centroid determination

Since that number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis. In a next step we will cluster all days together. Obviously more could be done, like a clustering / analysis of when I’m at those locations, or if clusters move with time, etc. But let’s stay on the basics for now.

In [118]:
'''
As we have too much data for our VM to handle it, we will do a per day clustering, then
another clustering for all days on top of it.
'''

from datetime import timedelta
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np

kms_per_radian = 6371.0088
epsilon = .25 / kms_per_radian

tmin = data.index.min()
tmin = pd.to_datetime('{}-{}-{}'.format(tmin.year,tmin.month,tmin.day))
tmax = data.index.max()
tmax = pd.to_datetime('{}-{}-{}'.format(tmax.year,tmax.month,tmax.day))+timedelta(days=1)

rng = pd.date_range(tmin,tmax)

centroids = pd.DataFrame(columns=['lat','lon'])
for d in rng:
    ds = '{}-{}-{}'.format(d.year,d.month,d.day)
    tmp = data[ds]
    if len(tmp)>0:
        coordinates=tmp[['lat','lon']].values
        minSamples = len(tmp)/10
        db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
        y = db.labels_
        X = pd.DataFrame(coordinates, columns=['lat','lon'])
        Y = pd.DataFrame(y, columns=['label'])
        res = pd.concat([X,Y], axis=1, join='inner')
        n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
        for i in range(n_clusters):
            el = res[res['label'] == i].mean(axis=0).drop('label')
            el['timestamp'] = ds
            centroids = centroids.append(el, ignore_index=True)
centroids = centroids.set_index('timestamp')

print('{} centroids of locations kept in "centroids".'.format(len(centroids)))
centroids.head(4)
1612 centroids of locations kept in "centroids".
Out[118]:
lat lon
timestamp
2011-12-14 45.503960 -73.664508
2011-12-14 45.500742 -73.661086
2011-12-15 45.507178 -73.831019
2011-12-15 45.528192 -73.838333

Clustering of all days

Now we cluster all the centroids of the per day clusters we determined at the previous step.

In [133]:
'''
DBSCAN clustering taking into account the spherical earth
source: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
'''

from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np

kms_per_radian = 6371.0088
epsilon = .5 / kms_per_radian
minSamples = len(centroids)/20 # Since this is the second pass, we could want to detect vacation spots or such, 
                               # in this case we might have to lover the min number of sample to 1-2

coordinates=centroids[['lat','lon']].values

db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
y = db.labels_
print ('List of generated labels for {} clusters: {}'.format(len(set(y)),set(y)))
List of generated labels for 5 clusters: {0, 1, 2, 3, -1}

Calculation of the overall centroids

And we calculate the centroids of those overall clusters.

In [136]:
overallCentroids = pd.DataFrame(columns=['lat','lon'])
X = pd.DataFrame(coordinates, columns=['lat','lon'])
Y = pd.DataFrame(y, columns=['label'])
res = pd.concat([X,Y], axis=1, join='inner')
overall_n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
for i in range(overall_n_clusters):
    el = res[res['label'] == i].mean(axis=0).drop('label')
    overallCentroids = overallCentroids.append(el, ignore_index=True)

print('{} overall centroids of locations kept in "overallCentroids".'.format(len(overallCentroids)))
overallCentroids
4 overall centroids of locations kept in "overallCentroids".
Out[136]:
lat lon
0 45.500931 -73.665334
1 45.506030 -73.832161
2 45.524857 -73.833701
3 45.926803 -74.913104

Display the clusters locations on the map

We can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones, it happens that the location of my house seems in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

In [139]:
import folium
import numpy as np

colorsList = ['red',
            'blue',
            'green',
            'orange',
            'purple',
            'pink',
            'gray',
            'cadetblue',
            'darkred',
            'darkblue',
            'darkgreen',
            'darkpurple',
            'lightgray',
            'lightred',
            'beige',
            'lightgreen',
            'lightblue',
            'white',
            'black']

centCoordinates=overallCentroids[['lat','lon']].values

m = folium.Map(location=[45.6, -73.8], zoom_start=9)

for i,r in enumerate(centCoordinates):
    color = i
    folium.Marker(
        location=[r[0], r[1]],
        #popup='Meter ID: {}'.format(r['meter']),
        icon=folium.Icon(color=colorsList[color])
    ).add_to(m)        

m
Out[139]:
Capture.JPG

Ericsson Hackathon 2016

Hi all! Last week I participated in the Ericsson Hackathon from the Montréal location and I wanted to share with you our achievement.

First, let me thank the incredible members of my team: Marc-Olivier, Mahdy, Antonio and Marc.

The short version of our work is contained in the Hackathon video we made, a little more fleshed out version in the following lines.

What can be done in 24 hours? We decided to start from the Montréal open data sets and create an interactive map of the parkable streets at a specific time. Specifically, we use a database of all the street signs in the city and a database of all the street side segments. The street signs database consists of an excel file of 308 685 entries and the street side database consists of a Geo JSON file of 104 431 entries.

My main task during the Hackathon consisted in parsing the street sign database and to associate the location of each sign to a single street segment from the street side database. Each street side segment usually covers from one corner of a street to another and as such is represented by at least two points, however curved streets for instance are represented by a number of points. So to figure out on which street segment a sign is located on, we need to find the street segment with the smallest distance to the sign. Below is the formula to calculate the distance to a line according to Wikipedia. This is even not taking into the account that our lines are segments and as such bounded and not infinite lines.

distance

So in order to limit the processing required we had to use some tricks. First, knowing that segments were of limited size (at most from one corner to the next), we extracted the first latitude and longitude from any street segment and kept only the segments close by a street sign. Next for the retained street segments, we computed the actual distance to all points defining it and kept only those which were indeed the closest. Finally, for this small subset of about 10 street segments we did the actual distance computation from the sign to the street segment taking into account the bound of the segment. The result is the street segment we want to retain.

segments

We split the sign database in roughly four parts and let the process run on four independent VMs in our OpenStack cluster. It took each of them around four to five hours to complete the computation. Around 17h of processing in total.

Other people in the team were concentrating on parsing the text of the parking sing in order to make them computer understandable. And finally some were working on assembling a web site having the Ericsson look and feel.

You can find some more information about the tools we used and some references in the video. All in all, it was an interesting experience. The geo parsing took longer than we anticipated. The sign text parsing was harder than we anticipated. The map generation with the trio of leaflet, mapbox and open street map was really easy. In the end we got the results we were hoping for!

Grow your own moustache!

It is done! I finished my Master study. The defence was held on April 26 and the thesis got published today. For those inclined in reading this kind of literature, here is my thesis: A Scalable Heterogeneous Software Architecture for the Telecommunication Cloud. It is an actor model based framework which can deploy on any cloud and be written with any programming language.

Let me tell you it is a big weight removed from my shoulders. In retrospective I would do it again. However I would not take two courses in the same session. Also I would have started the writing of the thesis and verified it with my thesis director earlier. But now it is done. On the bright side, I think what I have learned in the process really helped improve the end result the research done by my team.

I could not resist long and had to learn something new. Something has been on my back list for a while and I decided to give it a try. But let me put a little bit of context here. Four or five years ago I went to course about innovation in Stockholm. One of the exercises went as follow. In teams of two we had to point randomly in magazines and pick pictures and sentences, and give sense to them. I don’t recall each individual element, but I think we came up with a sentence going like this: “You have to grow your own moustache”. I still recall that sentence because out of the randomness of the sentences and images we picked we ended up with such a profound revelation!

It might not look like it, but “Growing your own moustache” is a really good metaphor for a lot of things in life. I will just show one of those things. As following a course and learning is, growing a moustache is a decision you have to make. Once that decision is made, it will take time, you cannot have it grown over night. Two people won’t grow the same moustache and it won’t grow at the same pace. When you learn, you might struggle more than someone else, but in the end, no matter the struggling, what you have learned is personal, what you retain depends on your background and how the moustache grew… everyone will get its own, there is things you can do to shape it the way you want, but some things you cannot control or change.

That being said, this exercise and many more went a great deal to start a friendship. The course I am following now is from an advice from that friend, Andreas S. who told me about it. So Andreas told me about that book which guide you through the process of building a computer. You start from Nand gates and build a computer from them, an OS, a language and eventually the Tetris game. It happens that the guys who wrote the book made available that course on Coursera: Build a Modern Computer from First Principles: From Nand to Tetris. It is a two parts course and the first part is available now. I finished the first part. Last week I completed the assignment for week five where you have to build a CPU and Memory and assemble them as a computer. This week I completed the 6th week assignment to write an assembler for that computer. It is all simulation, but you know it could work for real if you had the patience to build it physically as in the previous weeks we built every elements leading to this, from the Nand gate. Two transistors and a resistor and you have a physical implementation of a Nand gate. You would need a s**t load of them to build an actual physical version of it, but you get the full understanding with the course. By the way, someone did such a computer from individual transistors. You can get a view of it in this video.

When I did my bachelor degree (in electrical engineering) 25 years ago, I covered a lot of what is shown in this course. But still some pieces were missing. We built/simulated logical gates and from there went to Register and ALU but we didn’t assemble them as a CPU and a Computer. Other course showed us assembly and compilers but it was not linked in a coherent chain. That course bring you from the basic Nand gate up to writing a Tetris game will all the steps in between. You can be a perfectly good software engineer without knowing how a computer is built, but there is a lot you can gain by understanding it. Making it yourself ensures you have a deep understanding of the whole process. I recommend that course to everyone. Thanks Andreas!