Adventures in Geolocation!

As some of you might know I’m undertaking a “new career” as a data scientist. One of the upcoming project might require geolocation analysis. To prepare myself and to warm up I did some experiments. I was playing with my location data which I downloaded from google. I wanted to see how I could cluster it and determine the locations where I spend most time. As if I didn’t know!

The first step I downloaded my location history from google as a JSON file. Then I took in that data in a Jupyter Notebook (I’m using a Python 3 kernel). In my case I read 500547 location data points. Since the number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis and in a second step I cluster all days together.

Obviously more could be done on the data with such an approach, like a clustering / analysis of when I’m at those locations, or if clusters move with time, etc. But let’s stay on the basics for now.

Once the per day clustering is done, I find the centroids of those clusters and add them in a per day cluster centroid list. I then simply redo the same approach I did on a per day basis on the list of day cluster centroids and get the centroids of those new clusters.

Finally, I can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones (and because I didn’t optimize my density parameter in the DBSCAN algorithm), it happens that the location of my house seems to appear in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

If you want the details of the method, the html version of my notebook is just below.

I’ll keep you posted as I do more experiments!

Location History

In this notebook I show how I used clustering techniques to get the locations I spend the most time at.

Loading Google Location History

First I downloaded as JSON my location history from google at: https://takeout.google.com/settings/takeout The next step is to take in that data, in my case 500547 location data points.

In [2]:
'''
Loading the location data
'''

import json
import pandas as pd

filename = 'LocationHistory.json'

# Open the location history
with open(filename) as data_file:
    loc_data = json.load(data_file)

# Creating the data frame of locations
locations=[]
for loc in loc_data['locations']:
    tmp = {}
    tmp['timestamp'] = pd.to_datetime(int(loc['timestampMs']), unit='ms')
    tmp['lat'] = loc['latitudeE7']/10000000
    tmp['lon'] = loc['longitudeE7']/10000000
    locations.append(tmp)
data = pd.DataFrame(locations)
data = data.set_index('timestamp')

print('{} locations loaded in "data".'.format(len(data)))
data.head(4)
500547 locations loaded in "data".
Out[2]:
lat lon
timestamp
2015-03-28 19:45:28.653 45.909298 -74.917595
2015-03-28 19:44:27.996 45.909298 -74.917595
2015-03-28 19:43:26.793 45.909298 -74.917595
2015-03-28 19:42:25.765 45.909298 -74.917595

Per day clustering and centroid determination

Since that number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis. In a next step we will cluster all days together. Obviously more could be done, like a clustering / analysis of when I’m at those locations, or if clusters move with time, etc. But let’s stay on the basics for now.

In [118]:
'''
As we have too much data for our VM to handle it, we will do a per day clustering, then
another clustering for all days on top of it.
'''

from datetime import timedelta
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np

kms_per_radian = 6371.0088
epsilon = .25 / kms_per_radian

tmin = data.index.min()
tmin = pd.to_datetime('{}-{}-{}'.format(tmin.year,tmin.month,tmin.day))
tmax = data.index.max()
tmax = pd.to_datetime('{}-{}-{}'.format(tmax.year,tmax.month,tmax.day))+timedelta(days=1)

rng = pd.date_range(tmin,tmax)

centroids = pd.DataFrame(columns=['lat','lon'])
for d in rng:
    ds = '{}-{}-{}'.format(d.year,d.month,d.day)
    tmp = data[ds]
    if len(tmp)>0:
        coordinates=tmp[['lat','lon']].values
        minSamples = len(tmp)/10
        db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
        y = db.labels_
        X = pd.DataFrame(coordinates, columns=['lat','lon'])
        Y = pd.DataFrame(y, columns=['label'])
        res = pd.concat([X,Y], axis=1, join='inner')
        n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
        for i in range(n_clusters):
            el = res[res['label'] == i].mean(axis=0).drop('label')
            el['timestamp'] = ds
            centroids = centroids.append(el, ignore_index=True)
centroids = centroids.set_index('timestamp')

print('{} centroids of locations kept in "centroids".'.format(len(centroids)))
centroids.head(4)
1612 centroids of locations kept in "centroids".
Out[118]:
lat lon
timestamp
2011-12-14 45.503960 -73.664508
2011-12-14 45.500742 -73.661086
2011-12-15 45.507178 -73.831019
2011-12-15 45.528192 -73.838333

Clustering of all days

Now we cluster all the centroids of the per day clusters we determined at the previous step.

In [133]:
'''
DBSCAN clustering taking into account the spherical earth
source: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
'''

from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np

kms_per_radian = 6371.0088
epsilon = .5 / kms_per_radian
minSamples = len(centroids)/20 # Since this is the second pass, we could want to detect vacation spots or such, 
                               # in this case we might have to lover the min number of sample to 1-2

coordinates=centroids[['lat','lon']].values

db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
y = db.labels_
print ('List of generated labels for {} clusters: {}'.format(len(set(y)),set(y)))
List of generated labels for 5 clusters: {0, 1, 2, 3, -1}

Calculation of the overall centroids

And we calculate the centroids of those overall clusters.

In [136]:
overallCentroids = pd.DataFrame(columns=['lat','lon'])
X = pd.DataFrame(coordinates, columns=['lat','lon'])
Y = pd.DataFrame(y, columns=['label'])
res = pd.concat([X,Y], axis=1, join='inner')
overall_n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
for i in range(overall_n_clusters):
    el = res[res['label'] == i].mean(axis=0).drop('label')
    overallCentroids = overallCentroids.append(el, ignore_index=True)

print('{} overall centroids of locations kept in "overallCentroids".'.format(len(overallCentroids)))
overallCentroids
4 overall centroids of locations kept in "overallCentroids".
Out[136]:
lat lon
0 45.500931 -73.665334
1 45.506030 -73.832161
2 45.524857 -73.833701
3 45.926803 -74.913104

Display the clusters locations on the map

We can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones, it happens that the location of my house seems in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

In [139]:
import folium
import numpy as np

colorsList = ['red',
            'blue',
            'green',
            'orange',
            'purple',
            'pink',
            'gray',
            'cadetblue',
            'darkred',
            'darkblue',
            'darkgreen',
            'darkpurple',
            'lightgray',
            'lightred',
            'beige',
            'lightgreen',
            'lightblue',
            'white',
            'black']

centCoordinates=overallCentroids[['lat','lon']].values

m = folium.Map(location=[45.6, -73.8], zoom_start=9)

for i,r in enumerate(centCoordinates):
    color = i
    folium.Marker(
        location=[r[0], r[1]],
        #popup='Meter ID: {}'.format(r['meter']),
        icon=folium.Icon(color=colorsList[color])
    ).add_to(m)        

m
Out[139]:
Capture.JPG
Advertisements

Ericsson Hackathon 2016

Hi all! Last week I participated in the Ericsson Hackathon from the Montréal location and I wanted to share with you our achievement.

First, let me thank the incredible members of my team: Marc-Olivier, Mahdy, Antonio and Marc.

The short version of our work is contained in the Hackathon video we made, a little more fleshed out version in the following lines.

What can be done in 24 hours? We decided to start from the Montréal open data sets and create an interactive map of the parkable streets at a specific time. Specifically, we use a database of all the street signs in the city and a database of all the street side segments. The street signs database consists of an excel file of 308 685 entries and the street side database consists of a Geo JSON file of 104 431 entries.

My main task during the Hackathon consisted in parsing the street sign database and to associate the location of each sign to a single street segment from the street side database. Each street side segment usually covers from one corner of a street to another and as such is represented by at least two points, however curved streets for instance are represented by a number of points. So to figure out on which street segment a sign is located on, we need to find the street segment with the smallest distance to the sign. Below is the formula to calculate the distance to a line according to Wikipedia. This is even not taking into the account that our lines are segments and as such bounded and not infinite lines.

distance

So in order to limit the processing required we had to use some tricks. First, knowing that segments were of limited size (at most from one corner to the next), we extracted the first latitude and longitude from any street segment and kept only the segments close by a street sign. Next for the retained street segments, we computed the actual distance to all points defining it and kept only those which were indeed the closest. Finally, for this small subset of about 10 street segments we did the actual distance computation from the sign to the street segment taking into account the bound of the segment. The result is the street segment we want to retain.

segments

We split the sign database in roughly four parts and let the process run on four independent VMs in our OpenStack cluster. It took each of them around four to five hours to complete the computation. Around 17h of processing in total.

Other people in the team were concentrating on parsing the text of the parking sing in order to make them computer understandable. And finally some were working on assembling a web site having the Ericsson look and feel.

You can find some more information about the tools we used and some references in the video. All in all, it was an interesting experience. The geo parsing took longer than we anticipated. The sign text parsing was harder than we anticipated. The map generation with the trio of leaflet, mapbox and open street map was really easy. In the end we got the results we were hoping for!