As some of you might know I’m undertaking a “new career” as a data scientist. One of the upcoming project might require geolocation analysis. To prepare myself and to warm up I did some experiments. I was playing with my location data which I downloaded from google. I wanted to see how I could cluster it and determine the locations where I spend most time. As if I didn’t know!

The first step I downloaded my location history from google as a JSON file. Then I took in that data in a Jupyter Notebook (I’m using a Python 3 kernel). In my case I read 500547 location data points. Since the number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis and in a second step I cluster all days together.

Obviously more could be done on the data with such an approach, like a clustering / analysis of when I’m at those locations, or if clusters move with time, etc. But let’s stay on the basics for now.

Once the per day clustering is done, I find the centroids of those clusters and add them in a per day cluster centroid list. I then simply redo the same approach I did on a per day basis on the list of day cluster centroids and get the centroids of those new clusters.

Finally, I can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones (and because I didn’t optimize my density parameter in the DBSCAN algorithm), it happens that the location of my house seems to appear in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

If you want the details of the method, the html version of my notebook is just below.

I’ll keep you posted as I do more experiments!

# Location History

In this notebook I show how I used clustering techniques to get the locations I spend the most time at.

## Loading Google Location History

First I downloaded as JSON my location history from google at: https://takeout.google.com/settings/takeout The next step is to take in that data, in my case 500547 location data points.

```
'''
Loading the location data
'''
import json
import pandas as pd
filename = 'LocationHistory.json'
# Open the location history
with open(filename) as data_file:
loc_data = json.load(data_file)
# Creating the data frame of locations
locations=[]
for loc in loc_data['locations']:
tmp = {}
tmp['timestamp'] = pd.to_datetime(int(loc['timestampMs']), unit='ms')
tmp['lat'] = loc['latitudeE7']/10000000
tmp['lon'] = loc['longitudeE7']/10000000
locations.append(tmp)
data = pd.DataFrame(locations)
data = data.set_index('timestamp')
print('{} locations loaded in "data".'.format(len(data)))
data.head(4)
```

## Per day clustering and centroid determination

Since that number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis. In a next step we will cluster all days together. Obviously more could be done, like a clustering / analysis of when I’m at those locations, or if clusters move with time, etc. But let’s stay on the basics for now.

```
'''
As we have too much data for our VM to handle it, we will do a per day clustering, then
another clustering for all days on top of it.
'''
from datetime import timedelta
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np
kms_per_radian = 6371.0088
epsilon = .25 / kms_per_radian
tmin = data.index.min()
tmin = pd.to_datetime('{}-{}-{}'.format(tmin.year,tmin.month,tmin.day))
tmax = data.index.max()
tmax = pd.to_datetime('{}-{}-{}'.format(tmax.year,tmax.month,tmax.day))+timedelta(days=1)
rng = pd.date_range(tmin,tmax)
centroids = pd.DataFrame(columns=['lat','lon'])
for d in rng:
ds = '{}-{}-{}'.format(d.year,d.month,d.day)
tmp = data[ds]
if len(tmp)>0:
coordinates=tmp[['lat','lon']].values
minSamples = len(tmp)/10
db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
y = db.labels_
X = pd.DataFrame(coordinates, columns=['lat','lon'])
Y = pd.DataFrame(y, columns=['label'])
res = pd.concat([X,Y], axis=1, join='inner')
n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
for i in range(n_clusters):
el = res[res['label'] == i].mean(axis=0).drop('label')
el['timestamp'] = ds
centroids = centroids.append(el, ignore_index=True)
centroids = centroids.set_index('timestamp')
print('{} centroids of locations kept in "centroids".'.format(len(centroids)))
centroids.head(4)
```

## Clustering of all days

Now we cluster all the centroids of the per day clusters we determined at the previous step.

```
'''
DBSCAN clustering taking into account the spherical earth
source: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
'''
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np
kms_per_radian = 6371.0088
epsilon = .5 / kms_per_radian
minSamples = len(centroids)/20 # Since this is the second pass, we could want to detect vacation spots or such,
# in this case we might have to lover the min number of sample to 1-2
coordinates=centroids[['lat','lon']].values
db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
y = db.labels_
print ('List of generated labels for {} clusters: {}'.format(len(set(y)),set(y)))
```

## Calculation of the overall centroids

And we calculate the centroids of those overall clusters.

```
overallCentroids = pd.DataFrame(columns=['lat','lon'])
X = pd.DataFrame(coordinates, columns=['lat','lon'])
Y = pd.DataFrame(y, columns=['label'])
res = pd.concat([X,Y], axis=1, join='inner')
overall_n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
for i in range(overall_n_clusters):
el = res[res['label'] == i].mean(axis=0).drop('label')
overallCentroids = overallCentroids.append(el, ignore_index=True)
print('{} overall centroids of locations kept in "overallCentroids".'.format(len(overallCentroids)))
overallCentroids
```

## Display the clusters locations on the map

We can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones, it happens that the location of my house seems in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

```
import folium
import numpy as np
colorsList = ['red',
'blue',
'green',
'orange',
'purple',
'pink',
'gray',
'cadetblue',
'darkred',
'darkblue',
'darkgreen',
'darkpurple',
'lightgray',
'lightred',
'beige',
'lightgreen',
'lightblue',
'white',
'black']
centCoordinates=overallCentroids[['lat','lon']].values
m = folium.Map(location=[45.6, -73.8], zoom_start=9)
for i,r in enumerate(centCoordinates):
color = i
folium.Marker(
location=[r[0], r[1]],
#popup='Meter ID: {}'.format(r['meter']),
icon=folium.Icon(color=colorsList[color])
).add_to(m)
m
```