{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Story told By Your Location Data\n", "\n", "**By: Corrine, Brad, and Ben**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "With modern technology, there are various applications that utilize location to enhance user experience. While many of these uses could be considered to be very beneficial, there is also another question to consider, how do we keep this data anonymous? Given how frequently online services keep track of personal location data, is it possible to identify individuals based on “anonymous” location data? Would these location data collecting features allow inference of personal information such as gender, name, location, or even unique identity? This project aims to explore such questions and determine if location data should be considered a major privacy infringement, and question whether or not it should be publicly available.\n", "\n", "# Overview\n", "We will start off with a small dataset, looking a the location history of an android phone over a month during 2014. Without any prior knowledge, our goal is to find out as much as possible about this particular individual carrying this phone. Then we will move forward looking at two distinct social network data sets and see how location data can be exploited when is within a group of users.\n", "\n", "# Android Phone Data \n", "This dataset consist of locations of an android phone for a month. An plain text example of an entry is shown below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Datasets/phone_history.json\n", "\n", " \"locations\" : [ {\n", " \"timestampMs\" : \"1415051512187\",\n", " \"latitudeE7\" : 404212794,\n", " \"longitudeE7\" : -36286372,\n", " \"accuracy\" : 34\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because of the structure of this dataset is not easy to work with, we have created a python script to converted into a much more nicer format to handle. Also, we have converted latitudeE7 and longtitudeE7 into latitude/longtiude as well as time stamps into proper datatime formats for readibility." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def convert_json_history(filename):\n", " f = open(filename)\n", " file_str = f.read()\n", " phone_data = json.loads(file_str)['locations']\n", "\n", " # process into a normal txt\n", " phone_data_new = open('Datasets/phone_data.txt', 'w+')\n", " phone_data_new.write('Dates\\tLat\\tLong\\tAccuracy\\n')\n", " for d in phone_data:\n", " time = str(pd.to_datetime(d['timestampMs'], unit='ms'))\n", " lat = int(d['latitudeE7'])/(10**7)\n", " lon = int(d['longitudeE7'])/(10**7)\n", " acc = d['accuracy']\n", " line = '{}\\t{}\\t{}\\t{}\\n'.format(time,lat,lon,acc)\n", " phone_data_new.write(line)\n", " phone_data_new.close()\n", " print('finsh making file!')\n", "\n", "#convert_json_history('Datasets/phone_history.json') # Already ran, created phone_data.txt in Datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: The json dataset is obtained from [google's location data takeout](https://takeout.google.com/settings/takeout/custom/location_history?pli=1). Hence, this function works for any datasets that is generated from there.\n", "\n", "# Data Analysis\n", "Now that we have processed the data into a more manageable form, we can start to look at the information." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import datetime\n", "import warnings\n", "from urllib.request import urlopen\n", "import json\n", "import re\n", "\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DatesLatLongAccuracy
02014-11-03 21:51:52.18740.421279-3.62863734
12014-11-03 21:50:50.22840.421265-3.62864635
22014-11-03 21:49:50.13240.421271-3.62865034
32014-11-03 21:48:50.12740.421274-3.62863934
42014-11-03 21:47:49.27140.421286-3.62863533
\n", "
" ], "text/plain": [ " Dates Lat Long Accuracy\n", "0 2014-11-03 21:51:52.187 40.421279 -3.628637 34\n", "1 2014-11-03 21:50:50.228 40.421265 -3.628646 35\n", "2 2014-11-03 21:49:50.132 40.421271 -3.628650 34\n", "3 2014-11-03 21:48:50.127 40.421274 -3.628639 34\n", "4 2014-11-03 21:47:49.271 40.421286 -3.628635 33" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "phone_df = pd.read_csv('Datasets/phone_data.txt', parse_dates=['Dates'], sep='\\t')\n", "phone_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset consist of time, latitude and longtitude. First, we want to combine latitude and longtitude into one single column. This is to make it easier to handle and preping for feeding into Bing's location API because it takes in a order pair of latitude and longtitude." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DatesAccuracyLocation
02014-11-03 21:51:52.18734(40.421279399999996, -3.6286372000000005)
12014-11-03 21:50:50.22835(40.4212652, -3.6286462999999998)
22014-11-03 21:49:50.13234(40.421271000000004, -3.6286498999999997)
32014-11-03 21:48:50.12734(40.4212744, -3.6286388)
42014-11-03 21:47:49.27133(40.421286200000004, -3.6286354)
\n", "
" ], "text/plain": [ " Dates Accuracy Location\n", "0 2014-11-03 21:51:52.187 34 (40.421279399999996, -3.6286372000000005)\n", "1 2014-11-03 21:50:50.228 35 (40.4212652, -3.6286462999999998)\n", "2 2014-11-03 21:49:50.132 34 (40.421271000000004, -3.6286498999999997)\n", "3 2014-11-03 21:48:50.127 34 (40.4212744, -3.6286388)\n", "4 2014-11-03 21:47:49.271 33 (40.421286200000004, -3.6286354)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "phone_df['Location'] = tuple(phone_df[['Lat','Long']].values)\n", "phone_df = phone_df.drop(columns = ['Lat', 'Long'])\n", "phone_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next thing we want to do is looking at the location data more closely. To do this, we need to group them by dates after sorting it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Dates Accuracy \\\n", "43780 2014-09-30 21:54:03.688 27 \n", "43779 2014-09-30 21:55:03.956 21 \n", "43778 2014-09-30 21:56:03.888 26 \n", "43777 2014-09-30 21:57:03.784 35 \n", "43776 2014-09-30 21:58:03.933 37 \n", "\n", " Location \n", "43780 (40.4212446, -3.6286241) \n", "43779 (40.4212787, -3.6285733999999996) \n", "43778 (40.421249200000005, -3.6286188999999998) \n", "43777 (40.421282, -3.6286157) \n", "43776 (40.4212636, -3.6286042000000003) \n", " Dates Accuracy Location\n", "4 2014-11-03 21:47:49.271 33 (40.421286200000004, -3.6286354)\n", "3 2014-11-03 21:48:50.127 34 (40.4212744, -3.6286388)\n", "2 2014-11-03 21:49:50.132 34 (40.421271000000004, -3.6286498999999997)\n", "1 2014-11-03 21:50:50.228 35 (40.4212652, -3.6286462999999998)\n", "0 2014-11-03 21:51:52.187 34 (40.421279399999996, -3.6286372000000005)\n" ] } ], "source": [ "df = phone_df.sort_values('Dates')\n", "print(df.head())\n", "print(df.tail())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By this, we know that the the data starts off with 2014-09-30, and ends at 2014-11-03. Next, we will be looking more into the location parts. \n", "\n", "Specifically, we are interested in the following questions:\n", "1. How often does the android phone keep track of location data?\n", "2. What are some common places that the person go to?\n", "3. What info can be interpret based on our previous answers?\n", "4. How likely can we identify this individual?\n", "\n", "In order to do this, we first need to seperate the date and time." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df['Time'] = df['Dates'].dt.time\n", "df['Dates'] = df['Dates'].dt.date\n", "df = df[['Dates', 'Time', 'Location', 'Accuracy']]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DatesTimeLocationAccuracy
437802014-09-3021:54:03.688000(40.4212446, -3.6286241)27
437792014-09-3021:55:03.956000(40.4212787, -3.6285733999999996)21
437782014-09-3021:56:03.888000(40.421249200000005, -3.6286188999999998)26
437772014-09-3021:57:03.784000(40.421282, -3.6286157)35
437762014-09-3021:58:03.933000(40.4212636, -3.6286042000000003)37
\n", "
" ], "text/plain": [ " Dates Time Location \\\n", "43780 2014-09-30 21:54:03.688000 (40.4212446, -3.6286241) \n", "43779 2014-09-30 21:55:03.956000 (40.4212787, -3.6285733999999996) \n", "43778 2014-09-30 21:56:03.888000 (40.421249200000005, -3.6286188999999998) \n", "43777 2014-09-30 21:57:03.784000 (40.421282, -3.6286157) \n", "43776 2014-09-30 21:58:03.933000 (40.4212636, -3.6286042000000003) \n", "\n", " Accuracy \n", "43780 27 \n", "43779 21 \n", "43778 26 \n", "43777 35 \n", "43776 37 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Location
Dates
2014-09-30126
2014-10-011307
2014-10-021349
2014-10-031372
2014-10-041413
\n", "
" ], "text/plain": [ " Location\n", "Dates \n", "2014-09-30 126\n", "2014-10-01 1307\n", "2014-10-02 1349\n", "2014-10-03 1372\n", "2014-10-04 1413" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now we check distinct places per day.\n", "df_day = df[['Dates', 'Location']]\n", "df_freq = df_day.groupby(['Dates']).count()\n", "df_freq.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Location
count35.000000
mean1250.885714
std254.348044
min126.000000
25%1262.000000
50%1338.000000
75%1383.000000
max1429.000000
\n", "
" ], "text/plain": [ " Location\n", "count 35.000000\n", "mean 1250.885714\n", "std 254.348044\n", "min 126.000000\n", "25% 1262.000000\n", "50% 1338.000000\n", "75% 1383.000000\n", "max 1429.000000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_freq.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By this, we already can see that the android phone keep track of your location at least 126 times within a day. \n", "Next, we will look at specific locations. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Dates
Location
(39.847814, -5.6267378)1
(39.8596804, -5.613139)1
(39.871458399999995, -5.582563200000001)1
(39.871547, -5.5995402)1
(39.8723041, -5.5697503)2
\n", "
" ], "text/plain": [ " Dates\n", "Location \n", "(39.847814, -5.6267378) 1\n", "(39.8596804, -5.613139) 1\n", "(39.871458399999995, -5.582563200000001) 1\n", "(39.871547, -5.5995402) 1\n", "(39.8723041, -5.5697503) 2" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_loc = df_day.groupby('Location').count()\n", "df_loc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We group locations and did a count on how many times it had been visited over the month. By this, it's clear that there are many random locations that the person just happen to be there once or twice, which is not interested to look at since there is just not enough information about it to make any sort of logical interpretation. Hence, we have decided to only focus on the most frequent locations.\n", "\n", "We will be generating the results of the most frequent ones below." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Location
(40.4202487, -3.6303093)1013
(40.115815399999995, -5.949015)913
(40.4207664, -3.6332066)754
(40.419880799999994, -3.630936)675
(40.4207763, -3.6332088)655
\n", "
" ], "text/plain": [ " Count\n", "Location \n", "(40.4202487, -3.6303093) 1013\n", "(40.115815399999995, -5.949015) 913\n", "(40.4207664, -3.6332066) 754\n", "(40.419880799999994, -3.630936) 675\n", "(40.4207763, -3.6332088) 655" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_locations = df_loc.sort_values('Dates', ascending=False)\n", "df_locations = df_locations.rename(index=str, columns= {'Dates':'Count'})\n", "df_locations.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above consist of locations that had been visited the most. We will feed this into an reverse geocoding API to get better insights of the location itself. In order to limit the amount of queries, we choose to only consider locations that have at least 10 visits throughout the month, which a way to filter out useless locations." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "296\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LocationCount
0(40.4202487, -3.6303093)1013
1(40.115815399999995, -5.949015)913
2(40.4207664, -3.6332066)754
3(40.419880799999994, -3.630936)675
4(40.4207763, -3.6332088)655
\n", "
" ], "text/plain": [ " Location Count\n", "0 (40.4202487, -3.6303093) 1013\n", "1 (40.115815399999995, -5.949015) 913\n", "2 (40.4207664, -3.6332066) 754\n", "3 (40.419880799999994, -3.630936) 675\n", "4 (40.4207763, -3.6332088) 655" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_new = df_locations.reset_index()\n", "freq_locations = df_new[df_new['Count']>10]\n", "freq_locations = freq_locations.rename(index=str, columns={'Dates':'Count'})\n", "print(len(freq_locations))\n", "freq_locations.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data had been reduced to only consist of 296 entires, which is much more managable to handle. \n", "\n", "Below are the code to generate the readable addresses using Bing API and save it into a text file to preserve API usage." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"\\nimport geocoder\\nbing_key = 'AiEfap-qUoZalL1qK8ollM-SwVdoJFemh60tHo0EeraVYP8V4WPJXAVD2YjqzgA1'\\ncoordinates = freq_locations['Location']\\naddr_file = open('Datasets/phone_address_new.txt', 'w+', encoding='utf-8')\\nfor cord in coordinates:\\n cord_list = list(cord)\\n g = geocoder.bing(cord_list, method = 'reverse', key = bing_key)\\n for r in g:\\n line_str = r.address + ',' + r.city + ',' + r.country + '\\n'\\n addr_file.write(str(cord_list) + ' : ' + line_str)\\n print('{} : {}'.format(cord_list, line_str))\\naddr_file.close()\\nprint('finish address!')\\n\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''\n", "import geocoder\n", "bing_key = 'AiEfap-qUoZalL1qK8ollM-SwVdoJFemh60tHo0EeraVYP8V4WPJXAVD2YjqzgA1'\n", "coordinates = freq_locations['Location']\n", "addr_file = open('Datasets/phone_address_new.txt', 'w+', encoding='utf-8')\n", "for cord in coordinates:\n", " cord_list = list(cord)\n", " g = geocoder.bing(cord_list, method = 'reverse', key = bing_key)\n", " for r in g:\n", " line_str = r.address + ',' + r.city + ',' + r.country + '\\n'\n", " addr_file.write(str(cord_list) + ' : ' + line_str)\n", " print('{} : {}'.format(cord_list, line_str))\n", "addr_file.close()\n", "print('finish address!')\n", "'''\n", "# Already ran, created phone_address_new.txt in Datasets using Bing API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To preserve our results, we decided to save it into a text file and load it up after. This is the reason why it is commented out. This is to ensure we don't exceed the maximum query count.\n", "\n", "After checking references of coordinates to address, it seems like because of how precise the location data is being keep track of, there are different coordinates that represents the same general area, so we have to do another parse to reduce the redundency of our results by summing the count of any share addresses. Because of this, we have reduce the number places to 152." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most frequent visted places: 152\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Address
Calle de Nicolás Salmerón, 7, 28017 Madrid (Madrid),Madrid,Spain2971
Autovía del Norte, 28108 Alcobendas (Madrid),Alcobendas,Spain1661
Calle de Villasilos, 8B, 28017 Madrid (Madrid),Madrid,Spain1451
Calle de Nicolás Salmerón, 17, 28017 Madrid (Madrid),Madrid,Spain1429
Avenida de Bruselas, 37, 28108 Alcobendas (Madrid),Alcobendas,Spain1147
Calle de Matamorosa, 3, 28017 Madrid (Madrid),Madrid,Spain1013
CC-51, 10617 El Torno (Cáceres),El Torno,Spain913
Avenida Bruselas, 31, 28108 Alcobendas (Madrid),Alcobendas,Spain768
Calle de la Caléndula, 87, 28109 Alcobendas (Madrid),Alcobendas,Spain574
Avenida de Fuencarral, 18, 28108 Alcobendas (Madrid),Alcobendas,Spain567
\n", "
" ], "text/plain": [ " Count\n", "Address \n", " Calle de Nicolás Salmerón, 7, 28017 Madrid (Ma... 2971\n", " Autovía del Norte, 28108 Alcobendas (Madrid),A... 1661\n", " Calle de Villasilos, 8B, 28017 Madrid (Madrid)... 1451\n", " Calle de Nicolás Salmerón, 17, 28017 Madrid (M... 1429\n", " Avenida de Bruselas, 37, 28108 Alcobendas (Mad... 1147\n", " Calle de Matamorosa, 3, 28017 Madrid (Madrid),... 1013\n", " CC-51, 10617 El Torno (Cáceres),El Torno,Spain 913\n", " Avenida Bruselas, 31, 28108 Alcobendas (Madrid... 768\n", " Calle de la Caléndula, 87, 28109 Alcobendas (M... 574\n", " Avenida de Fuencarral, 18, 28108 Alcobendas (M... 567" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "address = pd.read_csv('Datasets\\phone_address_new.txt', sep=':', header = None)\n", "freq_locations['Address'] = address[1].values\n", "freq = freq_locations.groupby(['Address']).sum().sort_values('Count', ascending=False)\n", "print('Most frequent visted places: {}'.format(len(freq)))\n", "freq[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the map of the precise traveling locations of the phone on geo coordinates. This represents the general traveling locations of the person. This is not the reduced version based on unique addresses since we want the detail locations. This map represent the general area of operation for the person throughout this month." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def tuple_str_to_list(string):\n", " string = string.replace('(', '')\n", " string = string.replace(')', '')\n", " return [float(s) for s in string.split(',')]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import folium\n", "# build the location list\n", "coordinates = freq_locations['Location'].tolist()\n", "m = tuple_str_to_list(coordinates[0])\n", "m = folium.Map(location=m, zoom_start=10)\n", "for c in coordinates:\n", " c_list = tuple_str_to_list(c)\n", " folium.Marker(c_list, popup=str(c_list)).add_to(m)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the limitations of API and time constriants, we did not manage to references all the address to notable locations. However, we can do some basic searches to get some ideas of the person by manually lookup the top 10 most visited locations. Our results are as followed:\n", "1. Calle de Nicolás Salmerón - Sports Club\n", "2. Autovía A-1 - Route\n", "3. Shango - Bar \n", "4. Calle de Nicolás Salmerón - Residential Area\n", "5. Parking Lot close to Amusement Center\n", "6. Sala Bohemia - Music School\n", "7. CC-51 - Route close to Gas Station\n", "8. Different Parking Lot to Amusement Center\n", "9. Car Rental Agency - Calle de la Caléndula\n", "10. Aist Madrid - Medical Clinic\n", "\n", "Based on these, it should be safe to inference that the person helps people travel to places and live at Calle de Nicolás Salmerón. There is constant visit of a sports Club, so we can assume is a Male. \n", "\n", "# Conclusion - Phone Data\n", "This is as much as we can do so far with this dataset. For the future, we could consider referencing external datasets, such as history taxi or other means of transportation and see if there are correlation to possibly identify the person. However, this dataset still indicate the following information, which could be consider a breach of privacy.\n", "1. Perosnal home address is exposed\n", "2. Occupation\n", "3. General area mobility\n", "4. Habits\n", "5. Gender (Inference)\n", "6. Private info such as going to medical clinic\n", "With how frequent the android phone keep track of the location data on a daily basis, the data could have the potential to reavel much more info if it gets larger since this is just a month.\n", "\n", "Next, we will be looking at social network location datasets and try to see if we can find similar info for group of users rather than a specific individual." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Brightkite - Reading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by reading the csv files of our datasets into a pandas DataFrame using pd.read_csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data = pd.read_csv('Datasets/Brightkite.txt', delim_whitespace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see below that the Brightkite file contains 5 columns: UserID, Time, Latitude, Longitude, and PlaceID \n", "\n", "UserID: An ID number corresponding to an individual User. These are currently repeated (not linkably) across datasets, so we will have to find a way to ensure all IDs are unique in the future \n", "\n", "Time: The time(s) a user visited a location \n", "\n", "Latitude and Longitude: coordinates of actual location \n", "\n", "PlaceID: a place that corresponds to that location " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'Brightkite_data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mBrightkite_data\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'Brightkite_data' is not defined" ] } ], "source": [ "Brightkite_data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combine Latitude and Longitude into one column: Coordinates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data['Coordinates'] = tuple(Brightkite_data[['Latitude','Longitude']].values)\n", "Brightkite_data.drop(['Latitude','Longitude'], axis=1, inplace = True)\n", "Brightkite_data.columns" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'Brightkite_data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mBrightkite_data\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'UserID'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munique\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'Brightkite_data' is not defined" ] } ], "source": [ "Brightkite_data['UserID'].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocessing Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#this code is used to find places with unusually high rates of visitors\n", "place_groups = Brightkite_data.groupby(['Coordinates'], group_keys=True)\n", "location_counts = place_groups['Coordinates'].count()\n", "location_rankings = location_counts.sort_values().tail(40)\n", "location_rankings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#remove NAs. There are over 256000 datapoints at (0.0,0.0). This is the middle of nowhere.\n", "#It is safe to assume these people didn't actually go there.\n", "Brightkite_data = Brightkite_data[Brightkite_data.Coordinates != (0.0,0.0)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "group_by_user = Brightkite_data.groupby(['UserID'], group_keys = True)#group by person\n", "places_per_user = group_by_user['Coordinates'].unique() #find number of unique locations each person visited" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "places_per_user[0].shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countin = 0\n", "countout = 0\n", "keepers = []\n", "minimum_places = 15\n", "for i in places_per_user:\n", " if i.size < minimum_places:\n", " countout += 1\n", " keepers.append(False)\n", " else:\n", " countin += 1\n", " keepers.append(True)\n", "print(countin, countout)\n", "print(places_per_user.index)\n", "print(type(places_per_user))\n", "#the only problem with keepers is that its index does not match the index of places per user.\n", "#if we can get index in keepers to match userID like it does in places per user it might help\n", "\n", "#now we have the list of all users who have more than minimum_places unique coordinates logged" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We were able to successfully separate users who provided sufficient data from those who didn't." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(countin, countout)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Below you can see the location data of the users who were able to provide sufficient data, along with their user ID's." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(places_per_user[keepers].head(10))\n", "print(places_per_user[keepers].tail(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " At this point we can see all of the users that we want and the ones that we don't want. Now it is time to sort them out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Finalize the Processed Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "invalid_users = []\n", "for i in places_per_user.index:\n", " if places_per_user[i].size < minimum_places:\n", " invalid_users.append(i)\n", "Brightkite_data = Brightkite_data[~Brightkite_data['UserID'].isin(invalid_users)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert Datetime into Date and Time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The datasets we used presented its \"Time\" information in an unusable string format. In order to make use of this data, we had to convert it into something easier to utilize." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time = Brightkite_data['Time'].str.split(\"T\", n = 1, expand = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data.drop(columns=[\"Time\"], inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data['Date'] = time[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data['Time'] = time[1]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'Brightkite_data' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mBrightkite_data\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Time'\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mBrightkite_data\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Time'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mslice\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mNameError\u001b[0m: name 'Brightkite_data' is not defined" ] } ], "source": [ "Brightkite_data['Time'] = Brightkite_data['Time'].str.slice(0,-4,1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " In order to tell which users have gone to different locations together, we split the time strings into Date and Time. We then removed the old version of the data, as it was no longer necessary. After that we put in the new \"Date\" and \"Time\" columns. We also converted time to simply contain minutes and seconds. This made the data much easier to work with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remove duplicates and save \n", "\n", "Sometimes users visit the same place multiple times a day, they may also check in to the same place several times during one visit. We decided to eliminate these repeat visits and just count whether or not a person visited a given place at all during the day" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data = Brightkite_data[~Brightkite_data.duplicated(['UserID','Coordinates','Date'],keep='first')]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Brightkite_data.to_csv('Datasets/Brightkite_light.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Challenges\n", "Throughout this project, we have encounter numerous challenges that we have manage to overcome.\n", "1. Large dataset, which results long run time or not capable of running on laptops.\n", "Solution: Pre-process the dataset by removing useless informations such as places that only been visited once or twice to reduce the number of entries.\n", "2. Limitation of Location API - Bing\n", "Solution: Similar to previous, pre-process the dataset to make it smaller, so we don't have the risk of going over the free limits and long run time.\n", "\n", "\n", "# Future Research\n", "There various ways that could take this concept into the next level and do much more. We will talk specific toward the phone dataset and the brightkite dataset.\n", "\n", "## Phone Dataset\n", "1. Use the accuracy column to get a more precise result of locations. Also, this can be apply to Google's Place API to get point of interest around that circle of radius in meters.\n", "2. Resample the time intervals by a smaller time frame, such as looking at location changes every 6 hours to get a more precise routing map\n", "\n", "## BrightKite Dataset\n", "1. Route tracing for multiple users at once to get a better sense of where do people go\n", "2. Find details of possible home address of each users\n", "\n", "Additionally, there is also the possiblilty of finding more about the users by reference other datasets.\n", "\n", "\n", "# Conclusion\n", "Overall, we found out that location dataset could be very dangerous for anyone to collect. With sufficent location information, it is rather easy identify users' private informations solely using location data. This is clearly shown by our analysis in both of our datasets. We were capable of finding popular user locations and general area of operations for each users. Also, by knowing exactly where users normally go, we can draw inferences about users based on this. Hence, users should be very careful when sharing location data because this can provide critical information to external parties that can be use against you. Especially with social media, given people normally provide their date of birth, by knowing the precise home address of users, it is rather easy to uniquely identify individuals. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Appendix - Who Did What\n", "Ben - Ben did most of the planning and all the code and anaylsis for the phone location dataset as the foundation and help with the data analysis of the Brightkite dataset. He is also the person mainly in charge of researching tools and various API for reverse geocoding. He also contributed to the writing and the presentation slides.\n", "\n", "Brad - Brad contributed to the Brightkite dataset analysis, this includes contribution to creating code to analyze and processing the Brightkite dataset. He also contributed to slide creation. Additionally, he brought food for everyone during meeting times.\n", "\n", "Corrine - Corrine did most of the reading and research for our topic. She also contributed to working with the Brightkite dataset, this includes analyzing and processing data, working along side with Brad. Also, she does most of the proof-reading and editing for all our writing assignements, this includes all our checkpoints and slides." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }