Location Analysis -checkpoint.ipynb 364 KB
Newer Older
wx002's avatar
wx002 committed

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Story told By Your Location Data\n",
    "\n",
    "**By: Corrine, Brad, and Ben**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "With modern technology, there are various applications that utilize location to enhance user experience. While many of these uses could be considered to be very beneficial, there is also another question to consider, how do we keep this data anonymous? Given how frequently online services keep track of personal location data, is it possible to identify individuals based on “anonymous” location data? Would these location data collecting features allow inference of personal information such as gender, name, location, or even unique identity? This project aims to explore such questions and determine if location data should be considered a major privacy infringement, and question whether or not it should be publicly available.\n",
    "\n",
    "# Overview\n",
    "We will start off with a small dataset, looking a the location history of an android phone over a month during 2014. Without any prior knowledge, our goal is to find out as much as possible about this particular individual carrying this phone. Then we will move forward looking at two distinct social network data sets and see how location data can be exploited when is within a group of users.\n",
    "\n",
    "# Android Phone Data \n",
    "We will start off with doing some basic task to get some understanding of the data set. Due to the original source of the dataset is JSON format, we have converted into text file format to make it easier to work with Python. Below are some basic info about this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dates</th>\n",
       "      <th>Lat</th>\n",
       "      <th>Long</th>\n",
       "      <th>Accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2014-11-03 21:51:52.187</td>\n",
       "      <td>40.421279</td>\n",
       "      <td>-3.628637</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2014-11-03 21:50:50.228</td>\n",
       "      <td>40.421265</td>\n",
       "      <td>-3.628646</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014-11-03 21:49:50.132</td>\n",
       "      <td>40.421271</td>\n",
       "      <td>-3.628650</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2014-11-03 21:48:50.127</td>\n",
       "      <td>40.421274</td>\n",
       "      <td>-3.628639</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2014-11-03 21:47:49.271</td>\n",
       "      <td>40.421286</td>\n",
       "      <td>-3.628635</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Dates        Lat      Long  Accuracy\n",
       "0 2014-11-03 21:51:52.187  40.421279 -3.628637        34\n",
       "1 2014-11-03 21:50:50.228  40.421265 -3.628646        35\n",
       "2 2014-11-03 21:49:50.132  40.421271 -3.628650        34\n",
       "3 2014-11-03 21:48:50.127  40.421274 -3.628639        34\n",
       "4 2014-11-03 21:47:49.271  40.421286 -3.628635        33"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import datetime\n",
    "\n",
    "phone_df = pd.read_csv('Datasets/phone_data.txt', parse_dates=['Dates'], sep='\\t')\n",
    "phone_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset consist of time, latitude and longtitude. First, we want to combine latitude and longtitude into one single column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dates</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2014-11-03 21:51:52.187</td>\n",
       "      <td>34</td>\n",
       "      <td>(40.421279399999996, -3.6286372000000005)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2014-11-03 21:50:50.228</td>\n",
       "      <td>35</td>\n",
       "      <td>(40.4212652, -3.6286462999999998)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014-11-03 21:49:50.132</td>\n",
       "      <td>34</td>\n",
       "      <td>(40.421271000000004, -3.6286498999999997)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2014-11-03 21:48:50.127</td>\n",
       "      <td>34</td>\n",
       "      <td>(40.4212744, -3.6286388)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2014-11-03 21:47:49.271</td>\n",
       "      <td>33</td>\n",
       "      <td>(40.421286200000004, -3.6286354)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Dates  Accuracy                                   Location\n",
       "0 2014-11-03 21:51:52.187        34  (40.421279399999996, -3.6286372000000005)\n",
       "1 2014-11-03 21:50:50.228        35          (40.4212652, -3.6286462999999998)\n",
       "2 2014-11-03 21:49:50.132        34  (40.421271000000004, -3.6286498999999997)\n",
       "3 2014-11-03 21:48:50.127        34                   (40.4212744, -3.6286388)\n",
       "4 2014-11-03 21:47:49.271        33           (40.421286200000004, -3.6286354)"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phone_df['Location'] = tuple(phone_df[['Lat','Long']].values)\n",
    "phone_df = phone_df.drop(columns = ['Lat', 'Long'])\n",
    "phone_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next thing we want to do is looking at the location data more closely. To do this, we need to group them by dates after sorting it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dates</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>43780</th>\n",
       "      <td>2014-09-30 21:54:03.688</td>\n",
       "      <td>27</td>\n",
       "      <td>(40.4212446, -3.6286241)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43779</th>\n",
       "      <td>2014-09-30 21:55:03.956</td>\n",
       "      <td>21</td>\n",
       "      <td>(40.4212787, -3.6285733999999996)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43778</th>\n",
       "      <td>2014-09-30 21:56:03.888</td>\n",
       "      <td>26</td>\n",
       "      <td>(40.421249200000005, -3.6286188999999998)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43777</th>\n",
       "      <td>2014-09-30 21:57:03.784</td>\n",
       "      <td>35</td>\n",
       "      <td>(40.421282, -3.6286157)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43776</th>\n",
       "      <td>2014-09-30 21:58:03.933</td>\n",
       "      <td>37</td>\n",
       "      <td>(40.4212636, -3.6286042000000003)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        Dates  Accuracy  \\\n",
       "43780 2014-09-30 21:54:03.688        27   \n",
       "43779 2014-09-30 21:55:03.956        21   \n",
       "43778 2014-09-30 21:56:03.888        26   \n",
       "43777 2014-09-30 21:57:03.784        35   \n",
       "43776 2014-09-30 21:58:03.933        37   \n",
       "\n",
       "                                        Location  \n",
       "43780                   (40.4212446, -3.6286241)  \n",
       "43779          (40.4212787, -3.6285733999999996)  \n",
       "43778  (40.421249200000005, -3.6286188999999998)  \n",
       "43777                    (40.421282, -3.6286157)  \n",
       "43776          (40.4212636, -3.6286042000000003)  "
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = phone_df.sort_values('Dates')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specifically, we are interested in the following questions:\n",
    "1. How often does the android phone keep track of location data?\n",
    "2. What are some common places that the person go to?\n",
    "3. What info can be interpret based on our previous answers?\n",
    "4. How likely can we identify this individual?\n",
    "\n",
    "In order to do this, we first need to seperate the date and time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Time'] = df['Dates'].dt.time\n",
    "df['Dates'] = df['Dates'].dt.date\n",
    "df = df[['Dates', 'Time', 'Location', 'Accuracy']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dates</th>\n",
       "      <th>Time</th>\n",
       "      <th>Location</th>\n",
       "      <th>Accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>43780</th>\n",
       "      <td>2014-09-30</td>\n",
       "      <td>21:54:03.688000</td>\n",
       "      <td>(40.4212446, -3.6286241)</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43779</th>\n",
       "      <td>2014-09-30</td>\n",
       "      <td>21:55:03.956000</td>\n",
       "      <td>(40.4212787, -3.6285733999999996)</td>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43778</th>\n",
       "      <td>2014-09-30</td>\n",
       "      <td>21:56:03.888000</td>\n",
       "      <td>(40.421249200000005, -3.6286188999999998)</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43777</th>\n",
       "      <td>2014-09-30</td>\n",
       "      <td>21:57:03.784000</td>\n",
       "      <td>(40.421282, -3.6286157)</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43776</th>\n",
       "      <td>2014-09-30</td>\n",
       "      <td>21:58:03.933000</td>\n",
       "      <td>(40.4212636, -3.6286042000000003)</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Dates             Time                                   Location  \\\n",
       "43780  2014-09-30  21:54:03.688000                   (40.4212446, -3.6286241)   \n",
       "43779  2014-09-30  21:55:03.956000          (40.4212787, -3.6285733999999996)   \n",
       "43778  2014-09-30  21:56:03.888000  (40.421249200000005, -3.6286188999999998)   \n",
       "43777  2014-09-30  21:57:03.784000                    (40.421282, -3.6286157)   \n",
       "43776  2014-09-30  21:58:03.933000          (40.4212636, -3.6286042000000003)   \n",
       "\n",
       "       Accuracy  \n",
       "43780        27  \n",
       "43779        21  \n",
       "43778        26  \n",
       "43777        35  \n",
       "43776        37  "
      ]
     },
     "execution_count": 111,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 370,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dates</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2014-09-30</th>\n",
       "      <td>126</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-10-01</th>\n",
       "      <td>1307</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-10-02</th>\n",
       "      <td>1349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-10-03</th>\n",
       "      <td>1372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2014-10-04</th>\n",
       "      <td>1413</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Location\n",
       "Dates               \n",
       "2014-09-30       126\n",
       "2014-10-01      1307\n",
       "2014-10-02      1349\n",
       "2014-10-03      1372\n",
       "2014-10-04      1413"
      ]
     },
     "execution_count": 370,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now we check distinct places per day.\n",
    "df_day = df[['Dates', 'Location']]\n",
    "df_freq = df_day.groupby(['Dates']).count()\n",
    "df_freq.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Max Location count: 126\n",
      "Min Location Count: 1429\n"
     ]
    }
   ],
   "source": [
    "m = max(df_freq['Location'])\n",
    "M = min(df_freq['Location'])\n",
    "print('Max Location count: {}\\nMin Location Count: {}'.format(M,m))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By this, we already can see that the android phone keep track of your location at least 126 times within a day. \n",
    "Next, we will look at specific locations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dates</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Location</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>(39.847814, -5.6267378)</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(39.8596804, -5.613139)</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(39.871458399999995, -5.582563200000001)</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(39.871547, -5.5995402)</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(39.8723041, -5.5697503)</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          Dates\n",
       "Location                                       \n",
       "(39.847814, -5.6267378)                       1\n",
       "(39.8596804, -5.613139)                       1\n",
       "(39.871458399999995, -5.582563200000001)      1\n",
       "(39.871547, -5.5995402)                       1\n",
       "(39.8723041, -5.5697503)                      2"
      ]
     },
     "execution_count": 172,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_loc = df_day.groupby('Location').count()\n",
    "df_loc.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We group locations and did a count on how many times it had been visited over the month. We will generate the results of the most frequent ones below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 252,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Dates</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Location</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>(40.4202487, -3.6303093)</th>\n",
       "      <td>1013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(40.115815399999995, -5.949015)</th>\n",
       "      <td>913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(40.4207664, -3.6332066)</th>\n",
       "      <td>754</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(40.419880799999994, -3.630936)</th>\n",
       "      <td>675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(40.4207763, -3.6332088)</th>\n",
       "      <td>655</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 Dates\n",
       "Location                              \n",
       "(40.4202487, -3.6303093)          1013\n",
       "(40.115815399999995, -5.949015)    913\n",
       "(40.4207664, -3.6332066)           754\n",
       "(40.419880799999994, -3.630936)    675\n",
       "(40.4207763, -3.6332088)           655"
      ]
     },
     "execution_count": 252,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_locations = df_loc.sort_values('Dates', ascending=False)\n",
    "df_locations.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, there are quite a few places that this phone travels to, so we will be primarily focusing on these since it provides the most infomation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 261,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Location</th>\n",
       "      <th>Dates</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(40.4202487, -3.6303093)</td>\n",
       "      <td>1013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(40.115815399999995, -5.949015)</td>\n",
       "      <td>913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>(40.4207664, -3.6332066)</td>\n",
       "      <td>754</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>(40.419880799999994, -3.630936)</td>\n",
       "      <td>675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>(40.4207763, -3.6332088)</td>\n",
       "      <td>655</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          Location  Dates\n",
       "0         (40.4202487, -3.6303093)   1013\n",
       "1  (40.115815399999995, -5.949015)    913\n",
       "2         (40.4207664, -3.6332066)    754\n",
       "3  (40.419880799999994, -3.630936)    675\n",
       "4         (40.4207763, -3.6332088)    655"
      ]
     },
     "execution_count": 261,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_new = df_locations.reset_index()\n",
    "df_new[df_new['Dates'] > 10].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above consist of locations that had been visited more than 10 times. We will feed this into an reverse geocoding API to get better insights of the location itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 324,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "296\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Location</th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(40.4202487, -3.6303093)</td>\n",
       "      <td>1013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(40.115815399999995, -5.949015)</td>\n",
       "      <td>913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>(40.4207664, -3.6332066)</td>\n",
       "      <td>754</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>(40.419880799999994, -3.630936)</td>\n",
       "      <td>675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>(40.4207763, -3.6332088)</td>\n",
       "      <td>655</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          Location  Count\n",
       "0         (40.4202487, -3.6303093)   1013\n",
       "1  (40.115815399999995, -5.949015)    913\n",
       "2         (40.4207664, -3.6332066)    754\n",
       "3  (40.419880799999994, -3.630936)    675\n",
       "4         (40.4207763, -3.6332088)    655"
      ]
     },
     "execution_count": 324,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "freq_locations = df_new[df_new['Dates']>10]\n",
    "freq_locations = freq_locations.rename(index=str, columns={'Dates':'Count'})\n",
    "print(len(freq_locations))\n",
    "freq_locations.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below are the code to generate the readable addresses using Bing API and save it into a text file to preserve API usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 334,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"\\nimport geocoder\\nbing_key = 'AiEfap-qUoZalL1qK8ollM-SwVdoJFemh60tHo0EeraVYP8V4WPJXAVD2YjqzgA1'\\ncoordinates = freq_locations['Location']\\naddr_file = open('Datasets/phone_address_new.txt', 'w+', encoding='utf-8')\\nfor cord in coordinates:\\n    cord_list = list(cord)\\n    g = geocoder.bing(cord_list, method = 'reverse', key = bing_key)\\n    for r in g:\\n        line_str = r.address + ',' + r.city + ',' + r.country + '\\n'\\n        addr_file.write(str(cord_list) + ' : ' + line_str)\\n        print('{} : {}'.format(cord_list, line_str))\\naddr_file.close()\\nprint('finish address!')\\n\""
      ]
     },
     "execution_count": 334,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'''\n",
    "import geocoder\n",
    "bing_key = 'AiEfap-qUoZalL1qK8ollM-SwVdoJFemh60tHo0EeraVYP8V4WPJXAVD2YjqzgA1'\n",
    "coordinates = freq_locations['Location']\n",
    "addr_file = open('Datasets/phone_address_new.txt', 'w+', encoding='utf-8')\n",
    "for cord in coordinates:\n",
    "    cord_list = list(cord)\n",
    "    g = geocoder.bing(cord_list, method = 'reverse', key = bing_key)\n",
    "    for r in g:\n",
    "        line_str = r.address + ',' + r.city + ',' + r.country + '\\n'\n",
    "        addr_file.write(str(cord_list) + ' : ' + line_str)\n",
    "        print('{} : {}'.format(cord_list, line_str))\n",
    "addr_file.close()\n",
    "print('finish address!')\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After checking references of coordinates to address, it seems like because of how precise the location data is being keep track of, there are different coordinates that represents the same general area, so we have to do another parse to reduce the redundency of our results. Because of this, we have reduce the number places to 152."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 409,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most frequent visted places: 152\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Address</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Calle de Nicolás Salmerón, 7, 28017 Madrid (Madrid),Madrid,Spain</th>\n",
       "      <td>2971</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Autovía del Norte, 28108 Alcobendas (Madrid),Alcobendas,Spain</th>\n",
       "      <td>1661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Calle de Villasilos, 8B, 28017 Madrid (Madrid),Madrid,Spain</th>\n",
       "      <td>1451</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Calle de Nicolás Salmerón, 17, 28017 Madrid (Madrid),Madrid,Spain</th>\n",
       "      <td>1429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Avenida de Bruselas, 37, 28108 Alcobendas (Madrid),Alcobendas,Spain</th>\n",
       "      <td>1147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Calle de Matamorosa, 3, 28017 Madrid (Madrid),Madrid,Spain</th>\n",
       "      <td>1013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC-51, 10617 El Torno (Cáceres),El Torno,Spain</th>\n",
       "      <td>913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Avenida Bruselas, 31, 28108 Alcobendas (Madrid),Alcobendas,Spain</th>\n",
       "      <td>768</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Calle de la Caléndula, 87, 28109 Alcobendas (Madrid),Alcobendas,Spain</th>\n",
       "      <td>574</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Avenida de Fuencarral, 18, 28108 Alcobendas (Madrid),Alcobendas,Spain</th>\n",
       "      <td>567</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    Count\n",
       "Address                                                  \n",
       " Calle de Nicolás Salmerón, 7, 28017 Madrid (Ma...   2971\n",
       " Autovía del Norte, 28108 Alcobendas (Madrid),A...   1661\n",
       " Calle de Villasilos, 8B, 28017 Madrid (Madrid)...   1451\n",
       " Calle de Nicolás Salmerón, 17, 28017 Madrid (M...   1429\n",
       " Avenida de Bruselas, 37, 28108 Alcobendas (Mad...   1147\n",
       " Calle de Matamorosa, 3, 28017 Madrid (Madrid),...   1013\n",
       " CC-51, 10617 El Torno (Cáceres),El Torno,Spain       913\n",
       " Avenida Bruselas, 31, 28108 Alcobendas (Madrid...    768\n",
       " Calle de la Caléndula, 87, 28109 Alcobendas (M...    574\n",
       " Avenida de Fuencarral, 18, 28108 Alcobendas (M...    567"
      ]
     },
     "execution_count": 409,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "address = pd.read_csv('Datasets\\phone_address_new.txt', sep=':', header = None)\n",
    "freq_locations['Address'] = address[1].values\n",
    "freq = freq_locations.groupby(['Address']).sum().sort_values('Count', ascending=False)\n",
    "print('Most frequent visted places: {}'.format(len(freq)))\n",
    "freq[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is the map of the precise traveling locations of the phone on geo coordinates. This represents the general traveling locations of the person."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 411,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"><iframe src=\"data:text/html;charset=utf-8;base64,\" style=\"position:absolute;width:100%;height:100%;left:0;top:0;border:none !important;\" allowfullscreen webkitallowfullscreen mozallowfullscreen></iframe></div></div>"
      ],
      "text/plain": [
       "<folium.folium.Map at 0x12b859b0>"
      ]
     },
     "execution_count": 411,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import folium\n",
    "# build the location list\n",
    "coordinates = freq_locations['Location'].tolist()\n",
    "m = folium.Map(location=list(coordinates[0]), zoom_start=10)\n",
    "for c in coordinates:\n",
    "    c_list = list(c)\n",
    "    folium.Marker(c_list, popup=str(c_list)).add_to(m)\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the limitations of API and time constriants, we did not manage to references all the address to notable locations. However, we can do some basic searches to get some ideas of the person by manually lookup the top 10 most visited locations. Our results are as followed:\n",
    "1. Calle de Nicolás Salmerón - Sports Club\n",
    "2. Autovía A-1 - Route\n",
    "3. Shango - Cafe\n",
    "4. Calle de Nicolás Salmerón - Residental Area\n",
    "5. Calle de Nicolás Salmerón - Sports Club\n",
    "6. Avenida de Bruselas - Amusement Center\n",
    "7. Sala Bohemia - Music School\n",
    "8. CC-51 - Route close to Gas Station\n",
    "9. Calle de la Caléndula - School Campus\n",
    "10. Avenida de Fuencarral - Holiday Inn Express Madrid\n",
    "\n",
    "Based on these, it should be safe to inference that the person helps people travel to places and live at Calle de Nicolás Salmerón. There is constant visit of a sports Club, so we can assume is a Male. \n",
    "\n",
    "# Future Invetisgation\n",
    "This is as much as we can do so far with this dataset. For the future, we could consider referencing external datasets, such as history taxi or other means of transportation and see if there are correlation to possibly identify the person. However, this dataset still indicate the following information, which could be consider a breach of privacy.\n",
    "1. Perosnal home address is exposed\n",
    "2. Occupation\n",
    "3. General area mobility\n",
    "4. Habits\n",
    "5. Gender\n",
    "With how frequent the android phone keep track of the location data on a daily basis, the data could have the potential to reavel much more info if it gets larger since this is just a month.\n",
    "\n",
    "Next, we will be looking at social network location datasets and try to see if we can find similar info for group of users rather than individuals."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}