Today I was faced with an interesting challenge, finding time zones for users defined by latitude and longitude. Thanks to extensive package libraries of R, the solution was not hard at all.

First, I found the list of cities with populations greater than 1000 in Geonames.org web site. Now the list is quite big, almost 150K cities. Doing distance calculations for each of the 48K users that had location information would have been exceedingly long.

Luckily there are some tree based algorithms that make quick work of this kind of problem. I used the RANN package in R to do it. The package is incredibly intuitive to use. Here is the code I used.

# Import library
library(RANN)

# Read in geolocation data
geoloc<-read.table("data/cities1000.txt", sep="t", quote="")
# Column names
colnames(geoloc)<-c("id", "name", "asciiname", "latitude", "longitude", "feature_class", "feature_code", "country_code", "cc2", "admin1_code", "admin2_code", "admin3_code", "admin4_code", "pop", "elevation", "dem", "tz", "ModDate")
# Subset usefulpart
geoloc1<-unique(geoloc[,c("latitude","longitude","tz")])
# Create a list of locations from our own data to match the geolocation data
test<-unique(raw[,c("latitude", "longitude")])
# Get nearest neigbor
neigbor<-nn2(geoloc1[,1:2], na.exclude(test), k=1, treetype="bd")
test<-cbind(na.exclude(test), geoloc[neigbor$nn.idx,"tz"])
raw1<-merge(raw, test, all.x=T, by=c("latitude", "longitude"))
colnames(raw1)[26]<-"tz"

The code runs incredibly fast, I had timezone data in no time.

 

Advertisements