Ranking places with Google to create maps

Today we’re going to use the googleway R package, which allows their user to do requests to the GoogleMaps Places API. The goal is to create maps of specific places (restaurants, museums, etc.) with information from Google Maps rankings (number of stars given by other people). I already discussed this in french here to rank swimming pools in Paris. Let’s start by loading the three libraries I’m going to use : googleway, leaflet to create animated maps, and RColorBrewer for color ranges.

library(googleway)
library(leaflet)
library(RColorBrewer)

First things first. To do API request to Google, we need an API key ; you can ask for one here. We’ll use this key for the rest of the program, so let’s declare a global variable :

api_key <- "YourKeyHereIWontGiveYouMine"

Retrieving Google Data

We’re going to use the google_places function to get a list of places matching a description, called research in my program (for instance : “Restaurant, Paris, France”). The output are multiple, and I’m going to store the place ID and the rating. I’ll also store the research token ; that’ll be explained later.

gmaps_request <- google_places(search_string = research, language = language, key = api_key)
  gmaps_data <- gmaps_request$results
  
  place_id <- gmaps_data$place_id
  rating <- gmaps_data$rating
  
  token <- gmaps_request$next_page_token

This function returns up to 20 places associated to the research by Google. If you want more than 20, you need to use the token previously stored in order to ask the Google Places API to give you the next results, by tweaking the function this way :

gmaps_request <- google_places(search_string = research, language = language, key = api_key, page_token = token)

There are tow caveats to this function. Firstly, the token can be NULL. In this case, there isn’t any further research results you can get. This happens automatically as soon as you reach 60 results. Secondly, the API needs time to refresh the token research (see here) ; that’s why we’re going to make R wait a few seconds, using Sys.sleep(time) between our requests. Our complete function is therefore :

gmaps_request <- google_places(search_string = research, language = language, key = api_key)
  gmaps_data <- gmaps_request$results
  
  place_id <- gmaps_data$place_id
  rating <- gmaps_data$rating
  
token <- gmaps_request$next_page_token
Sys.sleep(5)
continue <- TRUE

  while (continue) {
    
    gmaps_request <- google_places(search_string = research, language = language, key = api_key, page_token = token)
    gmaps_data <- gmaps_request$results
    
    if (!is.null(gmaps_request$next_page_token)) {
      place_id = c(place_id,gmaps_data$place_id)
      rating = c(rating,gmaps_data$rating)
      token <- gmaps_request$next_page_token
      Sys.sleep(5)
    }
    else{continue <- FALSE}
  }

Now we’re going to search for the spatial coordinates of the places we found. To this extent, we’re going to use the google_place_details function of the packages, and retrieve latitude and longitude with these two functions :

get_lat <- function(id, key, language) {
  id <- as.character(id)
  details <- google_place_details(id, language = language, key=key)
  return(details$result$geometry$location$lat)
}

get_lng <- function(id, key, language) { 
  id <- as.character(id)
  details <- google_place_details(id, language = language, key=key)
  return(details$result$geometry$location$lng) 
}

All these blocks add up to build the complete function :

get_gmaps_data <- function(research, api_key, language) {
  
  gmaps_request <- google_places(search_string = research, language = language, key = api_key)
  gmaps_data <- gmaps_request$results
  
  place_id <- gmaps_data$place_id
  rating <- gmaps_data$rating
  
  token <- gmaps_request$next_page_token
  Sys.sleep(5)
  continue <- TRUE
  
  while (continue) {
    
    gmaps_request <- google_places(search_string = research, language = language, key = api_key, page_token = token)
    gmaps_data <- gmaps_request$results
    
    if (!is.null(gmaps_request$next_page_token)) {
      place_id <- c(place_id, gmaps_data$place_id)
      rating <- c(rating, gmaps_data$rating)
      token <- gmaps_request$next_page_token
      Sys.sleep(5)
    }
    else{continue <- FALSE}
  }
  
  lat = sapply(place_id, get_lat, key=api_key, language=language)
  lng = sapply(place_id, get_lng, key=api_key, language=language)
  
  return(data.frame(place_id, rating, lat, lng))
}

Map plot

The next part is more classical. We’re going to order the ratings of the data frame built by the previous function in order to arrange the places in differents groups. Each of the groups will be associated to a color on the data plot. If we want to make number_colors groups with the color scheme color (for instance, “Greens”), we are using the following instructions :

color_pal <- brewer.pal(number_colors, color)
pal <- colorFactor(color_pal, domain = seq(1,number_colors))

plot_data <- gmaps_data
plot_data$ranking <- ceiling(order(gmaps_data$rating)*number_colors/nrow(plot_data))

The definitive function just needs the addition of the leaflet call :

show_map <- function(number_colors, gmaps_data, color="Greens") {
  
color_pal <- brewer.pal(number_colors,color)
pal <- colorFactor(color_pal, domain = seq(1,number_colors))

plot_data <- gmaps_data
plot_data$ranking <- ceiling(order(gmaps_data$rating)*number_colors/nrow(plot_data)) leaflet(plot_data) %>% addTiles() %>%
  addCircleMarkers(
    radius = 6,
    fillColor = ~pal(ranking),
    stroke = FALSE, fillOpacity = 1
  ) %>% addProviderTiles(providers$CartoDB.Positron)
}

Examples

I just need to combine these two functions in one, and then to do some food-related examples !

maps_ranking_from_gmaps <- function(research, api_key, language, number_colors=5, color="Greens") {
  show_map(number_colors, get_gmaps_data(research, api_key, language), color)
}

maps_ranking_from_gmaps("Macaron, Paris, France", api_key, "fr")
maps_ranking_from_gmaps("Macaron, Montreal, Canada", api_key, "fr")

maps_ranking_from_gmaps("Poutine, Montreal, Canada", api_key, "fr", 5, "Blues")
maps_ranking_from_gmaps("Poutine, Paris, France", api_key, "fr", 5, "Blues")

which returns the following maps :

Macaron in Paris, France

Macaron in Montreal, Canada

Poutine in Montreal, Canada

Poutine in Paris, France (I guess French people are not ready for this)

Weighting tricks for machine learning with Icarus – Part 1

Calibration in survey sampling is a wonderful tool, and today I want to show you how we can use it in some Machine Learning applications, using the R package Icarus. And because ’tis the season, what better than a soccer dataset to illustrate this? The data and code are located on this gitlab repo: https://gitlab.com/haroine/weighting-ml

weighting ML gitlab
https://gitlab.com/haroine/weighting-ml

First, let’s start by installing and loading icarus and nnet, the two packages needed in this tutorial, from CRAN (if necessary):

install.packages(c("icarus","nnet"))
library(icarus)
library(nnet)

Then load the data:

load("data/weighting_ML_part1.RData")

The RData file contains two dataframes, one for the training set and one for the test set. They contain results of some international soccer games, from 01/2008 to 12/2016 for the training set, and from 01/2017 to 11/2017 for the test. Along with the team names and goals scored for each side, a few descriptive variables that we’re going to use as features of our ML models:

> head(train_soccer)
        Date                   team opponent_team home_field elo_team
1 2010-10-12                Belarus       Albania          1      554
2 2010-10-08 Bosnia and Herzegovina       Albania          0      544
3 2011-06-07 Bosnia and Herzegovina       Albania          0      594
4 2011-06-20              Argentina       Albania          1     1267
5 2011-08-10             Montenegro       Albania          0      915
6 2011-09-02                 France       Albania          0      918
  opponent_elo importance goals_for goals_against outcome year
1          502          1         2             0     WIN 2010
2          502          1         1             1    DRAW 2010
3          564          1         2             0     WIN 2011
4          564          1         4             0     WIN 2011
5          524          1         2             3    LOSS 2011
6          546          1         2             1     WIN 2011

elo_team and opponent_elo are quantitative variables indicative of the level of the team at the date of the game ; importance is a measure of high-profile the game played was (a friendly match rates 1 while a World Cup game rates 4). The other variables are imo self-descriptive.

Then we can train a multinomial logistic regression, with outcome being the predicted variable, and compute the predictions from the model:

outcome_model_unw <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance,
data = train_soccer)

test_soccer$pred_outcome_unw <- predict(outcome_model_unw, newdata = test_soccer)

The sheer accuracy of this predictor is kinda good:

> ## Accuracy
> sum(test_soccer$pred_outcome_unw == test_soccer$outcome) / nrow(test_soccer)
[1] 0.5526316

but it has a problem: it never predicts draws!

> summary(test_soccer$pred_outcome_unw)
DRAW LOSS  WIN 
   0  208  210

And indeed, draws being less common than other results, it seems more profitable for the algorithm that optimizes accuracy never to predict them. As a consequence, the probabilities of the game being a draw is always lesser than the probability of one team winning it. We could show that the probabilities are not well calibrated.

A common solution to this problem is to use reweighting to correct the imbalances in the sample, which we’ll now tackle. It is important to note that the weighting trick has to happen in the training set to avoid “data leaks”. A very good piece on this subject has been written by Max Kuhn in the documentation of caret.

R package caret
https://topepo.github.io/caret/

Commonly, you would do:

train_soccer$weight <- 1
train_soccer[train_soccer$outcome == "DRAW",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[1]) * 1/3
train_soccer[train_soccer$outcome == "LOSS",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[2]) * 1/3
train_soccer[train_soccer$outcome == "WIN",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[3]) * 1/3

> table(train_soccer$weight)

0.916067146282974  1.22435897435897 
             3336              1248

The draws are reweighted with a factor greater than 1 and the other games with a factor lesser than 1. This balances the predicted outcomes and thus improves the quality of the probabilities …

outcome_model <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance,
data = train_soccer,
weights = train_soccer$weight)

test_soccer$pred_outcome <- predict(outcome_model, newdata = test_soccer)
> summary(test_soccer$pred_outcome)
DRAW LOSS  WIN 
  96  167  155

… though at a loss in accuracy:

> ## Accuracy
> sum(test_soccer$pred_outcome == test_soccer$outcome) / nrow(test_soccer)
[1] 0.5263158

Now let’s look at the balance of our training sample on other variables:

> round(table(test_soccer$importance) / nrow(test_soccer),2)

   1    2    3    4 
0.26 0.08 0.54 0.12 
> round(table(train_soccer$importance) / nrow(train_soccer),2)

   1    2    3    4 
0.56 0.08 0.23 0.12

It seems that the test set features a lot more important matches than the training set. Let’s look further, in particular at the dates the matches of the training set were played:

> round(table(train_soccer$year) / nrow(train_soccer),2)

2008 2009 2010 2011 2012 2013 2014 2015 2016 
0.10 0.11 0.11 0.10 0.11 0.13 0.11 0.11 0.12

Thus the matches of each year between 2008 and 2016 have the same influence on the final predictor. A better idea would be to give the most recent games a slightly higher influence, for example by increasing their weight and thus reducing the weights of the older games:

nyears <- length(unique(train_soccer$year))
year_tweak <- rep(1/nyears,nyears) * 1:nyears
year_tweak <- year_tweak * 1/sum(year_tweak) ## Normalization

> year_tweak
[1] 0.02222222 0.04444444 0.06666667 0.08888889 0.11111111 0.13333333
[7] 0.15555556 0.17777778 0.20000000

We determine it is thus a good idea to balance on these two additional variables (year and importance). Now how should we do this? A solution could be to create an indicator variable containing all the values of the cross product between the variables outcome, year and importance, and use the same reweighting technique as before. But this would not be very practical and more importantly, some of the sub-categories would be nearly empty, making the procedure not very robust. A better solution is to use survey sampling calibration and Icarus 🙂

train_soccer$weight_cal <- 1
importance_pct_test <- unname(
table(test_soccer$importance) / nrow(test_soccer),
)

marginMatrix <- matrix(, nrow = 0, ncol = 1) %>% ## Will be replaced by newMarginMatrix() in icarus 0.3.2
addMargin("outcome", c(0.333,0.333,0.333)) %>%
addMargin("importance", importance_pct_test) %>%
addMargin("year", year_tweak)

train_soccer$weight_cal <- calibration(data=train_soccer, marginMatrix=marginMatrix,
colWeights="weight_cal", pct=TRUE, description=TRUE,
popTotal = nrow(train_soccer), method="raking")

outcome_model_cal <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance,
data = train_soccer,
weights = train_soccer$weight_cal)

test_soccer$pred_outcome_cal <- predict(outcome_model_cal, newdata = test_soccer)

icarus gives a summary of the calibration procedure in the log (too long to reproduce here). We then observe a slight improvement in accuracy compared to the previous reweighting technique:

> sum(test_soccer$pred_outcome_cal == test_soccer$outcome) / nrow(test_soccer)
[1] 0.5478469

But more importantly we have reason to believe that the we improved the quality of the probabilities assigned to each event (we could check this using metrics such as the Brier score or calibration plots) 🙂

It is also worth noting that some algorithms (especially those who rely on bagging, boosting, or more generally on ensemble methods) naturally do a good job at balancing samples. You could for example rerun the whole code and replace the logit regressions by boosted algorithms. You would then observe fewer differences between the unweighted algorithm and its weighted counterparts.

Stay tuned for the part 2, where we’ll show a trick to craft better probabilities (particularly for simulations) using external knowledge on probabilities.