## Have you checked your features distributions lately?

tl;dr
Trying to debug a poorly performing machine learning model, I discovered that the distribution of one of the features varied from one date to another. I used a simple and neat affine rescaling. This simple quality improvement brought down the model’s prediction error by a factor 8

Data quality trumps any algorithm

I was recently working on a cool dataset that looked unusually friendly. It was tidy, neat, interesting… the kind of things that you rarely encounter in the wild! My goal was to build a super simple predictor for one of the features. However, I kept getting poor results and at first couldn’t figure out what was happening.

Here is the dataset I was working with. I tweaked the ids and dates, and muddied the variables a little to make sure I’m not leaking any critical info, but the gist of the problem is absolutely identical to what I was working with (i.e. it’s a real-world datasetTM, not a toy example).

library(tidyverse)
data_magictransfo <- readRDS("data/nc233_rescaling_df_04_2021.rds")

This dataset features 211 distinct units and 3 main variables (X1, X2, Y) describing them. We have values for these variables on 7 consecutive dates.

> length(unique(data_magictransfo$unit_id)) [1] 211 > table(data_magictransfo$date_num)

1   2   3   4   5   6   7
204 211 211 211 211 211 211

Y is the variable we’re interested in predicting, and we’ll use X1 and X2 as predictors. Since this is basically a problem of machine learning on time series, we’ll use the last date as test set, and train our predictor on the first 6 dates. Usual stuff. To measure the accuracy, we will use the ratio between the square root of the MSE and the mean of the true variable for the 7th date:

df_train <- data_magictransfo %>%  filter(date_num <= 6)
df_test <- data_magictransfo %>%
filter(date_num > 6)

pred_error_ratio <- function(predict_Y, true_Y) {

mse_rescale <- (predict_Y - true_Y)**2
mse_rescale <- mean(mse_rescale)
pred_error_ratio <- sqrt(mse_rescale) / mean(true_Y)

return(pred_error_ratio)
}

The first thing I tried was a super simple linear regression:

simple_model <- lm(Y ~ X1 + X2, data=df_train)
df_test$Y_predict_lm <- predict(simple_model, newdata = df_test) pred_error_ratio_lm <- pred_error_ratio(df_test$Y_predict_lm, df_test$Y) > pred_error_ratio_lm [1] 0.3352261 This extra simple model gives a 33% error rate. Not catastrophic, but not great. Literature on this dataset suggests that 5% should be easily obtained though. Maybe the poor performance comes from the simplicity of the model? Let’s fire up some neural networks and see: library(nnet) set.seed(1005192119) nn <- nnet(Y ~ X1+X2, data=df_train %>% select(Y,X1,X2) , size=11, decay=1.0e-5, maxit=5000, linout=T) df_test$Y_predict_nn <- predict(nn, newdata=df_test %>% select(Y,X1,X2), type="raw")
prediction_error_ratio_nn <- pred_error_ratio(df_test$Y_predict_nn, df_test$Y)
> prediction_error_ratio_nn
[1] 0.1481201

Stop cutting corners and start doing what should always be done first: explore the data

¯\_(ツ)_/¯ Definitely far from what I expected. At this point, I should probably stop cutting corners and start doing what should always be done first: explore the data. After a few summaries, I start suspecting that X1 hides a dirty little secret:

density_plot1 <- ggplot(df_train, aes(x=X1, color=date_num)) +
geom_density(fill="lightblue", alpha=0.25) +
NULL
print(density_plot1)

It seems like the distribution of X1 is different between dates 1, 2 and the rest of the dataset! 😱 Let’s zoom in on the first three dates and add vertical dashed lines for the average of X1 by date:

# Add averages to viz
moments_df <- df_train %>%
filter(date_num <= 3) %>%
group_by(date_num) %>%
summarize(mu = mean(X1), sd = sd(X1))
mu_vec <- moments_df$mu density_plot3 <- ggplot(df_train %>% filter(date_num <= 3), aes(x=X1, color=date_num)) + geom_density(fill="lightblue", alpha=0.25) + geom_vline(xintercept=mu_vec, linetype="dashed", color="black", alpha=0.5) + NULL print(density_plot3) Clearly, X1 for date 1 is a wildly different thing than other dates. At date 2, the distribution is closer to what it will be in the future: average is very close, but there is still a clear difference in the shape and the width of the tails. As a data scientist, it’s your job to investigate the data-generating process 🕵️ This problem of features having different distributions by date (or by any other grouping than time, which could very well be possible as well) is common and can arise for multiple reasons. As a data scientist, it’s your job to investigate the data-generating process 🕵️ In this case, X1 comes from a score a certain authority attributes to each unit. A quick investigation revealed that the scoring method used had changed between dates 1 and 2, and then another time between dates 2 and 3. The change only affected the score and not the rank of the units, which means a simple fix should do 🙌 Now that we’ve diagnosed the problem, let’s try and fix it! Let’s open the math toolbox and… No, don’t go away! I promise it will be simple! 😂 The simplest tool in the box will suffice at first. We’ll try a rescaling of the form: $$Z = a X_1 + b$$ The expected value and variance of this new variable are easy to compute using well-know formulae: $$E(Z) = a E(X_1) + b \\ Var(Z) = a^2 Var(X_1)$$ We want the rescaled variable Z to have the same average and variance as the distribution from date 3 onwards. Let’s denote them μ and σ2. We thus have: $$\mu = a \cdot E(X_1) + b \\ \sigma^2 = a^2 \cdot Var(X_1)$$ Since we know E(X1) and Var(X1) from the data, this is actually a very simple linear system with two variables (a and b) and two equations. The unique solution is: $$a = \frac{\sigma}{Var(X_1)}\\ b = \mu ~ – ~ \frac{\sigma}{Var(X_1)} \cdot E(X_1)$$ which we can now apply on our data: # Distribution parameters library(zeallot) # Only used to improve code readability moments2_df <- df_train %>% mutate(date_select = case_when( date_num == "1" ~ "1", date_num == "2" ~ "2", date_num >= "3" ~ "3+", TRUE ~ "NA") ) %>% select(date_select, X1) %>% group_by(date_select) %>% summarise(mu=mean(X1), sigma=sd(X1)) c(mu_1, mu_2, mu_Z) %<-% moments2_df$mu
c(sigma_1, sigma_2, sigma_Z) %<-% moments2_df$sigma df_train <- df_train %>% mutate(X1_rescale = case_when( date_num == 1 ~ a1 * X1 + b1, date_num == 2 ~ a2 * X1 + b2, TRUE ~ X1 )) The new density plots look much better! moments_df_rescale <- df_train %>% filter(date_num <= 3) %>% group_by(date_num) %>% summarize(mu = mean(X1_rescale), sd = sd(X1_rescale)) mu_vec_rescale <- moments_df_rescale$mu

density_plot4 <- ggplot(df_train %>% filter(date_num <= 3), aes(x=X1_rescale, color=date_num)) +
geom_density(fill="lightblue", alpha=0.25) +
geom_vline(xintercept=mu_vec_rescale, linetype="dashed", color="black", alpha=0.5) +
NULL
print(density_plot4)

Now we can train our models on the rescaled data:

# New predictions
simple_model_rescale <- lm(Y ~ X1_rescale + X2, data=df_train)

df_test$X1_rescale <- df_test$X1
df_test$Y_predict_rescale <- predict(simple_model_rescale, newdata = df_test) pred_error_ratio_rescale <- pred_error_ratio(df_test$Y_predict_rescale, df_test$Y) set.seed(1005192119) nn_check <- nnet(Y ~ X1_rescale+X2, data=df_train %>% select(Y,X1_rescale,X2) , size=13, decay=1.0e-5, maxit=5000, linout=T) Y_predict_nn_check <- predict(nn_check, newdata=df_test %>% select(Y,X1_rescale,X2), type="raw") prediction_error_ratio_nn_check <- pred_error_ratio(Y_predict_nn_check, df_test$Y)

The prediction errors go down by a lot! 🎉🎉🎉 8% for the linear model and 2.6% for the neural network:

> pred_error_ratio_rescale
[1] 0.08092733
> prediction_error_ratio_nn_check
[1] 0.02650212

As always, data quality trumps any algorithm, as powerful as it may be!
And lesson learned for me: never trust a cute-looking dataset. Data in the wild is always hard to tame.

To be continued…

Do you like data science and would be interested in building products that help entrepreneurs around the world start and grow their businesses? Shopify is hiring 2021 engineers and data scientists worldwide! Feel free to reach out if you’d like to know more 🙂

## Micromorts – how much risk of death would you accept?

tl;dr
– A 1% death risk is deceptively high
– Micromorts (1/10000 th of a percent) are a useful scale to model death risks
– Statistical models for human life use ligh-tailed distributions. High values are extremely rare

Today a short post that I had in my drafts for a long time. I didn’t expect that it would (unfortunately) be so relevant to today’s context.

Life is finite and all human activities are risky. Although we all face a certain (hopefully low) risk of dying each time we breathe, it’s not enough reason to prevent us from doing any activities and live isolated in bubbles. But exactly how much risk of dying is acceptable? How much risk on your own life would you be willing to accept?

Turns out most people, sometimes even trained scientists, are bad at estimating death risk probabilities. They often underestimate how bad even seemingly low probabilities of dying can turn out to be. During a dramatic time of my life where I cared for a child who became suddenly sick. The head surgeon told us she had an 85% chance of making it.
So maybe you’re thinking oh it wasn’t that bad! And I mean I understand, 85 is pretty close to 100, situation’s looking fairly good, right?

I was terrified.

To put this number into perspective, imagine if all patients admitted faced such a risk. Let’s say doctors see 15 patients per hour, work 10 hours a day and that the department has 10 doctors. This represents approximately 35’000 patients per year, which seems to fit this UK data. With a 15% death rate, this department would have to deal with more than 5000 deaths over the course of the year, which is the number of people who died in the entire city of San Francisco in 2018! This is one death every 1 hour 40 minutes!

An activity with a 99% chance of survival would certainly kill you in less than a year

In fact, routine surgical procedures with risks greater than 5% are classified high risk. Even a 99% chance of survival doesn’t look so good. If you were enough of a daredevil to engage every day in an activity that exposes you to “only” 1% death probability, then you’d be almost certainly dead within a year.
(You can see this easily by using this easy rule of thumb: consider a random event occurring with probability p. Then there is a 95% chance that the event will occur in less than 3/p tries. In this case, this would be 3 / 0.01 = 300 days, which is less than a year)

### Micromort – the right scale for death risks

As it turns out, percents are not the right risk scale to talk about death risks. Ronald A. Howard realized this in 1979 and created the notion of micromort. A micromort represents a one-in-a-million chance of dying. Wikipedia has a list of how much risk some activities expose you to:

A micromort is one in a million chance of dying – it is equivalent to tossing 20 coins and getting 20 heads

Wikipedia has a list of how much risk some activities expose you to:

One day alive at age 20 – 1 micromort
Skydiving (one jump) – 10 micromorts
One day alive at age 90 – 400 micromorts
Being infected by the Spanish flu – 30000 micromorts
An ascent to Mt Everest – 40000 micromorts

Using this scale, my child’s illness exposed her to 150000 micromorts… which suddenly looks much frightening, and a much more intuitive representation of the risk she was actually exposed to.
(Side note if you’re wondering: my kid is fine, and I am really grateful for this 🙏 Diane, if you ever read this you are the sunshine of my life! ❤️☀️)

If this starts feeling scary, don’t worry too much. A risk of one micromort is equivalent to tossing 20 coins and getting only heads. It’s pretty unlikely! The problem is that you’re playing this game every day, and that every once in a while you have to remove some coins. By the time you’re 50, you only have 17 coins left, by the time you’re 90, you only have 11 left.

### Statistical models of death

A neat thing about Micromorts is that they also make a good and intuitive statistical model for age at death for humans. Let’s consider this very simple model based on the “game” described in the previous paragraph. Every day you play this game, with a certain risk of dying (for sake of simplicity, let’s forget about modelling childhood and only concentrate on adult life):

• Between the ages of 20 and 80, the risk is Floor(age / 10) – 1 micromorts (for example all days of your 26th year, you face a 1 micromort risk, all days in you 63rd year, you face 5 micromorts)
• At age 80, the risk jumps to 100 micromorts, and then each year you have to add 50 additional micromorts (which means 150 micromorts at age 81, 200 micromorts at age 82, etc.)

We can run a few simulations in R to see what life expectancy looks like with this simple model. First let’s define a vector of the risks that match the model we just described:

max_age <- 300min_age <- 20age_cut <- 80risks1 <- rep(1:7, each=365*10)risks2 <- rep(((age_cut : max_age) - age_cut) * 50 + 100, each=365)risks <- c(risks1, risks2) / (1e6)

Then we can run a few simulations to get a vector of age at death for 10 000 people playing this “game”. Note that all simulated values use the vectorization capabilities of the function rbinom:

N <- length(risks)
N_sims <- 1e4

days_sims <- matrix(rbinom(N_sims*N, 1, risks), ncol = N_sims, byrow = F)

death_ages_days <- apply(days_sims, 2, function(x) {
day_death <- which.max(x)
if(day_death > 1) {
return(day_death / 365 + min_age)
} else {
return(max_age)
}
})

The mean and max age at death are:

> mean(death_ages_days)
[1] 84.9348
> max(death_ages_days)
[1] 103.0274

Not too far from what we observe in most Western countries! For example, life expectancy for Canadian women in 2018 was 84.3 years, and that same year, the oldest Canadian man alive was 109 years old.

We can plot the distribution of age at death:

library(ggplot2)

model_plot <-
ggplot(data.frame(age=death_ages_days)) +
geom_histogram(aes(x=age, y=..density..), fill="#4b86b4") +
geom_density(aes(x=age, y=..density..), colour="#2a4d69") +
xlab("Age at death") +
ylab("Frequency") +
theme_bw()
print(model_plot)

In addition to being close to the actual demographic values, the most interesting property of this model is that it was able to generate a light tailed distribution.

A light tail distribution is one that quickly falls down for the highest values. Contrary to statistical distributions like revenue where extreme values (i.e. values that are several standard deviations greater than the mean) are routinely observed, extreme values are very unlikely when it comes to human life. The oldest person we know of reached 122 years of age and people who make it past 100 years old are still a tiny minority. It is extremely unlikely that we would observe someone living to be 200 or even 150 years old.
Even a good old normal distribution would yield extreme values more often (a distribution is said to have “light” and “fat” tails if extreme values are less (resp. more) likely to happen than with a normal distribution). We can see this on this plot from a StackExchange post where life expectancy distribution is plotted against the best normal fit:

One of the best statistical distributions one can use to model human life is the Weibull distribution. It is a variant of the exponential distribution, that is often studied in high school and is typically used to model memory-less failures (i.e. a probability of failing that is independent of time). The Weibull distribution is very similar except the failure rate increases with time, mimicking an aging process. The statistical model I used for this article is in fact very close to how the Weibull distribution works.

Do you like data science and would be interested in building products that help entrepreneurs around the world start and grow their businesses? Shopify is hiring 🎉 for many locations in Canada 🇨🇦 and around the world 🌎! Feel free to reach out directly to me 🙂

Edit: A previous version of the article didn’t feature the tweet illustrating the simple death model, and had typos in two micromorts numbers that were corrected

## The Mrs. White probability puzzle

tl;dr -I don’t remember how many games of Clue I’ve played but I do remember being surprised by Mrs White being the murderer in only 2 of those games. Can you give an estimate and an upper bound for the number of games I have played?
We solve this problem by using Bayes theorem and discussing the data generation mechanism, and illustrate the solution with R.

#### Making use of external information with Bayes theorem

Having been raised a frequentist, I first tried to solve this using a max likelihood method, but quickly gave up when I realized how intractable it actually was, especially for the upper bound.
This is a problem where conditioning on external knowledge is key, so the most natural way to tackle it is to use Bayes theorem. This will directly yield an interpretable probability for what we’re looking for (most probable number of and uncertainty interval)

Denote an integer n>3 and:

What we want writes as a simple product of quantities that we can compute, thanks to Bayes:
Notice that there is an “proportional to” sign instead of an equal. This is because the denominator is just a normalization constant, which we can take care of easily after computing the likelihood and the prior.
###### Likelihood
The likelihood indicates the odds of us observing the data (in this case, that k_Mrs_White = 2) given the value of the unknown parameter (here the number of games played). Since at the beginning of each game the murderer is chosen at uniform random between 6 characters, the number of times Mrs White ends up being the culprit can be modeled as a binomial distribution with parameters n and 1/6.

This will be easily obtained using the dbinom function, which gives directly the exact value of P(x = k), for any x and a binomial distribution of parameters n and p. Let’s first import a few useful functions that I put in our GitHub repo to save some space on this post, and set a few useful parameters:
library(tidyverse)
source("clue/clue_functions.R")

## Parameters
k_mrs_white <- 2 # Number of times Mrs. White was the murderer
prob <- 1/6 # Probability of Mrs. White being the murderer for one game

Note that we can’t exactly obtain the distribution for any game from 1 to infinity, so we actually compute the distribution for 1 to 200 games (this doesn’t matter much in practice):
x <- 1:200 # Reduction of the problem to a finite number of games

## Likelihood
dlikelihood <- dbinom(k_mrs_white, x, prob)

easy enough 🙂

Side note: when I was a student, I kept forgetting that the distribution functions existed in R and whenever I needed them I used to re-generate them using the random generation function (rbinom in this case) 🤦‍♂️
###### Prior
There are a lot of possible choices for the prior but here I’m going to consider that I don’t have any real reason to believe and assume a uniform probability for any number of games between 3 and 200:
dprior1 <- dunifdisc(x,3,100)
plot_clue_prior(x, dprior1)

###### First posterior
Using the likelihood and the prior, we can easily compute the posterior, normalize it and plot it:
dposterior1 <- dlikelihood * dprior1
dposterior1 <- dposterior1 / sum(dposterior1)
plot_clue_posterior(x, dposterior1)

We can also compute directly the estimates we’re looking for. The most probable number of games played is 11:
> which.max(dposterior1)
[1] 11

And there is a 95% chance that the number of games is less than 40:
> threshold_val <- 0.975
> which(cumsum(dposterior1) > (threshold_val))[1]
[1] 40


#### A more realistic data generation mechanism

I find this result very unsatisfying. It doesn’t “feel” right to me that someone would be surprised by only 2 occurrences of Mrs White being guilty in such a small number of games! For example, I simulated 40 games, a number that was supposedly suspiciously high according to the model:
set.seed(9)
N_sim_games <- 40
sim_murderer <- runifdisc(N_sim_games, 6)

plot_murderer <- ggplot(tibble(x=1:N_sim_games, y=sim_murderer), aes(y, stat(count))) +
geom_histogram(aes(y =..count..),
bins=6, fill="white",colour="black") +
ylab("Frequency - murderer") +
xlab("Character #") +
scale_x_continuous(breaks = 1:6)
print(plot_murderer)

We observe that characters #4 and #5 are the murderers in respectively only 2 and 3 games!

In the end I think what really counts is not the likelihood that *Mrs White* was the murderer 2 times, but that the *minimum* number of times one of the characters was the culprit was 2 times!

I think it’s a cool example of a problem where just looking at the data available is not enough to make good inference – you also have to think about *how* the data was generated (in a way, it’s sort of a twist of the Monty Hall paradox, which is one of the most famous examples of problems where the data generation mechanism is critical to understand the result).

I wrote a quick and dirty function based on simulations to generate this likelihood, given a certain number of games. I saved the distribution directly in the GitHub (and called it Gumbel since it kinda looks like an extreme value problem) so that we can call it and do the same thing as above:
gumbelclue_2 <- readRDS("clue/dcluegumbel_2.rds")
gumbelclue_2 <- gumbelclue_2[x]

dposterior_gen <- gumbelclue_2 * dprior1
dposterior_gen <- dposterior_gen / sum(dposterior_gen)

plot_clue_posterior(x, dposterior_gen)

The new posterior has the same shape but appears shifted to the right. For example N_games = 50 seems much more likely now! The estimates are now 23 for the number of games:
> which.max(dposterior_gen)
[1] 23

And 51 for the max bound of the uncertainty interval
> threshold_val <- 0.975
> which(cumsum(dposterior_gen) > (threshold_val))[1]
[1] 51

Credits for title image: Yeonsang

## Est-ce que cette piscine est bien notée ?

J’ai pris la (mauvaise ?) habitude d’utiliser Google Maps et son système de notation (chaque utilisateur peut accorder une note de une à cinq étoiles) pour décider d’où je me rend : restaurants, lieux touristiques, etc. Récemment, j’ai déménagé et je me suis intéressé aux piscines environnantes, pour me rendre compte que leur note tournait autour de 3 étoiles. Je me suis alors fait la réflexion que je ne savais pas, si, pour une piscine, il s’agissait d’une bonne ou d’une mauvaise note ! Pour les restaurants et bars, il existe des filtres permettant de se limiter dans sa recherche aux établissements ayant au moins 4 étoiles ; est-ce que cela veut dire que cette piscine est très loin d’être un lieu de qualité ? Pourtant, dès lors qu’on s’intéresse à d’autres types de services comme les services publics, ou les hôpitaux, on se rend compte qu’il peut y avoir de nombreux avis négatifs, et des notes très basses, par exemple :

Pour répondre à cette question, il faudrait connaître les notes qu’ont les autres piscines pour savoir si 3 étoiles est un bon score ou non. Il serait possible de le faire manuellement, mais ce serait laborieux ! Pour éviter cela, nous allons utiliser l’API de GoogleMaps (Places, vu qu’on va s’intéresser à des lieux et non des trajets ou des cartes personnalisées).

API, késako? Une API est un système intégré à un site web permettant d’y accéder avec des requêtes spécifiques. J’avais déjà utilisé une telle API pour accéder aux données sur le nombre de vues, de like, etc. sur Youtube ; il existe aussi des API pour Twitter, pour Wikipedia

Pour utiliser une telle API, il faut souvent s’identifier ; ici, il faut disposer d’une clef API spécifique pour Google Maps qu’on peut demander ici. On peut ensuite utiliser l’API de plusieurs façons : par exemple, faire une recherche de lieux avec une chaîne de caractères, comme ici “Piscine in Paris, France” (avec cette fonction) ; ensuite, une fois que l’on dispose de l’identifiant du lieu, on peut chercher plus de détails, comme sa note, avec cette fonction. De façon pratique, j’ai utilisé le package googleway qui possède deux fonctions qui font ce que je décris juste avant : google_place et google_place_details.

En utilisant ces fonctions, j’ai réussi à récupérer de l’information sur une soixantaine de piscines à Paris et ses environs très proches (je ne sais pas s’il s’agit d’une limite de l’API, mais le nombre ne semblait pas aberrant !). J’ai récupéré les notes et je constate ainsi que la note moyenne est autour de 3.5, ce qui laisse à penser que les piscines à proximité de mon nouvel appartement ne sont pas vraiment les meilleures… De façon plus visuelle, je peux ensuite représenter leur note moyenne (en rouge quand on est en dessous de 2, en vert de plus en plus foncé au fur et à mesure qu’on se rapproche de 5) sur la carte suivante (faite avec Leaflet, en utilisant le très bon package leaflet)

## Comparaison avec d’autres lieux

En explorant Google Maps aux alentours, je me suis rendu compte que les agences bancaires du quartier étaient particulièrement mal notées, en particulier pour une banque spécifique (dont je ne citerai pas le nom – mais dont le logo est un petit animal roux). Je peux utiliser la même méthode pour récupérer par l’API des informations sur ces agences (et je constate qu’effectivement, la moyenne des notes est de 2 étoiles), puis les rajouter sur la même carte (les piscines correspondent toujours aux petits cercles ; les agences bancaires sont représentées par des cercles plus grands), en utilisant le même jeu de couleurs que précédemment :

La carte est difficile à lire : on remarque surtout que les petits cercles (les piscines) sont verts et que les grands (les agences bancaires) sont rouges. Or, il pourrait être intéressant de pouvoir comparer entre eux les lieux de même type. Pour cela, nous allons séparer au sein des piscines les 20% les moins bien notées, puis les 20% d’après, etc., et faire de même avec les agences bancaires. On applique ensuite un schéma de couleur qui associe du rouge aux 40% des lieux les pires – relativement (40% des piscines et 40% des agences bancaires), et du vert pour les autres. La carte obtenue est la suivante : elle permet de repérer les endroits de Paris où se trouvent, relativement, les meilleurs piscines et les meilleures agences bancaires en un seul coup d’œil !

Google introduit des modifications aux notes (en particulier quand il y a peu de notes, voir ici (en), mais pas seulement (en)) ; il pourrait être intéressant d’ajouter une fonctionnalité permettant de comparer les notes des différents lieux relativement aux autres de même catégorie !

## Analyse de pronostics pour le Mondial 2018

On est les champions ! Si nous n’avons pas eu le temps de faire un modèle de prédiction pour cette coupe du monde de football 2018 (mais FiveThirtyEight en a fait un très sympa, voir ici), cela ne nous a pas empêché de faire un concours de pronostics entre collègues et ex-collègues statisticiens, sur le site Scorecast. Les résultats obtenus sont les suivants :

JoueurScore
Nic102
Cle100
Ron100
Lud96
Tho90
Lio88
Lis87
Pap86
Mau84
Yan78
Ant78
Lau75
Thi71
Arn56
Oli28
Mar7

## Un autre système de points ?

Le système de points utilisé par Scorecast est le suivant : si on a le bon gagnant, on gagne un faible nombre de points ; si en plus du bon gagnant, on a bien prédit l’écart de buts, on gagne un peu plus de points ; et enfin, si on a le score exact, on gagne le nombre maximal de points. Ce nombre maximal de points augmente au fur et à mesure de la compétition : la finale vaut plus de points qu’un match de poules. Ce système ne tient pas compte de cotes préexistantes (comme le fait par exemple Mon petit prono), ou du fait que certains matchs sont bien prédits par tout le monde alors que pour d’autres seule une personne a bien trouvé, voire personne.

Je propose donc ici d’altérer légèrement l’attribution des points, de la façon suivante : on dispose d’un nombre de points équivalent pour chaque match d’une même manche (match de poule, de quart, etc.), qu’on répartit entre les joueurs qui ont bien prédit le score, avec un avantage pour ceux qui ont le bon écart de points ou le bon score exact. Le nombre de points à répartir augmente tout au long de la compétition, de sorte que les phases finales aient plus d’importance dans le classement final.

Pourquoi faire ça ? Pour favoriser les joueurs qui ont fait des paris plus originaux et potentiellement plus risqués, ou en tout cas qui étaient les seuls à avoir la bonne intuition. Voici les résultats :

JoueurScoreScore modifié
Mau84185
Lud96163
Nic102144
Tho90136
Ant78135
Cle100126
Ron100123
Lis87120
Lio88115
Pap86108
Yan78105
Lau75100
Thi7190
Arn5678
Oli2843
Mar710

On constate que le classement évolue sensiblement avec cette nouvelle méthode de points ! Mais peut-être que certains auraient fait d’autres paris si ces règles étaient décidées…

## Choix des scores

Une des principales difficultés du pronostic est qu’il ne suffit pas de savoir (ou de penser savoir) qui va gagner le match, mais il faut aussi indiquer le score attendu. Regardons si les prédictions de l’ensemble des parieurs de notre ligue ont été pertinentes par rapport aux vrais scores ! Pour cela, on détermine pour chaque score le pourcentage des matchs qui ont abouti à ce résultat d’une part, et le pourcentage des paris faits avec ce score. On regarde ensuite la différence entre les pourcentages, qu’on va illustrer par la heatmap ci-dessous. Les cases vertes correspondent aux scores des matchs trop rarement prédits ; les cases rouges aux scores très souvent prédits mais qui n’arrivent que peu ou pas.

On constate que l’on a surestimé largement le nombre de 2-1, de 3-0 et de 4-0 (score qui n’est jamais arrivé lors de cette coupe du monde) ; ce sont d’ailleurs les seuls “gros” scores qui ont été surestimés dans les prédictions : tous les autres ont été sous-évalués. Cela peut laisser penser que les paris ont été faits avec une logique conservative et en évitant de tenter des scores absurdes, comme 7-0 pour l’Arabie Saoudite contre la Russie !

## Analyse de données et classification

Enfin, une dernière utilisation possible de ce jeu de données est d’en faire l’analyse pour en extraire des classes de parieurs ayant un peu le même profil (ou en tout cas les mêmes réussites), et pour voir ce qui les sépare. Plusieurs méthodes sont possibles pour cela.

Commençons par un grand classique : la Classification Ascendante Hiérarchique (CAH pour les intimes), qui est une méthode qui part de groupes d’une personne, et qui, à chaque étape, regroupe deux groupes de telle façon à ce que l’inertie intra augmente au minimum. De façon moins barbare, cela veut dire qu’on regroupe les deux groupes qui se ressemblent le plus, étape par étape, jusqu’à arriver à la population totale. On représente souvent ce type de méthodes par un dendogramme, qui ressemble un peu à un arbre phylogénétique en biologie de l’évolution, et qui illustre la construction des classes, de bas en haut.

On remarque qu’il y a de nombreux binômes qui sont cohérents, et qui signalent des parieurs avec des profils comparables (par exemple, Mar et Oli, qui correspondent à deux joueurs ayant raté une bonne partie de la compétition, soit en arrêtant les paris, soit en arrivant en cours), et qu’il y a une séparation entre les quatre joueurs de gauche et les autres (eux-mêmes largement séparés entre les 3 les plus à gauche et les autres).

Une autre possibilité est d’utiliser l’Analyse en Composantes Principales, que nous avions déjà utilisé dans un contexte footballistique ici ou ici (en). La logique est ici de chercher à résumer une matrice avec beaucoup d’informations (pour chaque joueur, l’ensemble des points obtenus via ses paris pour chaque match) en un nombre minimal de dimensions, dits d’axes, qui suffisent pour avoir une bonne idée de la logique d’organisation du jeu de données.

Si l’on réalise cette méthode ici, voici ce que l’on obtient sur les premiers axes :

L’axe 1 est souvent victime de ce qu’on appelle l'”effet taille” : on entend par là le fait que les individus ayant de grandes valeurs de certaines variables en ont souvent aussi pour les autres variables, et symétriquement pour les individus qui ont des petites valeurs. En effet, on voit que la variable supplémentaire, le total de points obtenus (avec la méthode Scorecast), en bleu, est proche de l’axe 1. Cela veut dire que les individus à droite de l’axe ont tendance à avoir un score important, tandis que ceux à gauche n’ont pas très bien réussi leurs prédictions.

On constate également que les représentations sur les plans constitués des dimensions 1-2, et 2-3, ont tendance à rapprocher les individus que la classification effectuée plus haut associait en binôme. Cela montre une certaine cohérence, ce qui est toujours rassurant !

Plus dans le détail, on voit que les axes 2 et 3 semblent correspondre aux paris suivants, qui sont donc discriminants entre les différents joueurs :

• Pour l’axe 2, avoir réussi son pari sur les matchs Pérou-Danemark, Mexique-Suède, Brésil-Suisse, Espagne-Russie et Argentine-Croatie
• Pour l’axe 3, avoir réussi son pari sur les matchs Japon-Sénégal, Suisse-Costa Rica, Danemark-France ou encore Brésil-Mexique

Difficile de trouver une interprétation de ces axes…

## Weighting tricks for machine learning with Icarus – Part 1

Calibration in survey sampling is a wonderful tool, and today I want to show you how we can use it in some Machine Learning applications, using the R package Icarus. And because ’tis the season, what better than a soccer dataset to illustrate this? The data and code are located on this gitlab repo: https://gitlab.com/haroine/weighting-ml

First, let’s start by installing and loading icarus and nnet, the two packages needed in this tutorial, from CRAN (if necessary):

install.packages(c("icarus","nnet"))
library(icarus)
library(nnet)


load("data/weighting_ML_part1.RData")


The RData file contains two dataframes, one for the training set and one for the test set. They contain results of some international soccer games, from 01/2008 to 12/2016 for the training set, and from 01/2017 to 11/2017 for the test. Along with the team names and goals scored for each side, a few descriptive variables that we’re going to use as features of our ML models:

> head(train_soccer)
Date                   team opponent_team home_field elo_team
1 2010-10-12                Belarus       Albania          1      554
2 2010-10-08 Bosnia and Herzegovina       Albania          0      544
3 2011-06-07 Bosnia and Herzegovina       Albania          0      594
4 2011-06-20              Argentina       Albania          1     1267
5 2011-08-10             Montenegro       Albania          0      915
6 2011-09-02                 France       Albania          0      918
opponent_elo importance goals_for goals_against outcome year
1          502          1         2             0     WIN 2010
2          502          1         1             1    DRAW 2010
3          564          1         2             0     WIN 2011
4          564          1         4             0     WIN 2011
5          524          1         2             3    LOSS 2011
6          546          1         2             1     WIN 2011

elo_team and opponent_elo are quantitative variables indicative of the level of the team at the date of the game ; importance is a measure of high-profile the game played was (a friendly match rates 1 while a World Cup game rates 4). The other variables are imo self-descriptive.

Then we can train a multinomial logistic regression, with outcome being the predicted variable, and compute the predictions from the model:

outcome_model_unw <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance,
data = train_soccer)

test_soccer$pred_outcome_unw <- predict(outcome_model_unw, newdata = test_soccer) The sheer accuracy of this predictor is kinda good: > ## Accuracy > sum(test_soccer$pred_outcome_unw == test_soccer$outcome) / nrow(test_soccer) [1] 0.5526316 but it has a problem: it never predicts draws! > summary(test_soccer$pred_outcome_unw)
DRAW LOSS  WIN
0  208  210

And indeed, draws being less common than other results, it seems more profitable for the algorithm that optimizes accuracy never to predict them. As a consequence, the probabilities of the game being a draw is always lesser than the probability of one team winning it. We could show that the probabilities are not well calibrated.

A common solution to this problem is to use reweighting to correct the imbalances in the sample, which we’ll now tackle. It is important to note that the weighting trick has to happen in the training set to avoid “data leaks”. A very good piece on this subject has been written by Max Kuhn in the documentation of caret.

Commonly, you would do:

train_soccer$weight <- 1 train_soccer[train_soccer$outcome == "DRAW",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[1]) * 1/3
train_soccer[train_soccer$outcome == "LOSS",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[2]) * 1/3 train_soccer[train_soccer$outcome == "WIN",]$weight <- (nrow(train_soccer)/table(train_soccer$outcome)[3]) * 1/3


> table(train_soccer$weight) 0.916067146282974 1.22435897435897 3336 1248 The draws are reweighted with a factor greater than 1 and the other games with a factor lesser than 1. This balances the predicted outcomes and thus improves the quality of the probabilities … outcome_model <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance, data = train_soccer, weights = train_soccer$weight)

test_soccer$pred_outcome <- predict(outcome_model, newdata = test_soccer)  > summary(test_soccer$pred_outcome)
DRAW LOSS  WIN
96  167  155

… though at a loss in accuracy:

> ## Accuracy
> sum(test_soccer$pred_outcome == test_soccer$outcome) / nrow(test_soccer)
[1] 0.5263158

Now let’s look at the balance of our training sample on other variables:

> round(table(test_soccer$importance) / nrow(test_soccer),2) 1 2 3 4 0.26 0.08 0.54 0.12 > round(table(train_soccer$importance) / nrow(train_soccer),2)

1    2    3    4
0.56 0.08 0.23 0.12

It seems that the test set features a lot more important matches than the training set. Let’s look further, in particular at the dates the matches of the training set were played:

> round(table(train_soccer$year) / nrow(train_soccer),2) 2008 2009 2010 2011 2012 2013 2014 2015 2016 0.10 0.11 0.11 0.10 0.11 0.13 0.11 0.11 0.12 Thus the matches of each year between 2008 and 2016 have the same influence on the final predictor. A better idea would be to give the most recent games a slightly higher influence, for example by increasing their weight and thus reducing the weights of the older games: nyears <- length(unique(train_soccer$year))
year_tweak <- rep(1/nyears,nyears) * 1:nyears
year_tweak <- year_tweak * 1/sum(year_tweak) ## Normalization


> year_tweak
[1] 0.02222222 0.04444444 0.06666667 0.08888889 0.11111111 0.13333333
[7] 0.15555556 0.17777778 0.20000000

We determine it is thus a good idea to balance on these two additional variables (year and importance). Now how should we do this? A solution could be to create an indicator variable containing all the values of the cross product between the variables outcome, year and importance, and use the same reweighting technique as before. But this would not be very practical and more importantly, some of the sub-categories would be nearly empty, making the procedure not very robust. A better solution is to use survey sampling calibration and Icarus 🙂

train_soccer$weight_cal <- 1 importance_pct_test <- unname( table(test_soccer$importance) / nrow(test_soccer),
)

marginMatrix <- matrix(, nrow = 0, ncol = 1) %>% ## Will be replaced by newMarginMatrix() in icarus 0.3.2

train_soccer$weight_cal <- calibration(data=train_soccer, marginMatrix=marginMatrix, colWeights="weight_cal", pct=TRUE, description=TRUE, popTotal = nrow(train_soccer), method="raking") outcome_model_cal <- multinom(outcome ~ elo_team + opponent_elo + home_field + importance, data = train_soccer, weights = train_soccer$weight_cal)

frenchLeague$name <- NULL  Here’s what our object looks like (we only display the first few lines here): > head(frenchLeague) foot position league age height overall Florian Thauvin left RM Ligue1 1 3 4 Layvin Kurzawa left LB Ligue1 1 3 4 Anthony Martial right ST Ligue1 1 3 4 Clinton N'Jie right ST Ligue1 1 2 3 Marco Verratti right MC Ligue1 1 1 4 Alexandre Lacazette right ST Ligue1 2 2 4 ### Data analysis Our dataset contains categorical variables. The appropriate data analysis method is the Multiple Correspondance Analysis. This method is implemented in FactoMineR in the method MCA. We choose to treat the variables “position”, “league” and “age” as supplementary: > library(FactoMineR) > mca <- MCA(frenchLeague, quali.sup=c(2,3,4))  This produces three graphs: the projection on the factorial axes of categories and players, and the graph of the variables. Let’s just have a look at the second one of these graphs: Before trying to go any further into the analysis, something should alert us. There clearly are two clusters of players here! Yet the data analysis techniques like MCA suppose that the scatter plot is homogeneous. We’ll have to restrict the analysis to one of the two clusters in order to continue. On the previous graph, supplementary variables are shown in green. The only supplementary variable that appears to correspond to the cluster on the right is the goalkeeper position (“GK”). If we take a closer look to the players on this second cluster, we can easily confirm that they’re actually all goalkeeper. This absolutely makes a lot of sense: in football, the goalkeeper is a very different position, and we should expect these players to be really different from the others. From now on, we will only focus on the positions other than goalkeepers. We also remove from the analysis the abilities that are specific to goalkeepers, which are not important for other players and can only add noise to our analysis: > frenchLeague_no_gk <- frenchLeague[frenchLeague$position!="GK",-c(31:35)]
> mca_no_gk <- MCA(frenchLeague_no_gk, quali.sup=c(2,3,4))

And now our graph features only one cluster.

### Interpretation

Obviously, we have to start by reducing the analysis to a certain number of factorial axes. My favorite method to chose the number of axes is the elbow method. We plot the graph of the eigenvalues:

> barplot(mca_no_gk$eig$eigenvalue)

Around the third or fourth eigenvalue, we observe a drop of the values (which is the percentage of the variance explained par the MCA). This means that the marginal gain of retaining one more axis for our analysis is lower after the 3rd or 4th first ones. We thus choose to reduce our analysis to the first three factorial axes (we could also justify chosing 4 axes). Now let’s move on to the interpretation, starting with the first two axes:

> plot.MCA(mca_no_gk, invisible = c("ind","quali.sup"))

We could start the analysis by reading on the graph the name of the variables and modalities that seem most representative of the first two axes. But first we have to keep in mind that there may be some of the modalities whose coordinates are high that have a low contribution, making them less relevant for the interpretation. And second, there are a lot of variables on this graph, and reading directly from it is not that easy. For these reasons, we chose to use one of FactoMineR’s specific functions, dimdesc (we only show part of the output here):

> dimdesc(mca_no_gk)
$Dim 1$category
Estimate       p.value
finishing_1        0.700971584 1.479410e-130
volleys_1          0.732349045 8.416993e-125
long_shots_1       0.776647500 4.137268e-111
sliding_tackle_3   0.591937236 1.575750e-106
curve_1            0.740271243  1.731238e-87
[...]
finishing_4       -0.578170467  7.661923e-82
shot_power_4      -0.719591411  2.936483e-86
ball_control_4    -0.874377431 5.088935e-104
dribbling_4       -0.820552850 1.795628e-117

The most representative abilities of the first axis are, on the right side of the axis, a weak level in attacking abilities (finishing, volleys, long shots, etc.) and on the left side a very strong level in those abilities. Our interpretation is thus that axis 1 separates players according to their offensive abilities (better attacking abilities on the left side, weaker on the right side). We procede with the same analysis for axis 2 and conclude that it discriminates players according to their defensive abilities: better defenders will be found on top of the graph whereas weak defenders will be found on the bottom part of the graph.

Supplementary variables can also help confirm our interpretation, particularly the position variable:

> plot.MCA(mca_no_gk, invisible = c("ind","var"))

And indeed we find on the left part of the graph the attacking positions (LW, ST, RW) and on the top part of the graph the defensive positions (CB, LB, RB).

If our interpretation is correct, the projection on the second bissector of the graph will be a good proxy for the overall level of the player. The best players will be found on the top left area while the weaker ones will be found on the bottom right of the graph. There are many ways to check this, for example looking at the projection of the modalities of the variable “overall”. As expected, “overall_4” is found on the top-left corner and “overall_1” on the bottom-right corner. Also, on the graph of the supplementary variables, we observe that “Ligue 1” (first division of the french league) is on the top-left area while “Ligue 2” (second division) lies on the bottom-right area.

With only these two axes interpreted there are plenty of fun things to note:

• Left wingers seem to have a better overall level than right wingers (if someone has an explanation for this I’d be glad to hear it!)
• Age is irrelevant to explain the level of a player, except for the younger ones who are in general weaker.
• Older players tend to have more defensive roles

Let’s not forget to deal with axis 3:

> plot.MCA(mca_no_gk, invisible = c("ind","var"), axes=c(2,3))

Modalities that are most representative of the third axis are technical weaknesses: the players with the lower technical abilities (dribbling, ball control, etc.) are on the end of the axis while the players with the highest grades in these abilities tend to be found at the center of the axis:

We note with the help of the supplementary variables, that midfielders have the highest technical abilities on average, while strikers (ST) and defenders (CB, LB, RB) seem in general not to be known for their ball control skills.

Now we see why we chose to make the variables categorical instead of quantitative. If we had kept the orginal variables (quantitative) and performed a PCA on the data, the projections would have kept the orders for each variable, unlike what happens here for axis 3. And after all, isn’t it better like this? Ordering players according to their technical skills isn’t necessarily what you look for when analyzing the profiles of the players. Football is a very rich sport, and some positions don’t require Messi’s dribbling skills to be an amazing player!

### Mathieu Valbuena

Now we add the data for a new comer in the French League, Mathieu Valbuena (actually Mathieu Valbuena arrived in the French League in August of 2015, but I warned you that the data was a bit old ;)). We’re going to compare Mathieu’s profile (as a supplementary individual) to the other players, using our data analysis.

> columns_valbuena <- c("right","RW","Ligue1",3,1
,4,4,3,4,3,4,4,4,4,4,3,4,4,3,3,1,3,2,1,3,4,3,1,1,1)
> frenchLeague_no_gk["Mathieu Valbuena",] <- columns_valbuena

> mca_valbuena <- MCA(frenchLeague_no_gk, quali.sup=c(2,3,4), ind.sup=912)
> plot.MCA(mca_valbuena, invisible = c("var","ind"), col.quali.sup = "red", col.ind.sup="darkblue")
> plot.MCA(mca_valbuena, invisible = c("var","ind"), col.quali.sup = "red", col.ind.sup="darkblue", axes=c(2,3))

Last two lines produce the graphs with Mathieu Valbuena on axes 1 and 2, then 2 and 3:

So, Mathieu Valbuena seems to have good offensive skills (left part of the graph), but he also has a good overall level (his projection on the second bissector is rather high). He also lies at the center of axis 3, which indicates he has good technical skills. We should thus not be surprised to see that the positions that suit him most (statistically speaking of course!) are midfield positions (CAM, LM, RM). With a few more lines of code, we can also find the French league players that have the most similar profiles:

> mca_valbuena_distance <- MCA(frenchLeague_no_gk[,-c(3,4)], quali.sup=c(2), ind.sup=912, ncp = 79)
> distancesValbuena <- as.data.frame(mca_valbuena_distance$ind$coord)
> distancesValbuena[912, ] <- mca_valbuena_distance$ind.sup$coord

> euclidianDistance <- function(x,y) {

return( dist(rbind(x, y)) )

}

> distancesValbuena$distance_valbuena <- apply(distancesValbuena, 1, euclidianDistance, y=mca_valbuena_distance$ind.sup$coord) > distancesValbuena <- distancesValbuena[order(distancesValbuena$distance_valbuena),]

> names_close_valbuena <- c("Mathieu Valbuena", row.names(distancesValbuena[2:6,]))

And we get: Ladislas Douniama, Frédéric Sammaritano, Florian Thauvin, N’Golo Kanté and Wissam Ben Yedder.

There would be so many other things to say about this data set but I think it’s time to wrap this (already very long) article up 😉 Keep in mind that this analysis should not be taken too seriously! It just aimed at giving a fun tutorial for students to discover R, FactoMineR and data analysis.