statistics – NC233

Causal Inference cheat sheet for data scientists

Being able to make causal claims is a key business value for any data science team, no matter their size.
Quick analytics (in other words, descriptive statistics) are the bread and butter of any good data analyst working on quick cycles with their product team to understand their users. But sometimes some important questions arise that need more precise answers. Business value sometimes means distinguishing what is true insights from what is incidental noise. Insights that will hold up versus temporary marketing material. In other terms causation.

When answering these questions, absolute rigour is required. Failing to understand key mechanisms could mean missing out on important findings, rolling out the wrong version of a product, and eventually costing your business millions of dollars, or crucial opportunities.
Ron Kohavi, former director of the experimentation team at Microsoft, has a famous example: changing the place where credit card offers were displayed on amazon.com generated millions in revenue for the company.

The tech industry has picked up on this trend in the last 6 years, making Causal Inference a hot topic in data science. Netflix, Microsoft and Google all have entire teams built around some variations of causal methods. Causal analysis is also (finally!) gaining a lot of traction in pure AI fields. Having an idea of what causal inference methods can do for you and for your business is thus becoming more and more important.

The causal inference levels of evidence ladder

Hence the causal inference ladder cheat sheet! Beyond the value for data scientists themselves, I’ve also had success in the past showing this slide to internal clients to explain how we were processing the data and making conclusions.

The “ladder” classification explains the level of proof each method will give you. The higher, the easier it will be to make sure the results from your methods are true results and reproducible – the downside is that the set-up for the experiment will be more complex. For example, setting up an A/B test typically requires a dedicated framework and engineering resources.
Methods further down the ladder will require less effort on the set-up (think: observational data), but more effort on the rigour of the analysis. Making sure your analysis has true findings and is not just commenting some noise (or worse, is plain wrong) is a process called robustness checks. It’s arguably the most important part of any causal analysis method. The further down on the ladder your method is, the more robustness checks I’ll require if I’m your reviewer 🙂

I also want to stress that methods on lower rungs are not less valuable – it’s almost the contrary! They are brilliant methods that allow use of observational data to make conclusions, and I would not be surprised if people like Susan Athey and Guido Imbens, who have made significant contributions to these methods in the last 10 years, were awarded the Nobel prize one of these days!

causal_cheat_sheet — The causal inference levels of evidence ladder – click on the image to enlarge it

Rung 1 – Scientific experiments

On the first rung of the ladder sit typical scientific experiments. The kind you were probably taught in middle or even elementary school. To explain how a scientific experiment should be conducted, my biology teacher had us take seeds from a box, divide them into two groups and plant them in two jars. The teacher insisted that we made the conditions in the two jars completely identical: same number of seeds, same moistening of the ground, etc.
The goal was to measure the effect of light on plant growth, so we put one of our jars near a window and locked the other one in a closet. Two weeks later, all our jars close to the window had nice little buds, while the ones we left in the closet barely had grown at all.
The exposure to light being the only difference between the two jars, the teacher explained, we were allowed to conclude that light deprivation caused plants to not grow.

Sounds simple enough? Well, this is basically the most rigorous you can be when you want to attribute cause. The bad news is that this methodology only applies when you have a certain level of control on both your treatment group (the one who receives light) and your control group (the one in the cupboard). Enough control at least that all conditions are strictly identical but the one parameter you’re experimenting with (light in this case). Obviously, this doesn’t apply in social sciences nor in data science.

Then why do I include it in this article you might ask? Well, basically because this is the reference method. All causal inference methods are in a way hacks designed to reproduce this simple methodology in conditions where you shouldn’t be able to make conclusions if you followed strictly the rules explained by your middle school teacher.

Rung 2 – Statistical Experiments (aka A/B tests)

Probably the most well-known causal inference method in tech: A/B tests, a.k.a Randomized Controlled Trials for our Biostatistics friends. The idea behind statistical experiments is to rely on randomness and sample size to mitigate the inability to put your treatment and control groups in the exact same conditions. Fundamental statistical theorems like the law of large numbers, the Central Limit theorem or Bayesian inference gives guarantees that this will work and a way to deduce estimates and their precision from the data you collect.

Arguably, an Experiments platform should be one of the first projects any Data Science team should invest in (once all the foundational levels are in place, of course). The impact of setting up an experiments culture in tech companies has been very well documented and has earned companies like Google, Amazon, Microsoft, etc. billions of dollars.

Of course, despite being pretty reliable on paper, A/B tests come with their own sets of caveats. This white paper by Ron Kohavi and other founding members of the Experiments Platform at Microsoft is very useful.

Rung 3 – Quasi-Experiments

As awesome as A/B tests (or RCTs) can be, in some situations they just can’t be performed. This might happen because of lack of tooling (a common case in tech is when a specific framework lacks the proper tools to set up an experiment super quickly and the test becomes counter-productive), ethical concerns, or just simply because you want to study some data ex-post. Fortunately for you if you’re in one of those situations, some methods exist to still be able to get causal estimates of a factor. In rung 3 we talk about the fascinating world of quasi-experiments (also called natural experiments).

A quasi-experiment is the situation when your treatment and control group are divided by a natural process that is not truly random but can be considered close enough to compute estimates. In practice, this means that you will have different methods that will correspond to different assumptions about how “close” you are to the A/B test situation. Among famous examples of natural experiments: using the Vietnam war draft lottery to estimate the impact of being a veteran on your earnings, or the border between New Jersey and Pennsylvania to study the effect of minimum wages on the economy.

Now let me give you a fair warning: when you start looking for quasi-experiments, you can quickly become obsessed by it and start thinking about clever data collection in improbable places… Now you can’t say you haven’t been warned 😜 I have more than a few friends who were ~~lured into~~ attracted by a career in econometrics for the sheer love of natural experiments.

Most popular methods in the world of quasi-experiments are: differences-in-differences (the most common one, according to Scott Cunnigham, author of the Causal Inference Mixtape), Regression Discontinuity Design, Matching, or Instrumental variables (which is an absolutely brilliant construct, but rarely useful in practice). If you’re able to observe (i.e. gather data) on all factors that explain how treatment and control are separated, then a simple linear regression including all factors will give good results.

Rung 4 – The world of counterfactuals

Finally, you will sometimes want to try to detect causal factors from data that is purely observational. A classic example in tech is estimating the effect of a new feature when no A/B test was done and you don’t have any kind of group that isn’t receiving the feature that you could use as a control:

Slightly adapted from CausalImpact‘s documentation

Maybe right new you’re thinking: wait… are you saying we can simply look at the data before and after and be allowed to make conclusions? Well, the trick is that often it isn’t that simple to make a rigorous analysis or even compute an estimate. The idea here is to create a model that will allow to compute a counterfactual control group. Counterfactual means “what would have happened hadn’t this feature existed”. If you have a model of your number of users that you have enough confidence in to make some robust predictions, then you basically have everything

There is a catch though. When using counterfactual methods, the quality of your prediction is key. Without getting too much into the technical details, this means that your model not only has to be accurate enough, but also needs to “understand” what underlying factors are driving what you currently observe. If a confounding factor that is independent from your newest rollout varies (economic climate for example), you do not want to attribute this change to your feature. Your model needs to understand this as well if you want to be able to make causal claims.

This is why robustness checks are so important when using counterfactuals. Some cool Causal Inference libraries like Microsoft’s doWhy do these checks automagically for you 😲 Sensitivity methods like the one implemented in the R package tipr can be also very useful to check some assumptions. Finally, how could I write a full article on causal inference without mentioning DAGs? They are a widely used tool to state your assumptions, especially in the case of rung 4 methods.

(Quick side note: right now with the unprecedented Covid-19 crisis, it’s likely that most prediction models used in various applications are way off. Obviously, those cannot be used for counterfactual causal analysis)

Technically speaking, rung 4 methods look really much like methods from rung 3, with some small tweaks. For example, synthetic diff-in-diff is a combination of diff-in-diff and matching. For time series data, CausalImpact is a very cool and well-known R package. causalTree is another interesting approach worth looking at. More generally, models carefully crafted with domain expertise and rigorously tested are the best tools to do Causal Inference with only counterfactual control groups.

Hope this cheat sheet will help you find the right method for your causal analyses and be impactful for your business! Let us know about your best #causalwins on our Twitter, or in the comments!

Micromorts – how much risk of death would you accept?

tl;dr
– A 1% death risk is deceptively high
– Micromorts (1/10000 th of a percent) are a useful scale to model death risks
– Statistical models for human life use ligh-tailed distributions. High values are extremely rare

Today a short post that I had in my drafts for a long time. I didn’t expect that it would (unfortunately) be so relevant to today’s context.

Life is finite and all human activities are risky. Although we all face a certain (hopefully low) risk of dying each time we breathe, it’s not enough reason to prevent us from doing any activities and live isolated in bubbles. But exactly how much risk of dying is acceptable? How much risk on your own life would you be willing to accept?

Living in a bubble would be fun! – xkcd.com

Turns out most people, sometimes even trained scientists, are bad at estimating death risk probabilities. They often underestimate how bad even seemingly low probabilities of dying can turn out to be. During a dramatic time of my life where I cared for a child who became suddenly sick. The head surgeon told us she had an 85% chance of making it.
So maybe you’re thinking oh it wasn’t that bad! And I mean I understand, 85 is pretty close to 100, situation’s looking fairly good, right?

I was terrified.

To put this number into perspective, imagine if all patients admitted faced such a risk. Let’s say doctors see 15 patients per hour, work 10 hours a day and that the department has 10 doctors. This represents approximately 35’000 patients per year, which seems to fit this UK data. With a 15% death rate, this department would have to deal with more than 5000 deaths over the course of the year, which is the number of people who died in the entire city of San Francisco in 2018! This is one death every 1 hour 40 minutes!

An activity with a 99% chance of survival would certainly kill you in less than a year

In fact, routine surgical procedures with risks greater than 5% are classified high risk. Even a 99% chance of survival doesn’t look so good. If you were enough of a daredevil to engage every day in an activity that exposes you to “only” 1% death probability, then you’d be almost certainly dead within a year.
(You can see this easily by using this easy rule of thumb: consider a random event occurring with probability p. Then there is a 95% chance that the event will occur in less than 3/p tries. In this case, this would be 3 / 0.01 = 300 days, which is less than a year)

Micromort – the right scale for death risks

As it turns out, percents are not the right risk scale to talk about death risks. Ronald A. Howard realized this in 1979 and created the notion of micromort. A micromort represents a one-in-a-million chance of dying. Wikipedia has a list of how much risk some activities expose you to:

A micromort is one in a million chance of dying – it is equivalent to tossing 20 coins and getting 20 heads

Wikipedia has a list of how much risk some activities expose you to:

One day alive at age 20 – 1 micromort
Skydiving (one jump) – 10 micromorts
One day alive at age 90 – 400 micromorts
Being infected by the Spanish flu – 30000 micromorts
An ascent to Mt Everest – 40000 micromorts

Using this scale, my child’s illness exposed her to 150000 micromorts… which suddenly looks much frightening, and a much more intuitive representation of the risk she was actually exposed to.
(Side note if you’re wondering: my kid is fine, and I am really grateful for this 🙏 Diane, if you ever read this you are the sunshine of my life! ❤️☀️)

If this starts feeling scary, don’t worry too much. A risk of one micromort is equivalent to tossing 20 coins and getting only heads. It’s pretty unlikely! The problem is that you’re playing this game every day, and that every once in a while you have to remove some coins. By the time you’re 50, you only have 17 coins left, by the time you’re 90, you only have 11 left.

Statistical models of death

A neat thing about Micromorts is that they also make a good and intuitive statistical model for age at death for humans. Let’s consider this very simple model based on the “game” described in the previous paragraph. Every day you play this game, with a certain risk of dying (for sake of simplicity, let’s forget about modelling childhood and only concentrate on adult life):

Between the ages of 20 and 80, the risk is Floor(age / 10) – 1 micromorts (for example all days of your 26th year, you face a 1 micromort risk, all days in you 63rd year, you face 5 micromorts)
At age 80, the risk jumps to 100 micromorts, and then each year you have to add 50 additional micromorts (which means 150 micromorts at age 81, 200 micromorts at age 82, etc.)

Fun #Statistics fact: death age distribution (of adults) can be modeled simply with:

20-29: 1 micromort / day
30-39: 2
40-49: 3
50-59: 4
60-69: 5
70-79: 6
80: 100
81: 150
82: 200
…
100: 1200#DataScience
— NC233 (@nc233) December 27, 2020

We can run a few simulations in R to see what life expectancy looks like with this simple model. First let’s define a vector of the risks that match the model we just described:

max_age <- 300
min_age <- 20
age_cut <- 80
risks1 <- rep(1:7, each=365*10)
risks2 <- rep(((age_cut : max_age) - age_cut) * 50 + 100, each=365)
risks <- c(risks1, risks2) / (1e6)

Then we can run a few simulations to get a vector of age at death for 10 000 people playing this “game”. Note that all simulated values use the vectorization capabilities of the function rbinom:

N <- length(risks)
N_sims <- 1e4

days_sims <- matrix(rbinom(N_sims*N, 1, risks), ncol = N_sims, byrow = F)

death_ages_days <- apply(days_sims, 2, function(x) { 
    day_death <- which.max(x) 
    if(day_death > 1) {
      return(day_death / 365 + min_age)
    } else {
      return(max_age)
    } 
  })

The mean and max age at death are:

> mean(death_ages_days)
[1] 84.9348
> max(death_ages_days)
[1] 103.0274

Not too far from what we observe in most Western countries! For example, life expectancy for Canadian women in 2018 was 84.3 years, and that same year, the oldest Canadian man alive was 109 years old.

We can plot the distribution of age at death:

library(ggplot2)

model_plot <- 
   ggplot(data.frame(age=death_ages_days)) +   
     geom_histogram(aes(x=age, y=..density..), fill="#4b86b4") + 
     geom_density(aes(x=age, y=..density..), colour="#2a4d69") +
     xlab("Age at death") +
     ylab("Frequency") +
     theme_bw()
 print(model_plot)

In addition to being close to the actual demographic values, the most interesting property of this model is that it was able to generate a light tailed distribution.

A light tail distribution is one that quickly falls down for the highest values. Contrary to statistical distributions like revenue where extreme values (i.e. values that are several standard deviations greater than the mean) are routinely observed, extreme values are very unlikely when it comes to human life. The oldest person we know of reached 122 years of age and people who make it past 100 years old are still a tiny minority. It is extremely unlikely that we would observe someone living to be 200 or even 150 years old.
Even a good old normal distribution would yield extreme values more often (a distribution is said to have “light” and “fat” tails if extreme values are less (resp. more) likely to happen than with a normal distribution). We can see this on this plot from a StackExchange post where life expectancy distribution is plotted against the best normal fit:

Histogram of life expectancy vs. a normal distribution. We can see that extreme values of human life are far less more likely to happen than predicted by the normal distribution

One of the best statistical distributions one can use to model human life is the Weibull distribution. It is a variant of the exponential distribution, that is often studied in high school and is typically used to model memory-less failures (i.e. a probability of failing that is independent of time). The Weibull distribution is very similar except the failure rate increases with time, mimicking an aging process. The statistical model I used for this article is in fact very close to how the Weibull distribution works.

Do you like data science and would be interested in building products that help entrepreneurs around the world start and grow their businesses? Shopify is hiring 🎉 for many locations in Canada 🇨🇦 and around the world 🌎! Feel free to reach out directly to me 🙂

Edit: A previous version of the article didn’t feature the tweet illustrating the simple death model, and had typos in two micromorts numbers that were corrected

Est-ce que cette piscine est bien notée ?

J’ai pris la (mauvaise ?) habitude d’utiliser Google Maps et son système de notation (chaque utilisateur peut accorder une note de une à cinq étoiles) pour décider d’où je me rend : restaurants, lieux touristiques, etc. Récemment, j’ai déménagé et je me suis intéressé aux piscines environnantes, pour me rendre compte que leur note tournait autour de 3 étoiles. Je me suis alors fait la réflexion que je ne savais pas, si, pour une piscine, il s’agissait d’une bonne ou d’une mauvaise note ! Pour les restaurants et bars, il existe des filtres permettant de se limiter dans sa recherche aux établissements ayant au moins 4 étoiles ; est-ce que cela veut dire que cette piscine est très loin d’être un lieu de qualité ? Pourtant, dès lors qu’on s’intéresse à d’autres types de services comme les services publics, ou les hôpitaux, on se rend compte qu’il peut y avoir de nombreux avis négatifs, et des notes très basses, par exemple :

Pour répondre à cette question, il faudrait connaître les notes qu’ont les autres piscines pour savoir si 3 étoiles est un bon score ou non. Il serait possible de le faire manuellement, mais ce serait laborieux ! Pour éviter cela, nous allons utiliser l’API de GoogleMaps (Places, vu qu’on va s’intéresser à des lieux et non des trajets ou des cartes personnalisées).

API, késako? Une API est un système intégré à un site web permettant d’y accéder avec des requêtes spécifiques. J’avais déjà utilisé une telle API pour accéder aux données sur le nombre de vues, de like, etc. sur Youtube ; il existe aussi des API pour Twitter, pour Wikipedia…

Pour utiliser une telle API, il faut souvent s’identifier ; ici, il faut disposer d’une clef API spécifique pour Google Maps qu’on peut demander ici. On peut ensuite utiliser l’API de plusieurs façons : par exemple, faire une recherche de lieux avec une chaîne de caractères, comme ici “Piscine in Paris, France” (avec cette fonction) ; ensuite, une fois que l’on dispose de l’identifiant du lieu, on peut chercher plus de détails, comme sa note, avec cette fonction. De façon pratique, j’ai utilisé le package googleway qui possède deux fonctions qui font ce que je décris juste avant : google_place et google_place_details.

En utilisant ces fonctions, j’ai réussi à récupérer de l’information sur une soixantaine de piscines à Paris et ses environs très proches (je ne sais pas s’il s’agit d’une limite de l’API, mais le nombre ne semblait pas aberrant !). J’ai récupéré les notes et je constate ainsi que la note moyenne est autour de 3.5, ce qui laisse à penser que les piscines à proximité de mon nouvel appartement ne sont pas vraiment les meilleures… De façon plus visuelle, je peux ensuite représenter leur note moyenne (en rouge quand on est en dessous de 2, en vert de plus en plus foncé au fur et à mesure qu’on se rapproche de 5) sur la carte suivante (faite avec Leaflet, en utilisant le très bon package leaflet)

Comparaison avec d’autres lieux

En explorant Google Maps aux alentours, je me suis rendu compte que les agences bancaires du quartier étaient particulièrement mal notées, en particulier pour une banque spécifique (dont je ne citerai pas le nom – mais dont le logo est un petit animal roux). Je peux utiliser la même méthode pour récupérer par l’API des informations sur ces agences (et je constate qu’effectivement, la moyenne des notes est de 2 étoiles), puis les rajouter sur la même carte (les piscines correspondent toujours aux petits cercles ; les agences bancaires sont représentées par des cercles plus grands), en utilisant le même jeu de couleurs que précédemment :

La carte est difficile à lire : on remarque surtout que les petits cercles (les piscines) sont verts et que les grands (les agences bancaires) sont rouges. Or, il pourrait être intéressant de pouvoir comparer entre eux les lieux de même type. Pour cela, nous allons séparer au sein des piscines les 20% les moins bien notées, puis les 20% d’après, etc., et faire de même avec les agences bancaires. On applique ensuite un schéma de couleur qui associe du rouge aux 40% des lieux les pires – relativement (40% des piscines et 40% des agences bancaires), et du vert pour les autres. La carte obtenue est la suivante : elle permet de repérer les endroits de Paris où se trouvent, relativement, les meilleurs piscines et les meilleures agences bancaires en un seul coup d’œil !

Google introduit des modifications aux notes (en particulier quand il y a peu de notes, voir ici (en), mais pas seulement (en)) ; il pourrait être intéressant d’ajouter une fonctionnalité permettant de comparer les notes des différents lieux relativement aux autres de même catégorie !

Riddler and Voter Power Index

Oliver Roeder has a nice puzzle: the riddler. Just like last week, this week’s puzzle has an interesting application to the US Election and I enjoyed it really much, so I figured I might just write a blog post 🙂 In this article, we’ll solve this week’s riddler two different ways (just because :p) and discuss an indicator used on FiveThirtyEight’s prediction model for the election: the Voter Power Index.

Exact solution and Stirling approximation

I won’t write again the problem and notations, but you can find them here. We’ll also assume N is odd (as precised later by Ollie on Twitter). This assumption won’t matter much because we’ll only look at applications for large values of N. Let’s write:

$\mathbb{P} = \Pr(you~decide~the~election)$

Your vote is obviously going to be decisive if there is a tie between the N-1 other votes (convienently, N-1 is even). The votes are all independant with same probability p=1/2, so they are Bernoulli trials. Consequently, the probability we’re looking for is the probability that exactly half of these Bernoulli trial succeed, which is by definition the binomial distribution. Thus:

$\mathbb{P} = {{N-1}\choose{\frac{N-1}{2}}} p^{\frac{N-1}{2}} {(1-p)}^{\frac{N-1}{2}} $

As p=0.5, the exact value for the probability of your vote being decisive is thus:

$\fbox{$\mathbb{P} = \frac{{{N-1}\choose{\frac{N-1}{2}}}}{{2}^{N-1}}$}$

So, here is the exact solution, but it’s not super useful as is. Much more interesting is how this varies with N (with N sufficiently large). We can use Stirling’s approximation:

$\log \mathbb{P} = \log {{N-1}\choose{\frac{N-1}{2}}} – (N-1) \log 2 \\
~~~~\sim N \log N – \frac{N}{2} \log \frac{N}{2} – \frac{N}{2} \log \frac{N}{2} + \frac{1}{2} \left( \log N – \log \frac{N}{2} \\~~~~~~~- \log \frac{N}{2} – \log 2\pi \right) – N \log 2 \\~~~~\sim – \frac{1}{2} \log N + \log 2 – \frac{1}{2} \log 2\pi $

Thus for sufficiently large N, the probability your vote is the decisive vote varies like the inverse of the square root of N:

$\fbox{$\mathbb{P}\sim \sqrt{\frac{2}{N\pi}} \approx \frac{0.8}{\sqrt{N}}$}$

A very simple solution for large N

Actually, we could have obtained this result for large N much more simply. We know that asymptotically the binomial distribution is gonna converge to a normal distribution. The event that your vote is the decisive one is actually the most probable event, as probabilities that the other people vote for either candidates are equal to 1/2. So the solution to the riddler can be easily computed using the density of the normal distribution:

$\mathbb{P} = \phi(0) = \frac{1}{\sqrt{2\pi \sigma^2}}$

with:

$\sigma^2 = Np(1-p) = \frac{N}{4}$

(the variance of the binomial distribution), we get the same result as in the first paragraph:

$\fbox{$\mathbb{P}\sim \sqrt{\frac{2}{N\pi}} \approx \frac{0.8}{\sqrt{N}}$}$

Mode of normal distribution for various standard deviations. © W. R. Leo

Voter Power Index

Caption from FiveThirtyEight's model — Caption from FiveThirtyEight’s model

In the US Presidential election, voters don’t elect directly their preferred candidates, but “electors” who will eventually get to vote for the president. For example, California get 55 electors while Wyoming only get 3. But divided by the number of voters in each of these states, it appears that there are approximately 510 000 voters for each elector in California while only 150 000 voters get to decide an electoral vote in Wyoming. If we assumed that probabilities of voting for each candidate was equal in these states, we can use our formula to get the relative likelihood that one vote is going to change the outcome in the election in these two states:

$\sqrt{\frac{510000}{150000}} \approx 1.8$

So in a way, a vote by a Californian is nearly 2 times less important than a vote cast in Wyoming!

Of course, probabilities are far from being equal for this year’s 2 candidates in California and Wyoming. And as Michael Vartan noted, the value of this probability matters very much!

All parameters taken into account (also including the different configurations of the electoral college in other states), this is what Nate Silver call the Voter Power Index. For this year, the probabilities that one vote will change the outcome of the whole election is highest in New Hampshire and lowest in DC.

Featured image: Number of electoral votes per voter for each state. Made using the awesome tilegram app

Data analysis of the French football league players with R and FactoMineR

This year we’ve had a great summer for sporting events! Now autumn is back, and with it the Ligue 1 championship. Last year, we created this data analysis tutorial using R and the excellent package FactoMineR for a course at ENSAE (in French). The dataset contains the physical and technical abilities of French Ligue 1 and Ligue 2 players. The goal of the tutorial is to determine with our data analysis which position is best for Mathieu Valbuena 🙂

The dataset

A small precision that could prove useful: it is not required to have any advanced knowledge of football to understand this tutorial. Only a few notions about the positions of the players on the field are needed, and they are summed up in the following diagram:

The data come from the video game Fifa 15 (which is already 2 years old, so there may be some differences with the current Ligue 1 and Ligue 2 players!). The game features rates each players’ abilities in various aspects of the game. Originally, the grade are quantitative variables (between 0 and 100) but we transformed them into categorical variables (we will discuss why we chose to do so later on). All abilities are thus coded on 4 positions : 1. Low / 2. Average / 3. High / 4. Very High.

Loading and prepping the data

Let’s start by loading the dataset into a data.frame. The important thing to note is that FactoMineR requires factors. So for once, we’re going to let the (in)famous stringsAsFactors parameter be TRUE!

> frenchLeague <- read.csv2("french_league_2015.csv", stringsAsFactors=TRUE)
> frenchLeague <- as.data.frame(apply(frenchLeague, 2, factor))

The second line transforms the integer columns into factors also. FactoMineR uses the row.names of the dataframes on the graphs, so we’re going to set the players names as row names:

row.names(frenchLeague) <- frenchLeague$name
frenchLeague$name <- NULL

Here’s what our object looks like (we only display the first few lines here):

> head(frenchLeague)
                     foot position league age height overall
Florian Thauvin      left       RM Ligue1   1      3       4
Layvin Kurzawa       left       LB Ligue1   1      3       4
Anthony Martial     right       ST Ligue1   1      3       4
Clinton N'Jie       right       ST Ligue1   1      2       3
Marco Verratti      right       MC Ligue1   1      1       4
Alexandre Lacazette right       ST Ligue1   2      2       4

Data analysis

Our dataset contains categorical variables. The appropriate data analysis method is the Multiple Correspondance Analysis. This method is implemented in FactoMineR in the method MCA. We choose to treat the variables “position”, “league” and “age” as supplementary:

> library(FactoMineR)
> mca <- MCA(frenchLeague, quali.sup=c(2,3,4))

This produces three graphs: the projection on the factorial axes of categories and players, and the graph of the variables. Let’s just have a look at the second one of these graphs:

Projection of the players on the first two factorial axes (click to enlarge)

Before trying to go any further into the analysis, something should alert us. There clearly are two clusters of players here! Yet the data analysis techniques like MCA suppose that the scatter plot is homogeneous. We’ll have to restrict the analysis to one of the two clusters in order to continue.

On the previous graph, supplementary variables are shown in green. The only supplementary variable that appears to correspond to the cluster on the right is the goalkeeper position (“GK”). If we take a closer look to the players on this second cluster, we can easily confirm that they’re actually all goalkeeper. This absolutely makes a lot of sense: in football, the goalkeeper is a very different position, and we should expect these players to be really different from the others. From now on, we will only focus on the positions other than goalkeepers. We also remove from the analysis the abilities that are specific to goalkeepers, which are not important for other players and can only add noise to our analysis:

> frenchLeague_no_gk <- frenchLeague[frenchLeague$position!="GK",-c(31:35)]
> mca_no_gk <- MCA(frenchLeague_no_gk, quali.sup=c(2,3,4))

And now our graph features only one cluster.

Interpretation

Obviously, we have to start by reducing the analysis to a certain number of factorial axes. My favorite method to chose the number of axes is the elbow method. We plot the graph of the eigenvalues:

> barplot(mca_no_gk$eig$eigenvalue)

Around the third or fourth eigenvalue, we observe a drop of the values (which is the percentage of the variance explained par the MCA). This means that the marginal gain of retaining one more axis for our analysis is lower after the 3rd or 4th first ones. We thus choose to reduce our analysis to the first three factorial axes (we could also justify chosing 4 axes). Now let’s move on to the interpretation, starting with the first two axes:

> plot.MCA(mca_no_gk, invisible = c("ind","quali.sup"))

Projection of the abilities on the first two factorial axes

We could start the analysis by reading on the graph the name of the variables and modalities that seem most representative of the first two axes. But first we have to keep in mind that there may be some of the modalities whose coordinates are high that have a low contribution, making them less relevant for the interpretation. And second, there are a lot of variables on this graph, and reading directly from it is not that easy. For these reasons, we chose to use one of FactoMineR’s specific functions, dimdesc (we only show part of the output here):

> dimdesc(mca_no_gk)

$`Dim 1`$category
                      Estimate       p.value
finishing_1        0.700971584 1.479410e-130
volleys_1          0.732349045 8.416993e-125
long_shots_1       0.776647500 4.137268e-111
sliding_tackle_3   0.591937236 1.575750e-106
curve_1            0.740271243  1.731238e-87
[...]
finishing_4       -0.578170467  7.661923e-82
shot_power_4      -0.719591411  2.936483e-86
ball_control_4    -0.874377431 5.088935e-104
dribbling_4       -0.820552850 1.795628e-117

The most representative abilities of the first axis are, on the right side of the axis, a weak level in attacking abilities (finishing, volleys, long shots, etc.) and on the left side a very strong level in those abilities. Our interpretation is thus that axis 1 separates players according to their offensive abilities (better attacking abilities on the left side, weaker on the right side). We procede with the same analysis for axis 2 and conclude that it discriminates players according to their defensive abilities: better defenders will be found on top of the graph whereas weak defenders will be found on the bottom part of the graph.

Supplementary variables can also help confirm our interpretation, particularly the position variable:

> plot.MCA(mca_no_gk, invisible = c("ind","var"))

Projection of the supplementary variables on the first two factorial axis

And indeed we find on the left part of the graph the attacking positions (LW, ST, RW) and on the top part of the graph the defensive positions (CB, LB, RB).

If our interpretation is correct, the projection on the second bissector of the graph will be a good proxy for the overall level of the player. The best players will be found on the top left area while the weaker ones will be found on the bottom right of the graph. There are many ways to check this, for example looking at the projection of the modalities of the variable “overall”. As expected, “overall_4” is found on the top-left corner and “overall_1” on the bottom-right corner. Also, on the graph of the supplementary variables, we observe that “Ligue 1” (first division of the french league) is on the top-left area while “Ligue 2” (second division) lies on the bottom-right area.

With only these two axes interpreted there are plenty of fun things to note:

Left wingers seem to have a better overall level than right wingers (if someone has an explanation for this I’d be glad to hear it!)
Age is irrelevant to explain the level of a player, except for the younger ones who are in general weaker.
Older players tend to have more defensive roles

Let’s not forget to deal with axis 3:

> plot.MCA(mca_no_gk, invisible = c("ind","var"), axes=c(2,3))

Projection of the variables on the 2nd and 3rd factorial axes

Modalities that are most representative of the third axis are technical weaknesses: the players with the lower technical abilities (dribbling, ball control, etc.) are on the end of the axis while the players with the highest grades in these abilities tend to be found at the center of the axis:

Projection of the supplementary variables on the 2nd and 3rd factorial axes

We note with the help of the supplementary variables, that midfielders have the highest technical abilities on average, while strikers (ST) and defenders (CB, LB, RB) seem in general not to be known for their ball control skills.

Now we see why we chose to make the variables categorical instead of quantitative. If we had kept the orginal variables (quantitative) and performed a PCA on the data, the projections would have kept the orders for each variable, unlike what happens here for axis 3. And after all, isn’t it better like this? Ordering players according to their technical skills isn’t necessarily what you look for when analyzing the profiles of the players. Football is a very rich sport, and some positions don’t require Messi’s dribbling skills to be an amazing player!

Mathieu Valbuena

Now we add the data for a new comer in the French League, Mathieu Valbuena (actually Mathieu Valbuena arrived in the French League in August of 2015, but I warned you that the data was a bit old ;)). We’re going to compare Mathieu’s profile (as a supplementary individual) to the other players, using our data analysis.

> columns_valbuena <- c("right","RW","Ligue1",3,1
 ,4,4,3,4,3,4,4,4,4,4,3,4,4,3,3,1,3,2,1,3,4,3,1,1,1)
> frenchLeague_no_gk["Mathieu Valbuena",] <- columns_valbuena

> mca_valbuena <- MCA(frenchLeague_no_gk, quali.sup=c(2,3,4), ind.sup=912)
> plot.MCA(mca_valbuena, invisible = c("var","ind"), col.quali.sup = "red", col.ind.sup="darkblue")
> plot.MCA(mca_valbuena, invisible = c("var","ind"), col.quali.sup = "red", col.ind.sup="darkblue", axes=c(2,3))

Last two lines produce the graphs with Mathieu Valbuena on axes 1 and 2, then 2 and 3:

Axes 1 and 2 with Mathieu Valbuena as a supplementary individual (click to enlarge)

Axes 2 and 3 with Mathieu Valbuena as a supplementary individual (click to enlarge)

So, Mathieu Valbuena seems to have good offensive skills (left part of the graph), but he also has a good overall level (his projection on the second bissector is rather high). He also lies at the center of axis 3, which indicates he has good technical skills. We should thus not be surprised to see that the positions that suit him most (statistically speaking of course!) are midfield positions (CAM, LM, RM). With a few more lines of code, we can also find the French league players that have the most similar profiles:

> mca_valbuena_distance <- MCA(frenchLeague_no_gk[,-c(3,4)], quali.sup=c(2), ind.sup=912, ncp = 79)
> distancesValbuena <- as.data.frame(mca_valbuena_distance$ind$coord)
> distancesValbuena[912, ] <- mca_valbuena_distance$ind.sup$coord

> euclidianDistance <- function(x,y) {
 
 return( dist(rbind(x, y)) )
 
}

> distancesValbuena$distance_valbuena <- apply(distancesValbuena, 1, euclidianDistance, y=mca_valbuena_distance$ind.sup$coord)
> distancesValbuena <- distancesValbuena[order(distancesValbuena$distance_valbuena),]

> names_close_valbuena <- c("Mathieu Valbuena", row.names(distancesValbuena[2:6,]))

And we get: Ladislas Douniama, Frédéric Sammaritano, Florian Thauvin, N’Golo Kanté and Wissam Ben Yedder.

There would be so many other things to say about this data set but I think it’s time to wrap this (already very long) article up 😉 Keep in mind that this analysis should not be taken too seriously! It just aimed at giving a fun tutorial for students to discover R, FactoMineR and data analysis.

[Sampling] Talk at INSPS – Avignon

I’m in the beautiful city of Avignon for the 3rd ISNPS conference, which is held in the extroardinary Palace of the Popes Convention center. I’ve been invited by Ricardo Cao to give a talk wednesday morning during on sampling methods for big graphs.

Sampling methods for graphs from Antoine Rebecq