[Sports] Fifa et analyse de données

Après un été chargé en sports, l’automne et la Ligue 1 reprennent peu à peu leurs droits. C’est l’occasion de détailler un sujet d’analyse de données élaboré pour un cours à l’ENSAE. Il s’agit d’analyser des données qualitatives (caractéristiques physiques, tactiques et aptitudes relatives à certains aspects techniques du jeu) décrivant les joueurs du championnat de France de football. Le but final est de déterminer “statistiquement” à quel poste faire jouer Mathieu Valbuena 🙂 On utilise le langage R et l’excellent package d’analyse de données FactoMineR.

Les données

Comme indiqué dans l’énoncé du TD, il n’est pas nécessaire de bien connaître le football pour pouvoir suivre cet article. Seule une notion de l’emplacement des joueurs sur le terrain en fonction de leur poste (correspondant à la colonne “position” du dataset) est souhaitable. Voici un petit schéma pour aider les moins avertis :

disposition_terrain

Les données sont issues du jeu vidéo Fifa 15 (les connaisseurs auront remarqué que les données datent donc d’il y a déjà deux saisons, il peut donc y avoir quelques différences avec les effectifs actuels !), qui donne de nombreuses statistiques pour chaque joueur, incluant une évaluation de leurs capacités. Les données de Fifa sont quantitatives (par exemple chaque capacité est notée sur 100) mais pour cet article on les a rendues catégorielles sur 4 positions : 1. Faible / 2. Moyen / 3. Fort / 4. Très fort. On verra l’intérêt d’avoir procédé ainsi un peu plus loin !

Préparation des données

Commençons par charger les données. Notez l’utilisation de l’option stringsAsFactors=TRUE (plus d’explications sur ce fameux paramètre stringsAsFactors ici). Eh oui, une fois n’est pas coutume, FactoMineR utilise des facteurs pour effectuer l’analyse de données !

> champFrance <- read.csv2("td3_donnees.csv", stringsAsFactors=TRUE)
> champFrance <- as.data.frame(apply(champFrance, 2, factor))

La deuxième ligne sert à transformer les colonnes de type int créés par read.csv2 en factors.

FactoMineR utilise le paramètre “row.names” des data.frame de R pour l’affichage sur les graphes. On va donc indiquer qu’il faut utiliser la colonne “nom” en tant que row.names pour faciliter la lecture :

> row.names(champFrance) <- champFrance$nom
> champFrance$nom <- NULL

Voilà à quoi ressemble désormais notre data.frame (seules les premières lignes sont affichées) :

> head(champFrance)
                      pied position championnat age taille general
Florian Thauvin     Gauche      MDR      Ligue1   1      3       4
Layvin Kurzawa      Gauche       AG      Ligue1   1      3       4
Anthony Martial      Droit       BU      Ligue1   1      3       4
Clinton N'Jie        Droit       BU      Ligue1   1      2       3
Marco Verratti       Droit       MC      Ligue1   1      1       4
Alexandre Lacazette  Droit       BU      Ligue1   2      2       4

Analyse des données

Nous avons affaire à un tableau de variables catégorielles : la méthode adaptée est l’Analyse des Correspondances Multiples, qui est implémentée dans FactoMineR par la méthode MCA. Pour le moment on exclut de l’analyse les variables “position”, “championnat” et “âge” (que l’on traite comme variables supplémentaires) :

> library(FactoMineR)
> acm <- MCA(champFrance, quali.sup=c(2,3,4))

Trois graphes apparaissent dans la sortie : la projection sur les deux premiers axes factoriels des catégories et des individus, ainsi que le graphe des variables. A ce stade, seul le second nous intéresse :

2_nuages_points_2
Projection des individus sur les deux premiers axes factoriels

Avant même d’essayer d’aller plus loin dans l’analyse, quelque chose doit nous sauter aux yeux : il y a clairement deux nuages de points ! Or nos méthodes d’analyse de données supposent que le nuage qu’on analyse est homogène. Il va donc falloir se restreindre à l’analyse de l’un des deux nuages que l’on observe sur ce graphe.

Pour identifier à quels individus le nuage de droite correspond, on peut utiliser les variables supplémentaires (points verts). On observe que la projection de la position goal (“G”) correspond bien au nuage. En regardant de plus près les noms des individus concernés, on confirme que ce sont tous des gardiens de but.

On va se concentrer pour le reste de l’article sur les joueurs de champ. On en profite également pour retirer les colonnes ne concernant que les capacités de gardien, qui ne sont pas importantes pour les joueurs de champ et ne peuvent que bruiter notre analyse :

> champFrance_nogoals <- champFrance[champFrance$position!="G",-c(31:35)]
> acm_nogoals <- MCA(champFrance_nogoals, quali.sup=c(2,3,4))

Et l’on vérifie bien dans la sortie graphique que l’on a un nuage de points homogène.

Interprétation

On commence par réduire notre analyse à un certain nombre d’axes factoriels. Ma méthode favorite est la “règle du coude” : sur le graphe des valeurs propres, on va observer un décrochement (le “coude”) suivi d’une décroissance régulière. On sélectionnera ensuite un nombre d’axes correspondant au nombre de valeurs propres précédant le décrochement :

> barplot(acm_nogoals$eig$eigenvalue)

 

barplot
Éboulis des valeurs propres

Ici, on peut choisir par exemple 3 axes (mais on pourrait justifier aussi de retenir 4 axes). Passons maintenant à l’interprétation, en commençant par les graphes des projections sur les deux premiers axes retenus pour l’étude.

> plot.MCA(acm_nogoals, invisible = c("ind","quali.sup"))
axes_1_2_modalites
Projection des modalités sur les axes factoriels 1 et 2 (cliquer pour agrandir)

On peut par exemple lire sur ce graphe le nom des modalités possédant les plus fortes coordonnées sur les axes 1 et 2 et commencer ainsi l’interprétation. Mais avec un tel de nombre de modalités, la lecture directe sur le graphe n’est pas si aisée. On peut également obtenir un résultat dans la sortie texte spécifique de FactoMineR, dimdesc (seule une partie de la sortie est donnée ici) :

> dimdesc(acm_nogoals)
$`Dim 1`$category
                         Estimate       p.value
finition_1            0.700971584 1.479410e-130
volees_1              0.732349045 8.416993e-125
tirs_lointains_1      0.776647500 4.137268e-111
tacle_glisse_3        0.591937236 1.575750e-106
effets_1              0.740271243  1.731238e-87
[...]
finition_4           -0.578170467  7.661923e-82
puissance_tir_4      -0.719591411  2.936483e-86
controle_balle_4     -0.874377431 5.088935e-104
dribbles_4           -0.820552850 1.795628e-117

Les modalités les plus caractéristiques de l’axe 1 sont, à droite, un niveau faible dans les capacités offensives (finition, volées, tirs lointains), et de l’autre un niveau très fort dans ces même capacités. L’interprétation naturelle est donc que l’axe 1 discrimine selon les capacités offensives (les meilleurs attaquants à gauche, les moins bons à droite). On procède de même pour l’axe 2, et on observe le même phénomène, mais avec les capacités défensives : en haut on trouvera les meilleurs défenseurs, et en bas les moins bons défenseurs.

Les variables supplémentaires peuvent aussi aider à l’interprétation, et vont confirmer notre interprétation, notamment la variable de position :

> plot.MCA(acm_nogoals, invisible = c("ind","var"))
var_sup_axes_1_2
Projection des variables supplémentaires sur les axes factoriels 1 et 2 (cliquer pour agrandir)

On trouve bien à gauche du graphe les les postes offensifs (BU, AIG, AID) et en haut les postes défensifs (DC, AD, AG).

Une conséquence de cette interprétation est que l’on risque de trouver les joueurs de meilleur niveau organisés le long de la seconde bissectrice, avec les meilleurs joueurs dans le quadrant en haut à gauche, et les plus faibles dans le quadrant en bas à droite. Il y a beaucoup de moyens de le vérifier, mais on va se contenter de regarder dans le graphe des modalités l’emplacement des observations de la variable “général”, qui résume le niveau d’un joueur. Comme on s’y attend, on trouve “général_4” dans en haut à gauche et “général_1” dans le quadrant en bas à droite. On peut observer aussi le placement des variables supplémentaires “Ligue 1” et “Ligue 2” pour s’en convaincre 🙂

A ce stade, il y a déjà plein de choses intéressantes à relever ! Parmi celles qui m’amusent le plus :

  • Les ailiers gauches semblent avoir un meilleur niveau que les ailiers droits (si un spécialiste du foot voulait bien m’en expliquer la raison ce serait top !)
  • L’âge n’est pas explicatif du niveau du joueur, sauf pour les plus jeunes qui ont un niveau plus faible
  • Les joueurs les plus âgés ont des rôles plus défensifs.

N’oublions pas de nous occuper de l’axe 3 :

> plot.MCA(acm_nogoals, invisible = c("ind","var"), axes=c(2,3))
axes_2_3
Modalités projetées sur les axes 2 et 3

Les modalités les plus caractéristiques de ce troisième axe sont les faiblesses techniques : les joueurs les moins techniques sont sur les extrémités de l’axe, et les joueurs les plus techniques au centre. On le confirme sur le graphe des variables supplémentaires : les buteurs et défenseurs centraux sont en effet moins réputés pour leurs capacités techniques, tandis que tous les postes de milieux se retrouvent au centre de l’axe :

sup_axes_2_3
Variables supplémentaires sur les axes 2 et 3 (cliquer pour agrandir)

C’est l’intérêt d’avoir rendu ces variables catégorielles. Si l’on avait conservé le caractère quantitatif des données originelles de Fifa et effectué une ACP, les projections de chaque caractéristique sur chaque axe auraient été ordonnées par niveau, contrairement à ce qui se passe sur l’axe 3. Et après tout, discriminer les joueurs suivant leur niveau technique ne reflète pas forcément toute la richesse du football : à certains postes, on a besoin de techniciens, mais à d’autres, on préférera des qualités physiques !

Mathieu Valbuena

On va maintenant ajouter les données d’un nouvel entrant dans le championnat de France : Mathieu Valbuna (oui je vous avais prévenu, les données commencent à dater un peu :p) et le comparer aux autres joueurs en utilisant notre analyse.

> columns_valbuena <- c("Droit","AID","Ligue1",3,1
 ,4,4,3,4,3,4,4,4,4,4,3,4,4,3,3,1,3,2,1,3,4,3,1,1,1)
> champFrance_nogoals["Mathieu Valbuena",] <- columns_valbuena

> acm_valbuena <- MCA(champFrance_nogoals, quali.sup=c(2,3,4), ind.sup=912)
> plot.MCA(acm_valbuena, invisible = c("var","ind"), col.quali.sup = "red", col.ind.sup="darkblue")
> plot.MCA(acm_valbuena, invisible = c("var","ind"), col.quali.sup = "red", col.ind.sup="darkblue", axes=c(2,3))

Les deux dernières lignes permettent de représenter Mathieu Valbuena sur les axes 1 et 2, puis 2 et 3 :

Axes factoriels 1 et 2 avec Mathieu Valbuena en point supplémentaire (cliquer pour agrandir)
Axes factoriels 1 et 2 avec Mathieu Valbuena en point supplémentaire (cliquer pour agrandir)
Axes factoriels 2 et 3 avec Mathieu Valbuena en point supplémentaire (cliquer pour agrandir)
Axes factoriels 2 et 3 avec Mathieu Valbuena en point supplémentaire (cliquer pour agrandir)

Résultat de notre analyse : Mathieu Valbuena a plutôt un profil offensif (gauche de l’axe 1), mais possède un bon niveau général (sa projection sur la deuxième bissectrice est assez élevée). Il possède également de bonnes aptitudes techniques (centre de l’axe 3). Enfin, ses qualités semblent plutôt bien convenir aux postes de milieu offensif (MOC) ou milieu gauche (MG). Avec quelques lignes de code, on peut trouver les joueurs du championnat dont le profil est le plus proche de celui de Valbuena :

> acm_valbuena_distance <- MCA(champFrance_nogoals[,-c(3,4)], quali.sup=c(2), ind.sup=912, ncp = 79)
> distancesValbuena <- as.data.frame(acm_valbuena_distance$ind$coord)
> distancesValbuena[912, ] <- acm_valbuena_distance$ind.sup$coord

> euclidianDistance <- function(x,y) {
 
 return( dist(rbind(x, y)) )
 
}

> distancesValbuena$distance_valbuena <- apply(distancesValbuena, 1, euclidianDistance, y=acm_valbuena_distance$ind.sup$coord)
> distancesValbuena <- distancesValbuena[order(distancesValbuena$distance_valbuena),]

# On regarde les profils des 5 individus les plus proches
> nomsProchesValbuena <- c("Mathieu Valbuena", row.names(distancesValbuena[2:6,]))

Et l’on obtient : Ladislas Douniama, Frédéric Sammaritano, Florian Thauvin, N’Golo Kanté et Wissam Ben Yedder.

Il y aurait plein d’autres choses à dire sur ce jeu de données mais je préfère arrêter là cet article déjà bien long 😉 Pour finir, gardez à l’esprit que cette analyse n’est pas vraiment sérieuse et sert surtout à présenter un exemple sympathique pour la découverte de FactoMineR et de l’ADD.

 

[Sports] On peut rater une flèche aux JO

En cette période de Jeux Olympiques d’été, c’est l’occasion de regarder à la télévision sur des chaînes de grande écoute et à des heures décentes (modulo le décalage horaire !) des sports méconnus du grand public. Nous avons déjà parlé ici du biathlon (en ce qui concerne les JO d’Hiver, qu’on retrouvera en 2018), mais ce billet va parler d’un autre sport : le tir à l’arc. Le but du tir à l’arc est de placer ses flèches sur une cible, assez souvent très éloignée, dans des cercles concentriques qui valent de plus en plus de points au fur et à mesure qu’on se rapproche du centre, de 1 à 10 (voire 0 si l’on rate la cible, ce qui est assez rare aux JO !).

Les règles semblent simples, mais il y a une petite subtilité qui est apparue cette année. En effet, jusqu’à présent les archers tiraient quatre volées de trois flèches chacun, de façon alternée, et on sommait les points obtenus : celui qui avait le meilleur score était qualifié pour la manche suivante. En cas d’égalité, une flèche était tirée pour chaque archer, et le plus proche gagne le match.

Les nouvelles règles mettent en avant la notion de “set” : désormais, chaque volée de trois flèches est considérée de façon indépendante. L’archer qui a un meilleur score que son adversaire à la fin d’un set marque 2 points, et en cas d’égalité au set, les deux marquent 1 point, sachant que le match se joue en 6 points. On joue alors cinq sets, et si personne n’est arrivé à 6 à la fin de ces cinq sets, chacun tire une flèche et la plus proche gagne le match.

Selon les journalistes sportifs de France Télévisions, ces nouvelles règles permettent à un tireur de rattraper un mauvais tir (c’est à dire un tir en dessous du 8, à ce niveau de compétition) plus facilement que lorsque l’on somme la totalité des points, où une flèche ratée pénalise toute la partie. Nous allons à l’aide d’un exemple et de quelques simulations vérifier si cette affirmation est vraie.

Considérons deux archers, Arthur et Bastien. Les deux archers ont un niveau équivalent, mais ils n’ont pas le même profil : Arthur ne met jamais de flèches en dessous de 8, mais tire souvent dans le 8. Bastien, lui, peut rater un tir et toucher un 5 ou un 7, mais arrive plus souvent à toucher la partie jaune de la cible (9 ou 10). Plus précisément, leurs chances pour chaque tir sont les suivantes :

Flèche Arthur Bastien
1 à 4 0 % 0 %
5 0 % 2 %
6 0 % 0 %
7 0 % 1 %
8 50 % 40 %
9 40 % 47 %
10 10 % 10 %

Un rapide calcul permet de constater que pour les deux archers, chaque flèche rapporte en moyenne 8,6 points. Ils ont donc bien un niveau comparable. Nous allons maintenant simuler plusieurs dizaines de milliers de matchs en suivant les deux jeux de règles possibles afin de déterminer qui gagne, et si Bastien est bien favorisé par les nouvelles règles. Les résultats obtenus sont les suivants :

Règles Arthur gagne… Bastien gagne…
Somme totale 48,2 % des matchs 51,8 % des matchs
Jeu par sets 44,2 % des matchs 55,9 % des matchs

Cela se confirme donc bien : les nouvelles règles favorisent Bastien, qui rate de temps en temps son tir, et permettent donc plus facilement de revenir dans le match après une flèche ratée. Cela permet également un suspens plus important, car rien n’est jamais joué d’avance !

[Sports] What the splines model for UEFA Euro 2016 got right and wrong

UEFA Euro 2016 is over! After France’s heartbreaking loss to Portugal in the Final, it’s now time to assess the performance of our “splines model“. On the main page of the project you can now find the initial predictions we made before the start on our competition. I also added a link to the archives of the odds we updated after each day (EDIT: I realize I made a mistake with a match that was played on Day 2, I’ll correct this asap – results should not be altered much though.)

screenshot Euro 2016
Screenshot predictions Euro 2016

What went well (Portugal, Hungary, Sweden)

Let’s begin with our new European champions: Portugal. They were our 5th favorite, with an estimated 8.3% chance of winning the title. To everyone’s surprise (including ours to be honest 😉 ), they finished 3rd in group F. However, the odds of this happening were estimated at 20%, so we can hardly say the splines model was completely stunned by this outcome! In fact, except for the initial draw against Iceland, we had all calls correct for Portugal games!

Hungary were described by some as the weakest team of the tournament, so by extension as the weakest team of group F. But they won it! Our model didn’t agree with those pundits, estimating the chances of advancing to the second round for Gábor Király‘s teammates at almost 3 out of 4.

Sweden certainly had one of the best players in the world with Zlatan Ibrahimovic. But our model was never a fan of their squad, and they did end up at the last place in group E. Similarly, Ukraine was often referred to as a potential second-rounder but ended up at the last place (losing all their games), which was the most likely outcome according to the splines model.

What went wrong (Iceland, Austria, England)

Austria were seen by the splines model as outsiders for this competition (4.7% of becoming champs – for instance, Italy’s chances were estimated at 4.2%). We evaluated their chances of advancing to the second round to be greater than 70%. They ended up at the last place of Group F with a single point.

On the contrary, Iceland were seen as one of the weakest teams of the competition and a clear favorite for last place in Group F. Eventually, they were astonishingly successful! On their way to the quarter-finals, they eliminated England. Our model gave England a good 85% probability to win the match. But, surprising as it was, this alone does not prove our model was not reliable (more on upsets on the next paragraph). Yet we can’t consider the projections for the Three Lions other than a failure, because they also ended up second in group B when we thought they would easily win the group.

Spain lost in round of 16 to Italy and in the group phase to Croatia. The estimated probabilities for these events were 40% and 16%.

Hard to say

We almsot included Turkey in the previous paragraph: after all, we gave them the same chances as Italy for winning the tournament, and we estimated their odds of advancing to the round of 16 to more than 70%, yet they failed. In addition, their level was described by experts as rather poor. But paradoxically, the splines model had all calls correct for Turkey games! What doomed them was the 3-0 loss against the defending champions, Spain. With a final goal average of -2 and 3 points, they couldn’t reach the second round as one of the four best thirds.

Wales unexpectedly beat Belgium, one of our favorites, in quarter-finals. But is this a sign of a bad model or bad luck? Upsets happen, and they’re not necessarily a sign that a team’s strength was incorrectly estimated.

Home field advantage

Our model stood out from others (examples here, here or here) on predictions for France. As a matter of fact, it valued much less home field advantage than the other models. But France didn’t win the Euro! Similarly, nearly all models predicted a Brazil victory in World Cup 2014, mostly because of home field advantage… and we all know what happened!

To us, it is unclear whether home field advantage during Euro or the World Cup can compare to home field advantage for a friendly match or a qualifier. I hope someone studies this particular point in the future!

Conclusion

We had a lot of fun building this model and it helped us enjoy the competition! I hope you guys enjoyed it too!

[Sports] L’adversaire des bleus en 8èmes

Après la première place du groupe acquise par l’équipe de France, Baptiste Desprez de Sport24 se demandait aujourd’hui quel est l’adversaire le plus probable pour les Bleus en huitièmes.

Ça tombe bien, on dispose d’un modèle capable de calculer des probabilités pour les matches de l’Euro. Je vous laisse lire l’article de Sport24 si vous voulez comprendre toutes les subtilités concoctées par l’UEFA pour ce premier Euro à 24. Nous, on va se contenter de faire tourner le modèle pour obtenir les probabilités. On obtient (avec arrondis) :

Irlande du Nord : 72% ; République d’Irlande : 14% ; Allemagne : 8% ; Belgique : 4% ; Pologne : 2%

probas_huitiemes

Voilà, il est extrêmement probable que le prochain adversaire de l’équipe de France se nomme “Irlande” 🙂 . Curieusement, la probabilité de rencontrer l’Allemagne est bien plus forte que de rencontrer la Pologne, alors même que le modèle donne une forte probabilité pour que l’Allemagne termine première de son groupe devant la Pologne… C’est complexe un tableau de l’Euro ! On va quand même croiser les doigts pour ne pas croiser la route de Müller et cie aussi tôt dans le tableau !

Il est également amusant de constater que, bien que ce soit possible, un huitième contre une équipe du groupe D (Tchéquie, Turquie ou Croatie) est hautement improbable (<0.2% de chances d’après les simulations). Il semblerait que les configurations permettant à ces équipes de se qualifier en tant que meilleurs troisièmes sont incompatibles avec les configurations les envoyant en huitième contre la France. Si un opérateur vous proposait ce pari, je ne saurais trop vous conseiller de l’éviter 😉

[Sports] UEFA Euro 2016 Predictions – Model

Today we’re launching our own predictions for UEFA Euro 2016 that starts next week.

A model for football: state of the art

There are many ways to build a model to predict football results. The Elo ranking system is commonly used. As it name indicates, it relies on a ranking of the international football teams, either the official FIFA ratings (which are widely known as poorly predictive of a team’s strength) or a custom made Elo ranking. A few Elo rankings are available on the Internet, so one possibility was to use one of these to compute probabilities for each game (via a very simple analytical formula). But we wanted to do something different.

When FiveThirtyEight created a nice viz of their own showing odds for the men’s and women’s world cups, their model was based on ESPN’s Soccer Power Index (SPI). The principle of the SPI is quite simple: compute expected scored and against goals for each team under the assumption it plays against an “average” football squad. Then run a logistic regression to predict the outcome of any two teams, based on their expected performance against the “average” team. SPI takes an impressive amount of relevant parameters into account (including player performance), and has generally proven itself reliable (although FiveThirtyEight’s predictions always seemed a tad overconfident to me!).

Our very own model

For our model, we liked the principle of the SPI very much, but we wanted to try our own little variation. So we kept the core feature of the SPI: computing the expected goals scored and against for each team, but we chose to directly plug these results into our simulations (i.e without the logistic regression). Of course, due to lack of time and resources, our model will be way less sophisticated than ESPN’s (there was no way we could include player performance for example), but still the results might be worth analyzing!

So, for each one of the 24 teams competing, we’re trying to predict the quantitative variable that is the number of goals scored (and against) for each game. Of course, we’re going to use all our knowledge of machine learning to achieve this 😉 Our training data is composed of the 1795 international games played by the 24 teams that qualified for UEFA Euro 2016 between 2008 and 2016 (excluding the Olympics, which are too peculiar in football to be relevant).

We dispose of 1795 observations, for each of which we know: the location of the game, the teams that played, the final score and the type of the game (friendly, world cup qualifier, etc.). We matched each team to its Fifa ranking at the closest date available, and determine which team had home court advantage (if any).

Then we ran the simplest of regression models: a linear regression on year, team (as a categorical variable), dummy variable indicating if team plays at home or away, type of match and FIFA rankings of both teams. Before even thinking of using this model for simulations, we have to look at how it performs. And a lot of think indicates that it is too unsophisticated. The most telling example might be the prediction of large number of goals. Let’s plot the number of goals scored vs. the FIFA ranking of the opposing team.

linear_model
Number of goals scored with respect to strength of the opponent. Black points: observed ; red points: modeled (linear regression).

You can see on the right side of the plot that it’s not that rare that a large number of goals is scored, especially when playing a very weak team. However, the linear model is unable to predict more than 4 goals scored in a game. This can be a huge problem for simulations as ties are broken by number of goals scored at Euro.

The idea is thus to combine several linear models to get a more sensible prediction. This can be done using regression splines, for which the parameters are chosen using cross-validation.

model_splines_2
Number of goals scored with respect to strength of the opponent. Black points: observed ; green points: modeled (regression splines).

On number of ways, this model is much more satisfying than the first one. Regarding the large values of number of goals scored, the above plots show that our model is now able to predict them 🙂

Simulations and results

Our model gives us expected values for number of goals scored and against, as well as a model variance. We then simulate the number of goals with normal error around the expected value. We do this 10000 times for each match and finally get Monte-Carlo probabilities for the outcome of each group phase match, as well as odds for each team to end at each place in its group and to qualify to each round of the knockout phase.

The results can be found here, and I will post another article later to comment them (which really is the fun part after all!).

[Sports] Don’t miss a shot in biathlon races

Today, I want to speak about my favorite sport to watch on TV, which is biathlon (the one which involves skiing and shooting at things. Who doesn’t love that?). I really enjoy to follow the races, and not only because the best athlete at this moment is a french one (go Martin!), but because the shooting part seems so crucial and stressful. This leads me to wonder about how much missing a shot is relevant for the ranking in the end of the race. Let’s find out!

Shoot

Gathering some data

My idea is to do some basic analysis on data about the results of many biathlon races, in order to evaluate if there is some form of correlation between the final ranking and the number of shot missed. This requires to first gather the data. The results are stored on multiple sites: obviously Wikipedia pages of the championships, but this website is much more detailed. I’m going to use some of the results between 2007 and 2015. I won’t use all of the races because I want to have comparable results: I’m going to only consider the races where the number of competitors is between 50 and 60. This allows me to interpret the results about final ranking with a similar scaling system for all the races. Moreover, the specificity of biathlon is that the rules are very different from a format to another (see this Wikipedia article for more informations), and I can’t easily discriminate my data between them. Using a limitation on the number of participants is a way to limit the width of the spectrum of formats considered. Well, let’s forget these technicalities and analyse the data!

Don’t miss a shot!

So, the idea is to put in regard the number of shots missed and the final ranking. Fun fact: the number of shot during a race is 20, but the maximum number of shot missed during a race I analysed is only 9. That’s not really a surprise if you frequently watch biathlon, because missing that much shots usually means that you’re going to finish in last place. I’m going to use a heat map in order to show the correlation. An heat map is a form of 2D data vizualisation which is based on a spectrum of colors. Thedarker the color is, the more important the value is. The idea here is to put in rows the final ranking, and in columns the number of shots missed. Here is what we obtain:

Miss a shot

There results directly show that:

  • There is a clear diagonal on the heat map. This isn’t really surprising: that means that everytime an athlete misses a shot, his final ranking goes lower. This is our first result: missing shot are penalties. What a surprise!
  • There is also a very dark blue area in the first column, at the top of the diagram. This means that most of the time, doing a clear round leads to a very good ranking in the end.
  • But it is clearly possible to win a race while missing some shots! The first row is filled with dark blue in the first few columns.

Don’t miss a prone shot!

As you may know, there is two different types of shooting during a biathlon race: the prone shot, where the athletes are lying on the floor; this position helps them to stabilise their aim. The other type is the standing shot, which is much more difficult. Therefore, it might be interesting to deal with the two phases of shooting separately. Let’s start with the prone shot, as this is usually the first phase of shooting during a race.

Prone shot

We see the same pattern as the total of shots. The top of the first column is much darker than before: this is because of two things. First, it is usual that a lot of athletes don’t miss a shot during the prone shot phase. And that means that missing a prone shot is much more a sign of a bad shooting, which leads to a bad ranking at the end. This point is very important: we’re not evaluating results in a vacuum: and missing a shot usually means that the athlete is in a bad shape compared to the others, and therefore has a bad ranking. But this also means that missing a shot during the first phase raise the odds of missing shots during the other phases.

Let’s have a look at the heat map for the standing shots.

Standing shot

As expected (because the initial one is the combinaison of these two heatmap), we’ve a much more dispersed heat map. Missing a standing shot is something that happens to pretty much everyone, even the best athletes.

Is the starting order relevant?

I add to the analysis a last factor: the starting order, which is linked to the expectation of results of the athlete (based on a global ranking, or on the results of another race). The heat map showing the correspondances bewteen the final ranking (still in rows) and the starting order (in columns) shows a clear diagonal line: the expectation seems relevant.

Starting order

In order to do a much more indepth analysis, I’m going to perform a linear regression on these variables. I want to know if the final ranking is explained by the initial order, the number of prone shots missed and the number of standing shots missed. This linear regression will also help me to evaluate how big of an impact these three variables have on the final outcome. Let’s have a look at the results:

Call:
lm(formula = ranking ~ prone shots  + standing shots + starting order)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.625  -8.483  -0.101   9.268  45.295 

Coefficients:
                  Estimate  Std. Error  t value   Pr(>|t|)    
(Intercept)       5.816571   0.212517   27.37

The three variables of the model are statistically significant, which means that they do have a relation with the final ranking. Understanding the coefficient for the Starting Order is kind of tough, but the two other coefficients are much more easier to analyse:

  • When you miss a prone shot, you lose about 5 places at the end of the race
  • When you miss a standing shot, you lose about 1 place at the end of the race

Obviously, these results are only valid on average. But this is kind of a fun way to comment biathlon shootings! “Oh, you just lost 10 places!”.

[Sports] Best/Worst NBA matchups ever

Earlier this month, the Philadelphia 76ers grabbed their first (and for now, only) win of the season by beating the Lakers. That night, checking for the menu on the NBA League Pass, I quickly elected not to watch this pretty unappealing matchup. Though I couldn’t help thinking that it is a shame for franchises that have both won NBA titles and see legends of the game wear their jerseys. During many seasons, a Lakers-Sixers game meant fun, excitement and most important of all excellent basketball. And I started wondering what were the best and worst matchups throughout NBA history.

Data and model

My criterion to evaluate these matchups will be the mean level of the two teams during each season. In fact, by “good” matchup I mean a game that feature two excellent ball clubs, making it an evening every NBA fan awaits impatiently as soon as the season calendar is made available. On the contrary, a “bad” matchup is a game whose only stake will be to determine draft pick orders. My criterion does not predict the actual interest of watching these games: a confrontation between two top teams might very well be pretty boring if the star players are having a bad night (or if the coach decides to bench them). Also, a contest between two mediocre teams might very well finally be a three-OT thriller with players showing excellent basketball skills. In fact, my criterion only holds from an historical perspective (or in the very unlikely case that you have to choose between the replay of several games without knowing when these games were played 🙂 ).

The level of each team is estimated using the excellent FiveThirtyEight NBA Elo rankings. Then, to rank the 435 possible matchups between the 30 NBA franchises, I will average the mean level of every two teams for all years between 1977 and 2015 (I chose to limit the analysis to after the NBA-ABA merger of 1976 so as to avoid dealing with defunct franchises). You can found the R codes I used on my GitHub page.

The best matchups

Of course, our method values regularity, so it’s no surprise we find at the top matchups between the teams that have been able to maintain a high level of competitivity throughout the years. In fact, the best matchup ever is “Lakers – Spurs”, two teams that have missed the playoffs only respectively 5 and 4 years since the 1976-1977 season! “Celtics – Lakers” comes in 6th: basketball fans won’t be suprised to find this legendary rivalry ranked up high. It might even have been higher if I had taken seasons prior to the merger into account. The first 10 matchups are:

1. “Lakers – Spurs”
2. “Lakers – Suns”
3. “Lakers – Trailblazers”
4. “Lakers – Thunder”
5. “Suns – Spurs”
6. “Celtics – Lakers”
7. “Spurs – Trailblazers”
8. “Spurs – Thunder”
9. “Rockets – Lakers”
10. “Spurs – Heat”

The worst matchups

The worst matchup ever is “Timberwolves vs. Hornets”. Thinking about the last few years, I have to admit that these games were clearly not among my favorites. Poor Michael Jordan’s Hornets trust the last 7 places on the ranking, thanks to the inglorious Bobcats run.

Among the most infamous matchups ever are:

425. “Clippers – Timberwolves”
426. “Clippers – Nets”
427. “Raptors – Hornets”
428. “Wizards – Timberwolves”

434. “Kings – Hornets”
435. “Timberwolves – Hornets”

I really hope the owners of these franchises are able to turn the tide and put their teams back up in the rankings soon!

[Sports] Why favourite tennismen win even more than they should

Today, I’d like to write a short note to begin to answer one simple question: which sport is the most conservative? By conservative I mean a sport where the favourite team or athlete wins much more often than the other. As this is quite a large problem, which had been adressed for NFL, I’ll start with some results about some racquet sports, tennis and badminton. I won’t use real data but rather focus on some simulations in order to evaluate the chances to win a match when you’re the #1 tennismen against less talented players.

Game, Set, Match

I won’t explain the rules of scoring in tennis matches, but I’d like to remind you some of the basic principles. To win a tennis match, you need to win 2 or 3 sets. Each set requires you to win at least 6 games, but also two more games than your opponent. And to win a game, you need to score 4 points AND 2 more points than your opponent. These rules mean that a tennis match can last almost forever but as they are very restrictive, they also favor the most efficient player. Let’s have a simple example to understand that.

Let’s say Adam and Becky play a game where they have to score 2 points in 3 rounds. Becky plays way better and has a 90% chance to win each point. The probability of Becky winning is 0.9*0.9 (she wins the first two rounds) + 0.9*0.1*0.9 (she wins round 1, loses round 2 but wins round 3) + 0.1*0.9*0.9 (she loses round 1, but wins rounds 2 and 3), which sums to 97.2%, therefore Adam has a 2.8% chance to win at that game. That is pretty low.

What happens if we add the ‘tennis’ rule of “2 more points than your opponent” ? The game becomes “the first one to score 2 points in a row win”. The maths needed to get Becky’s changes of winning are a little more complex in this example: one way to deal with that problem is to think about the first two rounds. After these two rounds, there is 81% that the winner is Becky, 1% that it is Adam, and 18% that no one has won yet. If we repeat this process, we get a geometric series which sums to approx. 98.8%. Adam’s chances of winning have dropped even more: only 1.2% now!

Let’s go back to Tennis

Now Becky and Adam are tired of playing a game Adam always loses, and they decide to go outside in order to play some tennis. Suppose that Adam is better at this sport than Becky. A conservative hypothesis is to suppose that he’ll win 51% of the time, while she’ll win only 49% of the time. The difference is fairly minor, and that shouldn’t have a big impact, right?

I write a short program in R which simulates thousands of matches in order to know who wins between the two players. The program is quite simple, here is the part about simulating a game :

 win_a_point <- function(advantage) {
  x = runif(1,0,1)
  if (x > advantage) {return(2)}
  else {return(1)}
}

gagne_jeu <- function(advantage) {
  score = c(0,0)
  while ((max(score) < 4) | (abs(score[1]-score[2]) < 2)) {
    x = win_a_point(advantage)
    score[x] <- score[x] + 1
  }
  if (score[1] > score[2]) {return(1)}
  else {return(2)}
}

The results obtained for 100.000 matches (in 3 winning sets) simulated are compiled in this tabular:

Point Game Set Match
51% 52.5% 58.4% 63%

Another interesting result is the following graph:

Tennis

Even if the edge for the better player is as low as 5 percent (55-45 for each point), the match is clearly one-sided. This means that as long as one player have a predominant position and so a little more chance to win each point than his opponent, he is almost assured to win at the end. That makes tennis a very conservative sport!

What about Badminton ?

In order to compare this result to some benchmark, I run the same simulations about Badminton. The rules are quite different but not so much, as it is required to get to 21 points in order to win a game, with at least a two-points lead. Winning a match requires to win two games before your opponent does so. With a small adjustment, I use the same program as before and the results obtained are compiled in this graph:

Badminton

To conclude, compared to Badminton, Tennis is definitely a conservative game!

[Sports] Why 6% should be the golden number in professional cycling

Disclaimer : I don’t pretend to be a pro rider nor a sport doctor/physiologist, just a data scientist who sometimes likes to apply mathematical modeling to understand some real-world situations 🙂

Are all slopes only for lightweights ?

I’m a bike enthusiast and I used to race (although I was far from being a champion!). I noticed during the training sessions among teammates that had approximately equivalent physical abilities than mine that our ways of performing were very different: every time we climbed a long steep hill, there was no way I could follow their pace no matter how hard I tried, but on the contrary, when I rode in front on a faux plat (especially when there was a lot of wind), I would sometimes find myself a few meters ahead of the pack without really noticing. In fact, I was physically quite different from the rest of the pack: my BMI is roughly equal to 23.5, whereas most of my training mates were in the low 19s.

Watching the pros compete, I found a somewhat similar pattern: although it’s very unlikely that we see a powerful sprinter such as André Greipel win the Flèche Wallonne someday at the famous and feared Wall of Huy ; however, the podium of the 2011 World championships in Copenhaguen was 100% “sprinters”, although the finish line was at the end of a slight slope (with a reasonable gradient of ~ 3%).

The intuition : there exists a “limit gradient”

Based on these two examples, we can see that although very steep hills are obviously lightweight riders’ turf, more powerful (and thus, heavier) riders can still outperform them on slopes with low gradients. So, my intuition was that there exists a “limit gradient” after which the benefits of being powerful (which comes up with being heavier, given that riders have same physical abilities) are overrun by the benefits of being more lightweight, which gives an edge when you have to fight gravity. Are we going to be able to quantify such a limit ?

The power formula

The power needed to move at speed v on a bike is sum of the power needed to move the air, the power needed to resist friction, and the power needed to resist gravity:

\begin{align*}
P &= P_{air} + P_{friction} + P_{gravity} \\
&= v \cdot \left( \dfrac{1}{2} \cdot \rho \cdot SC_X \cdot (v-v_{wind})^2 + Mg C_r \cos \phi + C_f (v-v_{wind})^2 + Mg \sin \phi \right) \\
\end{align*}

where:

\begin{align*}
\rho &= \text{Cubic mass of air} \\
SC_X &= \text{Drag coefficient} \\
C_r &= \text{Wheels coefficient} \\
C_f &= \text{Friction coefficient} \\
M &= \text{total weight rider + bike} \\
g &= 9.81 ms^{-2} \\
\phi &= \text{slope gradient} \\
v_{wind} &= \text{wind speed} \\
\end{align*}

The values chosen for the parameters can be found in the R code on my GitHub page.

This formula shows that the lighter you are, the less power you need to move at speed v. But let’s not forget that my hypothesis is precisely that if you are more powerful, you may be heavier but you’ll also be able to produce more power!

W/kg and Kleiber’s law

So now, I need a equation to relate mass and power for riders with similar abilities. One way to do this is to consider that physical abilities are measured by power divided by body mass (a measure that has been widely used and discussed recently). So this is going to be my first work hypothesis:

\begin{align*}
\text{H1: } \frac{P}{m} \text{ is constant among riders of the same level}
\end{align*}

 

Nevertheless, this doesn’t feel satisfying: sure, it is very common to assume a linear relationship between two parameters, but most of the time we do this because we lack of a better way to describe how things work. And in this case, it seems that power is linked to body weight by Kleiber’s law, which is going to be my second work hypothesis:

\begin{align*}
\text{H2: } {\left(\frac{P}{m}\right)}^{0.75} \text{ is constant among riders of the same level}
\end{align*}

Plotting the power

Now, I need the value of the constants under hypotheses H1 and H2. For now, I’m only interested in top-level riders, so I choose to use Strava data for Robert Gesink on the wall of Huy to compute the constants under hypotheses H1 and H2. Turns out Gesink, who weighs 70kg, was able to develop a mean power of 557W during the climbing of the hill, which gives us our two constants:

\begin{align*}
C_{H1} &= \frac{557}{70} \\
C_{H2} &= {\left( \frac{557}{70} \right)}^{0.75} \\
\end{align*}

We are now able to plot the speed advantage that a rider weighing m2 would have over a rider m1 < m2, given that the two riders have similar physical abilities (e.g. same body fat mass). We could plot this function for any m1,m2 but I’m basically looking for a gradient that would equate chances for almost all professional riders, so I’m going to select m1 = 57kg (Joaquim Rodriguez) and m2 = 80kg (André Greipel) from the two most different riders I can think of. We get:

speed advantage lighter rider
Speed advantage for rider weighing 57kg over equal level rider weighing 80kg (in pp.)

On this graph, we see that under hypothesis H1 (blue curve), the function giving the speed advantage for the lighter rider remains always negative, which means it is always better to be more powerful, even if you have to drag an additional charge while climbing (this is quite surprising, and I think we could use this result to dismiss H1… but a physiologist would qualify way more than me to discuss this particular point!). The curve we get under hypothesis H2 (red curve) is far more interesting. It shows that when the road is flat or faux plat, the more powerful rider has an edge, but for steep slopes (10% and more), the lighter the better. And it between, there exists a “limit gradient”, which confirms what I initially suspected. Yay!

6% : the golden gradient

The “limit gradient” appears to be between 5% and 10%. Now we can write a few lines of code (in R) to determine its exact value. We get:

gradient_lim ~ 6.15 %

According to our model, this means that if a race includes hill whose gradient is approximately 6%, Rodriguez and Greipel stand equal chances of being the fastest (and so probably all the riders in between these very different two champions). In particular, when the finish line of a race is drawn at the top of a climb, it could be really interesting to choose a 6% gradient: below, it would be typically a sprinters-type finish and above, a climbers-type finish. Incidentally, this happens to be roughly the gradient of the final hill of the 2015 World Championships held in Richmond this Sunday! Well done, Virginia!

Elevation map road circuit Richmond 2015. © UCI Road World Championships
Elevation map road circuit Richmond 2015. © UCI Road World Championships

What about amateurs ?

If I want to do this analysis again for riders of my level, all I need to do is compute the constants according to my Strava data. Turns out on ~1.5 km hills, I can develop approximately 250W, and I weigh 70kg. I get:

gradient_lim ~ 4.9 %

Although it’s not very rigorous, I can confirm that of the (very) few KOMs/not-so-bad-performances I hold on Strava, all of them occurred on slopes with gradients lower than 4%.

Of course, I don’t pretend that the values for the “limit gradients” I find are an exact measure, and always should be the gradient of final hills on every race. For starters, there are a lot of parameters that I didn’t take into account :
– the length of the hill (I voluntarily did not say a word about the fact that different riders could react very differently to hills presenting the same gradient but different lengths)
– that top-level riders may not have exactly the same physical abilities (for example in terms of body fat percentage) when their riding styles are different
– that I’m not sure that Kleiber’s law is really valid in this context? From what I understand, it was primarily designed for evolutionary biology, not for physiology, but I couldn’t find better.
– and of course, that who wins a bike race depends more on the profile of the race before the final sprint, which after all is the very nature of this beautiful sport!

Still, I think we can reasonably assert that in general a 2-3 % gradient is not enough to discourage powerful riders from sprinting, and that a 8-9 % gradient largely favors the Rodriguez-like riders of the pack! And if anyone holds a dataset that could enable me to check this hypothesis, I’d be very happy!