This weekend I released version 0.3.0 of the Icarus package to CRAN.

Icarus provides tools to help perform calibration on margins, which is a very important method in sampling. One of these days I’ll write a blog post explaining calibration on margins! In the meantime if you want to learn more, you can read our course on calibration (in French) or the original paper of Deville and Sarndal (1992). Shortly said, calibration computes new sampling weights so that the sampling estimates match totals we already know thanks to another source (census, typically).

In the industry, one of the most widely used software for performing calibration on margins is the SAS macro Calmar developed at INSEE. Icarus is designed with the typical Calmar user in mind if s/he whishes to find a direct equivalent in R. The format expected by Icarus for the margins and the variables is directly inspired by Calmar’s (wiki and example here). Icarus also provides the same kind of graphs and stats aimed at helping statisticians understand the quality of their data and estimates (especially on domains), and in general be able to understand and explain the reweighting process.

I hope I find soon the time to finish a full well documented article to submit to a journal and set it as a vignette on CRAN. For now, here are the slides (in French, again) I presented at the “colloque francophone sondages” in Gatineau last october: http://nc233.com/icarus.

Aujourd’hui un petit post un peu plus “pratique”. On va réaliser le graphique du taux de chômage en France depuis 1975 en utilisant R. Les données sont disponibles sur le site de l’INSEE. En suivant ce lien on va pouvoir les télécharger au format csv. Mais il est beaucoup plus sympathique d’utiliser une méthode un peu plus automatique pour récupérer ces données. Ainsi, dès que l’INSEE les mettra à jour (le trimestre prochain par exemple), il suffira de relancer le script R et le graphe se mettra automatiquement à jour.

Pour ce faire, on va utiliser la compatibilité du nouveau site de l’Insee avec la norme SDMX. Le package R rsdmx va nous permettre de récupérer facilement les données en utilisant cette fonctionnalité :

install.packages("rsdmx")
library(rsdmx)

Sur la page correspondant aux données sur le site de l’Insee, on récupère l’identifiant associé : 001688526. Cela permet de construire l’adresse à laquelle on va pouvoir demander à rsdmx de récupérer les données : http://www.bdm.insee.fr/series/sdmx/data/SERIES_BDM/001688526. On peut également ajouter des paramètres supplémentaires en les ajoutant à la suite de l’url (en les séparant par un “?”). On utilise ensuite cette adresse avec la fonction readSDMX:

Les deux colonnes qui vont nous intéresser sont “TIME_PERIOD” (date de la mesure) et “OBS_VALUE” (taux de chômage estimé). Pour le moment, ces données sont au format “caractère”. On va les transformer respectivement en “date” et en “numérique”. Notez que l’intégration du trimestre nécessite de passer par le package “zoo” (pour la fonction as.yearqtr) :

library(zoo)
donnees_chomage$OBS_VALUE <- as.numeric(donnees_chomage$OBS_VALUE)
donnees_chomage$TIME_PERIOD <- as.yearqtr(donnees_chomage$TIME_PERIOD, format = "%Y-Q%q")

Il ne reste plus qu’à utiliser l’excellent package ggplot2 pour créer le graphe. On en profite pour ajouter des lignes verticales en 1993 et 2008 (dates des deux dernières crises économiques majeures) :

This year we’ve had a great summer for sporting events! Now autumn is back, and with it the Ligue 1 championship. Last year, we created this data analysis tutorial using R and the excellent package FactoMineR for a course at ENSAE (in French). The dataset contains the physical and technical abilities of French Ligue 1 and Ligue 2 players. The goal of the tutorial is to determine with our data analysis which position is best for Mathieu Valbuena 🙂

The dataset

A small precision that could prove useful: it is not required to have any advanced knowledge of football to understand this tutorial. Only a few notions about the positions of the players on the field are needed, and they are summed up in the following diagram:

The data come from the video game Fifa 15 (which is already 2 years old, so there may be some differences with the current Ligue 1 and Ligue 2 players!). The game features rates each players’ abilities in various aspects of the game. Originally, the grade are quantitative variables (between 0 and 100) but we transformed them into categorical variables (we will discuss why we chose to do so later on). All abilities are thus coded on 4 positions : 1. Low / 2. Average / 3. High / 4. Very High.

Loading and prepping the data

Let’s start by loading the dataset into a data.frame. The important thing to note is that FactoMineR requires factors. So for once, we’re going to let the (in)famous stringsAsFactors parameter be TRUE!

The second line transforms the integer columns into factors also. FactoMineR uses the row.names of the dataframes on the graphs, so we’re going to set the players names as row names:

Here’s what our object looks like (we only display the first few lines here):

> head(frenchLeague)
foot position league age height overall
Florian Thauvin left RM Ligue1 1 3 4
Layvin Kurzawa left LB Ligue1 1 3 4
Anthony Martial right ST Ligue1 1 3 4
Clinton N'Jie right ST Ligue1 1 2 3
Marco Verratti right MC Ligue1 1 1 4
Alexandre Lacazette right ST Ligue1 2 2 4

Data analysis

Our dataset contains categorical variables. The appropriate data analysis method is the Multiple Correspondance Analysis. This method is implemented in FactoMineR in the method MCA. We choose to treat the variables “position”, “league” and “age” as supplementary:

This produces three graphs: the projection on the factorial axes of categories and players, and the graph of the variables. Let’s just have a look at the second one of these graphs:

Before trying to go any further into the analysis, something should alert us. There clearly are two clusters of players here! Yet the data analysis techniques like MCA suppose that the scatter plot is homogeneous. We’ll have to restrict the analysis to one of the two clusters in order to continue.

On the previous graph, supplementary variables are shown in green. The only supplementary variable that appears to correspond to the cluster on the right is the goalkeeper position (“GK”). If we take a closer look to the players on this second cluster, we can easily confirm that they’re actually all goalkeeper. This absolutely makes a lot of sense: in football, the goalkeeper is a very different position, and we should expect these players to be really different from the others. From now on, we will only focus on the positions other than goalkeepers. We also remove from the analysis the abilities that are specific to goalkeepers, which are not important for other players and can only add noise to our analysis:

Obviously, we have to start by reducing the analysis to a certain number of factorial axes. My favorite method to chose the number of axes is the elbow method. We plot the graph of the eigenvalues:

> barplot(mca_no_gk$eig$eigenvalue)

Around the third or fourth eigenvalue, we observe a drop of the values (which is the percentage of the variance explained par the MCA). This means that the marginal gain of retaining one more axis for our analysis is lower after the 3rd or 4th first ones. We thus choose to reduce our analysis to the first three factorial axes (we could also justify chosing 4 axes). Now let’s move on to the interpretation, starting with the first two axes:

We could start the analysis by reading on the graph the name of the variables and modalities that seem most representative of the first two axes. But first we have to keep in mind that there may be some of the modalities whose coordinates are high that have a low contribution, making them less relevant for the interpretation. And second, there are a lot of variables on this graph, and reading directly from it is not that easy. For these reasons, we chose to use one of FactoMineR’s specific functions, dimdesc (we only show part of the output here):

The most representative abilities of the first axis are, on the right side of the axis, a weak level in attacking abilities (finishing, volleys, long shots, etc.) and on the left side a very strong level in those abilities. Our interpretation is thus that axis 1 separates players according to their offensive abilities (better attacking abilities on the left side, weaker on the right side). We procede with the same analysis for axis 2 and conclude that it discriminates players according to their defensive abilities: better defenders will be found on top of the graph whereas weak defenders will be found on the bottom part of the graph.

Supplementary variables can also help confirm our interpretation, particularly the position variable:

> plot.MCA(mca_no_gk, invisible = c("ind","var"))

And indeed we find on the left part of the graph the attacking positions (LW, ST, RW) and on the top part of the graph the defensive positions (CB, LB, RB).

If our interpretation is correct, the projection on the second bissector of the graph will be a good proxy for the overall level of the player. The best players will be found on the top left area while the weaker ones will be found on the bottom right of the graph. There are many ways to check this, for example looking at the projection of the modalities of the variable “overall”. As expected, “overall_4” is found on the top-left corner and “overall_1” on the bottom-right corner. Also, on the graph of the supplementary variables, we observe that “Ligue 1” (first division of the french league) is on the top-left area while “Ligue 2” (second division) lies on the bottom-right area.

With only these two axes interpreted there are plenty of fun things to note:

Left wingers seem to have a better overall level than right wingers (if someone has an explanation for this I’d be glad to hear it!)

Age is irrelevant to explain the level of a player, except for the younger ones who are in general weaker.

Modalities that are most representative of the third axis are technical weaknesses: the players with the lower technical abilities (dribbling, ball control, etc.) are on the end of the axis while the players with the highest grades in these abilities tend to be found at the center of the axis:

We note with the help of the supplementary variables, that midfielders have the highest technical abilities on average, while strikers (ST) and defenders (CB, LB, RB) seem in general not to be known for their ball control skills.

Now we see why we chose to make the variables categorical instead of quantitative. If we had kept the orginal variables (quantitative) and performed a PCA on the data, the projections would have kept the orders for each variable, unlike what happens here for axis 3. And after all, isn’t it better like this? Ordering players according to their technical skills isn’t necessarily what you look for when analyzing the profiles of the players. Football is a very rich sport, and some positions don’t require Messi’s dribbling skills to be an amazing player!

Mathieu Valbuena

Now we add the data for a new comer in the French League, Mathieu Valbuena (actually Mathieu Valbuena arrived in the French League in August of 2015, but I warned you that the data was a bit old ;)). We’re going to compare Mathieu’s profile (as a supplementary individual) to the other players, using our data analysis.

Last two lines produce the graphs with Mathieu Valbuena on axes 1 and 2, then 2 and 3:

So, Mathieu Valbuena seems to have good offensive skills (left part of the graph), but he also has a good overall level (his projection on the second bissector is rather high). He also lies at the center of axis 3, which indicates he has good technical skills. We should thus not be surprised to see that the positions that suit him most (statistically speaking of course!) are midfield positions (CAM, LM, RM). With a few more lines of code, we can also find the French league players that have the most similar profiles:

And we get: Ladislas Douniama, Frédéric Sammaritano, Florian Thauvin, N’Golo Kanté and Wissam Ben Yedder.

There would be so many other things to say about this data set but I think it’s time to wrap this (already very long) article up 😉 Keep in mind that this analysis should not be taken too seriously! It just aimed at giving a fun tutorial for students to discover R, FactoMineR and data analysis.