NC233 – Sampling and data tinkering

Rugby World Cup explainer using data

September 20, 2023 Antoine Rebecq

Last week, a stereotypical “French” ceremony opened the 10th Rugby World Cup in Stade de France, in the suburbs of Paris, France. As a small boy growing up in the southern half of France, I developed a strong interest for the sport. Now being an adult living and working in North America, where barely anyone has ever heard the word “Rugby”, I now rarely have anyone else to talk to about Antoine Dupont’s (captain of the French team and best…

Read More Read More

Using R to build predictions for UEFA Euro 2020

June 15, 2021 Antoine Rebecq

Last friday, Euro 2020, one of the biggest events in International soccer, was kicked off by the inaugural match between Italy and Turkey (Italy won it 3-0). Euros (short for European Championships) are usually held every 4 years, but because of he-who-must-not-be-named, last year’s edition was postponed to this summer, while keeping the name “Euro 2020” (much like the Tokyo Olympics). 4 5 years ago, for Euro 2016, I basically wanted to try some cool methods based on splines on…

Read More Read More

Have you checked your features distributions lately?

April 14, 2021 Antoine Rebecq

tl;dr Trying to debug a poorly performing machine learning model, I discovered that the distribution of one of the features varied from one date to another. I used a simple and neat affine rescaling. This simple quality improvement brought down the model’s prediction error by a factor 8 Data quality trumps any algorithm I was recently working on a cool dataset that looked unusually friendly. It was tidy, neat, interesting… the kind of things that you rarely encounter in the wild!…

Read More Read More

Creating an hex map of France electricity consumption

June 2, 2020 Thomas M

The French Ministry for the Ecological and Inclusive Transition (for which I’m currently working) is ongoing a process of opening data related to energy consumption. Each year, we publish data for every neighborhood in France (at the iris statistical level, even adresses in some cases) and to the nature of the final consumer (a household, an industry, a shop…). These data are available here (website in French – direct link to 2018 electricity consumption data). Making a map to have…

Read More Read More

Maquereaux et départements

May 24, 2020 Thomas M

Cette semaine, l’énigme “classique” de FiveThirtyEight (qu’on peut retrouver ici) demande de trouver des mots n’ayant aucune lettre en commun avec un et seul état américain. Par exemple, “mackerel” (le maquereau) a des lettres en commun avec tous les états sauf l’Ohio. Ce problème peut s’adapter au cas français : quels sont les mots n’ayant aucune lettre en commun avec un et un seul département français ? En reprenant la liste de mots utilisés pour notre article sur Motus et…

Read More Read More

Rolling some dices

May 17, 2020 Thomas M

Today, a quick post trying to provide an answer to this week Riddle Classic on FiveThirtyEight : The fifth edition of Dungeons & Dragons introduced a system of “advantage and disadvantage.” When you roll a die “with advantage,” you roll the die twice and keep the higher result. Rolling “with disadvantage” is similar, except you keep the lower result instead. The rules further specify that when a player rolls with both advantage and disadvantage, they cancel out, and the player rolls a single die….

Read More Read More

Eurovision 2020 – « prédictions »

May 9, 2020 Thomas M

L’Eurovision 2020, comme bon nombre d’événements culturels et sportifs, n’aura pas lieu cette année, pour cause de pandémie. Les chansons proposées par les pays participants ont néanmoins été mises en ligne : on peut les retrouver ici. Même si cela n’a aucun intérêt (personne ne gagnera un concours qui n’aura pas lieu), il est donc possible de mettre en oeuvre notre modèle de prédictions (comme les années précédentes, en 2018 et 2019) utilisant les données associées à chaque vidéo sur…

Read More Read More

Causal Inference cheat sheet for data scientists

April 29, 2020 Antoine Rebecq

Being able to make causal claims is a key business value for any data science team, no matter their size.Quick analytics (in other words, descriptive statistics) are the bread and butter of any good data analyst working on quick cycles with their product team to understand their users. But sometimes some important questions arise that need more precise answers. Business value sometimes means distinguishing what is true insights from what is incidental noise. Insights that will hold up versus temporary marketing…

Read More Read More

Comment expliquer la baisse de participation aux municipales 2020 ?

March 20, 2020 Thomas M

Dimanche dernier, le 15 mars 2020, la France a organisé le premier tour des élections municipales, après avoir annoncé une fermeture des écoles puis des restaurants et commerces non essentiels. La participation à ce scrutin s’établit à 44,64 %, en chute de 20 points par rapport à 2014, date des précédentes élections municipales (voir une très belle carte du Monde ici, assez illustrative de la situation) Ce rapide billet ne s’attardera pas sur la question de savoir s’il fallait ou…

Read More Read More

Micromorts – how much risk of death would you accept?

March 8, 2020 Antoine Rebecq

A micromort is one in a million chance of dying – it is equivalent to tossing 20 coins and getting 20 heads