NC233 – Page 2 – Sampling and data tinkering

Graph sampling for machine learning at Montreal AI Symposium

September 6, 2019 Antoine Rebecq

I’ll be at the 2019 Montreal AI Symposium today, presenting a poster about Network sampling and an application to Machine Learning: Network sampling and applications to big data and machine learning from Antoine Rebecq Featured image: Montreal Skyline, by Taxiarchos228

Eurovision 2019 – prédictions

May 1, 2019 Thomas M

Sur le même modèle que l’année dernière (et, nous l’espérons, avec autant de succès !), nous allons tenter de faire nos prédictions pour l’Eurovision 2019, avec toujours un modèle basé sur les statistiques des vidéos publiées sur Youtube (la liste des vidéos en lice cette année est ici). Les données Rappel : nous utilisons les informations disponibles sur les vidéos Youtube : nombre de vues, nombre de “Like” et nombre de “Dislike”. Nous récupérons ces informations grâce au package R…

Read More Read More

The Mrs. White probability puzzle

April 28, 2019 Antoine Rebecq

tl;dr -I don’t remember how many games of Clue I’ve played but I do remember being surprised by Mrs White being the murderer in only 2 of those games. Can you give an estimate and an upper bound for the number of games I have played?We solve this problem by using Bayes theorem and discussing the data generation mechanism, and illustrate the solution with R. Making use of external information with Bayes theorem Having been raised a frequentist, I first…

Read More Read More

Ranking places with Google to create maps

March 11, 2019 Thomas M

Today we’re going to use the googleway R package, which allows their user to do requests to the GoogleMaps Places API. The goal is to create maps of specific places (restaurants, museums, etc.) with information from Google Maps rankings (number of stars given by other people). I already discussed this in french here to rank swimming pools in Paris. Let’s start by loading the three libraries I’m going to use : googleway, leaflet to create animated maps, and RColorBrewer for…

Read More Read More

Est-ce que cette piscine est bien notée ?

March 3, 2019 Thomas M

J’ai pris la (mauvaise ?) habitude d’utiliser Google Maps et son système de notation (chaque utilisateur peut accorder une note de une à cinq étoiles) pour décider d’où je me rend : restaurants, lieux touristiques, etc. Récemment, j’ai déménagé et je me suis intéressé aux piscines environnantes, pour me rendre compte que leur note tournait autour de 3 étoiles. Je me suis alors fait la réflexion que je ne savais pas, si, pour une piscine, il s’agissait d’une bonne ou…

Read More Read More

[Sampling] Présentation à Ottawa – une nouvelle base pour les enquêtes de l’INSEE

November 8, 2018 Thomas M

Demain (jeudi 8 novembre), je donnerai une présentation au Symposium de méthodologie de Statistiques Canada sur la mise en place du nouveau système d’échantillonnage de l’INSEE pour les enquêtes auprès des ménages et des individus à partir des sources fiscales. Ce changement de base apporte de nouvelles opportunités (nouvelles variables, nouveaux moyens de contact, meilleure coordination des enquêtes) mais aussi des défis (concordance des concepts, gestion du champ de la base administrative). Les acétates sont ci-dessous :

[Sampling] Big data and sampling in Ottawa

November 6, 2018 Antoine Rebecq

Tomorrow (November 7th), I’ll give a talk at the Statistics Canada Symposium on survey sampling and big data. I’ll show how techniques that were developed at official statistics institutes can now be used in the context of big data and machine learning, and add a lot of value. I’ll show some examples with: A/B testing Tracking design Calibration in Machine Learning Network analysis User feedback Bring survey sampling techniques into big data de Antoine Rebecq And really glad to…

Read More Read More

Bad recommendations, good algorithm

September 4, 2018 Antoine Rebecq

If you’ve ever shopped online (*cough* Amazon *cough*), you’ve probably experienced the “vacuum cleaner effect”. You carefully buy one expensive item (e.g. a vacuum cleaner) and then you receive dozens of recommendations for other vacuum cleaners to buy: by email, everywhere on the retailer’s website, or sometimes in the ads you see on other websites. In other terms, Amazon is a 1 trillion dollar company that employs hundreds of data scientists and is incapable of understanding that if you bought…

Read More Read More

Analyse de pronostics pour le Mondial 2018

July 18, 2018 Thomas M

On est les champions ! Si nous n’avons pas eu le temps de faire un modèle de prédiction pour cette coupe du monde de football 2018 (mais FiveThirtyEight en a fait un très sympa, voir ici), cela ne nous a pas empêché de faire un concours de pronostics entre collègues et ex-collègues statisticiens, sur le site Scorecast. Les résultats obtenus sont les suivants : Un autre système de points ? Le système de points utilisé par Scorecast est le suivant…

Read More Read More

Weighting tricks for machine learning with Icarus – Part 1

July 5, 2018 Antoine Rebecq

Calibration in survey sampling is a wonderful tool, and today I want to show you how we can use it in some Machine Learning applications, using the R package Icarus. And because ’tis the season, what better than a soccer dataset to illustrate this? The data and code are located on this gitlab repo: https://gitlab.com/haroine/weighting-ml First, let’s start by installing and loading icarus and nnet, the two packages needed in this tutorial, from CRAN (if necessary): install.packages(c(“icarus”,”nnet”)) library(icarus) library(nnet) Then…

Read More Read More