If you’ve ever shopped online (*cough* Amazon *cough*), you’ve probably experienced the “vacuum cleaner effect”. You carefully buy one expensive item (e.g. a vacuum cleaner) and then you receive dozens of recommendations for other vacuum cleaners to buy: by email, everywhere on the retailer’s website, or sometimes in the ads you see on other websites.
In other terms, Amazon is a 1 trillion dollar company that employs hundreds of data scientists and is incapable of understanding that if you bought an expensive appliance, buying another one of the same category in the next weeks is what you’re *least* likely to do!
But let’s think about the problem for a second. Suggesting item that are similar to what you just bought is actually the core feature of recommendation algorithms! Detecting that it might be inappropriate for some precise categories of items is not an easy task! It would require some careful analysis of the performance by categories, which would be prone to many potential errors: sampling variance, categorization error (maybe some manual tagging would be required), temporal fluctuations, etc.
So fixing this little annoyance for the consumer might take a few weeks of research, a couple months of integration, and still fail in some cases. It could end up costing several hundred thousands of dollars to fix this, not even counting that it could also affect the performance of the global recommendation algorithm.
So the recommendations might be bad, but in the end the algorithm is valuable nonetheless. Remember, machine learning and artificial intelligence are pretty stupid, but they are very valuable!
Recommender systems are awesome! Very excited to say I’ll be at RecSys 2018 in Vancouver next month to learn more about them 🙂
— Featured image: View of downtown Vancouver from the Lookout Tower at Harbour Centre, by Magnus Larsson.
The RData file contains two dataframes, one for the training set and one for the test set. They contain results of some international soccer games, from 01/2008 to 12/2016 for the training set, and from 01/2017 to 11/2017 for the test. Along with the team names and goals scored for each side, a few descriptive variables that we’re going to use as features of our ML models:
Date team opponent_team home_field elo_team
1 2010-10-12 Belarus Albania 1 554
2 2010-10-08 Bosnia and Herzegovina Albania 0 544
3 2011-06-07 Bosnia and Herzegovina Albania 0 594
4 2011-06-20 Argentina Albania 1 1267
5 2011-08-10 Montenegro Albania 0 915
6 2011-09-02 France Albania 0 918
opponent_elo importance goals_for goals_against outcome year
1 502 1 2 0 WIN 2010
2 502 1 1 1 DRAW 2010
3 564 1 2 0 WIN 2011
4 564 1 4 0 WIN 2011
5 524 1 2 3 LOSS 2011
6 546 1 2 1 WIN 2011
elo_team and opponent_elo are quantitative variables indicative of the level of the team at the date of the game ; importance is a measure of high-profile the game played was (a friendly match rates 1 while a World Cup game rates 4). The other variables are imo self-descriptive.
Then we can train a multinomial logistic regression, with outcome being the predicted variable, and compute the predictions from the model:
DRAW LOSS WIN
0 208 210
And indeed, draws being less common than other results, it seems more profitable for the algorithm that optimizes accuracy never to predict them. As a consequence, the probabilities of the game being a draw is always lesser than the probability of one team winning it. We could show that the probabilities are not well calibrated.
A common solution to this problem is to use reweighting to correct the imbalances in the sample, which we’ll now tackle. It is important to note that the weighting trick has to happen in the training set to avoid “data leaks”. A very good piece on this subject has been written by Max Kuhn in the documentation of caret.
Thus the matches of each year between 2008 and 2016 have the same influence on the final predictor. A better idea would be to give the most recent games a slightly higher influence, for example by increasing their weight and thus reducing the weights of the older games:
We determine it is thus a good idea to balance on these two additional variables (year and importance). Now how should we do this? A solution could be to create an indicator variable containing all the values of the cross product between the variables outcome, year and importance, and use the same reweighting technique as before. But this would not be very practical and more importantly, some of the sub-categories would be nearly empty, making the procedure not very robust. A better solution is to use survey sampling calibration and Icarus 🙂
But more importantly we have reason to believe that the we improved the quality of the probabilities assigned to each event (we could check this using metrics such as the Brier score or calibration plots) 🙂
It is also worth noting that some algorithms (especially those who rely on bagging, boosting, or more generally on ensemble methods) naturally do a good job at balancing samples. You could for example rerun the whole code and replace the logit regressions by boosted algorithms. You would then observe fewer differences between the unweighted algorithm and its weighted counterparts.
Stay tuned for the part 2, where we’ll show a trick to craft better probabilities (particularly for simulations) using external knowledge on probabilities.