Contributed by: Tom Hofkamp
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at firstname.lastname@example.org and let’s get in touch!
In June and July 2016 the EURO 2016 soccer cup in France takes place. I developed a simulation model that predicts who will win the tournament. The model is based on: i) current shape of the team, ii) mutual results between teams, iii) tournament performance of the team, iv) experience of the team, and v) market value of the team. I used the Poisson distribution for simulation purposes, using this distribution is justified by looking at the goals scored per team per match during the previous five European Championships (1996-2012). The simulation selects France to beat Spain in the final in extra time after an 1-1 score in 90 minutes and the golden boot winner will be Antoine Griezmann with a total of 5 goals. Comparing the simulation with the ELO-ratings and betting odds (e.g. Bwin and Unibet) the results seem to be quite accurate. The most equal group is Group E, where the model selects Belgium, Italy and Ireland (as best third placed) to continue to the next round. The most decisive match of the tournament is the semifinal between France and Germany, the one who wins will win the trophy. Further improvements of the model should focus on optimizing the weights of the different features of the model.
Like every football tournament men all over the world are involved in poles where small and big prizes are at stake. Everybody wants to be seen as a football expert and next to the prizes, these poles are often more about esteem. The previous World Cup companies as Goldman Sachs and Deloitte tried to predict the winner and the results of the tournament. During EURO 2008 and the World Cup 2010 the oracle: Der Tintenfisch Paul Octopus was quite accurate in his predictions, unfortunately Paul died two months after Spain had beaten the Netherlands in the World Cup 2010 final. By developing a simulation model I will do an attempt to fill up the spot of Oracle Paul.
This study is based on simulating all the 51 matches that are being played during the EURO 2016. All these matches are being simulated by using the Poisson distribution, the average goals scored per team per match in the previous five European Championships (1996, 2000, 2004, 2008, 2012) follows exactly this distribution. This means that I only need to calculate the probabilities of team x and team y scoring against each other in particular match z. A favourable side effect is that the Poisson distribution is easily simulated and it will be simple for my computer (read: relatively low computation power needed) to simulate the probability distributions of goals scored by team x and team y for match z a million times. This is exactly what I did!
The model is based on mainly three different features:
- Historical mutual results
I decided not to take all mutual results into account. I believe that a 2-1 victory of England against Wales in a friendly match on January 18, 1879 played in England is not very relevant for current state of England – Wales rivalry. All matches before January 1, 1990 are removed from the dataset.
- Current shape of the team
All matches since January 2014 are taken into account, including World Cup 2014 matches and qualification matches for the EURO 2016.
- Tournament performance of the team
The adjustment I mention in mutual results accounts for tournament performance as well, I only take tournaments into account starting with the World Cup 1990 in Italy.
Finally I add team ratios, obtained by the market value and experience feature, to the model.
Compared to the ELO ratings and Bwin betting odds the simulation model I developed seems pretty accurate. The biggest difference between my model and the benchmarks is that my model selects France as winner of the semifinal against Germany to become the champion in the end, while the benchmarks pick Germany as winner of the semi final to ultimately bring the trophy to Berlin.
This has also impact on the Golden Boot winner Antoine Griezmann. He will need all the matches to score his goals, if Germany wins the semi final the changes of Thomas Muller will increase dramatically. However, it seems to be a crucial match, where I am very much looking forward to! Personally I would always expect the Germans to beat the French in a football match. It is interesting to see that Spain and France both have the opportunity to make it to the final relatively easy, in the simulation Spain has to beat Poland in the final and Ukraine in the quarters, where France has to win against Ireland and Austria. The group phases of France is relatively easy as well, competing with Switzerland, Romania and Albania. Then Spain is likely to score many goals (10 in the simulation), but does not have a striker like Raul, David Villa or Fernando Torres (in shape) at the moment. Alvaro Morata had a poor season scoring wise at Juventus so there is a place to be filled by another Spaniard: Aritz Aduriz. Who?! Exactly. A 35 year old Basque playing for Athletic Bilbao. Another remarkable observation is the young age of the French team, historically unexperienced teams do not win trophies so easily. In the winratio in the final with Spain you see that France has a minimal lead of 0.02.
Finally I would like to wish you a lot of joy watching the EURO 2016. And for myself I hope this simulation model will be as accurate as the predictions of Tintenfisch Paul, in which case I’d like to call the model: Oracle Paul 2.0.