The quadrennial World Cup kicks off this week. If you are a fake football fan like me, please know the following first:

  1. The World Cup is being held in Russia
  2. A total of 32 teams are divided into 8 groups, and the top 2 teams in each group advance to the knockout round
  3. The competition lasts a month
  4. A total of 64 games were held in various Russian cities
  5. There is no Chinese team because it did not qualify
  6. No Italy, no Netherlands, because they didn’t qualify either
  7. There are no Barcelona, Real Madrid, Manchester United, Bayern…
  8. Regular game time, half and half 45 minutes each
  9. A penalty shootout is a tie in the knockout stage after 30 minutes of extra time
  10. Messi is From Argentina, Cristiano Ronaldo is from Portugal, Neymar is from Brazil, they are not Spanish
  11. Ronaldo, ronaldinho, Kaka and Beckham are not taking part
  12. The World Cup is a game of soccer, no Harden, Curry, durant


Every World Cup, one of the reserved programs is to predict the winner of the year, all kinds of gods, famous mouth, octopus, cats and dogs. Let me make a prediction this time. But what if I don’t understand balls? It’s okay, I can use the program! (It’s always a lie.)


The data source

The data, from Kaggle, is based on 38,929 matches from 1872 to 2018. We’ll use that data as the basis for our prediction.


International football results from 1872 to 2021

Available from the end of the article.

Let’s take a look at Kaggle, which is a data science competition platform, and we strongly recommend that those of you who are interested in data analysis or machine learning play.


Build a model

With all this historical game data, how do you predict? I established the following rules:

  1. Data that is too old is of limited reference value to the current team, so set a starting age
  2. Find the match data of the opposing sides from the starting years to now, and calculate the probability of victory =(wins + draws /2)/ total games
  3. In the group stage, the team whose probability of victory exceeds a certain threshold (say, 0.7) wins, otherwise it is a draw
  4. In the knockout phase, the team with the best probability wins
  5. If the two teams have not played each other since the start of the year, N more years of data are selected in advance (generally appearing in teams with fewer games).
  6. If they still haven’t played each other, the probability of victory is calculated based on their records against all other teams in the tournament. The team with the highest probability wins. However, if it is a group match, the probability difference must be higher than a certain threshold (such as 0.1), otherwise it is a draw


Schedule simulation

Based on the above rule model, we imported data and simulated 64 matches of 32 teams in this World Cup through Python program.


This “predicted” the outcome of the game.


Predicted results

So, what exactly does this code run look like?

Because different starting years and local thresholds will get different results. I tried to use 11 different years from 2006 to 2016 and the values of 4 groups of N to obtain a total of 44 groups of competition results. The number of times he won was:

23 times in Brazil



Spain 12 times



Germany 6 times



England 3 times

Brazil, it seems, is still the undisputed favorite to win. No wonder spinach websites are offering them the lowest odds.


But Brazil aside, England performed exceptionally well in my results. This is mainly due to their good record against Brazil in recent years: 1 win, 2 draws and 0 losses. Argentina, by contrast, is probably out of the running.

In addition, Senegal and Iran should be watched as they have a good record against other teams in recent years and could be dark horses:

Since 2012,





Senegal won 4, drew 3 and lost 1


Iran won 5, drew 6 and lost 3


Historical record query tool

Of course, my model is very crude. But the ball is round, and predicting the outcome of a game with historical data is just a bit of fun. If you have your own rules you want to implement, you can modify them based on my code. Access to code and data is explained at the end of the article.

In addition, I exported some of the data to make an online query tool, so that you can directly query the history of any two teams.

Click to enter:Online search tool for historical records



You can choose different years. At the same time I also created a set of “odds” calculation, for reference.

Home team’s combined win rate = Total games /(Home team’s wins + visiting team’s losses)

Because this odds model is based on more historical records, and the opponents of strong teams are mostly strong teams, while the opponents of weak teams are weak teams, the difference in odds is not as big as in the market, but generally speaking, it is basically in line with the relationship between winners and losers. If you find that there is a big difference between the result of a match and the result of someone else’s match, it could be an upset


The forecast results are for reference only. Any similarity is purely coincidental.


Finally, I suddenly thought, our national football team against the 32 teams of the record? What would happen if you were lucky enough to compete in another universe? So…

Since 2014:2 wins, 5 draws, 8 losses


Since 2002:8 wins, 19 draws, 35 losses


Panama, which has never played before, seems to be the only team to match.


Okay, forget it. Let us enjoy the joy of the World Cup!


The data and code used in this article can be downloaded from theCrossin’s programming classroom), reply to the keywordThe World Cup


Oh, oh, oh, oh, oh, oh, oh

Novice Python | how to self-study guide | select Python q&a | | | | block chain Python wordlist artificial intelligence double 11 | hip-hop | | | | creeper sorting algorithm I use Python college entrance examination

Welcome to search and follow: Crossin programming classroom