AI in Football — Biases in Expected Goals (xG) Models Confound Finishing Ability: Part 1
Last Updated on April 22, 2024 by Editorial Team
Author(s): Saankhya Mondal
Originally published on Towards AI.
This post will discuss the findings of the “Biases in Expected Goals Models Confound Finishing Ability” paper, published by researchers at the KU Leuven Institute for Artificial Intelligence, Belgium.
Part 1 discusses different biases in Expected Goal (xG) models when we use these models to judge the finishing prowess of a football player. The authors have performed simulation studies and statistical analysis in their work. Part 2 will discuss multi-calibration methods that can be used to mitigate biases in xG models.
What are the Expected Goals?
Expected Goals (xG) is a popular statistic to measure the quality of goals scored by a football player. xG is the likelihood of a player scoring a goal when he/she takes a shot. One can also define xG as a goal an average football player scored if he/she took the shot. Intuitively, a shot taken as a penalty kick or a tap-in will have a high xG whereas, a shot taken from outside the box with multiple defenders to beat will have a low xG. In machine learning terms, xG is a binary classification model. Given the attributes of a shot, we predict whether or not a shot will result in a goal. One can train a logistic regression model on player shot data containing features such as the coordinates of the location from which the shot has been taken, and body part with which the goal has been scored, and labels — whether the shot was a goal or not.
Over a period (match or season), cumulative xG may be a more insightful metric to evaluate a player’s quality rather than the number of goals scored. Football fans usually assess the player’s finishing quality with their cumulative xG. GAX (Goals Above Expectation) can be defined as the difference between the binary label goal/no goal and the xG of the particular shot. Over a period, GAX would be the difference between the actual goal tally of the player and the total xG accumulated by the player through his shots.
A player is termed a good finisher if he has overperformed his xG (positive GAX) over a given period. For example, Lionel Messi scored 375 goals in five seasons (from 2015/16 to 2020/21) from 1862 shots and accumulated an xG of 247 (according to the StatsBomb model) from these shots. Thus, he overperformed his xG by an incredible amount of 127 (GAX = 127)! No wonder he’s considered one of the best finishers in the world. xG or GAX can also be used to evaluate a team’s cumulative ability to generate goal-scoring chances in a match or over a season.
Limitations
However, in the paper, the authors have pointed out limitations of the xG statistic especially when it’s used to evaluate a player’s finishing ability. The authors have provided three possible explanations where xG models would fail to capture the finishing ability of a player. Players’ shot data from the top five European leagues has been considered for analysis in this paper. They have performed simulations studies and statistical analysis of the results to arrive at the following hypotheses.
Hypothesis 1: Limited sample sizes, high variances, and small variations in skill between players make GAX a poor metric for measuring finishing skills.
GAX is a poor metric to measure finishing skills. The player-shot data has a limited sample size and high variance. The players in the data have different shooting profiles. Defenders typically take a lower number of shots than forwards. Even among the forwards, there’s a huge variance in the quality of players.
The authors claim that the finishing quality and volume of shots taken greatly impact whether a player will overperform his/her xG over the season. Players are unlikely to overperform their cumulative xG unless they can pair exceptional finishing with high shot volumes due to limited sample sizes and high variances. This is even more difficult to maintain over multiple seasons.
The authors ran a simulation experiment with two variables — α and n- to gauge the order of magnitude of the sample size required. They computed the probability (P [G > xG]) that a player who is α% better finisher than average and takes n shots in a season would overcome their cumulative xG. This was done to estimate how many shots (n) a player with a certain finishing quality (α % better than average) has to take to have positive GAX (G > xG). The following is the process.
- They trained an xG model using logistic regression on the 2015/16 season data containing around 43K shots. For simplicity, the authors considered the location and body part as features. They derived features such as the coordinates of the location, the angle from the center of the goal, and the distance from the center of the goal.
- Then, using the same dataset, they divided the field into square grids of size 1m x 1m and computed the proportion of shots generated from each square grid.
- Thirdly, they sampled the locations of n shots (n ∈ {25, 50, 75, 100, 125, 150}) from this distribution. For each sampled shot, they extracted the features and obtained its xG from the trained model. Assuming a player who is α % (α ∈ {0, 5, 10, 15, 25}) better finisher than average, the adjusted xG of the shot is calculated as (1 + α/100) * xG.
- For each combination of n and α, they sampled an observed outcome (goal/no goal) based on the new distribution and computed its GAX. This simulated outcome represents what might happen if the player’s finishing ability were different from the average. For robustness, this sampling was repeated 10,000 times.
Suppose we have n = 100 and α = 25. We sample a shot which has xG of 0.7. The adjusted xG will be 0.875. Now, we sample a random number between 0 and 1 uniformly. If the outcome is 1, we get GAX = 1 –0.875 = 0.125. If the outcome is 0, we get GAX = 0 –0.875 = –0.875. This is done for every 100 shots to get a cumulative GAX.
The following heatmap shows the results.
The results show that players are unlikely to overperform their cumulative xG consistently unless they can pair exceptional finishing with high shot volume. As shown on the heatmap on the right, the P [G > xG] declines severely when computed over multiple seasons in cases of low and average shot volume.
Hypothesis 2: Including all shots when computing GAX is incorrect and obscures finishing ability.
Including all shots in the cumulative xG total is not appropriate. Some shots that have resulted in goals due to deflection may not represent the actual finishing skill of the player who took the shot. Some speculative shots (like long-range shots) resulting in goals may not represent the player’s actual finishing ability.
The authors illustrate it by giving the example of Riyad Mahrez. Mahrez took 289 shots during five seasons between 2017 and 2022. 17 (5.9%) of them were deflected. If we include these deflected shots while computing GAX, he achieves a GAX of 14.61. If we discard the deflected shots, he overperforms his xG by only 9.03 goals.
The number of goals scored in a football match, given the xG values of the shots follows Poisson Binomial Distribution (Poisson binomial distribution — Wikipedia). The following figure is a probability distribution curve for the number of goals scored, given the xG values of Mahrez’s shots.
Hypothesis 3: Interdependencies in the data bias GAX.
The assumption that xG captures the likelihood of the “average player” finishing a shot is violated because of the overrepresentation of shots from good finishers. In addition, there might be biases because players may have shots that appear in both training and test data. Consider three cases —
- Data only contains shots from top-quality finishers like Messi and Cristiano Ronaldo.
- Data only contains shots from players who typically take a large number of attempts and tend to underperform their xG.
- Data contains equal representation of every type of finisher.
When we compute Messi’s GAX in the first case, we would expect a low value as the training data contains only excellent finishers. Messi’s GAX would be much higher in the second case as the players in training data usually underperform their xG. In the third case, Messi’s GAX would be somewhere in the middle. The authors experimented to see how Messi’s GAX would vary when an xG model is trained with different sets of finishers.
- They began with the original dataset containing shots from 2015/16 season and excluded Messi’s shots from the data. They trained an xG model and computed Messi’s GAX which was 127.9.
- They performed data augmentation by adding 0 to 5000 shots for each cohort of players. In other words, they created a new dataset by sampling n shots from the distribution (1 + α/100) * xG (following the similar process as discussed in Hypothesis 1) and adding them to the original data.
- For each n and alpha, they trained an xG model on the new dataset and tracked the performance of Messi. They repeated the experiment 100 times for each case.
Messi’s GAX decreased to 120.8 when there was more representation of shots taken by top finishers in the training data. This experiment shows how the composition of the training data can have an effect on a player’s GAX.
The authors performed another experiment to better understand the biases that arise from variations in the number of shots players take and their finishing skill on xG models.
- They generated a training set of 1000k by sampling n shots from distribution (1 + α/100) * xG. The skill level, α ∈ {-5, 0, 10, 20}.
- They considered three different allocations for each skill type in the dataset, thereby forming three datasets. They trained xG models on these datasets.
- Next, they estimated the effect of the skill distribution in the training data on the GAX of a player with α ∈ {−5, 0, 10, 20} better finishing skills who took n ∈ {75, 100, 125} shots in a season. They sampled n shots from distribution (1+α/100) × xG (following the similar process as discussed in Hypothesis 1) and computed the xG from all three models, repeating it 10,000 times for each combination of α and n.
- For each sample, they computed the observed GAX and compare it to the average player GAX calculated by the ground truth model.
When the training data contains more shots from excellent finishers, the GAX decreases for players who are excellent finishers. From these two experiments, it is clear that the xG model overestimates the finishing of a poor finisher and underestimates the finishing of an excellent finisher.
Now the question is — how to solve the problem? The authors have derived an interesting parallel to the works done on fairness in AI and came up with a better approach to evaluate players based on the xG metric. They achieved it using a technique from fairness AI to learn an xG model by calibrating it on multiple subgroups of players. Part 2 will cover the multi-calibration techniques used.
References —
Thank you for reading!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI