Testing Predictive Metrics for the Premier League

In this blog I look in detail at the predictive power of various metrics in the Premier League, to see which is best at predicting future performance at each stage of the season.

The metrics I am testing are as follows:

PTS = Points scored
TSR = Total Shot Ratio
SOTR = Shots on Target Ratio
GR = Goal Ratio
DGR = Deserved Goals Ratio

Test 1

For the first test, I want to imagine that we know nothing about each team at the start of the season.

Using data from the last 15 seasons, I will see how well each metric predicts points scored in the remaining fixtures. For example, after 10 games played I will see how well the metrics for the first 10 games predict points scored in games 11 to 38.

Note: To be consistent with a test I will perform later in the post, I am just looking at the 17 teams in each season who played in the season before. That means that the sample is 17*15, so 255 full seasons.

There are 2 methods of testing predictive power. The first is looking at the correlation (R^2), and the second is looking at the average errors (MAE). Let’s start with correlation, where higher numbers are better.

correlation

So, if we don’t know anything about the teams at the start of the season, this test would indicate that we should use DGR for the majority of the season, although SOTR takes the lead briefly between 26 and 29 games played. TSR is slightly worse than SOTR pretty much all season, and PTS and GR start off badly and never catch up with the rest.

Whilst using R^2 to test predictive power is widespread, a better test is to look at the average error per game when using each metric to predict future points.

To make the graph easier to interpret, I have shown the results relative to the results for PTS. On this graph, low is good.

correlation

We can see from this that the average errors agree with the R^2 results. DGR still dominates for most of the season, SOTR takes the lead for a brief period etc.

 

However, there is a problem with this approach. We don’t start the season knowing nothing about the teams.

Goal Ratio gets a bad result in the above test, but as examined in a previous post we know that GR in one season correlates more with points in the following season than TSR, so it must have some decent predictive power with a larger sample.

Test 2

Let’s repeat the above test, but instead of assuming we know nothing at the start of the season, let’s be more realistic and start off with the previous season’s data, and then use a 38-game rolling score as the teams progress through the season.

Here’s the R^2 results, with our previous results greyed out for comparison. High is good.

correlation

Obviously, our early season results are much better. In addition, the previously poor GR redeems itself, being a better early season predictor than both shot-based metrics, which fall behind PTS in the same period. DGR still dominates, and SOTR still takes a brief lead later on. Importantly, almost all these results are better than just using in-season data.

Again, I’ve done the average errors test relative to the results for PTS. Low is good.

correlation

Once more, this tells a similar story to the R^2 values. DGR takes an early lead over GR, and is overtaken by SOTR for a brief period. The later stages of the season are a mix of DGR and the shot-based metrics.

In summary then, DGR is the best early-season predictor, both when limited to in-season data and when using data from the previous season. SOTR is a good metric for the later stages, although DGR does well here too. GR is a strong metric, but only worth using with access to results from the previous season, as it takes too long to pick up a signal within a season.

Full results are available on google sheets here: Google Sheets Results.

To finish off, what does this mean for the current season?

Well, with 6 games played, a rolling 38 game Deserved Goals Ratio has a higher correlation and lower average error than any other metric tested here. If you can’t be bothered to calculate that, a rolling 38 game Goal Ratio is not far behind.

If you insist on only looking at this season, DGR is still the best after 6 games, followed by Shots on Target Ratio. However, these in-season metrics are significantly worse than longer-term approaches, so it’s worth the time looking a bit further back for information about the relative strength of the teams.

As an example, Points scored in the first 6 games correlates with points scored in games 7-38 with an r^2 of 0.329, and an average error of 0.296.

38-game DGR after the first 6 games correlates with an r^2 of 0.670, and an average error of 0.202.

Follow me on Twitter @8Yards8Feet

The idea for these graphs comes from 11tegen11, whose blog is well worth a read.

The Natural Limits of Predictions (Part 2)

In my last post, I calculated a value of 6.5 points as the minimum average absolute error a model can reliably achieve when predicting a season’s worth of points for each team in the Premier League.

However, this was based on a simple simulation of a league full of evenly matched teams, with an adjustment for home advantage. It is therefore not particularly accurate, as not all teams are equally matched.

To simulate more realistically, we need a representative spread of the abilities of the teams. As a proxy for this, I have taken the average points scored by position over the last 16 seasons, on the assumption that this will produce a fairly accurate distribution of the teams’ abilities.

Points

Previously, we used the average Home/Draw/Away percentages and applied them to an average team. This time, we need to adjust these percentages based on the relative strength of teams in each fixture.

Looking at historic results, we can see the relationship between final points in a season, and individual results within that season. Based on this, we can calculate formulas which convert the relative strength of 2 teams into Home/Draw/Away percentages for individual matches.

We can then run a simulation of a full 38 game season, with the probabilities for each match being calculated using the above formulas.

If we run this simulation many times, we get a large sample of simulated points, and we can take an average of the variance from the mean for each team.

After 25,000 simulations, the average variance from the mean was 5.7 points.

This means we can adjust our figure from 6.5 points to 5.7 points. In the Premier League, 5.7 points is the lowest average absolute error in points we could consistently achieve with pre-season predictions.

Another interesting thing to note is how the average variance differs depending on how good a team is.

Points

It looks like it’s easier to accurately predict the best and worst teams, but more difficult to predict the middle of the table.

In summary, an average absolute error of 5.7 points is the natural limit for Premier League predictions, and you should expect bigger errors in the middle of the table than at the top and bottom.

The Natural Limits of Predictions (Part 1)

Let’s say we want to predict the points scored for each team in a Premier League season.

You might think that a perfect model would predict the points exactly right. However, that’s impossible to do consistently because of a mixture of random and chaotic variation, which introduces an element of unpredictability into every model. Because we can never perfectly measure the current skill of the teams, we can never perfectly predict their skill over the season. Also, we don’t know which teams will get lucky or unlucky.

When we make predictions using statistics we assign values to each team, which we think represents their current skill level. This could be as simple as Goal Difference in the previous season, or as complicated as an Expected Goals model. In any case, the outcome is a set of “skill values”.

These skill values are a product of past performance, intended to measure the current ability of the teams, and we make the assumption that a team’s performance in the future will be roughly the same as its performance in the past.

We know that football results are a mixture of skill and luck. Over a Premier League season of 38 games for each team luck will mostly cancel out, meaning we can be fairly confident that accurately describing the skill of each team will produce decent predictions.

This produces nice charts like this which show a decent correlation between performance in one season and points in the next season. This chart looked a lot nicer before last season, when Leicester and Chelsea added the 2 obvious outliers.

Nice Chart

Some of the variation between seasons is down to genuine changes in ability, but even over a full season luck still plays a part. We can show this by simulating a Premier League season of 38 games for an average team.

There are 3 outcomes in a Football match. Win (3 points), Draw (1 point) and Loss (0 points).

These outcomes are not all equally likely for an average team. Draws don’t happen as often as Home and Away wins, and there is a noticeable home advantage.

Using data from previous seasons, we can generate probabilities for each outcome. We can then run a Monte Carlo simulation to see how many points our average team is expected to achieve over the season.

For 19 “Home games”,  we will use 46.3% chance of a win, and 25.8% chance of a draw. For 19 “Away games”, we will use 27.9% chance of a win, and 25.8% chance of a draw.

The results from 5000 simulations are as follows:

Nice Chart

These results return an average points score of 52.1, which encouragingly matches the actual average points scored in the last 16 Premier League seasons. However, there is plenty of variation around that central result.

The average variation from the mean was 6.5 points. 

This means that even if we had a brilliant metric which described the relative skill of each team near-perfectly at the start of the season, we would still expect to see an average error of 6.5 points between our pre-season predictions and the actual results.

We could therefore conclude that 6.5 points is our “Holy Grail” for the Premier League. Theoretically, no predictive metric could ever consistently achieve a lower average error than this.

However, this is all based on an average team playing a season against other average teams. In reality, some teams are better than others, and it may be easier to predict the points for games between teams of varying abilities.

I develop the method further in Part 2.

 

As always, feedback is very welcome.

All data is from http://www.football-data.co.uk/

Follow me on twitter @8Yards8Feet