FT Baseline Interactive: is the early season Premier League table a good predictor of future success?

By John Burn-Murdoch and Gavin Jackson

In the third instalment of The Baseline, our weekly feature on sports statistics, we looked at how much we can learn about the final outcome of the Premier League from the season so far.

For the story click here, or keep reading to find out how we worked it out.

The data on league positions comes from the Premier League. We used the positions of every team since 1995, when the number of teams was fixed at 20, which meant that the number of matches played was constant for each season.

If you look at the scatter plot in our interactive graphic, at the beginning there’s essentially no pattern but as the season progresses the points start to more closely resemble an upwards sloping straight line.

In other words, a team’s league position at any given time becomes more correlated with its final position as the season goes on, much as we would expect.

We decided to find out exactly how good each week’s league table is at predicting the final standings. Using the statistical programming language R we correlated a team’s position after each round of matches against its position at the end of the season.

Then we squared this number, giving us the R-squared statistic. R-squared tells us about how much of the spread of positions at the end of the season can be explained by their positions at each stage. For example, the first game explains only eight per cent of the final table, the 20th game explains 77 per cent.

Next, we calculated the average difference between a team’s position after x matches and their final position. So, for example, between the sixth game and the final game teams move on average 3.4 points up or down. Based on that figure, Manchester United – currently lying in seventh place – could easily end up either tenth or fourth.

The key number we were looking for was after how many matches this number fell below one, at which point you would expect a team to be more likely than not to remain in its current position. This was found to be after 33 matches.

All of our R code for this step of the analysis is shown below:

Data <- read.csv(‘Premier League.csv’) #loads the data
library(reshape2) #installs reshape2 package to better manage the data
League <- dcast(File, Team + Season ~ Match,value.var = “Position”) #organises the data
Correlations <- vector() #creates an empty vector
for (i in 3:39){Correlations <- c(Correlations,cor(League[i],League[40]))} #loops over all matches in the league
R_squared <- Correlations^2
plot(R_squared, type = ‘l’)
Avg_differences <- vector()
for (i in 3:39){Avg_difference <- c(Avg_differences,mean(abs(League[40]-League[i])))}

Having established just how weak an indicator of future performance the early season table is, we then asked ourselves whether any other readily available data might prove a better guide to the final standings.

Fortunately, we aren’t the first to ask this question. The FT’s Simon Kuper co-wrote the book Soccernomics with professor Stefan Szymanski, of the University of Michigan, and in it explored Szymanski’s work analysing the relationship between clubs’ wage spend and their final league position.

We set out to repeat this analysis, but using more up-to-date data. Using the final Premier League standings from 2000/2001 through to last season and financial data from Deloitte, we plotted the average finishing position for every team that has played in England’s top flight against the average of its wage bills relative to the league average for a given season.

The result was an R-squared statistic of 0.67, having adjusted for the exponential nature of Premier League wage bills (i.e wage spend increases at an ever faster rate as you move from the lowest- to the highest-paying clubs).

As a result, we know that wages explain 67 per cent of variation in final league standings, compared to just 47 per cent for the table as it looks tonight. Another way of putting this is to say that the league table only becomes a better predictor of final standings than a club’s wage bill after roughly 13 matches have been played.

See below for our R code:

wageposabs <- read.csv('wageposabs.csv') #loads the data
y <- log(wageposabs[1]) #sets the log of a club's average finishing league position (since the relationship here is exponential) as the y variable
x <- log(wageposabs[2]) #sets the log of a club's average relative wage bill as the x variable
r <- cor(x,y) #calculates the correlation coefficient 'r' for the two variables
r_squared <- r^2 #squares this to calculate R-squared

Throughout this series we’re keen for readers to join the debate, so if you want to ask further questions, offer ideas for who or what we should look at next, or point out flaws in our logic, please leave a comment below or email us on baseline@FT.com

This article has been amended to reflect the fact that in an earlier version the log transformation was applied to only one of the variables for the wages regression, instead of both. As a result the number of matches after which league position becomes a better predictor than wages is roughly 13. Thanks to Roger Pielke Jr for bringing this to our attention.