I will post from time to time about the model that I have built to simulate and project Premier League matches. Whenever I do this I get asked about how it works. This post will go through I how I do things.

My first step is establishing team ratings. For this I have landed on using data from the previous season, the current season, the last 10 weeks, plus salaries.

The data from the previous season and wages is combined into one rating, where I have arbitrarily weighted things at 70% towards goals and xG and 30% towards how much a team spent. Maybe that weighting is wrong but it felt right considering that I think that wages is a good signal but something that it is hard to find reliable data how much a team spends for the upcoming season. On a short digression I do wish that there was more transparency in soccer about the salaries players earn like there are in American sports, I wish that you could easily find stuff like this:

Cot’s Baseball Contracts

The second part of the previous season is looking at goals for and allowed, and expected goals for and against. I use the StatsBomb xG via FBRef because it is easy and very high quality. I weight these also 70/30 with xG getting the higher weighting. Basically I thought the work done in this post was persuasive and thought lets implement that:

Does xG really tell us everything about team performance?
Following Liverpool’s decisive title win the end of 2019/20, there was discussion that they might have found a way to “beat” the xG models. The xG tallies suggested that Man City were as good, if not

For the teams that were just promoted, what I did this year was go back and adjust the goals and expected goals for the ELO rating of the team that they played compared to the average rating of a Premier League team. Is this good? I have no idea but it was a way of adjusting team performances between leagues that was feasible for me to accomplish.

The data for the current season is the same, but I do specifically look at the last 10 weeks as information that I think is important. Why the last 10 weeks? Well because generally 10 matches is a good sample for getting information about a team and teams play generally one match a week. I do this because if teams are performing good or bad over these stretches I want to capture that and bring that information in sooner.

The weighting for these is a bit dynamic, as more of the season progresses the more things weight towards the current season.

At this current point in the season, playing 7 matches the weighting is 65% towards last year and 35% to this season (there is no difference between current and last 10 because it is the same). At 11 matches played the current season starts to have more weight 45% last year vs 55 this year, at the halfway mark (19 matches) the weighting is 26% last year, 40% this season and 34% last 10 weeks. As more matches are played last 10 weeks maxes out at this 34% with more and more moving from last year to current year.

Is this the perfect weighting? Again I have no idea but it feels like it is pointed in the direction that I want it to go.

From this I create an attack rating and defending rating. This will feed directly into creating match odds.

The next step is creating match odds. To do this I take the team ratings and combine them into a projection for the number of goals scored for each team. For this I decided the way I wanted to do things was to keep things simple, I simply take the attack rating for one team, add the defense rating for the other team, divide by 2 and then multiply by a home field advantage factor (right now I have both a 5% bonus and 5% penalty).

This combo will give the base expected goals for a team. Let use the Arsenal vs Crystal Palace on match October 18th as an example. Arsenal have an attack rating of 1.31, Crystal Palace have a defense rating of 1.51 so we get (1.31+1.51)/*2-1.05 and end up with 1.5 goals for Arsenal. For Crystal Palace we take their attack rating of 1.1 and Arsenal's defense rating of 1.31 and get (1.1+1.31)/2*0.95 and get 1.1 goals for Crystal Palace.

The next step is taking these expected goal figures and putting them into Poisson distributions to model the probability of certain amounts of goals being scored. Doing this for both teams then allows you to figure out the odds for each score line and thus the odds of a win/loss/draw. That is how this graphic is derived.

I do this exercise for each match to get the match odds. From there we take these match odds and simulate a season.

To simulate a match I simply generate a random number between 0 and 1, take the probability distribution for the goals and use that to determine the goals scored. Sticking with the Arsenal vs Crystal Palace match, Arsenal's random number was 0.0878 which corresponds with scoring 0 goals, Crystal Palace got 0.5006 which corresponds with them scoring 1 goal, meaning that in this simulation Crystal Palace wins 0-1 and Arsenal Twitter melts down.

To simulate a season I would do this same thing for each match, record the number of goals for and against, the number of wins, losses and draws, the points and where a team finished and save that. Then I would start again simulating the season 10,000 times.

After all of that I am able to get information like this:

So this is how my simulation model work. Could you make a fancier one? Probably. Could you make a more accurate one? Yeah, I am certain that there are more advanced models, specifically ones built for gambling out there that take into account more information and are more accurate. Does this mostly work and point me in the right direction without implicitly introducing bias? Yes I think it does and that is why I do it.

Generally I think that this model is in the ball park of where the "true" odds of a match are. For things that I predict to happen 0-15% of the time, they have happened 11%. For things from 15-30%, they happened 25% of the time. For things from 30-45%, they happened 31% of the time (I am a little light here). For things from 45-60%, they happened 61% of the time (I am a little heavy here). For things from 60-75%, they happened 71% of the time.