Europe's Big 5 Football Data at a Glance 2022-2023
The Beautiful Game

Europe's Big 5 Football Data at a Glance 2022-2023

There's a reason why soccer or football is the world's favorite sport. It's often called the "beautiful game". And if you can't see how beautiful this build-up play on the gif above, then I feel sorry for you, haha. I have been playing since I could learn to walk. I live, breathe, and eat football. I'm saying football because I recently have moved to the UK and it just feels better than saying "soccer".

And because I love football SO much, I wanted to step into the world of data and soccer: combine two of my most favorite things and see what would happen. This project that I have worked on was self-started and has been an incredible opportunity to showcase all of my data skills into one great big project in which I can publish to my portfolio. I hope you enjoy.

Groundwork

Being the football fan that I am, and moving to Europe, I decided to investigate Europe's Top 5 Competitions which include: Premier League (England), La Liga (Spain), Ligue 1 (France), Serie A (Italy), and Bundesliga (Germany). I do not doubt that there are significant other competitions around the world, it's just that for the most part, these 5 leagues are considered to be the best of the best.

So, we have the competitions to compare and contrast the data from, now it's about asking which questions are the most important. And I know that football is a team sport and it is hard to showcase that with individual variables likes goals and assists, but I will do my best. My attempt here is to answer the following questions:

Important Questions

Which players are the top performing in Goals and Assists?

What is the expected Goals % and why is that important?

What is the peak age for scoring goals?

Which Competition has the most Yellow and Red Cards?

Does the Attendance in a Home game affect the outcome of the Match?

Key Findings

In answering the above questions and hopefully drawing you in to look more into the details:

  • Erling Haaland had an impressive 36 goals in his debut season with Manchester City
  • Lionel Messi, Kevin De Bruyne and Antoine Griezman led Europe's Top 5 Comps with 16 assists
  • La Liga had significantly more yellow and red cards than any other competition
  • 30% of the variation in Wins can be explained by the number in Attendance

DATA

The data I found and used in this project came from fbref.com. More specifically looking at the 2022-2023 season for squads and individuals. I used the table to look at top performers for individual categories like goals, assists, expected goals (which I will go over later), minutes played, number of starts, yellow and red cards, etc. And I looked at the squad table to analyze matches won, number of fans in attendance, wins, draws, losses, etc.

I collected the data into an Excel file, ran some cleaning operations, edited some functions and verified with SQL queries. After satisfactory validations, I through the data into Tableau to look at trends, patterns and correlations. The following are my results.

Analysis

Things to keep in mind

For the individual table, the total number of rows or individuals is 2889. It is important to realize that some individuals are noted twice. This is because they were either transferred or sent away on loan during the middle of the season, so they played for two different squads.

No alt text provided for this image
SQL Query: Counting Rows in Individual Stats Table
No alt text provided for this image
Repeated Names due to mid-season Transfers or Loans

As for the squad table, there ought to be 98 rows indicating 98 teams: 20-England, 20-Spain, 20-Italy, 20-France, and 18-Germany. This was verified with a google search. Not only are the number of teams per competition different, it also means that the number of matches played is different. The competitions with 20 teams play 38 games in a season whereas Germany (the only competition) with 18 teams plays 34 games in a season.

No alt text provided for this image
SQL Query: Counting Competition number of teams
No alt text provided for this image
Result: Competition Teams

Additionally, it is once again to note that we are just talking about the top 5 leagues in Europe. So that means we lose out on other top performers outside of this scope, including Christiano Ronaldo who now plays in Saudi Arabia.

And a final note is that all matches are recorded twice: one from each team. By saying there were 38 matches in the season is accurate, it means that the 2 teams playing each other double record the data from that match. I'm not saying this is a bad thing, it's just important to keep in mind as we go through the rest of the data.

Quick Summary

Doing some simple SQL queries, I was able to find out that the oldest player in the top 5 leagues were 41 (Joaquin and Gianluca Pegola) and the youngest were 15 (Nathan Nwaneri and Lamine Yamal) as of the beginning of the season, when the leagues mark down those data values.

No alt text provided for this image
SQL Query: Oldest/Youngest Players
No alt text provided for this image
Result: Oldest/Youngest Players

Then looking at the best 10 teams from the top 5 competitions, we can see which team they are, the country their competition is in, and their Pts/MP (matches played) ratio. This is an important factor in football as a win = 3 pts, a draw = 1 pt, and a loss = 0 pts. So the top performing teams are getting at least 2 pts per game played. That's equivalent to winning 2 games and losing the third.

No alt text provided for this image
SQL Query: top 10 teams pts/mp
No alt text provided for this image
BarChart top 10 squads based on Pts/MP

The bar chart above indicates the top 10 performing teams among the competitions with their pts/mp ratio. The top team overall is the only team from the Italian League, the English Premier League has 2 teams in the top 10 along with France and Germany. The Spanish League has 3 teams in the top 10. Napoli led the way averaging 2.37 points every game ending the season with 90 points overall during their 38 match season.

And ending the quick summary, I wanted to look at the competitions' total values for goals and assists as this is one of the top performance indicators in football.

No alt text provided for this image
Competitions' Goals and Assits

We can clearly see that the English Premier League leads all competitions in both Goals and Assists.

Which players are the top performing in Goals and Assists?

Now that we know which league has the most goals and assists, let's dive deeper into the actual players. The following shows the top performers in goal contributions.

No alt text provided for this image
Top 25 Goals & Assists Leaders

According to the above bar chart, we can see that Erling Haaland has surpassed all performers by 10 values in goal contributions. He has outscored Harry Kane, the next highest goalscorer, by 6 goals. You will also note other top goal contributors including Lionel Messi who has an equal amount of goals and assists, making him still one of the best in the world.

In addition to this comparison, I wanted to include other important measures like positions and expected goals (which I will cover shortly). The following bubble plot measures goals by assist with positions in different colors and the weight of the bubble measured by their xG value.

No alt text provided for this image
Ast vs Gls by Pos & xGls

So again, we can see top performers as with the previous model. But this plot helps us see that forwards are typically the ones grabbing more goals than assists, mid fielders are half and half, and defenders are lower than all. Goalkeepers don't really play a role here because they are not the ones focusing on scoring. However, in recent years, some goalkeepers have been able to contribute to goals by their long through balls and getting assists on the stat sheet. One team in particular shows that RB Leipzig Goalkeeper Péter Gulácsi, averaged .167 assists in a 90 minute period throughout the season.

No alt text provided for this image
Goalkeeper % assist in 90 minutes
No alt text provided for this image
Assist % by Position and Team

This heatmap shows us the top performing team with the best assist % by position. Union Berlin midfielders are likely to get at least 1.17 assist in a 90 minute period and Bayern Munich Forwards are likely to get .83 assist in a 90 minute period.

What is the expected Goals % and why is that important?

Returning back to the bubble plot earlier, we used one variable, expected goals. Now, what is that? According to the experts, the expected goals (xG) measure "is a metric that's intended to measure the probability of a shot resulting in a goal. The purpose is to show when a player should be expected to score from a particular opportunity, by basically rating how good of a goal-scoring opportunity it is." It is therefore obvious to us based on the bubble plot above, that Forwards have the greater likelihood of scoring because of how close they are to goal and their ability to capitalize on their goal scoring chances. If you want to understand this metric better, follow this link for more information.

What is the peak age for scoring goals?

The next thing I was curious to look into was which age was the best for scoring goals. So, with the line chart below, I was able to plot the age of the players with the goals scored.

No alt text provided for this image
Age vs Sum Gls

From here, it looks to be that the most goals happens at age 24 and then starts to go back down. Now that could mean that we just have a lot of 24 year olds playing and scoring. Regardless, 24 year olds did have the most goals out of all the ages.

Which Competition has the most Yellow and Red Cards?

One significant variable that UEFA (Union of European Football Association) could use to crack down on fees and penalties (no pun intended) is the amount of yellow and red cards that are issued by each league. The following bar chart shows us which competition has the highest number of both yellow and red cards.

No alt text provided for this image
Yellow & Red Cards per League

From this visual we can easily see that La Liga in Spain accumulates the most yellow and red cards out of all of the competitions. Italy is next in line. And for me that is somewhat expected: both spaniards and italians are known for their hot-headedness and quick temper that often results in lapse in discipline and gaining yellow cards. It is unsurprising that the German League has the fewest, because they have the least amount of matches, meaning the least amount of opportunities to received cards. UEFA may want to use this information to incentivize Spain and Italy in controlling their teams' discipline.

Does the Attendance in a Home game affect the outcome of the Match?

One final thought I was interested in pursuing was looking into the relationship between wins and match attendance for teams. The number in attendance was averaged only on games that were played at home, so half of the matches played throughout the season. The following graph illustrates that relationship.

No alt text provided for this image
Wins by Attendance

From this chart and the linear regression model I included, we can see that 30% of the variability in wins can be explained by the number in attendance. This is the R-Squared value. What's also noteworthy, is that the p-value is significantly low, meaning that the data is statistically significant, that there was no cause for concern in randomness.

Barcelona had 28 wins and the most attendance with nearly 80,000 people. But Monaco had 19 wins but only 7,000 people in attendance. The R-Squared number was high enough to notice but not high to be a strong relationship between wins and attendance.

Conclusion

In summary, we were able to look at the top performers in goals and assists for last season among Europe's Top 5 Football Leagues. Several players definitely stood out among the crowd, including leading goalscorer Erling Haaland who is only 23 years old and has a bright future ahead of him. He will be one to watch throughout his career.

In addition to looking at top performers, we were able to get a better understanding of what the expected goals calculation means. We also noticed that La Liga had significantly more yellow and red cards than any other competition and should be monitored to ensure good, clean games during the season. And surprisingly, we learned that 30% of the variation in Wins for a team can be explained by the number in Attendance. So make sure you go out and support your teams in person! You just may be the extra lift the team needs to win.

Thank you for reading all of this. If you have any questions feel free to comment below or connect with me Brock Johnson here on LinkedIn.

I am looking for new opportunities and roles in the data world, so if you hear of any or are in the market please reach out, thanks!

Christy Ehlert-Mackie, MBA, MSBA

Data Analyst who 💗 Excel | SQL | Tableau | I analyze and interpret data so companies have the information and insights they need to make sound business decisions.

1y

Nice job, Brock! It looks like most goals are scored by players in their early to mid 20s. I'm guessing that most players are also in that age range. It would be interesting to know what the average number of goals is at each age based on the number of players that age.

Abdifatah Mohamed

R&D S&E Electronics Engineer

1y

It would be interesting to see the trends compared to the 2010s in terms of goalscorers and their age (Ronaldo, and Messi in their prime). Great stuff Brock!

Stuart Walker

Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R

1y

Seeing as I’m a huge football fan I love this analysis. Well done Brock 👏💪👏

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics