In 2003, Michael Lewis published "Moneyball"; a book about Billy Beane, the Oakland Athletics general manager who applied statistical analysis to baseball in order to identify and recruit under-valued baseball players. With the use of data, Billy Beane achieved as many wins as teams with more than double the payroll, and managed to get to the play-offs in 4 successive years from 2000 to 2003.
In 2011, Moneyball was adapted into a movie with the role of Billy Beane played by Brad Pitt. Both the book and the movie were a success and popularized the idea of using data to improve sport teams performance. The usage of data in sport is often referred as: Sports analytics.
In baseball, the nature of the sport makes it easy to collect a lot of data points about in-game action. You can download from this link a database covering in-game data points and other statistics about the players and teams going back to 1871. If you're interested in analyzing baseball data, you can find here a blog post on the topic that I wrote a few years back.
In the case of football (soccer), data collection is more complex. Football is a dynamic sport with 22 players on the pitch and unlimited possibilities of ball movement and players positioning. Fortunately in the last few years, with the advancement in sensors and video analysis; it is possible to have high quality football data that can be used to analyze football games, teams and players. In this blog post, we will be using an open collection of football logs to create a web app that analyzes Messi and Ronaldo's game during LaLiga season 2017-18 [1]. We will be using Python/Streamlit to create an interactive web app that compares both players stats and shows their positions on the pitch.
I would like to thank Luca Pappalardo and his colleagues for making this great dataset available to the public.
Below is an animated gif of the application that we will build. You can find the source code in this github repository.
1. Getting, Reading and Structuring the Data¶
Messi and Ronaldo dominated world football during the last decade with a combined 11 FIFA Ballon d'Or awards (six for Messi and five for Ronaldo). Both players are considered to be amongst the greatest players of all time and they're frequently compared to each other.
In this blog post, we will analyze the games of both players during LaLiga (Spanish League) season 2017-18. This was Ronaldo's last season in Spain before moving to Juventus.
1.1. Getting the Data¶
To start with, we need to download the datasets that are introduced in the paper: A public data set of spatio-temporal match events in soccer competitions from this link. We need the following:
- matches/matches_Spain.json: Information about LaLiga (Spanish football league) season 2017-18 matches.
- events/events_Spain.json: All the events that occur during each match of LaLiga season 2017-18.
- players.json: All players of the teams playing in seven national and international soccer competitions (Italian, Spanish, French, German, English first divisions, World Cup 2018, European Cup 2016).
- teams.json: All teams in seven prominent soccer competitions (Italian, Spanish, German, French and English first divisions, World Cup 2018, European Cup 2016).
- tags2name.csv: Mapping of tag identifiers to tag names
1.2. Reading the Data¶
We start by importing the different Python libraries that we need.
import json
import unicodedata
import numpy as np
import pandas as pd
From players.json, we can find the player id "wyId" of both players:
- 3359 for Messi
- 3322 for Ronaldo
We also can find from teams.json, the team id "wyId" of both teams:
- 676 for FC Barcelona
- 675 for Real Madrid
Next, we read Spain matches and events datasets:
with open('../data/matches/matches_Spain.json') as json_file:
matches_spain_data = json.load(json_file)
with open('../data/events/events_Spain.json') as json_file:
events_spain_data = json.load(json_file)
1.3. Structuring the Data¶
Structuring Messi and Ronaldo's events data¶
Next, we will structure all Real Madrid and FC Barcelona matches information into 2 Pandas DataFrames.
barca_matches = [match for match in matches_spain_data if '676' in match['teamsData'].keys()]
real_matches = [match for match in matches_spain_data if '675' in match['teamsData'].keys()]
barca_matches_df = pd.DataFrame(barca_matches)
real_matches_df = pd.DataFrame(real_matches)
barca_matches_df.head(2)
status | roundId | gameweek | teamsData | seasonId | dateutc | winner | venue | wyId | label | date | referees | duration | competitionId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Played | 4406122 | 38 | {'676': {'scoreET': 0, 'coachId': 92894, 'side... | 181144 | 2018-05-20 18:45:00 | 676 | Camp Nou | 2565922 | Barcelona - Real Sociedad, 1 - 0 | May 20, 2018 at 8:45:00 PM GMT+2 | [{'refereeId': 398931, 'role': 'referee'}, {'r... | Regular | 795 |
1 | Played | 4406122 | 37 | {'676': {'scoreET': 0, 'coachId': 92894, 'side... | 181144 | 2018-05-13 18:45:00 | 695 | Estadio Ciudad de Valencia | 2565917 | Levante - Barcelona, 5 - 4 | May 13, 2018 at 8:45:00 PM GMT+2 | [{'refereeId': 420995, 'role': 'referee'}, {'r... | Regular | 795 |
Next, we will structure all Messi and Ronaldo's events data into 2 Pandas DataFrames.
messi_events_data = []
for event in events_spain_data:
if event['playerId'] == 3359:
messi_events_data.append(event)
messi_events_data_df = pd.DataFrame(messi_events_data)
ronaldo_events_data = []
for event in events_spain_data:
if event['playerId'] == 3322:
ronaldo_events_data.append(event)
ronaldo_events_data_df = pd.DataFrame(ronaldo_events_data)
Adding tags to events data¶
From tags2name.csv, we select the event tags that are interesting for our analysis.
- 101: Goal
- 301: Assist
- 302: key Pass
- 401: Left Foot
- 402: Right Foot
We add these tags as new columns in the events DataFrames.
def add_tag(tags, tag_id):
return tag_id in [tag['id'] for tag in tags]
messi_events_data_df.columns
Index(['eventId', 'subEventName', 'tags', 'playerId', 'positions', 'matchId', 'eventName', 'teamId', 'matchPeriod', 'eventSec', 'subEventId', 'id'], dtype='object')
messi_events_data_df['goal'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 101))
messi_events_data_df['assist'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 301))
messi_events_data_df['key_pass'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 302))
messi_events_data_df['left_foot'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 401))
messi_events_data_df['right_foot'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 402))
ronaldo_events_data_df['goal'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 101))
ronaldo_events_data_df['assist'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 301))
ronaldo_events_data_df['key_pass'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 302))
ronaldo_events_data_df['left_foot'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 401))
ronaldo_events_data_df['right_foot'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 402))
messi_events_data_df.head(2)
eventId | subEventName | tags | playerId | positions | matchId | eventName | teamId | matchPeriod | eventSec | subEventId | id | goal | assist | key_pass | left_foot | right_foot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | Simple pass | [{'id': 1801}] | 3359 | [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}] | 2565554 | Pass | 676 | 1H | 1.012047 | 85 | 180465950 | False | False | False | False | False |
1 | 8 | Simple pass | [{'id': 1801}] | 3359 | [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}] | 2565554 | Pass | 676 | 1H | 51.068905 | 85 | 180465968 | False | False | False | False | False |
Adding matches information to the events DataFrames¶
messi_events_data_df = pd.merge(messi_events_data_df, barca_matches_df, left_on='matchId', right_on='wyId', copy=False, how="left")
ronaldo_events_data_df = pd.merge(ronaldo_events_data_df, real_matches_df, left_on='matchId', right_on='wyId', copy=False, how="left")
messi_events_data_df.head(2)
eventId | subEventName | tags | playerId | positions | matchId | eventName | teamId | matchPeriod | eventSec | ... | seasonId | dateutc | winner | venue | wyId | label | date | referees | duration | competitionId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | Simple pass | [{'id': 1801}] | 3359 | [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}] | 2565554 | Pass | 676 | 1H | 1.012047 | ... | 181144 | 2017-08-20 18:15:00 | 676 | Camp Nou | 2565554 | Barcelona - Real Betis, 2 - 0 | August 20, 2017 at 8:15:00 PM GMT+2 | [{'refereeId': 398919, 'role': 'referee'}, {'r... | Regular | 795 |
1 | 8 | Simple pass | [{'id': 1801}] | 3359 | [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}] | 2565554 | Pass | 676 | 1H | 51.068905 | ... | 181144 | 2017-08-20 18:15:00 | 676 | Camp Nou | 2565554 | Barcelona - Real Betis, 2 - 0 | August 20, 2017 at 8:15:00 PM GMT+2 | [{'refereeId': 398919, 'role': 'referee'}, {'r... | Regular | 795 |
2 rows × 31 columns
Saving Data to Disk¶
messi_events_data_df.to_pickle('../data/messi_events_data_df.pkl')
ronaldo_events_data_df.to_pickle('../data/ronaldo_events_data_df.pkl')
Getting matches dates¶
Next, we will create 2 DataFrames with all Real Madrid and FC Barcelona LaLiga matches during season 2017-18 and the corresponding dates.
barca_matches_dates_df = barca_matches_df[['label', 'date']].copy()
real_matches_dates_df = real_matches_df[['label', 'date']].copy()
barca_matches_dates_df['date'] = pd.to_datetime(barca_matches_df['date'], utc=True).dt.date
real_matches_dates_df['date'] = pd.to_datetime(real_matches_df['date'], utc=True).dt.date
#Change date to string
barca_matches_dates_df['date'] = barca_matches_dates_df['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
real_matches_dates_df['date'] = real_matches_dates_df['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
barca_matches_dates_df = barca_matches_dates_df.rename(columns={"label": "match"})
real_matches_dates_df = real_matches_dates_df.rename(columns={"label": "match"})
barca_matches_dates_df.head(2)
match | date | |
---|---|---|
0 | Barcelona - Real Sociedad, 1 - 0 | 2018-05-20 |
1 | Levante - Barcelona, 5 - 4 | 2018-05-13 |
Saving to Disk¶
barca_matches_dates_df.to_pickle('../data/barca_matches_dates_df.pkl')
real_matches_dates_df.to_pickle('../data/real_matches_dates_df.pkl')
2. Analyzing the Data¶
In this section, we will analyze Messi and Ronaldo's events DataFrames. We will compute a few statistics and we will see how we can plot the events on a football pitch.
Total number of events broken down by player and event type¶
goals = [messi_events_data_df['goal'].sum(), ronaldo_events_data_df['goal'].sum()]
assists = [messi_events_data_df['assist'].sum(), ronaldo_events_data_df['assist'].sum()]
shots = [messi_events_data_df[messi_events_data_df['eventName'] == 'Shot'].count()['eventName'],
ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]
free_kicks = [messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName'],
ronaldo_events_data_df[ronaldo_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName']]
passes = [messi_events_data_df[messi_events_data_df['eventName'] == 'Pass'].count()['eventName'],
ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Pass'].count()['eventName']]
stats_df = pd.DataFrame([goals, assists, shots, free_kicks, passes],
columns=['Messi', 'Ronaldo'],
index=['Goals', 'Assists', 'Shots', 'Free Kicks', 'Passes'])
print(stats_df)
Messi Ronaldo Goals 34 26 Assists 13 5 Shots 142 151 Free Kicks 47 15 Passes 1787 727
Goals by left foot vs. right foot¶
messi_lf_goals = messi_events_data_df[messi_events_data_df['left_foot'] == True]['goal'].sum()
messi_rf_goals = messi_events_data_df[messi_events_data_df['right_foot'] == True]['goal'].sum()
print("Messi's goals with left foot: ", messi_lf_goals)
print("Messi's goals with right foot: ", messi_rf_goals)
Messi's goals with left foot: 32 Messi's goals with right foot: 2
ronaldo_lf_goals = ronaldo_events_data_df[ronaldo_events_data_df['left_foot'] == True]['goal'].sum()
ronaldo_rf_goals = ronaldo_events_data_df[ronaldo_events_data_df['right_foot'] == True]['goal'].sum()
print("Ronaldo's goals with left foot: ", ronaldo_lf_goals)
print("Ronaldo's goals with right foot: ", ronaldo_rf_goals)
Ronaldo's goals with left foot: 7 Ronaldo's goals with right foot: 14
Ploting the events on a football pitch¶
For each event in messi_events_data_df
and ronaldo_events_data_df
, we have the origin and destination positions associated with the event. Each position is a pair of coordinates (x, y). The x and y coordinates are always in the range [0, 100] and indicate the percentage of the field from the perspective of the attacking team. [2] We will use these positions to plot the events on a football pitch.
messi_events_data_df['positions'].head()
0 [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}] 1 [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}] 2 [{'y': 62, 'x': 62}, {'y': 64, 'x': 69}] 3 [{'y': 64, 'x': 69}, {'y': 74, 'x': 83}] 4 [{'y': 74, 'x': 83}, {'y': 61, 'x': 77}] Name: positions, dtype: object
from plots import *
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
output_notebook()
We will be using bokeh for drawing the football pitch and plot the events. I prepared 2 python functions to simplify both tasks:
draw_pitch()
: Function to draw an empty pitchplot_events(player_events, event_name, plot_color)
: Function that takes as input the events DataFrame, event name and a color; and plots the events on a football pitch
You can find the soure code of both functions here.
messi_goals = messi_events_data_df[messi_events_data_df['goal'] == True]['positions']
p_messi = plot_events(messi_goals, 'Goals', 'red')
show(p_messi)
3. Building the Web App¶
Now that we understood how to read, structure and plot the data; we can start building the web app. The goal of the app is to compare the games of Messi and Ronaldo by focusing on: Goals, Assists, Shots, Free Kicks and Passes.
The app will have one tab for each event type. In each tab, we will show statistics and positions of the events; and the breakdown of events count by game. The app will also have a filter that we can use to select the events by left/right foot.
We will use an open-source app framework called Streamlit. Streamlit is a python library that can be installed using a pip install
command. Streamlit is an easy to use library that allows us to create web applications using Python only and without writing HTML/JS/CSS code.
You can download the source from this github repo and you can start the application by running the following command from your terminal streamlit run app.py
and open http://localhost:8501
in your browser.
Breakdown of the code¶
The First function get_data(foot)
reads the pickle files and returns messi_events_data_df
, ronaldo_events_data_df
, barca_matches_dates_df
and real_matches_dates_df
DataFrames. It also filters the events by left/right foot if we pass Left
or Right
as parameter.
The decorator @st.cache(allow_output_mutation=True)
is used to update the data whenever we call the get_data(foot)
function.
@st.cache(allow_output_mutation=True)
def get_data(foot):
.
.
.
return messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df
Creating tabs for each event type¶
Streamlit is a powerful library for buidling powerful web apps and user interfaces, however the current version of the library doesn't support the creation of tabs natively. In order to add tabs to the application, we will use Bokeh and the method described here.
For each event type, we have a function that takes as input the 4 DataFrames, and for each player it draws the events positions on a football pitch and a table with the breakdown of events by game. The function combined the 2 plots and the 2 tables in a Bokeh Grid and returns the the grid as a Bokeh Panel.
def plot_goals(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df):
#Getting events data positions
messi_goals = messi_events_data_df[messi_events_data_df['goal'] == True]['positions']
ronaldo_goals = ronaldo_events_data_df[ronaldo_events_data_df['goal'] == True]['positions']
#Pitch with events
p_messi = plot_events(messi_goals, 'Goals', 'red')
p_ronaldo = plot_events(ronaldo_goals, 'Goals', 'blue')
....
grid = bokeh.layouts.grid(
children=[
[p_messi, p_ronaldo],
[print_table(messi_stats_df), print_table(ronaldo_stats_df)],
],
sizing_mode="stretch_width",
)
return bokeh.models.Panel(child=grid, title="Goals")
In the main function at the end of app.py, you can see how the 5 functions are used to create 5 tabs for each event type.
tabs = bokeh.models.Tabs(
tabs=[
plot_goals(messi_events_data_df, ronaldo_events_data_df,
barca_matches_dates_df, real_matches_dates_df),
plot_assists(messi_events_data_df, ronaldo_events_data_df,
barca_matches_dates_df, real_matches_dates_df),
plot_shots(messi_events_data_df, ronaldo_events_data_df,
barca_matches_dates_df, real_matches_dates_df),
plot_free_kicks(messi_events_data_df, ronaldo_events_data_df,
barca_matches_dates_df, real_matches_dates_df),
plot_passes(messi_events_data_df, ronaldo_events_data_df,
barca_matches_dates_df, real_matches_dates_df),
]
)
Left/Right Foot Filter¶
In the main function, we define a streamlit Radio that we can use to filter the data by foot.
foot = st.sidebar.radio("Foot", ('Either Left or Right', 'Left', 'Right'))
messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df = get_data(foot)
Stats of both players as Table¶
In the main function, we calculate the stats of both Messi and Ronaldo and display them as a dataframe using streamlit.dataframe
goals = [messi_events_data_df['goal'].sum(), ronaldo_events_data_df['goal'].sum()]
assists = [messi_events_data_df['assist'].sum(), ronaldo_events_data_df['assist'].sum()]
shots = [messi_events_data_df[messi_events_data_df['eventName'] == 'Shot'].count()['eventName'],
ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]
free_kicks = [messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName'],
messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName']]
passes = [messi_events_data_df[messi_events_data_df['eventName'] == 'Pass'].count()['eventName'],
ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Pass'].count()['eventName']]
stats_df = pd.DataFrame([goals, assists, shots, free_kicks, passes],
columns=['Messi', 'Ronaldo'],
index=['Goals', 'Assists', 'Shots', 'Free Kicks', 'Passes'])
st.sidebar.markdown(""" ### Stats """)
st.sidebar.dataframe(stats_df)
Conclusion¶
In this blog post, we saw how to build a web app that analyzes Messi and Ronaldo's game during LaLiga season 2017-18 using Python and Streamlit. The dataset and the source code from this post can be adapted to implement other use cases. For example: Comparaison between other players, teams and even championships.
References¶
[1] Pappalardo et al., (2019) A public data set of spatio-temporal match events in soccer competitions, Nature Scientific Data 6:236, https://www.nature.com/articles/s41597-019-0247-7