Adil Moujahid
Buy Me a Coffee at ko-fi.com

Published

Mon 15 June 2020

←Home

Analyzing Messi and Ronaldo's Games using Python and Streamlit



In 2003, Michael Lewis published "Moneyball"; a book about Billy Beane, the Oakland Athletics general manager who applied statistical analysis to baseball in order to identify and recruit under-valued baseball players. With the use of data, Billy Beane achieved as many wins as teams with more than double the payroll, and managed to get to the play-offs in 4 successive years from 2000 to 2003.

In 2011, Moneyball was adapted into a movie with the role of Billy Beane played by Brad Pitt. Both the book and the movie were a success and popularized the idea of using data to improve sport teams performance. The usage of data in sport is often referred as: Sports analytics.

In baseball, the nature of the sport makes it easy to collect a lot of data points about in-game action. You can download from this link a database covering in-game data points and other statistics about the players and teams going back to 1871. If you're interested in analyzing baseball data, you can find here a blog post on the topic that I wrote a few years back.

In the case of football (soccer), data collection is more complex. Football is a dynamic sport with 22 players on the pitch and unlimited possibilities of ball movement and players positioning. Fortunately in the last few years, with the advancement in sensors and video analysis; it is possible to have high quality football data that can be used to analyze football games, teams and players. In this blog post, we will be using an open collection of football logs to create a web app that analyzes Messi and Ronaldo's game during LaLiga season 2017-18 [1]. We will be using Python/Streamlit to create an interactive web app that compares both players stats and shows their positions on the pitch.

I would like to thank Luca Pappalardo and his colleagues for making this great dataset available to the public.

Below is an animated gif of the application that we will build. You can find the source code in this github repository.

Alt Text

1. Getting, Reading and Structuring the Data

Messi and Ronaldo dominated world football during the last decade with a combined 11 FIFA Ballon d'Or awards (six for Messi and five for Ronaldo). Both players are considered to be amongst the greatest players of all time and they're frequently compared to each other.

In this blog post, we will analyze the games of both players during LaLiga (Spanish League) season 2017-18. This was Ronaldo's last season in Spain before moving to Juventus.

1.1. Getting the Data

To start with, we need to download the datasets that are introduced in the paper: A public data set of spatio-temporal match events in soccer competitions from this link. We need the following:

  • matches/matches_Spain.json: Information about LaLiga (Spanish football league) season 2017-18 matches.
  • events/events_Spain.json: All the events that occur during each match of LaLiga season 2017-18.
  • players.json: All players of the teams playing in seven national and international soccer competitions (Italian, Spanish, French, German, English first divisions, World Cup 2018, European Cup 2016).
  • teams.json: All teams in seven prominent soccer competitions (Italian, Spanish, German, French and English first divisions, World Cup 2018, European Cup 2016).
  • tags2name.csv: Mapping of tag identifiers to tag names

1.2. Reading the Data

We start by importing the different Python libraries that we need.

In [1]:
import json
import unicodedata
import numpy as np
import pandas as pd

From players.json, we can find the player id "wyId" of both players:

  • 3359 for Messi
  • 3322 for Ronaldo

We also can find from teams.json, the team id "wyId" of both teams:

  • 676 for FC Barcelona
  • 675 for Real Madrid

Next, we read Spain matches and events datasets:

In [2]:
with open('../data/matches/matches_Spain.json') as json_file:
    matches_spain_data = json.load(json_file)

with open('../data/events/events_Spain.json') as json_file:
    events_spain_data = json.load(json_file)

1.3. Structuring the Data

Structuring Messi and Ronaldo's events data

Next, we will structure all Real Madrid and FC Barcelona matches information into 2 Pandas DataFrames.

In [3]:
barca_matches  = [match for match in matches_spain_data if '676' in match['teamsData'].keys()]
real_matches  = [match for match in matches_spain_data if '675' in match['teamsData'].keys()]
In [4]:
barca_matches_df = pd.DataFrame(barca_matches)
real_matches_df = pd.DataFrame(real_matches)
In [5]:
barca_matches_df.head(2)
Out[5]:
status roundId gameweek teamsData seasonId dateutc winner venue wyId label date referees duration competitionId
0 Played 4406122 38 {'676': {'scoreET': 0, 'coachId': 92894, 'side... 181144 2018-05-20 18:45:00 676 Camp Nou 2565922 Barcelona - Real Sociedad, 1 - 0 May 20, 2018 at 8:45:00 PM GMT+2 [{'refereeId': 398931, 'role': 'referee'}, {'r... Regular 795
1 Played 4406122 37 {'676': {'scoreET': 0, 'coachId': 92894, 'side... 181144 2018-05-13 18:45:00 695 Estadio Ciudad de Valencia 2565917 Levante - Barcelona, 5 - 4 May 13, 2018 at 8:45:00 PM GMT+2 [{'refereeId': 420995, 'role': 'referee'}, {'r... Regular 795

Next, we will structure all Messi and Ronaldo's events data into 2 Pandas DataFrames.

In [6]:
messi_events_data = []
for event in events_spain_data:
    if event['playerId'] == 3359:
        messi_events_data.append(event)
        
messi_events_data_df = pd.DataFrame(messi_events_data)
In [7]:
ronaldo_events_data = []
for event in events_spain_data:
    if event['playerId'] == 3322:
        ronaldo_events_data.append(event)

ronaldo_events_data_df = pd.DataFrame(ronaldo_events_data)

Adding tags to events data

From tags2name.csv, we select the event tags that are interesting for our analysis.

  • 101: Goal
  • 301: Assist
  • 302: key Pass
  • 401: Left Foot
  • 402: Right Foot

We add these tags as new columns in the events DataFrames.

In [8]:
def add_tag(tags, tag_id):
    return tag_id in [tag['id'] for tag in tags]
In [9]:
messi_events_data_df.columns
Out[9]:
Index(['eventId', 'subEventName', 'tags', 'playerId', 'positions', 'matchId',
       'eventName', 'teamId', 'matchPeriod', 'eventSec', 'subEventId', 'id'],
      dtype='object')
In [10]:
messi_events_data_df['goal'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 101))
messi_events_data_df['assist'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 301))
messi_events_data_df['key_pass'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 302))
messi_events_data_df['left_foot'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 401))
messi_events_data_df['right_foot'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 402))

ronaldo_events_data_df['goal'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 101))
ronaldo_events_data_df['assist'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 301))
ronaldo_events_data_df['key_pass'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 302))
ronaldo_events_data_df['left_foot'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 401))
ronaldo_events_data_df['right_foot'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 402))
In [11]:
messi_events_data_df.head(2)
Out[11]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id goal assist key_pass left_foot right_foot
0 8 Simple pass [{'id': 1801}] 3359 [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}] 2565554 Pass 676 1H 1.012047 85 180465950 False False False False False
1 8 Simple pass [{'id': 1801}] 3359 [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}] 2565554 Pass 676 1H 51.068905 85 180465968 False False False False False

Adding matches information to the events DataFrames

In [12]:
messi_events_data_df = pd.merge(messi_events_data_df, barca_matches_df, left_on='matchId', right_on='wyId', copy=False, how="left")
In [13]:
ronaldo_events_data_df = pd.merge(ronaldo_events_data_df, real_matches_df, left_on='matchId', right_on='wyId', copy=False, how="left")
In [14]:
messi_events_data_df.head(2)
Out[14]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec ... seasonId dateutc winner venue wyId label date referees duration competitionId
0 8 Simple pass [{'id': 1801}] 3359 [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}] 2565554 Pass 676 1H 1.012047 ... 181144 2017-08-20 18:15:00 676 Camp Nou 2565554 Barcelona - Real Betis, 2 - 0 August 20, 2017 at 8:15:00 PM GMT+2 [{'refereeId': 398919, 'role': 'referee'}, {'r... Regular 795
1 8 Simple pass [{'id': 1801}] 3359 [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}] 2565554 Pass 676 1H 51.068905 ... 181144 2017-08-20 18:15:00 676 Camp Nou 2565554 Barcelona - Real Betis, 2 - 0 August 20, 2017 at 8:15:00 PM GMT+2 [{'refereeId': 398919, 'role': 'referee'}, {'r... Regular 795

2 rows × 31 columns

Saving Data to Disk

In [15]:
messi_events_data_df.to_pickle('../data/messi_events_data_df.pkl')
ronaldo_events_data_df.to_pickle('../data/ronaldo_events_data_df.pkl')

Getting matches dates

Next, we will create 2 DataFrames with all Real Madrid and FC Barcelona LaLiga matches during season 2017-18 and the corresponding dates.

In [16]:
barca_matches_dates_df = barca_matches_df[['label', 'date']].copy()
real_matches_dates_df = real_matches_df[['label', 'date']].copy()
In [17]:
barca_matches_dates_df['date'] = pd.to_datetime(barca_matches_df['date'], utc=True).dt.date
real_matches_dates_df['date'] = pd.to_datetime(real_matches_df['date'], utc=True).dt.date
In [18]:
#Change date to string 
barca_matches_dates_df['date'] = barca_matches_dates_df['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
real_matches_dates_df['date'] = real_matches_dates_df['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
In [19]:
barca_matches_dates_df = barca_matches_dates_df.rename(columns={"label": "match"})
real_matches_dates_df = real_matches_dates_df.rename(columns={"label": "match"})
In [20]:
barca_matches_dates_df.head(2)
Out[20]:
match date
0 Barcelona - Real Sociedad, 1 - 0 2018-05-20
1 Levante - Barcelona, 5 - 4 2018-05-13

Saving to Disk

In [21]:
barca_matches_dates_df.to_pickle('../data/barca_matches_dates_df.pkl')
real_matches_dates_df.to_pickle('../data/real_matches_dates_df.pkl')

2. Analyzing the Data

In this section, we will analyze Messi and Ronaldo's events DataFrames. We will compute a few statistics and we will see how we can plot the events on a football pitch.

Total number of events broken down by player and event type

In [22]:
goals = [messi_events_data_df['goal'].sum(), ronaldo_events_data_df['goal'].sum()]
assists = [messi_events_data_df['assist'].sum(), ronaldo_events_data_df['assist'].sum()]
shots = [messi_events_data_df[messi_events_data_df['eventName'] == 'Shot'].count()['eventName'],
         ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]
free_kicks = [messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName'], 
              ronaldo_events_data_df[ronaldo_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName']]
passes = [messi_events_data_df[messi_events_data_df['eventName'] == 'Pass'].count()['eventName'],
          ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Pass'].count()['eventName']]

stats_df = pd.DataFrame([goals, assists, shots, free_kicks, passes], 
                        columns=['Messi', 'Ronaldo'], 
                        index=['Goals', 'Assists', 'Shots', 'Free Kicks', 'Passes'])

print(stats_df)
            Messi  Ronaldo
Goals          34       26
Assists        13        5
Shots         142      151
Free Kicks     47       15
Passes       1787      727

Goals by left foot vs. right foot

In [23]:
messi_lf_goals = messi_events_data_df[messi_events_data_df['left_foot'] == True]['goal'].sum()
messi_rf_goals = messi_events_data_df[messi_events_data_df['right_foot'] == True]['goal'].sum()

print("Messi's goals with left foot: ", messi_lf_goals)
print("Messi's goals with right foot: ", messi_rf_goals)
Messi's goals with left foot:  32
Messi's goals with right foot:  2
In [24]:
ronaldo_lf_goals = ronaldo_events_data_df[ronaldo_events_data_df['left_foot'] == True]['goal'].sum()
ronaldo_rf_goals = ronaldo_events_data_df[ronaldo_events_data_df['right_foot'] == True]['goal'].sum()
In [25]:
print("Ronaldo's goals with left foot: ", ronaldo_lf_goals)
print("Ronaldo's goals with right foot: ", ronaldo_rf_goals)
Ronaldo's goals with left foot:  7
Ronaldo's goals with right foot:  14

Ploting the events on a football pitch

For each event in messi_events_data_df and ronaldo_events_data_df, we have the origin and destination positions associated with the event. Each position is a pair of coordinates (x, y). The x and y coordinates are always in the range [0, 100] and indicate the percentage of the field from the perspective of the attacking team. [2] We will use these positions to plot the events on a football pitch.

In [26]:
messi_events_data_df['positions'].head()
Out[26]:
0    [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}]
1    [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}]
2    [{'y': 62, 'x': 62}, {'y': 64, 'x': 69}]
3    [{'y': 64, 'x': 69}, {'y': 74, 'x': 83}]
4    [{'y': 74, 'x': 83}, {'y': 61, 'x': 77}]
Name: positions, dtype: object
In [27]:
from plots import *
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
output_notebook()
Loading BokehJS ...

We will be using bokeh for drawing the football pitch and plot the events. I prepared 2 python functions to simplify both tasks:

  • draw_pitch(): Function to draw an empty pitch
  • plot_events(player_events, event_name, plot_color): Function that takes as input the events DataFrame, event name and a color; and plots the events on a football pitch

You can find the soure code of both functions here.

In [28]:
messi_goals = messi_events_data_df[messi_events_data_df['goal'] == True]['positions']
In [29]:
p_messi = plot_events(messi_goals, 'Goals', 'red')
In [30]:
show(p_messi)

3. Building the Web App

Now that we understood how to read, structure and plot the data; we can start building the web app. The goal of the app is to compare the games of Messi and Ronaldo by focusing on: Goals, Assists, Shots, Free Kicks and Passes.

The app will have one tab for each event type. In each tab, we will show statistics and positions of the events; and the breakdown of events count by game. The app will also have a filter that we can use to select the events by left/right foot.

We will use an open-source app framework called Streamlit. Streamlit is a python library that can be installed using a pip install command. Streamlit is an easy to use library that allows us to create web applications using Python only and without writing HTML/JS/CSS code.

You can download the source from this github repo and you can start the application by running the following command from your terminal streamlit run app.py and open http://localhost:8501 in your browser.

Alt Text

Breakdown of the code

The First function get_data(foot) reads the pickle files and returns messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df and real_matches_dates_df DataFrames. It also filters the events by left/right foot if we pass Left or Right as parameter.

The decorator @st.cache(allow_output_mutation=True) is used to update the data whenever we call the get_data(foot) function.

@st.cache(allow_output_mutation=True)
def get_data(foot):
    .
    .
    .

    return messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df

Creating tabs for each event type

Streamlit is a powerful library for buidling powerful web apps and user interfaces, however the current version of the library doesn't support the creation of tabs natively. In order to add tabs to the application, we will use Bokeh and the method described here.

For each event type, we have a function that takes as input the 4 DataFrames, and for each player it draws the events positions on a football pitch and a table with the breakdown of events by game. The function combined the 2 plots and the 2 tables in a Bokeh Grid and returns the the grid as a Bokeh Panel.

def plot_goals(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df):

    #Getting events data positions
    messi_goals = messi_events_data_df[messi_events_data_df['goal'] == True]['positions']
    ronaldo_goals = ronaldo_events_data_df[ronaldo_events_data_df['goal'] == True]['positions']

    #Pitch with events
    p_messi = plot_events(messi_goals, 'Goals', 'red')
    p_ronaldo = plot_events(ronaldo_goals, 'Goals', 'blue')

    ....

    grid = bokeh.layouts.grid(
        children=[
            [p_messi, p_ronaldo],
            [print_table(messi_stats_df), print_table(ronaldo_stats_df)],
        ],
        sizing_mode="stretch_width",
    )

    return bokeh.models.Panel(child=grid, title="Goals")

In the main function at the end of app.py, you can see how the 5 functions are used to create 5 tabs for each event type.

tabs = bokeh.models.Tabs(
    tabs=[
        plot_goals(messi_events_data_df, ronaldo_events_data_df, 
                   barca_matches_dates_df, real_matches_dates_df),
        plot_assists(messi_events_data_df, ronaldo_events_data_df, 
                     barca_matches_dates_df, real_matches_dates_df),
        plot_shots(messi_events_data_df, ronaldo_events_data_df, 
                   barca_matches_dates_df, real_matches_dates_df),
        plot_free_kicks(messi_events_data_df, ronaldo_events_data_df, 
                        barca_matches_dates_df, real_matches_dates_df),
        plot_passes(messi_events_data_df, ronaldo_events_data_df, 
                    barca_matches_dates_df, real_matches_dates_df),
    ]
)

Left/Right Foot Filter

In the main function, we define a streamlit Radio that we can use to filter the data by foot.

foot = st.sidebar.radio("Foot", ('Either Left or Right', 'Left', 'Right'))
messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df = get_data(foot)

Stats of both players as Table

In the main function, we calculate the stats of both Messi and Ronaldo and display them as a dataframe using streamlit.dataframe

goals = [messi_events_data_df['goal'].sum(), ronaldo_events_data_df['goal'].sum()]
assists = [messi_events_data_df['assist'].sum(), ronaldo_events_data_df['assist'].sum()]
shots = [messi_events_data_df[messi_events_data_df['eventName'] == 'Shot'].count()['eventName'],
         ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]
free_kicks = [messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName'], 
            messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName']]
passes = [messi_events_data_df[messi_events_data_df['eventName'] == 'Pass'].count()['eventName'],
        ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Pass'].count()['eventName']]

stats_df = pd.DataFrame([goals, assists, shots, free_kicks, passes],
                        columns=['Messi', 'Ronaldo'], 
                        index=['Goals', 'Assists', 'Shots', 'Free Kicks', 'Passes'])

st.sidebar.markdown(""" ### Stats """)
st.sidebar.dataframe(stats_df)

Conclusion

In this blog post, we saw how to build a web app that analyzes Messi and Ronaldo's game during LaLiga season 2017-18 using Python and Streamlit. The dataset and the source code from this post can be adapted to implement other use cases. For example: Comparaison between other players, teams and even championships.

References

[1] Pappalardo et al., (2019) A public data set of spatio-temporal match events in soccer competitions, Nature Scientific Data 6:236, https://www.nature.com/articles/s41597-019-0247-7

[2] https://figshare.com/articles/Events/7770599

Go Top
comments powered by Disqus