Running totals with a window function 🦆⚽
- Check out the GitHub repo
- See the data in Google sheets format here.
As a season progresses, clubs accumulate points and comparisons between clubs at different points in the season are interesting. Some clubs “hit form” and surge ahead while others lose players through injury or experience “a dip in form” that is reflected in a reduced rate of points accumulation. Running totals are needed to see how clubs’ positions change throughout the season.
There is a general application for this approach to things like sales over time and it can be implemented in SQL using window functions which are fully supported by DuckDB. I am only going to explain one example in this post but I will return to the topic again and perhaps then write a up a more comprehensive review of the topic. Bottom line is: if you don’t know and use window functions, you’re missing out on one of the most powerful features of modern SQL.
In the meantime, as always check out the DuckDB documentation and I also strongly recommend this excellent tutorial.
Running totals
We are going to create a view that orders all the data from matches table by season, club and match date and adds a column that displays the club’s accrued points at that point in the season. The final view definition SQL will be quite long but it can be decomposed into manageable segments that we can run and test independently so we will build it up in steps and explain the steps as we go along.
Calculating points earned home and away
As a reminder of our matches table columns and data, here is a sample of 10 rows:
mid | season | mdate | mtime | hcc | acc | hcg | acg |
---|---|---|---|---|---|---|---|
4559 | 1993_1994 | 1993-08-14 | Â | ARS | COV | 0 | 3 |
4560 | 1993_1994 | 1993-08-14 | Â | AVL | QPR | 4 | 1 |
4561 | 1993_1994 | 1993-08-14 | Â | CHE | BLB | 1 | 2 |
4562 | 1993_1994 | 1993-08-14 | Â | LIV | SHW | 2 | 0 |
4563 | 1993_1994 | 1993-08-14 | Â | MCI | LEE | 1 | 1 |
4564 | 1993_1994 | 1993-08-14 | Â | NEW | TOT | 0 | 1 |
4565 | 1993_1994 | 1993-08-14 | Â | OLD | IPS | 0 | 3 |
4566 | 1993_1994 | 1993-08-14 | Â | SHU | SWI | 3 | 1 |
4567 | 1993_1994 | 1993-08-14 | Â | SOU | EVE | 0 | 2 |
4568 | 1993_1994 | 1993-08-14 | Â | WHU | WIM | 0 | 2 |
Clubs play each other club twice in a season: once at home and once away. Our matches table records the goals scored by the home and away club for each match in separate columns (hcg: home club goals, acg: away club goals). We need to implement logic to assign points based on the winner and collect the points for each club in each match of each season in one column. We can do this in three steps:
- Calculate the points earned for each club for each match in a season at home (first CTE).
- Calculate the points earned for each club for each match in a season away from home (second CTE).
- Bring the home and away results together using a UNION.
Home
The following SQL calculates the points each club gained when playing at home (hcc: home club code) for each match of each season using a CASE statement (3 points for a win, one for a draw, 0 for a loss). We need the match date later to calculate the running totals but this value is not available for season 1992_1993 so we filter it out in the WHERE clause.
SELECT
season,
mid,
mdate,
hcc ccode,
CASE
WHEN hcg > acg THEN 3
WHEN hcg = acg THEN 1
ELSE 0
END points
FROM
matches
WHERE mdate IS NOT NULL;
season | mid | mdate | ccode | points |
---|---|---|---|---|
1993_1994 | 4559 | 1993-08-14 | ARS | 0 |
1993_1994 | 4560 | 1993-08-14 | AVL | 3 |
1993_1994 | 4561 | 1993-08-14 | CHE | 0 |
1993_1994 | 4562 | 1993-08-14 | LIV | 3 |
1993_1994 | 4563 | 1993-08-14 | MCI | 1 |
1993_1994 | 4564 | 1993-08-14 | NEW | 0 |
1993_1994 | 4565 | 1993-08-14 | OLD | 0 |
1993_1994 | 4566 | 1993-08-14 | SHU | 3 |
1993_1994 | 4567 | 1993-08-14 | SOU | 0 |
1993_1994 | 4568 | 1993-08-14 | WHU | 0 |
Away
The points gained for each club away from home uses a very similar query but we are now selecting the away club (acc: away club code) and the order of the comparisons in the CASE statement is swapped around.
SELECT
season,
mid,
mdate,
acc ccode,
CASE
WHEN acg > hcg THEN 3
WHEN acg = hcg THEN 1
ELSE 0
END points
FROM
matches
WHERE mdate IS NOT NULL;
UNION
In order to combine the home and away datasets returned by the two SQL statements above we can make them CTEs and then combine the two CTEs with a UNION.
The window function
We can now add the window function to query our two UNIONed CTEs like so:
CREATE OR REPLACE VIEW vw_cum_points AS
WITH home AS(
SELECT
season,
mid,
mdate,
hcc ccode,
CASE
WHEN hcg > acg THEN 3
WHEN hcg = acg THEN 1
ELSE 0
END points
FROM
matches
WHERE mdate IS NOT NULL),
away AS(
SELECT
season,
mid,
mdate,
acc ccode,
CASE
WHEN acg > hcg THEN 3
WHEN acg = hcg THEN 1
ELSE 0
END points
FROM
matches
WHERE mdate IS NOT NULL),
home_and_away AS(
SELECT *
FROM home
UNION
SELECT *
FROM away)
SELECT
season,
ccode,
mdate,
SUM(points) OVER(PARTITION BY season, ccode
ORDER BY mdate) cum_points
FROM home_and_away
ORDER BY season, ccode, mdate;
The window function totals all the points (SUM(points)
) for each season and club (PARTITION BY season, ccode
) with the results ordered by the match date (ORDER BY mdate
).
Let’s query the view we just created for Manchester City, season 2017-2018:
SELECT *
FROM vw_cum_points
WHERE ccode = 'MCI'
AND season = '2017_2018';
The output shows us how they accumulated their very impressive 100 points in that season
season | ccode | mdate | cum_points |
---|---|---|---|
2017_2018 | MCI | 2017-08-12 | 3 |
2017_2018 | MCI | 2017-08-21 | 4 |
2017_2018 | MCI | 2017-08-26 | 7 |
2017_2018 | MCI | 2017-09-09 | 10 |
2017_2018 | MCI | 2017-09-16 | 13 |
2017_2018 | MCI | 2017-09-23 | 16 |
2017_2018 | MCI | 2017-09-30 | 19 |
2017_2018 | MCI | 2017-10-14 | 22 |
2017_2018 | MCI | 2017-10-21 | 25 |
2017_2018 | MCI | 2017-10-28 | 28 |
2017_2018 | MCI | 2017-11-05 | 31 |
2017_2018 | MCI | 2017-11-18 | 34 |
2017_2018 | MCI | 2017-11-26 | 37 |
2017_2018 | MCI | 2017-11-29 | 40 |
2017_2018 | MCI | 2017-12-03 | 43 |
2017_2018 | MCI | 2017-12-10 | 46 |
2017_2018 | MCI | 2017-12-13 | 49 |
2017_2018 | MCI | 2017-12-16 | 52 |
2017_2018 | MCI | 2017-12-23 | 55 |
2017_2018 | MCI | 2017-12-27 | 58 |
2017_2018 | MCI | 2017-12-31 | 59 |
2017_2018 | MCI | 2018-01-02 | 62 |
2017_2018 | MCI | 2018-01-14 | 62 |
2017_2018 | MCI | 2018-01-20 | 65 |
2017_2018 | MCI | 2018-01-31 | 68 |
2017_2018 | MCI | 2018-02-03 | 69 |
2017_2018 | MCI | 2018-02-10 | 72 |
2017_2018 | MCI | 2018-03-01 | 75 |
2017_2018 | MCI | 2018-03-04 | 78 |
2017_2018 | MCI | 2018-03-12 | 81 |
2017_2018 | MCI | 2018-03-31 | 84 |
2017_2018 | MCI | 2018-04-07 | 84 |
2017_2018 | MCI | 2018-04-14 | 87 |
2017_2018 | MCI | 2018-04-22 | 90 |
2017_2018 | MCI | 2018-04-29 | 93 |
2017_2018 | MCI | 2018-05-06 | 94 |
2017_2018 | MCI | 2018-05-09 | 97 |
2017_2018 | MCI | 2018-05-13 | 100 |
A macro to query the view
In the last post I described how macros can simplfy our lives by allowing us to write re-usable code. Generating the running total of points for a given club in a particular season is an ideal candidate for a macro that takes two parameters: the season and the club. Let’s create a macro to do just this in the macros schema from the last post:
USE macros;
CREATE MACRO get_club_season_cum_points(p_season, p_ccode) AS TABLE
SELECT *
FROM main.vw_cum_points
WHERE season = p_season
AND ccode = p_ccode
ORDER BY mdate;
We can test our macro as follows:
SELECT *
FROM macros.get_club_season_cum_points('2003_2004', 'ARS');
Another cool feature of DuckDB which we haven’t covered yet is how easy it is to generate files from our query output using the COPY
command. For example, if we want to send the output from the previous query to a CSV file with column headers we can do it as follows using the COPY
command
COPY (SELECT * FROM macros.get_club_season_cum_points('2003_2004', 'ARS'))
TO 'ars_2003_2004.csv' (HEADER, DELIMITER ',');
We could then use the output for generating plots, for example.
Wrap up
-
We have only touched on window functions, there’s lots more to learn and we will re-visit them as and when the need arises.
-
We have also seen another example of how useful macros are.
-
We used to
COPY
command to create a file from the output of our macro.