Getting and Prepping the data for loading <div style="background-color:#E5E4D7">
Note for following along
- Check out the GitHub repo
- See the data in Google sheets format here.
</div>
Finding the data we need, checking it, cleaning it and preparing it for database upload can be thankless, onerous and tedious. The data we need is often scattered across multiple source files and the formats of those files may be incompatible. Frequently data is missing or unusable. All these factors combine to make data management and data analysis much more challenging than is often appreciated. Our objective here is to collect all the data for English Premier League (EPL) football (aka soccer in some varieties of English) games from the inaugural 1992-1993 EPL season to the season 2023-2024. I already gave some background on football in general and the EPL in the previous post so I am assuming you know what I am talking about when I say things like “EPL” and “seasons”.
I use some reasonably advanced shell scripting to prepare the files for database loading. Anyone who knows basic programming and has at least some shell programming experience should be able to follow along. I use the bash shell in all my examples; it is a commonly used shell and is standard on Linux. There is a range of shells to choose from, the modern Mac uses zsh (Z shell) by default but if you use the bash shebang line, your script will be executed by bash on that platform. In addition to the variety of shells, another complication of shell scripting is that the utilities that the shell uses might differ; for example, the text match and filter utility grep on one platform might be implemented as egrep on another platform. That said, a few tweaks or a Google search can usually fix things.
If you aren not interested in this topic and just want to get to the DuckDB stuff, then you can wait for the later blog posts when we will have our data in DuckDB tables. The DuckDB load files have already been generated and are available in the repo directory output_data. Also, if you are proficient in say Ruby or Python and prefer those tools to shell, by all means go ahead and use them instead of shell if you prefer. Windows users who are familiar with PowerShell might also prefer that tool to do the data manipulations; I am not familiar with it but it is popular and widely used.
To repeat what I said in the introduction, I like and value shell programming but opinions differ. That said, shell scripting and general competence with the Unix/Linux command line are very valuable skills. I have changed programming languages and databases many times over the years but two skills I have learned early on were Unix and SQL and I still use both of them.
Task overview
I have downloaded 31 files covering the EPL season 1993-1994 to 2023-2024. These are comma-separated values (CSV) files and we want to extract only a subset of columns from each file and then concatenate all 21 files together into a tab-separated file that we will load into a DuckDB database. As the source files were built over the seasons, new columns have been added over time. Most new columns have been added to the right and we are ignoring them. More recent season files have an additional column that records the match start time which has been inserted between columns we need to retain so we will need to account for it in our parsing code. I have a different file format entirely for the inaugural 1992-1993 season that we will process in the DuckDB database itself and then add to the final analysis-ready table.
Before we start
I assuming the following:
- You have access to a Unix type terminal; it could be Mac OS X, Linux, Unix or a linux emulator on windows such as Wine
- or Windows Subsystem for Linux (WSL)
- You have checked out the GitHub repo.
The source data
Seasons 1993-1994 to 2023-2024
I found all the data for seasons from 1993-1994 to 2013-2024 on the website Football-Data.co.uk. The site provides a guide to the data in a notes.txt file at this link. Each EPL season’s results has a link to a downloadable CSV file. The season CSV files are all named E0.csv so after downloading each one, I re-named them using the format season start year, followed by underscore, followed by the season end year followed by.csv; an example file name I have assigned is 2003_2004.csv. Tip: Consistent file naming really helps to keep things organised and also makes processing of such files using shell scripts much easier as we will see below.
31 such files were downloaded, renamed and stored in the repo directory named source_files.
Season 1992-1993
Unfortunately, data for the very first EPL season was not avaialble on the website I referred to above. It seemed a shame to me not to have data for the very first EPL season so I found it on Wikipedia, 1992–93 FA Premier League. I located the match results in a section called unsurprisingly “Results”. I created a CSV file version of this table in Google Sheets using its very useful IMPORTHTML function. The following formula entered in cell A1 of Sheet1 extracts the Wikipedia table into the spreadsheet.
=IMPORTHTML("https://en.wikipedia.org/wiki/1992%E2%80%9393_FA_Premier_League", "table", 6)
Note: you may be prompted by Google Sheets to authorise the call to an external resource.
I then downloaded the sheet as a tab-delimited CSV file using the menu action:
File->Download->Tab-separated values (.tsv)
I saved the file as season_1992_1993.tsv in the repo directory source_data.
The downloaded file is in what is often described as a “crosstab” format which, though convenient for viewing, is not convenient for querying. In order to extract the data from it, we will need to do some manipulation to create a “long” format with each club name and home and away scores in separate columns. We are also missing the match date so when we integrate data for this season with the later seasons from the previous step, we will have missing data. Missing data is a common occurrence in the real world and we will discuss how to deal with it when we have the data in DuckDB tables.
Builing our shell script piece by piece
The final shell script is available in the shell_scripts directory of the project repo and is named parse_seasons_1993_2023.sh, link. You can take a look at it now to get an overview of it but I am going to break it down piece by piece and explain what each segment achieves so that by the end, you will hopefully have a good understanding of its workings. Shell scripting is very amenable to this building piece by piece approach because it is really a glue for the wide range of programs and utilities available on modern Unix-type systems. If you use Unix/Linux utilities, then you can write shell scripts by just putting the commands in a script file that you then make executable.
Our script file starts with the shebang line: #!/usr/bin/bash
. When we execute the script by entering ./parse_seasons_1993_2023.sh
, in the command line, it starts interpreting
the lines in the file either line by line or by construct when you have compound statements such as loops and if statements.
Each section of the script is introduced with an explanatory comment that begins with # and extends to the line end.
Declare some variables
The script does not take any arguments when it is run; everything it needs to “know” is contained within the script itself. The input and out directory paths are assigned to variables as relative paths, relative to the directory of the script itself, that is.
# Declare variables.
OUTPUT_DIR="../output_data/"
INPUT_DIR="../source_data/"
Using variables in place of hard-coded values allows us to change either path in one just place; this is good practice. Note also that the double quotes are required and you cannot have any spaces around the assignment = sign; the requirement not to have spaces around the = is a shell script oddity and frequent source of potentially nasty code bugs.
Delete the output file
If the output file was generated previously, we want to delete it and start fresh. The following shell construct checks if the file exists, if it does, it deletes it. The relevant bash code is as follows:
# Remove the final output file if it already exists.
if [[ -f ${OUTPUT_DIR}seasons_1993_2023.tsv ]]; then
rm ${OUTPUT_DIR}seasons_1993_2023.tsv
fi
The bash if syntax looks strange with its double square brackets but the -f is a simple file test and returns true if the given file exists. The dollar is required when accessing the value assigned to a shell variable. The braces ({}) are included to delimit the shell variable. Without the braces, the shell would erroneously assume the variable name includes the text “seasons_1993_2023.tsv”; if you are accessing the value of a variable name that is bounded by white space, then the curly braces are not needed. Forgetting to use the dollar sign when accessing the variable’s value and omitting the curly braces when needed are two frequent sources of shell script bugs!
The body of the if statement contains a single command to remove ( rm ) the file. We could simply issue the rm command without enclosing it in the if construct but an error would be written to the terminal as standard error if the file did not exist so wrapping it in an if is slightly cleaner.
Parse files that do not have the match time column
Let’s take a close look at this code segment because it is relatively complex and it demostrates the real power of shell scripting. It uses two trusty old power tools of Unix scripting: sed and awk. I have broken up the commands using the shell line continuation backslash character to make the lines shorter and hopefully more readable.
# Parse the files which *do not have* the match time column.
for FILE in ${INPUT_DIR}{1993..2018}*.csv
do
sed '1d' $FILE | \
awk -v season=$(basename $FILE | sed 's/.csv//') -v null='NA' \
'BEGIN{FS=","; OFS="\t"}{print season,$2,null,$3,$4,$5,$6}' \
>>seasons_1993_2018.tsv
done
We already discussed the importance of consistent naming of files to make file processing easier and here we see an example
of it. In the for loop above we are using brace expansion to create a list of all CSV files beginning with the numbers
in the range 1993 up to and including 2018. Each iteration of the loop assigns the relative file path to the variable FILE.
The sed '1d' $FILE
command deletes the first line from the file (the column names row) and sends all remaining lines to the
awk command using the pipe symbol (|). In shell scripting parlance, we say a command pipes its output to another command.
The awk code is the most complex so let’s break it down:
-
-v season=$(basename $FILE | sed 's/.csv//')
: This creates an awk variable called season which we want to include in the output file. For each file processed in the loop, season is assigned a new value representing the season years separated by an underscore. It does this by removing the directory ( basename ), basename pipes its output to sed which removes the file extension. The sed output is assigned to the awk variable season. -
-v null='NA'
: This creates an additional awk variable we assign to the string value “NA” and is inserted into the final output to represent the missing match start time for these files. -
'BEGIN{FS=","; OFS="\t"}{print f,$2,null,$3,$4,$5,$6}'
: Here is the awk code to create the final output lines. The BEGIN{} block sets two special awk variables: FS sets the column separator for the input files to comma (,) and OFS sets the output column separator to tab (\t). Line processing for output takes place in the print block. The dollar followed by a numbers value correspond to the input file columns, $2 is the second column, for example. Five column values are extracted from each input file and two additional columns are inserted using the awk variables season for season and null for the missing match start time column. We end up with seven tab-separated columns in the output file. -
>>seasons_1993_2019.tsv
: The awk output is re-directed to a file. We need to use the » instead of the > syntax to append to the output file. If we used > each iteration of the loop would overwrite the file created in the previous iteration.
Parse files that do have the match time column
# Parse the files which *have* the match time column.
for FILE in ${INPUT_DIR}{2019..2023}*.csv
do
sed '1d' $FILE | \
awk -v season=$(basename $FILE | sed 's/.csv//') \
'BEGIN{FS=","; OFS="\t"}{print season,$2,$3,$4,$5,$6,$7}' \
>>seasons_2019_2024.tsv
done
The logic is identical to that described for the previous for loop except for this time we are only inserting the awk season variable value for one column to record the season. The match start time is in the third column ($3) of each input file.
Create the final file with the column names row
# Write the column names header row to the final output file.
echo "season,match_date,match_time,home_club_name,\
away_club_name,home_club_goals,away_club_goals" | \
awk 'BEGIN{FS="," ; OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7}' \
>seasons_1993_2023.tsv
The echo command pipes a string of seven comma-separated column names to awk which then converts them to tabs and writes them our final output file. We use only a single > this time because we want to create a new file. This file now has just a single line: the column names row.
Append the data from the two files created in the for loops
# Append the two files generated above to the output file.
cat seasons_1993_2018.tsv seasons_2019_2024.tsv \
>>seasons_1993_2023.tsv
The cat command appends (»)the rows from the two files created in the for loops to the final file.
Clean-up
# Clean up by moving the main output file and by removing the temporary files.
mv seasons_1993_2023.tsv ${OUTPUT_DIR}
rm seasons_1993_2018.tsv
rm seasons_2019_2024.tsv
echo "Script finished, see file '${OUTPUT_DIR}seasons_1993_2023.tsv'"
Our script has created three files in the same directory as the script itself. We only want one of these files, seasons_1993_2023.tsv, but we should not store it in the script directory so we move it (mv). The other two files are no longer needed so we can remove them (rm). Our task is now complete so we can print a message to that effect (echo).
General comment on shell scripting and its value
We will see more bash scripts throughout the book for moving data into and out of DuckDB databases but none of the code will be any more complicated than what we have covered here. The real power of shell scripting generally derives from the utilities that come as standard with Unix-type platforms with others available for installation. We can use shell scripts to call these utilities and pipe their output into other tools. These pipelines can be very powerful by allowing us to chain together various tools to achieve quite complex tasks. We have used sed and awk here but there are many other tools available. I frequently use curl in shell scripts, for example, to automate file downloads. The duckdb application itself can also be used in shell pipelines as we will see in later examples.
Main points
- We used the Bash shell to build our final DuckDB upload file.
- Brace expansion using number prefixes in filenames is very useful for creating a list of files to process in a for loop.
- The sed command is very useful for deleting lines from input files and for doing text substitutions.
- The awk command is well suited to column extraction and creation of variables whose values can be inserted into output files.
-
The output of one command can become the input of another command by using the pipe symbol ( ). - Multiple commands can be chained together using pipes to build pipelines.
- Unix-type systems provide a wide range of programs for a variety of task and these tools can be used together in shell scripts.
Going further
Other very use useful Unix commands to consider include tr, sort, cut and uniq. There are many, many good books and on-line resources devoted to Unix/Linux shell scripting, one personal book favourite is: “Unix Shell Programming” - Kochan & Wood. Nothing much has changed with the basics so an old edition works fine. Another good one is “Classic Shell Scripting: Hidden Commands That Unlock the Power of UNIX” - Robbins & Beebe.
Next up
We will:
- Upload the file we generated here into DuckDB.
- Manipulate the cross-tab file for season 1992-1993 in DuckDB itself.
- Generate the analysis-ready DuckDB tables.