Data Organisation in Spreadsheets
Updated: Aug 16, 2021
Last year an article was published in The American Statistician that is pertinent to those working with sports science data. The article, by Broman and Woo, is titled Data Organization in Spreadsheets. An open-access, preprint PDF copy can be found here. In this post we will explore how the practical recommendations offered in the editorial are relevant in Sports Science.
“Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization.”
Firstly, I personally see a difference in spreadsheets used purely for data analysis and those used for different purposes of documentation. This article outlines rules and recommendations for the setup of pure data spreadsheets. On the other hand, there are times when Excel may be used for specific purposes. There is no better example of this in my opinion than Professor Will Hopkins’s invaluable spreadsheets available at sportsci.org
Spreadsheets with the sole purpose of data analysis should be prepared for programming software e.g., R, Python, Sequel, whether completed by yourself or an analyst with more advanced data science skills.
Now let’s dive into these considerations for “pure” data spreadsheets using the topics brought up in the editorial.
Broman and Woo discuss a variety of examples for this cardinal rule. These include categorical variables, missing values, variable names, subject identifiers, file names, and date formats. Lets consider a couple of these in more detail.
There are many different categorical variables that we may use acronyms for in sports science. For instance, are different age groups titled as “Under 21”, “u21” or “U21”? In another example, different physical tests should have consistent test names or abbreviations, as well as the variable categories within each test.
The variable names consideration is also in line with their next point – “choose good names for things”. Do you say Total Distance, Total Distance Covered, Distance, Metres/Yards Covered, Odometer etc.? The authors urge spaces to be replaced with underscores or perhaps hyphens as other characters have meaning in data programming. Variables name should be short but meaningful. Consider if the metric names have enough information to distinguish their full meaning e.g. absolute vs relative thresholds.
Along with this, you should be consistent with units of measurement. For those working abroad, I urge you (from personal experience) to use the system domestic to the country you are in (regardless of your discomfort with it).
The authors also cite the PhD comic on using “final” in a file name. They advise not to do so because “you will invariably end up with ‘final_ver2’”. This is something I can certainly relate to with versions of a spreadsheet called “x FINAL” and then “x FINAL EVER”!
The Problem with Dates
Dates are a nightmare. Firstly, you may have to deal with the differences between the British-English format (dd/mm/yy) to the American-English format (mm/dd/yy). Then Excel goes and throws its own wrench into the mix by converting dates to a sequential serial number, as the number of days from January 1, 1990.
The authors of this paper recommend using the global “ISO 8601” standard, YYYY-MM-DD, such as 2019-01-20. They suggest converting date columns to plain text format, either by changing the format of the cells or by using an apostrophe before the text in the cell, although this relies on human precision in data entry.
For those also battling with dates in R, the lubridate package is extremely worthwhile.
The next few sections of the paper deal with the structure of the spreadsheet, namely;
no empty cells*
put just one thing in a cell
make it a rectangle
*The no empty cells point is up for debate as an empty cell itself may mean something. There is debate whether empty cells should be left truly empty or a code used for missing data. Broman and Woo put forward their case for filling all cells. Ultimately like everything else it comes down to the cardinal rule… be consistent.
Anyone who has at least dabbled with RStudio has probably come across the name Hadley Wickham. Wickham is a statistician credited with created many useful open source software packages, including ggplot2, dplyr, and reshape2, all of which are collectively known as the ‘tidyverse’. Indeed, the authors here cite Wickham’s 2014 paper entitled ‘Tidy data’, available to download on ResearchGate.
Tidy data stipulates that each variable should be stored as a column, each observation should be stored as a row, and each type of observational unit should be a table. Within Broman and Woo’s paper, they present multiple examples of different datasets and how the structure of them can limit the analysis. The examples below in figure 1 (untidy) and figure 2 (tidy) help to demonstrate what this might look like with a simple, hypothetical sample of jump testing data.
Often we are dealing with exports from third parties and the “tidiness” of the formats can range drastically! Utilising an API to automatically bring in data to your system can help to avoid dealing with these csv exports. If you are however, stuck with messy csv exports using R really can save time on cleaning and merging such data. (Thanks to Jose Fernandez from the Houston Astros who recently helped me with R code for a messy Excel export format!)
You can find more from Hadley Wickham on tidy data for R here.
Earlier we talked about the challenge of keeping variable names short but also meaningful. This meaning may need to include units of measurement as well as further information about the metric such as the type of threshold used. A way to efficiently account for this could be to set up a data dictionary. This is a separate file that contains information about the data.
The data dictionary should also be stored in a tidy format as it may be incorporated into the analysis. While the variable names in the data file may now follow certain rules that make them look unattractive e.g. all lower case and separated by underscores, the data dictionary provides an opportunity to align a more suitable name to use in data visualisations. Here we may also store further information such as the units of measurement, the source of the metric, the calculation for it, and a description of it.
No Calculations, Font Colours or Highlighting
Now we are getting into aspects of the article that I suspect might be different to how many Sports Scientists use Excel. To quote the authors themselves; “We feel strongly that your primary data file should contain just the data and nothing else: no calculations, no graphs.”
Some may well have adopted this approach, with Excel purely used for data entry and other tools used for transformation, analysis and visualisation. For myself, this is a goal I am striving to work towards but is certainly a work in progress. So for now, I have “pure” datasheets as well as other Excel files with plenty of calculations, fonts, colours, and conditional formatting incorporated within. Certainly, Excel Tricks for Sports (a phenomenal resource for sports scientists) would argue there is nothing wrong with that!
However, to expand further on Broman and Woo’s philosophy:
“We believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately. Analyzing and visualizing data in a separate program, or at least in a separate copy of the data file, reduces the risk of contaminating or destroying the raw data in the spreadsheet.” Broman and Woo (2018)
Whether you agree or not, I think this discussion raises an important point, to reflect upon the purpose of a document and try to plan from the onset which programmes will best suit what you are trying to achieve. Clearly, it is up to you the reader to reflect on your processes, your skillsets, as well as the skills and resources of others around you within your organization, to then determine how to best approach your data management.
Reducing Errors and Freezing
We’ve all been there… the dreaded non-responsive Excel after hours (and hours) of work. The authors final few recommendations aim to reduce error and the heart-breaking loss of hard work, through:
making backups – regularly and in multiple locations
use data validation – using this feature in Excel can help to reduce errors
save the data in plain text files
Following the recommendations above would mean your datasheets could be saved as csv (comma-delimited file) rather than large, complicated, multi-coloured Excel workbooks.
I will finish with a cautionary tale from the authors, who recommend applying these suggestions to your new datasets going forward rather than trying to adapt your current datasets to these formats. Please also remember that this blog is only a synopsis of the article, and so I urge you to read the paper in full (especially as it is freely available).
There should of course be no judgement of how you choose to incorporate Microsoft Excel into your workflow. The effectiveness of a Sports Scientist is not determined by their Excel or data science skills, but by their interpersonal skills that enable conversations about the data to take place.
“Without a systematic way to start and keep data clean, bad data will happen.” Donato Diorio