Thanks for submitting!

  • Jo Clubb

Big Data? Assessing Sports Science Inputs for AI Modelling

When surveying the industry on Artificial Intelligence (AI) in sport, a number of queries focussed on the system inputs. For instance, practitioners asked:


Does elite sport currently have sufficient data available to train models? And how do we know if our data inputs are good enough?


These are important questions. However, as we will discuss, the term “enough” is perhaps reducing the discussion to a ‘yes’ or ‘no’ when in reality, it is not dichotomous but a more nuanced consideration. Therefore, the question should perhaps be positioned as follows;


What is the minimum viable dataset in terms of size and quality to train models in elite sports?


This post aims to explore both data quantity and quality within sports performance. I’ll first be discussing these topics, and the team at Zone7 will then add further context of how they approach them, specifically in the realm of injury risk and performance modelling.


Are we dealing with ‘Big Data’?

A little over a decade ago, it was a novelty to have tracking technology used on the training field in professional sports. Now, athletes are often tracked every time they take the field, as well as in the weight room, during rehabilitation, and even in the swimming pool.


Many major leagues not only now permit in-game tracking, but use a league-wide provider that enables access to all games for all teams. As a result, there is increasing integration between physical and technical/tactical data. Within teams, tracking data is frequently supplemented with jump testing, subjective measures, strength tests, body composition assessments, and many more.

This seems like “Big Data”. But is it?


Perhaps surprisingly (to me at least), Big Data is not defined through quantification but by the characteristics of the data. Specifically, by the three V’s: Variety, Volume, Velocity. Put simply, Oracle describes big data as “larger, more complex data sets, especially from new data sources”. It is worth noting that some argue Value and Veracity should also be added to the V’s to better define Big Data. More on Veracity later…


Tracking data and in particular, in-game data, appears to meet the ‘Three-V tipping point’ for Big Data; the point at which traditional data management and analysis becomes inadequate (Gandomi and Haider, 2015). So, it is not surprising to witness collaboration between sports and computer science to research tactical behaviour, such as this framework by Goes et al., 2021 for tactical performance analysis.


What is Zone7’s approach to using “Big Data” to model injury risk and load management?


The first step is to define the initial data assets that are needed to model a problem, in this case injury risk forecasting and workload load management. This is a multi-dimensional problem so ideally we’d like to ingest descriptive data that covers workload and injury data from multiple teams and leagues. This typically touches on tracking technologies, game videos, strength assessments etc. The larger the data lake, the better (as long as it’s uniformly “clean”).


Within football, the reality is that not all leagues and teams use the same sensors. For example, teams often employ Global Positioning System (GPS) and accelerometry-generated data during training, but camera-based optical tracking technologies during competitive matches. Yet, these between-system differences mean that such data is not easily interchangeable (Taberner et al., 2019) and special care is needed here, otherwise the inputs are corrupted.


A common approach for this is to rely on an Extract, Transform, Load (ETL) architecture. ETL acts as the scaffolding through which the data can be normalised. This process ensures disparate datasets are standardised and therefore, can be used harmoniously on any longitudinal injury risk analysis.


Riding the ‘Data Quantity Wave’

In applied practice, we can ride peaks and troughs in data quantity. External load data can be abundant, thanks largely to the in-game tracking already mentioned, as well as the less invasive nature of certain devices (e.g., optical tracking or wearables integrated into equipment/clothing).

Meanwhile, internal load is sometimes neglected, perhaps due to the increased cooperation required, despite the added value it could provide. Often, subjective measures of internal load (namely, Rating of Perceived Exertion) and/or response (self-report questionnaires) are used, although these rely on the pillars of athlete self-consciousness, autonomy, and honesty (Montull et al., 2022).


During pre-season screening, we may be overwhelmed by data. This generally recedes to varying levels as the season gets underway. Personally, I think a one-off assessment can be used to guide individualised programmes in an effective manner. That said, clearly there can be benefits to increased data collection frequency, with the caveat being that the data is then utilised. I would love an objective way to quantify the greater value associated with the greater burden of repeat testing.


“Training as Testing” methodologies enable more consistent data collection. These include fitness-fatigue assessments during standardised drills, integrating regular jump testing to track neuromuscular fatigue, velocity-based training data collection, and measures of eccentric and/or isometric strength incorporated within a gym training programme.


Theoretically, this increased quantity pertaining to an athlete’s fitness-fatigue relationship and physical capacities should augment a practitioner’s decision making relating to health and performance. Is this the case? And if so, can AI enhance this process?


Can multiple input categories be used effectively in a model? Specifically, can external workload data combine with other datasets like strength assessments, fatigue, RPE, etc?


The Zone7 philosophy is that practitioners and organisations know best and they determine which datasets should be included in the analysis. That said, a go-no-go criteria is helpful to assess whether a specific dataset deemed relevant is “usable” in a predictive model aimed to address load management and risk forecasting:

  • Cadence: daily outputs require daily (or weekly) inputs. Hence this appears to be a strong argument for synergising “testing as training”. So, while yearly strength assessments are hugely valuable to practitioners, they may be difficult for inclusion in a model aiming to produce daily results.

  • Volume: a certain critical mass of data is needed to demonstrate validity of the model and a baseline per athlete. Hence, some datasets may require an initial ‘ramp up’ period of collecting them before they can be included in a predictive model. That said, they may very well be useful to practitioners from day-one of their inclusion for “human analysis” methods.

  • Quality: The other key parameter to consider is data quality, which we expand on below.


Taking pride in data quality

Of course, data quantity is not the only factor to consider with inputs. It is no good collecting a mountain of data if it is unusable. Garbage In Garbage Out (GIGO) is a data science concept that contends poor quality inputs will result in faulty outputs. Despite the common sense of this concept, IBM estimated the cost of poor data quality to the U.S. economy at over $3 trillion annually.


Data quality should always be a priority to sports science practitioners. We are “Data Stewards”, responsible for ensuring sufficient quality throughout data collection, storage, and analysis. Coaches, management, and the athletes themselves should be able to trust the information presented to them. If data is to be used for decision making, the outputs will only be as good as the inputs. Building a culture that respects and values clean data practices starts with us.


Data veracity, mentioned briefly earlier, has multiple definitions in the literature but is generally considered as relating to the accuracy and fidelity of data (Reimer and Madigan, 2018). A systematic review conducted veracity analysis on publications on load monitoring in professional soccer (Claudino et al., 2021). They found 73% of the studies did not report veracity metrics, such as coefficient of variation (CV), intraclass correlation coefficient (ICC), and the standard error of measurement (SEM).


We may consider such statistical measures to be innately linked to a particular metric or technology. Yet, often our own processes directly affect data quantity. For instance, we previously published a ‘Clean Tracking Data Checklist’ to promote a systematic process in external load data collection (Torres-Ronda et al., 2021). This includes considerations for before (use suitably-sized garments), during (manage drills/periods and athletes correctly within the software), and after (remove errors and spikes) the training session.


In another example, ensuring accurate body weight of a subject on a force plate, via a stationary weighing phase/silent period is important to maintain data quality within and between trials and athletes (McMahon et al., 2018).


There are also factors that while not directly within our control, sit within our sphere of influence. For instance, earlier I mentioned how self-reported, subjective measures rely on athlete honesty (Montull et al., 2022). Athlete buy-in can be supported through education and communication. Explaining the purpose and use of such data is key to trying to gain their trust in the process. Similarly, coach support has been shown to influence compliance of self-reported measures (Saw et al., 2015).


Clean and repeated data collection and management practices do not seem exciting, but they lay the foundation for effective data analysis, whether that is by human or machine.


Can Zone7 share best practices on how to regulate the quality of data that is ingested? Do you provide insight to practitioners around their specific data quality and/or cleaning needs?


Datasets need to be validated for quality to ensure any outputs generated are contextual, meaningful and potentially impactful. The GIGO maxim is absolutely at the forefront of any data science quality control process in elite sports.


To be effective in a predictive model, it is vital that practitioners use consistent data collection and cleaning practices as described above. Processes like checking for velocity spikes, being consistent with thresholds/bands, and managing the athletes in the correct sessions and drills.


It is not uncommon for Zone7’s automated “data diagnostics” to uncover glitches that have a negative impact on the data integrity in a specific environment. We have the tools to assist practitioners and their organisation to ensure data quality is sufficient. Measures are in place to ensure each new contributing dataset is not only an appropriately valid dataset, but also one where ad-hoc anomalies are identified and removed so that they do not pollute the reference data lake.


Considering our blind spots

We should also consider potential influential factors we are not currently capturing. These may be ‘known-unknowns’, exemplified perhaps by the gut microbiome and potential links with injury risk. Further, there are clearly things we do not yet know about human physiology, performance and injury risk: the ‘unknown-unknowns’.


As Nate Silver wrote in ‘The Signal and The Noise’;


“One of the most pernicious [biases] is to assume that if something cannot easily be quantified, it does not matter”.

Another potential limitation to our inputs is the instantaneous nature of our testing and, to some degree, monitoring assessments. Even a daily cadence of data collection does not reflect how physiology changes throughout a single day. Injury risk is dynamic. In a utopian sports science world, we would collect such measures in a continuous manner and subsequently, have access to real-time injury risk. Though that would open a whole other can of worms...


That said, I think we have a responsibility to induce more value out of the data we already collect. While some argue for a 24-hour approach to athlete monitoring (Sperlich and Holmberg, 2017), others decree that it has already gone too far (Powles and Walsh, 2022). My Value-Burden Matrix has been well received, I believe, because there is consensus that we owe it to our athletes to optimise this relationship; maximising value from the data collection burden that we place on them.


How can we know or assess if AI is adding greater value to our processes? What defines success?


Whether AI is effective is ultimately a subjective statement, driven by the collision between the model’s mathematical accuracy and the specific needs of the environment.


The mathematical “success” of a predictive model is driven by the combination of the dataset size and the quality of the code/algorithms. An algorithm looking for patterns associated with (for example) soccer-related hamstring injuries in women’s soccer will get better the more “examples” of data/injuries it can “learn” from. By “better” we typically mean less mistakes (either less false positives, or less false negatives or both). Some algorithms will detect 90% of events with lots of false alarms, and others will detect 65% of events with very few false alarms. There’s always a trade-off.


On the other hand, usability is subjective. Some environments might be much less tolerable of false positives than others. In sports, a large staff of physios and sport scientists might be better equipped to deal with a large number of ‘alerts’ per day/week. Hence, their accuracy needs are different. This can also change per group of athletes (e.g., by position) or seasonality.


So our suggestion when answering “do we have enough data” is to look at several factors in tandem: (1) the dataset available in-house in terms of size and quality, and (2) the dataset available to train the algorithms with and (3) the algorithm’s usability in the specific context it is being deployed in.


Improving value in the data we already collect

Whilst sports science datasets are not as big as those collected in some other fields, we are still trying to manage large, complex data sets coming from new sources, that require a high velocity to analyse and disseminate.


Whether we turn to man or machine to delve deeper, the human practitioner has an obligation to optimise the quantity and quality of the data collected from their athletes. Consider the following as good practices when it comes to this:


  1. Employ a regular data collection cadence

  2. Consider the Value-Burden Matrix when introducing a new data collection process

  3. Maintain data stability and document any changes to settings, in particular bands/thresholds with tracking technology

  4. Use systematic data collection processes and compile these as protocols to ensure consistency across staff and seasons

  5. Regularly conduct tests for data integrity (Zone7 can help with this)


Personally, my exploration of AI thus far has highlighted that the discussion is more nuanced than I expected. There is no magic number, no binary threshold of “enough”, when it comes to quantity and quality. Clearly though, there is an influence on the quality of outputs. Do they determine success? To some degree yes, but like any type of analysis, it comes down to how it is contextualised, interpreted, and actioned within each specific environment. These will be topics that we will explore in future articles, starting with context.