An AI Dictionary for Sport Scientists
Although terms like “artificial intelligence” and “machine learning” have become familiar vernacular, that does not necessarily mean they are well understood. In my recent campaign for questions on this topic, there were numerous queries around the definitions and distinctions between such terminology.
With the help of Zone7, we have put together a dictionary for sport science practitioners, split into the following sections:
Types of Machine Learning Algorithms
Machine Learning Model Attributes
Types of Analysis
Machine Learning Metrics
This is designed to be a resource you can return to and one that can also expand. So, if you have a term you would like added, let me know!
Artificial Intelligence (AI) is the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience.
Machine Learning (ML) is a subset of AI that is particularly focused on developing algorithms that will help machines to perform intellectual processes by learning from historical data and responding to new data.
Deep Learning (DL) is a subset of ML algorithms that is based on artificial neural networks. Artificial neural networks are able to learn complex tasks in a variety of domains such as computer vision, natural language understanding, speech recognition and many more.
Computer Vision (also called Visual Recognition)
A field of AI that enables computers and systems to derive meaningful information from digital images, videos and other visual inputs — and take actions or make recommendations based on that information. For a recent review of computer vision in sports, see Naik et al., (2022).
Types of Machine Learning Algorithms
Supervised learning models are trained to detect underlying patterns between the input data and the labels associated with it, enabling it to yield accurate labeling results when presented with never-before-seen data.
Unsupervised algorithms discover different patterns and similarities in the data without specifically referencing their relationship with an outcome represented by a label.
Reinforcement learning (RL) is an area of ML concerned with how intelligent agents should take action in an environment to maximize the notion of cumulative reward.
RL differs from supervised learning in not needing labelled input/output pairs to be presented and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). Partially supervised RL algorithms can combine the advantages of supervised and RL algorithms.
Machine Learning Model Attributes
Hyperparameters are parameters whose values control the learning process and determine the values of model parameters that a learning algorithm ends up learning. They are used by the learning algorithm when it is learning but they are not part of the resulting model.
Each algorithm type can use different types of hyperparameters. Examples for hyperparameters can be the depth of a decision tree, the number of trees in a tree-based ensemble model, and the number of layers in a neural network.
Parameters are part of the model resulting from the training process. They are learned from the data as the algorithm attempts to map the input features and the labels or targets. Examples for such parameters are the coefficients of a regression model or the weights of a neural network.
Classification algorithms are a subset of supervised learning algorithms used to identify the category of an unseen observation. A classifier learns from the given dataset and classifies new observations into a number of classes/categories.
For example, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, is an athlete likely to suffer an injury or not. Importantly, the output of such an algorithm can at times be presented as an estimated probability of each class.
Regression algorithms are supervised learning algorithms that learn a model based on a training dataset to estimate continuous response values.
A training data set is used to train an ML algorithm. For instance, in a supervised learning task, an algorithm looks at the training data (input data and labels) to learn combinations of parameters and values that produce optimal predictive performance. The goal is to produce a fitted model that learns the combinations mentioned above but still generalized and performs well on unseen data.
A validation data set is a set of examples used to tune a model’s hyperparameters and thresholds as well as model selection. It should follow the same distribution as the training data set.
In order to avoid overfitting, when any of a model’s attributes needs to be adjusted, it is recommended to have a validation data set.
The validation data set is completely separate and does not overlap with the training data set.
In order to get more stable results and use all valuable data for training, a data set can be repeatedly split into training and validation datasets in multiple iterations. This process is known as cross validation.
The results of the validation process will be determined based on an aggregation of the results achieved in each iteration. It is common practice to hold out an additional dataset from the cross-validation process as a test set to fully understand a model’s expected performance in a real-life scenario.
A test data set is an independent data set from the training and validation sets, but follows the same distribution as the training data set. Therefore, it is a set of examples used to assess the performance of a model for which all attributes have been selected. The final model selected following training on the training set and tuning on the validation set is evaluated based on it’s ability to agree in it’s predictions on the test set, when compared to the actual response values linked with it.
A central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage, with metadata tags and object storage, to store the data.
Specifically in the context of Zone7, when attempting to create injury risk forecasting models, the data lake represents the dataset used to “train” the models. This is strictly separated from the test data, which is typically an external dataset. In the case of Zone7’s approach, the dataset is divided into seasonal units, where the test data (e.g., the most recent season) is completely separate from the training data (previous seasons).
Types of Analysis
Prospective Data Analysis
A process that is designed to evaluate a predictive model trained on past data, based on its performance on future events.
For Zone7, this is using an algorithm trained on historical data to forecast the injury risk of players based on new incoming data day-by-day.
Retrospective Data Analysis
A process that is designed to to train and evaluate a model’s performance based on historical data. In order to evaluate the model’s performance, the most recent part of the data is held out for evaluation and is treated as “future” relative to the data the model is trained on.
In the context of Zone7, in order to evaluate the system’s performance at any given moment in time based on a large sample, using retrospective data in the evaluation data set is necessary. If hypothetically, we have ten seasons of historical data, we can train the system based on athlete data and injuries that occurred in seasons 1-9 and evaluate the system’s performance based on its ability to correctly flag injuries that occurred during season 10.
Machine Learning Metrics
The way different ML models are evaluated and compared is by using different metrics that assign a numerical value quantifying the performance of a model. Optimising different metrics will affect the way the ML model performs for a desired goal and will affect the models selected to achieve the desired goal in the first place.
The accuracy of a machine learning system is measured as the percentage of correct predictions or classifications made by the model over a specific data set.
Accuracy is a very simple metric that doesn’t perform well when the condition we are trying to predict is very rare. For example, if we would like to train a model that predicts a disease that impacts a single person out of 10,000 people we can train our model to predict “not sick” for all people who get tested. The model will be correct 9,999 times out of 10,000 which will give it an accuracy of 99.99% but it will be completely useless.
In the same manner described above, given an injury is a rare event we (Zone7) could hypothetically train an algorithm to always forecast an injury isn’t likely to happen and receive very high accuracy, though as mentioned before, the model will be of no use.
False Negative (Rate)
A false negative represents a “positive” case that was incorrectly predicted as “negative”.
The False Negative Rate represents the proportion of “positive” cases that were incorrectly classified as “negative”.
In the case of injury risk forecasting with Zone7, this would be where a player sustains an injury whilst the system forecasted low injury risk levels preceding it. Another way to think of this is an unexpected injury, where no flag was provided ahead of the incident.
False Positive (Rate)
A false positive represents a “negative” case that was incorrectly predicted as “positive”.
The False Positive Rate represents the proportion of “negative” cases that were incorrectly classified as “positive”.
In the case of injury risk forecasting with Zone7, this would be where a player does not sustain an injury, whilst the system forecasted medium/high risk levels.
Sensitivity / Recall (True Positive Rate)
The proportion of the correctly classified “positive” cases out of all the “positive” cases. In the case of injury, it’s the proportion of injuries for which medium/high risk levels were forecasted prior to their occurrence.
Zone7: Sensitivity reflects the percentage of injuries in a specific dataset that were correctly flagged by the model. For example, 50 injuries were sustained in a season, and this season was used as test data. Results indicate that 41 injuries were correctly flagged. Hence, the sensitivity (also called recall) is 82%. Sensitivity is referred to at times as the injury detection rate.
Specificity (True Negative Rate)
The proportion of the “negative” class that is correctly classified. In the case of injury, the number of uninjured athletes that are correctly identified.
A symptom of ML training in which an algorithm tends to "memorize" its training data rather than learning significant patterns from it. This usually leads to inferior performance on new previously unseen data.
Overfitting usually occurs when training data is too small or oversimplified to accurately represent the complexity of the real life situation we are trying to learn.
A few common methodologies to reduce overfitting include:
Enlarging the training dataset so it's too large to be "memorized":
Combining the dataset with additional similar datasets
Artificially augmenting the dataset
Use simpler models which avoid overfitting but overall tends to have inferior performance
The proportion of correctly classified “positive” cases out of all the predicted “positive” cases.
In the case of injury risk forecasting with Zone7, precision would be the proportion of athletes that were injured out of all the athletes that were forecasted to be at medium/high risk.
Mean Squared Error
The mean squared error (MSE) measures the average squared difference between the estimated values and the actual value. when our task is to evaluate continuous numbers rather than discrete categories.
Area Under the Curve (AUC) Analysis
The objective measure of the ability of a classifier to distinguish between binary classes, such as predicting the presence or absence of injury. It is used as a summary of the Receiver Operator Characteristic (ROC; see below) curve.
The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
When AUC = 1, the classifier is able to perfectly distinguish between all the Positive and the Negative class points correctly.
When AUC is between 0.5 and 1, there is a slightly better than random to high chance that the classifier will be able to distinguish the positive class values from the negative class values.
When AUC = 0.5, then the classifier is not able to distinguish between Positive and Negative class points.
Receiver Operating Characteristic (ROC) Curve
The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the True Positive Rate (Sensitivity) against the False Positive Rate (Specificity) at various threshold values. A higher X-axis value indicates a higher number of False Positives than True Negatives. While a higher Y-axis value indicates a higher number of True Positives than False Negatives.
The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. The F1-score is dependent on the selected classification threshold as part of the tuning process.