Data preparation for modelling

Conceptualisation

The aim of the modelling process is to predict a performance metrics at Integrated Care System geography from metrics that describe demand and capacity.

Data-driven approach

The approach taken in this project is to develop an adaptable process that can be repeated with small amounts of human intervention. The data collection is manual, but the input data provided to each model is only selected based on data quality, length and whether the published geography is at a granular enough scale. The metrics provided to the modelling are selected based on whether they provide information on system demand or capacity, and not based on pre-conceived hypotheses of relevance to the output metric being modelled. The final metrics used by the models are determined by the modelling method itself using metric selection methods (see Modelling approaches).

Data structure

A table of metrics is created from the prepared data. Each row in the table represents a unique combination of Integrated Care System and year. The columns in the table consist of the target variable (the variable that is being predicted), taking a value between 0 and 1, the number of cases for that target variable (see frequency weights) and the predictor variables (the demand and capacity metrics).

In order to identify which is the best model for predicting each metric different sizes of input data are tested. Here, we test for different lengths of input data (e.g., testing whether by including more years of data in the model development process it improve its predictive capability). This results in a longer dataset. We also test for whether including lagged values of the predictor variables improves the models predictive capability. This results in a wider dataset.

Modelling approaches

Two modelling approaches were attempted.

Generalised linear model

This approach attributes the target variable into two groups; “successful” and “unsuccessful”. The “successful” group is the count of cases for each ICS that achieved the performance metrics within the criteria specified. The “unsuccessful” group are the remaining cases. This allows for binomial regression using a generalised linear model using a logit link function.

Regularisation was performed to enable selection of the final variables to go into the model. Elastic net regularisation was performed (described in the hyperparameter tuning) section), and the important variables were identified.

Random Forest

This non-parametric approach allows the full dataset to be randomly sub-sampled (with replacement), along with taking random samples of the predictor variables, to build multiple decision trees. These decision trees get aggregated up to provide a model to use on unseen data.

A variation of this approach was also tested, where a random forest approach was used in the same way, but to model the change in value of the target variable from the previous year, from the change in values from the predictor variables from the previous year.

Variable importance

Permutation importance was used to provide an indication of which variable influenced the models most. This is performed by shuffling the input data within each column at a time, ten times, and calculating an average of the deterioration in the evaluation metric when assessing the predicted values, using the model, against the observed values in the test dataset.

Pre-processing

Frequency weights

A column for frequency weighting is used for the logistic regression only. This column is the total number of cases included in the target variable calculation. For example, if the target variable is the proportion of incomplete referral to treatment pathways that are greater than 18 weeks, the frequency weighting column is the total number of incomplete pathways. The purpose of this column is to ensure downstream processes consider the different weighting of each record in the dataset. Some records account for a much higher volume of cases than others, so they should have more weighting on processes like normalisation than others. These weights impact the pre-processing, the model estimation, and the performance estimation steps.

Lagging data

Models were tested with lagged predictor variables. Both one and two year lagging was tested. Conceptually, this was done to determine whether the value of a predictor variable for a previous year has an influence on the target variable for the current year. Testing was also performed where a lagged versions of the target variable was included as a predictor variable.

Pandemic impact

The impact of the pandemic was accommodated for by introducing a variable which quantifies the proportion of occupied beds that have a patient with a COVID19 diagnosis in.

Splitting dataset

Both of the modelling approaches described have unknown hyperparameters. These can be tuned to optimise the performance of the model. The optimisation process was done by splitting the input dataset into two parts: 80% of the data were selected into a training set and 20% were selected into a test set. This split was done at random, so all ICSs and all years have equal chance of appearing in each set.

Variable removal

Variables are removed manually in three circumstances:

where more than 40% of the training set are missing
where more than 95% of the training set are 0
where more than 95% of the test set are missing

Imputing missing variables

Missing values were imputed for the remaining variables by applying the mean of the five nearest neighbours values, based on all of the other predictor variables in the training data.

Normalisation

For the logistic regression model, the data are centred and scaled.

Hyperparameter tuning

The training set was split into 7 folds to perform “leave-group-out” cross validation to enable the hyperparameter tuning. Each fold was defined by NHS region and only contained the ICSs in those NHS regions. Models were then trained on 6 of the 7 folds using a variety of hyperparameters, and then validated on the remaining fold to identify the parameters that provided the best performance. The purpose of doing leave-group-out cross validation was to reduce the likelihood that the hyperparameters for the model were tuned on data that is similar in the training and validation sets due to being consecutive years for one ICS (e.g., a specific ICS could have similar inputs in two adjacent years, which, if one year was used for training and the subsequent year used for validation, it could lead to overfitting and promote a model that performs poorly on the test dataset).

Model specification

Generalised linear model with elastic net regularisation

Using the R package glmnet, a quasibinomial model was selected with a logit link transformation. The number of lambda values that determine the lambda path was selected as 150, which is higher than the default value of 100, to increase the likelihood of model convergence. Standardisation of the input data was performed as a separate pre-processing step, as described above, rather than within the glmnet function.

The two hyperparameters that were tuned were; the penalty parameter (lambda), which defines the amount of regularisation (e.g., the amount of explanatory variable penalisation that occurs), and the mixture parameter, where a value of 1 is a pure lasso model, and 0 is a pure ridge regression model.

Random forest

The random forest model was also specified with the randomForest package. The three hyperparameters that were tuned were; the number of randomly selected predictors in each tree, the number of trees, and the minimum node size allowed before permitting another branch to grow.

Model evaluation

Regression models have several useful evaluation metrics for assessing model performance. The priority for this study was to select models and explain them using simple metrics that were comparable between models. This meant using an evaluation metric in the same units as the target variables and one that was unaffected by the scale of the target variable.

The Mean Absolute Percentage Error (MAPE) metric was selected to evaluate the models because it provides the simplest interpretation and consistency between all the models.

Having determined the optimal hyperparameters (as described in the hyperparameter tuning section), the final model was evaluated on the test dataset, from which the MAPE score was calculated.

MAPE scores were also calculated on two naïve methods of performance forecasting:

future performance is the same as the previous year’s performance
future performance is estimated by extrapolating a linear model through the last three years of observed performance data

The MAPE score for the naïve methods was calculated using the same test dataset as the more sophisticated modelling methods to create a like for like comparison.