What is Covariate shift?
Covariate shift, a subclass of dataset shift, is a common reason why predictive models become obsolete when predicting unseen data. The inefficiency of a model may be explained due to the different input distributions from the test set that the model didn’t get to see during training. The difference in test inputs can be caused by the change of time or selection bias.
For example, let’s say I created a classifier for health insurance coverage using census data. The model will susceptibly deteriorate over time as the distribution of the inputs changes. A change in the input probabilities violates a primary assumption that the probabilities are the same during training and testing.
In addition, if my health insurance classifier was trained using New York’s census data explicitly and was then used to classify Alabama residents, the model might perform worse on Alabama than during training(New York).
Most supervised models follow the assumption that the training and testing sets are pulled from the same population. In order to maintain this expensive assumption, the model should be refit regularly with an addition of the most recent data. However, when it’s not possible to collect additional data, there are some simple tricks such as importance reweighting. Before handling the shift, it’s important to first identify if a shift has occurred.
Identify Shift
The first step is to see if the training input distributions are different from the testing inputs. If the model utilizes one or two features, the differences might be visually explained with histograms.
When the model handles multiple variables, I can create pseudo labels for the training and testing sets(1 for test and 0 for training) and use the combined data sets to fit a binary classifier to see if the model can detect the difference. If the training set and the testing set come from the same population, then the pseudo classifier will have a hard time classifying.
Handling Shift
It’s always best to collect additional data and refit the model periodically. However, if it’s not possible to refit or supplement data, I can add more weight to the training instances that look similar to the test set.
I extract weights by fitting a model with the joined training and testing set with a pseudo-dependent variable, 1 for testing and 0 for training. The objective is to return the confidence scores on the training set after fitting and using the exponentiated confidence scores as weights.
The higher the confidence score, the more confident the model is that the observation belongs in the positive case(similar to the test). The confidence scores are then exponentiated to return the weights. Since I don’t want specific observations to have outrageous weight assignments, I clipped the confidence scores at a certain threshold. Clipping will increase bias but at the same time decrease variance significantly.
The final classifier is fit with the appropriate weights that look similar to the test set.