Covariate shift in Machine Learning

Photo by Erda Estremera on Unsplash

What is Covariate shift?

Covariate shift, a subclass of dataset shift, is a common reason why predictive models become obsolete when predicting unseen data. The inefficiency of a model may be explained due to the different input distributions from the test set that the model didn’t get to see during training. The difference in test inputs can be caused by the change of time or selection bias.

For example, let’s say I created a classifier for health insurance coverage using census data. The model will susceptibly deteriorate over time as the distribution of the inputs changes. A change in the input probabilities violates a primary assumption that the probabilities are the same during training and testing.

In addition, if my health insurance classifier was trained using New York’s census data explicitly and was then used to classify Alabama residents, the model might perform worse on Alabama than during training(New York).

Most supervised models follow the assumption that the training and testing sets are pulled from the same population. In order to maintain this expensive assumption, the model should be refit regularly with an addition of the most recent data. However, when it’s not possible to collect additional data, there are some simple tricks such as importance reweighting. Before handling the shift, it’s important to first identify if a shift has occurred.

Identify Shift

The first step is to see if the training input distributions are different from the testing inputs. If the model utilizes one or two features, the differences might be visually explained with histograms.

When the model handles multiple variables, I can create pseudo labels for the training and testing sets(1 for test and 0 for training) and use the combined data sets to fit a binary classifier to see if the model can detect the difference. If the training set and the testing set come from the same population, then the pseudo classifier will have a hard time classifying.

Handling Shift

It’s always best to collect additional data and refit the model periodically. However, if it’s not possible to refit or supplement data, I can add more weight to the training instances that look similar to the test set.

I extract weights by fitting a model with the joined training and testing set with a pseudo-dependent variable, 1 for testing and 0 for training. The objective is to return the confidence scores on the training set after fitting and using the exponentiated confidence scores as weights.

The higher the confidence score, the more confident the model is that the observation belongs in the positive case(similar to the test). The confidence scores are then exponentiated to return the weights. Since I don’t want specific observations to have outrageous weight assignments, I clipped the confidence scores at a certain threshold. Clipping will increase bias but at the same time decrease variance significantly.

The final classifier is fit with the appropriate weights that look similar to the test set.

--

--

--

Hello! My name is Albert Um.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Talking about Statistics !

Introduction to Machine Learning

List of all data on my blog

10 Underrated Python Skills

Viral GSA Search Engine Ranker VPS

Download datasets into HDFS and Hive

Why physical storage of your database tables might matter

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Albert Um

Albert Um

Hello! My name is Albert Um.

More from Medium

Confidence interval and sampling error

Gas Turbine CO and NOx Emission

Chi-Square Test in Python: Analyzing Enabling Factors of Mental Health at Work

Adjust the phase offset when using data from ‘bode’