For this blog, I will demonstrate three techniques to handle class imbalance using NYS PUMS(Public Use Microdata Sample) Census data. (You can find the dataset here.) Training classification models with imbalanced classes can lead to the model biasedly predicting the majority class.
A pseudo-objective is to classify New York residents whether they have health insurance coverage. It’s a binary classification problem using the 2019 NYS PUMS Census data. I thinned out the features to only have seven independent variables and one dependent variable, ‘HICOV’ — Health Insurance Coverage.
According to the census, the majority of the residents do have Health Insurance. It might be important for a business to classify residents who don’t have health insurance in order to target them for health coverage.
I will first hold out a portion of the dataset for evaluation and preprocess the training set by dummifying the discrete variables and scaling the continuous variables. I will then fit a logistic regression model on the training set and evaluate after predicting on the withheld evaluation(test) set.
Unfortunately, the logistic model has mostly predicted the majority class(resident is insured). If we were to use this model to target NYS residents without health insurance, we would have classified a total of 9 residents, of which 8 are false negatives. Among the 31,851(30,256 + 1,595) residents classified as insured, 1595 residents were False Positives. The cost of False Positives is more detrimental than False Negatives for this case because we do not want to miss the uninsured residents. Therefore, I will use precision as my metric over the accuracy score.
One way to balance the classes is to randomly select observations of the majority class so that the number of selections matches the length of the minority class. However, this results in losing data points and should be applied when handling a massive dataset.
The code along is a continuation of the original fit from above. After encoding the categoricals, scaling the continuous, and then reassigning the X_train and y_train, I will undersample the majority class, covered.
The reverse of undersampling the majority class is to oversample the minority class. Oversampling is done by taking samples(duplicates included) of the minority class until the majority class’s length matches. Oversampling may become problematic as the model will overfit on specific characteristics due to oversampling’s duplicate nature.
Instead of simply oversampling the minority class, which adds exact replicas, SMOTE(Synthetic Minority Over-Sampling Technique) generates ‘synthetic’ data points that resemble the minority class. SMOTENC is used for datasets containing both continuous and discrete features. I must include the column indices in the parameter, categorical_features. The column indices of the exploded categoricals are 0–10 inclusive.