As stated in title, I want to differentiate the difference between OneHotEncoder and pandas.get_dummies.
In short, if I’m doing machine learning then I should use OneHotEncoder(ohe) over get_dummies. OHE does the same things as get dummies but in addition, OHE saves the exploded categories into it’s object.
Saving exploded categories is extremely useful when I want to apply the same data pre-processing on my test set. If the total number of unique values in a categorical column is not the same for my train set vs test set, I’m going to have problems.
For example on the training dataframe, when I apply get_dummies() on the ‘Call_Me?’ column; I return 3 new columns because there are 3 unique values. However, on the testing dataframe, I only have 2 unique values under the ‘Call_Me?’ column and therefore get_dummies() returns only 2 new columns.
Moreover, I’ll have errors when I fit a model on the training set and predict on my test features of different shape.
OneHotEncoder can transform the test dataframe from the saved exploded categories I fit on the training set.
Moreover, the ohe object has attributes to look for the saved exploded categories.
Conclusion
For quick data cleaning and EDA, it makes a lot of sense to use pandas get dummies. However, if I plan to transform a categorical column to multiple binary columns for machine learning, it’s better to use OneHotEncoder().
Source:
https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html