Preprocessing: OneHotEncoder() vs pandas.get_dummies

Albert Um
2 min readAug 10, 2020

--

Photo by Priscilla Du Preez on Unsplash

As stated in title, I want to differentiate the difference between OneHotEncoder and pandas.get_dummies.

In short, if I’m doing machine learning then I should use OneHotEncoder(ohe) over get_dummies. OHE does the same things as get dummies but in addition, OHE saves the exploded categories into it’s object.

Saving exploded categories is extremely useful when I want to apply the same data pre-processing on my test set. If the total number of unique values in a categorical column is not the same for my train set vs test set, I’m going to have problems.

For example on the training dataframe, when I apply get_dummies() on the ‘Call_Me?’ column; I return 3 new columns because there are 3 unique values. However, on the testing dataframe, I only have 2 unique values under the ‘Call_Me?’ column and therefore get_dummies() returns only 2 new columns.

Moreover, I’ll have errors when I fit a model on the training set and predict on my test features of different shape.

OneHotEncoder can transform the test dataframe from the saved exploded categories I fit on the training set.

Moreover, the ohe object has attributes to look for the saved exploded categories.

Conclusion

For quick data cleaning and EDA, it makes a lot of sense to use pandas get dummies. However, if I plan to transform a categorical column to multiple binary columns for machine learning, it’s better to use OneHotEncoder().

Source:

https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

--

--