Preprocessing: OneHotEncoder() vs pandas.get_dummies

Photo by Priscilla Du Preez on Unsplash

As stated in title, I want to differentiate the difference between OneHotEncoder and pandas.get_dummies.

In short, if I’m doing machine learning then I should use OneHotEncoder(ohe) over get_dummies. OHE does the same things as get dummies but in addition, OHE saves the exploded categories into it’s object.

Saving exploded categories is extremely useful when I want to apply the same data pre-processing on my test set. If the total number of unique values in a categorical column is not the same for my train set vs test set, I’m going to have problems.

For example on the training dataframe, when I apply get_dummies() on the ‘Call_Me?’ column; I return 3 new columns because there are 3 unique values. However, on the testing dataframe, I only have 2 unique values under the ‘Call_Me?’ column and therefore get_dummies() returns only 2 new columns.

Moreover, I’ll have errors when I fit a model on the training set and predict on my test features of different shape.

OneHotEncoder can transform the test dataframe from the saved exploded categories I fit on the training set.

Moreover, the ohe object has attributes to look for the saved exploded categories.

Conclusion

For quick data cleaning and EDA, it makes a lot of sense to use pandas get dummies. However, if I plan to transform a categorical column to multiple binary columns for machine learning, it’s better to use OneHotEncoder().

Source:

https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

--

--

--

Hello! My name is Albert Um.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Using Machine Learning to replace João Félix

Custom Named Entity Recognition with Watson NLP

(Reference) Custom Image Augmentation with Keras by Ceshine Lee — 라임오렌지파이와 일상

Practicing Regression Techniques on House Prices Dataset-Part 2

Your historical, theoretical and slightly mathematical introduction to the world of Machine…

Linear Regression using Tensorflow Estimator

How to Build a Machine Learning Model to Identify Credit Card Fraud in 5 Steps

Question Answering for Dravidian Languages — Hindi and Tamil

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Albert Um

Albert Um

Hello! My name is Albert Um.

More from Medium

Assumptions of Linear Regression (with Python Implementation)

Two-sample chi-square test in Python

What is the Difference Between Hierarchical and Partitional Clustering

Q-Q Plots — A view from statistics perspective