In the previous tutorial, we learned about Numerical data handling in our datasets. In this tutorial, we will learn how to handle bad categorical values in our dataset and finally use these methods to have a robust machine learning model.
“A machine learning model is only as good as the data it is fed.”
What are categorical values?
Categorical features are the values that are discrete in nature. For example, Days in a week (Monday, Tuesday, Wednesday, etc.) are categorical values because their values are always taken from a set of Days.
Categorical Features can be divided into two categories:
Ordinal Categorical Features: These categorical values can be ordered. For example, size of clothes ( Small < Medium < Large ).
Nominal Categorical Features: These categorical values do not have an implied order. For example, the color of clothes ( Red, Green, White, etc.)
Consider a survey that asks how often you travel by car and provides four options: “Never”, “Rarely”, “Most days”, or “Every day”. In this case, the data is categorical, because responses fall into a fixed set of categories.
You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first. In this tutorial, we’ll compare three approaches that you can use to prepare your categorical data:
- Drop Categorical Variables
- Ordinal Encoding
- One-Hot Encoding
The table above is used to make the house pricing predictions in our ML model. As you can see the price of the house depends on multiple factors. A house price varies according to the number of rooms, Area, Distance from the city center, Land size, Year of Built, etc. all these factors have a correlation and affect the price. The columns such as Suburb and Address and Type etc. are Categorical features. We can see there are many inconsistent values among the column date. We will use data preparation methods to fix these inconsistencies and get a better model.
1. Drop Categorical Values
Dropping categorical values is the simplest approach to remove the inconsistent data from our dataset. However, this method might not be the best approach to follow if our data has a strong correlation or contains useful information.
We start by importing necessary modules such as Pandas and Matplotlib for data manipulation and visualization. We will use Sci-Kit learn ML library to build the model to test the accuracy after handling the missing values.
As you can see in lines 24 and 25 we drop the columns from our dataset that have any missing value. Now in Line 29, we are looking for those columns that have a low number of unique values as every category is Unique. We use low cardinal columns to avoid excessive memory usage and reduce the computation time for our ML model.
In line 35 we create a new dataset using only numerical and categorical columns(with a low number of unique values).
You can download the dataset used in this tutorial from here:
2.Label Encoding
Label Encoding is a method to assign a unique integer value to each unique category. As computers do not understand the text language we need to convert the categories in a numeric format so the ML model can perform the operations on it. The image below is a good example of how Label Encoding works.
Categorical values have datatype ‘object’ and we use this information to find out the columns that have categorical values. We then apply the label encoder on these columns and it returns the newly encoded train and validation sets to be fed to our ML model.
As we can see the the mean absolute error after performing the Label encoding dropped significantly.
3.One Hot Encoding
Unlike label encoding, One hot Encoding method creates a new columns for each unique category in a column and indicates 1 if it is present or 0 if it is absent.
As you can see the One hot encoder created 3 new columns for each unique color and indicated 1 if that color is present.
As we can see in the image below the one-hot encoder performed very similar to the label encoder method. However, the performance of each of these methods will vary on different factors such as how many unique categories are present in the dataset, time complexity, and various preprocessing steps applied on the dataset.
Complete Code
# Dealing with the categorical Data in Melbourne House Price Dataset
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
path = '../House Pricing/melb_data.csv'
data = pd.read_csv(path)
print(data.head)
y = data.Price
# Using only Numerical values for features
X = data.drop(['Price'],axis =1) # Drops Price from Feature list
# Divide data into Training and Validation set
X_train,X_valid,y_train,y_valid = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)
# Drop columns with missing values (simplest approach) / Can Impute Values too
missing_columns = [col for col in X_train.columns if X_train[col].isnull().any()]
X_train.drop(missing_columns, axis = 1,inplace = True,errors = 'ignore')
X_valid.drop(missing_columns, axis = 1,inplace = True,errors = 'ignore')
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train.columns if X_train[cname].nunique()<10 and
X_train[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]
# Make a new Training/validation dataset consisting cardinal and numerical columns
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train[my_cols].copy()
X_valid = X_valid[my_cols].copy()
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
''' Approach 1 (Drop catagorical Values)'''
# We drop the object columns with the select_dtypes() method.
d_X_train = X_train.select_dtypes(exclude = ['object'])
d_X_valid = X_valid.select_dtypes(exclude = ['object'])
print(len(X_train),len(d_X_train))
score = score_dataset(d_X_train,d_X_valid,y_train,y_valid)
print('\nDrop column MAE:',score)
''' Approach 2 (Label Encoding)'''
# Catagorical values have datatype = 'object'
# let's get a list of all the columns that are catagorical
s = (X_train.dtypes =='object')
object_col = list(s[s].index)
print(object_col,'|||',low_cardinality_cols)
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_col:
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.transform(X_valid[col])
score = score_dataset(label_X_train,label_X_valid,y_train,y_valid)
print('\nLabel Encoder MAE:',score)
''' Approach 3 (One-Hot Encoding)'''
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
# handle_unknown to avoid errors when the validation data contains classes that aren't represented in the training data
# sparse = False to return Numpy array instead of sparse matrix
# Apply one-hot encoder to each column with categorical data
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_col]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_col]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_col, axis=1)
num_X_valid = X_valid.drop(object_col, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
score = score_dataset(OH_X_train,OH_X_valid,y_train,y_valid)
print('\nOne Hot Method MAE:',score)
What is next?
Throughout this ML concept series, we learned about the various ML concepts such as various ML methods and how each of these methods is further divided into categories. Then we learned about the various data preprocessing techniques such as Missing values Handling, Numerical and Categorical Data handling.
The next step is to dive deeper into the ML algorithms and get familiar with various models and see how they perform to our data.