Introduction to Machine Learning – Concepts II

In the previous tutorial, we learned the basic concepts of machine learning such as various ML learning methods and how each of these methods is further divided into various categories. Now you must be thinking to open your favorite code editor and start building ML models? Not so fast. Before we dive into building ML models we have to first understand our Data.

Photo by Luke Chesser on Unsplash

A machine learning model is only as good as the data it is fed.”

Machine Learning automatically finds the complex patterns in the data and then these patterns are molded into an ML model which is used on new data points to make predictions. So to improve the accuracy of the model we must learn about some data preprocessing steps.

What is Data Preprocessing ?

Preprocessing the data is the most important stage in ML applications.  Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), and missing values, etc. Preprocessing is the technique that involves transforming raw data into an understandable format

Data Preprocessing involves two stages: Data engineering and Feature Engineering. Data engineering is the process of converting raw data into prepared data. Once the data is prepared, we use feature engineering concepts to create features that are fed to our ML model. The image below shows the entire data preprocessing steps:

source :Google cloud

What are the Features ?

The columns from the dataset that are fed to the ML model in order to train the model and later to make predictions are called features.

The table above is used to make the house pricing predictions in our ML model. As you can see the price of the house depends on multiple factors. A house price varies according to the number of rooms, Area, Distance from the city center, Land size, Year of Built, etc. all these factors have a correlation and affect the price. Therefore these factors/columns in ML terminology are known as Features.

Now the features can be of two types:

1. Categorical Features :

Categorical features are the values that are discrete in nature. For example, Days in a week (Monday, Tuesday, Wednesday, etc.) are categorical values because their values are always taken from a set of Days.

Categorical Features can be divided into two categories:

  • Ordinal Categorical Features: These categorical values can be ordered. For example, size of clothes ( Small < Medium < Large ).
  • Nominal Categorical Features: These categorical values do not have an implied order. For example, the color of clothes ( Red, Green, White, etc.)

2. Numerical Features: Numerical features are the values that are continuous or integer-valued. They are represented by numbers. For example, a number of rooms(2,3,4..) in a house or area of the house(300 sq. feet, 400 sq. feet, etc ).

There are many more data preprocessing stages available but those are out of scope for this tutorial. These data preprocessing steps also depend on what kind of data we are using to train the model. In this example, we are only going to focus on Numerical data Handling.

1. Handling Missing Values

For this tutorial, we will be using the housing price dataset and will use Pandas and Numpy frameworks for data manipulation. We will be using two different methods for handling the missing values.

  • Drop the columns that contain missing values.
  • Imputation: We will replace the missing values with the mean values along each column.

We start by importing necessary modules such as Pandas and Matplotlib for data manipulation and visualization. We will use Sci-Kit learn ML library to build the model to test the accuracy after handling the missing values.

At this point do not worry about the ML model used. This tutorial is only to show the impact of preprocessing steps on the prediction accuracy of our model.

You can download the dataset used in this tutorial from here:

# Import Required modules 

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Load the data (Change path as per your file location)

path = '/home/anil-io/ML/Tutorial/House Pricing/housedata.csv'
data = pd.read_csv(path)

# Take Price column as our output
y = data.Price

# Let's See what our data contains
print(data.head())

Our dataframe has (13580 rows and 11 columns). data.head() command will print the first 5 rows from the dataframe. As we can see the columns BuildingArea and YearBuilt contain some missing values. Since we are working only with numerical data handling let’s drop all categorical features from the dataframe.

# Using only Numerical values for features

melb_pred = data.drop(['Price'],axis =1)                  # Drops Price from Feature list
X = melb_pred.select_dtypes(exclude = ['object'])         # Excludes any non-numeric column
print(X.columns)

As we can see the column suburb has been dropped from the data frame and now we can perform the missing value handling operation.

Next, we will split the data into Training and Validation sets using train_test_split() function. We have split 80% of the data for training and 20% for the validation. The training set is used to train our model. The model will learn the internal parameters and correlation between the data points. The validation set is used to validate or test the model.

# Split the data into training and validation sets

X_train,X_valid,y_train,y_valid = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)

The next step is to define our model that will take the training and validation data as an argument and give predictions. We will test the accuracy of our model predictions in this module.

# Define the RandomForestRegressor Model 

def mymodel(X_train,X_valid,y_train,y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train,y_train)
    predict = model.predict(X_valid)
    mae = mean_absolute_error(y_valid,predict)
    return mae

Once we have our model defined we will now process the data that we are going to feed. For this, we will look for all the columns that contain at least 1 missing value.

Let’s Pre-process the data!

Photo by Alexander Sinn on Unsplash
#Find Columns that contain missing values

missing_columns = [col for col in X_train.columns if X_train[col].isnull().any()]
print(f'Training dataset have {X_train.shape[0]} rows and {X_train.shape[1]} columns')
print('These columns have missing values:',missing_columns)

# print no of missing values in these columns 
no_of_missing_value = (X_train.isnull().sum())
print(no_of_missing_value[no_of_missing_value > 0])

As we can see the columns ‘Car’, ‘BuildingArea’, ‘YearBuilt’ contain at least 1 missing value and we have stored them in the list [missing_columns]. So now let’s drop these columns from our data set and feed new training and validation data to our model that we created earlier.

1. Drop Columns

Simplest approach is to drop all those columns from the dataset that contain any missing value. However we might lose a lot of important information.

# Let's drop these columns and generate prediction and Mean absolute error

new_X_train = X_train.drop(missing_columns, axis = 1)
new_X_valid = X_valid.drop(missing_columns, axis = 1)

print('Approach 1 (Drop missing value columns)')
ap1 = mymodel(new_X_train,new_X_valid,y_train,y_valid) 
print('MAE from Aproach 1:',ap1)

2. Imputation

Dropping the columns is not an option when the columns containing missing values have a strong correlation. In such cases, we use the imputation methods where the missing value is replaced by the mean/median value along that column.

In Approach 2 we will use a SimpleImputer() function which is going to compute the mean value for each missing value along that column. We will fit transform out training data so that model learns the internal parameters.

So using imputer’s fit on training data just calculates the mean of each column of training data. Using transform on validation data then replaces missing values of validation data with means that were calculated from training data.


imputer = SimpleImputer(strategy='mean')    # Change startegy = 'median' and test the score 
imp_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imp_X_valid = pd.DataFrame(imputer.transform(X_valid))

# Imputation removes column names; lets put them back

imp_X_train.columns = X_train.columns
imp_X_valid.columns = X_valid.columns

ap2 = mymodel(imp_X_train,imp_X_valid,y_train,y_valid)
print('Approach 2 (Impute missing Value)')
print('MAE from Approach 2:',ap2)

As we can see the mean absolute error value was significantly reduced after performing the imputation operation. There are many other parameters that can be fine-tuned to reduce the error even further. But for now, that is out of the scope of this Article.

Complete Code

# Dealing with the missing Data in Melbourne House Price Dataset

import pandas as pd
import seaborn as sns
from datetime import datetime 
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

path = '/home/anil-io/ML/Tutorial/House Pricing/housedata.csv'
data = pd.read_csv(path)
#print(data.columns)
y = data.Price

# Using only Numerical values for features
melb_pred = data.drop(['Price'],axis =1)   # Drops Price from Feature list
X = melb_pred.select_dtypes(exclude = ['object']) # Excludes any non-numeric column

# Divide data into Training and Validation set
X_train,X_valid,y_train,y_valid = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)


# Now Let's Build our model (Random Forest)
def melbmodel(X_train,X_valid,y_train,y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train,y_train)
    predict = model.predict(X_valid)
    mae = mean_absolute_error(y_valid,predict)
    return mae

# Now we will use different missing data approach and calc. Mean absolute Error for each case
# 1. Dropping the Column with NA values
# 2. Imputing the values 

missing_columns = [col for col in X_train.columns if X_train[col].isnull().any()]

print(f'\nTraining dataset have {X_train.shape[0]} rows and {X_train.shape[1]} columns')
print('\nThese columns have missing values:',missing_columns)

# print no of missing values in these columns 
no_of_missing_value = (X_train.isnull().sum())
print(no_of_missing_value[no_of_missing_value> 0])


''' Approach 1 '''
# Let's drop these columns and calc prediction and Mean absolute error
new_X_train = X_train.drop(missing_columns, axis = 1)
new_X_valid = X_valid.drop(missing_columns, axis = 1)

print('\nApproach 1 (Drop missing value columns)')
ap1 = melbmodel(new_X_train,new_X_valid,y_train,y_valid)
print('MAE from Aproach 1:',ap1)

''' Approach 2 '''
imputer = SimpleImputer(strategy='median')
imp_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imp_X_valid = pd.DataFrame(imputer.transform(X_valid))

# Imputation removes column names; put them back
imp_X_train.columns = X_train.columns
imp_X_valid.columns = X_valid.columns

ap2 = melbmodel(imp_X_train,imp_X_valid,y_train,y_valid)

print('\nApproach 2 (Impute missing Value)')
print('MAE from Approach 2:',ap2)

What is next?

In the next tutorial, we will learn how to handle categorical values in our dataset and finally use these handling methods to have a good set of features to feed for our ML models.

Leave a Reply

Your email address will not be published. Required fields are marked *