Introduction to Machine Learning – Concepts II

In the previous tutorial, we learned the basic concepts of machine learning such as various ML learning methods and how each of these methods is further divided into various categories. Now you must be thinking to open your favorite code editor and start building ML models? Not so fast. Before we dive into building ML models we have to first understand our Data.

Photo by Luke Chesser on Unsplash

A machine learning model is only as good as the data it is fed.”

Machine Learning automatically finds the complex patterns in the data and then these patterns are molded into an ML model which is used on new data points to make predictions. So to improve the accuracy of the model we must learn about some data preprocessing steps.

What is Data Preprocessing ?

Preprocessing the data is the most important stage in ML applications.  Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), and missing values, etc. Preprocessing is the technique that involves transforming raw data into an understandable format

Data Preprocessing involves two stages: Data engineering and Feature Engineering. Data engineering is the process of converting raw data into prepared data. Once the data is prepared, we use feature engineering concepts to create features that are fed to our ML model. The image below shows the entire data preprocessing steps:

source :Google cloud

What are the Features ?

The columns from the dataset that are fed to the ML model in order to train the model and later to make predictions are called features.

The table above is used to make the house pricing predictions in our ML model. As you can see the price of the house depends on multiple factors. A house price varies according to the number of rooms, Area, Distance from the city center, Land size, Year of Built, etc. all these factors have a correlation and affect the price. Therefore these factors/columns in ML terminology are known as Features.

Now the features can be of two types:

1. Categorical Features :

Categorical features are the values that are discrete in nature. For example, Days in a week (Monday, Tuesday, Wednesday, etc.) are categorical values because their values are always taken from a set of Days.

Categorical Features can be divided into two categories:

  • Ordinal Categorical Features: These categorical values can be ordered. For example, size of clothes ( Small < Medium < Large ).
  • Nominal Categorical Features: These categorical values do not have an implied order. For example, the color of clothes ( Red, Green, White, etc.)

2. Numerical Features: Numerical features are the values that are continuous or integer-valued. They are represented by numbers. For example, a number of rooms(2,3,4..) in a house or area of the house(300 sq. feet, 400 sq. feet, etc ).

There are many more data preprocessing stages available but those are out of scope for this tutorial. These data preprocessing steps also depend on what kind of data we are using to train the model. In this example, we are only going to focus on Numerical data Handling.

1. Handling Missing Values

For this tutorial, we will be using the housing price dataset and will use Pandas and Numpy frameworks for data manipulation. We will be using two different methods for handling the missing values.

  • Drop the columns that contain missing values.
  • Imputation: We will replace the missing values with the mean values along each column.

We start by importing necessary modules such as Pandas and Matplotlib for data manipulation and visualization. We will use Sci-Kit learn ML library to build the model to test the accuracy after handling the missing values.

At this point do not worry about the ML model used. This tutorial is only to show the impact of preprocessing steps on the prediction accuracy of our model.

You can download the dataset used in this tutorial from here:

Our dataframe has (13580 rows and 11 columns). data.head() command will print the first 5 rows from the dataframe. As we can see the columns BuildingArea and YearBuilt contain some missing values. Since we are working only with numerical data handling let’s drop all categorical features from the dataframe.

As we can see the column suburb has been dropped from the data frame and now we can perform the missing value handling operation.

Next, we will split the data into Training and Validation sets using train_test_split() function. We have split 80% of the data for training and 20% for the validation. The training set is used to train our model. The model will learn the internal parameters and correlation between the data points. The validation set is used to validate or test the model.

The next step is to define our model that will take the training and validation data as an argument and give predictions. We will test the accuracy of our model predictions in this module.

Once we have our model defined we will now process the data that we are going to feed. For this, we will look for all the columns that contain at least 1 missing value.

Let’s Pre-process the data!

Photo by Alexander Sinn on Unsplash

As we can see the columns ‘Car’, ‘BuildingArea’, ‘YearBuilt’ contain at least 1 missing value and we have stored them in the list [missing_columns]. So now let’s drop these columns from our data set and feed new training and validation data to our model that we created earlier.

1. Drop Columns

Simplest approach is to drop all those columns from the dataset that contain any missing value. However we might lose a lot of important information.

2. Imputation

Dropping the columns is not an option when the columns containing missing values have a strong correlation. In such cases, we use the imputation methods where the missing value is replaced by the mean/median value along that column.

In Approach 2 we will use a SimpleImputer() function which is going to compute the mean value for each missing value along that column. We will fit transform out training data so that model learns the internal parameters.

So using imputer’s fit on training data just calculates the mean of each column of training data. Using transform on validation data then replaces missing values of validation data with means that were calculated from training data.

As we can see the mean absolute error value was significantly reduced after performing the imputation operation. There are many other parameters that can be fine-tuned to reduce the error even further. But for now, that is out of the scope of this Article.

Complete Code

What is next?

In the next tutorial, we will learn how to handle categorical values in our dataset and finally use these handling methods to have a good set of features to feed for our ML models.

Leave a Reply

Your email address will not be published. Required fields are marked *