In this tutorial, we learn how to use scikit-learn library to implement Multiple linear regression. The carbon dioxide emissions dataset will be used again (Machine Learning [Python] – Linear Regression) to build a model, evaluate it and use it to predict an unknown value.
What is Multiple Linear Regression?
Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables. It is sometimes known simply as multiple regression, and it is an extension of linear regression. The variable that we want to predict is known as the dependent variable, while the variables we use to predict the value of the dependent variable are known as independent or explanatory variables.
- Python interpreter (Spyder, Jupyter, etc.).
Following are the steps required to perform this tutorial.
import matplotlib.pyplot as plt import pandas as pd import pylab as pl import numpy as np
dataset = pd.read_csv (r'LinearRegression_Dataset.csv') dataset.head()
As mentioned before, the fuel consumption dataset will be used, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset source]
#some features to explore more cdf = dataset[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']] cdf.head(10)
#plot engine size vs the emission, to see how linear is their relation: plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue') plt.xlabel("Engine size") plt.ylabel("Emission") plt.show()
In the post about linear regression, some conclusions have already been presented that can be drawn from the graphical representations and the correlations between the different variables.
Creating Train and Test Dataset
Train/Test split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. This will provide a more accurate evaluation of out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. It is more realistic for real-world problems. Since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly out-of-sample testing.
Train Data Distribution
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='red') plt.xlabel("Engine size") plt.ylabel("Emission") plt.show()
Multiple Regression Model
There are multiple variables that predict the Co2emission. When more than one independent variable is present, the process is called multiple linear regression. For example, predicting co2emission using FUELCONSUMPTION_COMB, ENGINESIZE and CYLINDERS of cars.
from sklearn import linear_model regr = linear_model.LinearRegression() x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']]) y = np.asanyarray(train[['CO2EMISSIONS']]) regr.fit (x, y) The coefficients print ('Coefficients: ', regr.coef_)
Coefficient and Intercept, are the parameters of the fit line. Given that it is a multiple linear regression, with 3 parameters, and knowing that the parameters are the intercept and coefficients of the hyperplane, sklearn can estimate them from our data. Scikit-learn uses plain Ordinary Least Squares method to solve this problem.
Ordinary Least Squares (OLS)
OLS is a method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the target-dependent variable and those predicted by the linear function. In other words, it tries to minimize the sum of squared errors (SSE) or mean squared error (MSE) between the target variable (y) and our predicted output (𝑦̂) over all samples in the dataset.
OLS can find the best parameters using of the following methods:
- Solving the model parameters analytically using closed-form equations;
- Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton’s Method, etc.).
msk = np.random.rand(len(dataset)) < 0.8 train = dataset[msk] test = dataset[~msk] y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']]) x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']]) y = np.asanyarray(test[['CO2EMISSIONS']]) print("Residual sum of squares: %.2f" % np.mean((y_hat - y) ** 2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % regr.score(x, y))
Variance Regression Score:
If 𝑦̂ is the estimated target output, y the corresponding (correct) target output, and Var is Variance, the square of the standard deviation, then the explained variance is estimated as follow:
The best possible score is 1.0, lower values are worse.
Now let’s use a multiple linear regression with the same dataset but this time use FUEL CONSUMPTION_CITY and FUEL CONSUMPTION_HWY instead of FUELCONSUMPTION_COMB. This means that we will now use four parameters to see if it results in better accuracy.
regr = linear_model.LinearRegression() x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY']]) y = np.asanyarray(train[['CO2EMISSIONS']]) regr.fit (x, y) print ('Coefficients: ', regr.coef_) y_= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY']]) x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY']]) y = np.asanyarray(test[['CO2EMISSIONS']]) print("Residual sum of squares: %.2f"% np.mean((y_ - y) ** 2)) print('Variance score: %.2f' % regr.score(x, y))
In this case, although the value of the variance is the same, the residual sum of the squares is less, which means that we have slightly better precision in this case.
 IBM – Machine Learning with Python – A Practical Introduction
 Udemy – The Data Science Course 2020: Complete Data Science Bootcamp – 365 Careers
 Udemy – Machine Learning and Data Science (Python)