In this tutorial, we will learn how to use the scikit-learn library to implement Simple Linear Regression.
- Python interpreter (Spyder, Jupyter, etc.).
Following are the steps required to perform this tutorial.
import matplotlib.pyplot as plt import pandas as pd import pylab as pl import numpy as np
Linear Regression Dataset Example
dataset = pd.read_csv (r'LinearRegression_Dataset.csv') dataset.head()
The imported dataset about fuel consumption contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset Source]
#summarize the data dataset.describe()
#some features to explore more cdf = dataset[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']] cdf.head(10)
#plot the features viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']] viz.hist() plt.show()
Here through the graphical representation it is possible to have a general analysis of the data to be treated.
#plot fuel consumption vs the emission, to see how linear is their relation: plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='green') plt.xlabel("FUELCONSUMPTION_COMB") plt.ylabel("Emission") plt.show()
Through this graph, it is possible to observe that there is an increasing linear relationship between fuel consumption and emissions. There are three different “zones” where the slope varies differently. Either way, we can conclude that when fuel consumption increases, emissions are also increase.
#plot engine size vs the emission, to see how linear is their relation: plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue') plt.xlabel("Engine size") plt.ylabel("Emission") plt.show()
Here are just two examples, but we can do this for all features to understand what are the correlations that exist between them!
In order to make a more in-depth study of the correlations, algorithms can be applied to measure the correlations of Pearson and Spearman.
Train and Test Dataset
Train/Test split involves splitting the dataset into training and testing sets respectively. This will provide a more accurate evaluation of out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data.
This means that we know the outcome of each data point in this dataset, making it great to test with! And since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly out-of-sample testing.
Let’s split our dataset into train and test sets, 80% of the entire data for training, and 20% for testing.
We create a mask to select random rows using np.random.rand() function:
Simple Regression Model
Linear Regression fits a linear model with coefficients B = (B1, …, Bn) to minimize the ‘residual sum of squares between the actual value y in the dataset, and the predicted value yhat using linear approximation.
Train Data Distribution
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue') plt.xlabel("Engine size") plt.ylabel("Emission") plt.show()
Using sklearn package to model the data.
from sklearn import linear_model regr = linear_model.LinearRegression() train_x = np.asanyarray(train[['ENGINESIZE']]) train_y = np.asanyarray(train[['CO2EMISSIONS']]) regr.fit (train_x, train_y) The coefficients print ('Coefficients: ', regr.coef_) print ('Intercept: ',regr.intercept_)
As mentioned before, Coefficient and Intercept in the simple linear regression, are the parameters of the fit line. Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data. Notice that all of the data must be available to traverse and calculate the parameters.
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue') plt.plot(train_x, regr.coef_*train_x + regr.intercept_, '-r') plt.xlabel("Engine size") plt.ylabel("Emission")
As we can see, the trend line (in red) is growing. Although the data (in blue) shows a tendency of growth in a discrete way (there are jumps between the values of the engine size) anyway, in general, as the engine size increases, the emissions increase.
To calculate the accuracy of a regression model, it is necessary to compare the current values and the predicted values. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.
There are different model evaluation metrics, lets use MSE here to calculate the accuracy of our model based on the test set:
- Mean Absolute Error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just an average error.
- Mean Squared Error (MSE): MSE is the mean of the squared error. It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.
- Root Mean Squared Error (RMSE).
- R-squared is not an error but is a popular metric for the accuracy of the model. It represents how close the data are to the fitted regression line. The higher the R-squared, the better the model fits your data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
from sklearn.metrics import r2_score test_x = np.asanyarray(test[['ENGINESIZE']]) test_y = np.asanyarray(test[['CO2EMISSIONS']]) test_y_ = regr.predict(test_x) print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y))) print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2)) print("R2-score: %.2f" % r2_score(test_y , test_y_) )
Advantages of Linear Regression
- Simple implementation;
- Performance on linearly seperable datasets;
- Overfitting can be reduced by regularization.
Disadvantages of Linear Regression
- Prone to underfitting;
- Sensitive to outliers;
- Linear Regression assumes that the data is independent.
 IBM – Machine Learning with Python – A Practical Introduction
 Udemy – The Data Science Course 2020: Complete Data Science Bootcamp – 365 Careers
 Udemy – Machine Learning and Data Science (Python)