In this tutorial, we will learn about Polynomial Regression and learn how to transfer your feature sets, and then use Multiple Linear Regression, to solve problems.
- Python interpreter (Spyder, Jupyter, etc.).
Please follow the this tutorial until this point here because we will use the same dataset:
msk = np.random.rand(len(dataset)) < 0.8 train = cdf[msk] test = cdf[~msk]
Sometimes, the trend of data is not really linear and looks curvy. In this case, we can use polynomial regression methods. In fact, many different regressions exist that can be used to fit whatever the dataset looks like, such as quadratic, cubic, among others.
In essence, we can call all of these, polynomial regression, where the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. For example, you can have a polynomial regression with 2 polynomial degree:
But, the question is: how we can fit our data on this equation while we have only x values, such as Engine Size? We can create a few additional features: 1, 𝑥, and 𝑥^2.
PolynomialFeatures() function in Scikit-learn library, drives a new feature sets from the original feature set. That is, a matrix will be generated consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, lets say the original feature set has only one feature, ENGINESIZE. Now, if we select the degree of the polynomial to be 2, then it generates 3 features, degree=0, degree=1 and degree=2:
from sklearn.preprocessing import PolynomialFeatures from sklearn import linear_model train_x = np.asanyarray(train[['ENGINESIZE']]) train_y = np.asanyarray(train[['CO2EMISSIONS']]) test_x = np.asanyarray(test[['ENGINESIZE']]) test_y = np.asanyarray(test[['CO2EMISSIONS']]) poly = PolynomialFeatures(degree=2) train_x_poly = poly.fit_transform(train_x) train_x_poly
fit_transform takes our x values, and output a list of our data raised from power of 0 to power of 2 (since we set the degree of our polynomial to 2).
The equation and the sample example is displayed below.
It looks like feature sets for multiple linear regression analysis. Indeed, Polynomial regression is a special case of linear regression, with the main idea of how do you select your features. Just consider replacing the 𝑥 with 𝑥1, 𝑥21 with 𝑥2, and so on. Then the degree 2 equation would be turned into:
Now, it is possible to deal with it as ‘linear regression’ problem. Therefore, this polynomial regression is considered to be a special case of traditional multiple linear regression. So, you can use the same mechanism as linear regression to solve such problems.
So we can use LinearRegression() function to solve it:
clf = linear_model.LinearRegression() train_y_ = clf.fit(train_x_poly, train_y) # The coefficients print ('Coefficients: ', clf.coef_) print ('Intercept: ',clf.intercept_)
As mentioned in this tutorial, Coefficient, and Intercept, are the parameters of the fit curvy line. Given that it is a typical multiple linear regression, with 3 parameters, and knowing that the parameters are the intercept and coefficients of the hyperplane, sklearn has estimated them from our new set of feature sets. So, let’s plot it:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='black') XX = np.arange(0.0, 10.0, 0.1) yy = clf.intercept_+ clf.coef_*XX+ clf.coef_*np.power(XX, 2) plt.plot(XX, yy, '-r' ) plt.xlabel("Engine size") plt.ylabel("Emission")
from sklearn.metrics import r2_score test_x_poly = poly.fit_transform(test_x) test_y_ = clf.predict(test_x_poly) print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y))) print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2)) print("R2-score: %.2f" % r2_score(test_y,test_y_ ) )
Here we will use a polynomial regression but this time with degree three (cubic). Does it result in better accuracy?
poly3 = PolynomialFeatures(degree=3) train_x_poly3 = poly3.fit_transform(train_x) clf3 = linear_model.LinearRegression() train_y3_ = clf3.fit(train_x_poly3, train_y) # The coefficients print ('Coefficients: ', clf3.coef_) print ('Intercept: ',clf3.intercept_) plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue') XX = np.arange(0.0, 10.0, 0.1) yy = clf3.intercept_+ clf3.coef_*XX + clf3.coef_*np.power(XX, 2) + clf3.coef_*np.power(XX, 3) plt.plot(XX, yy, '-r' ) plt.xlabel("Engine size") plt.ylabel("Emission") test_x_poly3 = poly3.fit_transform(test_x) test_y3_ = clf3.predict(test_x_poly3) print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y3_ - test_y))) print("Residual sum of squares (MSE): %.2f" % np.mean((test_y3_ - test_y) ** 2)) print("R2-score: %.2f" % r2_score(test_y,test_y3_ ) )
As we can see by comparing the evaluation metrics that no significant improvement has occurred which means that this may not yet be the best way to analyze this dataset. It would be interesting to carry out an analysis using ANN (artificial neural network) or SVM (support vetor machine) regression, for example, to observe if the performance increase.
 IBM – Machine Learning with Python – A Practical Introduction
 Udemy – The Data Science Course 2020: Complete Data Science Bootcamp – 365 Careers
 Udemy – Machine Learning and Data Science (Python)