In this tutorial, we will learn how to implement Non-Linear Regression. If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data behavior is linear.
Parts Required
- Python interpreter (Spyder, Jupyter, etc.).
Procedure
Following are the steps required to perform this tutorial.
Packages Needed
import numpy as np import matplotlib.pyplot as plt
Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and an independent variable x. It had a simple equation, of degree 1, for example, y = 4𝑥 + 2.
Non-linear regressions are a relationship between independent variables 𝑥 and a dependent variable 𝑦 which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear and is usually represented by the polynomial of 𝑘 degrees (maximum power of 𝑥).
Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example:
Let’s take a look at a cubic function’s graph:
x = np.arange(-5.0, 5.0, 0.1)
y = 1*(x**3) + 1*(x**2) + 1*x + 3
y_noise = 20 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
This function has 𝑥3 and 𝑥2 as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function.
Some other types of non-linear functions are:
Quadratic
x = np.arange(-5.0, 5.0, 0.1)
y = np.power(x,2)
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
Exponential
An exponential function with base c is defined by
where b ≠0, c > 0, c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable.
X = np.arange(-5.0, 5.0, 0.1)
Y= np.exp(X)
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
Logarithmic
The response 𝑦 is a result of applying a logarithmic map from input 𝑥’s to output variable 𝑦. Please consider that instead of 𝑥, we can use 𝑋, which can be a polynomial representation of the 𝑥’s. In general form, it would be written as
X = np.arange(-5.0, 5.0, 0.1)
Y = np.log(X)
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
Sigmoidal/Logistic
X = np.arange(-5.0, 5.0, 0.1) Y = 1-4/(1+np.power(3, X-2)) plt.plot(X,Y) plt.ylabel('Dependent Variable') plt.xlabel('Independent Variable') plt.show()
Non-Linear Regression Example
In this example, we’re going to try and fit a non-linear model to the data points corresponding to China’s GDP from 1960 to 2014. The dataset has two columns, the first, a year between 1960 and 2014, the second, China’s corresponding annual gross domestic income in US dollars for that year.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv("china_gdp_1960.csv")
dataset.head(10)
Plotting the Dataset
This is what the data points look like. It kind of looks like either a logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerates slightly in the 2010s.
plt.figure(figsize=(8,5))
x_data, y_data = (dataset["Year"].values, dataset["Value"].values)
plt.plot(x_data, y_data, 'ro')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
Choosing a Model
From an initial look at the plot, we determine that the logistic function could be a good approximation, since it has the property of starting with slow growth, increasing growth in the middle, and then decreasing again at the end as illustrated below:
X = np.arange(-5.0, 5.0, 0.1)
Y = 1.0 / (1.0 + np.exp(-X))
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
The formula for the logistic function is the following:
𝛽1: Controls the curve’s steepness,
𝛽2: Slides the curve on the x-axis.
Building The Model
Now, let’s build our regression model and initialize its parameters.
def sigmoid(x, Beta_1, Beta_2): y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2))) return y
Lets look at a sample sigmoid line that might fit with the data:
beta_1 = 0.10 beta_2 = 1990.0 #logistic function Y_pred = sigmoid(x_data, beta_1 , beta_2) #plot initial prediction against datapoints plt.plot(x_data, Y_pred*15000000000000.) plt.plot(x_data, y_data, 'ro')
Our task here is to find the best parameters for our model. Lets first normalize our x and y:
# Lets normalize our data xdata =x_data/max(x_data) ydata =y_data/max(y_data)
How we find the best parameters for our fit line?
We can use curve_fit which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid (xdata, *popt) – ydata is minimized.
*popt are our optimized parameters.
from scipy.optimize import curve_fit popt, pcov = curve_fit(sigmoid, xdata, ydata) print the final parameters print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
Now we plot our resulting regression model.
x = np.linspace(1960, 2015, 55) x = x/max(x) plt.figure(figsize=(8,5)) y = sigmoid(x, *popt) plt.plot(xdata, ydata, 'ro', label='data') plt.plot(x,y, linewidth=3.0, label='fit') plt.legend(loc='best') plt.ylabel('GDP') plt.xlabel('Year') plt.show()
Now, let’s find the accuracy of our model.
# split data into train/test msk = np.random.rand(len(dataset)) < 0.8 train_x = xdata[msk] test_x = xdata[~msk] train_y = ydata[msk] test_y = ydata[~msk] # build the model using train set popt, pcov = curve_fit(sigmoid, train_x, train_y) # predict using test set y_hat = sigmoid(test_x, *popt) # evaluation print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - test_y))) print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - test_y) ** 2)) from sklearn.metrics import r2_score print("R2-score: %.2f" % r2_score(y_hat , test_y) )
References
[1] https://medium.com/analytics-vidhya/non-linear-regression-analysis-e150447ac1a3
[2] https://www.kaggle.com/john77eipe/non-linear-regression
[3] https://codekarim.com/node/40
[4] IBM – Machine Learning with Python – A Practical Introduction
[5] Udemy – The Data Science Course 2020: Complete Data Science Bootcamp – 365 Careers
[6] Udemy – Machine Learning and Data Science (Python)