Linear Regression Introduction

Madhudeepa Jois

6 min readMar 10, 2021

Regression is a type of supervised machine learning model. Two types of Linear Regression models are :

Simple Linear Regression : One predictor variable(Independent variable) and one output variable (Target/Dependent)
Multiple Linear Regression : More than one predictor variable and one target variable

We try to fit bunch of data points in a straight line, We use this straight line to predict values.

Let’s look into Equation of a straight line :

y = mx + c

As we know, y is the target variable, m is the slop, c is the intercept and x is the dependent variable.

For more than one independent variables we can write it as,

y = m1x1 + m2x2 + ………+ mnxn + c

Best fit line :

Residual : It is the difference between the actual y value and predicted y^ value. Residual is basically error.
Residual can be written as : ei = yi -ypred
Sum of errors is called as Ordinary Least Square Method — sum of all squared residuals.
Choose c and m such that to reduce Residual sum of squared (RSS) errors.
RSS = (y1-c-m1x1)² + ….. + (yn-cn-mnxn)²

One of the ways to minimize the cost function is to differentiate and another way is to do it is iteration. Start with b0 and b1 and update it after every iteration such that cost function is minimized.
Gradient Descent is iterative method to minimize cost function.
After determining best b0 and b1, we need to decide how well does the best fit line represent scatter plot and how well does the best fit line predict the new data.
Residual square error (RSE) : It helps to measure the lack of fit of a model. Closeness of estimated b0 and b1 to the true value can be estimated with RSE.

RSS is absolute quantity and depends on scale of y, so it is good to use R² which normalizes with TSS.
Relative metrics to find best fit are TSS and R².
Total Sum of squares(TSS) : Squared difference between average sum of y and data point of y

R-Squared is used to calculate how well a model is explaining variance. Higher the value of R² good the model is. R² value can be between 0 to 1.

Simple Linear Regression:

Assumptions of Simple Linear Regression :

Target variable and independent variables are linearly dependent.
We are making use of samples from a population to make inferences. So, this assumption alone is not helpful to generalize it to population.

Important Assumptions:

There is a linear relationship between X and Y. We can check it through plotting a scatter plot.
Error terms are normally distributed with mean 0.
Error terms are independent of each other.
Error terms have constant variance (Homoscedastic).

Hypothesis testing in Linear Regression:

This is to see the significance of derived beta coefficients. Fitted line is of no use if data is scattered too much and R² is very less. Hence we need to test significance of slope every time we fit the line.

If there is no relation between X and Y then slop will be 0. i.e, b1 is zero.

Null hypothesis(H0) : b1 = 0
Alternate Hypothesis(H1) : b1 not equal to 0

If we fail to reject null hypothesis then b1 is insignificant and there is no linear relationship between X and Y.

First we compute t-score given by

where μ is the population mean and s is the sample standard deviation which when divided by √n is also known as standard error.

In the next step calculate p-value

Make the decision on basis of p-value with respect to the given value of b1. if the p-value turns out to be less than 0.05, we can reject the null hypothesis and state that b1 is significant.

F- Statistics and Prob(F-statistics): Instead of taking each beta and calculating significance, it says total significance of the model. It is used to test if model fit is just by chance.
If prob(F-statistics) is less than 0.05 then the model fit is significant.

Building a linear model:

Split train-test data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

Fit the model

#using Statsmodels
import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)
lr = sm.OLS(y_train, X_train)
lr_model=lr.fit()#using sklearnfrom sklearn.linear_model import LinearRegression

lm = LinearRegression()   # Create a linear regression object
lm.fit(X_train, y_train)

Multiple Linear Regression

One independent variable is not sufficient to create best model.

Multiple linear regression can be written as:

Y=β0+β1X1+β2X2+…+βnXn+ϵ
β0 is the change in mean in y/ amount of increase in expected y per unit increase in the variable when other predictors are constant.

Model fits the hyperplane instead of a line. All the other assumptions hold true for this.

Considerations:

Model may overfit as it is complex : high train accuracy and less test accuracy, can fail to generalize.
Multicollinearity : Association between predictor variable
Feature selection is really important

Multicollinearity:

Multicollinearity refers to the phenomenon of having related predictor variables in the input data set. It affects interpretation and inference.

Interpretation: Does “change in Y when all others are held constant” apply?
Inference: Co-efficients swing widely, signs can invert, p-value will not be reliable.

Detecting multicollinearity:

Scatter plots
Heatmap
Pairplots

Two ways to deal with multicollinearity:

Look at pairwise correlations
Variance inflation factor (VIF): Independent variables may depend on two or more variables, in this case just looking at pairwise correlations would not work. VIF calculates how well one independent variable is explained by all the other independent variables combined.

i is i’th variable which is a linear combination of other independent variables.

If VIF is high, that variable must be dropped. If it is less than 5 then no need to drop that variable.

Methods to eliminate multicollinearity:

Variable transformation- Principle component analysis
Creating new variable- Adding features derived from existing variables(Interaction features)
Dropping variable

Categorical variables:

Create dummy n-1 dummy variables.

Feature Scaling:

Feature is scaled for the ease of interpretation, faster convergence for gradient descent method. accuracy and p-value will not change but it changes coefficients.

Some of the methods for scaling:

Standardizing:Standardisation brings all the data into a standard normal distribution with mean 0 and standard deviation 1. x=x−mean(x)/sd(x)
Min-Max scaling(Normalizing): Brings all the data in the range of 0–1. x=x−min(x)/max(x)−min(x)

Bias vs Variance tradeoff:

Question is to make the model more complex and accurate model with more variance and may overfit or we need simple model and less accuracy. We have to select optimized model. Account number of variables that the model is having. Penalizing models for using too many features works.

Using Adjusted R² and AIC:

Low AIC is better and high Adjusted R² is good.

Feature Selection:

2^p models is really not possible for p features. Instead, Build the model with all the features, Drop the features that are the least helpful in prediction (high p-value), Drop the features that are redundant (using correlations and VIF), Rebuild the model and repeat.

This is difficult to do manually when there are many features. Automated approach is available for that-

Select top n features- RFE(Recursive Feature Elemination).
Backward/Forward/Step wise selection based on AIC
Regularization(Lasso)

#RFEfrom sklearn.feature_selection import RFE
rfe = RFE(lm,10)
rfe = rfe.fit(X_train,y_train)
list(zip(X_train.columns,rfe.support_,rfe.ranking_))#VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor   vif = pd.DataFrame() X = X_train_3  vif['Features'] =  X.columns vif['VIF'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])] vif['VIF'] = round(vif['VIF'],2)   vif = vif.sort_values(by='VIF',ascending= False)

Residual Analysis:

from sklearn.metrics import r2_score
r2_score(y_test,y_test_pred)