32 Linear Regression: Simple Linear Regression Model with Least Square Method
Dr Deependra Sharma
Learning Objectives
After this module the students will be able to:
1 Clearly define the meaning of simple linear regression.
2 Differentiate between the correlation and regression also state the advantages of simple linear regression.
3 Understand types of regression model and the assumptions
4 Establish the simple linear regression equation by least square method and determine the values of regression coefficients.
5 Clearly define the meaning of simple linear regression.
6 Differentiate between the correlation and regression also state the advantages of simple linear regression.
7 Understand types of regression model and the assumptions
8 Establish the simple linear regression equation by least square method and determine the values of regression coefficients.
1. Introduction
Correlation coefficient and covariance define a linear relationship between the variables but they are unable to state anything about the casual relationship or in other words with correlation coefficient and covariance we cannot say which variable is a cause and which one is an effect. Through the regression analysis we try to accomplish the causal relationship between the variables. The term regression basically means ‘moving backwards’.
Hence with the help of linear regression analysis we establish the causal relationship between the linearly related variables. In causal relationship there is a response variable which is influenced by other variable called explanatory variables. Some scholars have also given different names to both response variable and explanatory variable, like other names of response variable are dependent variable, predicted variable and explained variable etc. while explanatory variables are also called as independent variable, predictor variable and control variable etc.
In regression model we try establish the complete cause and effect relationship between the variables but in most of the cases response variable does not depend on only one explanatory variable there could be more than one variable that could have either direct or indirect influence of response variable. To understand this better let us take an example of a fast moving consumer goods (FMCG) manufacturing company that wants to do research on the buying preferences of people for its new product that is going to be launched. For that they collect data of the housewives as they think that they are the decision maker for all the products that are used in kitchen but in real situation there may be the choice of children in the family that could affect the product purchase not only children there may be many other factors that should be considered like husbands preference, family income or the effect of opinion leaders etc. Hence there could be many explanatory variables that can influence the response variable.
2. Difference between Correlation and Regression
In Statistics correlation and regression both are used to define the relationship between the variables. Correlation tells the degree of relationship between the variables where regression goes one step further and tells the cause and effect relationship between the variables. In other words Correlation measures the degree to which the two variables are related whereas regression is a method of describing the relationship between two variables. Below are the basic differences between correlation and regression-
- A statistical measure that establishes the relationship between two variables is called correlation and regression establishes the value of one variable called dependent variable for a given value of another variable called independent variable.
- Correlation represents the linear relationship between the two variables whereas regression fits the best line and estimates one variable on the basis of other variable.
- Correlation does not state anything about dependent and independent variables and it is symmetrical in nature for example if x and y are the two variables there is no difference in the correlation of x and y and y and x in contrary regression relationship between x and y is different from y and x.
- Correlation indicates the relationship between the two variables whereas regression denotes the impact of a unit change in independent variable on the dependent variable.
- Correlation finds a numerical value that expresses relationship between two variables unlike regression that predicts the future value of dependent variable for given value of independent variable.
3. Advantages of Regression Analysis
Following are the advantages of the regression analysis-
- Establishes the relationship between variables– Regression analysis establishes relationship between the response (dependent) variable and the explanatory (independent) variable.
- Determines the error- Regression analysis measure the standard error of estimates to measure the variability, as in regression line we establish a relationship line which fits all the values of x and y or in other words all the value of x and y should fall on the regression line and standard error estimates is equal to zero but it hardly happens. when all the variables either fall on the line or are very close to the line is called a good regression relationship.
- Suits in case of large sample size- In case of large sample size (n ≥ 30) then interval estimation for predicting the value of a dependent variable based on standard error of estimate is considered to be acceptable by changing the values of either x or y. the magnitude of r2 remains the same regardless of the values of the two variables.
- Predicting the future- In regression analysis we predict the future value of response variable based on the given value of explanatory variable.
4. Types of Regression Models
There are two methods of studying the regression model, one is simple regression model where response variable completely depend upon the only one explanatory variable and other is multiple regression model where response variable is influenced by more than one variable.
In those cases where the response variable completely depends upon the one explanatory variable, this relationship between the variables is called deterministic. For example-y=1.5x
Here the response variable y completely depends upon one explanatory variable x and no error is allowed while predicting the values of y. This is often seen in physical science concepts.
But usually, the relationship between explanatory and response variable is inexact i.e. stochastic. It is due to the omission of relevant factors which are sometimes immeasurable, that influence the response variable.
In Simple Regression model we judge the variability of response variable that is solely dependent upon only one explanatory variable. As the fundamental assumption in case of simple regression model is the expected value of y lies on a straight line, let us assume the response variable is denoted by y and explanatory variable is denoted by xi, then-y= β0 +β1xi
Where β0 and β1 are unknown intercepts and slope parameters respectively.
In simple linear regression model β0 +β1xi is the deterministic component of the regression model, which tells the expected value of y for a given value of x.
If the slop parameter β1 is positive (β1>0) the relationship between x and y is positive, if the slop parameter β1 is negative (β1<0) the relationship between x and y is negative and if the slop parameter β1=0 there is no relationship between x and y.
Graphically we can represent positive, negative and no relationship of linear regression model as below-
As we have discussed before actual value of response variable may defer from expected value hence we add ε (epsilon) as random error in deterministic component. Hence the sample regression model is defined as-
y= β0 + β1xi + ε
where y and x are dependent variable and independent variable respectively and ε is random error.
6. Assumptions for a Simple Linear Regression model
- There should be a linear relationship between two variables x and y whereas x is called dependent variable and y is independent variable. This relationship can be described by linear regression equation-
Where ε represents the difference between the expected value and actual value of response variable y for a given value of explanatory variable x.
2. The set of expected values of response variables y for a given value of explanatory variable x are normally distributed. The mean of these normally distributed values fall on the line of regression.
3. The dependent variable y is a continuous random variable, whereas values of the independent variables x are fixed and not random.
4. The sampling error associated with the expected value of the response variable is assumed to be an independent random variable distributed normally with constant standard deviation. The amount of error in the value of response variable maybe different in successive observation.
5. The standard deviation and variance of expected values of the response variable about the regression line are constant for all the values of the explanatory variable.
6. Regression cannot have the symmetrical value of variables means the response variable y and explanatory variable x cannot be interchanged for a same regression line equation.
7. Estimation: The Least Square Method (LS Method)
This method is also known as ordinary least square method (OLS method).The OLS method identifies that line which fits best for the given data. This is called the ‘line of best fit’ and is determined by identifying the line out of all of the probable lines which results in the least difference between the observed data points and the line.
Fig. -1 indicates that whenever a straight line is drawn passing through to a data, there will be some variations among the line values and the real observed values. Here, one is, fascinated about the vertical differences between this line and the real data. This line is used to make predictions about the values of Y (Dependent Variable) from different values of the X (Independent Variable). In context of regression, these differences are known as residuals and not as deviations (but actually both of them are same).
Fig-1 represents a scatter plot of any data where a line is projecting the general tendency. The vertical arrows are representing the gap (differences or residuals) between the actual data and the line.
As with the mean, values of the variables fall both above and below the line resulting in positive as well as negative differences. Thus, in case, these positive and negative differences are added, they will annul each other out .To overcome this challenge, the differences are squared before summing them. This squared differences offer a estimate about the ‘wellness’ with which any particular line fits the data,i.e.in case the square of differences are huge the line is not a representative of the data but in case the value of differences is small, that line is assumed to be a representative.
Thus to find the ‘line of best fit’ we find the sum of SS (squared differences) of all the likely lines for the given values data and then compare them. The line with the least value of SS represents the required line, i.e. line of best fit. In reality, this tedious process needs not to be followed as this can be attained by using the method of OLS. It does so with the help of mathematical method used for finding maxima and minima. This procedure is used to find the line that minimizes the sum of squared differences. This line of best fit is a regression line.
8. Mathematical Explanation
Suppose a sample of n pairs of observation (x1,y1), (x2,y2)………,(xn,yn) is taken from a population to
which we want to study to estimate the values of regression coefficient β0 and β1. The estimate values of β0 and β1 should result in a straight line where most pairs of observations fall very close to it. Such a straight line is referred to as ‘best fitted’ (least squares or estimates) regression line.
Rewriting equation as follows-
yi= β0 + β1xi + ei
or ei = yi – (β0 + β1xi)
To minimize
ei = L =∑ =1 2 = ∑ =1{yi – (β0 + β1xi)}2
let b0 and b1 be the least squares estimators of β0 and β1 respectively. After simplifying this equation, we get
∑ =1 yi = n b0 + b1 ∑ =1
This equation is called the least squares normal equation. The values of least squares estimators b0 and b1 can be obtained by solving this equation. Hence the fitted or estimated regression line is given by
Where- ei = L =∑ =1 2 = ∑ =1{yi – (β0 + β1xi)}2
let b0 and b1 be the least squares estimators of β0 and β1 respectively. After simplifying this equation, we get
We use this equation to make the expectation of y for a given value of x since the expected value may be different from the actual value we take difference of both expected value and actual value and is generally represented by residual e.
9. Regression Coefficients
To estimate values of population parameter β0 and β1, the estimated simple linear regression equation is
̂ = b0 + b1x
where ̂ (y hat ) is estimated average value of response variable y for a given value of explanatory
variable x. b0 or a is y-intercept that represents average value of ̅ and b1 or b is the slop of regression line that represents the expected change in the value of y for unit change in the value of x.
a and b are also called as intercept and regression coefficient respectively. To determine the value of y at any given point of x we must calculate the values of a and b. After getting the values of a and b we can easily determine the value of response variable that is y at any given value of explanatory variable that is x.
The regression coefficient ‘b’ is also denoted as-
byx that means regression coefficient of y on x and can be represented as y = a + bx
bxy that means regression coefficient of x on y and can be represented as
x = a + by
Calculation of a and b
For solving mathematical problems of regression analysis we need to calculate the intercept a and regression coefficient b.
With a little algebra and differential calculus it can be shown that the following two equations, if solved simultaneously, will yield values of the parameters a and b such that the least squares requirement is fulfilled-
∑Y = Na + b∑X
∑XY = a ∑X + b∑X2
These equations are usually called the normal equations. N is the total pair of observed pairs of values.
10. Properties of Regression coefficients
1. The correlation coefficient is the geometric mean of two regression coefficients byx and bxy i.e., r = √ (byx X bxy)
2. If the one regression coefficient is greater than one, then other regression coefficient must be less than one because the value of correlation coefficient cannot be more than one.
3. Both regression coefficient must have the same sign either +ve or –ve. This property abolishes the case of opposite coefficient may be less than one.
4. Both the correlation coefficient and the two regression coefficient will have the same sign.
5. The arithmetic mean of regression coefficient byx and bxy is more than or equal to the correlation coefficient.
11. Limitations of Simple Linear Regression model
- With the help of Simple linear regression model so far, we’ve only been able to examine the relationship between two variables.
- In many instances, we believe that more than one independent variable is correlated with the dependent variable.
- Multiple linear regressions provides is a tool that allows us to examine the relationship between 2 or more explanatory variable and a response variable.
12.Self-Check Questions:
Illustration 1:
Five Randomly selected students took a math aptitude test along with Statistics grade.
In the table below, the X column shows scores on the aptitude test. Similarly, the Y column shows statistics grades.
Student | X | Y |
1 | 95 | 85 |
2 | 85 | 95 |
3 | 80 | 70 |
4 | 70 | 65 |
5 | 60 | 70 |
SUM | 390 | 385 |
- Find out the best fit regression line based on math aptitude test?
- If a student got 80 marks in the aptitude test find out his grade in statistics?
Answer:
i) The linear regression equation Yon X is ŷ = b0 + b1x .
Step 1: To find out best fit regression line we need to solve for b0 and b1. And the normal equations are –
∑Y = Na + b∑X
∑XY = a ∑X + b∑X2
Step 2: Calculate the values of X2,Y2 and XY
Student | X | Y | X2 | Y2 | XY |
1 | 95 | 85 | 9025 | 7225 | 8075 |
2 | 85 | 95 | 7225 | 9025 | 8075 |
3 | 80 | 70 | 6400 | 4900 | 5600 |
4 | 70 | 65 | 4900 | 4225 | 4550 |
5 | 60 | 70 | 3600 | 4900 | 4200 |
SUM | 390 | 385 | 31150 | 30275 | 30500 |
Step 3: Substitute the values in the normal equation-385 = 5a + b390 30500= 390b +31150b
Step 4: after solving these equations we get-
a=26.768
b= 0.644
Therefore, the regression equation is: ŷ = 26.768 + 0.644x.
ii) Now once we have linear regression relationship equation between both the variables the aptitude test and the statistic’s grade. We can easily predict the statistic’s grade for any given value of aptitude test by just substituting the values in the found regression equation-
ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288
Caution: it not recommended to use the values of the independent variable that are outside the range of values used to create the equation. That is called extrapolation, and it can produce unreasonable estimates.
In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95. Therefore, only use values inside that range to estimate statistics grades. Using values outside that range (less than 60 or greater than 95) is problematic.
Illustration 2:
Calculate the regression equation Y on X and X on Y from the following table –
X | 1 | 2 | 3 | 4 | 5 |
Y | 2 | 5 | 3 | 8 | 7 |
Answer: Calculate the values of X2, Y2, XY, ∑X2,∑Y2 and ∑ XY
X | Y | X2 | Y2 | XY |
1 | 2 | 1 | 4 | 2 |
2 | 5 | 4 | 25 | 10 |
3 | 3 | 9 | 9 | 9 |
4 | 8 | 16 | 64 | 32 |
5 | 7 | 25 | 49 | 35 |
∑X= 15 | ∑Y= 25 | ∑ X2= 55 | ∑ Y2= 151 | ∑ XY= 88 |
i) Step 1: For regression equations Y on X, Y= a + bX and the normal equations are-
∑Y = Na + b∑X
∑XY = a ∑X + b∑X2
Step 2: Put the values from the table in to the equations-
25 = 5a +15b
88 = 15a + 55b
Step 3: After solving both the equations we get-
a= 1.10 and b = 1.3
Step 4: Hence the required regression equation of Y on X is given by-
Y = 1.10 + 1.30X
ii) For regression equations Y on X, Y= a + bX and the normal equations are
∑X = Na + b ∑Y
∑XY = a ∑Y + b ∑Y2
Step 1: Substituting the values we get –
15= 5a + 25b
88 = 25a + 151b
Step 2: After solving both the equations we get-
a= 0.5 and b = 0.5
Step 3: Hence the required equation of X on Y is –
X = 0.5 + 0.5Y
Illustration 3:
In a research it has been found the demand for automobiles in a city depends mainly, if not entirely, upon the number of families living in that city, below table shows for the sales of automobiles in the five cities for the year 2003 and the number of families living in those cities.
City | No of families in lakhs (X) | Sale of Automobiles in 000’s(Y) |
A | 6 | 9 |
B | 2 | 11 |
C | 10 | 5 |
D | 4 | 8 |
E | 8 | 7 |
Find the best fit linear regression equation and calculate the sales for the year 2006 for city A which is estimated to have 9 lakhs families assuming that the same relationship holds true. Also find out error and standard error between estimated value of Y and actual value of Y.
Answer:
Regression equation of Y on X is Y = a + bX
To determine the values of a and b, we shall solve the normal equation-
∑Y = Na + b∑X
∑XY = a ∑X + b∑X2
X | Y | XY | X2 |
6 | 9 | 54 | 36 |
2 | 11 | 22 | 4 |
10 | 5 | 50 | 100 |
4 | 8 | 32 | 16 |
8 | 7 | 56 | 64 |
∑X= 30 | ∑Y= 40 | ∑XY= 214 | ∑X2=220 |
i) Step 1: Substituting the values in to normal equations 40= 5a + 30b
214 = 30a + 220b
Step 2: After solving both the equations we get
a= 11.9 b = -0.65
and the regression equation is
Y = 11.9 + (-0.65)X
ii) If X= 9
Y= 11.9 -0.65*9 Y= 6.05
Hence the sales for the year 2006 for city A which is estimated to have 9 lakhs families assuming that the same relationship holds true is 6050 automobiles.
iii)
13. Summary
To summarize this model we can say that through correlation we establish the degree of relationship between the two variables whereas through regression analysis we try to establish a cause and effect relationship between the variables. Regression analysis establish a linear relationship between an independent variable that is already known, normally called explanatory variable and a dependent variable that is unknown, normally called as response variable.
Through the regression analysis we not only predict the future value of response variable for any given value of explanatory variable but also determine the error between the expected values and the estimated values of response variable.
The linear regression equation that completely depends upon one independent variable is called the simple regression equation and is represented by ̂ =b0 + b1x + ε, whereas the extension of this is a multiple regression equation that depends upon more than one independent variable. The best fit regression line is assumed to have all the points of both the dependent variable and the independent variable; this is an ideal situation where residual is zero.
Learn More:
- Sharma, J K (2014), Business Statistics, S Chand & Company, N Delhi.
- Bajpai, N (2010) Business Statistics, Pearson, N Delhi.
- Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.
- Darrell Huff (2010), How to Lie with Statistics, W. W. Norton, California.
- K.R. Gupta (2012), Practical Statistics, Atlantic Publishers & Distributors (P) Ltd., N. Delhi.