17 Introduction to Bivariate Linear Regression analysis
Prof. Aslam Mahmood
(1) E- Text:
Linear Regression Analysis
In any given system some of the variables do not vary on their own. Their behavior depends on the outcome of some the other variables. For example, values of the agricultural production of an area will depend on its average annual rainfall along with some other factors also, whereas the values of the agricultural production will not affect the values of the average annual rainfall.Understanding such cause and effect relationshipin a system has been the basic concern of a scientific enquiry. The understanding of such cause and effect relationship of a phenomenon helps us in understanding its nature and will help us in controllingit and also in making predictions ofits future.
Empirical Relationship
The study of any relationships will have two partstheoretical as well as empirical. Theoretical forms of relationship are based on the logical form of the inter-relationships amongdifferent variables. Empirical relationships are based on the co-variations of the values of these variables. These co-variations in the values of the variables can be verified from the real world data by plotting them on a scatter plot.Basic objective of the observation of an empirical relationship is to validate or invalidate an existing form of a theoretical relationship. Any empirical analysis, therefore, requires a theoretical framework and an empirical analysis without a theory will be a futile exercise.
The variables which affect the outcome of the values of any variable are known as ;“independent variables” and the variable which is being affected by other variables is known as ; “dependent variable”. In the above example; agricultural production will be the dependent variable and the average annual rainfall of the area over different years will be the independent variable.It is to be noted that in some other exercise of meteorology,for example, the average annual rainfall may become dependent variable and atmospheric pressure, temperature etc. may become independent variables, and so on. Statistical methods of the study of relationship through correlation or/and regression are based on empirical methods only and are not substitute of the theory behind these relationships.
The relationship between the values of two different variables can be easily shown In a scatter plotby plotting the values of the dependent variable on Y- axis and the values of the independent variables on X-axis of a graph paper. In a scatter plot we observe the degree and direction of covariations between the values two variablesand measure its strength quantitatively through either Karl Pearson’s or Spearman’s coefficients of correlations.
Bivariate Linear Regression Analysis
Knowledge of the degree and direction of relationship between dependent and independent variables alone is not sufficient in a cause and effect analysis unless it gives a mathematical form of the relationship between the two variables. Through the mathematical form we can evaluate the effect of different independent variables(also known as determinants) on the dependent variable with the help of their coefficients and can also make future projections of the dependent variable for some projected future values of independent variables by working out the estimated future values of the dependent variable etc.
A visible relationship between the values of the dependent and independent variable can be converted into a nearest form of a line or a curve of prescribed mathematical form of relationshipswhich is the basis of a regression analysis. The simplest of all these mathematical forms is the equation of a straight line which is Y = a + bX. In a straight line all the points will fall on a line with either an upward or a downward slope indicated by the value of “b” also known as regression coefficient. The line when extended sufficiently will also cut Y-axis and X-axis. The value at which the line cuts the Y-axis is “a”, known as intercept ( value of Y-when X=0 ).
The magnitude of the value of “b” the regression coefficient will indicate the change in the values of the variable Y for a unit change in the values of the variable X or simply the rate of change in Y with respect to X. If the sign of “b” is negative it shows the decline in Y with respect to X and a positive value of regression coefficient “b” will indicate a positive change in the values of Y with the change in the values of the variable X.From a given set of bivariate data for several observations, linear regression analysis provides us the basis to find out the mathematical form of relationship ( Y = a + bX ) closest to the related scatter plot with the help of the ‘Principle of Least Square. Since we use the equation of a line to approximate the form of relationship between the two variables , it will be known as linear regression analysis . If we have only two variables to analyze, i.e. one dependent and one independent variable, the regression is known as a bivariate linear regression. However, if the number of independent variables aretwo or more it is known as multiple linear regression analysis. Explanation of regression analysis will be greatly facilitated if we start from bivariate linear regression analysis and then extend these concepts to multiple linear regression analysis.
Principle of Least Square Bivariate Case
From a scatter plot we identify a straight line by fixing the values of a and b. However, most of the data from social sciences, will not exactly fall on a straight line. On a scatter plot of such data the points may cluster around a straight line and give some deviationsalso (error) on either side of the line represented by ε. If the values of a and b are identified and we substitute an actual ly given value of X of the independent variable the equation will give us an estimated value of the corresponding value of the dependent variable Ŷ. The difference between the actual valve of Y and the estimated value i.e. (Y – Ŷ ) = εis the error term. The total error in any exercise will be the sum of all such deviations. Summation of these error,however, will have the problem that negative error will cancel out the positive error which hide the efficiency of the estimates. It is, therefore, suggested to take sum of the squares of the errors i.e. ∑ (Y – Ŷ )2 .Theoretically, in the absence of any guideline, we can select any number of lines and each one of them will give a sum of the squares of the errors. Statistical theory has developed the ,”principle of least square” which suggest that out of all such lines choose the one which gives the least value of the above sum of squares. Thus fitting a regression line from a given set of correlated bivariate data amounts to identifying the corresponding values of the regression coefficient “b” and the intercept “a” of the line of best fit. Using the mathematics of calculus it also suggest that a straight line whose values of “b” and are calculated in the following manner will give this least sum of squares.
the intercept will be:
a= y – b x .
A regression line with the above values of the regression coefficient; “b” and
the intercept; “a” will be known as the least square regression line.This principle is known as Principle of Least Square and the line will be known as Regression Line of Y on X. Similarly if we choose X as dependent variable (and Y as independent variable) and minimize the sum of the squares parallel to X-axis, the line will be known asregression line of X on Y. The word regression here has been used to negate progress on its own. Here it implies that the movement of Y will not progress on its own, it will be regressed by the movement of X.
The values of the regression coefficient “b “and the intercept “a” will indicate the average position rather than the actual as there are positive and negative deviations with each point.
Example
Consider the following values of productivity of wheat Y (in 00 kgs./hectare) and average annual rainfall X (in cms.) of ten areas of a region. To test the hypothesis that wheat productivity in the region depends on the average annual rainfall of the area and find out the rate of change in wheat production for a unit change in average annual rainfall.
Solution
To show the relationship between wheat productivity Y and average annual rainfall graphically in the data given in the table, we have to prepare a scatter plot on a graph paper as shown below.
It is clear from the above graph that there exist a positive relationship between the two variables. To fit a regression line, we use the principle of least square and find out the values of the regression coefficient “b” and the intercept “a” using the above formulas which require the following table for calculations.
Table 1:Computations required forregression analysis.
X | Y | X2 | Y2 | XY | |
40 | 5 | 1600 | 25 | 200 | |
45 | 13 | 2025 | 169 | 585 | |
50 | 11 | 2500 | 121 | 550 | |
50 | 15 | 2500 | 225 | 750 | |
60 | 20 | 3600 | 400 | 1200 | |
60 | 13 | 3600 | 169 | 780 | |
65 | 18 | 4225 | 324 | 1170 | |
70 | 20 | 4900 | 400 | 1400 | |
71 | 25 | 5041 | 625 | 1775 | |
75 | 25 | 5625 | 625 | 1875 | |
Total | 586 | 165 | 35616 | 3083 | 10285 |
The value of the regression coefficient “b” is found to be .482(00) kg./ht and the intercept is – 11.75 (00) kg/ht. To elaborate it further, the meaning of b = 0.482(00) is that there exists a positive relationship between wheat productivity and average annual rainfall in the area and data given above shows the tendency of an average rise in wheat production equal to 0.482×100= 48.2 kgs. per hectaredue to every unit centimeter rise in the average annual rainfall in the area. Interpretation of a positive value of the intercept relates to the value of the dependent variable Y when the value of the independent variable is at its lowest level i.e. X= 0. However a negative value of intercept a at X= 0 directly does not mean anything. At the best we can see at what value of X the value of Y will be zero from the regression line in the following way:
0 = – 11.75 + 0.482 X
X = 24.34 Cm.
The above equation also suggest that on the average production of wheat require the thresh hold value of average annual rainfall at least 24.34 cm. Another point suggested by the results of the analysis is that on the average the wheat production is likely to stat only after the average annual rainfall of 24.34 cm. has already taken place.
The equation of the fitted regression line is found to be Y = – 11.781 + 0.482 X. Once the algebraic equation of the least square regression line is fitted we can estimate the values of wheat productivity (Y) , for every given value of average annual rainfall (X). The difference between given value of Y and the estimated value of Ŷ is known as the residual or the errorε. For example in the above case for first given value of average annual rainfall X=40 the given value of wheat production is 5(00) kg/ht. Whereas the estimated value as given by line is Y = – 11.781 + 0.4826*40 = -11.781 +19.304 = 7.523(00) kgs. The difference is 5(00) – 7.523(00) = – 2.523(00). In the second case (ignoring 00 and unit for the time being) for given value of X = 45 the given value of Y is 13 and the estimated value is 8.717 giving the difference in the second case as 13 – 8.717 = 4.283. Likewise the residuals or errors for all other observations can also be calculated.
In any regression analysis computations of the value of the regression coefficient “b” and intercepts alone are not sufficient. We have to evaluate its magnitude also by testing the null hypothesis that : “Is the value of b very close to zero or not?”. We proceed to interpret the value of the two parameters only when the null hypothesis is rejected i.e. the value of b is not very close to zero. In such a case we call the value of “a” or “b” being statistically significant. In other words we test the hypothesis that; can its actual value (also known as population value ) may be considered as zero? For the purpose of the test of significance of the regression coefficient “b” following “t” –test is carried out under the following assumptions:
- The variable ε is a random variable distributed normally.
- Mean value of εis zero.
- Variance of εis constant for all value of X ( condition of hetroscedasicity)
- If there are more independent variables all are independent to each other i.e. there is no multi co-linearity among the independent variables.
If the above assumptions are met the computed value of bwill give the “t” given below which willfollow “t” distribution with (n- 2) degrees of freedom.
t = b / S. E. (b), where
As estimated values of y are values found on the line there variations will always be less than the actual values of y. Objective of any regression line is always to get the estimated values of y giving variations as close to actual values of y as is possible. The ratio of Explained Some of square to total variations in y is therefore an indicator of the quality of any regression model and is known as coefficient of determination and denoted by R2 given by:
R2 = Explained Sum of Squares/ Total Sum of Squares.
The value of R2 will vary between zero and unity. If a value is found to be say 0 .75 will mean that of the total variations in the dependent variable y 75 percent are being explained by the independent variable X chosen here
Again to test the statistical significance of R2 we have F- ratio test as :
F = R2 (n-k)/(1-R2)(k- 1) with ( k-1, n-k), degrees of freedom
Use of the results of a regression analysis for any comparative research will be valid only when the estimated parameters are found to be statistically significant. Regression analysis, therefore, also requires calculation of standard error of b to carry out the statistical test of significance to assure that the regression coefficient “b” is not very close to zero or in other words is not statistically insignificant and the explanatory power of the regression model is also statistically significant. Such calculations required the following:
The tabulated value of “t” for 8 degrees of freedom is 2.31 at 5% level of significance and 3.36 at 1% level of significance respectively. The calculated value (6.025) is found to be significant even at 1% level of significance i.e. the regression coefficient is found to be statistically significant.
R2 = 296.12/360.5= 0.821 and
F = F = R2 (n-k)/(1-R2)(k- 1) = 0.821(10 -2)/ (1- 0.821)(2-1)=36.69
F value for d.f.= 1, 8 (2-1, 10-2)given in the table are 16.86 for 1% and 5.32 for 5% levels of significance respectively. Our calculated value is found to be significant at even at 1% level of significance. Thus we can say R2 is statistically significant from being zero. In other words we conclude that the independent variable is explaining the dependent variable in a substantial way ( not in any random way).
After carrying out the tests of significance and R2, one can verify the correspondence between actual values of the dependent variable and the valueestimated from the regression equation. The estimated values are obtained by putting the given values of X in the regression equation, as has been shown earlier while explaining principle of least square. It is interesting to observe that for the given values of Y the estimated values show good correspondence and thereby strengthening the confidence in the regression model. Since estimated values are found to be close to the actual values, the model can be used to make forecasting for the future values of Y for any future value of the independent variable X. the regression model can also be used for interpolating any missing values between the given values of the dependent variable Y.
X | Y | ̂Y | ̂(Y- ̂Y) |
40 | 5 | 7.523 | -2.523 |
45 | 13 | 9.936 | 3.064 |
50 | 11 | 12.349 | -1.349 |
50 | 15 | 12.349 | 2.651 |
60 | 20 | 17.175 | 2.825 |
60 | 13 | 17.175 | -4.175 |
65 | 18 | 19.588 | -1.588 |
70 | 20 | 22.001 | -2.001 |
71 | 25 | 22.4836 | 2.5164 |
75 | 25 | 24.414 | 0.586 |
you can view video on Introduction to Bivariate Linear Regression analysis |