19 Multiple linear regression analysis ( simple and step- wise regression)

Prof. Aslam Mahmood

1) E text

Multiple Linear Regression Analysis

Regression Analysis

Ultimate objective of any research is to understand the causes behind a process with objective to control it as per our requirement. Knowledge of the degree and direction of co relationship between a dependent and independent variable alone is not sufficient to help in this regard unless it gives the mathematical equation giving the relationship between them. Through this form of causal relationship (Cause and effect relationship ) we can evaluate the effect of different independent variablesalso known as determinants on the dependent variable and can also make projections of the dependent variable for some projected future values of independent variables etc.

Regression analysis helps us in identifying the above mentioned functional form of relationship. If we have only two variables to analyze, i.e. one dependent and one independent variable, the regression is known as bivariate regression. However, if the number of independent variables are two or more it is known as Multiple Regression analysis. Explanation of regression analysis becomes very complicated if we directly start from multiple regression analysis. It is easier to explain bivariate regression analysis first and explain Multiple Regression as its extension.

Bivariate Regression Analysis

Functional form of relationship between to variables is best explained by taking a straight line on a graph paper having Y-coordinate and X-coordinate. If we take large number of points on the line and note their Y and X coordinates. The Y and X coordinates of these points will be related by a general functional form as : Y = a + b X + ε. In any such equation of straight line a is known as intercept of the line i.e. the value at Y axis where the line will cut Y- axis (value of Y after putting X = 0.), btherate of change in Y with respect to X,is known as the slope of the line andε is known as the error and the relationship Y = a + b X. is known as linear relationship (as it relates to a line only). Thus if we refer to any specific straight line, we can specify it by fixing the values of a and b. However, most of the bivariate data from social sciences, though following a linear pattern of relationship, will not exactly fall on a straight line. On a scatter plot of such data the points may cluster around a straight line and give some deviations on either side of the line as indicated byε. In such cases the line around which the points cluster could be identified to give the relationship

between two variables X and Y. Here the values of Y and X will fall on the line and relationship between them will be exact. Actual relationship between values of Y and X falling around the line will only be approximate to it. The value of the coefficient of correlation will give the direction and the degree of such linear relationships. The equation of the line passing through the scatter plot will give the functional form of linear relationship between the two variables.

Principle of Least Square

As any number of lines can be drawn through a given scatter plot of a set of bivariate data, we have to develop a principle to select an optimal line. Statistical theory has developed the principle of least square to govern the choice of such an optimal line. Consider the following values of X and Y variables and their scatter plot.

y X

20 2

45 4

65 7

45 9

70 12

110 23

89 15

102 13

110 18

100 17

As is clear from the above graph, a line is drawn from the middle of the scatter in such a way that almost half of the points fall above it and half below it. Once the line is drawn, for every given value of X we have an estimated value of Y from the line apart from its given value. The difference between given value of Y and the estimated value of Y is known as the residual or error. For example in the above case for first given value of X=2 the given value is 20 whereas the estimated as given by line is 32. The difference is 32 – 20 = + 12. In the second case for given value of X = 4 the given value of Y is 44 and the estimated value is 40 and the difference in the second case is 40 – 44 = – 4. Likewise the residuals or errors for all other observations can also be calculated. For total error the sum of these residuals will be misleading as minus residuals will cancel the plus residuals. Residuals are, therefore, squared and then added to give the total error. If we draw different lines, we will get different values of the sum of the squares of residuals- one with every line. Ideally we will choose the line as optimal which will give us minimum value of the sum of the squares of residuals. This principle is known as Principle of Least Square and the line will be known as Regression Line. In practice we do not take several lines but take the help of Differential Calculus from mathematics which suggests that the slope “ b” which is also known as the of such an optimum line will be :

regression Coefficient

After computing the regression coefficient the analysis also focuses on its magnitude. We test the hypothesis that whether the value of b is so very close to zero or not? In other words we test the hypothesis that can its actual value(or population value )may be considered as zero?

For testing such a hypothesis we use “t” test, according to which:

Under the following assumptions:

1. The variable ε is a random variable distributed normally.
2. Mean value of εis zero.
3. Variance of εis constant for all value of X
4. If there are more independent variables all are independent to each other i.e. there is no multi co-linearity among the independent variables.

A computed values of bwill follow “t” distribution with (n- 2) degrees of freedom i.e.

t = b / S. E. (b), where

As estimated values of y are values found on the line there variations will always be less than the actual values of y. Objective of any regression line is always to get the estimated values of y giving variations as close to actual values of y as is possible. The ratio of Explained Some of square to total variations in y is therefore an indicator of the quality of any regression model and is known as coefficient of determination and denoted by R2 given by:

R2 = Explained Sum of Squares/ Total Sum of Squares.

The value of R2 will vary between zero and unity. If a value is found to be say 0 .75 will mean that of the total variations in the dependent variable y 75 percent are being explained by the independent variable X chosen here.

Again to test the statistical significance of R2 we have F- ratio test as :

F = R2 (n-k)/(1-R2)(k- 1) ( k-1, n-k), degrees of freedom

Example

From the data as given below we can compute the regression coefficient and the the intercept of the regression line as shown below :

X	y	X2	Y2	YX
2	20	4	400	40
4	45	16	2025	90
7	65	49	4225	455
9	45	81	2025	405
12	70	144	4900	840
23	110	529	12100	2530
15	89	225	7921	1335
13	102	169	10404	1326
18	110	324	12100	1980
17	100	289	10000	1700
Total 120	756	1830	66100	10701

Using the above formulas for regression coefficient b and the intercept a :

Regression analysis also requires calculation of standard error of b to carry out the statistical test of significance to assure that the regression coefficient “b” is not very close to zero.

F value for d.f.= 1, 8 (2-1, 10-2)given in the table are 16.86 for 1% and 5.32 for 5% levels of significance respectively. Our calculated value is found to be significant at even at 1% level of significance. Thus we can say R2 is statistically significant from being zero. In other words we conclude that the independent variable is explaining the dependent variable in a substantial way ( not in any random way).

Multiple Linear Regression Analysis through Matrix Algebra

Multiple Regression analysis is an extension of bivariate regression analysis, in which we have one dependent variable and two or more independent variables. In multiple regression analysis the regression model for p variables and n number of observations is;

Where Y, B and ε are column vectors of px1 dimensions and X is a matrix of nxp dimensions. (here convention of writing row number first and column number next has been interchanged to suit to the equation formation. ). Applying the principle of least square to the above set of equations, we can get solution vector β giving the estimated b1, b2, …………bn values of the constants of the regression line passing through the scatter of points as:

We can arrive at the same results using the method of normal equation from ordinary algebra. However, if the number of variables and number of observations are large as mostly is the case in geographical research it is quite useful to use matrix method as shown above. Advantage with matrix methods is that we can use some computer packages dealing in statistical analysis like SPSS, STATA and SAS etc.

Like bi-variate case we can also carry out the “t” test and work our out F- ratio for a multiple regression analysis also for which the values are given as:

significant at 1% level of significance. Thus the only regression coefficient b2 related to variable X1 has shown a significant effect on Y which is 1.7329 for a unit change in X1. Intercept b1 and the regression coefficient b3 related to variable X2 and the intercept are found to be insignificant even up to 5% level of significance.

F value for d.f. = 2, 2 (3-1=2, 5-3=2) is 99.0 at 1% and 19.0 at 5% levels of significance respectively. The test of significance also reveals that the two independent variables X1 and X2 explain the variations in the dependent quite substantially.

Stepwise Multiple Linear regression Analysis

Stepwise regression analysis relates to evaluation of the relative efficiency of independent variables in explaining the dependent variable when the variables are added/deleted to the model, one by one, in several steps. It has two approaches: Forward and backward. In forward stepwise regression, we start with one most important variable followed by the second, third …….and the least important variables.

In backward approach we start with all the variables and keep on excluding variables one by one and proceed in the reverse direction until we reach the optimal position.

Need for Stepwise regression analysis

One of the important assumption of the tests of significance of OLS regression analysis is that independent or explanatory variables are independent of each other. This assumption is known as assumption of absence of multi-collinearity. Sometimes this assumption is violated and variables are found to be collinear where independent variables show significant inter-correlation among themselves. In such cases there is some overlap in the explanatory power of the two or more collinear variables. Higher the relationship between the variables higher will be the overlap. For example if a variable is explaining 40 % of the dependent variable and another variable related to it is explaining 30 %. When these two variables are taken together, they may explain 70% of the dependent variable provided both the independent variables are independent. However, if they are collinear or correlated, they will explain less than 70%. Suppose these two variables together explain only 50 % of the dependent variable, it will be due to the fact that what second variable is contributing, 20 % of it has already been explained by the first variable. Second variable has now only 10 % contribution to explain in addition to first variable. In the absence of the first variable, however, second variable will have a higher explanatory power. In ordinary regression equation we will not have any idea about this complication arising due to the problem of multi-collinearity. Computer programmes have been develop to tackle this problem. Stepwise regression analysis is one such programme. In any regression model with some R2, if more variables are added the value of R2 will always increase, either the variable is positively or negatively related to the dependent variable. In stepwise approach of a regression analysis, independent variables are sequentially added to the model one by one, untilthe criterion of variable addition is not met. The sequence starts in such a manner that in first step it gives regression line with one independent variable choosing from all the independent variables the one which gives maximum R2. In the second step it adds a one more independent variable to the model which adds maximum value to the existing R2. Likewisein third step one more variable is added and so on. Every time the addition to R2 due to new variable will be less than the previous value and the value of F-ratio will also change. In stepwise regression analysis we can fix a criterion of adding a new variable. Generally it is done by choosing the probability level of the changed value of F due to the addition of new variable. In most of the cases, if the probability exceeds the fixed limit say 0.05, the variable is not added to the model. R2 (adjusted) is designed in such a way that with the increase of every new variable it will decrease unless the new variable causes a significant increase in the value of R2. In step wise regression analysis, we keep on allowing the addition of new variables until R2 (Adjusted) increases. After few steps though R2 will continue to rise, R 2 (Adjusted) will start decreasing indicating the fact that addition to R2 is not big enough to be retained in the analysis.

Example

Declining sex ratio in India is a big concern of the society. There are a large number of factors behind it. In the following example , for the sake of simplicity in explanation, we have taken few of them and used a stepwise regression analysis to explain the variations in the “Sex Ratio” in the 50 districts across Madhya Pradesh for 2011, with the help of the following variables.

1. Sex Ratio (Female per thousand male) (V1).

2. Growth rate of population, 2001-11( in Percentage) (V2)

3. Levels of Literacy ( in percentage) (V3)

4. Population Density per square kilo meter of area (V4)

5. Female work participation rate ( in percentage) (V5)

Using SPSS when data( given in annexure) was subjected to the bivariate correlation and stepwise regression analysis, following results were obtained.

First, it gives the inter-correlation matrix of each variable with other variables given in Table1, given below.It also identifies the level of significance at which these coefficients of correlation are significant.

The table shows that,the dependent variable Sex Ratio (V1)has anin-significant negative correlation coefficient with the variables: growth rate of population, 2001-11 (V2) and Levels of Literacy (V3) and a significant negative correlation with the variable oLevels of Literacy (V3). It is found to be significant at 5% level of significance. The inter-correlation matrix also shows that the dependent variable has a strong positive relationship with the variable offemale work participation rate (V5), significant at 1 % level of significance.

Note that :

Two tail test means that a value of coefficient of correlation could be ≠ 0 i.e. it could be greater than or less than 0.
One tail test will mean that coefficient of correlation could be either > 0 or <0 i.e. either greater than 0 or less than 0.
The diagonal elements of an inter-correlation matrix are always 1, indicating the correlation of a variable with itself is perfect and coefficient of correlation will be ,therefore, 1.

Table 1

Inter correlation Matrix

*. Correlation is significant at the 0.05 level (2-tailed).

**. Correlation is significant at the 0.01 level (2-tailed).

The inter-correlation matrix given above suggest reasonable justification for the choice of the explanatory or independent variables to explain the variations in the values of the dependent variable, Sex Ratio.

The above matrix of inter-correlations also suggest the overlap among the independent variables. The matrix shows inter-correlations among independent variables also. Fourth variable of the density of population (V4) has a strong positive relationship significant at 1% level of significance with the second variable of growth rate of population (V2) and third variable of literacy (V3) which has a strong significant positive relationship with the fifth variable of female work participation rate. Female work participation rate(V5) also has strong negative relationship with the density of population (V4).

The inter-correlation among independent variables suggest that there exist some multicollinearity among them and an ordinary regression analysis will not be the optimal regression equation. A stepwise regression is likely to give better results by excluding the redundant variables and retaining only those which add a higher value to R2 as explained above. It will also give the order of the efficiency with which each of the independent variable explains the dependent variable of sex ratio.

Step wise regression analysis will give different models by adding independent variables one by one sequentially. The criterion to add the new variable is in terms of probability of its F value being less than 0.05 or 5% as is shown in Table 2 given below. We can also change the probability to 0.10 or 10% to allow more variables to enter into the analysis.

In the present example the result given in Table 2 given below shows that only two variables; female work participation rate (V5) and density of population (V4) aresufficient to be retained in the multiple regression analysis. Other variables; population growth rate (V2) and level of literacy (V3) are not found to explain much the variations in the sex ratio of the districts of Madhya Pradesh. Their part of explanation is already explained by first two variable ; female work participation rate (V5).

Table 2

Criteria forChoosing the Independent Variables

Stepwise Regression Analysis

The summary results of the two steps of the regression model will follow in the computer output as given below in Table 3:

Table 3

Model Summary

Model 1 given by the first step shows that only one variable i.e. female work participation rate (V5) alone explains the sex ratio quite effectively. It explains 70.6 % variations of the sex ratio across 50 districts of Madhya Pradesh as per the data provided by the Census of India.The next variable which could be included in the model is the population density (V4) which could add to the explanatory power of the model only 3.6 % as the value of R2 could rise from 0.706 to 0.742 only. R2 (adjusted) could also rose from 0.700 to 0.731 only. Another two variables could not qualify the criterion of entering into the analysis du to multicollinearity.

Once the model is chosen, the main results follow. These include regression coefficient of the selected variables their standard errors, “t-statistics” and the level of their significance.These results are also given in Table 4 below.

Table 4

Coefficients

Table 4 given above show the regression coefficients in unstandardized form as well as in standardized form. Unstandardized coefficients relate to the data as provided in the computer input and also gives the value of the intercept as constant. Computer also converts the given data into their standard scores and give corresponding regression coefficients as standardized coefficients. The purpose is to bring the data to a standard form of zero mean and unit standard deviations. In standardized form when the mean of all the variables is zerointercept is not given as it also become zero. Standard error and ‘t- statistics’ of the regression coefficient, however, in both the cases remain the same.

The results of the above table how that as it is a unit change in the employment to female will promote an increase of 4.214 rise in the sex ratio. Whereas the density of population does not show much impact on the sex ratio. A change of one person per square km. will bring a change of only o.063 change in sex ratio.

It is important to note that female work participation rate varies with in a narrow range from 8.4 in Bhid to 52.9 in Dindori. Density of population has quite big range of variation from 855 in Bhopal to 94 in Annupur. These variations have been standardized by converting all the three variables into their standard scores. As a result the gap of 3.832 between the regression coefficients of unstandardized form (4.214 – 0.382 = 3.832 )is reduced to 0.734 between the same in unstandardized form ( 0.954 – 0.220 ). The proportion of the two has also been reduce from 11.03 (= 4.214/ 0.382 ) to 4.33 (= 0.954/0.220).

Annexure I : Data for Stepwise Regression Analysis

Source: Census of India 2011

Sex Ratio (Female per thousand male) (V1).
Growth rate of population, 2001-11( in Percentage) (V2)
Levels of Literacy ( in percentage) (V3)
Population Density per square kilo meter of area (V4)
Female work participation rate ( in percentage) (V5)

you can view video on Multiple linear regression analysis ( simple and step- wise regression)

References

Aslam Mahmood (1993). Statistical Methods in Geographical Research, Rajesh Publications, New Delhi.
Johnston J. (1972) Mc Graw Hills, pp 8
David Harvey (1969), Explanation in Geography, Edward Arnold London.
Koutsoyiannis A(1973). Theory of Econometric Mcmillan pp 225 – 49.
Retherford R.D. and Choe M.K.( 1973) Statistical Methods For Causal Ananlysis. John Wiley & Sons, INC.
Wooldridge J.M. Introduction to Econometrics: A Modern Approach (2009) Cengage Learning , India Pvt. Ltd. (Indian Edition).