29 Technique of Computation of Regression Residuals and Mapping
Prof Bimal Kar
Introduction
The correlation and regression analysis is a very popular and useful statistical technique in geography and many other disciplines in understanding nature, extent and overall pattern of relationship among various related attributes in different parts of the world. It is a very helpful technique of finding out meaningfully the causes and consequences of various spatial and non-spatial phenomena. For instance, why does intensity of cropping in an area varies spatially, or why does literacy rate in an area vary spatially, or why does crop productivity vary spatially in a country, or why does fertility rate or child-woman ratio vary spatially in a country? These are some of the questions which can be statistically analysed to find out some meaningful answers. As we know, the intensity of cropping in an area may vary due to variation in irrigation facility, population density, landholding size, mechanization, consumption rate of fertilizers, etc. Similarly, fertility rate or child-woman ratio in an area may vary due to income level, educational attainment, woman’s status, family system (nuclear or joint), standard of leaving, female age at marriage, etc. Such phenomenon can be basically studied through (i) coefficient of correlation, (ii) regression analysis and (iii) regression residual. The coefficient of correlation using the Karl Pearson’s formula helps us in understanding the nature or direction of relationship (positive or negative) and degree or extent of relationship (weak, moderate and strong) between the logically associated variables. Its value ranges from -1.0 (perfect negative) to +1.0 (perfect positive) through 0.0 (no relation). It means more closer the value of correlation coefficient to -1 and +1, the stronger would be the relationship (strong positive or strong negative), and it would be weak if the value is closer to zero.
Coefficients of correlation give us the idea of the degree and the direction of relationship only. On its own the knowledge of correlation coefficient does not help us in predicting the values of the dependent variable for changing values of the independent variable.The regression analysis is helpful in understanding the overall pattern of relationship between the independent and the dependent variables considered and in finding out the expected or predicted values of the dependent variable for any given set of the value of the independent variables for the average pattern of relationship.Future prediction of the values of the dependent variable is needed for making future projections of any phenomenon when the relationship between the dependent and the independent variable is strong, indicated by coefficient of variation R2. A high value of R2(say 0.90 or 0.85) will indicate a close correspondence between the observed values and the predicted values and so one can rely on the projected values of the dependent variable also.
Regression Residual Analysis
In any regression model, given independent variables explain the average variations in the dependent variable to some extent. There are some left out independent variables also about whom we do not know. However, due the influence of these left out variables the expected values from any regression model may not exactly match with the actual values of the dependent variable. The difference between the actual and the expected values of the dependent variable is known as the regression residual, which is the numerical deviation of the values between observed and expected ones of the dependent variable. A regression model with geographic data gives the equation of the regression line showing the average form of relationship between dependent and independent variables without any mention of the space. The utility of residuals from regression is that every residual is associated with a spatial unit, giving us some feel about the areas where the expected values are close to the actual values and where the deviations are larger. Thus if the pattern of deviations correspond to some geographic features of the area, it is quite possible that if those geographic features are added to our regression model, the correspondence between the expected and actual values of the dependent variable will improve and the residuals will minimize.Thus the analysis of residuals is useful in finding out the most contributory factors (independent variables) influencing the dependent variable in addition to the given set of independent variables. Generally regression analysis for calculation of regression residual is done meaningfully when the value of coefficient of correlation is found to be statistically significant at least at 0.05 significance level. This statistical testing of the calculated value of correlation coefficient (r) is done through well-known Students’ t-test.
Example 1
Taking all the 27 districts of the of Assam, the levels of literacy have been explained by regression model taking levels of urbanization as independent variable from the data provided by theCensus of India, 2011.
In Table 1 given below, byanalyzing the data, the expected values of the levels of literacy
(y) have been estimated with the help of regression equation by takingthe levels of urbanization
(x) as independent variable.
Using the data the Karl Pearson’s correlation coefficient (r) between thelevels of literacy and the levels of urbanization in all the districts of Assam is found to be +0.575, which is even statistically significant at 0.01 significance level. It means higher the urbanization level, higher is the literacy rate. Since there was sufficient correlation between the two variables, a least square regression equation was worked out which gave the value of intercept a =68.49 and regression coefficient or slope b =0.27. Extent of deviation (both positive and negative) of observed values of the dependent variable from the expected ones of the same variable areworked out as regression residual.
Such regression residual values of the dependent variable (Y-Yc) can be very close to zero (very low positive or very low negative), moderately away from zero (moderately positive or moderately negative), and much away from zero (high positive or high negative). All these can be looked at.
So far regression residual mapping for literacy rate on urbanization level for the district level data of Assam is concerned, the required value of the values of expected Y, i.e. Ychave been found out for given value of X of each district. For instance, in the case of Kokrajhar district, for given X-value of 6.19, the expected value (Yc) is found to be 70.17 (Table 1). In this way the expected Yc values for all the districts have been computed. Thereafter, the regression residual values (Y-Yc) have been calculated for all the districts, starting with -4.95 for Kokrajhar district (Table 1). Here, among 27 regression residual values (Y-Yc), the lowest one is -12.99 (high negative) and the highest one is +9.32 (high positive). Accordingly, regression residual values have been grouped into four categories, viz. above +6.0 (high positive), 0.0 – +6.0 (moderate positive), -6.0 – 0.0 (moderate negative) and below -6.0 (high negative) for regression residual mapping (Fig. 1).
These residuals have also been mapped on the district wise map of Assam as given in Figure 1 given below.
Table 1: Calculation of Regression Residualsof Literacy on Urbanisation for the districts of Assam, 2011
Yc=a+bx, whereYc= expected value of Y, a=68.49, b=0.27, r=0.575 (Statistically significant at 0.01 significance level)
Y- YC= Regression residual value
Source: Analysis based on Census of India, 2011, Primary Census Abstract, Assam.
Regression Residual Computation of Literacy on Urbanization for the districts of Assam, 2011:
Yc=a+bx, whereYc= expected value of Y, a=68.49, b=0.27 r=0.575 (Statistically significant at 0.01 significance level)
Y- YC= Regression residual value
Fig. 1
So far the map of regression residuals of literacy on urbanization in Assam is concerned (Fig. 1), the districts of Lakhimpur, Sivasagar, Jorhat, Golaghat, Karimganj and Nalbari record considerably higher literacy rate than that of expectation with respect to their level of urbanization. It may be attributed to the contribution of some positive factors in these districts, viz. availability of educational institutions, higher social awareness among the people about the need of education, better economic condition, etc. On the other hand, the districts of Dhubri, Barpeta, Chirang and Darrang witness considerably lower literacy rate than that of expectation in relation to their urbanization level. It may be due to some negative factors like lack of adequate educational facilities, poor economic condition of people, lower social awareness about education, etc. The above map of residuals from regression thus helps us in suggesting the inclusion of some more variables related to development in our regression model to improve its explanatory power. In fact, after identification of the areas the actual factors operating in such areas and their geographical features the underlying factors can be explored in further details.
Example 2
Another example to explain regression residuals of proportion of population in the age group (0-6) taking literacy as independent variable has been attempted below taking tha data from the Census of India 2011. The data is provided in table 2 given below of literacy. Again, so far regression residual mapping for 0-6 population on literacy rate for the district level data of Assam is concerned, the required value of Karl Pearson’s correlation coefficient (r) between them is found to be -0.67, which is statistically significant at 0.01 significance level. Then following the principle of least squareslinear regression analysis(equation y=a+bx) was attempted and b (regression coefficient or slope) was found = -0.2048and a (intercept) was found to be =29.56. Using this regression equation the values of expected Y, i.e. Ychave been found out for any given value of X. For instance, in the case of Kokrajhar district, for given X-value of 65.22, the expected value (Yc) is found to be 16.19 (Table 2). In this way the expected Yc values for all the districts have been computed. Thereafter, the regression residual values (Y-Yc) have been calculated for all the districts, starting with -0.76 for Kokrajhar district (Table 2). Here, among 27 regression residual values (Y-Yc), the lowest one is -2.53 (high negative) and the highest one is +3.72 (high positive). Accordingly, regression residual values have been grouped into six categories, viz. above +2.0 (high positive), 1.0 – 2.0 (moderate positive), 0.0 – 1.0 (low positive), -1.0 – 0.0 (low negative), -2.0 – -1.0 (moderate negative) and below -2.0 (high negative) for regression residual mapping (Fig. 2).
Table 2: Calculation of Regression Residualsof 0-6 population on Literacy for the districts of Assam, 2011
Yc=a+bx, whereYc= expected value of Y, a=29.56, b=-0.2048, r=-0.6737 (Statistically significant at 0.01 significance level)
Y- YC= Regression residual value
Regression Residual Computation of 0-6 population on Literacy for the districts of Assam, 2011:
Yc=a+bx, where Yc= expected value of Y, a=29.56, b=-0.2048, r=-0.6737 (Statistically significant at 0.01 significance level)
Y- YC= Regression residual value
It may be mentioned here that the areas for which regression residual values are very close to zero, those areas are statistically in correspondence with the average pattern of relationship as reflected by the variable considered, and the areas which witness considerable deviation from zero on both positive and negative directions are not in correspondence with the average pattern of relationship. The latter cases are indicative of the influence of some other factors (independent variables) in some cases positively and in some other cases negatively which need to be identified through regression residual analysis and mapping. A map of regression residual in this case is shown on the district-wise map of Assam given in Fig. 2. It may, however, be noted that in the case of positive relationship between two variables, as in the case of urbanization and literacy, the areas with positive regression residuals are indicative of prospect areas, while the areas with negative residuals are problem areas. On the other hand, in the case of negative relationship between variables like literacy rate and proportion of 0-6 population, the areas with positive regression residuals refer to problem areas, and the areas with negative residuals prospect areas. It means this regression residual analysis and mapping helps us in finding out problem areas, prospect areas and neutral areas. These types of situation can be found in the regression residual maps (Fig. 2).
Fig. 2
As regards the map of regression residuals of 0-6 population on literacy in Assam (Fig. 2), the districts of Karimganj and Hailakandi witness considerably higher 0-6 population than that of expectation with respect to their literacy rate. It may be attributed to the negative influence of some factors in these districts, viz. poor economic condition, low level of social awareness, low educational level, etc. On the other hand, the districts of Baksa and Udalguri record considerably lower 0-6 population than that of expectation in relation to their literacy level. It may be due to the influence of some positive factors like better economic condition, high level of social awareness, high educational level, etc. In fact, after identification of the areas the actual factors operating in such areas can be explored.
Regression Residual Values and Their Grouping for Mapping
Generally, as discussed above, regression residual is calculated by subtracting expected value of the dependent variable (Yc) from the corresponding observed value (Y) as Y-Yc. Such values, which can be found both in positive and negative forms, are known as absolute regression residuals. Here, the units of measurement involved in the dependent variable are attached with the values. For the purpose of regression residual mapping through choropleth technique, as discussed above, the residual values obtained for different spatial units are grouped almost uniformly with equal intervals into four groups (two intervals above zero, i.e. positive residuals, and two intervals below zero, i.e. negative residuals) or six groups (three intervals above zero, i.e. positive residuals, and three intervals below zero, i.e. negative residuals).
Of course, sometimes the absolute regression residual values (Y-Yc) are converted into corresponding relative regression residual values simply dividing the absolute value by the corresponding observed value Y, i.e. Y-Yc/Y. As the relative regression residual value is dimensionless or without any unit of measurement, it is generally converted into percentage for easy understanding. As such the resultant relative regression residual values in percentage would have a range of 0 to ±100. As required these values can be grouped almost uniformly with suitable interval into 4 classes or 6 classes for residual mapping. If necessary, the grouping of regression residual values, whether absolute or relative, can also be done by applying the principle of normal distribution, wherein the standard error of estimate of Ycis calculated using the formula √∑(Y-Yc)2/N. The value so obtained can be used as a fixed class interval for all the residual values.
Conclusion
From the foregoing discussion it is clear that residual mapping based on linear regression analysis is useful in finding out problem and prospect areas with respect to selected phenomenon in relation to influencing factors. Such identification of problem and prospect areas finally helps in finding out the related causal factors influencing the phenomenon under consideration.
you can view video on Technique of Computation of Regression Residuals and Mapping |
References
- DAVIS, P. (1974): DATA DESCRIPTION AND PRESENTATION, SCIENCE IN GEOGRAPHY 3, OXFORD UNIVERSITY PRESS.
- FITZGERALD, BRIAN P. (1974): DEVELOPMENTS IN GEOGRAPHICAL METHOD, SCIENCE IN GEOGRAPHY 1, OXFORD UNIVERSITY PRESS.
- GREGORY, S. (1963): STATISTICAL METHODS AND THE GEOGRAPHER, LONGMAN.
- HAMMOND, R. AND MCCULLAGH, P. (1974): QUANTITATIVE TECHNIQUES IN GEOGRAPHY, CLARENDON PRESS.
- MAHMOOD, A. (1998): STATISTICAL METHODS IN GEOGRAPHICAL STUDIES, RAJESH PUBLICATIONS, NEW DELHI.
- MCCULLAGH, P. (1974): DATA USE AND INTERPRETATION, SCIENCE IN GEOGRAPHY 4, OXFORD UNIVERSITY PRESS.
- PAL, S. K. (1998): STATISTICS FOR GEOSCIENTISTS: TECHNIQUES AND APPLICATIONS, CONCEPT PUBLISHING COMPANY, NEW DELHI.
- SMITH, DAVID M. (1977): PATTERNS IN HUMAN GEOGRAPHY, PENGUIN.
- YEATES, M.H. (1968): AN INTRODUCTION TO QUANTITATIVE ANALYSIS IN ECONOMIC GEOGRAPHY, MCGRAW-HILL.