34 Correlation: Coefficient of Determination: Testing for Significance
1. Learning Outcome:
- After completing this module the students will be able to:
- Understand the meaning and usefulness of correlation analysis
- Understand different techniques of finding out correlation
- Find out the significance of correlation coefficient
- Calculate coefficient of correlation and its importance
- Develop an understanding about various types of correlation
- Understand the significance and limitations of every technique of correlation
- Evaluate the relationship between the variables
2.Introduction to the Concept of Correlation and Its Applications:
Most of the times, we come across the problems that comprise of two or more variables. When two variables appear to move in the same direction i.e. both variables are either increasing or decreasing; or in opposite direction i.e. one is increasing and another decreasing; both variables are said to be associated/ correlated to each other. When the variations in both variables take place in the same direction (both are either increasing or decreasing), they are assumed positively correlated. If variations in both variables travel in opposite director (one increases and another decreases), both are said to be negatively correlated. For example, in a class when homework grades of any student increase, his final grades also increase. This means that there is a positive relationship between homework grades and final grades. Similarly, when the price of a branded washing machine is decreased, its market demand will shoot up. This signifies a negative relationship between the price and demand of washing machine.
For instance, we assume two variables, X and Y. When we plot the values of X and Y on a graph and if ‘Y’ increases at a similar rate as ‘X,’ these two variables are said to be positively correlated. The figure above on the left is a case of a positive correlation. Alternatively; if ‘Y’ decreases as ‘X’ increases then both the variables are said to be negatively correlated. The figure above on the right provides an example of a negative correlation.
Some important definitions of correlation are mentioned as hereunder:
Simpson and Kafka – correlation analysis deals with the association between two or more
variables.
L.R. Conner– If two or more quantities vary in sympathy so that movements in one tend to accompanied by corresponding movements in the other(s) the they tend are said to be correlated.
Ya Lun Chou– correlation analysis attempts to determine the ‘degree of relationship’ between
variables.
A.M.Tuttle– correlation is an analysis of the covariation between two or more variables. Thus, correlation may be considered as a technique that helps us in analyzing the covariation of two or more variables. To be more precise, it measures the extent of correspondence between the ordering of two random variables. It reflects the degree to which two variables share a common relationship.
The problem in analyzing the relationship/association between the variables may be categorized in three stages as below:
- Firstly, we need to see whether the variables under study are related to each other or independent of each other;
- Secondly, if any relationship is found than we move ahead to understand the nature and degree of this relationship;
- Lastly, having calculated the degree of relationship, we may be interested in searching acause-effect (causal) relationship between the variables. That means variations in one variable cause variations in another variable.
- It is important to mention that now a day’s correlation is the most widely used technique in the problems pertaining to economics, business world, social science, biological problems, and psychology etc. It has become so important mainly because of the following reasons:
- With the knowledge of degree of relationship between the variables under study, we can be very specific while appreciating this relationship. For example, we can easily appreciate the relationship between per capita income and per capita electricity consumption;
- In business organizations, it is very significant concept as it enables the managers to estimate the costs, sales, price, and many other important variables with the help of other variables which are closely related to these variables.
- If both variables show a specific and reliable relationship, we can predict the unknown value of one variable with the help of the given value of another variable. This is usually done with the help of simple regression analysis.Correlation and Cause-effect Relationship (Causal Relationship):If two variables are related to each other, it does not mean that there is essentially any cause- effect (causal) relationship. Causal relationship means that variations in one variable cause variations in another. In fact, two variables may be strongly correlated, but causal relationship is non-existent. Here, we need to understand that two variables may be correlated to each other because of following reasons:
- In a small sample, two variables may show a strong relationship but in large population but no relationship is observed between both variables in large population. It means that both variables tend to be correlated only because of chance, actually they are not;
- Two variables may be in relationship due to influence of one or more other variables. For example, in Madhya Pradesh, for last five years production of wheat and rice has been increasing. This shows a positive relationship between both variable. But actually, it is happening because of rainfall. Due to rainfall during that time period, production of rice and production of wheat is increasing. Thus, a third variable (rainfall) influences both the variables.
- Sometimes, both the variables may be influencing each other so that it becomes difficult to say that which is the cause and which is the effect. For example, in first case, when a company increases its promotion expenditure, its sales also increase. Here promotion expenditure is cause and sale is effect. In second case, a decrease in company sales may force the company to cut its promotion expenditure, here sale is cause and promotion expenditure if effect. Therefore, it is very difficult to understand which variable is cause and which is effect as both influence each other.
From the above discussion, we may easily understand that existence of correlation does not indicate towards any cause-effect relation between the variables. But existence of a cause- effect relationship between two variables implies that both variables are necessarily correlated to each other.
Suppose, two variables are correlated to each other but have no causal relationship and one interprets the relationship as causal, such a correlation is described as spurious or no-sensecorrelation. For example, salary of health sector employee and number of accidents in Delhi over a period 10 years tend to be positively correlated (r = 0.9) . But does that mean that these two variables are strongly correlated? Certainly not! Not even through the longest and most complex cause-effect chain. That is what spurious correlation is all about. Spurious correlation occurs between two variables that are supposed to be mutually independent. Therefore, high correlation between the variables indicates only the mathematical result. One must arrive at the conclusion based on logical reasoning and intelligent investigation of significantly related variables.
It may be noted here that in simple correlation analysis we have only two variables and term ‘dependent variable’ and ‘independent variable’ refer to the mathematical or functional meaning of dependence; i.e. they do not imply that there is necessarily any cause and effect relationship between the variables. For instance, while estimating demand of a FMCG product from figures on sales promotion expenditures, demand is generally considered as the dependent variable. However, there may or may not be causal relationship between these two variables in the sense that changes in sales promotion cause changes in demand. In fact, in few cases, the cause-effect relationship may be just opposite what appears to be the obvious one.
Types of Correlation:
Some of the important types of correlation are as:
(a)Positive Correlation: when variations in two variables move in the same directions. That means both variables are either increasing or decreasing.
In above graph, both variables are increasing that signifies a positive correlation.
(b)Negative Correlation: When variations in both variables travel in opposite direction. That means one is increasing and another is decreasing.
In the above graph, variable X increases and Y decreases. Thus, it is a negative correlation.
(c)Linear and Non-linear Correlation: When spread of change in one variable has a tendency to have a constant ratio with the spread of change in another variable, then it is said be a linear correlation.
In the above figure, it is easy to understand that changes in Y are proportional to the changes in X. Therefore, it is a perfect linear relationship. A linear correlation may be positive or negative.
(d)Non-linear Correlation: When the extent of variations in both the variables does not show a constant ratio, then it is said to be non-linear correlation.
LLR Smoother
In the above graph, one can easily see that variations in study hours and variations in final score are not showing a constant ratio. Beyond a level, a little increase in study hours brings very high increase in final score.
(e)Simple, Multiple and Partial Correlation: In simple correlation with study only two variables simultaneously. Multiple correlations are used when we try to find out the relationship among more than two variables simultaneously. Partial correlation is used when we try to find out the relationship between two variables assuming the effect of other variables constant.
3.Methods of Correlation:
There are different methods of determining the association/ relationship between variables, but none of them can inform us with certainty that a correlation is pinpointing of causal relationship. Therefore, we have to answer two types of questions in bivariate (only two variables) population viz.
The first question is answered by the use of correlation technique and the second question by the technique of regression. In case of bivariate population (two variables), correlation can be studied through
(a)Scatter Diagramme;
(b)Karl Pearson’s coefficient of correlation;
(c)Charles Spearman’s coefficient of correlation;
Whereas cause and effect relationship can be studied through simple regression equations. Finding a correlation among more than two variables is beyond the scope of this module.
Scatter Diagramme Method:
Scatter Diagram is a graph of observed plotted points where each points represents the values of X & Y as its coordinate. It gives us a superficial idea about the relationship between two variables graphically.
Above figure depicts different scatter diagrammes. One can easily take an idea about the direction and strength of relationship. However, exact degree of relationship cannot be ascertained from the above diagrammes.
Advantages:
- First step in investigating the relationship between two variables
- Simple & non mathematical method
- Not influenced by the size of extreme item
Disadvantage:
- This cannot measure the exact degree of relationship.
Karl Pearson’s Method:
It is most commonly used method for finding out the relationship between the variables. It measures the direction as well as degree of relationship. The degree of correlation between two variables (measured on interval and ratio scale) can be measured through PERASON’S CORRELATION COEFFICIENT (r).
When deviation taken from actual mean:
r = Σdxdy /√ Σdx² Σdy²
(also known as covariance method)
When deviation taken from an assumed mean:
r = N Σdxdy – Σdx Σdy
√[N Σd²x – (Σdx)²] [N Σd²y – (Σdy)²]
Value of correlation coefficient for bivariate frequency data
r = N Σfdxdy – Σfdx Σfdy
√[N Σfd²x – (Σfdx)²] [N Σfd²y – (Σfdy)²]
The value of r always lies between –1 and 1, i.e., −1 ≤ r ≤1. If both variables increase or decreases (same direction), we say that there is positive correlation between them. However, if one decreases when another increases (or vice versa), then we say that both are negatively or inversely correlated.
Procedure for Calculating Pearson’s coefficient of correlation:
• Calculate the mean of the two series ‘X’ &’Y’
• Calculate the deviations ‘dx’ & ‘dy’ in two series from their respective mean.
• Square each deviation of ‘dx’ & ’dy’ then obtain the sum of the squared deviation i.e.∑dx2 & ∑dy2
• Multiply each deviation under x with each deviation under y & obtain the product of
‘dxdy’. Then obtain the sum of the product of dx, dy i.e. ∑dxdy
• Substitute the value in the formula.
Interpretation of Correlation Coefficient (r):
The extreme values of r, i.e., when r = ±1, shows that there is perfect (positive or negative) correlation between X and Y. The remaining values of r that lie in subintervals of {–1, 1}, explain the association/ relationship in terms of its strength. One may use the following figure as a guideline as to what adjective must be used for the values of r obtained after calculation to describe the relationship.
Note
When r = 0, we cannot say that there is no correlation at all between X and Y. Pearson’s correlation coefficient is meant to measure linear relationship only. It should not be used in the case of non-linearrelationships since it will obviously lead to a wrong interpretation.
Advantages of Correlation Coefficient:
- It summarizes in one value, the degree of correlation & direction of correlation also;
- Since its value is unit less therefore, it is independent of the change of origin & scale.Limitations of Correlation Coefficient:
- Always assume linear relationship between variables
- Interpreting the value of r is difficult. There are chances of wrong interpretation.
- Value of Correlation Coefficient is affected by the extreme values
- Time consuming methods
- Usually, Coefficient of Determination (r2) is used to interpret the value of coefficient of correlation (r). Coefficient of determination (r2) measures common variance.
- Charles Spearman’s Method
This method is used when we are given with rank data (ordinal data). This method is based on the order (ranks) of the given observations.
Spearman’s coefficient of correlation is represented by R and it is calculated as:
R = 1- (6 ∑D2) / N (N2 – 1)
Where
R = Rank correlation coefficient
D = Difference of rank between paired item in two series.
N = Total number of observation.
Interpretation of Rank Correlation Coefficient (R)
•The value of rank correlation coefficient, R ranges from -1 to +1
•If R = +1, then there is complete agreement in the order of the ranks and the ranks are in the same direction
•If R = -1, then there is complete agreement in the order of the ranks and the ranks are in the opposite direction
•If R = 0, then there is no correlation
Problems in Rank Correlation Coefficient:
Problems where Ranks are not given: If the ranks are not given, then we need to assign ranks to the data series. The ranking can be done in ascending or descending order. We need to follow the same scheme of ranking for the other series.
Equal Ranks or tie in Ranks: In such cases average ranks should be assigned to each individual.
R = 1- (6 ∑D2) + AF / N (N2 – 1)
Where
AF = 1/12(m13 – m1) + 1/12(m23 – m2) +…. 1/12(m23 – m2)
m = Number of time an item is repeated
Merits/ Demerits of Spearman’s Rank Correlation:
Merits:
•This method is simpler to understand and easier to apply compared to karl Pearson’s correlation method.
•This method is useful where we can give the ranks and not the actual data. (qualitative term)
•This method is to use where the initial data in the form of ranks.
Demerits:
•Cannot be used for finding out correlation in a grouped frequency distribution.
•This method should be applied where N exceeds 30.
4.Finding out Significance of Pearson’s Correlation Coefficient (r):
As we know that correlation coefficient (r) is usually calculated to have an idea about the extent of relationship between two variables. Most of time, we consider a sample from a large population to calculate correlation coefficient (r) that is further used as a point estimate of population correlation coefficient (ƿ). It means ‘ƿ’ is estimated by ‘r’.
Therefore, we can say that ‘r’ is used as a population parameter (ƿ).
Here, one can take note that ‘r’ may be considered as an estimate of ‘ƿ’ if the assumption of normal distribution of both the variables holds true.
The most widely used test to investigate whether both variables X and Y are correlated to each other or not, is the t-test. For using t-test, we take our hypothesis as follows:
H0: ƿ = 0
Ha: ƿ ≠ 0
Null hypothesis (H0) assumes that both the variables X and Y are not correlated in the entire population. Alternative hypothesis (Ha) is just opposite to null hypothesis that assumes both variables are strongly correlated in population.
Now we can calculate the t-value using the following formula:
Where
r = Pearson’s coefficient of correlation
ƿ= Population correlation
n = number of observations in sample (n-2) = degree of freedom
Here, we have assumed population correlation (ƿ) equals to zero. We can get the value of t after putting the values of r and n. This is said to be calculated value of t.
Now, we see the table value of t for level of significance (α) = 0.05 and degree of freedom = (n-2).
(a) If tcal > ttable value ; we will reject H0; it means r is significantly different from zero and both variables are strongly correlated in population.
(b) If tcal < ttable value ; we will accept H0; it means r is almost equal to zero and both variables are not correlated in population.
Standard Error of Correlation Coefficient (r):
We can also calculate the standard error of r using following formula:
(1−�²)
S.E.= √�
Where
r = Correlation coefficient
n = Number of observations in sample
Probable Error of Correlation Coefficient (r):
We can also calculate the range of degree of correlation for the entire population with the help of ‘probable error’.
ƿ± P.E.r
Lower limit of population correlation coefficient = ƿ – P.E.r
Upper limit of population correlation coefficient = ƿ + P.E.r
5. Coefficient of Determination (r²):
The most widely used and convenient way to interpret the value of correlation coefficient between two variables is the coefficient of determination. The square of r is called as Coefficient of Determination (r2) that measures the common variations in both the variables. This is very useful measure of linear covariation of two variables.
Coefficient of Determination (r2) = �������������� ��������������
�������� ���������������
For example, if value of r = 0.50, then r2 = 0.25 i.e r2 = 25/ 100
It may be interpreted as out of total variations (100 percent), 25 percent variations in dependent variable may be explained by variations in independent variable. The value of r2 is ranging from 0 to 1.
Coefficient of determination is widely used in regression analysis to assess the goodness of fit of the regression model. The greater the value of r2 the better is the fit and the more useful the regression equation as a predictive instrument.
6. Limitations of Correlation analysis:
From the discussion made so far, one can easily understand that correlation analysis is a statistical tool that is used to find out the association/ relationship between the variables. It must be used very carefully to avoid misleading conclusions/interpretations. The most common mistakes that are usually made by us are as hereunder:
- Correlation coefficient (r) gives us an idea about the linear relationship between the variables. As the value of r increase from 0 to 1, it means linear association between the variables also increases. A value of r = 0 does not show the absence of relationship. In this case, both the variables are not linearly related to one another but they may be associated in any other manner.
- For the value of r = 0.8 and r = 0.4; we cannot say that in first case degree of relationship between the variables is just double/two times as compared to second case. It means correlation coefficient does not follow the principle of proportionality.
- A value of r = 0.6 does not mean that correlation explains 60% variations of total variations. Rather, r = 0.6 means that correlation explains only 36% variations in both the variables (as r² = 0.36).
- One may consider presence of strong correlation between the variables as cause-effectrelationship (causation). Actually, correlation speaks nothing about the causation. In case of two variables, causation may be established only with the help of simple regression analysis.
- A very common mistake in interpretation of correlation coefficient takes place when we conclude a strong relationship between the variables but actually they are not related. For example, sales of Hero bikes in New Delhi and number of accidents in Mumbai. Both the variables seem to be correlated as they show similar movements but actually, it is not possible to link them.
7.Summary:
Correlation analysis is a very good technique to find out the association or relationship between the two or more variables. In simple correlation we consider only two variables. In multiple correlations we try to find out the correlation among more than two variables. In partial correlation, we try to find out the correlation between two variables by partial out the effect of other variables. Pearson’s coefficient of correlation is most widely used technique for finding out the correlation between the variables (recorded using interval/ ratio scale). This assumes a linear relationship among the variables and normal distribution of variables. The calculation of correlation coefficient is little complex and is affected by the extreme values. For the rank data, Spearman’s correlation coefficient is used. It is distribution free coefficient. Significance of Pearson’s coefficient of correlation can be found out using t-test. Coefficient of determination (r²) is widely used in determining the common variations between the related variables. Existence of correlation between the variable does not signify any cause-effect relationship. Causation may be established by simple regression analysis.