25 Test of Goodness of Fit and Independence: Chi-Square-test-as a test of independence
Dr Deependra Sharma
Test of Goodness of Fit and Independence: Chi-Square-test-as a test of independence
Learning objective
- After reading this module the students will be able to
- Understand the concept of non-parametric tests Apply Chi-square as Test for independence.
- Gain knowledge about the procedure of conducting Chi-square test.
Introduction
The given set of data can be analyzed with the help of various tools available on the basis of the following parameters;
Size of the Sample
Size of the Population
Scale used for measurement of data And dependency of measurement
The tests may be classified in to two category mainly Parametric and Non-Parametric. Three test i.e. t, z and F are used to estimate and test the population parameters and prerequisite of application of these test are Interval and ratio Scale to be used
Hypothesis testing for specific parameters
Assumption of normality and Standard deviation is known or not should be clear
The absence of these conditions leads to the application of Non-Parametric Tests or distribution free tests. These tests are applied in following conditions;
Do not require specific population distribution and data can be nominal or ordinal Does not takes in to consideration of population parameters
Does not require normally distributed population
These test are very easy to apply and can use nominal or ordinal data as well for calculation. These tests provide broad based conclusion with approximate solution and does not necessarily require normally distributed population. The χ-square test is one of the non-parametric tests used to test hypothesis.
χ2 test for Independence
The test is applicable in the situation when there are two categorical variables from a single population. Its’ purpose is to find out if there is a significant association between the two variables or not. For example, in an election survey, voters may be categorized on the basis of gender (i.e. male or female) and on the basis of party inclination ( i.e Democrat, Republican, or Independent). Chi-square test for independence is conducted to determine whether gender is related to party inclination or not.
This test is suitable under the following conditions:
§ The sample is selected through simple random sampling.
§ The variables are of categorical nature.
§ The expected frequency count for each cell of the contingency table should not be less than 5.
Procedure
The procedure to test the association between two independent variables where the sample data is presented in the form of contingency table with n rows and m columns is summarized as –
1. State the null and alternative hypotheses
H0: No relationship or association exists between variables.
Ha: A relationship or association exists between variables i.e., they are related.
2. Select a random sample and record the observed frequencies (O) in each cell of the contingency table and calculate the row, column and grand total.
3. Calculate the expected frequencies (E) for each cell:
E= Row total*column total/Grand total
4. Compute the value of test statistic, χ2= Σ [(O – E)2 / E ],
where O is the observed frequency count and E is the expected frequency count.
5. Calculate the degrees of freedom
df = (c – 1) * (r – 1)
where c is the number of levels for one categorical variable, and r is the number of levels for the other categorical variable.
6.Use the level of significance α and df to find the table value of χ2 at α.
7. Compare the calculated and table value .If calculated value of chi-square is less than the table value, accept the null hypotheis otherwise reject it
Example
A simple random sample of 1000 prospective voters was taken. They were categorized on the basis of gender ( namely M/F) and on the basis of party liking (Republican, Democrat, or Independent). The contingency table given below shows the result
Do the M’s party liking differ significantly from the F’s preferences? Use a 0.05 level of significance.
Solution
As discussed above following procedure is followed,
The first step is to state the null hypothesis and an alternative hypothesis. H0: Gender and party likings are independent.
Ha: Gender and party likings are not independent.
For this analysis, the significance level is 0.05, chi-square test for independence will be used.
Degrees of freedom, the expected frequency counts, and the chi-square test statistic are calculated. df = (c-
1) * (r – 1) = (2 – 1) * (3 – 1) = 2
E1,1 = (800 * 900) / 2000 = 720000/2000 = 360 E1,2 = (800 * 900) / 2000 = 360
E1,3 = (800 * 200) / 2000 = 80 E2,1 = (1200 * 900) / 2000 = 540
E2,2 = (1200 *900) / 2000 = 540
E2,3 = (1200 * 200) / 2000 = 120
x2 = Σ [ (On ,m – En, m)2 / En ,m ]
x2 = (400 – 360)2/360 + (300 – 360)2/360 + (100 – 80)2/80
+ (500 – 540)2/540 + (600 – 540)2/540 + (100 – 120)2/120 x2 = 4.44 + 10.00 + 5.0 + 2.96 + 6.66 + 3.34 = 32.4
This calculated value of chi-square statistic having 2 degrees of freedom is more than the table value (refer Chi-square table ,hence null hypothesis is not accepted. Thus, we conclude that there is a relationship between gender and voting preference.
Self-Check Questions:
Question 1: Two hundred randomly selected adults were asked whether TV shows as a whole are primarily entertaining, educational or boring. The respondents were categorized by gender. Their responses are given in the following table-
Gender | Opinion Entertaining | Educational | Waste of time | Total |
Female | 52 | 28 | 30 | 110 |
Male | 28 | 12 | 50 | 90 |
Total | 80 | 40 | 80 | 200 |
Is this evidence convincing that there is a relationship between gender and opinion in the population of interest?
Solution – Let us take the null hypothesis that the opinion of adults is independent of adults is independent of gender.
Since, contingency table is of size 2×3,the degrees of freedom would be (2-1)(3-1) = 2.This implies that we need to calculate only to calculate only two expected frequencies and the other four can automatically be determined as shown below:
E11=Row 1 total x Column 1 total
E13=110-(44+22)=44
E21=80-E11=40-22==18
E22=40-E12=40-22=18
E23=80-E13=80-44=36
The contingency table of expected frequencies is as follows:
Gender | Entertaining | Opinion Educational | Waste of time | Total |
Female | 44 | 22 | 44 | 110 |
Male | 36 | 18 | 36 | 90 |
Total | 80 | 40 | 80 | 200 |
Arranging the observed and expected frequencies as follows to calculate the value of x2-test statistic:
Observed(O) | Expected(E) | O-E | (O-E)2 | (O-E)2/E |
52 | 44 | 8 | 64 | 1.454 |
28 | 22 | 6 | 36 | 1.636 |
30 | 44 | 14 | 196 | 4.455 |
28 | 36 | -8 | 64 | 1.777 |
12 | 18 | -6 | 36 | 2 |
50 | 36 | 14 | 196 | 5.444 |
16.766 |
Since, calculated value of x2=16.766 is more than its critical value, x2=5.99 at α=0.05 and df = 2 ,the null hypothesis is rejected. Hence, we conclude that the opinion of adults is not independent of gender.
Question 2: A sample analysis of examination results of 500 students was made. It was found that 220 students had failed ,170 had secured a third division 90 were placed in second division and 20 got a first division. Are these figures commensurate with the general examination result which is the ratio of 4:3:2:1 for the various categories respectively?
Solution– Let us take the null hypothesis that the observed results are commensurate with the general examination result which is the ratio 4:3:2:1.
The expected number of students who have failed, obtained a third division second division and first division, respectively, are
E1=500*4/10=200, E2=500*3/10=150;E3=500*2/10=100 AND E4=500*1/10=50
The contingency table of expected and observed frequencies is as follows:
Category | O | E | (O-E)2 | Χ2=(O-E)2/E |
Failed | 220 | 200 | 400 | 2 |
3rd division | 170 | 150 | 400 | 2.667 |
2nd division | 90 | 100 | 100 | 1 |
1st division | 20 | 50 | 900 | 18 |
23.667 |
Since calculated value of x2 = 23.667 is more than its table value , x2=7.81 at α = 0.05 level of significance and df= n – 1 = 4 -1 =3 the hypothesis is rejected.
Question 3: Based on information on 1000 randomly selected fields about the tenancy status of the cultivation of these fields and use of fertilizers ,collected in an AGRO ECONOMY survey, the following classification was noted:
Owned | Rented | Total | |
Using fertilizers | 416 | 184 | 600 |
Not using fertilizers | 64 | 336 | 400 |
Total | 480 | 5220 | 1000 |
Solution: Let us take the hypothesis that ownership of fields and the use of fertilizers are independent attributes. Since, contingency table is of size 2*2 the degree of freedom would be (2-1)(2-1)=1. This implies that we need to calculate only one expected frequency and others can be automatically determined as follows:
E11=600*480/1000=288
E12 =600-288=312
E21=480-288=192
E22=208
The contingency table of expected frequencies is as follows:
Observed | Expected | (O-E)2 | x2=(O-E)2/E |
416 | 288 | 16,384 | 56.889 |
64 | 192 | 16,384 | 85.333 |
184 | 312 | 16,384 | 52.513 |
336 | 208 | 16384 | 78.769 |
273.534 |
The calculated value of x2=273.534 at α=0.05 level of significance and df= (n-1) (r-1) =(2-1) (2-1) = 1 is much more than its table value,χ2=3.84. The null hypothesis H0 is rejected. Hence, it can be conducted
that owners’ cultivators are more inclined towards the use of fertilizers.
Summary
The tests may be classified in to two category mainly Parametric and Non-Parametric. Three test i.e. t test, z test and f test are used to estimate and test the population parameters and prerequisite of application of these test are-interval and ratio Scale to be used, hypothesis testing for specific parameters, assumption of normality and Standard deviation is known or not should be clear.
The absence of these conditions leads to the application of Non-Parametric Tests or distribution free tests. These test are very easy to apply and can use nominal or ordinal data as well for calculation. These tests provide broad based conclusion with approximate solution and does not necessarily require normally distributed population. The Chi-Square test is one of the non-parametric tests used to test hypothesis. Chi-Square Test for Independence test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.
Learn More:
- Sharma, J K (2014), Business Statistics, S Chand & Company, N Delhi.
- Bajpai, N (2010) Business Statistics, Pearson, N Delhi.
- Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.
- Darrell Huff (2010), How to Lie with Statistics, W. W. Norton, California.
- K.R. Gupta (2012), Practical Statistics, Atlantic Publishers & Distributors (P) Ltd., N. Delhi.