25 Test of Goodness of Fit and Independence: Chi-Square-test-as a test of independence

Dr Deependra Sharma

 

Test of Goodness of Fit and Independence: Chi-Square-test-as a test of independence

 

Learning objective

  • After reading this module the students will be able to
  • Understand the concept of non-parametric tests Apply Chi-square as Test for independence.
  • Gain knowledge about the procedure of conducting Chi-square test.

 

Introduction

 

The given set of data can be analyzed with the help of various tools available on the basis of the following parameters;

 

Size of the Sample

Size of the Population

Scale used for measurement of data And dependency of measurement

 

The tests may be classified in to two category mainly Parametric and Non-Parametric. Three test i.e. t, z and F are used to estimate and test the population parameters and prerequisite of application of these test are Interval and ratio Scale to be used

 

Hypothesis testing for specific parameters

 

Assumption of normality and Standard deviation is known or not should be clear

 

The absence of these conditions leads to the application of Non-Parametric Tests or distribution free tests. These tests are applied in following conditions;

 

Do not require specific population distribution and data can be nominal or ordinal Does not takes in to consideration of population parameters

 

Does not require normally distributed population

 

These test are very easy to apply and can use nominal or ordinal data as well for calculation. These tests provide broad based conclusion with approximate solution and does not necessarily require normally distributed population. The χ-square test is one of the non-parametric tests used to test hypothesis.

 

χ2  test for Independence

The test is applicable in the situation when there are two categorical variables from a single population. Its’ purpose is to find out if there is a significant association between the two variables or not. For example, in an election survey, voters may be categorized on the basis of gender (i.e. male or female) and on the basis of party inclination ( i.e Democrat, Republican, or Independent). Chi-square test for independence is conducted to determine whether gender is related to party inclination or not.

 

This test is suitable under the following conditions:

 

§  The sample is selected through simple random sampling.

 

§  The variables are of categorical nature.

 

§  The expected frequency count for each cell of the contingency table should not be less than 5.

 

Procedure

 

The procedure to test the association between two independent variables where the sample data is presented in the form of contingency table with n rows and m columns is summarized as –

 

1. State the null and alternative hypotheses

 

H0: No relationship or association exists between variables.

 

Ha:  A relationship or association exists between variables i.e., they are related.

 

2.  Select a random sample and record the observed frequencies (O) in each cell of the contingency table and calculate the row, column and grand total.

 

3. Calculate the expected frequencies (E) for each cell:

 

E= Row total*column total/Grand total

 

4.  Compute the value of test statistic, χ2= Σ [(O – E)2 / E ],

 

where O is the observed frequency count and E is the expected frequency count.

 

5. Calculate the degrees of freedom

 

df = (c – 1) * (r – 1)

 

where c is the number of levels for one categorical variable, and r is the number of levels for the other categorical variable.

 

6.Use the level of significance α and df to find the table value of χ2 at  α.

 

7.  Compare the calculated and table value .If calculated value of chi-square is less than the table value, accept the null hypotheis otherwise reject it

 

Example

 

A simple random sample of 1000 prospective voters was taken. They were categorized on the basis of gender ( namely M/F) and on the basis of party liking (Republican, Democrat, or Independent). The contingency table given below shows the result

 

Do the M’s party liking differ significantly from the F’s preferences? Use a 0.05 level of significance.

 

Solution

 

As discussed above following procedure is followed,

 

The first step is to state the null hypothesis and an alternative hypothesis. H0: Gender and party likings are independent.

 

Ha: Gender and party likings are not independent.

 

For this analysis, the significance level is 0.05, chi-square test for independence will be used.

 

Degrees of freedom, the expected frequency counts, and the chi-square test statistic are calculated. df = (c-

1) * (r – 1) = (2 – 1) * (3 – 1) = 2

 

E1,1 = (800 * 900) / 2000 = 720000/2000 = 360 E1,2 = (800 * 900) / 2000 = 360

 

E1,3 = (800 * 200) / 2000 = 80 E2,1 = (1200 * 900) / 2000 = 540

 

E2,2 = (1200 *900) / 2000 = 540

 

E2,3 = (1200 * 200) / 2000 = 120

 

x2 = Σ [ (On ,m – En, m)2 / En ,m ]

 

x2 = (400 – 360)2/360 + (300 – 360)2/360 + (100 – 80)2/80

 

+  (500 – 540)2/540 + (600 – 540)2/540 + (100 – 120)2/120 x2 = 4.44 + 10.00 + 5.0 + 2.96 + 6.66 + 3.34 = 32.4

 

This calculated value of chi-square statistic having 2 degrees of freedom is more than the table value (refer Chi-square table ,hence null hypothesis is not accepted. Thus, we conclude that there is a relationship between gender and voting preference.

 

Self-Check Questions:

 

Question 1: Two hundred randomly selected adults were asked whether TV shows as a whole are primarily entertaining, educational or boring. The respondents were categorized by gender. Their responses are given in the following table-

Gender Opinion Entertaining Educational Waste of time Total
Female 52 28 30 110
Male 28 12 50 90
Total 80 40 80 200

 

Is this evidence convincing that there is a relationship between gender and opinion in the population of interest?

 

Solution – Let us take the null hypothesis that the opinion of adults is independent of adults is independent of gender.

 

Since, contingency table is of size 2×3,the degrees of freedom would be (2-1)(3-1) = 2.This implies that we need to calculate only to calculate only two expected frequencies and the other four can automatically be determined as shown below:

 

E11=Row 1 total x Column 1 total

 

E13=110-(44+22)=44

 

E21=80-E11=40-22==18

 

E22=40-E12=40-22=18

 

E23=80-E13=80-44=36

 

The contingency table of expected frequencies is as follows:

Gender Entertaining Opinion Educational Waste of time Total
Female 44 22 44 110
Male 36 18 36 90
Total 80 40 80 200

 

Arranging the observed and expected frequencies as follows to calculate the value of x2-test statistic:

Observed(O) Expected(E) O-E (O-E)2 (O-E)2/E
52 44 8 64 1.454
28 22 6 36 1.636
30 44 14 196 4.455
28 36 -8 64 1.777
12 18 -6 36 2
50 36 14 196 5.444
16.766

 

Since, calculated value of x2=16.766 is more than its critical value, x2=5.99 at α=0.05 and df = 2 ,the null hypothesis is rejected. Hence, we conclude that the opinion of adults is not independent of gender.

 

Question 2: A sample analysis of examination results of 500 students was made. It was found that 220 students had failed ,170 had secured a third division 90 were placed in second division and 20 got a first division. Are these figures commensurate with the general examination result which is the ratio of 4:3:2:1 for the various categories respectively?

 

Solution– Let us take the null hypothesis that the observed results are commensurate with the general examination result which is the ratio 4:3:2:1.

 

The expected number of students who have failed, obtained a third division second division and first division, respectively, are

 

E1=500*4/10=200, E2=500*3/10=150;E3=500*2/10=100 AND E4=500*1/10=50

 

The contingency table of expected and observed frequencies is as follows:

Category O E (O-E)2 Χ2=(O-E)2/E
Failed 220 200 400 2
3rd division 170 150 400 2.667
2nd division 90 100 100 1
1st division 20 50 900 18
23.667

 

Since calculated value of x2 = 23.667 is more than its table value , x2=7.81 at α = 0.05 level of significance and df= n – 1 = 4 -1 =3 the hypothesis is rejected.

 

Question 3: Based on information on 1000 randomly selected fields about the tenancy status of the cultivation of these fields and use of fertilizers ,collected in an AGRO ECONOMY survey, the following classification was noted:

Owned Rented Total
Using fertilizers 416 184 600
Not using fertilizers 64 336 400
Total 480 5220 1000
Would you conclude that owner cultivators are more towards the use of fertilizers at 5%level of significance? Carry out a chi-square test as per testing procedure.

 

Solution: Let us take the hypothesis that ownership of fields and the use of fertilizers are independent attributes. Since, contingency table is of size 2*2 the degree of freedom would be (2-1)(2-1)=1. This implies that we need to calculate only one expected frequency and others can be automatically determined as follows:

 

E11=600*480/1000=288

 

E12 =600-288=312

 

E21=480-288=192

 

E22=208

 

The contingency table of expected frequencies is as follows:

Observed Expected (O-E)2 x2=(O-E)2/E
416 288 16,384 56.889
64 192 16,384 85.333
184 312 16,384 52.513
336 208 16384 78.769
273.534

 

The calculated value of x2=273.534 at α=0.05 level of significance and df= (n-1) (r-1) =(2-1) (2-1) = 1 is much more than its table value,χ2=3.84. The null hypothesis H0 is rejected. Hence, it can be conducted

that owners’ cultivators are more inclined towards the use of fertilizers.

 

Summary

 

The tests may be classified in to two category mainly Parametric and Non-Parametric. Three test i.e. t test, z test and f test are used to estimate and test the population parameters and prerequisite of application of these test are-interval and ratio Scale to be used, hypothesis testing for specific parameters, assumption of normality and Standard deviation is known or not should be clear.

 

The absence of these conditions leads to the application of Non-Parametric Tests or distribution free tests. These test are very easy to apply and can use nominal or ordinal data as well for calculation. These tests provide broad based conclusion with approximate solution and does not necessarily require normally distributed population. The Chi-Square test is one of the non-parametric tests used to test hypothesis. Chi-Square Test for Independence test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

 

Learn More:

  1. Sharma, J K (2014), Business Statistics, S Chand & Company, N Delhi.
  2. Bajpai, N (2010) Business Statistics, Pearson, N Delhi.
  3. Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.
  4. Darrell Huff (2010), How to Lie with Statistics,  W. W. Norton, California.
  5. K.R. Gupta (2012), Practical Statistics, Atlantic Publishers & Distributors (P) Ltd., N. Delhi.