25 Test of Goodness of Fit and Independence: Chi-Square-test-as a test of independence

Dr Deependra Sharma

Test of Goodness of Fit and Independence: Chi-Square-test-as a test of independence

Learning objective

After reading this module the students will be able to
Understand the concept of non-parametric tests Apply Chi-square as Test for independence.
Gain knowledge about the procedure of conducting Chi-square test.

Introduction

The given set of data can be analyzed with the help of various tools available on the basis of the following parameters;

Size of the Sample

Size of the Population

Scale used for measurement of data And dependency of measurement

The tests may be classified in to two category mainly Parametric and Non-Parametric. Three test i.e. t, z and F are used to estimate and test the population parameters and prerequisite of application of these test are Interval and ratio Scale to be used

Hypothesis testing for specific parameters

Assumption of normality and Standard deviation is known or not should be clear

The absence of these conditions leads to the application of Non-Parametric Tests or distribution free tests. These tests are applied in following conditions;

Do not require specific population distribution and data can be nominal or ordinal Does not takes in to consideration of population parameters

Does not require normally distributed population

These test are very easy to apply and can use nominal or ordinal data as well for calculation. These tests provide broad based conclusion with approximate solution and does not necessarily require normally distributed population. The χ-square test is one of the non-parametric tests used to test hypothesis.

χ2 test for Independence

The test is applicable in the situation when there are two categorical variables from a single population. Its’ purpose is to find out if there is a significant association between the two variables or not. For example, in an election survey, voters may be categorized on the basis of gender (i.e. male or female) and on the basis of party inclination ( i.e Democrat, Republican, or Independent). Chi-square test for independence is conducted to determine whether gender is related to party inclination or not.

This test is suitable under the following conditions:

§ The sample is selected through simple random sampling.

§ The variables are of categorical nature.

§ The expected frequency count for each cell of the contingency table should not be less than 5.

Procedure

The procedure to test the association between two independent variables where the sample data is presented in the form of contingency table with n rows and m columns is summarized as –

1. State the null and alternative hypotheses

H0: No relationship or association exists between variables.

Ha: A relationship or association exists between variables i.e., they are related.

2. Select a random sample and record the observed frequencies (O) in each cell of the contingency table and calculate the row, column and grand total.

3. Calculate the expected frequencies (E) for each cell:

E= Row total*column total/Grand total

4. Compute the value of test statistic, χ2= Σ [(O – E)2 / E ],

where O is the observed frequency count and E is the expected frequency count.

5. Calculate the degrees of freedom

df = (c – 1) * (r – 1)

where c is the number of levels for one categorical variable, and r is the number of levels for the other categorical variable.

6.Use the level of significance α and df to find the table value of χ2 at α.

7. Compare the calculated and table value .If calculated value of chi-square is less than the table value, accept the null hypotheis otherwise reject it

Example

A simple random sample of 1000 prospective voters was taken. They were categorized on the basis of gender ( namely M/F) and on the basis of party liking (Republican, Democrat, or Independent). The contingency table given below shows the result

Do the M’s party liking differ significantly from the F’s preferences? Use a 0.05 level of significance.

Solution

As discussed above following procedure is followed,

The first step is to state the null hypothesis and an alternative hypothesis. H0: Gender and party likings are independent.

Ha: Gender and party likings are not independent.

For this analysis, the significance level is 0.05, chi-square test for independence will be used.

Degrees of freedom, the expected frequency counts, and the chi-square test statistic are calculated. df = (c-

1) * (r – 1) = (2 – 1) * (3 – 1) = 2

E1,1 = (800 * 900) / 2000 = 720000/2000 = 360 E1,2 = (800 * 900) / 2000 = 360

E1,3 = (800 * 200) / 2000 = 80 E2,1 = (1200 * 900) / 2000 = 540

E2,2 = (1200 *900) / 2000 = 540

E2,3 = (1200 * 200) / 2000 = 120

x2 = Σ [ (On ,m – En, m)2 / En ,m ]

x2 = (400 – 360)2/360 + (300 – 360)2/360 + (100 – 80)2/80

+ (500 – 540)2/540 + (600 – 540)2/540 + (100 – 120)2/120 x2 = 4.44 + 10.00 + 5.0 + 2.96 + 6.66 + 3.34 = 32.4

This calculated value of chi-square statistic having 2 degrees of freedom is more than the table value (refer Chi-square table ,hence null hypothesis is not accepted. Thus, we conclude that there is a relationship between gender and voting preference.

Self-Check Questions:

Question 1: Two hundred randomly selected adults were asked whether TV shows as a whole are primarily entertaining, educational or boring. The respondents were categorized by gender. Their responses are given in the following table-

Gender	Opinion Entertaining	Educational	Waste of time	Total
Female	52	28	30	110
Male	28	12	50	90
Total	80	40	80	200

Is this evidence convincing that there is a relationship between gender and opinion in the population of interest?

Solution – Let us take the null hypothesis that the opinion of adults is independent of adults is independent of gender.

Since, contingency table is of size 2×3,the degrees of freedom would be (2-1)(3-1) = 2.This implies that we need to calculate only to calculate only two expected frequencies and the other four can automatically be determined as shown below:

E11=Row 1 total x Column 1 total

E13=110-(44+22)=44

E21=80-E11=40-22==18

E22=40-E12=40-22=18

E23=80-E13=80-44=36

The contingency table of expected frequencies is as follows:

Gender	Entertaining	Opinion Educational	Waste of time	Total
Female	44	22	44	110
Male	36	18	36	90
Total	80	40	80	200

Arranging the observed and expected frequencies as follows to calculate the value of x2-test statistic:

Observed(O)	Expected(E)	O-E	(O-E)2	(O-E)2/E
52	44	8	64	1.454
28	22	6	36	1.636
30	44	14	196	4.455
28	36	-8	64	1.777
12	18	-6	36	2
50	36	14	196	5.444
				16.766

Since, calculated value of x2=16.766 is more than its critical value, x2=5.99 at α=0.05 and df = 2 ,the null hypothesis is rejected. Hence, we conclude that the opinion of adults is not independent of gender.

Question 2: A sample analysis of examination results of 500 students was made. It was found that 220 students had failed ,170 had secured a third division 90 were placed in second division and 20 got a first division. Are these figures commensurate with the general examination result which is the ratio of 4:3:2:1 for the various categories respectively?

Solution– Let us take the null hypothesis that the observed results are commensurate with the general examination result which is the ratio 4:3:2:1.

The expected number of students who have failed, obtained a third division second division and first division, respectively, are

E1=500*4/10=200, E2=500*3/10=150;E3=500*2/10=100 AND E4=500*1/10=50

The contingency table of expected and observed frequencies is as follows:

Category	O	E	(O-E)2	Χ2=(O-E)2/E
Failed	220	200	400	2
3rd division	170	150	400	2.667
2nd division	90	100	100	1
1st division	20	50	900	18
				23.667

Since calculated value of x2 = 23.667 is more than its table value , x2=7.81 at α = 0.05 level of significance and df= n – 1 = 4 -1 =3 the hypothesis is rejected.

Question 3: Based on information on 1000 randomly selected fields about the tenancy status of the cultivation of these fields and use of fertilizers ,collected in an AGRO ECONOMY survey, the following classification was noted:

	Owned	Rented	Total
Using fertilizers	416	184	600
Not using fertilizers	64	336	400
Total	480	5220	1000

Would you conclude that owner cultivators are more towards the use of fertilizers at 5%level of significance? Carry out a chi-square test as per testing procedure.

Solution: Let us take the hypothesis that ownership of fields and the use of fertilizers are independent attributes. Since, contingency table is of size 2*2 the degree of freedom would be (2-1)(2-1)=1. This implies that we need to calculate only one expected frequency and others can be automatically determined as follows:

E11=600*480/1000=288

E12 =600-288=312

E21=480-288=192

E22=208

The contingency table of expected frequencies is as follows:

Observed	Expected	(O-E)2	x2=(O-E)2/E
416	288	16,384	56.889
64	192	16,384	85.333
184	312	16,384	52.513
336	208	16384	78.769
			273.534

The calculated value of x2=273.534 at α=0.05 level of significance and df= (n-1) (r-1) =(2-1) (2-1) = 1 is much more than its table value,χ2=3.84. The null hypothesis H0 is rejected. Hence, it can be conducted

that owners’ cultivators are more inclined towards the use of fertilizers.

Summary

The tests may be classified in to two category mainly Parametric and Non-Parametric. Three test i.e. t test, z test and f test are used to estimate and test the population parameters and prerequisite of application of these test are-interval and ratio Scale to be used, hypothesis testing for specific parameters, assumption of normality and Standard deviation is known or not should be clear.

The absence of these conditions leads to the application of Non-Parametric Tests or distribution free tests. These test are very easy to apply and can use nominal or ordinal data as well for calculation. These tests provide broad based conclusion with approximate solution and does not necessarily require normally distributed population. The Chi-Square test is one of the non-parametric tests used to test hypothesis. Chi-Square Test for Independence test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

Learn More:

Sharma, J K (2014), Business Statistics, S Chand & Company, N Delhi.
Bajpai, N (2010) Business Statistics, Pearson, N Delhi.
Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.
Darrell Huff (2010), How to Lie with Statistics, W. W. Norton, California.
K.R. Gupta (2012), Practical Statistics, Atlantic Publishers & Distributors (P) Ltd., N. Delhi.