37 Chi-Square Test

S. Gandhimathi

epgp books

 

 

 

 

1. Introduction

 

Chi square test is one of a non parametric test. To test the goodness of fit and to verify the distribution of observed data with assumed theoretical distribution, chi square test is used. Therefore, it is a measure to study the divergence of actual and expected frequencies. It has great use in Statistics, specially in sampling studies, where we expect a doubtful coincidence between actual and expected frequencies and the extent to which the difference can be ignored, because of fluctuations in sampling. If there is no difference between the actual and expected frequencies, c2is zero. Thus, the chi square test shows the discrepancy between theory and observation.

 

2. Objectives

 

In this module, we are going to discuss the following

  • The characteristics of c2 Test
  • Assumption of c2 Test
  • Uses of chi square test
  •  How to apply chi square test

3.  Characteristics of c2 Test:

  • It is specifically used to test the association between the variables when the data is qualitative.
  • Chi square test is based on events or frequencies, whereas in theoretical distribution, the test is based on mean and standard deviation.
  • To draw inferences, it is applied to test the hypothesis
  • The test can be used between the entire set of observed and expected frequencies.
  • A new c2distribution is formed, for every increase in the number of degree of freedom,

4.  Assumptions:

  • The observations must be independent
  • The events must be mutually exclusive
  • There must be large observations
  • The data must be in original units for comparison purposes,
  • The expected frequencies should not be less than five. If it is less than 5, it should be pooled with the frequency from the adjacent item.
  • The sample data must be drawn at random basis.

5.  Degree of freedom:

 

When we compare the computed value of c2with the table value, the degree of freedom is evident. The degree of freedom means the number of classes to which values can be assigned at will, without violating restrictions. For example, we choose any four numbers, whose total is 50. Here we have a choice to select any three numbers, say 10, 15, 20 and the fourth number is 5: [50 – (10+15+20)]. Thus our choice of freedom is reduced by one, on the condition that the total be 50. Therefore the restriction placed on the freedom is one and degree of freedom is three. As the restrictions increase, the freedom is reduced.

 

Thus                     V = n – k

V:   (nu): Degree of freedom

k  : No. of independent constraints

: Number of frequency classes

For a contingency table, 2 x 2 table, the degree of freedom is

V  = (c-1) (r – 1) =(2-1)(2–1)

= 1.

 

6.  Uses:

 

c2Test of goodness of fit: Through the test we can find out the deviations between the observed values and expected values. Here we are not concerned with the parameters but concerned with the form of distribution. Karl Pearson has developed a method to test the difference between the theoretical value (hypothesis) and the observed value. The test is done by comparing the computed value with the table value of c2for the desired degree of freedom. A Greek letter c2is used to describe the magnitude of difference between the fact and theory.

 

The c2may be defined as, c2  = ∑ [(  −  )2]

  • O =Observed frequencies
  • E = Expected frequencies.

Steps:

  1. A hypothesis is established along with the significance level.
  2. Compute deviations between observed value and expected value (O E).
  3. Square the deviations calculated (O E)2.
  4. Divide the (O E)2by its expected frequency.
  5. Add all the values obtained in step 4.
  6. Find the value of c2, from c2Table at certain level of significance, usually 5% level.

 

If the calculated value of c2is greater than the table value of c2, at certain level of significance, we reject the hypothesis. If the computed value of c2value is zero, then the observed value and expected value completely coincide. If the computed value of c2is less than the table value, at a certain degree of level of significance, it is said to be non-significant. This implies that the discrepancy between the observed and expected frequencies may be due to fluctuations in simple sampling.

 

Example: Coins were tossed 160 times and the following results were obtained:

 

No. of heads 0 1 2 3 4
Observed frequencies 17 52 54 31 6

 

Under the assumption that coins are balanced, find the expected frequencies of getting 0, 1, 2, 3 or 4 heads and test the goodness of fit.

 

Solution:

 

Hypothesis is that the coins are unbiased.

 

Calculated value of c2  is 12.725, which is greater than the table value 9.488.

Therefore, the fit is insignificant.

 

c2 as a test of independence: c2 test can be used to find out whether one or more attributes are associated or not. For example, coaching class and successful candidate, marriage and failure, etc., we can find out whether they are related or independent. We take a hypothesis that the attributes are independent. If the calculated value of c2 is less than the table value at a certain level of significance, the hypothesis is correct and vice versa.

 

Example: Out of sample of 120 persons in a village, 80 were administered a new drug for preventing influenza and out of them 26 persons were attacked by influenza. Out of those who were not administered the new drug, 6 persons were not affected by influenza. (a) prepare 2 x 2 tables showing the actual and expected frequencies; (b) Use Chi square test for finding out whether the new drug is effective or not.

 

(At 5% level for one degree of freedom, the value of Chi square is 3.84).

 

Solution: 2 x 2 Table

Calculated value of c2 is 35.7 which is much higher than the table value. Therefore the hypothesis is rejected. Hence we conclude that the drug is undoubtedly effective in controlling the influenza.

 

YATES CORRECTION

 

If any cell frequency in 2 x 2 table is less than 5, then for the application of c2 test it has to be pooled with the preceding or succeeding frequency so that total is greater than 5. This results in the loss of 1 d.f. In such a situation i.e., when any cell frequency in 2 x 2 table is less than 5, we apply the correction, popularly known as the Yates correction, for continuity. This consists in adding 0.5 to the cell frequency which is less than 5 and adjusting the remaining frequencies accordingly, since row and column totals are fixed and then applying c2 test without pooling.

 

Illustration: Solving the illustration.

 

Solution: Yates corrections are applied. In Yates corrections, 0.5 is added to the observed cell frequency which is lessthan 5. In this problem, an observed frequency is less than 5. As such we will add 0.5 to 2 and make it 2.5. The rest of the frequencies are adjusted keeping sub tables unchanged. Thus after Yates corrections, the observed frequencies would be:

Since this value is less than the value of c2is not significant. The vaccine is not effective in controlling the disease.

 

c2As a test of homogeneity: This type of application of c2test can be regarded as an extension of the c2test of independence. Such tests indicate whether two or more independent samples are drawn from the same population or from different populations. Instead of one sample as we use with independence problem, we shall now have two or more samples. A random sample is dawn from each of the population and the number in each sample falling into each category is determined. The sample data is displaced in a contingency table. The analytical procedure is the same as that discussed for test of independence.

 

The main difference is that, in test of independence, we are concerned with the problem whether the two attributes are independent or not while in tests of homogeneity, we are concerned whether the different samples come from the same population. Another difference is that test of independence involves a single sample but test of homogeneity involves two or more samples, one from each population.

 

Additive Property: One of the merits of c2test as an instrument of research is that it is possible to combine the independently derived values of c2relating to samples of similar data by the simple process of addition. It enables a better test than could be made using the data of any one sample by itself. The sum of the

 

c2values thus combined will itself have a c2distribution with degrees of freedom equal to the sum of the degrees of freedom of the separate c2values. However, while adding c2values two points must be remembered;

 

The combined result in a single inclusive test is appropriate when the samples are independent; and

 

Whenc2values are to be added. Yates’ corrections should not be applied, because the addition theorem holds only for uncorrected constituent items.

 

Example: 1000 families were selected at random in a city to test the belief that high income families usually send their children to public schools and the low income families often send their children to government schools. The following results were obtained:

The calculated value of c2is more than the table value (V = 1, c2= 0.05 = 3.84). The hypothesis is rejected. Hence income and type of schooling are not independent.

Example: From the data given below about the treatment of 250 patients suffering from a disease, state whether the new treatment is superior to the conventional treatment.

Given for degrees of freedom = 1, Chi-square 5% = 3.84)

 

Solution: Let us take the hypothesis that there is no significant difference between the new and conventional treatment.

 

Applying c2test

 

For   V = 1, c20.05= 3.84.

 

The calculated value of c2is greater than the table value. The hypothesis is rejected. Hence, there is significant difference between the new and conventional treatment.

 

CONCLUSION

 

Let us summarise, Chi square test is a non parametric test. It means that it is not based on any distributional assumptions. The chi square test is used to test the independence of attributes, to test the goodness of fit and to test the homogeneity of samples. Chi square test is based on frequencies and it is not based on the actual values. When the data are categorical and qualitative in nature, we cannot apply any parametric test. Because the parametric tests are based on distributional assumptions. In such cases, we have to compulsorily use the non parametric test. But it is not the meaning that chi square test can be applied only for the data of qualitative in nature. Even if the data are in actual values, we can convert the data into categorical and can use the chi square test. Hence the chi square test can be used in the place of parametric test also. But parametric test could not be used in the place of non parametric test of chi square.

you can view video on Chi-Square Test