19 F-distribution and tests of significance based on F distribution

Felix Bast

    1. Introduction

 

F-distribution is an important probability distribution which is used in a number of statistical tests of significance, most famous among which is ANOVA used to compare means of more than two groups. The main problem with multiple t-tests is that significance level applies to each comparison and as number of comparisons increases, chance of false positives greatly increases and test results would become unreliable. One way ANOVA tests whether means of all groups are equal. If ANOVA returns a P value <0.05 (or any other threshold significance level that we decide as part of the experimental design), it would mean that at least one group mean is different from the rest. In most cases we are more interested to know which group means are significantly different from the rest rather than ANOVA P value itself. This can be tested by multiple comparisons posttests, famous one being Tukey’s HSD. Tukey’s HSD calculates 95% CI of the difference between each pairs of group means adjusted for multiple comparisons using a critical value from q-distribution. Tukey’s HSD depends not merely of the two groups that the test analyses, but also all other groups and in fact every single value. Interpretation of Tukey’s HSD Confidence Intervals are straight forward; if the range includes zero-the null hypothesis of no difference, then the difference in means of the pair of groups are significant. Other posttests include Dunnett’s test which is highly useful to compare multiple groups with a group defined as a control group (each test group will be compared with control group, but there wont be any comparison between test groups), and Scheffe’s test to compare ‘contrasting’ sets of group means (for example, groups {A, B and C} versus groups {D and E}).

 

2. Learning Outcome:

 

a. To learn about the properties of F-distribution and statistical tests of significance based upon F distribution.

b. To learn why uncorrected multiple comparisons like multiple t-tests to compare means of groups is problematic.

c. To learn principles and assumptions of one-way ANOVA and how to compute it by hand

d. To know more about other kinds of ANOVA including two-way ANOVA, Repeated measures ANOVA, Random Effects ANOVA, MANOVA and AMOVA

e. To learn how to perform multiple comparison post hoc tests including Tukey’s HSD and Bonferroni adjustment

 

3. F Distribution

 

F-distribution (Fisher distribution) is a type of continuous probability distribution with exact probabilities of every F ratios under the assumption of null hypothesis. This enables us to calculate P value from a given F-ratio. We have seen how F-ratio is calculated in module 18 while testing for the assumption that two groups have same variance, prior to performing an unpaired t-test. This is calculated by the formula:

 

(s1/s2)2

 

Where s1 and s2 are standard deviations of group 1 and group 2 respectively. More generally, each of these standard deviations is divided by the respective degrees of freedom of each group. Distribution of these F-ratios plotted as in a probability histogram is called F-distribution. Like lognormal distribution and Chi-Square distribution, F-distribution is right-skewed (with a long tail towards right. The shape of F-distribution depends only on degrees of freedom (df) of numerator (DFn) and df of denominator (DFd).

 

As one can see, when both of the df is small, shape is distinctly skewed to the right. With high df, distribution becomes bell-shaped, although it would still be skewed to the right.

 

At each combination of these two df, F-distribution can be calculated using intricacies of calculus. At each of these distributions at each combination of DFn and DFd, we can calculate the area under the significance level (the threshold P value, alpha) at the tails. These values can be presented in a tabular format, the so called F-distribution tables, where one can lookup a value called F-critical. One can compare this F-critical value found from the table and the F-ratio calculated from the data to make inferences about statistical significance. If F-critical at significance level of 0.05 is less than F-ratio, we can infer that P value must be less than 0.05 and can reject the null hypothesis.

 

4. Tests of significance based on F distribution

 

F distribution is used to test whether two groups have equal variances, one of the assumptions of unpaired t-test, as we have seen in module 18. The most important use of F-distribution is for F-test, also known as ANOVA (Analysis of Variance), for comparing means of three or more groups, as discussed in this module. Keep in mind that ANOVA merely compares group means and informs us whether all groups have identical means or not. Means would not be identical if mean of one group is different from the rest. To know which group means are significantly different, we should do multiple comparisons posttest, as explained later in this module.

 

5. Problem with multiple t-tests

 

Consider three groups of patients (X, Y and Z), each group treated with different drug. An intuitive approach to compare group means would be performing multiple t-tests. For example, t-test to compare means of X and Y, another one between Y-Z, yet another one between X and Z. In any of these t-tests if P<0.05 was obtained, we can conclude that group means as not identical. If all of these t-tests returns P>0.05, we can conclude that group means are identical. Though sound appealing, multiple comparison like these should not be performed as it drastically increases random error .In one t-test at 0.05 significance level, we still expect to find 5% of “false positives” (a blood test tells you are HIV+ but in reality you are HIV-); i.e, t-test yielding a significant P value even if two group means are identical due to chance alone. With two or more t-tests, proportion of expected false positives are not merely 5% but well above this. Therefore multiple comparisons, like multiple t-tests, should never be used.

 

6. One-way ANOVA

 

One-way ANOVA, also called one-factor ANOVA compares means of three or more groups to find differences between means of these groups are statistically significant or not. ANOVA can also be used to compare means of two groups, but in this case ANOVA is indistinguishable from unpaired t-test explained previously and returns identical P values from t-test. ANOVA was developed by population geneticist and founder of statistics, Ronald Fisher.

 

7. Assumptions for ANOVA

 

Like t-test, One-way ANOVA is based on a set of familiar assumptions:

1.      Random samples

2.      Independent measurements

3.      Accurate Data

4.      Data are sampled from populations that are approximately Gaussian

5.      Variances, or Standard deviations, of populations from which the samples came from are identical

 

To test 4th assumption, a formal statistical test like D’Agostino Pearson Omnibus K2 test can be performed. In case the populations significantly deviate from Gaussian, first option should be trying to see any normalizations transforms the distributions to normal (normalization). For example, a number of biological distributions like enzyme kinetics and wherever the term ‘half-maximal’ is used are lognormal. It can be normalized by converting each values to logarithm. If none of the normalizations works, then a non-parametric test like Kruskal-Wallis test, followed by Dunn’s posttest should be preferred. For paired (matched) data that came from non-Gaussian, Friedman’s test followed by Dunn’s posttest should be preferred. A statistical package like Graphpad Prism can be used for both of these non-parametric tests. 5th assumption of equal variance is not usually tested prior to ANOVA, as testing for this assumption is built-in ANOVA itself.

 

8.   One-way ANOVA by hand

 

Let us consider an example, Increase in mouse colorectal tumor by four different heavy metals after feeding for a period of time.

 

 

 

 

 

 

 

 

 

 

Let us first define our null hypothesis and alternative hypotheses:

H0: All group means are equal

Ha: At least one group mean is different from the rest

 

As we can see, there are four groups (arranged by columns: U, Pb, As and Hg) and four technical replicate measurements (also called levels) for each group, arranged by rows (only three for As though). The numbers (elements, measurements) means increase in tumour, so in effect there is only one factor (increase in tumour mass) across four groups. That is why the ANOVA is called one-way or one factor ANOVA. A factor is an independent treatment variable whose settings (values) are controlled and varied by the experimenter. The intensity setting of a factor is the level. Levels may be quantitative numbers or, in many cases, simply “present” or “not present” (“0” or “1”). Had there been another factor like age of mouse, then we would need to do two-factor ANOVA. Use one-way for comparing 1 factor means of >2 groups. For eg., average heights of neem trees at three villages; average ground water As levels at 4 districts in Gujarat.

 

For one-way ANOVA, the first step is to calculate mean, standard deviation and variance for each of these groups, as already computed in the table above. We can also compute overall (total) mean, Standard Deviation and variance of our entire Dataset (n=15). ANOVA starts with calculation of “Sum of Squares,” SS. SS is defined as the square of deviations of values from sample mean (∑(x-x̄)2). Remember that we have used this while calculating standard deviation from raw data.

 

First let us calculate SStotal, Sum of Squares of difference between each of these 15 values and the overall mean (x̄, 77.79).

 

(60.8-77.79)2+ (67-77.79)2+………..(90.3-77.79)2

x-x̄ (x-x̄)2
1 60.8 -16.99 288.6601
2 67 -10.79 116.4241
3 54.6 -23.19 537.7761
4 61.7 -16.09 258.8881
5 78.7 0.91 0.8281
6 77.7 -0.09 0.0081
7 76.3 -1.49 2.2201
8 79.8 2.01 4.0401
9 92.6 14.81 219.3361
10 84.1 6.31 39.8161
11 90.5 12.71 161.5441
12 86.9 9.11 82.9921
13 82.2 4.41 19.4481
14 83.7 5.91 34.9281
15 90.3 12.51 156.5001
Total 1923.41
(SStotal)

 

Let us then consider SSbetween, “Sum of Squares” between group means. We have 4 groups, so as four group means. To compute SSbetween, we have to subtract each group mean (x) with overall mean (x̄, 77.79), square this number, and multiplied with size of each of these groups (to weigh by sample size). Sum of all these values in our 5th column is SSbetween.

 

 

 

 

 

 

 

 

 

 

 

We need one more Sum of Squares, i.e., within groups SSwithin.

 

SStotal = SSbetween and SSwithin (derivation of which is omitted for the sake of simplicity)

∴     SSwithin= SStotal SSbetween

=1923.41- 1761.128

=162.282

This SSwithin is also called ‘residual sum of squares’ or ‘error sum of squares’

 

Next step is to calculate degree of freedom for each of these three Sum of Squares. For SSbetween, we have four groups (K, total number of groups); as df=(n-1), df is 4-1= 3

 

Degree of freedom for SSwithin, is overall sample size (N) minus number of groups (K). Here overall sample size is 15 and number of groups is 4. Therefore, 15-4= 11

 

Degree of freedom for SStotal is N-1, 15-1= 11 (also, df for between groups + df for within groups = df for total)

 

Let us put all these numbers in the summary table for one-way ANOVA:

 

 

 

 

 

 

 

 

 

 

 

 

The fourth column is Mean Square (MS, also called variance), which is Sum of Squares divided by df for each of these sources. Sum of MS to calculate total MS makes no sense, so this is not done. Please note that MSwithin is perhaps the most important statistic in an ANOVA test; it tells us about the variance due to error and is important for calculating ‘margin of error’ required for standardized error computations in posttests (like Tukey’s) to know which means are significantly different, as explained later in this module. Finally, the fifth column is F-ratio, which is the ratio of these two MS values (MSbetween/MSwithin).

 

As in the case of F-test to detect the equality of variance (assumption of homoscedasticity) we did prior to unpaired t-test, here too we should look up F table for critical F value. Df for F Ratio is expressed with DFn =3 and DFd=11. Critical F value corresponding to these two df at 0.05 significance level is:

 

 

As table value (critical F, 3.587) is far less than our obtained F ratio (39.7917), we can conclude that P<0.05, null hypothesis of equal means are rejected and concluded that at least one of the group mean is significantly different from the rest. There are many F tables corresponding to different significance levels (these are accessible at http://www.socr.ucla.edu/applets.dir/f_table.html). Perhaps we are interested whether this F ratio is still higher than the critical F at lower significance levels, let us say 0.01 or 0.001. We can see that even at significance level of 0.001 level, F critical from table (11.56) is still less than our obtained F ratio, so our actual P value must be very less. One-way ANOVA for the above data done using excel returns a P value of 3.36E-06 (read this value as 3.36 x 10-6 i.e, 0.00000336). We can also use an online calculator like the one at http://stattrek.com/online-calculator/f-distribution.aspx (keep in mind that this calculator returns a cumulative probability; we have to subtract this value from 1 to get empirical probability). A low P value indicate that the group means are significantly different. It could be because one of the four means is different from the rest three, or two of the group means different from the rest two. It could also be due to all four means different from each other. None of these finer details are revealed by one-way ANOVA P value. To find those out we have to perform an appropriate multiple comparison posttest as explained below.

 

9. Other types of ANOVA

 

a) Repeated Measures ANOVA: As a paired t-test is used to compare two groups with paired (matched or dependant) measurements, we use ‘Repeated Measures ANOVA for matched measurements for more than 2 groups (these are called matched sets or matched blocks). Whenever a value in any particular group is expected to be closer to a specific value in another group than to a random value from our whole experiment (because of the way we designed the experiment), we should use repeated measures ANOVA. To find those out which groups are significantly different, we have to perform an appropriate multiple comparison posttest as explained below, similar to normal one-way ANOVA.

 

b) Two-way ANOVA and more: Two-way (or two-factor) ANOVA is used for comparison involving two factors. As already explained, in addition to heavy metals, we might also be interested to know the effects of their age on tumour mass (so variable remains only one). Use two-way ANOVA for comparing 2 factor means of >2 groups. Examples are average heights of neem trees grown using 4 different fertilizers at three villages; and average As levels at shallow/moderate/deep wells in 4 districts. If we add yet another factor, like sex, we will have to perform 3 factor ANOVA and so on. In the case of One-way ANOVA, null hypothesis is that means of groups are same. In case of Two-way ANOVA, null hypothesis takes up three possibilities: 1) There is no difference in the means of factor A across the groups, 2) There is no difference in means of factor B across the groups and 3) There is no interaction between factors A and B

 

c) Multivariate ANOVA (MANOVA): Manova is used for comparing the effect of single categorical variable (exposure to tobacco) on the averages of two or more continuous variables (for eg: heart rate and blood pressure)

 

d) Random Effects ANOVA: Normal ANOVA is technically called fixed-effects or Type-I ANOVA; in this case the test assess means of our selected groups only without any further extrapolations. In the case of Random Effects ANOVA, also called Type-II ANOVA, the test assumes that the groups that we selected are randomly selected representatives (our sample groups) of an infinite number of groups (our unknown population) and the test compares our sample group means to infer whether our infinite population group means are significantly different or not.

 

e) Analysis of Molecular Variance (AMOVA): This is a method to detect population differentiation (a large contiguous population fragmenting to small isolated populations using molecular markers. While AMOVA had been developed in lines of ANOVA, two methods are quite different.

 

10. Multiple Comparisons Posttests

 

One-way ANOVA informs us whether the overall differences in the means of our groups are significant or not. As explained already, a low P value in one-way ANOVA could be due to the fact that only one of the mean is significantly deferring from the rest, or groups of means differing from other groups of mean (the so called ‘contrast’), or every mean differing from every other means significantly. In most of the situations, ANOVA P value is not what we want to know; we would be more interested to know which means are significantly different. To find that out, we should perform a multiple comparisons posttest (also called Post Hoc test).

 

Multiple t-test is one way to compare pairs of means after ANOVA; however, as already explained this should not be used as expected false positives (Type I error) would be far higher than 5% level. This is because significance level (for example, 0.05) is for each comparison involved, not for the entire family of comparisons. Between four groups, 6 comparisons are possible (no. of possible comparisons can be calculated by combination formula, nCr= n! /(n-r)!.r!), so after performing these comparisons, our significance level would be far higher than 0.05 ( that would mean a number of false positives in our results). In case multiple t-test is used, each individual P value should be multiplied with total number of such comparisons. For example, an unadjusted P value of 0.01 is obtained for one pair of comparison in a group of 4. As total number of possible comparisons in a group of four is 6, the unadjusted P value need to be multiplied with 6 to get the corrected P value. 0.01*6 = 0.06. This correction is known as Bonferroni adjustment. Bonferroni adjustment is fine in case you want to compare a specific pair after the ANOVA provided that this pair is clearly specified as part of the experimental design. As already mentioned, after seeing the data and choosing the highest and lowest means, you would be effectively comparing not merely this two means but all six means in a group of four.

 

In case we would like to compare all combinations of means, which is by far most of us would like to do after ANOVA, the best option is Tukey’s Honestly Signified Differences (Tukey’s HSD). Bonferroni test should not be used for this purpose, as the statistical power for which is far less than Tukey’s HSD. As already explained, Tukey’s HSD depends on a key output of ANOVA, MSwithin. As in Confidence Interval of mean explained in module 16, or CI of differences between means explained in module 18, Tukey’s test returns CIs for all pairwise combinations of the differences between group means. For four groups, there would be six pairwise comparisons (difference between means) and these would be expressed in Confidence Intervals at our selected Confidence Level (for example 95%).

 

The formula for Tukey’s HSD is:

 

 

 

 

Where x̄i and x̄j are means of groups i and j. q is a critical value from q distribution, MSw is MSwithin from ANOVA output, and ni and nj are sizes of groups I and j respectively. The above equation is very similar to other equations of Confidence Intervals; difference between sample means ± w (the width of CI). The width of CI is calculated from standardized (instead of ‘standard’) error (the square-root term in above equation) multiplied by a critical value from q distribution (instead of t*). The critical value q depends upon three things; the chosen level of significance (α), total number of groups (r) and degree of freedom for MSwithin (dfw). q Distribution is studentized range for α. Also note that though we are only comparing means of two of our groups (i and j), all groups and every single value in our dataset is important, as the value MSwithin depends upon the whole dataset. This is needed to account for the perils of multiple comparisons as already explained.

 

Let us go back to our original example. To compare means of Pb vs U (Lead vs Uranium, our first two groups), we need to first calculate the difference between their respective group means. Mean for Pb is 78.13 and mean for U is 61.03, so difference is 17.1.

 

Let us first define our null hypothesis and alternative hypotheses:

H0: Difference between two means is zero

Ha: Difference between two means is not zero

 

Next step is to calculate the width of 95% CI (so α=0.05). To calculate it we need the following information: MSwithin, dfw, nPb, nU, α, r and q. MSwithin=14.753, dfw=11, nPb=4, nU=4. α=0.05, r=4 and dfw=11. Plugging the values in:

 

w=[√(14.753/2)x (1/4+1/4)] x q (α=0.05, r=4 and dfw=11)

 

q- table is not normally provided in statistics tests of tables, as it is hard to calculate. One version of online calculator can be found at http://www.vassarstats.net/tabs.html#q

 

 

 

 

 

 

As seen, Q at α=0.05 is 4.27. Plugging this in the above equation:

=[√(14.753/2)x (1/4+1/4)] x 4.27

=√ (7.365x 0.5) x 4.27

=1.919 x 4.27

=8.2

 

The above value is the width, it extends both sides of the difference between means (17.1). So, 95% Confidence Interval for this difference is:

(17.1 – 8.2) to (17.1 + 8.2)

8.9 to 25.3

 

Does this range include our null hypothesis of no difference in means, i.e., zero? No, it doesn’t. Therefore, P<0.05 and the difference is statistically significant.

 

We can perform the same pairwise computation of 95% CI for all six combinations of group means as well to find how many of these combinations returns statistically significant differences. Confidence Interval of difference between means tells us whether P value is significant at our chosen significance level. Some softwares (like GraphPad prism) also returns an exact P value, called ‘multiplicity-adjusted P value’ (that had been adjusted for multiple comparisons).

 

It is possible that the one-way ANOVA returns a P<0.05 yet none of the pairs of mean differences are significant. It is also possible that ANOVA returns P>0.05, yet some of the difference between group means are significant.

 

Other post hoc multiple comparison methods other than Bonferroni’s test and Tukey’s HSD) include Dunnett’s test which is highly useful to compare multiple groups with a group defined as a control group (each test group will be compared with control group, but there wont be any comparison between test groups), Scheffe’s test to compare ‘contrasting’ sets of group means (for example, groups {A, B and C} versus groups {D and E}) and Holm’s test.

 

An online calculator for ANOVA with Tukey’s HSD is available at http://astatsa.com/OneWay_Anova_with_TukeyHSD/

 

The results of Tukey’s HSD for our example are as follows:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This calculator also include results of other multiple comparison tests, including Bonferroni’s test, Scheffé test, and Holm’s test.

 

Also note that even though ANOVA returns and overall P value >0.05 (and differences not statistically significant), results of multiple comparisons like Tukey’s HSD, Bonferroni test and Dunn’s test are still valid.

 

11. Summary

  1. For One way ANOVA, we first have to calculate SStotal, SSbetween and SSwithin. With degrees of freedom for all of these Sum of Squares, we can calculate Mean Square by dividing Sum of Squares with df. Finally we can calculate F-ratio by MSbetween/MSwithin. From two df, number of groups and F-ratio, finally we can calculate P value. One way ANOVA can easily be calculated using a software or many web-based calculators.
  2. For matched or paired data, the test analogous to matched t-test for groups more than 2 is repeated measures ANOVA
  3. Two-way ANOVA determines how a response is affected by two variables, and any association between the variables.
  4. A common mistake is analysis of ANOVA groups by multiple t-tests, but this is highly problematic as chance of false positives greatly increases this way.
  5. A popular and powerful multiple comparison post hoc test is Tukey’s HSD that calculate 95% CI of differences between means of pairs of groups. In most cases, results of Post hoc tests like Tukey’s is a lot more informative than ANOVA itself, but Tukey’s test is dependent on results of ANOVA.
  6. Even though ANOVA returns and overall P value >0.05 (and differences not statistically significant), results of multiple comparisons like Tukey’s HSD, Bonferroni test and Dunn’s test are still valid.

    Quadrant-III: Learn More/ Web Resources / Supporting Materials:

  1. F tables arranged by significance levels http://www.socr.ucla.edu/applets.dir/f_table.html
  1. One-way ANOVA online calculator http://stattrek.com/online-calculator/f-distribution.aspx
  1. Online calculator to find q-value from df and k (number of groups) http://www.vassarstats.net/tabs.html#q
  1. online calculator for ANOVA with Tukey’s HSD is available at http://astatsa.com/OneWay_Anova_with_TukeyHSD/
  1. Online calculator to calculate exact P from F ratio https://www.graphpad.com/quickcalcs/pValue1/