18 t-Distribution and tests of significance based on t-distribution

Felix Bast

 

1. Introduction

 

Student’s t-distribution is an important probability distribution in statistics and is used in a number of statistical significance tests, t-tests being the most common and important amongst those. T-tests are used for comparing means of two groups. If the objective is to know whether two group means differ significantly or not (statistical hypothesis testing), a t-test is not even required; plotting 95% CI of the difference between sample means would suffice. A manual t-test is no more robust or powerful than 95% CI of differences for hypothesis testing, but a t-test in computer returns exact P value that enables us to decide on significance for the evidence of differences-if any . This module details the t-distribution, performing two kinds of t-tests (paired and unpaired t-tests) manually, and most importantly, interpreting the P values.

 

2. Learning Outcome:

 

a)      Student’s t-distribution

b)      CI of mean difference

c)      F-test

d)      Homoscedasticity

e)      Matched t-test

f)       Paired t-test

g)      Unpaired t-test

 

3. t-Distribution

 

t-Distribution (Student’s t-distribution) is a type of continuous probability distribution with exact probabilities of every t scores known under the assumption of null hypothesis. This enables us to calculate P value from a given t-score. We have seen how t distribution is computed in module 16; first we calculate t-score (t-ratio) by dividing difference between sample mean and population mean with standard error of the sample mean. Distribution of these t-ratios plotted as in a probability histogram is called t-distribution. The shape of t-distribution depends only on degree of freedom (df). With small df, distribution is fat-tailed (platykurtic), while with large values of df, distribution become almost indistinguishable from normal Gaussian distribution with thin tails. For each values of df, t-distribution can be calculated using intricacies of calculus. At each of these distributions at each df, we can calculate the area under the significance level (the threshold P value, alpha) at either of the tails. These values can be presented in a tabular format, the so called t-distribution tables, where one can lookup a value called t-critical from a combination of df and significance level (for eg., df=15 and significance level=0.05). One can compare this t-critical value found from the table and the t-score calculated from the data to make inferences about statistical significance. If t-critical at significance level of 0.05 is less than t-score, we can infer that P value must be less than 0.05 and can reject the null hypothesis.

 

t-distribution was developed by W.S. Gosset who was working at Guinness brewery in Dublin, Ireland then. To maintain the trade secrets of the brewery, Gosset had to publish the findings anonymously using a pseudonym ‘student’ (hence the name of the distribution). Ronald Fisher, the famous population geneticist and the father of statistics, was the first one to refer the distribution as t-distribution. t-distribution is an important probability distribution involved in a number of statistical significance tests. In module 16 we have seen how this distribution is used while calculating confidence interval of the sample mean. T-distribution is used in calculating Confidence intervals of many other statistical measures as well, for example, difference between two sample means as we will shortly learn in this module. T-distribution is also used for t-tests as explained in this module, linear regression analysis and Bayesian analyses, discussed elsewhere in this MOOC. Let’s consider the most important test of significance for comparing means of two groups, t-test.

 

4. Un-paired t-test

 

Unpaired t-test is used to compare means of two independent, unpaired or unmatched groups. For example, marks of two groups of students (females vs. males in a class). Let’s consider an example, data from Franzier et al, 2006 who examined concentrations of neurotransmitter norepinephrine required to get maximal relaxations of urinary bladder muscles of young rats.

Old Young
20.8 45.5
2.8 55.0
50.0 60.7
33.3 61.5
29.4 61.1
38.9 65.5
29.4 42.9
52.6 37.5
14.3

 

We can also calculate values of mean, standard deviation (remember, this is a sample, not population, and we should use sample SD equation) and standard error of the mean for both of these groups exactly as we have learned in descriptive statistics

Statistics Old Rats Young Rats
Mean 30.17 53.71
Sample Standard Deviation (s) 16.09 10.36
SEM 5.365 3.664
Sample Size (n) 9 8

 

Let us first define our null hypothesis and alternative hypotheses:

 

H0: Group mean for old = Group mean of young (difference between two group means is zero)

Ha: Group mean for old ≠ Group mean of young (difference between two group means is not zero)

 

5.   Difference between sample means

 

Difference between sample means are easy to compute. Here, sample mean of young rats is higher than old rats. The difference between them is 23.55.

 

6.    Confidence Interval of the difference between sample means

 

In module 16 we have learned how to compute 95% CI of sample means. Remember the formula used to calculate the width (w) of 95% CI is (w= t* x SEM), t* being a constant from t distribution and SEM being the sample standard error of the mean.

 

To calculate width of 95% CI of difference between sample means, a similar formula is used; (w= t* x “Standard Error of the Difference between sample means”)

 

Therefore, first we need to calculate Standard Error of the difference between two means. The following formula is used:

 

Where s1 and s2 are sample standard deviations of groups 1 and 2 respectively, and n1 and n2 are sample sizes of group 1 and 2 respectively.

 

Substituting values from rat example in this formula, the standard error of the difference between sample means can be calculated as:

=  √ (16.092/9 + 10.362/8)

=√ (28.77 + 13.42)

=√43.19

=6.50

 

(The exact value of this standard error of difference between sample means uses a slightly different, albeit a complicated formula that is being omitted here for the sake of brevity. Calculated using that formula, the exact value of the standard error of difference between sample means is 6.67)

 

To calculate the width of 95% CI, we have to multiply this standard error with a constant from t distribution (t*). Remember that t* depends only on df and level of significance. Level of significance is 0.05 for 95% Confidence Level and the degree of freedom is (n-1). Df for each group need to be calculated and sum of these df should be used for this calculation. Df for old rats is 9-1=8 and df for young rats is 8-1=7. Combined df is 8+7=15. For significance level 0.05 and df 15, t* is 2.1314. Therefore, w is

 

W of 95% CI = t* x SE of differences

=  2.1314 x 6.67 =14.22

 

This width extends both sides of the mean differences. I.e., 23.55. 95% CI is (mean ± w). =(23.55-14.22) to (23.55+14.22)

=9.33 to 37.77

 

Remember that we have learned about null hypothesis in module 17. While comparing means of two samples, null hypothesis is that there are no differences between sample means, or the difference between mean 1 and mean 2 is zero. We have also learned in module 17 a crucial connection between 95% CI and P in statistical hypothesis testing; the connection is that if 95% CI do not include value of lull hypothesis, P value should be lesser than 0.05 and we can conclude that the result is ‘statistically significant’.

 

In our above example, null hypothesis is that the difference between two sample means is zero. Does the range of 95% CI includes this null hypothesis, i.e., 0? As 9.33 to 37.77 do not include 0, we can conclude that P value must be < 0.05 and result is statistically significant. Remember that we are deducing this conclusion only from the 95% CI, even before performing t-test. If the objective is to compare two sample means to know whether difference is statistically significant, calculating 95% CI of those differences would suffice and is robust. Manual t-test that do not calculate exact P value and conclude such as “P<0.05” is completely optional. However, a t-test using computer produces the exact P value (like 0.0499), which is a lot more informative than conclusion such as “P<0.05”, and therefore preferable over conclusion arrived using 95% CI alone. The width of CI depends upon following three factors:

 

1)      Variability. Low SD (consistent data) would lead to narrower CI

2)      Sample size. Large n would lead to narrower CI

3)      Degree of confidence (confidence level). Lower confidence would lead to narrower CI

 

7.   Unpaired t-test: Assumptions

 

Before performing an unpaired t-test, we should make sure that the set of assumptions for this test is not violated in our data. The set of assumptions are:

1)      Subjects have randomly assigned to one of two groups.

2)      Each element (individual measurements or values) are independent of other such elements.

3)      Measurement is accurate

4)      Samples came from a normal (or nearly Gaussian) distribution

5)    Two samples follow ‘homoscedasticity’ ie., they have nearly equal variances (or standard deviations). Sample sizes between groups do not have to be equal.

 

First three assumptions (random samples, independent measurement, accurate data) are a familiar set of assumptions used universally in a number of statistical tests. To detect whether our fourth assumption is valid, we should detect whether our samples came from a normally distributed population. A rough approximation for which can be done using visual interpretation of histograms, or by calculating Kurtosis and Skewness levels and deciding are these values fall within values for approximately Gaussian distribution. For a more formal test, D’Agostino Pearson Omnibus K2 test can be performed using a statistical package like Graphpad Prism. If the inference is that the populations are significantly deviating from Gaussian distribution, a non-parametric test like Mann-Whitney U test or Wilcoxon signed-rank test (both are not available in excel; use statistical package like GraphPad Prism). For testing assumption of homoscedasticity 5 that standard deviations of two groups are nearly equal, an F-test is usually performed.

 

To do F test, first calculate F-ratio

= (s1/s2)2

 

Where s1 is standard deviation of group 1 and s2 is standard deviation of group 2 For our earlier example of rat bladder,

=  (s1/s2)2

=  (16.09/10.36) 2

=  2.41

 

What are the degree of freedom for this ratio? There will be two df, one for Numerator (DFn) and other for denominator (DFd), both will be respective group size minus 1. As seen earlier, Dfn= 8 and DFd =7. From these three numbers one can calculate P value. We can also calculate Critical P value by using F distribution table:

 

Remember that DFd (df1) is across and DFn (df2) is vertically down. 8 vs 7 (3.7257) as in this example, is different from 7 vs 8 (3.5005); we should be meticulous not to make such an often made error in this step. As table value (critical F, 3.72) is higher than our obtained F ratio (2.41), we can conclude that P>0.05 and our variances are not significantly different. Whenever looking at T or F or Chi square table, remember that if table value is higher than obtained value, P>0.05, and conclude ‘ns’ (not significant). If table value is less than test value, P<0.05. A more accurate computational method calculates exact P value of F test. For our earlier example, P=0.2631, which is indeed >0.05. Variances are not significantly different

 

What if P<0.05 with a statistically significant conclusion about variances of two groups? A usual practice is to perform a modified t-test that allows for unequal variances (such an option is available for excel). However, Moser & Stevens, 1992 concluded that modified t-test that allows for unequal variances should not be used, as the results will be misleading. If F test returns a low P value (significant, or unequal variances), perhaps the best practice is to ignore the result and go ahead with t-test assuming equal variance. T-test is fairly robust to violations of assumption of equal variances as long as sample sizes are not tiny, and two groups have approximately equal sample size. That means you really don’t need to do F test before a t test; simply perform normal t test that assumes equal variances. As explained previously, even t-test is not necessary. 95% CI of differences between means would be enough to know whether P<0.05.

 

8.    Performing an unmatched t-test

 

Performing an unmatched t-test is straightforward. First, we have to calculate t-ratio which is identical to the t-score we calculated while discussing about the derivation of 95% CI of the sample mean in module 16. To calculate t-ratio in t-test, following formula is used:

 

t-Ratio = Difference between sample means/Standard Error of the difference between sample means

Note that for performing t-test all we need to know are mean and standard deviations of two groups that we are comparing. Raw data is not required. If raw data is given, we first have to calculate mean and standard deviations of those two groups.

 

In our rat bladder example, difference between sample means = 23.55

Standard error of the difference between sample means=6.67

t-ratio = 23.55/6.67

= 3.53

 

As in the case of F test, next step is to look up t-distribution table for critical t value given a significance level and df. For 15 df at 0.05 alpha, let us look up the table for t critical value:

 

t-critical is 2.131

 

As the t-critical from table (2.131) is lower than calculated t-ratio (3.53), P value is inferred to be <0.05, we reject the null hypothesis and conclude that the difference between sample means as statistically significant. Had our significance level been 0.01, t-critical from the table (2.947) would have been still less than the calculated t-ratio (3.53), so we can further infer that P value must be <0.01 and differences must be very significant. Actual P value calculated in a slightly complicated manner by software such as excel is 0.0030 which is indeed <0.01 (you can also use a web-based calculator that computes P from t and df https://www.graphpad.com/quickcalcs/pValue1). In case calculated t ratio is negative, no problem; take the absolute value and infer the results. The sign indicate the direction of difference. For example, old rats having larger sample mean than young rats. Had our calculated value been -3.53, results would have been still same, P<0.05.

 

P value depends on the following three factors:

1) If mean differences is much greater than zero (i.e., two groups are so much different), P value will be smaller

2) If data are very consistent, i.e., low standard deviations, P value will be smaller

3) If sample size is large, P value will be smaller

 

t-ratio can also be calculated by another formula which some students find easier:

 

However, method through SE of differences explained earlier is advantageous, as it enables us to calculate 95% CI of the differences. Yet another method utilizes calculating Z score. This method requires us to know exact population mean and population standard deviation beforehand. As already explained, in vast majority of cases (except for simulations) the properties of true population, including mean and standard deviation, remain unknown to us. Therefore, practical utility of method through Z-score remains negligible with empirical scientific data.

 

9.   Overlapping error bars

 

Suppose you have plotted Mean± SD in a bar chart with SD shows as error bars extending to both sides of mean. What if you saw two group’s SD error bars overlapping? Does it mean the differences between the sample means not significant? As SD captures only the scatter or variability of individual elements, SD overlap tells us nothing in reality. Suppose you have plotted Mean±SEM and the SEM error bars overlap, only conclusion we can make is that P>0.05. However, if SEM error bars do not overlap, you cannot conclude the reverse; that the difference is significant. It could be significant or it could be not significant. Suppose you have plotted Mean±95%CI and the 95%CI error bars overlap, it tells us nothing; differences could be significant or not. What if 95%CI error bars do not overlap? That would mean P<0.05 and differences statistically significant. Therefore, to know whether difference between two group means as statistically significant or not, the easiest way is to plot Mean±95%CI in bar charts to see do the 95% CI error bars overlap. If they do not overlap, a crisp conclusion that P<0.05 and difference to be statistically significant can be made without even performing any further tests. If they do not overlap? In that case you can calculate Standard Error of the differences between sample means, and calculate 95% CI and make inferences about P value (if range include 0, then P>0.05 and if range do not include 0, P<0.05).

 

Manual calculation of t-test is no more advantageous than the conclusion arrived from 95% CI. However, computational t-tests have advantage; it informs us exact P value to let us know whether we have significant evidence of difference (not evidence of significant differences). Exact P value also enables us to spot cases of borderline significance (for example, P=0.0499, I would be sceptical to read statements like ‘differences were found to be significant with P<0.05’).

 

10. Paired t-test

 

In situations where corresponding measurement is made between individual elements of two groups, a matched (or paired or independent) t-test is done. Examples include before and after analyses, or matched control vs treated (for example, subjects with treatment only on left eye and no treatment on right eye). Another example include each student’s individual performance in mid semester test-1 and mid semester test 2.

 

Assumptions for paired t-test

1)   Random samples

2) Accurate data

3) Independent measurements of each pair from other such pairs

4) Differences between matched values follow roughly Gaussian distribution. Note that individual measurements need not assume to have come from populations that are Gaussian; this assumption is explicitly about the differences between two measurements.

 

Note also that assumption about equal variances that is needed for unpaired t-test is not required in paired t-test.

 

Let us consider an example from Darwin, 1876. He wanted to know whether self-fertilized or cross- fertilized seeds produce taller plants. He planted each pot with self-fertilized seed and cross-fertilized seed. By doing this, Darwin controlled for any changes in temperature, soil, light intensity etc., as all would be same for the same pot.

Cross-fertilized Self-Fertilized Difference
23.500 17.375 6.125
12.000 20.375 -8.375
21.000 20.000 1.000
22.000 20.000 2.000
19.125 18.375 0.750
21.500 18.625 2.875
22.125 18.625 3.500
20.375 15.250 5.125
18.250 16.500 1.750
21.625 18.000 3.625
23.250 16.250 7.000
21.000 18.000 3.000
22.125 12.750 9.375
23.000 15.500 7.500
12.000 18.000 -6.000

The above data can effectively be presented as before-after plot with change of each individual measurement is shown (this is created using excel):

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Let us first define our null hypothesis and alternative hypotheses:

H0: Mean difference between paired observations is zero

Ha: Mean difference between paired observations is not zero

 

11. 95% Confidence Interval of the difference between sample means

First, differences between each pair of values are calculated (third column of earlier figure). These differences are treated as raw data, and mean of these data and SEM are calculated exactly as we would for any other set of data. 95% CI is calculated exactly as we would for mean (t*.SEM). In our example,

 

Mean of differences=2.62 inches

SEM=1.22 inches

 

95% CI : (0.003639 inches to 5.230 inches). As 95% CI do not include zero, our null hypothesis, we can instantly conclude that P value must be <0.05. However, our lower limit of 0.003639 is very close to 0, so we can infer that this significance must only be a ‘borderline significance.’

 

12. Paired t-test

 

For paired t-test, we need to compute t-score, which is very easy to calculate. t-score = Mean differences/SEM, exactly as in unpaired t test.

 

As calculated in last section, Mean differences=2.62 inches and SEM=1.22 inches. Therefore, t-score:

= 2.62/1.22

= 2.15

 

Next step is to look up t-distribution table for critical t-value given a significance level and df. As we have 15 matched measurements, n=15, and df=14. Significance level is 0.05 as usual.

 

 

Critical t-value is 2.145. As the critical t-value from table (2.145) is less than the t-ratio that we earlier calculated from our data (2.15), we can conclude that P<0.05 we reject the null hypothesis and conclude that the difference between sample means as statistically significant. But you should promptly infer that it is only a “borderline significance.” Remember that the exact same conclusion we had already been inferred using 95% CI in the last section, so t-test is not really required for testing the statistical significance. Had our significance level been 0.02, t-value from table would be 2.624 (one value right from our earlier value), which is far higher than the calculated t-ratio (2.15), so the exact P value must be >0.02. That would mean our P value must be somewhere between 0.02 to 0.05. The exact P value calculated by software (one can use a weeb-based calculator https://www.graphpad.com/quickcalcs/pValue1/) is 0.0497 which is indeed less than 0.05 and indeed is pretty borderline.

 

13. Summary

 

a. For hypothesis testing involving means of two groups, a 95% CI of the differences in sample means would suffice; t-tests are completely optional.

b. Standard Error of the difference between two sample means is computed using the equation:

c. Width of 95% CI is Standard Error x t*

d. If 95% CI of mean differences do not include zero (null hypothesis), then differences can be concluded as statistically significant and P<0.05

e. t-Ratio is ratio between difference between two sample means and Standard Error of the difference

 

Quadrant-III: Learn More/ Web Resources / Supporting Materials:

  1. A web-based t-test calculator (for both paired and unpaired variants) is available at https://www.graphpad.com/quickcalcs/ttest1.cfm
  1. One and two tail t distribution values http://www.statisticshowto.com/tables/t-distribution-table/
  1. Bayesian t-tests https://arxiv.org/abs/1704.02479
  1. An online calculator for calculating 95% CI of mean https://www.graphpad.com/quickcalcs/CImean1/?Format=SD
  1. Online calculator to calculate exact P from a t score or F score https://www.graphpad.com/quickcalcs/pValue1/