17 Statistical Hypothesis Testing and P Values

Felix Bast

1. Introduction

Statistical hypothesis testing is the current gold standard of scientific methodology and is a key concept in inferential statistics. This involves defining two contrasting hypotheses, calculating P value from the sample data and deciding the fate of the two hypotheses based on the P value and the threshold cut-offs significance level. Most common forms of errors in hypothesis testing are Type I and Type II errors. These errors can be minimized with appropriate choice of significance level. Most statisticians agree that P value-based significance testing is over rated, and a simpler representation of 95% CI is a better approach in most of the cases. A number of ways P values can be decreased, the so called P-hacking- is common across scientific disciplines and it construes a scientific misconduct.

2. Learning Outcome:

a) To learn concepts statistical hypothesis testing

b) To learn significance levels and choosing the right significance level

c) To learn about Type I, type II and type III errors

d) To learn about P values and discern a number of P value fallacies

e) To learn about statistical Power and False discovery Rate

f) To learn the relationship between Confidence Intervals and P value

3. Statistical Hypothesis Testing

Scientific methodology was developed with Francis Bacon’s “Novum Organum” published in 1620. Subsequently, Karl Popper introduced concept of falsifiability that the scientific claims have to be independently verifiable, and Thomas Kuhn introduced concept of Paradigm Shift that the science progresses through unconventional ‘revolution’ of ideas. How do we distinguish science from pseudoscience? The current gold standard for experimental scientific disciplines is statistical hypothesis testing, which is very intuitive to comprehend with a simple analogy.

Let us design a very simple experiment that tests how ‘lucky’ you are at any particular moment by coin toss. Effectively by doing this coin toss experiment you are testing two contrasting hypotheses: 1) You are not lucky and 2) you are lucky. There will be ten toss trials and even before the first coin is flipped, you decide that if you get 6 heads or more, you are deemed lucky. After flipping the coin for 10 times, you got 6 heads and 4 tails. Conclusion? Hypothesis No. 1 that ‘you are not lucky’ is rejected (beware of double negatives!) and conclude that you are lucky. Suppose you got only 5 heads, then your conclusion would be you ‘fail to reject hypothesis 1- that you are not lucky’ (a triple negative logic jargon to say that you are not lucky). If you get 5 heads, instead of concluding you are not lucky, you take a correction pen and change your original design statement from ‘6 heads or more’ to ‘5 heads or more’, and conclude that you are lucky! Of course, this is not fair; this is cheating and this form of misconduct is very common in scientific research. Another option is instead of concluding you are lucky with 5 heads in 10 toss trials, you increase no. of trials; you go ahead tossing coin 11th time, 12th and so on till you get 6 heads on 13th time, and suddenly stop the experiment there to conclude that you are lucky. Again, you take a correction pen and change the original proposition from “There will be ten toss trials” to “There will be thirteen toss trials”. Of course, this is cheating too. What is so special about 6 heads? Nothing, really. It is our threshold value for crisply deciding success from failure.

Statistical hypothesis testing follows the following four steps sequentially:

1. Define alpha (α )-threshold P value

2. Define null hypothesis (H0)

3. Define alternative hypothesis (Ha)

4. Calculate P value, and decide the fate of hypotheses

First step is defining the threshold alpha. This threshold is exactly like the threshold of 6 heads we used in our example. 6 heads out of 10 trials constitute 60% ‘confidence,’ so if you get 6 heads or more, you have 60% confidence to say you are lucky. However, experimental scientists want a lot more confidence in their conclusion to be independently verifiable (Karl Popper’s falsifiability’). Typically they choose a confidence level of 95% (or 9.5 out of 10). This would mean 5% tolerance of making incorrect mistakes. The tolerance for making incorrect mistakes is what is called threshold P value or alpha. P in P value stand for Probability, and it is always expressed as a number between 0 and 1. A 5% tolerance for error expressed percentage is 0.05 tolerance level expressed in probability (or P-value). In practice, the threshold value (called alpha) is almost always set to 0.05 (an arbitrary value that has been widely adopted). Ideally, you should set this value based on the relative consequences of falsely finding a difference (False Positive) or missing a true difference (False Negative), more on this to be followed.

Null hypothesis is usually the opposite of what you are trying to prove. In our first example, if you were trying to test the proposition that you are lucky. Null hypothesis is that you are not lucky. If you are testing the efficacy of a drug, null hypothesis is that the drug is ineffective. If you are comparing means of two groups, null hypothesis is that their group means are equal. For clinical testing, null hypothesis is that the result is negative. For spam detection in email, null is that email is not spam.

Alternative hypothesis is the hypothesis what you are usually trying to prove. In our earlier examples, alternate hypotheses is that you are lucky, or drug is effective, or group means are not equal, or clinical test result is positive, or email is spam.

After completing your experiment and after performing appropriate statistical tests, you get a P value. It is important to note that P values and statistical tests only assess the validity of our null hypothesis defined in step 2 above (they do not test the validity of our alternate hypothesis directly). Therefore, we can make conclusions only about the null hypothesis. If you get a P value (for example, 0.04) which is less than our predefined level of tolerance for errors (0.05) defined in step 1, you may conclude to ‘reject the null hypothesis’ and state that the results are statistically significant. If your obtained P value (for example, 0.06) is higher than significance level alpha, then the conclusion would be ‘not to reject the null hypothesis’ and that the results are not statistically significant. Remember that statistically significant does not mean scientifically or clinically significant. You cannot conclude that the null hypothesis is true. All you can do is conclude that you ‘don’t have sufficient evidence to reject the null hypothesis’. This sort of arguments are called arguments by contradiction often used in logic and philosophy.

An intuitive analogy is the similar situation in jurisdiction. Here H0 is accused is innocent and Ha is accused is criminal. A juror (judge) can pronounce the verdict as guilty (reject the null hypothesis) or not guilty (do not reject the null hypothesis). She can never pronounce that the accused is criminal (accept the alternative hypothesis) or innocent (accept the null hypothesis). Also, the juror’s conclusion is crisp and clear: guilty or not guilty. She cannot say that the accused is ‘probably guilty’ or ‘I don’t know’, as no one likes uncertainty in judgements. Similarly, in statistical hypothesis testing conclusion is crisp and clear as it is based on the threshold significance level. A test that returns P value of 0.4999 should be deemed significant while P value of 0.5001 is not significant. Many statisticians criticise making this sort of crisp conclusions based on threshold P value and overly emphasizing the term significant in papers. Many investigators do not report the exact P value (like P=0.499, which is in fact very much borderline) and instead make ambiguous statements like ‘results are significant (P<0.05)’.

5. Type 1 and 2 errors

As already stated, a significance level of 0.05 means 95% confidence level and tolerance of 5% error level. Even though this 5% level is very less, 5% of all results will be erroneous. Out of 100 statistical tests, 5 tests would lead to erroneous conclusion that the result is significant, while in reality the result would not be significant. For example, when you compare means of two groups, you got a P value <0.05, therefore reject the null hypothesis and conclude that two means are significantly different, but in reality means are not different (null hypothesis of no difference in means is wrongly rejected). Another example is that you go for an ELISA blood test to see whether the test is positive or negative. The test report says positive (the null hypothesis that you have no HIV infection is rejected), but this result is erroneous. You will never know is it an error or not unless you repeat the test many times or you already know the reality (for example, in simulations). This type of errors where null hypothesis is wrongly rejected (incorrectly rejecting the null hypothesis) is called type-1 error.

We can also have yet another kind of error. The test report there is no difference in two group means, but in reality there is difference. A person is really HIV positive, but the test says he is HIV negative (‘false negative’). A person is really guilty but the juror pronounces the verdict as not guilty. The email is truly spam but the spam filter says email as not spam. This type of error is called Type II error or False Negative, which is incorrectly not rejecting the null hypothesis. You’ve made a type II error when you wrongly do not reject null hypothesis. There really is a difference (association, correlation) overall, but random sampling caused your data to not show a statistically significant difference. So your conclusion that the two groups are not really different is incorrect.

It is important to note that we will never know is it error or no error when we analyse the sample, unless the analysis is simulation in which the true population values and reality is known precisely, which is almost impossible when analysing the scientific experimental data.

There is yet another type of errors, Type III or type S. This would happen when direction of effect goes opposite the direction of prediction (especially prone if you use one tail P values to be explained soon). An example of such an error happened in Cardiac Arrhythmia Suppression Trial (CAST). Anti-arrhythmia drugs (like atenolol to control atrial and ventricular fibrillations) were previously thought to either prolong the life of the patients or no effect, so the investigators chose to report one tail P value. They got a high one tail P value, so concluded the drug has ‘no effect’ on prolonging the life. Actually patients given drugs were four times more likely to die than placebo, so the effect went in opposite direction!

6. Choosing the level of significance

In the table above, quadrat denoted as [A] signifies false positives. Sum of quadrats [A] and [B] is the total of first row, ‘null hypothesis is true’. A/(A+B) is α, level of significance. It is ‘Fraction of false positives when null hypothesis is true’. If the null hypothesis is true, what is the probability of incorrectly rejecting it? That probability is the significance level.

As already explained, a significance level (α ) of 0.05 (95% confidence) is chosen almost universally in experimental scientific disciplines, especially in biological and environmental sciences. This traditional significance level is sometimes called “two sigma level”, as 95% of all values are expected within two standard deviations from the sample mean in normal distributions (central limit theorem of statistics). Ideally this level should be decided based upon the consequence of two main kinds of errors, type I and type II errors explained earlier.

Use low α (like 0.01), to have less chance of making Type I error (False Positive), but beware, chance of making Type II error (False Negative) is increased! The consequence of setting α=0.01 would be more Type II errors (False Negatives). Examples of type II errors include more spam in inbox, a criminal is set free, the test says negative for HIV+ person, A doped athlete is set free, and an effective drug is declared as ineffective and abort drug development. If these sort of false negatives are tolerable, choosing low α is justified. Use low α (like 0.01) to minimize false positives and is ideal for situations when false negatives are tolerated like Criminal jurisdiction (“it is better to let many guilty person go free than to falsely convict one innocent person”), Spam Filter (it is better to have spam emails in inbox rather than automatically detecting as spam and deleting that ‘article acceptance’ or ‘job selection’ email!), and clinical trial for ‘me-too’ drug (it is better to abort an effective drug development than marketing an ineffective, useless drug when better alternatives are already available)

Use high α (like 0.1), to have less chance of making Type II error (False Negative), but beware, chance of making Type I error (False Positive) is increased! The consequence of setting α=0.1 would be more Type I errors (False Positives). ). Examples of type I errors include: Good email declared as spam and deleted, An innocent is punished, The test says positive for HIV- person, An innocent athlete is declared as doped and banned for life, and an ineffective drug is declared as effective and market it. Use high α (like 0.1) to minimize false negatives and is ideal for situations when false positives are tolerated like Civil jurisdiction (it is better to punish many innocent than let one huge illegal corporation go free), and clinical trial for novel drug (it is better to market an ineffective, useless drug than aborting the development of an effective drug when no drugs are available in the market).

Please be noted that traditional two sigma level of 0.05 is not universally followed in other disciplines, especially in particle physics. To conclude evidence of detecting a (previously known) particle, accepted level of significance is 3 sigma (α=0.003). To conclude evidence of the discovery of a new kind of particle (like in the paper describing the discovery of Higg’s Boson), accepted level of significance is 5 sigma (α=0.0000003).

7. P-values: Correct interpretations and fallacies

Definition of P value starts with the proposition ‘If the null hypothesis is true’. The definition is “If the null hypothesis is true, what is the probability that random sampling would lead to difference as large as or larger than that observed in this study?”

Let us consider an example. A study compared mean of two groups, and got a P value of 0.03.

That means if two population means are identical (i.e., if the null hypothesis is true), there is a 3% chance of observing a difference as large as or larger than what you observed. As that chance is very low 9only 3 out of 100), you conclude that the means are not identical. Alternatively, you can conclude that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments, and larger than you observed in 3% of experiments. P value answers the question: “In an experiment of this size, if the populations really have the same mean, what is the probability of observing at least as large a difference between sample means as was, in fact, observed?”

P values are indeed confusing to many scientists because null hypothesis is almost always false. Logic works backwards (Population to Sample, Logical Deduction akin to Probability) and is counterintuitive; the hypothesis testing works on the principle of argument by contradiction.

There are two types of P values, two-tail and one tail. A two-tail P value indicates critical region extends to both extremes of distribution (significantly large or significantly small). A one-tail P value extend only to either of the two tails of distribution. A difference between two can easily be spotted by looking at your null hypothesis. If the null hypothesis contains the symbol = explicitly or indirectly (like ‘there are no differences between means’ is equivalent to ‘mean of group x = mean of group y’), P value should be two-tail. If the null hypothesis contain the symbol > or < explicitly or indirectly (like “there is no fever” is equivalent to “temperature ≤37°C” or “there is no profit” is equivalent to “net gain > net loss”), P value should be one-tail. A one tail P value explicitly make predictions of the directionality of effect as part of the experimental design. This predicted directionality might very well be wrong too, that lead to type III error as explained earlier. When in doubt, always stick with two-tail P values. One tail P values are always less than Two tail P values, so investigators like One tail P to prove their hypothesis! Some reviewers criticize any use of one-tail P values, no matter how well justification is.

The following is a list of P value fallacies (incorrect interpretations):

1. The P value is the probability of rejecting the null hypothesis

2. A high P value proves that the null hypothesis is true.

3. 1-P is the probability that the results will hold up when the experiment is repeated

4. 1-P is the probability that the alternative hypothesis is true

5. The P value is the probability that the null hypothesis is true

6. P value is the probability that the result was due to sampling error

Be noted that all six of the above propositions are wrong interpretations of P values.

8. P-hacking

Remember the coin toss experiment we described in the beginning to test how lucky a person is? Two ways the person can cheat are by changing the threshold value and increasing the no of trials. Similarly, experimental scientists unethically resort to many practices to decrease the P value (a low P value is desirable, as a P value lesser than the alpha (P<0.05) would reject the null hypothesis and lead to a conclusion of statistical significance. Some of these tricks include:

Try different tests, parametric and nonparametric and picking the one with lowest P value The statistical test that will be used (for example, unpaired t-test) should be clearly specified as part of the experimental design. It is not fair to try different tests and go with the one that gives lowest P value.
Dynamic sample size and stop when P is lower than threshold, Sample size should clearly be decided as part of the experimental design. It is not fair to change this till P value is less than the threshold.
Slice and dice the dataset to get lowest P, Obviously an unethical practice. We should use all of the elements of our dataset without slicing and dicing.
Cherry-picking the dataset to get lowest P Obviously an unethical practice. We should use all of the elements of our dataset without picking certain values that affirms our preconceived conclusions.
Play with outliers, remove one or more selectively to get desired P

Obviously an unethical practice. We should use all of the elements of our dataset without removing any outliers manually. In case outliers need to be removed, a formal statistic test such as Grubb’s test can be used for this.

All of the above are cheating, and construes scientific misconduct. As already mentioned, statistical hypothesis testing based on P values is based upon the threshold significance level, which is nothing but an arbitrary value. Statistical significance is usually denoted by the symbol * in graphs. Over emphasizing the statistical significance and * is sometimes referred as ‘stargazing’. P values depend largely on sample size. It is easy to get very low P values with huge sample size with tiny effect, or tiny sample size with huge effect. P value answers the question, is there a significant evidence of difference? That question is different from “Is there an evidence of significant difference?”. P value does not answer that question. Difference between sample means might be tiny, or scientifically negligible, yet decision based on P value could be ‘statistically significant’. Add on to the woes, many studies have revealed that P values are not very well reproducible. Instead of P values, the current consensus among the statisticians is to used 95% Confidence Intervals.

9. P-values and Confidence Intervals

Confidence Intervals and P values are very well connected. A 95% Confidence Interval is similar to 0.05 significance level as already explained. The question is does the range of values defined by 95% Confidence Interval include our null hypothesis? For example, while comparing the difference in mean between two groups. Null hypothesis is that the group means are same, or the difference between two group means is zero. If 95% Confidence Intervals of difference between sample means does not include zero-our null Hypothesis, then P<0.05 (inversely, if it includes zero, then P>0.05) and indicate a statistically significant difference. Let us consider another example. A random sample consisting of 10 individuals volunteered to have their body temperatures measured to know whether they have normal body temperature or not. Mean body temperature was found to be 36.8°C. The width of 95% Confidence Interval was found to be 0.3°C and therefore 95% CI about the sample mean ranges from (36.8-0.3) to (36.8+0.3) that is 36.5°C to 37.1°C. Our null hypothesis of having no fever is 37°C. Does our obtained 95% CI include the null hypothesis of 37°C? Yes, it does. So, P>0.05 and we can conclude that there is no statistically significant difference from the null hypothesis.

10. FDR and Statistical Power

We have already defined A/(A+B) as α, level of significance. . It is ‘Fraction of false positives when null hypothesis is true’. A related term is False Discovery Rate (FDR), which is A/A+C. FDR is defined as ‘Fraction of false positives out of statistically significant results’. If the result is statistically significant, what is the probability that null hypothesis is really true? Note that for both level of significance and FDR, numerator remains same,’ fraction of false positives’, but denominator is different. While significance level is based out of all results where null hypothesis is true (i.e., all results that in reality have no effect/difference etc.), FDR is based out of results with a ‘statistically significant’ conclusion. FDR depends upon ‘prior probability’-the context of experiment akin to Bayesian Statistics. FDR and Prior Probability will be elaborated in the module of Bayesian Statistics. Another related term is ‘statistical power’ which is C/C+D. It is ‘fraction of statistically significant results out of all results where null hypothesis is false’. The statistical power is the probability to obtain a statistically significant result assuming a certain effect size in population. Power depends on sample size, variability, choice of α and hypothetical effect size. A high power is obtained when: 1) Large sample size is used, 2) Looking for a large effect (or large difference or a strong correlation etc), and 3) Data with little scatter (small variance and SD).

Summary

a. Statistical hypothesis is a four step sequential process. Significance level should be defined first, then null and alternative hypotheses must be defined. Finally, based on obtained P value, the fate of two hypotheses is decided. It is cheating to change the decided significance level, or changing the sample size and other numerous ways P values can be hacked.

b. Significance level should be chosen based upon tolerance limit to two types of errors; type I (false positives) and type II (false negatives)

c. P value is defined as “If the null hypothesis is true, what is the probability that random sampling would lead to difference as large as or larger than that observed in this study?” It is more essential to know how P values are correctly interpreted rather than how P values are calculated.

d. A low P value does not mean differences are scientifically significant, or finding interesting or warrant further funding. . P value answers the question, is there a significant evidence of difference? Not “Is there an evidence of significant difference?”.

e. If 95% CI does not include null hypothesis, then P value must be <0.05. If the range does include null hypothesis, then P>0.05. This connection is important.

Quadrant-III: Learn More/ Web Resources / Supporting Materials:

1. Brief review of concepts of Statistical Hypothesis testing at PennState:

https://onlinecourses.science.psu.edu/statprogram/node/138

2. A good overview of hypothesis testing in GraphPad guide:

https://www.graphpad.com/guides/prism/7/statistics/index.htm?statistical_hypothesis_testing.htm

3. P values- an introduction https://www.statsdirect.com/help/basics/p_values.htm