16 Concepts of Population, Sample and Confidence Interval

Felix Bast

    1. Introduction

 

Concepts of population and sample are ubiquitous in any statistical analysis. A thorough distinction between them is essential for a comprehensive understanding of the underlying concepts. In this module, concepts of samples and populations are thoroughly expounded. Theory of Confidence Intervals is introduced and how CI connects sample to population via Student’s t-distribution. We will also learn how to compute 95% CI for a number of statistical measures, including sample mean (via parametric and non-parametric approaches), Standard deviation, Binomial distributions (proportions) and Poisson distributions (count data)

 

2. Learning Outcome

 

a.       To learn concepts of population

b.       To learn theory of Confidence Intervals

c.       To learn calculation of 95% Confidence Intervals for mean, SD, Poisson and Binomial data,

d.      To learn calculation of CI of mean through resampling (bootstrapping) approach

 

3. Population and Sample

 

The term population is used to refer different meanings in different contexts. For example, consider birthweights of newborns in a particular city over an year period. There were 10,000 childbirths in that particular city. Here the population consists of all those 10,000 newborns. Instead of measuring the weights of all those newborns (which is not challenging in this case), let’s decide to measure birthweights of only 50 random newborns. This subset is what is known in statistics as sample. We can use sample as a proxy for entire population and extrapolate measurements done in sample to make conclusions about the populations. In biomedical research population is not only a much larger dataset consisting of target people (for example, “males in India”) but also the future generations, therefore we assume that it is infinite. In experimental research, population means the ideal situation or underlying mechanism. For example, Gregor Mendel studied 929 pea plants in F2 generation (his sample) to arrive in generalized 3:1 phenotypic ratio (dominant:recessive) in ideal population. His phenotypic ratio of 3:1 is a model- a mathematical description of a simplified view of nature. He used his samples to extrapolate an underlying phenomenon (his model) that is applicable to diploid organisms in general. Statistics goes from sample (specific things) to population (general conclusion and model) and therefore it is considered as an example of logical induction. On the other hand, probability is an example of logical deduction as it goes other way around, from general to specific.

 

4. Population Vs. Sample Notations

 

In general, abbreviations (notations) of statistical parameters are presented in Greek alphabets for populations and Latin for sample.

Population (Greek) Sample (Latin)
Mean µ Mean x
Variance σ2 Variance s2
Standard Deviation σ Standard Deviation s
Median ν Median x
Proportion π Proportion p

 

5.   Theory of Confidence Intervals

 

Let’s consider our former example. We precisely know the birth weights of 10,000 newborns in a particular city over a year and let’s consider it as our population. We have also calculated mean birth weight, 2.7kg (population mean). Out of 10,000 newborns, we have randomly selected 50 newborns (our sample) and calculated the sample mean. The value of sample mean could be more or less than the actual population mean. Let sample mean be 2.6 kg with 0.1 kg as standard deviation. We will first calculate a statistic called t-ratio from this:

 

t= (m-µ)/(s/√n)

 

where m is sample mean, µ is population mean and denominator (s/√n) is sample SEM.

 

t-ratio can be defined as difference between sample mean and population mean upon sample standard error or the mean.

 

t= (2.6-2.7)/(0.1/√50)

=  -0.1/0.141

=  -0.70922

 

The value of t ratio is a bit less than zero. This is expected, as the sample mean will be more or less close to population mean, numerator of t ratio will be close to zero, so the ratio will hover around zero. Now we take yet another 50 random samples from the same population and calculate t-ratio. We repeat this 100 times and plot the distribution of t-ratio, like how we plot a histogram. The distribution would look like this:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This is one way to convert approximate Gaussian distribution to a symmetric distribution, but for this method, you should know the true population mean (true population mean remains unknown mostly, except in this kind of simulation studies). As explained, t-ratio will be centered around zero (peak at zero), because most of the sample means would be close to population mean. The shape of t-distribution depends upon degree of freedom (df, which is equal to n-1). If the area under 2.5% of total area at both the tails (i.e., most unusual t-ratios whether it is too low or too high) is chopped off, the resulting area would include range of t ratios that include 95% of samples. To get 95% area, we have to chop the tail precisely at a t score at both the directions; this t score is called t* or t critical. In this case, with degree of freedom 49 and significance level 0.05 (because we have to chop of 5% of most unusual values), t* is 2.01 (calculated using online calculator as explained below). That would mean, if we cut the graph at -2.01 and +2.01 (shaded area in figure) we will get range of t-ratios that include 95% of samples. The shape of t distribution (so as t*) depends upon degree of freedom. For a particular significance level, we can calculate t* from t-distribution table, or using an online calculator:

 

http://mathcracker.com/t_critical_values.php#results

 

As stated earlier, t ratio = (m-µ)/(s/√n)

 

We can rearrange this equation to solve for µ, the true population mean. For the sake of brevity, derivation of following formula is omitted.

µ= m ± t* (s/√n)

where m is sample mean, t* is constant from t-distribution and s/√n is sample SEM

 

Given the sample mean m, sample standard deviation s and sample size n, it is possible for us to define ranges that include the real population mean with confidence. As already explained, t* depends on the desired confidence, which is arbitrarily 95%. If t* for 95% confidence is used, the resulting ranges of sample means would include the true population mean 95% of times. This range is called 95% Confidence Interval.

 

6. Confidence Limits, Levels and Intervals

 

The desired amount of statistical confidence is called Confidence Level. For eg., if you want a very high confidence on your results, you should choose 99% Confidence Level as part of your experimental design. Confidence Level is chosen before experiment is conducted and it is not ethical to change the CL after the data is generated. If you had chosen 99% CL, you will generate 99% Confidence Interval after the experiment. Confidence Interval is a range (lower limit to upper limit) that plot the precision of your sample measurement in comparison with the true population value. Two values that limit this range, the lower limit and upper limit, are called Confidence Limits.

 

7. Confidence Interval of Mean

 

Confidence interval of the mean tells you how precisely you have determined the sample mean as an estimate of population mean. On the other hand, precision depends on Confidence Level (CL). CL of 99 is more precise estimate than 95, which is better than 90. As precision increases, wider would be Confidence Interval. For example, 99% CI is a lot wider than 90% CI.

 

Another statistic Standard Error of the Mean is also similar to CI; SEM also tells you how precisely you have determined sample mean comparing with population mean. However confidence level of SEM is very low; only around 60%, so SEM is quite an inaccurate measure to quantify precision. 95% CI is routinely used across biological and environmental sciences.

 

In situations where whole population is used to calculate statistic, for example mean of population, Confidence Interval makes no sense. Consider a class with total strength 24 students and mean mark 11.8 out of 25. Here 11.8 is the population mean and we are 100% sure (confident) that the true population mean is this value, no question about it. However, out of 24 students, if I randomly selects 8 students and calculate their mean marks, that mean would be sample mean and CI makes sense in such situations.

 

95% CI of sample mean can be calculated using two methods. The parametric (distribution-dependent) method assume that our sample is sampled from a roughly Gaussian distribution. This method utilizes a constant from t-distribution (t*) for calculating 95% CI of sample mean. We have already explained how this formula is derived in earlier section. Formula is:

 

m ± t* (s/√n)

 

where m is sample mean, t* is a constant from t-distribution, s is sample standard deviation and n is sample size. Also note that s/√n = SEM (Standard Error of the Mean)

 

We can calculate 95% CI of sample mean given sample mean, sample standard deviation and sample size; raw data is not necessary.

 

Confidence Interval is a numerical presented as sample mean ± CI, written as (Lower Limit to Upper Limit)

 

E.g. If 95% CI is 6 and µ=51, CI range= (45 to 57). Notation like 51±6, which is commonly used to describe standard deviation, is not used to describe Confidence Intervals.

 

Let us consider an example for calculating 95%CI of sample mean. Test marks data with Mean=12.81 S=4.905 n=24. As n=24, df is 23.

 

Let’s first look up t* for 95% Confidence Level with df=23:

 

t* is 2.0687

Let’s now calculate w, the width of 95% CI

 

W= t* (s/√n)

=2.0687 * (4.905 / √24)

=2.0687 * (4.905 / 4.90)

=2.0687 * 1

=2.0687

 

Finally, 95% CI of mean is

Mean – 2.0687 to Mean + 2.0687

(10.7413 to 14.8787)

 

In case no Standard Deviation or Mean are given but presented with raw data, we have to calculate mean and SD to calculate 95% CI.

 

Microsoft Excel formula for t* in CI formula is

=TINV (alpha ,n-1)

Where alpha is level of significance (0.05 to calculate 95% CI)

 

An online calculator for calculating CI of mean is at http://www.sample-size.net/confidence-interval-mean/

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For calculating CI of mean, there are a number of assumptions. A major assumption which is often overlooked is that the sample must come from a population that is Gaussian (or roughly Gaussian). If the distribution is lognormal etc., CI can not be calculated using the above method. Samples have to be random. (If samples are deliberately chosen non-random, CI cannot be calculated). This also applies if some cells in suspension are clumped (therefore not homogenous). Patients from a particular clinic too are heterogeneous; instead of random samples, situations where true randomization is impossible, we use ‘convenience sample’ as in the case of patients from a particular clinic.

 

An alternative approach for calculating CI of mean is nonparametric, rank based, and therefore, do not make explicit assumptions about probability distributions.

 

For example, out of our 24 students, we randomly choose 5 students and their test marks are: {22, 14, 11, 2, 9}

  • Step1 Rank Order these values

 

 

 

 

 

 

 

 

 

  • Step 2. Make a new subset by picking five random integers from 1 to 5, and picking the value with that rank, repeat is allowed. For example, you mark five pieces of papers with 1 through 5 and place it in a box (as in a lottery). Shuffle the box and randomly pick a paper, record its value, and put the paper back to the box, shuffle, pick once again and so on. You might get same numbers multiple times, and that is allowed. For example, suppose you got 1, 3, 3, 4, 5. Now you should record values of those ranks to make a subset (2, 11, 11, 14, 22). This new subset is called pseudosample
  • Do this many times (pseudoreplicates); say 500 pseudoreplicates to generate 500 pseudosamples. For each pseudosample, calculate mean. Next, rank order those means (total 500 values from minimum to maximum), and pick 2.5th and 97.5th percentile. As 97.5-2.5 = 95, this range is the 95% CI of mean!

This method is variously known as resampling method, bootstrapping or computer-intensive method and is extensively used in phylogenetics and genomics. A number of studies have revealed that this method is far superior to the earlier method that uses a constant from t distribution. However, this method is not suitable in case you would like to solve the question manually in a test paper.

 

8.   Confidence Interval of Standard Deviation

 

Standard Deviation is an interval estimate. But how precise is your estimation of sample SD (that plots scatter or variability of data) in comparison with the population? 95% CI of SD can be calculated using online calculators, for example:

 

https://www.graphpad.com/quickcalcs/CISD2/

 

Input values are Standard Deviation and N the sample size. For example, for our earlier marks data:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

95% CI turns out to be 3.81 to 6.88

 

As this CI is an interval estimate calculated for another interval estimate that is SD, each original limit of SD will now be expressed as ranges. For example, upper limit of SD 12.81+4.905 would now become a range between 12.81+3.81 to 12.81+6.99. Similarly, lower limit of SD 12.81-4.905 would become 12.81-6.99 to 12.81-3.81.

 

Microsoft Excel formulas for this are:

  • Lower limit: = D*SQRT((n-1)/CHIINV((alpha/2), n-1)) 
  • Upper limit: =  SD*SQRT((n-1)/CHIINV(1-(alpha/2), n-1))

    For e g., SD for marks data = 4.67, n=34

  • =4.67*SQRT((33)/CHIINV((0.05/2), 33)) = 3.77
  • =4.67*SQRT((33)/CHIINV(1-(0.05/2),33)) = 6.15

    9. Confidence Interval of Binomial Distribution (proportion)

 

Experiments with two possible outcomes, like gender of child (male or female) coin toss (heads or tails) etc follow Binomial Distribution. Binomial distribution are commonly used in biology and are often times expressed in proportions. Some examples are:

  • Proportion of monoecious plants in a quadrat
  • Proportion of an allele or a genotype in a population

    One can calculate 95% of a proportion from value of numerator (x or S, no. of successes) and denominator (N, total number of trials) alone. For example, 45 out of 100 people were females. Here x is 45 and N is 100. Sample proportion, x/N is 45/100=0.45

 

To calculate 95% CI of this proportion, head to http://statpages.info/confint.html and input numerator and denominator there. The calculator will return the values as: 0.35 to 0.55. This means that there is 95% chance that the true population proportion of females (that remains unknown to us) falls somewhere between 0.35 to 0.55. That also means that there is a 5% chance that the true population proportion is either less than 0.35 or greater than 0.55.

 

A method for manually calculating 95% CI of a proportion is Modified Wald Method.

 

 

 

 

 

 

For example, in our earlier example p’ (called p-prime) is (45+2)/(100+4)=0.45

The width of 95% CI = 1.96 x √ [0.45(1-0.45)]/104 =1.96 x √ [0.45×0.55]/104

=1.96 x √ [0.248]/104

=1.96 x √0.00238

=1.96 x 0.0488

=0.0956

95% CI of proportion= (0.45 – 0.096) to (0.45 + 0.096)

=0.354 to 0.546

 

Another example: Out of 10 diploid plants studied, 1 plant had Aa genotype, and the rest had AA genotype. Calculate the frequency of ‘a’ allele, with 95% CI

 

Here x numerator) is 1 and N (denominator) is 20. Denominator is 20 because as plants are diploid, ten plants together produce 20 sets of alleles. One can write down all alleles of these ten plants to make the problem easy to understand:

 

1.       Aa

2.       AA

3.       AA

4.       AA

5.       AA

6.       AA

7.       AA

8.       AA

9.       AA

10.   AA

 

As one can see, out of twenty alphabets, 19 are capital A, while only one is small a. So the proportion of ‘a’ allele is 1 out of 20, or 0.05.

 

One can either input these two numbers (1 and 20) in online calculator or in Modified Wald method as explained earlier to calculate 95% CI of the sample proportion (0.05). 95% CI of proportion are: 0.0013 to 0.2487

 

11. Confidence Interval of Poisson Distribution (count data)

 

Count data like the mortality in an island (no. of persons dead in unit time), No. of bullets shot, no. of mutations in a stretch of DNA molecule, or no. of raisins in a laddoo follows Poisson Distribution if the measurements are random. From the count “C’ alone, we can calculate the 95% Confidence Interval of the count. A good approximation for 95% CI is C is more than 25 is

 

C-1.96√C to C+1.96√C

 

For example, after dissecting 10 laddoos, you found 25 raisins. Mean No. of raisins per laddoo is 2.5, but hold on, we will come to it later. First let’s calculate 95% CI of this actual count:

 

C=25

Width of 95% CI= 1.96√C

= 1.96x √25 =1.96×5 = 9.8

95% CI = 25-9.8 to 25 + 9.8

= 15.2 to 34.8

 

This is for 10 laddoos. Now, let’s divide these limits by 10 to get CI per laddoo.

1.52 to 3.48

 

This means that there is 95% chance that the real population mean rainsins per laddoo (which remains unknown to us) falls somewhere between 1.52 to 3.48. Mean raisins per laddoo of our sample is 2.5, which doesn’t say anything about how accurate the measurement is, or how approximate the measured sample mean comparing with the population mean.

 

Remember that the formula given above is only an approximation if C is more than 25. An accurate calculator is available online at http://statpages.info/confint.html

 

Another example: There were a total of 10,000 deaths in an island over a period of 100 days.

  • What is the average death of this island, death per day?
  • What is the 95% CI of this average?

    Calculation of average death is straightforward. Death per day is 10,000 / 100 = 100

In this example as N is sufficiently large (larger than 25). We can use the formula

C-1.96√C to C+1.96√C

Width of 95% CI= 1.96√C

= 1.96x√10,000

=1.96×100

=196

∴   95% CI = (10000-196) to (10000+196)

=9804 to 10196

 

This is for 100 days. To calculate 95% CI for mean death per day, each of these limits need to be divided by 100: = 98.04 to 101.96

 

That means, there is 95% chance that people who die in any particular day in that island falls somewhere between 98 and 102.

 

12. Common Mistakes

  1. For calculating 95% of proportion (binomial distribution), it is important that no normalization is done prior to the calculation. For example, never express the measurement in percentage and calculate 95% CI with 100 as denominator; instead, we should use our empirical data. This is because CI depends on sample size (denominator) and that would not be 100 in most cases. For example, 2 out of 18 giraffes observed shows albinism. That means 2 are white giraffes and the rest 16 are normal. Here x is 2 and N is 18. You should calculate 95% CI for 2/18 with sample size = 18 (not 11.11 / 100, sample size would then be 100 which would result in narrower CI width)
  2. For calculating 95% of count data, it is important to use total count value (not mean value). For example in our earlier problem, don’t take the average raisins per laddoo 2.5 as C (in that case, CI would be 0.6 to 7.2, a wider width). You should take C as 25-the total- and calculate the CI (16.1 to 36.9). Finally, divide both the limits with number of observations (10) to get CI per event (1.6 to 3.6, which is narrower than had we used 2.5 as C, as the sample size increases the width of 95% CI gets narrower)
  1. Summary 

    a. Sample is a subset of population and in most cases properties of population remains unknown and we use samples as proxy to make inferences about the population.

b. 95% CI informs us how precisely we have calculated the respective sample statistic with respect to the true population mean. It is related to SEM (SEM is approximately 60% CI), and 95% CI is approximately twice the SEM. It is different from SD, as SD captures only the scatter of dataset.

c. 95% Confidence Interval of mean can be calculated by using the formula µ= m ± t* (s/√n)

d. It is possible to calculate 95% CI of mean without making any assumptions about the distribution of populations from which the samples came. The approach is through resampling (bootstrapping). Manual calculation is almost impossible, but can easily be done in a computer.

e. 95% CI of Standard deviation can be calculated in MS Excel or using a web-based calculator easily, though manual calculation is challenging.

f. 95% CI of Poisson Distribution (count data) can be calculated using the equation C-1.96√C to C+1.96√C

g. 95% CI of binomial distribution can be calculated using modified Wald’s equation, or easily using a web-based calculator

 

Quadrant-III: Learn More/ Web Resources / Supporting Materials:

1. Brief review of concepts of Confidence Intervals at Yale University:

http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

2. A good overview of Confidence intervals at Boston University:

http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_confidence_intervals/bs704_confidence_intervals_print.html

3. Confidence Intervals explained in GraphPad Guide

https://www.graphpad.com/guides/prism/7/statistics/index.htm?confidence_intervals.htm