19 Sampling and Sampling Distributions: Determining sample size

 

1.       Learning Outcome

 

2.       Introduction

 

3.       Data required for computing sample size

 

4.       Determining sample size for estimating population mean

 

5.       When to use population or sample mean

 

6.       Methods to estimate the  population standard deviation

 

7.       Sample size for estimating population proportion

 

8.       Determining the sample size with fpc

 

9.       Summary

 

 

1.     Learning outcomes:

 

§      To realize the importance of estimating optimal sample size.

 

§      Relevance of variables influencing sample size

 

§      To make a preliminary estimate of the appropriate sample size

 

 

2.     Introduction

 

The most important aspect of any research study is the magnitude of the sample size. If the sample size is very large, it could waste the resources of the researcher and the organization and if the sample size is small, it could not be able to correctly represent the population under study. Therefore it is very important to rightly estimate the sample size for the research under study.

 

 

3. Data required for computation of sample size

 

a)  Magnitude of error (Desired precision (± e) It gives the estimate of how precise we want to be in our measurement. Here the researcher must workout the largest acceptable difference between the sample mean and the population mean. It is specified by acceptable degree of sampling error. It is termed as standard error (i.e. the standard deviation of the sample means). Therefore the larger the acceptable degree of sampling error, the smaller the sample size must be. For example if we want to estimate the average weight of 60 students in a class within +-4kg. The precision level is +-4kg.

 

b)    Value associated with desire confidence level (Z) It states the degree of confidence interval taken in the population mean. It is a percentage or decimal value that tells how confident a researcher can be for his research. It includes the long run percentage of confidence intervals that will include the true population mean. It implies that the greater the desired confidence, the larger the sample size. For example if we want to be 95% confident that the estimate weight of the students in the class should be within +-4kgs, then the desired confidence level is 95%.

 

Calculating a Confidence Interval

 

Approximate location (value) of the population mean

µ= X̅± a small sampling error

Estimation of the sampling error

Small sampling error = Z. σx̅

X̅= sample mean

µ      = X̅ ± Z. σx̅

 

c)    Variance (Estimator of the standard deviation of the population, σ) It investigates that

 

how heterogeneous is the population. This variability is measured by estimating the population standard deviation. A heterogeneous population will have more variance will require a large sample while a homogeneous population having less variance will require a small sample. Therefore a smaller dispersion in the population calls for a smaller sample size while a larger dispersion in the population calls for a larger sample size.

Sample size compared to margin of error

 

4.  Determining sample size for estimating population mean

 

When sampling distribution of sample mean   ̅is normal, the standard normal variable z is given by

 

where z = standard normal variate x̅= sample mean

µ     = population mean

 

σ    = population standard deviation n= sample size

̅−µ

Confidence interval for µ (      )

 

n  Assumptions

 

n   Population standard deviation is known

 

n   Population is normally distributed

 

n   If population is not normal, use large sample

 

Confidence Interval Estimate

 

 

The difference between x̅ and µ is called the sampling error or margin of error, e.

or     e=     √

 

Thus the acceptable margin of error (maximum tolerance difference between unknown population mean µ and the sample estimate at a particular level of confidence) at the chosen level 1-α can be written as

where

 

n= sample size

 

z = standardized value indicating the level of confidence E = accepted magnitude of sampling error (i.e. precision) σ = estimator of the population standard deviation

 

or n=2.2/2

where

s= sample standard deviation

 

If population standard deviation σ is unknown, then sample standard deviation s can be used to find sample size n.

 

 

While the first two estimates; the desired precision level and the desired confidence level are at the discretion of the researcher, the standard deviation of the population is bit tough to estimate.

 

In case if the true population dispersion is unknown, it could be find, the standard deviation of the sample is used as a proxy figure. This figure could be worked out by any one of the following methods:

 

Any previous research on this topic.

A pilot test or pre test of the data among a sample drawn from the population.

A rule of thumb (one-sixth of the range based on six standard deviations within 99.73 percent confidence.

 

5.  When to use the sample or population standard deviation

 

We are normally interested in knowing the population standard deviation because our population contains all the values we are interested in. Therefore, you would normally calculate the population standard deviation if: (1) you have the entire population or (2) you have a sample of a larger population, but you are only interested in this sample and do not wish to generalize our findings to the population. However, in statistics, we are usually presented with a sample from which we wish to estimate (generalize to) a population, and the standard deviation is no exception to this.

 

Therefore, if all you have is a sample, but you wish to make a statement about the population standard deviation from which the sample is drawn, you need to use the sample standard deviation. The standard deviation is used in conjunction with the mean to summarise continuous data, not categorical data. In addition, the standard deviation, like the mean, is normally only appropriate when the continuous data is not significantly skewed or has outliers.

 

For example a teacher sets an exam for their pupils. The teacher wants to summarize the results the pupils attained as a mean and standard deviation. Which standard deviation should be used? The answer is population standard deviation. The teacher is only interested in this class of pupils’ scores and nobody else. On the other end a researcher has interviewed females aged 18 to 30 years old for their view on live in relationships. Which standard deviation would most likely be used? The probable answer is sample standard deviation. Although not explicitly stated, a researcher investigating live in relationship issues will not simply be concerned with just the participants of their study; they will want to show how their sample results can be generalised to the whole population (in this case, females aged 18 to 30 years old).

 

One of the questions on a national consensus survey asks for respondents’ age. Which standard deviation would be used to describe the variation in all ages received from the consensus? The probable answer is population standard deviation. A national consensus is used to find out information about the nation’s citizens. By definition, it includes the whole population. Therefore, a population standard deviation would be used.

 

6.  Methods to estimate the population standard deviation

 

The standard deviation is a measure of the spread of scores within a set of data. Usually, we are interested in the standard deviation of a population. However, as we are often presented with data from a sample only, we can estimate the population standard deviation from a sample standard deviation. These two standard deviations – sample and population standard deviations – are calculated differently. In statistics, we are usually presented with having to calculate sample standard deviations. But we can also estimate the population standard deviation with the given methods:

 

1.  Use information from a previous study. If the researchers already conduct such type of study, they can

make use of that information for the current study as well.

 

 

2.  Use secondary data. Researchers can take help of vast amount of data available with the company, on

their website or their local library. It can also be well estimated with the help of different sources like

industry averages, competitor’s data; government documents some previous surveys and many others. This

existing pool of information may help them to work out the estimate of population standard deviation.

 

 

3.  Conduct a small study of the population. The researcher may conduct a study of relatively small numbers

of target population members to better understand the group’s degrees of dispersion from the average for

the variable under study.

 

 

4. Talk to informed people. At last, the researcher could also use the judgment of the experienced managers

who are quite knowledgeable about the variable under study.

 

For example a hospital wants an estimate of the mean time that a doctor spends with each patient in the OPD. How large a sample should be taken if the desired margin of error is 2 minutes at a 95 percent level of confidence, assuming population standard deviation of 8 minutes?

 

Given e = 2 minutes

Z     ∝/2= 1.96 at 95 percent confidence level

 

α = 8

2.  2

n=             2

= (1.96)2.(8)2

(2)2

≅   62

 

 

7.  Sample size for estimating population proportion

 

The previous discussion was focused on determining sample size for estimating mean values. But at times it is the proportion of population with a particular attribute which is more significant than the mean value. For example one might be interested in knowing the proportion of households who dine outside on weekends rather than enjoying at their home.

 

7.1 Calculation of sample size of proportion :

 

We already know that the margin of error is 1.96 times the standard error and that the

standard error is √  ^(1−  )^

 

In general the formula is

e = z √  ^(1−  )^

Or e= Z ∝/2√( ^ /  )

where q=1-p^

 

•   e is the desired margin of error, the difference between sample proportion,  ̅and population proportion, p.

 

•    z is the z-score, e.g. 1.645 for a 90% confidence interval, 1.96 for a 95% confidence interval, 2.58 for a 99% confidence interval

 

•  pˆ is our prior judgment of the correct value of p.

 

•  n is the sample size (to be found)

 

 

For example a professor in department of management studies is trying to determine the proportion of students in the department who support late marriages after 30 years of age. He asks, “How large a sample size do I need?” To answer a question like this we need to ask the researcher certain questions, like 1. How accurately do you need the answer? 2. What level of confidence do you intend to use? 3. What is your current estimate of the proportion of students in the department who support late marriage (approx.)?

 

Possible answers might be:

 

1. “We need a margin of error less than 2.5%”. Typical surveys have margins of error ranging from less than 1% to something of the order of 4% — we can choose any margin of error we like but need to specify it.

 

2.   95% confidence intervals are typical but not in any way mandatory — we could do 90%, 99% or something else entirely. For this example, we assume 95%.

 

3.    May be guided by past surveys or general knowledge of public opinion. Let’s suppose answer is 30%. So in this case we set e equal to 0.025, z = 1.96 and ˆp = 0.3, and equation becomes 0.025 = 1.96√0.3   0.7⁄ 0.3  0.7 = (0.025) 2 = .0001617 1.96

 

Therefore

n= .00016170.3X0.7 =1291

 

Therefore a sample size of 1300 students is required.

 

We could clearly try varying any of the elements of this. For example, may be the researcher would be satisfied with a 90% confidence interval, for which z = 1.645. In this case equation becomes

0.3X0.7

0.025=1.645√  n

 

for which we can quickly find n = 909. If we are willing to accept a lower confidence level, we can get away with a smaller sample size.

 

A different type of variation is “What if we have no initial estimate of ˆp?” In this case, the

convention is to assume ˆp= 0.5 .The reason is that the standard error formula is  √p^(1−p)^ ,n is largest when ˆp = 0.5, so this is a conservative assumption that allows for ˆp being unknown a priori. If we repeat the calculation with ˆp = 0.5 (but returning to z = 1.96), we find

0.025 = 1.96√0.5X0n.5 which results in n = 1537.

 

 

Question  :  A survey  estimated that 20% of  all   Indian  aged 16 to 20 are quite concerned for their health. A similar survey is planned for US. They want a 95% confidence  interval to have a margin of error of 0.04.

 

(a) Find the necessary sample size if they expect to find results similar to those in India

 

(b)Suppose instead they used the  conservative formula based on ˆp = 0.5. What is now the required sample size?

Solution:

 

(a) The general formula is

E=   √  ^(1−  )^

 

which translates to

n     =   ^(1−  ^)  2   2

n= 0.2X0.8X1.96X1.96 = 384.2

0.04X0.04

 

b) With e = 0.04, p^ = 0.5, z= 1.96 we get

n= 0.5 0.5 1.96 1.96 = 600.25 

0.04 0.04

Thesamplesizeis384 for(a) and 600 for (b), showing the advantage in using the estimated ˆp (0.2) so long as we feel confident that this is roughly the right guess.

 

 

8.  Determining the Sample Size with fpc

 

Finite population correction factor (fpc) is used to determine sample size when sampling without replacement. The use of such factor reduces the standard error by a value equal to

√(   −   )/   − 1.

 

For example, in estimating the mean, the sampling error is given by

e =       √       √(   −   )/   − 1

 

 

9.  Summary

 

One of the most common requests that researchers get from investigating agencies are sample size calculations or sample size justifications for the proposed study. The sample size is the number of experimental units or samples included in a study. Determining the sample size to answer the research question is one of the important requirements in designing a study. In order to calculate the sample size, it is required to have some idea of the results expected in a study. In general, the greater the variability in the outcome variable, the larger the sample size required to assess whether an observed effect is a true effect.

Learn More:

  1. Tulsian P.C. and Pandey V. (2002). Quantitative Techniques, Theory & Problems (1st edition). New Delhi: Pearson India.
  2. Black. K (2013) Business Statistics For Contemporary Decision Making (8th Edition) New Delhi: Wiley
  3. Cooper D.R., Schindler P. S. and Sharma J.K. (2012). Business Research Methods (11th Edition) New Delhi: Mc Graw Hill Education
  4. https://www.unc.edu/~rls/s151-2010/class23.pdf