18 Probability Sampling: Principles and Procedures
Soumyajit Patra
1. Objective
This module you will teach about the importance of probability sampling in social science research. At the end of this module you will find some digital resources and a bibliography for further study.
- Introduction
In most of the empirical social researches, it is impossible to collect data from all potential informants considering the time-cost-labour components that such a large scale study incurs. Time, labour and cost of a study proportionately increase with the increase in the scale of research. If we want to know, for example, the average monthly income of the adult males and females living in Kolkata municipal area, it is impracticable to obtain data from each and every adult male and female residents of that area. So we select a representative group from the population or ‘universe’ to predict the average monthly income of the people living in Kolkata municipal area. This representative group is called sample. And the aggregate of individuals or units from which a sample is drawn is known as population or ‘universe’. In fact, the researchers may like to know about the population characteristics from the findings of the study of the sample. No doubt the ideal way to have knowledge of the population is to conduct a study on each and every member of the population. ‘Sample’ is the short cut way to understand the population characteristics. So, in most of the quantitative researches, the researchers draw sample from a large population in order to examine the characteristics existing in the population or universe.
- Learning Outcome
This Module will help you understand some basic concepts related to sampling and the principles and procedures of probability sampling.
- Sampling
According to Payne and Payne (2005: 200), ‘sampling is the process of selecting a sub-set, of people or social phenomena to be studied, from the larger “universe” to which they belong.’ In the words of Bloor and Wood (2006: 153), ‘a sample is representative of the population from which it is selected if the characteristics of the sample approximate to the characteristics in the population’. This representativeness of the sample is very important because it is presumed that the results obtained from the sample can be used to describe the population or universe as such. The individuals selected for a sample are called sampling units. In other words, a sample consists of sample units. When the research is conducted on the entire universe, i.e. when information is collected from each and every individual of the population it is called census. According to Bryman (2012:187), census is
‘the enumeration of an entire population. Thus, if data are collected in relation to all units in a population, rather than in relation to a sample of units of that population, the data are treated as census data’.
Hagood and Price (1952) have pointed out three important features of a good sample. These are –
The sample must represent the universe,It should be unbiased,
There are different types of sample and the researcher has to pay sincere attention in selecting the appropriate one. Otherwise, the results of the study will be misleading. Prior knowledge of the various characteristics of the population is essential for selection of the right sampling design.
In order to draw a sample, a source list (the complete list of the units of the population is known as source list or sampling frame) is required. Voters’ list, for example, can be used as source list for many social science researches. But, in most of the cases, the researcher has to prepare the source list. It is although not very easy always. Researchers face difficulties to prepare a source list if the population, for example, is mobile.
Sometimes it is also impossible to identify the actual units of the population. If, for example, we want to collect a sample from the student communities who have tendencies to commit crime, it would not be very easy to identify the right persons and to collect a sample from them.
4.1 Statistic and Parameter
The value of a variable calculated from the sample is called ‘statistic’ and the value of a variable existing in the population or universe is called ‘parameter’. For obvious reasons in most of the cases the parameter is unknown. The researcher draws a sample from the population in order to know or guess the parameter. For example, suppose a researcher has selected a sample of 10 students following a standard sampling procedure from a batch of 100 students (population). She wants to know the average age of the students of the said batch. So she calculated the average age of the students from the sample and found it is 21.3 years (statistic).
Now suppose in this case the parameter is known and it is 21.6 years (the average age of 100 students). Here 21.3 years is ‘statistic’ and 21.6 years is ‘parameter’. The difference between statistic and parameter is due to sampling. As in most of the cases, the researchers do not know the parameter, they only try to guess it with the help of statistic.
- Why do the researchers prefer sampling?
Complete enumeration or what is called census, requires much time. Sampling, selection of a representative part from the whole, saves it. Many social science researches are time-bound. So sampling becomes inevitable to complete the work in time.
- Sampling saves labour and money.
- Sample study yields more precise results.
- From the administrative point of view also sampling is preferred.
- Concentration on a comparatively small group helps collect accurate data.
- The magnitude of error can be calculated in most of the cases (particularly in case of probability sampling).
- The degree of accessibility of the units of the study (respondents) is higher in case of sample study than that of the population study.
- Probability sampling and non-probability sampling
Sampling can be of two types – probability and non-probability. According to Das (2004: 61), ‘the chance of being included in the sample is commonly known as probability.’ In case of probability sampling each item of the universe has a determinate or fixed chance of being selected. The idea behind probability sampling is that a sample will be representative of the universe from which it is selected if all members of the universe have an equal chance of being selected (Babbie 2004). This sampling method is sometimes called EPSEM (Equal Probability of Selection Method). You know that a good sample should adequately represent the population. Probability sampling enhances the degree of representativeness. And a random method of selection, in which each item has an equal probability of being included in the sample, is the key to the probability sampling. A major advantage of probability sampling is that sampling error, i.e. the degree of expected error for a given sample design, can be calculated.
Non-probability sampling, on the contrary, does not follow the rule of probability. Bryman (2012) points out that non-probability sample is a sample that is not selected using a random selection method. That means in case of non-probability sampling some units in the population are more likely to be selected than others. In the words of Babbie (ibid.: 182), ‘any technique in which samples are selected in some way not suggested by probability theory’ may be called non-probability sampling. In some social researches, probability sampling does not seem feasible. In those cases, non-probability sampling is preferred. For example, if we want to study homelessness, it is impossible to collect the list of such people. In this case non-probability sampling would be appropriate (ibid.). In non-probability sampling, there is no way to ensure that each item of the population has a chance of being included in the sample. The selection here totally depends on the researcher and therefore the representativeness of the sample cannot be guaranteed in most of the cases.
6.1 Sampling in quantitative and qualitative research
Sampling techniques vary with the nature of research. In quantitative research, the researcher generally wants to focus on the numerical aspect of social life through the collection and analysis of some statistical data like average age of the population, average income, level of education, dropout rate, etc. For this purpose, s/he wants to draw a truly representative sample from a large population and tries to understand the population parameter through sample statistic. According to Neuman (2007), probability sampling is most appropriate for quantitative research because it produces more accurate result expressed in terms of numerate data than the non-probability sampling and, hence, sampling error can be calculated.
Qualitative research, on the other hand, focuses on the peculiar features of social life, or on the meanings created and transformed in course of inter-human interactions, or sometimes on the inter-subjective feelings and emotions. These demand proper and in-depth understanding of the social reality that simple numerals cannot express. Neuman (2007: 141) thus writes:
Qualitative researchers’ concern is to find cases that will enhance what the researchers learn about the process of social life in a specific context. For this reason, qualitative researchers tend to collect a second type of sampling: non-probability sampling.
Self-check Exercise -1:
- What is population?
The aggregate of individuals or units from which a sample is drawn is known as population. It is also called universe.
- What is census?
Complete enumeration is called census. In census all the units of population are covered, that means data are collected from each and every member of the population.
- What is sample?
Sample is the representative of the population. When a researcher selects some units from the population or universe following some standardized procedures as representative of the population, this group is called sample. Sample reflects the characteristics of the universe.
- What is probability sampling?
Probability sampling is that type sampling in which each item of the universe has a determinate or fixed chance of being selected. The idea behind probability sampling is that a sample will be representative of the universe from which it is selected if all the members of the universe have an equal chance of being selected.
- What are statistic and parameter?
The value of a variable calculated from the sample is called statistic and the value of a variable existing in the population or universe is called parameter. In most of the cases, the researcher cannot know directly the value of the parameter. So he or she tries to have an idea about it from the statistic.
- What is variable?
According to Bryman (2012: 48), “a variable is simply an attribute on which cases vary. ‘Cases’ can obviously be people, but they can also include things such as households, cities, organizations, schools, and nations. If an attribute does not vary, it is a constant”. In other words variables are characteristics that can have different values like age, height, income etc. These can both be qualitative and quantitative. The qualitative variables like caste, sex are often called ‘attributes’. The quantitative variables like age, income, family size are simply called ‘variables’.
- Types of probability sampling
You know that in case of probability sampling each item of the population has a chance of being selected in the sample. So the sample becomes unbiased. There is no question of preferring one over another in the selection procedure. The researcher does not bother who or which item of the population will come in the sample. Researchers’ objectivity and value neutrality are ensured when they go for probability sampling.
There are different types of probability sampling. The researchers adopt any one of them according to the purpose and nature of their study. However, they should be very cautious in selecting the type of sampling because any wrong decision may jeopardize their research project. Before taking the final decision regarding the type of sample to be adopted for a particular study, the researcher should gather sufficient knowledge about the characteristics of the universe or population on which he or she would conduct the research in detail. It is also true in case of non-probability sampling.
7.1 Simple Random Sampling
When each unit or element in the population has an equal chance of being selected in the sample, it is called Simple Random Sampling (SRS). According to Young (1988), the term ‘random’ here ‘does not mean haphazard, careless, unplanned or hit-and-miss. Rather, according to accepted standard of statistical sampling, every effort should be made to control the choice of items so that every item in the universe shall have the same probability of being included in the sample.’ Generally each unit in the population is identified by a number, and these numbers are printed on metal or cardboard discs. These discs are placed in a container and after through shuffling sample units are selected by simple lottery method. Random number tables are also used instead of this procedure to select the sample.
Simple Random Sampling may be of two types –
- Simple Random Sampling with replacement – In case of Simple Random Sampling with replacement, the units selected at each draw are reinserted in the container before the next draw is made. So the size of the population remains same at each draw. Simple Random Sampling with replacement is often termed as unrestricted random sampling. Most of the statistical theories are based on Simple Random Sampling with replacement.
- Simple Random Sampling without replacement – Here the units selected are not replaced or returned to the original population. So the size of the population or universe changes at each draw. It should be noted that in both types of sampling, each unit of the population has an equal probability to be selected in the sample if the units appear once in the population (Majumdar 2005).
When the population is homogeneous Simple Random Sampling can be a very good option. Suppose you are conducting a study on the beliefs and practices of the adult tribal women of a particular area that have a bearing on their health. You have to prepare a source list first. This source list would indicate the universe of your study. Voters’ list can be helpful for this. You can put a number before each name present in the list and then conduct a lottery to collect the sample. This sample would be unbiased and would represent your population.
7.2 Stratified Random Sampling
Though it is said that Simple Random Sampling is representative of the population, as no personal choice of the researcher in selection of the sample units enters in the process, in reality all the characteristics of the population may not be reflected in the sample. This is particularly true if the population is heterogeneous. Simple Random Sampling does not ensure the inclusion of every segment of the population as it is based on random (generally lottery) method on which no one has control.
To make sure the true reflection of the characteristics of the population in the sample, often the entire population is divided into some strata on the basis of some criteria relevant to the study, and then sub- samples are collected from each stratum following random method. This type of sampling is known as Stratified Random Sampling. According to Babbie (2004: 206),
…the ultimate function of stratification, then, is to organize the population into homogeneous subsets (with heterogeneity between subsets) and to select the appropriate number of elements from each.
For example, suppose you want to conduct a study on the environmental awareness of the students of a university. If you prefer a Stratified Random Sampling, at first you have to divide the population of the student faculty wise. Then you can proceed in the following manner: Faculty of Science and Faculty of Arts and Commerce may be divided into different academic departments; departments may be divided into different classes, classes may be divided according to the sex of the students. Ultimately from these last strata (males and females of each class) sub-samples can be collected through Simple Random Sampling technique. All these sub-samples together constitute the total sample. By doing so, various categories/strata present in the universe may get represented in the sample.
There are two types of Stratified Random Sampling – Proportionate Stratified Random Sampling and Disproportionate Stratified Random Sampling. In case of proportionate Stratified Random Sampling the specified characteristics of the population are reflected in the sample in the same proportion in which they are distributed in the population.
Suppose the researcher wants to draw a proportionate stratified sampling from among the students of an Engineering college. The calculations for proportionate stratified sampling are given below.
Departments | Sex | No. of students | Proportion in population | Size of the sub-sample* |
IT | M | 70 | 70/500 = .14 | 100x.14 = 14 |
F | 60 | 60/500 = .12 | 100x.12 = 12 | |
Civil | M | 85 | 85/500 = .17 | 100x.17 = 17 |
F | 50 | 50/500 = .1 | 100x.1 = 10 | |
Electronics | M | 50 | 50/500 = .1 | 100x.1 = 10 |
F | 50 | 50/500 = .1 | 100x.1 = 10 | |
Mechanical | M | 75 | 75/500 = .15 | 100x.15 = 15 |
F | 60 | 60/500 = .12 | 100x.12 = 12 | |
Total | 500 (Size of the | 1 | 100 (Size of the sample) | |
population) |
- * Suppose the researcher has decided to collect a sample of 100 students. In the words of Majumdar (2005: 175),
The number of units selected from each stratum may be proportional to the stratum size to the population. That is, if Ni is the stratum size or the size of the ith sub-population and N the size of the population (∑Ni = N), and if ni is the size of the sample in the ith stratum and n the total sample size (∑ni = n), then for a proportionate stratified sample the relation Ni / N = ni / n must hold. In Disproportionate Stratified Random Sampling, the size of the sub-samples is not proportional to the respective population strata. Here, generally, equal number of units is selected through Simple Random Sampling from each stratum. The specified characteristics of the population may not be reflected in the same proportion in which they are distributed in the population, in case of disproportionate Stratified Random Sampling. So there is a chance of sub-samples being overrepresented or underrepresented.
7.3 Cluster sampling
According to Bryman (2012: 709), cluster sampling is a ‘procedure in which at an initial stage the researcher samples areas (i.e. clusters) and then samples units from these clusters, usually using a probability sampling method’. So (2007) says that a cluster is a unit that contains the final sampling units, but as the cluster is chosen through Simple Random Sampling it itself is a sampling unit. For example, in order to select a sample of women voters from a town, we can divide the town ward wise and then select a sample of wards by Simple Random Sampling. The final sample of women voters, then, may be selected from each selected ward again by Simple Random Sampling or by Stratified Random Sampling. Cluster sampling is less expensive. Babbie (2004) writes that cluster sampling may be used when it is either impossible or impractical to compile an exhaustive list of the elements of the population.
7.4 Multi-stage sampling
In multi-stage sampling, the researcher proceeds through a number of stages (from a large macro unit to a small micro unit), selecting a predetermined size of sample from each stage by Simple Random Sampling (Majumdar: 2005). For example, if we want to select a sample of urban people of West Bengal, we have to divide the state on the basis of districts; then we can select a sample of districts by Simple Random Sampling. These selected districts may again be divided into different urban areas. In the second stage, we may select a few towns from those by the same random method. These randomly selected towns may again be divided into different wards and a sample of wards may be selected by using similar sampling method. The final sample may be selected from the list of the residents of these wards. In this multi-stage sampling, a random method is applied at every stage. For a large population, this kind of sampling is very useful.
7.5 Multi-phase sampling
When some general information is collected from all the units of the sample and some specific information from sub-samples of the original sample, it is called multi-phase sampling. This sampling technique is also based on random methods and it can be combined or used with other types of sampling techniques. For obvious reasons, multi-phase sampling saves time and money. It is time consuming and unnecessary to ask every question to everyone. Multi-phase sampling also reduces the burden on the informants.
7.6 Systematic sampling
According to Babbie (2004), systematic sampling is a kind of probability sampling in which every kth unit or person in the list of the population is selected for the inclusion in the sample. Generally k is calculated by dividing the population size by the desired sample size. K is called sampling interval. So we can obtain sampling interval in the following way:
Sampling interval (k) = population size ÷ sample size
The first unit is selected at random. Then every kth unit is selected from the list of the population. In order to select a systematic sampling, a complete list of the population with proper numbering of the units is essential. But any purposeful arrangement or rearrangement of the units in the list may produce biased sample.
For example, suppose you have decided to conduct a study on the reading habits of the students of a particular class, in which there are altogether 100 students. You also have decided to take a sample of 10 students for this. You can use the attendance register of the students as the source list. Every student in that register has a roll number like 1, 2, 3. Sampling interval K here is 10 (100/10). To start with you have to prepare 10 discs numbering 1 to 10 and then select the first number by lottery. Say, the first number comes 6 (through the lottery). Roll no. 6 is then included in the sample. The other sample units would be the students having Roll no. 16, 26, 36, 46, 56, 66, 76, 86 and 96 (notice the sampling interval is 10). So the final sample would consist of Roll no. 6, 16, 26, 36, 46, 56, 66, 76, 86 and 96.
7.7 Area sampling
- V. Young (1988) has defined area sampling as a type of sampling in which small areas are designated as primary sampling units (PSUs), and the households interviewed include all or a specified fraction of those found in these areas. Area sampling is similar to multi-stage sampling; the only difference is that here the total area under study is divided into some smaller areas and then a sample is selected following random method. After the selection of areas, all the households may be studied (like cluster sampling) or further sub-samples may be selected again classifying those areas. In agricultural and market surveys this type of sampling design is used.
- Sample Size
The most frequently asked question is ‘what would be the size of sample?’ If it is a probability sample, the researchers can determine it with the help of a statistical method. But this statistical procedure is not easy and it requires prior knowledge of the population, which often the researchers do not have. Neuman (2007) has informed us about a convention that can help the researchers in deciding the sample size. It should be noted that large sample size does not always ensure the representativeness of the sample if the population is a heterogeneous one and the sample is poorly crafted. In case of qualitative research even a very small sample can produce accurate and fascinating information. But this cannot be said of quantitative researches.
On the basis of the principle, ‘smaller the population, bigger the sampling ratio’ (ibid.: 162) suggests the following (sampling ratio is the ratio of sample size and population size):
For a small population (about 1,000), sample size would be 300 (i.e. 30%).
For a moderately large population (about 10,000), sample size would be 1,000 (i.e. 10%).
For a large population (about 1,50,000), sample size would be 1,500 (i.e. 1%).
For a very large population (about 10 million) sample size would be 2,500 (i.e. 0.025%).
- Sampling Error
According to Bryman (2012: 187), sampling error is the error in the findings deriving from research due to the difference between a sample and the population from which it is selected. This may occur even when probability sampling is employed. This error creeps in the result of your research because of the fact that you have conducted the study on the sample instead of the population. In census, sampling error is zero. In case of probability sampling, the more the sample is a representative one, the less is the sampling error.
Self-check Exercise – 2:
- What is Simple Random Sampling (SRS)?
When each unit or element in the population has an equal chance of being selected in the sample, the sample is called Simple Random Sampling (SRS). It is usually done with the help of lottery method. Sometimes Random number table is also used to select sample units.
- Define Stratified Random Sampling.
Often the entire population is divided into some strata on the basis of some criteria relevant to the study, and then sub-samples are collected from each stratum following random method. This type of sampling is known as Stratified Random Sampling. If population is heterogeneous, Simple Random Sampling does not ensure that each segment of the population has been included in the sample. A Stratified Random Sample attempts to include all the segments of the population to make the sample truly representative.
- What does cluster sampling mean?
Cluster sampling is a one-stage sampling. Here the researcher selects the areas (i.e. clusters) by Simple Random Sampling and then samples the units from these clusters again by using Simple Random Sampling or Stratifies Random Sampling.
- What is multistage sampling?
In multi-stage sampling the researcher proceeds through a number of stages (from a large macro unit to a small micro unit), selecting a predetermined size of sample from each stage by Simple Random Sampling (Majumdar 2005).
- What is sampling error?
Sampling error is the error in the findings deriving from research due to the difference between a sample and the population from which it is selected.
- Summary
In quantitative sociological research, sociologists often have to understand the population parameters by collecting and analyzing data. If the population size is quite large, it becomes impossible to collect data from each and every member of the population. In such cases, the researchers collect a sample from the population using standard sampling procedures. There are two types of sampling – probability sampling and non-probability sampling. Probability sampling ensures the chance for each unit of the population to be included in the sample. This prevents any kind of bias to creep in the research and thus some sort of objectivity is guaranteed. On the contrary, in case of non-probability sampling, the selection of sample units depends to a large extent on the knowledge and expertise of the researchers. One of the advantages of probability sampling is that sampling error can be measured statistically. However, with the increasing popularity of qualitative research in delving the issues concerning social life, the use of probability sampling is on the decline.
you can view video on Probability Sampling: Principles and Procedures |