14 Biostatistics & Epidemiologic Method

Dr. Huidrom Suraj Singh

epgp books

 

Contents:

 

1. Introduction

 

2. Biostatistics

 

2.1 Data

 

2.2 Variable

 

2.3 Methods of Analysis

 

2.4 Frequency Distribution

 

2.5 Measurement of Central Tendency

 

2.6 Measurement of Dispersion

 

2.7 Relationship Analysis

 

3. Epidemiology

 

3.1 Health Determinants

 

3.2 Epidemiological Method

 

3.3 Disease Outcome

 

3.4 Measures of Disease Frequency

 

4. Summary

 

 

Learning Objectives:

 

  •  To introduce the basic knowledge of biostatistics and epidemiological methods
  •  To explain biostatistical analysis and its applications
  •  To elucidate the types of epidemiological methods
  •  To understand the distribution, disease occurrence and determinants of disease
  1. Introduction

In modern research, biostatistics is used extensively in epidemiology. For example, biostatistics was used in research for the development of the polio vaccine in the 1950s. Epidemiology is the basic science of public health. It uses the statistical tools and research methodologies to find out the causes and risk of certain diseases in certain population groups. The science of epidemiology began by investigating infectious disease outbreaks; however, it is also concerned with other complex diseases such as heart disease, cancer, stroke, diabetes, hypertension and others. Biostatistics is used to determine how diseases develop, progress and spread. It also predicts the behaviour of the disease like the symptoms, individuals who are at risk, the mortality and morbidity rate of the population and so on. Combination of biostatistics and epidemiological research significantly enhance the methodology, study design, disease etiology, controlling as well as in preventing the outbreak of the disease.

  1. Biostatistics

Biostatistics is a sub-discipline of applied statistics focuses on study design of biological experiments, data collection, data analysis and interpretation of the findings. It is generally applied in public health research and other biomedical sciences like clinical and transitional studies. The field of biostatistics has developed rapidly in past thirty years. Biostatistics is integral to the advance of knowledge in biology, clinical and medical sciences, public health and its policy, health economics, proteomics, genomics, and other disciplines. Biostatisticians evaluate possible variables that may influence the health related events of a population based on different methods of statistical analysis to provide a valid inference from the known samples about the populations.

 

2.1 Data

Data is one of the most important and vital aspect of any research studies. It is a collection of items of information. In other words, it is the information in raw or unorganized forms that refer to, or represent, conditions, ideas, or objects. Accurate and complete data are essential to maintain the integrity of research and to understand the research problem. It is common to all fields of study including physical and social sciences, medical and clinical studies, humanities, business and so on. A complete collection of data on the group under study is called ‘Population’ and the collection of sampling units selected from the population is known as ‘sample’. ‘Sampling unit’ is the smallest unit of the data which is represented by a member of the population. In research, a small representative ‘sample’ is used to study a population as it is difficult to collect information from each and everyone in the population. Moreover, studying a whole population is expensive and time consuming.

Data can be classified into two types such as qualitative and quantitative data.

 

(i) Qualitative Data

 

Qualitative data are the observation or information characterized by measurement on a categorical scale. It is information about the qualities and information that can not actually be measured. It describes quality of the studied subject. For example, softness of skin, ethnic, death, gender, nationality etc. Qualitative data are generally described in terms of percentages or proportions. It is mostly displayed by suing contingency table, pie chart and bar charts.

 

(ii) Quantitative Data

 

Quantitative data is information about numerical quantities that can be measured and written down with numbers. They are observation for which the differences between numbers have meaning on a numerical scale. It measures the quantity of something. The quantitative data can be classified into two sub groups on the basis of the types of numerical scale viz continuous scale (eg. height of individual) and discrete scale (eg. number of pregnancy). Quantitative data are described in terms of means and standard deviation. It is mostly displayed in frequency tables and histograms.

 

 

2.2 Variables

 

Variable is a characteristics that can take on different values for different members of the group under study. For example, a group of university students will be found to differ in gender, height, attitudes, intelligence and many ways. These characteristics are called variables because the value may vary between data units in a population, and it may change in value over time. It is any characteristics, number or quantity that can be measured or counted. Variables are categorised into different types based on their scale. These types are described below.

1) Numerical variables

 

These have values that describe a measureable quantity as a number, like ‘how many’ or ‘how much’. Therefore, numeric variables are quantitative variables. It is further divided into continuous and discrete variables.

 

Continuous Variable

It is the variable that can take on any value between its minimum value and its maximum value. Observations can take any value between a certain set of real numbers. It does not fit into a finite number or categories and referred to as measurement data. Examples of continuous variable include height, time, age, temperature, blood pressure etc.

 

Discrete Variable

Discrete values are the only designated values or integer values (i.e. 1, 2, 3…). Here, observations can take a value based on a count from a set of distinct whole values. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variable include number of children in a family, number of cars in a house etc. all of which measured as whole units. Discrete variables are fit into limited categories and referred as count data. The count data are of dichotomous (e.g. male-female; yes-no) and multichotomous (e.g. Indian-Chinese-Malaysian).

 

2) Dependent Variable

A dependent variable is the variable being tested and measured in a scientific experiment. It is the variable that is dependent on an independent variable(s). For example – time spent in study and test score. Here, time spent is the independent variable and the test score is the dependent variable because, time spent in study causes a change in the test score.

 

3) Independent Variable

An independent variable is the variable that is changed or controlled in an experiment to test the effects on the dependent variable. In other words, it is the variable that is being manipulated in accordance with the purpose of the researcher to observe the effect on a dependent variable. It is also known as outcome variable. A change in the independent variable directly causes a change in the dependent variable.

 

4) Categorical variables

These have values that describe a ‘quality’ or ‘characteristics’ of a data unit like ‘what type’ or ‘which category’. It falls into mutually exclusive and exhaustive categories hence these variables are qualitative variables and represented by a non-numeric value. It may be further divided into ordinal and nominal.

 

Ordinal Variable

It is a categorical variable. Observations can take a value that can be logically ordered or ranked. The categories can be ranked higher or lower than another, but no consistent level of magnitude between groups. For example – academic grade (A, B, C), attitudes (strongly agree, disagree, strongly disagree)

 

Nominal Variable

Observations can take a value that is not able to be organised in a logical sequence. For example – sex, business type, eye colour, religion etc.

 

2.3 Methods of Analysis

 

(i) Inferential Statistics

The inferential statistical methods provide a confirmatory data analysis. It generalizes conclusions about the population from the sample data. It is used to make an inference, on the basis of data, about the (non)existence of a relationship between the independent and dependent variables. It also assesses the strength of the evidence and makes future predictions.

 

(ii) Descriptive Statistics

Descriptive statistics describe the frequency and distribution to characterize data collected from a group of sample to represent the population. For example – percentage of patients attending in a diabetes clinic, gender, age group, education level of the patients etc. It describes the important characteristics of a known set of population data. Some of the important descriptive statistical analysis includes frequency distribution, central tendency, dispersion, and association.

 

2.4 Frequency Distribution

 

It is one of the most important means of summarizing the data from a single variable. It can be computed by tabulating the frequencies of the variable. It simply tells how often a variable takes on each of its possible values. In quantitative variables with many possible values, the possible values are typically grouped into intervals. The relative frequency as proportion is calculated as –

 

Relative Frequency = Frequency ÷ Sample size ‘n’

 

When, the value relative frequency is multiplied with 100%, it gives the relative frequency as percent. The frequency distributions can often be displayed effectively using graphical means such as the bar chart, pie chart or histogram.

 

2.5 Measurement of Central Tendency

 

(i) Mean

 

The sample mean measures the location or central tendency of the observation in the sample. Its value depends equally on all of the data. The population mean is the same quantity computed on all the elements in the population. In ordinal or nominal variables, the mean is not an appropriate measure for estimating the population mean. For example, the mean of sample is computed as –

 

                                                                             Mean (X) =     ΣX/n

 

[Where, n – total number of observation; X – value of observation sample]

 

(ii) Median

 

Median is defined as a point on a scale such that above or below it lie 50 percent of the cases. In other words, it is the value halfway or 50th percentile of ordered data set. The median is appropriate for ordinal qualitative data as well as quantitative data. The median is the mean of the middle two numbers, when the number of observation ‘n’ is even. The median is computed as follows –

 

                      Median (M) = {(n + 1) ÷ 2}th

[When, number of observation ‘n’ is odd]

[Where, n number of items in the data set; ‘th’ – the (n)th number  in ordered data set]

 

Median (M) = {(Sum of the two middle numbers ÷ 2)}

[When, ‘n’ is even number of observation]

 

 

(iii) Mode

 

It is the most common and frequently occurring value in a set of discrete data. It is used for all data type. Mode can be more than one, if two or more values are equally common in the data set. For example –

 

                 Mode = {most frequent value in a data set}

 

2.6 Measurement of Dispersion

 

Measurement of dispersion is used to describe the variability in a given sample. In other words, it is the study of spread and dispersion of data. Dispersion measurements include range, percentile, variance, standard deviation, standard error, and interquartile range.

 

(i) Range

 

Range is the difference between the smallest and largest value in a set of observation. It uses only the two extreme values (such as the lowest and highest) and other values in the data set are ignored.

 

Range = {the largest value – the smallest value}

 

(ii) Interquartile range (IQR)

 

Interquartile range (IQR) is the distance between 1st and 3rd quartile. In other words, it is the difference between the upper quartile and the lower quartile of a data set. It is used in statistical analysis to draw conclusions about a set of numbers. The IQR is not sensitive to extreme values. It helps in understanding the spread or dispersion of a set of numbers. Thus, it is usually described together with the median in skewed distribution of observation.

 

                                                         IQR = {Q3 – Q1}

[Where, Q1 – lower quartile (25th percentile); Q3 – upper quartile (75th percentile)]

 

(iii) Percentile

 

Percentile indicates the percentage (%) of individuals who have equal to / below a given value.

                      

[Where, ‘X’ is the number for which percentile has to be calculated]

 

(iv) Variance

 

The Variance is defined as the average of the squared differences from the mean. sum of the squared distances of each term in the distribution from the mean divided by the number of terms in the distribution. It measure spread or dispersion within a set of sample data. It provides information about how individuals differ within sample.

[Where, X – the observed value; μ – mean; N – No. of terms in the distribution]

 

(v) Standard Deviation

 

Standard deviation (SD) is the measure of spread or dispersion of a set of data. It gives information about the variability of scores around the mean. The SD is calculated to reflect range of samples and it is appropriate for normal or nearly normal data. It is also calculated by taking square root of the variance. Larger the value of SD, the data are more widely spread out.

(vi) Standard Error

 

The standard error is the standard deviation of the sampling distribution of a statistic. It is the relationship between the dispersion of the sample mean and the actual mean of the population. It is estimated from the SD. It indicates about the certainty of the mean and the standard error of the mean (SEM) refers to the standard deviation of the distribution of sample means taken from the population. The standard error is inversely proportional to the sample size that is larger the sample size, the smaller is the standard error because the dispersion of the sample means clusters more closely around the population mean.

        S

SE = n

[Where, S – sample based estimate of the standard deviation; n – size of the data set]

 

2.7 Relationship Analysis

 

(i) Chi square

 

Chi square test is calculated to test the independence of two categorical variables. It gives association or relationship between two categorical variables when chi-square value is found to be significant. Data are summarised in the two-way contingency table which represents the observation count table. From this observation count table expected count are calculated as follows –

Row total X Column total

————————–

Sample size

 

From the observed and expectedE= count table Chi-square (χ2) is computed from the following formula –

 

                                                                                 χ2 = Σ(Oi – Ei)2/ Ei

 

[Where, Oi – is the observed count of the cell ‘i’ – expected count of the cell ‘i‘; Degree of freedom

= (r-1)(c-1) and significance p-value at 5%]

The sign of the correlation coefficient (r) indicates whether the two variables are positively or negatively related. The value of ‘r’ ranges between -1 to +1. As the value of ‘r’ approaches to 1, the stronger is the linear relationship; whereas, the value of ‘r’ approaches to 0, weaker is the linear relationship.

 

(iii) Relative Risk

 

The relative risk (RR) for disease and exposure is defined as the ratio of the hazard of disease among exposed subjects and that among unexposed subjects.

[Where, λ(t) – incidence rate (hazard) of disease at a particular time or age ‘t’; E1 – exposed subject; E0 – unexposed subject]

 

(iv) Odds Ratio

 

Odds ratio (OR) is a measure of association between an exposure and an outcome. The OR value represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. It compares the odds of disease of those exposed to the odds of disease those unexposed. It is computed as follows –

[Where, E – exposure status (1- exposed, 0 – unexposed); D – disease status (1- disease, 0 – Non-disease)]

 

(v) Regression

 

Regression analysis is a statistical analysis for estimating the relationship among variables. It is used to find out what relationship, if any, exists between sets of data. It is used to find equations that fit data. Linear regression is a way to model the relationship between two variables. However, in case of single regression model with more than one outcome variable, multivariate regression analysis are performed. Linear regression is computed as follows –

                                                                                                Y = a + bX

[Where, x – independent variable; y – dependent variable; a- y-intercept; b – slope of the line]

[Where, N-number of observation; x-independent variables; y – dependent variables]

  1. Epidemiology

The word epidemiology is derived from the combination of three Greek words ‘epi’, meaning on or upon, ‘demos’, meaning people, and ‘logos’, meaning the study of. In other words, epidemiology is the study of what befalls a population. The term ‘epidemiology’ appears to have first been used by Villalba, Spanish physician to describe the study of epidemics in 1802. Epidemiology is the study of the distribution of disease frequency in human population and the determinants of that distribution. According to the WHO, epidemiology is the study of the distribution and determinants of health-related states or events (including disease), and the application of this study to the control of diseases and other health problems. There are many other definitions of epidemiology proposed by different groups, but the definition given by Dictionary of Epidemiology captures the underlying principles and public health spirit of epidemiology. According to the Dictionary of Epidemiology (1995), epidemiology is the study of the distribution and determinants of health-related states or events in specified populations, and the application of this study to the control of health problems. The health-related states or events encompasses health status, diseases, death, other implications of disease such as disability, residual dysfunction, complication, recurrence, but also causes of death, behaviour, provision and use of health services. On the other hand, determinants of health-related states or events include physical, biological, social, cultural and behavioural factors.

 

Epidemiology concerned with distribution of diseases at population level but not individual’s disease as clinicians does. Epidemiologists focuses on why and how often diseases occur in different groups of people. Epidemiological studies are aimed at revealing unbiased relationships between exposures to outcomes. Identification of causal relationships between the exposures and its outcomes is one of the most important aspects of epidemiology. The major aims of the Epidemiologic research may be summarised as to –

  •  describe the health status distribution of a specified population
  •  assess the frequency and pattern of health events
  •  explain the etiology of disease
  •  predict the disease occurrence
  •  evaluate the prevention and control of disease
  •  control the disease distribution

Epidemiology is a scientific discipline with systematic approach to the collection, analysis and interpretation of data. Epidemiology is a quantitative discipline that relies on the knowledge of statistics and it is also a method of causal reasoning based on developing and testing of hypotheses to explain the health events. Therefore, epidemiology is also described as the basic science of public health. It provides the foundation knowledge for appropriate public health action plans besides the research activity on health events of a specified population.

 

3.1 Health Determinants

 

Epidemiology studies the causing factors that influence the occurrence of diseases and other associated health related events. They searched the right determinants causing disease using different methods of analytical studies assessing the population demographic characteristics, genetic make-up, and exposed environment to find out the so-called potential risk factors of the disease. To understand the health determinants, the term epidemiological triad is used by epidemiologists for studying. The epidemiologic triad is a model for studying the health related events.

The triad consists of interactions of an external agent, a host and an environment maintaining a balance keeping the system in equilibrium state. The agent is the cause of the disease and ‘what’ of the triangle. Hosts are organism, usually humans or animals, which are exposed to and harbour a disease. It is second part of the epidemiological triad and known as ‘who’. On the other hand, environment is the favourable surrounding and conditions external to host that cause or allow disease transmission. It is the remaining part of the triangle of epidemiological triad and known as ‘where’. Disease prevails when the system is in disequilibrium disturbed by change in any of the factors. Epidemiologists assess the disturbed system and aims to break at least one of the sides of the disturbed triangle disrupting the connection to stop the continuation of disease.

 

3.2 Epidemiological Methods

 

Epidemiology is concerned with the distribution of health events in a population. The distribution comprises of frequency and pattern of the health events. Here, frequency refers to the number of cases of health events as well as its relationship to the size of the population, whereas, the pattern refers to the occurrence of the health related events by time (annually, seasonally, weekly, daily or hourly), place (geographic variation or rural/urban) and person (socio-economic status, behaviour, risk of illness and demographic factors such as age, sex, marital status). Various epidemiological methods of studies can be used to carry out epidemiological investigations. In general, the epidemiologic method focuses on two main aspects, such as descriptive and analytic approaches to understand the determinants of health events as well as to develop effective strategies to prevent them.

 

Descriptive epidemiology describes the occurrence of disease characterizing the health events by time, place and person. It pertains to who, what, where and when the health events occurred. Therefore, descriptive studies can be used to study the distribution of frequency and patterns of health events. On the other hand analytic epidemiology used to study determinants of the health events such as how and what are the responsible factors for the health events.

Analytic epidemiology can be further classified into two group viz observational epidemiology and experimental epidemiology. The observational epidemiology observes as well as estimates the association between exposure and disease through different methods of studies including cohort, case control, cross-sectional and ecologic study. While, experimental epidemiology focuses on intervention and treatment of health event through randomised controlled trial, field trial and community trial. However, epidemiology also applies other methodological techniques of allied disciplines including biostatistics and informatics, with biologic, economic, social, and behavioral sciences.

 

3.3 Disease Occurrence

 

Occurrence of disease in a population in epidemiological studies can be classified into different levels on the basis of extent of prevalence. Different levels of disease occurrence are –

  1. Sporadic level – disease that occur occasionally (infrequently) at irregular intervals
  2. Endemic level – the persistent occurrence of a disease with a low to moderate level
  3. Hyperendemic level – persistently high levels of disease occurrence
  4. Epidemic – suddenly increase in the disease occurrence above normally expected level for a given time period
  5. Outbreak – refers the same meaning with epidemic but it is often used for a more limited geographic area.
  6. Pandemic – epidemic spread over several countries or continents, usually affecting a large number of people

3.4 Measures of Disease Frequency

 

In epidemiological studies, there are several means by which the occurrence of disease may be measured. Some of the important measures of disease frequency and risk are shown in table 1.

Summary

 

Biostatistics is a broad discipline encompassing the application of statistical theory. Biostatistical research is closely connected with the real applications such as designing and conducting biomedical experiments and clinical trials. Biostatisticians applied different statistical tools and mathematics to enhance biological science and bridge the gap between theory and practice.

Epidemiology is the systematic and scientific study of the frequency distribution and determinants of health-related states and events in specified populations. Epidemiology has helped in developing study design and methods used in clinical research, public health studies. It is not just the study of health in a population but also involves diagnosis of the health problem and to propose appropriate, practical and acceptable public health interventions to control and prevent disease in the community.

you can view video on Biostatistics & Epidemiologic Method