12 Modern Tools in Studying Genetic Diversity

Dr. Siuli Mitra

epgp books

 

CONTENTS:

 

1.      Learning outcomes

 

2.      Modern tools

 

3.      Molecular tools

 

a.       DNA extraction

 

b.      Restriction enzymes

 

c.       DNA hybridization

 

d.      Gel electrophoresis

 

e.       Polymerase Chain Reaction

 

f.       Automated DNA sequencing

 

4.      Statistical tools

 

a.       Description of diversity

 

b.      Evaluating genetic relationships

 

c.       Reconstruction of evolutionary history (Phylogenetic) or population history

 

d.      Dating evolutionary history

 

e.       Detecting selection

 

f.       Assessing geographical variation

 

1.    Learning outcomes:

 

At the end of the module the reader will know:

 

a.       Types of modern tools used to study genetic diversity

 

b.      Categories of molecular tools and techniques

 

c.       Categories of statistical tools

 

 

2.    Modern tools

 

Knowledge about genetic diversity and genetic relatedness is gathered by using a variety of tools and techniques. The range of tools available can be broadly classified on the basis of their nature: molecular and statistical. Molecular tools are the ones that are used for collecting the genetic evidences and statistical tools are used for analyzing these evidences. In essence, a combination of both the classes is required to comprehend genetic diversity. Given below is a list of different tools in popular use for genetic diversity studies.

 

3.   Molecular tools

 

Superseding the past decades’ work with protein (classical) variants, new-age molecular tools and techniques enabled unraveling of DNA variations that synthesized proteins. The discovery of restriction enzymes and Agarose gel electrophoresis formed the foundation of molecular techniques and were followed by many other high-resolution techniques to purify, detect, amplify and study DNA sequences. This section enumerates some of the technological advancements that lead to successful extraction of DNA sequence information.

 

a.    DNA extraction

 

The human genomic DNA is long (6.4 X 109 base pairs) in all cells and composed of four nucleotides Adenine, Thymine, Guanine and Cytosine. The large size makes the isolation and studying DNA fragments a very formidable procedure. The extraction of genomic DNA marks the initiation of laboratory analysis in molecular genetic studies. The source of DNA can be serum, plasma or whole blood, buccal cells, hair and skin, semen being the most widely used. The basic steps involved in DNA extraction are lyses if RBC and WBC, protein and lipid extraction, DNA isolation and purification of isolated DNA. Many different standard protocols have been devised that differ in the reagents used for the various steps. The two protocols used for DNA extraction of samples used in this study are salting-out method (Miller et al, 1988) and phenol-chloroform iso-amyl alcohol (PCI) method (Sam brook and Russell, 2001). For a few samples where the volume of blood is very less, a DNA extraction kit can be used. The methods yield DNA that differs in level of protein contamination adjudged by the ratio of absorbance at 260nm to absorbance at 280nm wave length. PCI extraction is known to reduce contamination due to protein which is more than the salting-out method but suffers from the demerit of using a corrosive chemical.

 

b.   Restriction enzymes

 

These enzymes can fragment DNA at specific sites. They were discovered for their ability to fragment DNA of foreign organisms to evade infection and in the process restrict infection. There are three types of restriction enzymes I, II and III out of which only the type II restriction enzymes are used for cleaving DNA fragments in genetic analysis. Examples of restriction enzymes are HindII, HindIII and EcoRI. These enzymes recognize portion of DNA sequences that are rotationally symmetrical and called palindromic sequences. For instance, EcoRI cleaves the site 5’ – GAATTC – 3’. In DNA sequences certain single nucleotide differences between individuals lead to a modification in the recognition site of the restriction enzyme. This results in the inability of the enzyme to cleave the DNA. The difference in size of the resulting fragments is used to identify the DNA polymorphism. This polymorphism is called a restriction fragment length polymorphism (RFLP). RFLP was the first DNA typing methodology used for genetic mapping, localization of genes involved in genetic disorders and paternity testing. Several polymorphic restriction sites are now known to exist in human genome and are applied in population genetics studies.

Figure1. Cleavage of DNA fragments by restriction enzymes. These enzymes act on specific DNA sequences to cleave them creating overhanging or blunt ends. The blue line shows the cleavage site on the DNA sequences. Cleavage leads to breaking of the hydrogen bonds and the phosphodiester linkages on the DNA fragment. Figure adapted from Analysis of Genes and Genomes by Richard J Reece

 

c.    DNA hybridization

 

Hybridization of complementary bases in two separate DNA molecules is used for identification of similar sequences of DNA. Inter-species comparison (for example ape and human DNA) was first done by harnessing this property of DNA. This technique requires radio-labeled DNA probes that are quite expensive and so the technique was replaced by cheaper methods of detection for similar DNA. However the DNA-DNA hybridization was retained due to its accuracy.

 

d.   Gel electrophoresis

 

Electrophoresis is central to population genetics research as it is for separation and detection of genetic biomarkers. The first experiments of electrophoresis were done by using sucrose gradients to detect digested and ligated DNA molecules. The working principle of the technique is differential migration of macromolecules (for example protein, deoxyribonucleotides and carbohydrates) on the basis of size and residual electric charge under the influence of an external electric field. DNA which is negatively charged molecule due to the presence of sugar-phosphate backbone is loaded in a gel which provides a porous medium to migrate under the influence of electric field towards the positive electrode (anode). The friction posed because of the movement of the strands in the gel is necessary to separate fragments depending on their size. Two kinds of gels have been used most widely: Polyacrylamide gels and Agarose gels. Polyacrylamide gel electrophoresis (PAGE) was introduced to separate proteins but is also applied for nucleic acids separation. Agarose is extracted from seaweed and is available as powder which is suspended in a standard buffer to make a solution that solidifies at room temperature. Agarose gels are used for separation of nucleic acids only. The electrophoretic technique has been used for detection since the studies on classical genetic markers began long back and has evolved since then for ease in usage and wider applicability like many other molecular techniques.

 

 

e.   Polymerase Chain Reaction (PCR)

 

PCR has had an extraordinary impact on varied fields of research like molecular biology, genetics and anthropological genetics research due to its rapid incorporation and wide range of applications. The in vitroamplification of selected genomic regions by PCR is today THE most essential technique in every laboratory that has molecular biological aspects in its research. The steps fundamental to a PCR are strand denaturation, annealing and extension. An additional step of initial denaturation is added before the denaturation step to initiate the reaction, a final extension at the end of the PCR to facilitate extension of some remaining fragments and a cooling phase (4°C, 10°C or 15°C) to topple the high reaction temperature. The chemical reagents called components are:

 

(i) DNA polymerase normally isolated from the thermolabile micro-organism Thermos aquaticus, called Taq DNA polymerase,

 

(ii)   Enzyme buffer to aid the enzyme activity,

 

(iii)   Oligonucleotide primers to anneal with specific region(s) of interest,

 

(iv)     Deoxyribonucleotides (dNTPs) and

 

(v)   A co-factor for the enzyme which is Mg2+ for most reactions.

 

(vi)   DNA (or RNA) molecule used as a template

 

The components (reagents and conditions) are modified to meet the specific requirements of the templates and this lead to creation of many different versions of the PCR: Real-time (RT) PCR (for an mRNA template), multiplex PCR, inverse PCR, allele-specific PCR, hot-start PCR and nested PCR. The versatility of the PCR technique finds it application in archaeology, medicine, forensics besides evolutionary studies, gene mapping, and mutagenesis in molecular biology laboratories. Knowledge and understanding of properties of DNA molecule (such as complementary base pairing, its synthesis), availability of primers and the discovery of polymerases laid the foundation for invention of the technique in 1980s by Kary Mullis and his co-workers. Initially done manually by maintaining the heating and cooling cycles, now automated thermal cyclers are available. There are several variations available today in the basic type of PCR that can be chosen depending upon the kind of analysis desired.

 

f.   Automated DNA sequencing

 

Chemical methods of DNA sequencing were introduced in 1977 by Allan Maxam and Walter Gilbert. They devised a chemical method to cleave the sugar-phosphate bonds in a DNA fragment at specific points. These methods were limited as only about 100 bases could be sequenced in a single reaction. This was followed by an alternative approach called the chain termination method devised by Frederick Sanger. Synthesis of a newly replicated DNA is terminated by the incorporation of 2’, 3’ – dideoxynucleotides at a specific base and that is why the method gets its name. This method had the advantages over the chemical methods and thus was automated to cut down on the labor involved. Sequencing reactions are now performed in a single tube and the products obtained are separated in a single lane by gel electrophoresis and fluorescence detectors used to detect the DNA fragments. Base calling software is used to convert fluorescing into a sequence of bases. The method can be used to read 1000 bases in a single reaction. Figure1 shows electrophoregrams of DNA sequence.

Figure1. Electrophoregrams showing a sequence output

 

4.   Statistical tools

 

Enormousness is a characteristic of biological data. Understanding patterns in its vastness and drawing meaningful inferences requires a careful choice of analytical tools. In order to draw accurate and unbiased estimates of genetic diversity different statistical tools are in use. They address certain basic aspects of statistical analysis of genetic diversity:

 

Description of diversity

 

Descriptive statistics in genetic diversity studies are used to assess genetic variation by comparing its patterns within individuals in groups and between different groups of populations. Any population genetics question is answered by first finding out what is called population diversity data comprising expected genotype frequencies. The prediction requires finding out the observed genotype frequencies and allele counts. Alleles can be counted following a simple gene counting method. After calculating the observed allele and genotype frequencies, the frequencies as per Hardy-Weinberg expectations are calculated. The next step is to calculate the gene diversity indices to measure the deviation from Hardy-Weinberg expected heterozygote frequencies, observed and expected heterozygosity. Gene diversity indicesare used to quantify the extent of genetic differentiation between populations. The coefficient of gene differentiation or GSTis the ratio of the inter population gene diversity to the total gene diversity (HT) among the sub-populations.

 

GST = DST / HT

 

The gene diversity of the total population (HT) can be calculated by taking the average gene frequencies of all populations.

 

HT = HS + DST

 

The average gene diversity HS within the population and average gene diversity between the populations DSTis calculated to obtain HT. Another method of assessing differentiation is fixation index. Fixation index based measures of deviation from expected heterozygosity is used to define population structure. FIS compares average observed heterozygosity of individuals in each sub-population with the average expected heterozygosity for all subpopulations. FST compares the average expected heterozygosity for sub-populations compared with expected heterozygosity for the total population. FIT indicates the overall fixation index and is the reduction in heterozygosity of an individual in relation to the total population.

 

(1-FIT) = (1-FIS) (1-FST)

 

Various indices have been introduced in analogy with FST to accommodate different models parameters. For example NST considers Phylogenetic differences among haplotype in addition to differences in haplotype frequencies. The Analysis of Molecular Variance or AMOVA is used to quantify diversity (average distance between randomly chosen haplotype or alleles) apportioned by population stratification brought about by geographic, linguistic or ethnic barriers. It is also used a measure of population differentiation. Mismatch distribution is another measure of genetic diversity that is used to estimate pair wise differences when allelic differences can be counted.

 

a.    Evaluating genetic relationships

 

The relatedness of populations is compared by using genetic distance statistics wherein a higher value of the statistic between two molecules/populations implies a greater evolutionary distance between them. Assuming that a group of populations are being studied, pairwise comparisons are made to examine the population structure and molecular diversity between the populations under investigation. The genetic distance can also give an idea of the evolutionary divergence of the populations being studied starting from their origins. The choice of the genetic distance measure depends on the evolutionary force under question and the mutation rate of the markers chosen for study. Two commonly used indices to measure genetic distance are FST and Nei’s genetic distance, DA (Nei, 1987). In order to visualize, and hence better comprehend population relatedness, genetic distances are represented graphically in multi-dimensions. Information available from a large number of loci is reduced into one or a few components that can explain most of the variability using multivariate analysis. Jombart et al. (2009) used the phrase “ordination in reduced space” and multivariate analysis interchangeably and quite correctly. The analysis is done to counter the two-fold challenge (basic to biological data) of large datasets and limited space for a comprehensive representation.

 

b.   Reconstruction of evolutionary history (Phylogenetic) or population history

 

The evolutionary history of a group of individuals or groups can be inferred by comparison of a group of sequences representing either a portion or complete genome. The most widely used and lucid way of doing this is by a Phylogenetic tree. Four basic approaches used in Phylogenetic analysis are distance methods, parsimony methods, maximum likelihood and Bayesian methods.

 

a. Distance methods are used to find out the amount of divergence between two individuals or groups of individuals or populations by evaluating the number of differences in each pair of DNA or protein sequences.

 

b. Parsimony methods systematically search all the Phylogenetic trees to find the one that has minimum number of fixed mutations to account for the data.

 

c.  A maximum likelihood approach is used to find out a model to explain the occurrence of the nucleotide and amino acid substitutions and it recognizes the tree with the maximum probability of observing actual data based on this model.

 

d.      Bayesian methods infer the relative probability of any gene tree based on the a priori assumptions about the distribution of possible trees.

 

Bootstrapping in Phylogenetic analysis is done to assign a level of confidence to the nodes obtained in a tree. It initiates a resampling in which in a given dataset columns (a subset of data) are removed and the tree is rebuilt to check if the branches remain unchanged. In case, the same node is obtained in 95 of a total 100 times, it can be inferred that the node is well supported.

 

c.    Dating evolutionary history

 

Genetic data can be integrated along with data from other areas like paleontology or archeology to yield the sequence of events that form the evolutionary history of a gene. Population divergence events are dated by estimating FST and D assuming absence of gene flow after divergence has occurred. Allele frequency differences exist between populations that can be used to calculate FST which in turn gives an idea of the time that has passed since the divergence event. Nei’s D statistic which has a linear relationship with time is also used for similar purpose. Examples of evolutionary events include mutation like the CCR5-Δ32 mutation, population size at a point of time and mutation rates.

 

d.   Detecting selection

 

Detection of selection in the human genome is done through genome-wide scans and screening candidate-genes. Methods of detection are chosen depending on the genomic region under study.

 

  • Ka/Ks ratio – This gives a ratio between the rates of amino acid substitution and synonymous substitution in DNA sequences. If Ka/Ks = 1 the protein coding gene is said to be undergoing neutral evolution. If Ka/Ks< 1, non – synonymous substitutions are less frequent than what is expected in neutral evolution which indicates the protein is subject to selective constraints due to which many of the amino acid substitutions are deleterious and so eliminated by selection. Most of the protein-coding genes show this value. If Ka/Ks> 1, amino acid substitutions have taken place faster than that expected under neutrality. This implies the favorable nature of substitutions.
  • Tests of neutrality – Statistics used to compare the observed levels of diversity with that expected in neutral evolution constitute neutrality tests. The efficiency of these tests depends on the type of selection, time when the selection occurred and whether the selective force is acting on the derived variant. For example the Hudson-Kreitman-Aguade (HKA) test compares polymorphism within species and is used for both coding and non-coding sequences. The allele frequency spectrum is evaluated to detect selection as the haplotype undergoing selection or those haplotype near a region undergoing selection have an altered frequency. Other examples of neutrality statistics are Tajima’s D and Fay and Wu’s H.
  • Extended haplotype tests – Haplotype comprising an allele that has recently undergone positive selection extend over longer distance in the genome in comparison to haplotype that don’t have such an allele. Increase in frequency of the selected allele leads to increase in frequency of other alleles of the haplotype. Detection of this signature of selection is called an extended haplotype test

 

e.    Assessing geographical variation

 

Evaluation of the geographical variation of genetic data helps to partition the contributions of the historical and geographical factors that have led to extant patterns of diversity. The geographical patterning of allele frequencies is observed as patches or continuous gradients known as clines and is called a genetic landscape. These patterns are results of different evolutionary processes that result in isolation by distance (IBD) and clinal gradients. Sampling for studying geographic variation should be done from closely located sites. Distance between sites of sample collection is calculated by first estimating the great circle distance and then taking into account the recent migratory history of Homo sapiens (method for calculating geographic distance given in Ramachandran et al., 2005). The genetic evidences can then be represented on a map with isogenic lines joining points having equal allele frequency. Genetic boundaries that signify genetic differentiation between neighboring populations due to limited gene flow can be analyzed from these maps. Two popular methods are in use for the assessment of the correlation between genetic distance and geographic distance:

 

i. Spatial autocorrelation:

 

A statistical analysis method called spatial autocorrelation has been used in several studies to compute the correlation between allele frequencies and geography. The Moran’s Iobtained is a popular index to quantify the correlation. The analysis gives a correlogram which gives a quantitative (and pictorial) evaluation of the geographical pattern of genetic variation. A positive value of I means a positive correlation while a negative value shows negative correlation.

 

ii.    Mantel test:

 

An alternative method is Mantel test which makes pair wise comparisons between populations to generate geographic distance and genetic matrices. Correspondence analysis between these matrices is done to make inferences on the evolutionary history of the populations under study.

 

Summary

  • Several tools and techniques have been devised and used to gather information on molecular diversity in human populations differing on the basis of type of data to be collected and the evolutionary model being tested.
  • Molecular tools are used to collect data on DNA and protein based polymorphisms while statistical tools are used to describe and analyze the data collected.
  • Extraction of DNA is the first step of molecular genetic analysis which is followed by amplification of a selected region of the genome and genotyping by RFLP or sequencing techniques.
  • Molecular diversity is described by estimating allele frequencies, heterozygosity and indices of population differentiation
  • The genetic affinities are measured using genetic distance, Phylogenetic trees and clustering algorithms.
  • Bootstrapping is used in Phylogenetic analysis to find out the level of confidence of the results obtained on genetic relationships.Genetic data can be used to estimate events that have occurred during the evolutionary history of a gene or a population.
  • Genomic regions that have undergone selection can be detected by evaluating Ka/Ks ratio, neutrality or extended haplotype frequency.
  • Geographical variation of genetic data helps to partition the contributions of the historical and geographical factors that have led to extant patterns of diversity.
  • Miller SA, Dykes DD, Polesky HF. 1988. A simple salting out procedure for extractingDNA from human nucleated cells. Nucleic Acids Res 16: 1215.
  • Nei M. 1973. Analysis of gene diversity in subdivided populations. Proc Natl Acad SciUSA 70: 3321-3323
  • Nei M. 1987. Molecular evolutionary genetics. New York: Columbia University press.
  • Ramachandran et al. 2005. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci USA 2005; 102:15942–1594
  • Sam brook J., Russell D.W., Molecular cloning – a laboratory manual, 3rd ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2001.

 

you can view video on Modern Tools in Studying Genetic Diversity

References

 

  • Miller SA, Dykes DD, Polesky HF. 1988. A simple salting out procedure for extractingDNA from human nucleated cells. Nucleic Acids Res 16: 1215.
  • Nei M. 1973. Analysis of gene diversity in subdivided populations. Proc Natl Acad SciUSA 70: 3321-3323.
  • Nei M. 1987. Molecular evolutionary genetics. New York: Columbia University press.
  • Ramachandran et al. 2005. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci USA 2005; 102:15942–15947
  • Sam brook J., Russell D.W., Molecular cloning – a laboratory manual, 3rd ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2001.