30 Human Genome Project

Dr. Siuli Mitra

epgp books

 

CONTENTS:

 

1.      Learning outcomes

2.      Human genome project: The inception of the idea

3.      Role of bioinformatics in genome projects

4.      Genome sequencing

5.      Overview of the first human genome project

6.      Human genome diversity project

7.      The HapMap project

8.      The Geographic Project

9.      The 1000Genomes Consortium

10.  Ethical, Legal and Social (ELS) Issues

11.  Summary

 

 

1.    Learning Outcomes

 

In this section the reader will get an overview of the following:

  • The events that lead to the conception of the idea of sequencing the human genome
  • First Human Genome Project
  • Salient features of the Human genome diversity project, HapMap, Genographic and 1000Genomes projects

 

The description and interpretation of human variation in the light of evolution has occupied the center stage in anthropological questioning. The pursuit of anthropological genetics studies to understand human origins, population history, and relationship between human biology, language and culture has been made easier by human genome sequencing and large scale projects using the techniques. The human genome projects lead to the development of new technologies of genome sequencing, determination of physical map of the genome, discovery of millions of common and rare sequence variants in different populations of the world in medically important genes. This module enumerates the events that lead to the conception of the human genome sequencing, human genome projects, their goals, achievements and pitfalls.

 

2. Human genome project: The inception of the idea

 

The double helical structure of the DNA molecule was proposed in 1953 (Watson and Crick, 1953) and with that molecular biology provided the physical basis for the inheritance of genetic information from one generation to the next. By this time DNA was known to be self- replicating machinery. The knowledge of the double helical structure helped understand the mechanics if replication. Each strand of the DNA molecule is capable of resynthesizing a complementary strand, and the resulting strands intertwine with the help of hydrogen bonds to produce two daughter DNA molecules. In addition, genomic DNA in a cell was known to be of a finite length. The nuclear DNA encoded for all the molecular information required to build a cell or the whole organism. Deciphering the sequence of the genomic DNA of human was perceived to help generate the coded information of the structure and metabolism at the cellular and organismal level. These three factors sowed the seeds of the idea behind the first ever human genome project which was aimed at finding the sequence a nd composition of the deoxy-ribonucleotides of the DNA molecules that comprise the entire human genome. The technological prowess required for genome sequencing brought together public and private partnerships involving the best minds in molecular biology. Before delving into genome projects lets go through the evolution of genome mapping.

 

3. Role of bioinformatics in genome projects:

 

The branch of biology dealing with acquiring, storing, managing, displaying and analyzing massive amount of biological data is called bioinformatics. Biological data is available as nucleic acid and amino acid sequence data. The huge amount of nucleic acid data produced as a result of genome projects in general and human DNA sequence data produced by human genome projects, invited bioinformaticians to play emphatic role in the projects.

 

Bioinformatics tools first came into use after generation of sequences by the sequencers. The DNA sequences were assembled after identification of overlapping areas in a series of sequences. The computer programs used are called sequence assembly software.

 

In addition, the sequence data is viewed as essential to researchers studying genetic diseases in humans. Thus the annotation of gene sequences to establish their functions and development of preventive, diagnostic and therapeutic strategies for diseases were important responsibilities of bioinformatics post genome sequencing.

 

After establishing gene functions appropriate programs are used for comparing gene structure and functions in different organisms, examine genetic variation within and between species to explore phylogenetic relationships.

 

All these diverse roles have made bioinformatics an ever-expanding field with new software being developed almost every day.

 

4. Genome sequencing:

 

Genomic mapping:

 

The chromosome comprising a single DNA molecule is the starting unit for genome mapping. Individual chromosomes are treated with certain dyes to get distinct chromosome banding patterns. These patterns are used to generate cytological map for ea ch chromosome. These maps are used to locate structural modifications in chromosomes. Isolated single stranded DNA fragments are located on chromosomes by labeling the fragments with fluorescent probes and hybridizing to chromosome. The fragment hybridizes to its complementary sequence on the chromosome. This method is called fluorescence in situ hybridization (FISH).

 

Genetic mapping:

 

A genetic map is a representation between the distance between two DNA elements obtained using the recombination frequency between the two. The construction of the first genetic map for Drosophila by Alfred Sturtevant was in fact the conception of the idea of mapping genomes. Genetic maps can be constructed for each chromosome of an organism. Demerits of this method are that the gene to be mapped should be coding for an observable phenotype and a number of crosses are required to generate enough data for mapping.

 

Physical mapping:

 

The physical map of a genome is a map of genetic markers made by analyzing a genomic sequence directly rather than obtaining recombination frequencies first. Restriction mapping, radiation hybrid maps and STS maps are some examples of physical maps. The physical maps have been useful in producing a definite order of cloned fragments of DNA but not useful for finding the DNA sequence. Finding the DNA sequence is the final step of a genome sequencing project.

 

Figure2. The cytological, genetic and physical map of a chromosome 5. Overvie w of the first human genome project

 

Human Genome Project (HGP) pioneered human genome sequencing on a large-scale by aiming to completely sequence the human genome and make the data generated freely available to researchers. It was a co-ordinated effort by the National Institute of Health (NIH) and Department of Energy (DoE) in the USA to decode the sequence of the complete human genome. The main goals of the project were:

 

a) To generate comprehensive maps of the location of genes in the human genome and in those of other well studied model organisms in biology like bacteria, yeast, nematode, fruit fly, mouse and Arabidopsis thaliana

 

b) To determine the nucleotide composition and sequence of the DNA of the genomes of model organisms mentioned above and to identify and annotate functions of the 20,000 human genes estimated to be present in the human genome using the genetic information from the organisms. The genomic information is in the form of nitrogenous bases named A for Adenine, T for Thymine, G for Guanine and C for Cytosine.

 

c) To store the information generated in databases freely available for public viewing and research

 

d) To improve tools for sequencing data analysis through computational methods

 

 

The project targeted identification of all the genes, creation of database for cataloguing the sequences of the individual chromosomes, development of faster, high throughput sequencing methodologies and investigate the ethical, legal and social complications that crop in genome sequencing efforts. The first draft of the human genome sequence was published in 2001 by the Internationa l Human Genome Sequencing Consortium which covered 90% of the complete sequence. The complete sequence for Drosophila (fruit fly) was also available. The final sequence was announced in April, 2003.

 

The feat of the complete sequencing of human genome was achieved in record time due to the development of the whole genome shotgun sequencing technique (Figure1) by a biotechnology company Celera Genomics. In this technique the entire genome is fragmented randomly. Following which each fragment is amplified by cloning in a vector. Each vector insert is then sequenced separately using an automated sequencing machine. The sequences are then examined for overlapping and assembled to get the complete genome. The achievements of the project were:

 

a) The shotgun sequencing technique developed was subsequently adopted for sequencing genomes of organisms to delineate their phylogenetic relationship with humans.

 

b) The HGP provided information on the structure, organization and function of the set of human genes comprising about 20, 500 genes.

 

c)  Comparative genome sequencing complemented efforts to annotate gene functions which were already being done through studies on gene-knock out animal models and made the process faster and economical.

Figure1. Assembling genomic data using the whole genome shotgun sequencing approach. Modified from Robe rt H. Waterston et al. PNAS 2002;99:3712-3716

 

Large regions of DNA in eukaryotic genomes containing heterochromatic DNA could not be sequenced as they were rich in repeat sequences and had fewer genes. So although the goal of the project was to obtain a complete sequence for each chromosome, a full sequence has not be obtained till date.

 

 

6. Human genome dive rsity project (1991)

 

The Human Genome Diversity Project was initiated in 1991under the leadership of Dr. Luigi Luca Cavalli-Sforza a geneticist at the Morrison Institute of Stanford University. Samples were collected and have been maintained aslympho-blastoid (LBL) cell lines at the Foundation Jean Dausset-CEPH in Paris (HGDP-CEPH cell lines) created for 1064 individuals belonging to 52 populations. The ethno-historical, linguistic and archaeological data was also collected to support investigations on these samples. Studies on these samples were conducted to examine the patterns of genet ic diversity in these populations and delineating demographic and cultural factors that shaped evolutionary forces in fashioning these patterns. HGDP had many important firsts (based on a perspective on HGDP, its inception and progress till 2005 by Cavalli-Sforza):

 

  1. The importance of LBL cell lines for their accuracy and renewability for obtaining DNA,
  2. Apprehensions on ‘bio-piracy’ i.e. use of the DNA of indigenous people for commercial purposes, were addressed by the decision to pay careful attentio n to informed consent and other ethical issues,
  3. The inclusion of samples from diverse ethnic groups (and so the Diversity Project).Studies on HGDP populations:

 

Rosenberg et al. (2002) typed each sample of the project for 377 autosomal microsatellite loci. Their findings reinforced the importance of geographic isolation in population divergence and established the small magnitude of genetic differences between populations. The same group also asserted the evaluation of the events of early divergence in human evolution using microsatellite markers (Rosenberg et al. 2005).

 

The utility of HGDP in biomedical research can be for estimating incidence of recessive diseases and providing control samples for disease association studies. Anthropologists can use HGDP data and samples to decipher the influence of climatic, ecological factors and cultural practices at the population level, study population structure and reconstruct population history (Li et al., 2008; Herraez et al., 2009). Cavalli-Sforza also emphasized on the importance of both historical and geographical information in reconstruction of human evolutionary history. To summarize, the HGDP in contrast to human genome project accommodated that human populations are genetically divergent and cannot be represented by one single genome.

 

Although the project was planned for collecting samples about 25 individuals each from around 500 populations, the target could not be achieved. The funds required could not be raised. The project attracted severe criticism from indigenous populations to be recruited. Some important areas of the world like Australia, India and North America were not represented.

 

7. The HapMap project (2002)

 

The HapMap Project is an effort by multiple countries to identify and record genetic variation in humans for use in medical genetic, population genetic and evolutionary genetics research. The international HapMap Consortium developed a map of these genetic variation patterns across the genome by genotyping 1 million or more sequence varia nts, their allele frequencies and the association between them. Two hundred sixty nine samples were screened from populations with African, Asian and European ancestries. The freely accessible information available in papers by the International HapMap Consortium (2005, 2007, and 2010) is helpful to understand genetic predisposition to disease and response to drugs and other environmental factors.

 

The project occurred in three phases:

 

Phase I:

 

It was aimed at genotyping at least one common SNP (Minor Allele Frequency ≥ 0.05) per 5kb in all samples. Data was published on 1 million SNPs. The phase I was jo intly carried out in nine genomic centers who worked with six different technologies.

 

Phase II:

 

The number of SNPs typed per sample was 3.1million and SNPs with a Minor Allele Frequency less than 0.05 were also included. A single genotyping methodology was adopted. The accuracy of genotyping was estimated at greater than or equal to 0.05.

 

Phase III:

 

Additional samples were included from diverse set of populations. About 1.3-1.5 million SNPs were genotyped using two standardized methods.

 

A dataset comprising results from SNP genotyping and deep sequencing of selected genomic regions in a subset of samples is available for use by researchers everywhere. The res ults were used to find patterns of recombination across the genome in different human populations. The results showed an extremely non-uniform distribution of recombination across the genome. Knowledge of recombination helped decipher the Linkage Disequilibrium (LD) patterns across the human genome. The discontinuity was observed in block like structures called haplotype blocks. These blocks contain a number of SNPs that are associated with one another. The regions of the genome with high occurrence of recombination events are called recombination hotspots. After the discovery of tag SNPs they could be used as proxy for other SNPs or tag adjacent haplotypes in LD with these SNPs. The information was subsequently used in other studies to test hypothesis on population histories. Following the third phase of the project imputation efforts were made which helped obtain information on loci that have not been typed (Marchini et al. 2007).

 

Cavalli-Sforza (2005) comments that data from HGDP will be complementary to HapMap datasets as the former will be providing the historical information supporting the samples included in the latter for drawing inferences while making evolutionary comparisons.

 

Table1. Populations samples for the HapMap project

 

8. The Geographic Project (2005)

 

The project’s sole aim was to elucidate the migration patterns of modern humans after their origin from and dispersal out of Africa. This has been done by studying genetic patterns in mitochondrial DNA and Y chromosomal DNA diversity. The project turned out to be gigantic considering the amount of participation it could garner by the joint initiative of the National Geographic, IBM and the Waite Family Foundation and half a million participants who contributed samples and finance. The project has also got regional centers around the world at 11 locations. In India the principal investigator of the project’s team is Dr. Pitchappan whose work aims to understand the pattern of gene flow in northeast Indian populations and the effects of interbreeding and marriage within specific groups or customs on the population structure.

 

9. The 1000Genomes Consortium (2008)

 

The project was aimed at discovering and genotyping different DNA polymorphisms in populations of different ancestries.

 

The important goals of the 1000Genomes project were as follows:

 

a) To characterize more than 95% variants of allele frequency greater than or equal to 1% by using high throughput sequencing technologies

 

b) To catalogue both high and low frequency alleles in coding regions

 

The 1000Genomes Project is the first project to have undertaken sequencing of 95% of the complete human genomes of a large number of individuals (2500) of different ancestry by achieving a 4X genomic coverage. The samples will be collected from five regions from all over the world including five populations with European, East Asian, South Asian, West African and American ancestry (The 1000Genomes Consortium, Nature, 2010). A 4X coverage allows detection of most variants having frequency up to 1%. The data generated can be used for identifying the underlying risk of diseases (Genovese et al, 2010), to investigate processes shaping human genetic variation, to detect de novo mutations of individuals in a pedigree, effects of selection on local variation, population differentiation and positive selection and effect of recombination on sequence variation.

 

The genotyping strategies used were assessed using samples collected in the HapMap project and the evaluation was carried out in three projects:

 

Trio project:

 

High coverage whole genome shotgun sequencing in two trios (one Yoruba from Ibadan, Nigeria and one of European ancestry) with each offspring sequenced using three sequencing platforms in multiple centres

 

Low-coverage project:

 

Low coverage whole genome shotgun sequencing of 59 individuals from YRI, 60 unrelated individuals from CEU, 30 unrelated Han Chinese in Beijing CHB and 30 Japanese individuals in Tokyo (JPT).

 

Exon project:

 

Targeted sequencing of 8140 exons (protein coding part of a gene) from 906 randomly selected genes in 697 individuals from 7 populations of YRI, LWK, CEU, TSI, CHB, JPT and

 

CHD

 

The phase I data has been used to compare various genomic features, effects of selection, allele frequency distribution and degree of differentiation between populations. The data generated from the project has been used for exome (all exons present in the genome) analysis of individua ls with genetic disorders and cancer.

 

Certain features are common to both HapMap and 1000Genomes projects (Buchanan, 2012). They are listed here:

 

a)      High throughput genotyping platforms

 

b)      Development of computational pipelines

 

c)      Defining linkage disequilibrium patterns across the genome and the ability to discover tagSNPs

 

d)     Unbiased estimation of allele frequencies

 

e)      Estimation of ancestry

 

f)       Identification of population of substructure

 

g)      Evaluation of genomic structure, rates of recombination and mutation rates

 

The essence of the motivation behind efforts on sequencing human genome is encompassed by a statement given by Dr. Francis Collins on the publication of a part of the genome in 2001:

 

It’s a history book – a narrative of the journey of our species through time. It’s a shop manual, with an incredibly detailed blueprint for building every human cell. And it’s a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease.

 

10. Ethical, Legal and Social (ELS) Issues:

 

The realization of possible misuse of genetic information is as old as the generation of human genetic information and hence the related issues are central to all the genome projects. It is believed that genetic information on a certain group of individuals can be used in ways that can create a bias against them. Additionally there are concerns on who should be allowed to access this information. Hence, the brains behind the human genome projects explored the Ethical, Legal and Social Implications (ELSI) of generation of human genetic information. Ethical issues arise from concerns related to what actions are considered right in the successful functioning of an organization which can be a group of individuals or an entire community. Legal issues concern protection of laws that govern ethical concerns. Social issues encompass concerns regarding effect on individuals and entire communities. The main issues concern privacy of genetic information, genetic testing and psychological issues.

 

The National Human Genome Resources Institute (NHGRI) heads the ELSI research programs. It states four categories of the ELSI research program depending on the nature of issues encountered.

 

i. Psychosocial and ethical issues in genomics research

 

Areas in genomics research that require interventions by the ELSI programs are related to procuring consent of participants, use of cell lines, community participation, data sharing and security, privacy rights, third party benefits from genomics research, use of tissue and health data from the deceased and ethnicity variables in genomics.

 

ii.  Psychosocial and ethical issues in genomic medicine

 

Areas in genomic medicine having ELSI relevance are genomic and genetic services for diseased subjects and third parties involved, personalized genomic-based health care, informed consent for communicating genetic information, genomics in preventive care and shift in the roles of healthcare personnel and genomic medicine is implemented.

 

iii. Legal and public policy issues

 

ELSI research includes studies on topics having legal and public policy ramifications like intellectual property rights issues, regulations governing genetic testing, pharmacogenomics and genome-based therapies, ownership and liability issues of biobanked samples, acce ss and use of genetic information by life, disability and long term care insurance companies and use of genomic information for forensic investigation.

 

iv. Broader societal issues

 

ELSI research has profound implications on society. The areas affected are risk and benefits of genomics research availed by communities, genomics research on special populations like newborns, disabled persons, deceased individuals etc., comparative perception of the genetic information of health among individuals, health providers and health care industry, aftermath of comparative genomic research and evidence of natural selection among human populations.

 

The ELSI program ensures privacy of genetic information of an individual and restricts misuse of genetic data like in genetic enhancement to have offspring with desirable traits. The effects of genetic testing on individuals, families and communities as a whole are also monitored to prevent biases against them during provision of medical services. The program also checks inclusion of informed consent of participants in genetics research. It is the prerogative of a researcher to educate the participants about elementary genetics and its effect on health. Genetic research projects are hence under immense scrutiny due to ELS issues related with them. As human genome projects are progressing to accumulate more genomic resources the ELS issues are challenging as ever. In the Indian context, the research on ELS issues is all the more complex with a large number of communities demarcated by social and cultural barriers. With the DNA bill drafted in the monsoon session of Indian Parliament in 2015 the issues will be now talked about outside the restricted group of genetic researchers.

 

The different genome sequencing projects have been summarized in the Table1.

 

Table1. Human genome projects, aims, sample sizes and populations sampled (Modified from Jobling et al., 2014)

 

The evolutionary anthropologist Mark Stoneking wrote about the various outcomes of genome projects that will be useful by molecular anthropologists:

  1. Mitochondrial DNA sequence helped find the recent origin of the human mtDNA ancestor and subsequently use the information for analysis of human mtDNA variation and evolution.
  2. Human genome projects will permit molecular anthropological studies to study evolutionarily important regions of the genome.
  3. Knowledge of polymorphisms of disease genes will be facilitated by human genome project s.
  4. Development of new technology will help link genetic variation with morphological variation.
  5. Comparative genome sequencing of non- human primates to identify genes involved in the morphological differences between human and non- human primates.

 

Summary

  • The human genome project was one of the greatest accomplishments with far-reaching effects in the endeavour for understanding human biology. The project was successfully completed in April 2003.
  • The project aimed at mapping of the human genome and annotating the function of individual genes’ functions. The project revealed that there are 20,500 human genes along with their location.
  • The technology developed in the project has been used to sequence other organisms’ genomes.
  • The underrepresentation of human diversity in HGP led to the Human Genome Diversity Project which recruited 52 populations from all over the world now a part of HGDP -CEPH panel.
  • HGDP data and samples have been used to delineate the influence of climatic, ecological factors and cultural practices at the population level, for studying population structure and to reconstruct population history.
  • The International HapMap Consortium carried out SNP genotyping and deep sequencing of selected genomic regions in a subset of samples to create tagSNPs in haplotype blocks across the human genome.
  • The Geographic Project aimed at elucidating migratory history of modern humans after their origin from and dispersal out of Africa by studying genetic patterns in mitochondrial DNA and Y chromosomal DNA diversity.
  • The 1000Genomes Project undertook, for the first time, the sequencing of 95% of the complete human genomes samples collected from populations with European, East Asian, South Asian, West African and American ancestry.
  • The samples collected, data generated and techniques developed in the genome projects continue to be used by researchers in anthropology to study the genetic basis of population structure.
you can view video on Human Genome Project