2 Overview of Bioinformatics

Dr.Rajeshwari Sinha

1. Objectives:

The students will learn about

the emerging discipline of Bioinformatics in relation to Biochemistry and Molecular Biology
a classical example revealing the power of sequence alignment sequence for searching similar sequences.
the fusion of Biotechnology, Biochemistry and Molecular Biology with Mathematics and Information Technology to give Bioinformatics
the central paradigm of Life with key areas of Bioinformatics
the aim, goal and scope of Bioinformatics

2. Concept Map

3. Overview of Bioinformatics

The Bioinformatics began with efforts to combine experimental data with computational algorithms for generation of information and knowledge about biological systems. The ultimate goal is to model complete biological processes. Biological processes are recognized series of molecular functions, with a defined starting and an ending function, relevant to integrated functioning of cells. Bioinformatics began on a small scale with sequence alignment algorithms for searching similar sequences conserved during evolution. Now, Bioinformatics encompasses new foundations for collection, organization and mining gene/ protein sequences, three dimensional structures and biochemical functions, for modeling biological processes of functioning cells.

3.1. Classical example revealing the power of Bioinformatics

Classical example of bioinformatics is provided by computational gene hunting for the fatal disease, cystic fibrosis (CF). This disease is associated with recurrent respiratory infections and abnormal secretions (For details of cystic fibrosis – kindly consult module – Pumps, ionic channels and cystic fibrosis, in paper 3 – Structure and Function of Biomolecules II). CF is diagnosed in one out of every twenty five hundred persons in Caucasian race. One Caucasian out of 25 is a carrier of the faulty gene. It’s gene is autosomal and recessive. Autosomal means a gene on chromosomes other than X- and Y-chromosomes of humans, i.e. rest of the 22 pairs of chromosomes and recessive means when both alleles on each of the homologous chromosome are same. Now the presence of CF disease in other countries such as in India is well documented with the advancement of laboratory testing. Most of the phenotype of CF disease was in accordance of western population. [Prasad et. al. Molecular Basis of Cystic Fibrosis Disease: An Indian Perspective, Indian J Clin Biochem. 2010 Oct; 25(4): 335–341.]

In 1980s, biologist knew nothing about this gene. The search of the CF gene began in early 1980s with genetic mapping using recombination frequency with genes and RFLP probes as markers. In 1985, three groups of scientists independently announced the gene (approximately 1 million nucleotides) on chromosome number 7, located between met gene (a gene involved in cancer) and D7S8 (an RFLP marker).

This was followed by physical mapping using chromosome mapping and chromosome jumping techniques to find actual nucleotide distance and base sequence. A clone hybridizing with D7S8, RFLP marker was selected and sequenced followed by selection of next clones hybridizing with ends of the clone. After several iterations of selection and sequencing, the gene of interest will also get sequenced, if it is located near to the starting marker. Knowing the approximate location of the gene, does not lead to the gene itself. For example, the Huntington’s disease gene was mapped in 1983 , i.e. before CF gene (1985) and remained elusive till 1993. In addition, after sequencing we don’t have any idea about the function of the gene of interest. However, with database search in 1989, a region encompassing about 6500 nucleotides was found to have sequence similarity to ATP binding proteins such as Cystic Fibrosis Transmembrane Conductance Regulator (CFTR). The CFTR is an integral cell membrane single polypeptide chain of 1480 amino acids with five domains. The domains include two transmembrane, two cytoplasmic and one regulatory. This shows that each domain has a specific function involving transport of Chloride ions through the cell membrane. Consequently, mutations in these domains will affect transport of Chloride ions through cell membrane causing abnormal secretions to manifest CF disease. This sequence database search, based on sequence alignment, not only identified the gene but also hinted at the possible structure and its function, through which the disease was manifesting. This showed the power of sequence alignment for searching similar sequences in the database. This sequence similarity database search is most widely used tool of bioinformatics now-a- days to have an idea about the function of a given protein sequence. This was the first example of revealing a protein structure to suggest biochemical function and then phenotype. This laid the foundation of Bioinformatics.

3.2. Biochemistry and Bioinformatics

There are two types of systems or machines: living and nonliving. Both types of systems or machines transform matter, energy and information. Biochemistry is the study of chemistry of life. Chemistry is the making and breaking of chemical bonds. Life is defined as the capability of a system to reproduce self i.e. to produce a replica of the system by itself. However, only living things can reproduce self and this ability is made possible through making and breaking of the chemical bonds at very rates. In addition, the living cells/machines are capable of self duplication with modifications i.e. evolution through survival of the fittest. Living systems or machines are hierarchically organized in the form of an individual component in a container, where a container may be a component in a higher level container. The largest container is the biosphere containing various ecosystems having various species containing various individual organisms made up of various types of cells with several organelles ultimately made up of large number of macromolecules.

Biochemistry and Molecular Biology is concerned about collection and interpretation of data about living components, under normal and abnormal situations, so as to take corrective measures, if any of the living components behaves in some undesirable manner. The approach followed has been reductionist with isolation of a living component and then making observations. With the advent of modern sophisticated high-throughput technologies in biology, the quantity of data produced is so large that it is impossible to manage this large scale data without automation using computing and robotic devices. Consequently, mathematics, computer science and information technology were increasingly applied to collect, organize and interpret this large quantity of data acquired through automated high throughput technologies, which are low cost and very fast. Therefore, high-throughput technologies are those technologies which are able to acquire and analyse large quantities of nucleic acid sequences (DNA/RNA), protein expression in a cell, drug activities etc, at a very high speed. These high-throughput technologies include next-generation sequencing for determination of sequences of complete chromosomes/genomes, an area called genomics; microarrary technology to study RNA expression of complete genome under a given set of conditions, an area called transcriptomics; 2D polyacrylamide gel electrophoresis to study expression of complete proteins under a given set of conditions, an area called proteomics; screening of a large number of drug molecules, an area called pharmacogenomics and large scale modelling of biological information, an area called systems biology. Bioinformatics has emerged as a result of the advancement of the experimental technologies for collection of this vast amounts of data in biochemistry and molecular biology and subsequent organization, interpretation and modelling these data with mathematics using computers. Bioinformatics, therefore, fuses biology with mathematics and computer science. Bioinformatics, follows synthetic approach i.e. it aims to assemble the small molecules such as amino acids and nucleotides into proteins and nucleic acids. Then the macromolecules such as proteins, nucleic acids, polysaccharides, lipids etc into organelles such as ribosomes, cell membrane, chromosomes etc. These organelles then are aimed to be assembled as organelles, then cells and then individuals.

This is transformation from basic reductionist approach of observing single component to a synthetic approach to organize individual components on large scale for developing models capable of making predictions about living systems. In Biochemistry we are concerned up to macromolecules and this field is known as Molecular Bioinformatics. Here, we will be concerned about the Molecular Bioinformatics i.e. collections of molecules: genes, proteins, lipids, carbohydrates etc. and their interactions.

3.3. Central Paradigm of Bioinformatics

We are well aware that biology at the molecular level is guided by central dogma of molecular biology. However, this dogma is not able to answer questions such as when a gene in the life cycle or in which tissue or why and how a gene/protein is expressed in an individual. Central dogma of molecular biology tells only the flow of information from DNA to RNA (transcription) to Protein (translation) but never back from protein to RNA or from protein to DNA. However, information from RNA may be converted back to DNA as in reverse transcription. Central dogma of molecular biology simply conveys that sequence information determines phenotype of a living system.

Therefore, central dogma of Molecular biology dictates the direction of synthesis of biological macromolecules i.e. DNA to RNA and then translation to proteins. This dogma forbids the synthesis of nucleic acids from proteins. This flow of information then decides the manifestation of phenotype of a given cell. However, the central dogma of molecular biology does not specify the way in which the phenotype of a cell will be achieved.

On the other hand, the central paradigm of bioinformatics considers that sequence information in DNA, RNA and proteins represents same knowledge and determines the molecular structure of actual transforming machines i.e. proteins and RNAs, to interconvert/ transform metabolites and therefore controls the phenotype of a biological system. Therefore, central paradigm of Bioinformatics specifies that the sequence information (DNA/RNA/Protein) determines the protein structures, which in turn will give a specific biochemical functions, to finally express the cellular phenotype through interaction of metabolic, signalling and gene expression pathways. The systems/ cells which will be surviving, will then reproduce to pass on the genetic sequence information (DNA/RNA/Protein) to next generation cells, which will repeat the above life cycle. Therefore, Bioinformatics is having four sequential stages. The first is genetic sequence information determination encompassing genes and proteins. The next is the transformation of this sequence information into three dimensional structural information of proteins. The classical experiment of Christopher Anfinsen in 1973 using RNAse denaturation and re-naturation established that the information for folding a protein into three dimensional structures is encoded within the amino acid sequence of the protein itself. Therefore, once the protein is synthesized, it will fold into the three dimensional structure as per information contained in its sequence. Once the protein folds into a defined three dimensional structure, then this protein and other structures are actual workers of the biological processes, which brings about the biochemical transformations using pathways. These pathways then interact with each other to express the phenotype of the cell. Those phenotypes which survive this life cycle are able to reproduce the next generation.

Bioinformatics encompasses computational molecular biology tools, i.e. mathematical and computational analysis for sequences and structures of biological macromolecules, in large scale genome research, targeting investigation of the complete set of chromosomes & genes of an organism and their RNA as well as protein products. On the basis of the central paradigm, bioinformatics has two key areas i.e. structural Bioinformatics and functional Bioinformatics.

Structural Bioinformatics refers to gene mapping and determination of DNA/RNA/protein sequences/structures on large scale. Genetic linkage maps databases store distances between genes and other markers, based on meiotic recombination frequencies. Chromosome physical maps represent distances between these genetic markers based on nucleotide lengths between them. Sequence maps store the actual nucleotide and amino acid sequences of the genes and proteins. At protein level, structural Bioinformatics also elucidates the representative protein folds.

Functional Bioinformatics is concerned with annotation of existing structural knowledge about DNA sequences i.e. genes & their diffusible products, with functional knowledge. Consequently, there is a paradigm shift from static structural Bioinformatics to dynamic functional Bioinformatics. It has two successive levels. Study of expression of complete genomes i.e. transcriptomics and proteomics as well as modeling complete biological systems, i.e. systems biology. Transcriptomics is concerned with RNA-level measurements of gene expression. The aim is quantitative measurement of transcription under normal and experimental conditions and to appreciate differential gene expression under these conditions, to discover novel targets for drug discovery. High throughput analysis of differential gene expression is carried out using technologies such as DNA microarrays, serial analysis of gene expression (SAGE) and Expressed sequence tags (ESTs).

Similarly, Proteomics is concerned with protein-level measurements of gene expression. The aim is quantitative measurement of translation under normal and experimental conditions and to appreciate differential protein expression under these conditions, to discover novel protein targets for drug discovery. Proteomics analysis is carried out using two-dimensional polyacrylamide gel electrophoresis for resolution of complete proteomes on the gel followed by protein quantitation using scanning and imaging techniques. Finally, the protein identification is carried out using mass spectrometry, after picking the spot from the gel. In this way, one is be able to look at complete proteome for differential activities and post translational modifications. In future, specific protein interactions with other proteins is expected using protein microarrays, similar to DNA microarrays.

3.4. The goal, aim and scope of Bioinformatics

Bioinformatics gained significant importance with the human genome project for creation of detailed genetic and physical maps, and DNA sequence for each of the 24 different human chromosomes. It encompasses development of automated technologies for generation, organization, mining and modelling of biological data with the goal of revealing new insights, theories and principles about living things, so as to enable predictions about them. Most of the organized activities of bioinformatics has focused on the collection, organization, interpretation and modelling these processes on a large scale consequently giving rise to sub-areas such as genomics, transcriptomics, proteomics and metabolomics. And all this for better nutrition, health and environment of living machines through predictions from modelled systems.

If we are interested to take a snapshot from history of life that extends back some 2.5 billion years, then we find no biological experimental way except to have some information from DNA of fossils. On the other hand, with bioinformatics, we can expect to model functional systems in that age. Similarly, we can also expect to predict that in future there may be evolution of a deadly virus and therefore design strategies in advance, to combat evolving deadly virus. Imagine a scenario in future in which a new biological virus creates an epidemic of some fatal disease. Laboratory biologists will provide the genetic material for automated sequencing and computer programs will then take over to design an effective drug against this virus. Comparative genomics of this new virus with previously studied viral genomes will suggest some targets for antiviral drugs. Three dimensional structure prediction of the target shall provide the actual site of action for the drug. High throughput virtual screening may provide leads for faster drug development, which may be finally tested on bench for its efficacy. Therefore, the scope of bioinformatics is to model living systems so to predict the behaviour of living systems under perturbed conditions.

3.5. The emergence of the discipline of Bioinformatics

Bioinformatics is a young but fast-growing field for biological data collection, organization, interpretation and modelling. Tools and techniques for bioinformatics are derived from multidisciplinary combinations of varied disciplines from natural and physical sciences. Previously various disciplines were carved out as and when sufficient specialization was achieved. However, now bioinformatics is borne out of alliance between existing disciplines from life and non-life.

Figure 5. Bioinformatics: Marriage of life and non-life

4. Summary

In this module, students learnt about:

the emerging discipline of Bioinformatics in relation to Biotechnology, Biochemistry and Molecular Biology
a classical example revealing the power of sequence alignment sequence for searching similar sequences.
the fusion of Biotechnology, Biochemistry and Molecular Biology with Mathematics and Information Technology to give Bioinformatics
the central paradigm of Life with key areas of Bioinformatics
the aim, Goal and scope of Bioinformatics

you can view video on Overview of Bioinformatics

Refernces:

Boguski M. S. (1998). Bioinformatics – a new era, Trends Guide to Bioinformatics, Elsevier Trend Journal, Supplement, page 1-3
Kanehisa Minoru (1998). Databases of Biological Information, Trends Guide to Bioinformatics, Elsevier Trend Journal, Supplement, page 24-26
Michael J. Brownstein, Jeffery M. Trent and Boguski M. S. (1998). Functional Genomics, Trends Guide to Bioinformatics, Elsevier Trend Journal, Supplement, page 27-29
Thornton Janet M. (1998). The future of Bioinformatics, Trends Guide to Bioinformatics, Elsevier Trend Journal, Supplement, page 30-31