4 Molecular Sequence Databases

Dr.S sundar

Objectives:

In the present module, the students will learn about

Encoding linear sequences of nucleic acids (DNA/RNA) and proteins using single letter codes
Creating sequence files using NotePad in different formats of sequence data for use by different programs
International public domain sequence archives and databases
Retrieval systems used by different sequence databases
Browsing genomes for understanding the gene arrangement along chromosomes
Converting one sequence format into another for use in other sequence analysis program

2. Concept Map

3. Molecular Sequence Databases

Molecular sequence data are known linear sequences of deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and proteins. Functional information may be derived from sequence data. In addition, the sequence data also have attached useful information about these molecules. This information is known as annotations of data. All the information in sequence and related annotations are stored in specific formats, particular to the database. These particular databases have also developed retrieval systems for accessing sequence data. We will understand major online sequence gateways to retrieve and browse sequence data as well as converting between various sequence formats.

3.1. Sequence Data Encoding

The bioinformatics tools enable the biochemists to derive useful information from nucleotide or protein sequence data for biochemical analyses. Therefore, nucleotide or protein sequence data is an important resource for understanding the biochemical function of the genes and proteins. The linear sequences are represented by single letter codes for residues. The nucleotide or protein sequence data are stored as linear sequences of these single letter codes in the sequence databases.

Recommended single letter codes for residues in nucleic acids (DNA and RNA) are shown next

Recommended single letter codes for residues in proteins i.e. amino acids are shown next

3.2. Formats for handling sequence data

Specific bioinformatics software packages and online tools can read the sequence data in the recognised standard formats. This is similar to opening a text file saved in MSWord format will open with MSWord only. This file cannot be opened with Adobe reader because the MSWord format is not supported with Adobe reader. Similarly, the reverse is also not possible, i.e. MSWord cannot open file in Adobe PDF Reader format. Therefore, a given software will open files with supported and recognised standard formats. However, this is to make clear that text saved in MSWord file is not in sequence formats supported by various bioinformatics tools. There are several specific sequence formats available which can be used to save and store sequences. To save sequences in files, we need to provide two values in ‘Save As’ dialog box. The first is ‘file name’ to specify the primary name of the file and second is ‘save as Type’ to specify the extension name of the file. Both names are joined automatically using a ‘dot’ i.e. ‘full stop’ or ‘period’. For example, if in ‘NotePad’, available with windows operating system, we enter the sequence of a nucleic acid or protein in plain text and then select ‘Save As’ from file menu and provide the value ‘mySequence’ for the file name and use the default save as type ‘Text Documents (*.txt)’ in the save as dialog box, then the sequence will be saved as file ‘mySequence.txt’. When we try to use ‘mySequence.txt’ file name having .txt as extension name, it is not recognised by sequence analysis programs. Even if the ‘mySequence.txt’ is opened with a sequence analysis program, even then the plain text sequence in‘mySequence.txt’ is not recognized as it is not a standard sequence format. Therefore, plain sequence in ‘mySequence.txt’ cannot be read with or used with any sequence analysis software. However, some online sequence analysis programs allows to paste the plain text sequence in the input text box.

To understand the meaning of sequence formats, let us see the most commonly used standard sequence format, known as the ‘FASTA’ format. The sequence in FASTA format can be saved with even Notepad or any other text editor. There are two steps. The first is to enter sequence and related information in the Notepad and then to save this file in FASTA format extension name as ‘FA’, so that the same can be read with all software packages demanding the sequence information in FASTA format. The sequence information is entered in two steps. The first is to enter the first line known as ‘comment line’ starting with ‘greater than’ sign i.e. ‘>’ followed by some identification name or comment for the sequence. Suppose we have a sequence with name ‘mySequence’, for identification of this sequence, then in Notepad we will enter as follows:

In this comment line we can continue entering any other information, such as annotation features. Continue in first annotation/ comment line with entering words/ tesxt, but without pressing ‘enter’ key, as shown next:

This shows that the entering information will continue in the same line. But initial information in this line is not visible. To view the whole line in one window, select ‘Word Wrap’ command from ‘Format’ menu, as shown:

This will display the complete entered information as one paragraph displayed in multiple rows, three rows in the present case.

So this comment line is actually one single paragraph, which may occupy multiple rows on computer screen, as seen, but it is actually a single line. This comment line contains three pieces of annotation information separated by a delimiter character ‘\’. Three pieces of information are the name of the sequence, then source from which sequence isolated and finally technique used to sequence this protein. After entering this information, press the enter key so that the cursor goes/ moves into the next line. In the next line (which is equivalent to next paragraph), the sequence of the protein or nucleic acid is entered, as shown below for protein sequence “THISISTHESEQUENCEOFMYPROTEIN”:

Then save the file, by opening ‘Save As’ dialog box and entering file name ‘mySequence.fa’, selecting ‘all files’ from ‘save as Type’ and clicking ‘save’ button, as shown in below:

Then to open the saved file ‘mySequence.fa’, select ‘All files’ in open file dialog box, as shown with arrow below and click open button:

What is important in ‘open’ file dialog box is in the dropdown list . Therefore, always select ‘All Files’ from the choices in this dropdown list, if the FASTA format choice/option is not listed in this dropdown list. This will open the saved file as shown below:

In addition to entering single sequence information in one file, one may add any number of sequences’ information in one file, in FASTA format. Simply press ‘enter’ key after the sequence to enter into next line. Then again add comment line starting with ‘>’ sign and pressing ‘enter’ key to go to next line and enter the sequence without pressing enter, as shown below for the second sequence information:

In this way one can concatenate as many sequences in one file, in FASTA format, as one want to analyse.

This is useful for pairwise and Multiple sequence alignment as well as phylogenetic analysis.

3.3. Molecular Sequence Archives

The International Nucleotide Sequence Database Collaboration, is main archive of nucleotide sequences with three collaborators: GenBank http://www.ncbi.nlm.nih.gov/genbank/ at NCBI, DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL). These three organizations exchange data on a daily basis.

NCBI http://www.ncbi.nlm.nih.gov/ integrates nucleotide sequence database GenBank with other gene information databases for search in an integrated manner.

GenBank Sequence record format can be seen at http://www.ncbi.nlm.nih.gov/genbank/samplerecord/.

NCBI Nucleotide sequence Gateway can also be reached directly at www.ncbi.nlm.nih.gov/nuccore/.

Similarly we have, the Universal Protein Resource (UniProtKB), a comprehensive archive of protein sequences. In addition, independent protein sequence Gateway at NCBI can be reached directly at http://www.ncbi.nlm.nih.gov/protein.

ExPASy (Expert Protein Analysis System) www.expasy.org server at SIB integrates UniProtKB database with other protein information databases, for searching in an integrated way.

In addition to each of the sequence gateway providing access and retrieval system separately for nucleotide and protein sequences, we have integrated genome browsers for individual organisms, where we can have both gene and protein sequences with additional annotated information in an integrated 11way. Both sequence retrieval (nucleotide or protein with annotations) systems and integrated genome browsers (nucleotide and protein sequences with annotations) are discussed next.

3.3.1. Retrieval Systems

There are retrieval systems with each of the sequence archive. Following provides a partial list:

Entrez (pronounced as Aahntray) is NCBI
Expert Protein Analysis System (ExPASy) at SIB
SRS at EMBL
DBGET at DDBJ

3.3.1.1. Entrez is NCBI’s primary text search and retrieval system (gateway) and Entrez help can be reached at http://www.ncbi.nlm.nih.gov/books/NBK3837/. In the present example we will retrieve and download nucleotide and protein sequences , for “Hpr from Enterococcus faecalis”, a gene encoding 88 amino acid phosphocarrier protein. For the same, we have key information features. The first is organism “Enterococcus faecalis” and the second is name “Hpr”.

Visit NCBI at http://www.ncbi.nlm.nih.gov/ and select nucleotide in the left dropdown list of databases to search, enter “Hpr from Enterococcus faecalis” in the text box and click to search.

We find that there are 20433 results to be displayed. This is long list to browse. Therefore use advanced search feature available below search text box:

and in the builder section of ensuing page select fields to search and the data values to be matched, as shown next:

Therefore, select Title and enter Hpr followed by selecting Organism and entering Enterococcus faecalis with click on search button. The ensuing results page shown only one record in GenBank format.

The GenBank format has three sections: First section, as shown above, is the HEADER section with general information about locus, source organism, literature references etc.

Second section is FEATURES section, gene and coding sequence (CDS) information with external database (db_xref) links CAA79533.1 for NCBI protein, and P07515 for UniProtKB/SwissProt protein databases, as highlighted next:

One can click on these links to reach protein sequences. Finally the sequence section, as shown next:

Selecting the desired format ‘FASTA’ will display following:

Click on ‘Create File’ button and save file in Save As dialog box with entering a full name (such as mysequence.FA) and selecting all files in ‘Save as Type’ dropdown list. Even if the selected format for sequence was any other, say GenBank, we would entered the full name (such as GenBankHprProteinSequence.gbk) and selected all files in ‘Save as Type’ dropdown list, before clicking save in ‘Save As’ dialog box..

Now, click on Graphics to change display. The following window appears and just click on Tools Button to expand the list, as shown below:

This page provides tools for BLAST and Primer Search as well as for downloading sequence.

Clicking on external database (db_xref) links CAA79533.1 for NCBI protein n features section, as highlighted above will take you to protein sequence entry NCBI. The features section in this record has important sites at residue numbers as shown next:

Clicking on external database (db_xref) link , will open conserved protein domain family entry in CDD database NCBI, as shown next:

CDD is a protein annotation resource that consists of conserved domains in protein sequences to explicitly define domain boundaries and provide insights into sequence to structure and then to function relationships.

Clicking on external database (db_xref) links P07515 for UniprotKB/SwissProt in features section, as highlighted above, will take you to protein sequence entry in UniProtKB protein database. The features section in this record has important sites at residue numbers as shown next:

The most important is Display menu. One could jump to any of the feature by just clicking. The features include, function, names & taxonomy, subcellular function, post-translational medications & processing,

interactions with other proteins, 3-d structures, conserved families and domains, sequence & external links to other sequence databases, publications & literature information.

3.3.1.2. ExPASy (Expert Protein Analysis System) is the gateway for all protein sequence information available at UniprotKB. Before 2002, PIR produced the Protein Sequence Database (PIR-PSD), SIB produced manually-curated SwissProt and EMBL produced computationally translated coding sequences database TrEMBL, awaiting manual annotation for inclusion into SwissProt. In 2002 the three institutes pooled their resources and produced UniProtKB. It has two components. UniprotKB/SwissProt is the manually annotated component of UniProtKB. It contains manually reviewed and annotated proteins with information extracted from the literature and curator-evaluated computational analysis. UniProtKB/TrEMBL, on the other hand is computationally analyzed proteins which are manually reviewed and annotated with information extracted from the literature for their transfer into UniprotKB/SwissProt component of UniprotKB.

Now , let us download “Hpr from Enterococcus faecalis” protein from “UniProtKB” database Gateway www.uniprot.org.

Click on Reviewed (5) as shown by arrow above to display only SwissProt sequences, as shown next

To download sequence in FASTA, adjust the settings in Download Tab as shown next and clock Go.

The FASTA sequence retrieved in browser window is displayed below

3.3.2. Genome Browsers

Since, in the present case we are specifically interested in Enterococcus faecalis, we will try to get the nucleic acid and protein sequences as well as associated information for Enterococcus faecalis using a genome browser. Therefore, you search Enterococcus faecalis genome browser on Google. This will display like this

Click on the first link to reach Enterococcus faecalis genome browser page. This is bacterial genome browser page where we can browse the complete genomes various bacteria/archaea organisms. We can change to other organisms.

However, without changing the group and genome organism, In the search text box enter “Phosphocarrier protein Hpr”, and press ‘enter’ key. You will reach, the gene EF0709 encoding protein “Phosphocarrier protein Hpr” displayed in Genome Browser window. Bring your mouse over the gene number displayed on the left side and then on corresponding gene displayed next as, this is display as below.

Click on predicted protein, your browser will show the following protein sequence in FASTA format. Copy the complete FASTA sequence and save it as ‘EfaecalisHpr.FA’ using Notepad.

>EF0709 length=88

MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGV

MSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

3.4. Interconverting sequence formats

Sequence formats were designed by specific database developers/ groups/ companies, to hold the sequence data and other information about the sequence, for use in their own programs/ software packages. There are several sequence analysis software packages and online sequence analysis tools. A specific package/ tool will support only some recognised standard formats. This shows that there are several sequence formats but some are internationally recognised standard formats which are much more common than others. Almost every database of sequences such as GenBank, EMBL, SwissProt, PIR etc., has stored its data in its own format but it allows to download sequence data in additional formats also. But in case, we do not get sequence data in the desired format then we have the option of downloading the sequence data in the their database format and convert it to another format for use in with the desired sequence analysis package.

To convert a sequence format to any other sequence format, go to Sequence Format Converters at http://www.ebi.ac.uk/Tools/sfc/

Now choose to Launch EMBOSS Segret and follow the three steps on the appearing browser window. First step is upload already saved file “GenBankHprProteinSequence.gbk” in GenBank format and choose it convert to SwissProt entry format (swissnew) and click Submit Button.

The resulting window will display of “histidine containing phosphocarrier protein Hpr from Enterococcus faecalis” sequence in GenBank Format which can be downloaded and saved.

This site also provide ReadSeq program for sequence conversion for several input to output options. In addition, this site provides MView, a web interface to Transform a Sequence Similarity Search result into a Multiple Sequence Alignment or reformat a Multiple Sequence Alignment using the MView program. The

Another implementation of Segret EMBOSS is available at http://genome.nci.nih.gov/tools/reformat.html. Paste the FASTA sequence in the text box, then select the input sequence and output sequence from the dropdown lists and click submit request button.

The result will appear in the Browser window and resulting window will display sequence of “histidine containing phosphocarrier protein Hpr from Enterococcus faecalis” sequence in SwissProt format:

4. Summary

In this lecture we learnt about:

Encoding linear sequences of nucleic acids (DNA/RNA) and proteins using single letter codes
Creating sequence files using NotePad in different formats of sequence data for use by different programs
International public domain sequence archives and databases
Retrieval systems used by different sequence databases
Browsing genomes for understanding the gene arrangement along chromosomes
Converting one sequence format into another for use in other sequence analysis program

you can view video on Molecular Sequence Databases

Refernce Books

Ontroduction to Bioinformatics by Arthur M. Lesk, Oxford University Press, ISBN 978-0-521-70610-0, http://www.gettextbooks.com/search/?sa=4&isbn=Essential+Bioinformatics+Jin+Xiong, Chapter – 3: Archives and Information Retrieval, pages 117-152