3 Molecular Database Concepts

Objectives:

In this module, the students will understand

nature of molecular data
molecular database concepts
data retrieval
searching database for information retrieval
basic organization of molecular data for knowledge discovery

2. Concept Map

3. Molecular databases

Molecular data refers to some known values about the molecules or their interaction with other molecules, as in pathways, i.e. networks of metabolites linked through enzymes. The molecules involved are deoxyribonucleic acid (DNA), ribonucleic acid (RNA), proteins, carbohydrates, lipids, vitamins etc. Therefore, we have genes, proteins, hormones, enzymes etc with their associated data values.

3.1. Nature of molecular data and retrieval

All molecules have their associated data values, which may be textual (qualitative), such as the accepted name of an enzyme, say for example, 6-phosphofructokinase (PFK) and may be numeric (quantitative) such as assigned Enzyme Commission number 2.7.1.11. This number is assigned by IUBMB Enzyme Nomenclature, which is systematic; 2 stands for ‘transferase’ enzymes, followed by 7 for ‘Phosphotransferases’, followed by 1 for ‘Phosphotransferases with an alcohol group as acceptor’ and finally 11 to show that it is 11th enzyme in the list of 186 known ‘Phosphotransferases with an alcohol group as acceptor’. In addition, alternative Name(s) may be available as in this case, Phosphofructokinase I and Phosphohexokinase.

The basic data about this enzyme is the reaction catalysed, as shown next

ATP + D-fructose 6-phosphate ADP + D-fructose 1,6-bisphosphate

In addition, there may be associated information which is included as comments. This additional information or comments is called ‘annotation’. In this case, the annotation include that D-tagatose 6-phosphate and sedoheptulose 7-phosphate can act as acceptors; UTP, CTP and ITP can act as donors; and this enzyme is not identical with EC 2.7.1.105. This 2.7.1.105 is 6-phosphofructo-2-kinase (PFK2) with alternative name Phosphofructokinase 2.

All the above values is the data about the molecule, which is an enzyme in this case. The enzyme in this case is not specific to a particular species, but universal across all species. This means EC number 2.7.1.11 specifies the reaction it catalyses. This reaction is not specific for a specific species, breed, variety or strain. Whenever, EC number 2.7.1 is mentioned then it means that this is an enzyme which will catalyse ‘transfer of phosphate to an alcohol group’ and number 11 specifies that it is 11th enzyme of this type involving specific substrates and products catalysing the reaction, i.e. ATP + D-fructose 6-phosphate ADP + D-fructose 1,6-bisphosphate. Therefore, wherever, this reaction is catalysed then the enzyme involved is 6-phosphofructokinase with EC number 2.7.1.11. This reaction may occur in E. coli, yeast or even human. The reaction involved is common to all organisms but the amino acid sequence of the EC 2.7.1.11, catalysing this reaction, may or may not be same in all the organism.

Similar to PFK, all other enzymes will have information about their accepted name, assigned EC number, alternative Name(s), reaction catalysed and annotated information. But this information for each enzyme will be different i.e. each enzyme will have different accepted name, different assigned EC number, different alternative Name(s), different reaction catalysed and different annotation. Placed below is the data for four enzymes from glycolysis:

Each of the data value is stored in a field having unique label. In the present case we have five fields with labels named as as Assigned EC number, accepted name,, Alternative Name(s), Reaction catalysed and Comment (annotation). The data value on each field for one enzyme forms one ‘record’ for the given enzyme. In the present case we have four records. All the records in the list collectively forms a database. The database about enzymes is implemented in ENZYME database, which can be reached at http://enzyme.expasy.org/

Alternatively, if you do not enter any number in the last text box, then the enzymes in the class 2.7.1 will be presented as complete list, as shown next (partial list):

One can click on any number to retrieve that record.

3.2. Searching databases

Each of the molecular database provides access for retrieval to its records through accession numbers assigned by that particular database. In addition, the molecular data can be retrieved through keyword searches. Even more than one keywords can be combined together using Boolean principles. Using more than one keywords helps to search individual/ separate fields in organised records of the database. There are two systems for searching database: Boolean search system and concept based search system

3.2.1. Boolean search

The keywords in each of the Boolean search may be single word or more than one word, each called search term. There are three Boolean operators which can be used to combine keywords. These are case sensitive i.e. use only upper case letters.

AND operator: This directs to find records that contain keywords/ search terms on both sides of the AND operator, i.e. the intersection of both keywords in Venn Diagram. In most search systems,many keywords entered in search text box are automatically joined by AND operator. Therefore, n those cases we need not to join using AND operator. For example, entering Alcohol dehydrogenase in search text box, will be treated as Alcohol AND dehydrogenase, and this will retrieve all records containing both Alcohol and dehydrogenase keywords.
OR operator: This directs to find records that contain either keyword/ search term or both keywords/ search terms joined by OR operator i.e. union of both keywords in Venn Diagram. For example, entering Alcohol OR dehydrogenase in search text box, will retrieve all records containing either Alcohol or dehydrogenase as well as containing both Alcohol and dehydrogenase.
NOT operator: This directs to find records that contain the keyword/ search term on the left of NOT operator but excluding records with keyword/ search term on the right of the operator, i.e. the subtraction of the right hand keyword from the one on the left in Venn Diagram. For example, entering Alcohol NOT dehydrogenase in search text box, will retrieve all records containing Alcohol but will discard records containing dehydrogenase also.

The keywords/ search terms may be even phrases i.e. words placed between a pair of double quotes. As opposed to AND operator, the words in phrases are joined in contiguous manner so that they are searched in records wherever they occur contiguously i.e. phrase behaves as a single keyword. For example “Alcohol dehydrogenase”, placed between double quotes will behave as single keyword and will search the records where both words are continuous. This will not retrieve the records where alcohol and dehydrogenase keywords are present separately.

3.2.2. Concept Search

Boolean search does not guarantee to find relevant records in large unstructured databases such as bibliographic databases which contain large texts. In these cases, keyword searches often return results that include many non-relevant records called false positives or that does not include many relevant records called false negatives. These are the results of synonymous or polysemous meanings of keywords. Synonymous words means that one of two or more words in the same language have the same meaning. Polysemous means that many individual words have more than one meaning. Therefore, in these cases we will retrieve many irrelevant records. In addition, the keyword based searches require the exact typing of spellings of the keywords. If there is any wrongly spelled keyword, then the records which are intended to be retrieved will not be retrieved. To overcome these problems concept search techniques were developed especially when dealing with large, unstructured digital textual databases. concept search refers to retrieve all records which follow the particular concept delivered by words joined in phrases.

Let us take an example of patient with all biochemical test reports for blood. These include, estimation of vitamins, minerals, urea, uric acid, creatinine, SGOT, SGPT, lipid profile, protein profile etc. The patient had levels for all of these within normal limits, except low level of vitamin D, vitamin B12 and iron. In addition, patient had increased level of uric acid. Searching Google with individual values for abnormal levels of vitamin D, vitamin B12, iron and uric acid retrieved no useful interpretation. Searching Google with Boolean combination of “low level of vitamin D” AND “low level of vitamin B12” AND “low level of iron” AND “high level of uric acid” retrieved no record, as shown next.

In the present example, we have four key words/ search terms: vitamin D, vitamin B12, iron, uric acid. How shall we use Google search engine for these keywords/ search terms. Entering these keywords/ search terms, either individually or in combination using Boolean search principles does not retrieve the relevant information. Google search engine implements concept search. Therefore, we need to transform the keywords/ search terms in a concept and then use that concept as search term. We know that the patient had all levels within normal range except low level of vitamin D, vitamin B12 and iron but with increased level of uric acid. Therefore, let us form a “concept search” based on this information. We know that the patient had all levels within normal range except low level of vitamin D, vitamin B12 and iron. In addition, patient had increased level of uric acid. Searching with this concept, the following results could be retrieved:

At this point of time, let me give you a tip to search using concept: “the patient had all levels within normal range except low level of vitamin D, vitamin B12 and iron. In addition, patient had increased level of uric acid”.

3.3. Organization of Molecular Data for Information Retrieval and Knowledge Discovery

The following are the common ways to store records in a database in a computer file.

(a).Flat File in text

(b).Table as in a Relational database

(c).Object oriented database

3.3.1. FlatFile database

We can write one record in one line without writing the names of field and thereby avoiding the field names repeatedly, for each record. However, to separate the data value in each field, an identification mark is placed between two fields. This mark or character is known as ‘delimiter’. This delimiter character may be any unusual character not found in the data values. This may be ‘\’, a back slash or ‘|’, a vertical bar typed with ‘\’ key but shift key on the key board pressed. Let us use ‘|’ as delimiter character to separate data values for each field in records of the above enzymes. The following is the first record of enzyme PFK:

2.7.1.11|6-phosphofructokinase| Phosphofructokinase I. Phosphohexokinase.| ATP + D-fructose 6- phosphate ADP + D-fructose 1,6-bisphosphate| D-tagatose 6-phosphate and sedoheptulose 7-phosphate can act as acceptors; UTP, CTP and ITP can act as donors; and not identical with EC 2.7.1.105.

In this entry the ‘enter’ key is never pressed. This is equivalent to adding or typing in a single paragraph. We can write another record, but, now, starting in a ‘new line’. In text typing such as with NotePad, a line continues as the same line until the ‘enter’ key on the keyboard is pressed. Therefore, although, the above record for PFK is covered in three rows on this page, but in computer sense it is one line, because, ‘enter’ key is pressed only after entering the last word, i.e. 2.7.1.105. Once we press ‘enter’ key, we move on to next line to enter next record. Therefore, what appear as a paragraph in word processing file is actually a single line in computer sense.

Writing each record in a separate line is illustrated next. The first line for first record of the Phosphofructokinase is shown above. Next is the second line of the file which contains second record for Hexokinase in this database and the same is as follows –

2.7.1.1| Hexokinase| Hexokinase type I. Hexokinase type II. Hexokinase type III. Hexokinase type IV (glucokinase).| ATP + D-hexose ADP + D-hexose 6-phosphate| D-glucose, D-mannose, D-fructose, sorbitol and D-glucosamine can act as acceptors; ITP and dATP can act as donors

Next is the third line of the file which contains third record for Enolase in this database and the same is as follows –

4.2.1.11| Phosphopyruvate hydratase|2-phosphoglycerate dehydratase. Enolase.| 2-phospho-D-glycerate phosphoenolpyruvate + H2O| Also acts on 3-phospho-D-erythronate

Next is the fourth line of the file which contains fourth record for Pyruvate kinase in this database and the same is as follows –

2.7.1.40| Pyruvate kinase| Phosphoenol transphosphorylase. Phosphoenolpyruvate kinase.| ADP +phosphoenolpyruvate ATP + pyruvate| UTP, GTP, CTP, ITP and dATP can also act as donors. Also phosphorylates hydroxylamine and fluoride in the presence of CO2.

When these four lines (looking as paragraph above, are copied into the NotePad, it looks like as shown in figure .

then the individual lines are wrapped to be displayed into window as shown in the figure next. Each field is separated by the delimiter character, ‘|’ i.e. vertical pipe in this case. Otherwise, a comma ‘,’ may be used. The only limitation is that the delimiter character used in the database must not be occurring in any value of a field for any record.

Similarly, for all other enzymes, we can create separate lines. Then the file is stored with a name. Storing data values for all the records in one file in this way is known as ‘flatfile’ system for storing database.

Now, we would like to find the relevant/desired data values. In text file, the data values can be searched for the desired ‘key word’, say ‘phosphoenolpyruvate’. The search or find option in the text file will highlight ‘phosphoenolpyruvate’ three times.

This finding of 3 data values for ‘phosphoenolpyruvate’ using NotePad does not reveal any information. The purpose of storing data values in a database is to organize data values in separate fields in such a way that when we search each field with keyword, we get not only data values but, associated and hidden connecting information. The information is connection between different data values stored in same field. For example, if we search the above file with ‘phosphoenolpyruvate’ as key word, but searching only one field, say ‘Reaction catalysed’, then we will be able to connect ‘phosphoenolpyruvate’ in the reaction catalysed field and retrieve two records with reaction catalysed field displayed as shown below:

Therefore, when we retrieve these data values from a database, it is not isolated data values but the data values also reveal the connections between the data values. In this way, the retrieval of data connects two enzymes i.e. Enolase and Pyruvate Kinase. Therefore, we retrieve not only data about the reactions involving ‘phosphoenolpyruvate’, but information connecting the enzymes in glycolytic pathway, also. This is the purpose of organizing data in databases with different fields for each record.

3.3.2. Relational Database

The above data for enzymes can also be organized as a table, as shown below:

Storing data values in the tabular form under different columns, where each column represents a unique FIELD is known as relational database system. This helps to search the fields separately and specifically in all the records. In addition, we here have a unique identifier for each record, the EC number for enzymes, in this case. In case we do not have any set standard for unique identification number, the database manager can include one unique identifier such as an accession number. This accession number is allotted only to one record and therefore, each record will have a unique identifier.

The enzyme, 6-phosphofructokinase, EC 2.7.1.11 catalyses reaction

ATP + D-fructose 6-phosphate ADP + D-fructose 1,6-bisphosphate

Therefore, wherever, this reaction is catalysed then the enzyme involved is always 6-phosphofructokinase with EC number 2.7.1.11. This reaction may occur in E. coli, yeast or even human. The reaction involved is common to all organisms but the amino acid sequence of the EC 2.7.1.11 may or may not be same in all organism catalysing this reaction. Therefore, whenever, the amino acid sequence of the EC 2.7.1.11 is determined, say in E. coli, then the EC 2.7.1.11 record may be used to store this amino acid sequence. This storage of the amino acid sequence may be in the same flatfile database or same relational database table. However, in case the amino acid sequence of the enzyme 6-phosphofructokinase becomes available in another species, say yeast, then the amino acid sequences can be stored as separate record. But in this case we will repeat all other data field values in the new record. This leads to redundant/ superfluous or duplicated data. In future, in case the amino acid sequence of the enzyme 6-phosphofructokinase becomes available in another species, say human, then again, the amino acid sequences can be stored as separate record. But in this case again, we will need all other data field values to be stored again. This leads to redundant/ superfluous or duplicated data. In this case, if we need to edit some data value say in comment field, then will have to edit all the records for EC 2.7.1.11. This leads to difficulties in data maintenance.

Relational database management system (RDBMS) avoids this problems by storing the sequence information in separate table in the same database, with ECNumber as connection between two tables, as shown in table next.

Relational database management system (RDBMS) has additional advantages of integrating databases. The sequence information may be stored even in a separate database In fact, UniProtKB database at http://www.uniprot.org/ houses the amino acid sequences of all the proteins including enzymes and search for ‘2.7.1.11’, retrieves all the amino acids sequences (7343 sequences of EC 2.7.1.11 as on 01 September,-2015).

This allows to use unique identifier for identification of protein sequences and include the amino acid sequences of the same enzyme from any number of species. Further, when other enzymes in any species comes known, then the same may be included in the same database and in this way database continues to grow without duplicating previously added data about the reaction catalysed. In this way two tables are related to each other and help in the retrieval of all related information about one particular enzyme in a particular species without duplicating the data which is common to all species.

3.3.3. Object Oriented Database

The development of RDBMS helped in retrieving not only data but information also. The development of Object Oriented databases helped further by adding the property of knowledge discovery. We know that the information about the folding of the enzymes or proteins is encoded in their amino acid sequence. This, therefore, can help in converting the relational database into an object oriented database. In the object oriented database, the individual objects, for example, ‘the enzyme’ knows that what it can do, i.e which reaction it can catalyse. It also knows how to fold into three dimensional structure to catalyse that reaction. Therefore, when the objects knows that what and how can be done something, say folding in three dimensions, then there is no need to store that structure data values of their three dimension structures. This information of folding enzyme into 3 dimensions is stored in the sequence and the same may be used to fold enzymes/ proteins in three dimensions. This will ultimately help in reducing the storage of the three dimensional structure information in separate databases. But this is the future expectation, when we completely understand the folding rules and patterns for enzyme/ protein sequences. Therefore, this form of storing data in the form of Object oriented databases, will ultimately be useful for knowledge discovery in Bioinformatics Simulations, where a given object such as an enzyme may simulate itself under the changing environments of pH, temperature, metabolites etc., and will enable the elucidation of cell/ Systems Biology, ultimately.

4. Summary

In this module, the students learnt

the nature of molecular data
the molecular database concepts
the retrieval of data
the searching of database for information retrieval
the basic organization of molecular data for knowledge discovery

you can view video on Molecular Database Concepts

References

Lewitter Fran (1998). Text based database searching, Trends Guide to Bioinformatics, Elsevier Trend Journal, Supplement, page 3-6