6 ISAR Systems: Functions and Design
Sunita Barve
I. Objectives
The objective of this module is to:
• To brief about the functions and design of Information Storage and Retrieval(ISAR) systems.
• Introduce the reader to role of user interface in Information Storage and Retrieval(ISAR) system.
• Introduce the basic concept of data model and its types in information system and also the data modeling techniques.
• Familiarize the reader about role of search engine to achieve high recall and precision.
• Introduce reader about different databases and their functions in different field for information storage.
• Introduce reader about information services facilitate by different service providers.
II. Learning Outcomes
After reading this module:
• The reader will gain the knowledge of Information Storage and Retrieval System and various requirements for using the system.
• The reader will understand concepts of user interface system and its functions (i.e. searching, saving, printing etc.) in ISAR.
• The reader will also understand the guidelines how to design user interface for ISAR.
• The reader will gain the knowledge of various data models and their usefulness in management of data.
• The reader will gain the knowledge of different databases and their functions like OPAC, Dialog, etc.
• The learner will also gain the knowledge of search engines like Google and its advanced search features as examples for IR.
III. Structure
1. Introduction
2. User Interface System
3. Query Processing System
4. Database Modeling System
4.1. Linear Sequential Model
4.2. Hierarchical Model
4.3. Network Model
4.4. Entity Relationship (E-R) Model
4.5. Relational Model
4.6. Object Oriented Model
5. Sampling of Information Retrieval Systems (OPAC, Dialog, GOOGLE, EBSCO, PubMed)
6. Summary
7. References
1. Introduction
Information Search and Retrieval is a system that allows end users to communicate with the system. Users interact with ISAR system in a different ways. Each user will have different searching skills and knowledge of using any ISAR system. While using any ISAR system, user interface plays an important role as it caters to the different requirements and proficiency levels of different users.
2. User Interface System
The user interface forms an important component of any information retrieval system since it connects the users to the organized information resources. In ISAR systems, how user communicates with the system will determine how successful the retrieval could be. User interface usually includes such features that help to connect the two entities – users and information. ISAR can be effectively used only if the user interface is easy to use by the end user. User interface performs two major functions: they allow users to search or browse an information collection and then display the results of a search.
The user interface also allows users to perform further tasks, like:
• searching
• Sorting
• Saving
• Printing the search results,
• Modifying the search query, and so on.
The success of an information retrieval system depends significantly on the design, efficiency and usefulness of the user interface that enhances interactions with the system. Well designed user interfaces allow users to find the information and access it.
Some essential features are :
• Interactive
• Easy and intuitive navigation
• GUI based; use of extensive graphic images so that it is easy for end users to navigate
• Command mode as expert users prefer to type commands
• Mnemonic: use familiar icons and also provide query saving options
• Modules for re-defining query
• Provide for basic keyword search and advanced options with combinations to choose any elements for searching
To provide the required facilities as well as to cater to different users requirements the following guidelines are proposed while designing a user interface for ISAR system (Shneiderman, 1987, Neilson, 1995, Block, 2013):
• Strive for consistency in terminology, layout, instructions, fonts and color.
• Provide short-cuts for skilled users.
• Provide appropriate and informative feedback about the sources and what is being searched.
• Permit reversal of actions so that users can undo or modify actions; for example, they should be able to modify their queries or go back to the previous state in a search session.
• Support user control, allowing users to monitor the progress of a search and be able to specify the parameters to control a search.
• Provide mnemonics; Reduce short-term memory load; the system should keep track of some important actions performed by the users and allow them to jump easily to a formerly performed action.
• Simple error-handling facilities to allow users to rectify errors easily; all error messages should be clear and specific.
• Provide plenty of space for entering text in search boxes.
• Provide alternative interfaces for expert and novice users.
• Test run for different platforms and different versions of web browsers
3. Query Processing System
The query processor has two main tasks: query optimization and query execution.
Query optimization is the process of choosing the best solution for query execution. Often there are several different ways to perform a query; all will eventually lead to the same correct result. The job of the query processor is to create one or more access plans for a given query. If several possible solutions exist that give the correct query result, the query processor must select the optimal access plan for the query.
The general tasks of the optimizer are maintaining statistics about the volume of data, choosing the indexes for the query, and (in the case of a query containing a join operation) and deciding the appropriate table-join order.
The database user retrieves data by formulating a query in the data manipulation language provided within the database management system. The query processor is used to interpret the online user’s query and convert it into an efficient series of operations in a form capable of being sent to the data manager for execution.
The query processor uses the data dictionary to find the structure of the relevant portion of the database and uses this information in modifying the query and preparing an optimal plan to access the database.
4. Database Modeling System
Data modeling is the process of creating a data model for an information system by applying formal data modeling techniques.
According to Steve Hoberman, data modeling is the process of learning about the data, and the data model is the end result of the data modeling process.
Data modeling defines not just data elements, but also their structures and the relationships between them.
We know that data exists as facts, figures and other bits of knowledge. When lot of data items are to be harnessed they have to be put together in a useful form one can get information. Data model is a schema to represent the real world using information concepts and structures.
A model of data is a particular type of structure or manner of visualizing a data structure. A data structure is a collection of data elements or objects and relationships among them. A data model is a plan for building a database. It is a description of the data about entities, events, activities and their associations within an organization.
A database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized, and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.
A given database management system may provide one or more models. The optimal structure depends on the natural organization of the application’s data, and on the application’s requirements. Most database management systems are built around one particular data model, although it is possible for products to offer support for more than one model.
Before 1980s, the two most commonly used database models were used hierarchical and network systems. There are various data models now in use such as:
• Linear Sequential Model
• Hierarchical Model
• Network Model
• Entity Relationship (E-R) Model
• Relational Model
• Object Oriented Model
4.1. Linear Sequential Model
It is a very common data structure, also called as a flat structure. The flat (or table) model consists of a single, two dimensional array of data elements. It is simply a list or table of elements, with no hierarchical structure. An example is the list of students registered for a given course – the class list. There may be no information except the STUDENT NUMBER and NAME in this type of sequential model.
4.2. Hierarchical Model
This data model uses tree structures to represent relationship among records. For e.g. an institute has a number of programmes to offer. Each programme has a number of courses. Each course has a number of students in it. A hierarchical model consists of collection of records which are connected to each other through links. The tree type data structure is used to represent hierarchical data model and shows the relationships among the institute, courses and students. The highest level of the tree is known as the root in hierarchical model.
Hierarchical structures were widely used in the early mainframe database management systems, such as the Information Management System (IMS) by IBM, and now describe the structure of XML documents. This structure allows one, one-to-many relationship between two types of data. Hierarchical structure is very efficient to describe many relationships in the real world; recipes, table of contents, ordering of paragraphs/verses, any nested and sorted information.
4.3. The Network Model
It is designed to solve some of the problems with that were in the hierarchical model. A network model is very similar to the hierarchical model. In network model, relationships are represented in terms of sets rather than hierarchy. This allows the network model to support many-to-many relationships and also solve the problem of data redundancy.
The network model organizes data using two fundamental concepts, called records and sets. Records contain fields (which may be organized hierarchically, as in the programming language COBOL). Sets define one-to-many relationships between records: one owner, many members. A record may be an owner in any number of sets and a member in any number of sets.
4.4. The Relational Model
The relational model was introduced in 1970 as a way to make database management systems more independent of any particular application. Three key terms are used extensively in relational database models: relations, attributes, and domains.
A relation is a table with columns and rows. The named columns of the relation are called attributes, and the domain is the set of values the attributes are allowed to take.
The basic data structure of the relational model is the table, where information about a particular entity (say, an employee) is represented in rows (also called tuples) and columns. Thus, the “relation” in “relational database” refers to the various tables in the database; a relation is a set of tuples. The columns enumerate the various attributes of the entity (the employee’s name, address or phone number, for example), and a row is an actual instance of the entity (a specific employee) that is represented by the relation. As a result, each tuple of the employee table represents various attributes of a single employee.
A relational database contains multiple tables, each similar to the one in the “flat” database model. One of the strengths of the relational model is that, in principle, any value occurring in two different records (belonging to the same table or to different tables), implies a relationship among those two records. The most common query language used with the relational model is the Structured Query Language (SQL).
4.5. Entity-Relationship Model
The Entity-Relationship (ER) model was originally developed by Peter Chen in 1976. It unifies the network and relational database views. It is a conceptual data model that views the real world as entities and relationships. In this model an Entity-Relationship diagram is used to visually represent data concepts. Thus, it provides graphic representation of entities, attributes and relationship. Today this model is commonly used for database design.
4.6. Object-Oriented Model
Object-oriented data model (OODM) is the basis for the object-oriented database management system (OODBMS). OODM is said to be a semantic data model. Object includes information about relationships between facts within object, and relationships with other objects.
5. Examples of Information Retrieval Systems (OPAC, Dialog, GOOGLE, EBSCO, PubMed)
In the recent past large number of information retrieval systems are made available to end users to query information from an IRS. These systems use different query languages to develop any IRS. The query languages are either based on Boolean algebra or the natural language queries. In most IRS systems “boolean algebra” is used to form a set from an inverted index.
In all IRS systems, it is expected that relevant information is retrieved than non relevant information. While searching for the information, one cannot assure that good number of records are retrieved with relevance from any IRS system. Users expect high recall in IRS system. To achieve high recall, OR boolean operator is essential. AND operator usually retrieves small set of records than OR operator. While searching for information from top search engines like Google or Bing or Yahoo, AND operator is used by default. In search engines use of “Controlled Vocabulary” search is rare and even if present is not consistent across different search engines. Most of the search engines available today build their indexes by scanning words available page by page hence searchers have to try different word orders to achieve high recall and precision.
A database is a collection of similar data records stored in a common file (or collection of files). Databases are various types such as catalogues of books or other types of documents, computerized bibliographies, address directories, full text newspaper, newsletter, magazine, journal, WWW and Internet search engines etc. “Text retrieval” can be considered as a part of the larger concept “information management”.
Following are some of the databases which help to find information from a large systems:
OPAC: Online Public Access Catalogues were previously termed as online catalogues which first came into existence in libraries during 1970s and have reached in most of the libraries around the world today.
Online catalogues brought significant improvement in access to library resources. Even though the content and structure of the records were different from card catalogues, online catalogues provided new searching capabilities such as keyword search, basic search and advance search usually based on Boolean logic. Online catalogues also brought search limits based on date, language and type of document.
An integrated library automation software OPAC can combine data from circulation, acquisitions and cataloguing modules and can show bibliographic data of a book as well as can display who owns the book, whether book is available on shelf or issued out or any other status of the document.
Currently OPACs are termed as Web OPACs. Web OPACs are OPACS provided on the web. Web OPACs today offer a wide range of search options. They may incorporate information retrieval techniques such as word stemming, truncation, weighted searching, use of fuzzy match search logic, natural language processing. other feature include providing automatic spelling correction of common terms. They frequently provide ability for a reader to save searches via email. They also provide self-service features, such as self reservation and renewal, document ordering, writing reviews, comments etc. WebOPACs also offer extensive search and browse features. With availability of open source software, library software are providing many features to the users through WebOPAC.
We looked at the general information about library OPAC. Let us now look at some information about commercial databases providing IR services also. Note: The services discussed in the following sections are included to discuss functions of popular IR interfaces and not as an endorsement of any service or product.
Dialog : Knight-Ridder Information Inc’s Dialog is one of the oldest online retrieval system. It was completed in 1966. According to its literature it was “the world’s first online information retrieval system to be used globally with materially significant databases”. Dialog provides over 500 databases, ranging across most disciplines and bibliographic, abstract, and full text format. It has most comprehensive content collection and most powerful search language available. From concept testing, to clinical trials, to product launches, to patent protection, Dialog delivers accurate, relevant results with excellent speed.
Dialog is unique in the vast array of information covering virtually every subject. More than 30,000 separate serial publications are indexed on Dialog, and more than 8,000 of these are in full-text. In addition, Dialog includes the full text of many reference works and special publications from around the world, such as market research and brokerage house reports; patents and trademark registrations; chemical directories and drug pipeline monitoring services, etc. Archival data is also available on dialog for many sources back to the 1960s and 1950s some even back to the 18th century. One of the feature of dialog is Information on Dialog is organized into separate databases, each like a “mini library” of specific information or publications. Learning to identify the best database for a search is a skill.
Although menu-based searching is available in Dialog, the primary search style is Boolean, offering truncation, field searching, limits by categories such as language or document type. For many years dialog system offered users searching based on only one query logic such as Boolean based searching. Its primary users initially were only library community members and other professional searchers. Over the years dialog has provided variations in its services for different classes of users with different skill sets. Dialog’s basic language allow to use six commands such as BEGIN, EXPAND, SELECT, DISPLAY, PRINT and LOGOFF. Dialog also allow users to enter a “natural language” query. The natural language query is converted into boolean expression with common words omitted and the remaining words ORed together. The retrieved set is then ranked according to the number of query terms found.
Dialog has also introduced web version of its service DIALOGWeb which offers Internet access to the regular Dialog search system. Dialog provides use of sophisticated search engine and authoritative databases. DialogWeb provides easy access to the full content (over 500 databases), power, and precision of Dialog through a Web browser. Special features include:
• A flexible and easy to use Guided Search mode that does not require knowledge of the Dialog command language.
• A robust Command Search mode that uses the powerful Dialog command language which experienced searchers can easily use.
• Database selection tools to help pinpoint the right database for your search.
• Integrated database descriptions, pricing information, and other search assistance.
• Easy to use forms to create and modify Alerts (current awareness updates).
• Search results available in HTML or text formats.
• A choice of displaying records or sending search results via email, fax or postal delivery.
Dialog offers guided search which is designed for novice to intermediate searchers who want easy access to Dialog’s authoritative business, legal, scientific, intellectual property, and technical information. Guided Search is the default search option for all new DialogWeb customers.
Dialog offers Command Search designed for experienced Dialog searchers. It provides complete command based access to the extensive collection of Dialog databases.
Google: Google is the most popular web search engine among the search engines available today on Internet. It was developed in 1998. Google navigates the web by crawling and indexing web resources. The Google index contains over 60 trillion resources and over 100 million gigabytes of data. The websites that Google ranks on the 1st page of its search results for any given search term are the ones that they consider to be the most relevant and useful. They determine which websites are the most useful and relevant by using a complex algorithm (mathematical process) which takes into account 200+ different factors. Google holds 70% of the search engine market share.
Google’s opening page is very simple consisting of just a query box and links to advanced search section. It also allows to set certain searching preferences like language and display size, and links to specialized databases for images, interest groups, news, product location, and scholarship. Google also allows to set search preferences based on images, maps, videos, books as well as to restrict searches based on country, time and place.
While searching for information on Google, it offers spell checking and provides a suggestions to the end user. Google also does stemming. Google supports basic keyword search as well as advanced search feature facilities using AND, OR and NOT operators [http://www.google.com/advanced_search]. Google also allows, to use advance search features based on phrase, wild card, stemming algorithm, fuzzy searching, proximity searching, must include, must exclude terms, range searching, etc.
Google also takes snapshot of the pages that it has indexed. It is called as “cache” page. If end user clicks on “Cached” link Google will take you to the web page as it looked when it was indexed.
Google searches 12 non-HTML formats including PDF documents, Microsoft Office, MacWrite, PostScript, WordPerfect, and Lotus 1-2-3 and permits viewing of such pages converted to HTML so that the particular software is not required to view the retrieved resources. The advanced search allows to specify operators such as ‘must include’, ‘must exclude’ terms, phrase searching, wild card searching, range searching, file type searching. In a similar manner Google also uses search qualifiers such as site, url, in text, in title, in anchor, link, cached which are some of the advance Google qualifiers.
By clicking on advanced search on Google, one reaches the advance search page at http://www.google.com/advanced_search. There one finds boxes which will retrieve only pages with all of the words entered, with the exact phrase entered, with at least one of the words entered, and without the words entered. On Google advanced search page one can find more than one box which will retrieve pages with the words entered into several advance search boxes. Also present are pull-down menus which will limit a search by language, format, time since last crawl, a sparse set of HTML tags, and to within, or not within, a particular site or domain. It is also possible to enter a page URL and retrieve pages that link to it, or are similar to it.
EBSCO Information Services: [www.EBSCO.com] It is a division of EBSCO Industries Inc. EBSCO offers library resources to customers in academic, medical, K–12, public library, law, corporate, and government markets. Its products include EBSCONET, a complete e-resource management system, and EBSCOhost, which supplies a fee-based online research service with 375 full-text databases, a collection of 600,000-plus eBooks, subject indexes, point-of care medical references, and an array of historical digital archives.
EBSCO has following products:
Databases: EBSCO provides a range of library database services. Many of the databases, such as Medline and EconLit, are licensed from content vendors. Others, such as Academic Search, MasterFILE, and Environment Complete, are content licensed directly from publishers and compiled by EBSCO.
Discovery: This product is used to create a unified, customized index of an institution’s information resources, and a means of accessing all the content from a single search box. The system works by harvesting metadata from both internal and external sources, and then creating a pre-indexed service.
eBooks: EBSCO provides e-books and audiobooks across a wide range of subject matter.
DynaMed: is a clinical reference tool for physicians and other health care professionals. DynaMed is ranked highest among 10 online clinical resources in a study in the Journal of Clinical Epidemiology and also had the highest overall performance in the disease reference product category in the last two reports on clinical decision support resources by KLAS, a research firm that specializes in monitoring and reporting the performance of healthcare vendors.
EBSCOhost: is a powerful online reference system accessible via the Internet. It offers a variety of proprietary full text databases from leading information providers. It provides a complete and optimized research solution comprised of research databases, e-books and e-journals—all combined with the most powerful discovery service and management resources to support the information and collection development needs of libraries and other institutions and to maximize the search experience for researchers and other end users.
EBSCO serves the content needs of all researchers whether they access EBSCO resources via academic institutions, schools, public libraries, hospitals and medical institutions, corporations, associations, government institutions, etc. It has a powerful web-based retrieval system. As soon as a user logs in to the system, user can select the database and then click on search option. The initial search screen presents a toolbar which includes functions that are available at all times during a search session. These include buttons for “new search” which will return to the initial default search screen; “view folder” which allows the user to view a personal folder which is cleared at session end unless the user signs into the system establishing a permanent file; “preferences” which permits a change in the Result List Format and number of results per page; “help” which opens an online help manual; plus an “exit” or “home library graphic” which will close EBSCO host and return to the library’s home page. The basic search screen supplies a Find box, in which terms may be entered and automatically checked for commonly misspelled words and alternate spellings suggested. Keywords are the assumed default. EBSCOhost allow to use basic search as well as advance search option.
PubMed: PubMed [https://www.ncbi.nlm.nih.gov/pubmed] is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval. PUBMED provides access to MEDLINE: an international bibliographic database of over 4600 biomedical journals from 1966-present. PubMed also links to molecular biology databases (Nucleotide, Protein, Genome, PopSet).
In addition to MEDLINE, PubMed provides access to:
• older references from the print version of Index Medicus back to 1951 and earlier;
• references to some journals before they were indexed in Index Medicus and MEDLINE, for instance Science, BMJ, and Annals of Surgery;
• very recent entries to records for an article before it is indexed with Medical Subject Headings (MeSH) and added to MEDLINE; and
• a collection of books available full-text and other subsets of NLM records.
• PMC citations
Many PubMed records contain links to full text articles, some of which are freely available, often in PubMed Central and local mirrors such as UK PubMed Central.
The PubMed database allows to carry out basic and advanced searches. The first search page of PubMed is very simple and easy to use. To search PubMed, one enters search terms in the query box. PubMed may be searched by entering one or more keywords or phrases into the text box.
PubMed searches multiple words as a phrase if it recognizes the terms. Otherwise, PubMed will search the words separately and combine with AND. PubMed also automatically tries to map search term to MeSH headings. Using MeSH (Medical Subject Headings), helps to achieve specificity in results and may be the most precise way to do subject/domain searches. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts.
6. Summary
The function and design of Information Storage and Retrieval System depends on the various requirements of users who use the system. The importance of user interfaces and design principles have been briefed here. The general tasks of optimizer and functions of query processor in database have been described. To highlight essential features of IR in ISAR systems popular IR systems and search facilities including Google, EBSCOweb services and PubMED have been described.
7. References
- Ben Shneiderman (1987). Designing the User Interface. Addison Wesley.
- Dialog web service: information accessed at https://en.wikipedia.org/wiki/Knight_Ridder
- Google Search: https://developers.google.com/search/docs/guides/search-features
- Henri Block (2013). The Eight Golden Rules of Interface Design. Accessed at https://www.frantic.com/blog/The-golden-rules–back-to-the-basics-61.html (browsed on 03/07/2015)
- Jacob Nielsen (1995). Ten Usability Heuristics (1995) (accessed at https://www.nngroup.com/articles/ten-usability-heuristics/ [browsed on 03/07/2015])
- PubMed [https://www.ncbi.nlm.nih.gov/pubmed]
- Steve, Hoberman. Data Modeling Fundamentals. Available at accessed www.irmuk.co.uk/events/72.cfm