17 Virtual Information Resources in the Thrust Areas of the Organisation

Prof I V Malhan

1. Introduction

Information usually has two key aspects — content (what the material is about); and format (is it a book, a newspaper article, a CD, a video). Some systems organize access to information based on content alone, irrespective of format. For example, books, videos and CDs might all be grouped by subject next to each other. But, as there are so many different formats, one may find many access systems which provide access to just one format (e.g., a database that only indexes newspapers).

2. Virtual information resources

Virtual information resources can be discussed under the following categories:

1. By Type: E-Books, E-Journals, Bibliographic Databases, Electronic Theses and Dissertations (ETD), Online Public Access Catalogues (OPACs), Web-OPACs, etc.

2. By Content: Primary, Secondary and Tertiary sources and,

3. By Format: Text, Audio, Video, Image, Multimedia object,

4. By Source: Websites/Portals, Digital Libraries, Networked Digital Libraries (NDLs), Institutional Repositories (IRs), etc.

2.1 By Type

2.1.1 E-books: An e-book (short for electronic book and also known as a digital book, ebook, and eBook) is an e-text that forms the digital media equivalent of a conventional printed book,sometimes restricted with a digital rights management system. An e-book, as defined by the Oxford Dictionary of English, is “an electronic version of a printed book which can be read on a personal computer or hand-held device designed specifically for this purpose”. E-books are usually read on dedicated hardware devices known as e-Readers or e-book devices (e.g. Amazon Kindle and Sony’s PRS-500). Personal computers and some cell phones can also be used to read e-books”. E-books come in various open formats like plain text (.txt files), Hypertext Markup Language (.htm/.html files), as well as proprietary formats like Amazon Kindle (.azw), Open Electronic Package (.opf files, OPF is an XML-based e-book format created by E-Book Systems), PostScript (.ps files), Portable Document Format (.pdf files) etc.

2.1.2 E-Journals: Electronic journals, also known as ejournals, e-journals, or electronic serials, are scholarly journals or intellectual magazines that can be accessed via electronic transmission. They are usually published on the Web, though in earlier days these were published on CD-ROMs. They are a specialized form of electronic document: they have the purpose of providing material for academic research and study, and they are formatted approximately like journal articles in traditional printed journals. Being in electronic form, articles usually contain metadata that can be entered into specialized databases, such as Directory of Open Access Journals (DOAJ) [http://www.doaj.org/], as well as the databases and search-engines for the academic discipline concerned. Some electronic journals are online-only journals; some are online versions of printed journals, and some consist of the online equivalent of a printed journal, but with additional online-only material (e.g. video and interactive media). Most commercial journals are subscription-based,or allow pay-per-view access. An increasing number of journals are now available as online open access journals, requiring no subscription and offering free full-text articles and reviews to all.Most electronic journals are published in HTML and/or PDF format, but some are available in only one of the two formats. A small minority publish in DOC format, and a few are starting to add MP3 audio. Some early electronic journals were first published in ASCII text, and some informally-published ones continue in that format.

2.1.3 Bibliographic Database: A bibliographic database “contains references to published literature, including journal and newspaper articles, conference proceedings and papers, reports, government and legal publications, patents, books, etc. In contrast to library catalogue entries, a large proportion of the bibliographic records in online and CD-ROM databases describe analytics (articles, conference papers, etc.) rather than complete monographs, and they generally contain very rich subject descriptions in the form of subject-indexing terms and abstracts”.

The database consists of “electronic entries called records, each containing a uniform description of a specific document or bibliographic item, usually retrievable by author, title, subject heading (descriptor), or keyword(s). Some bibliographic databases are general in scope and coverage; others provide access to the literature of a specific discipline or group of disciplines. An increasing number of bibliographic databases provide the full-text of at least a portion of the sources indexed. Most bibliographic databases are proprietary, available by licensing agreement from vendors, or directly from the abstracting and indexing services that create them”. Many scholarly databases are bibliographic databases, but some are not. For example, there are many scholarly databases that do not include references to published literature, but instead include references to chemical structures (such as Chemical Abstracts), sequences (such as Entrez), or artistic images (such as ARTstor).

2.1.4 Electronic Theses and Dissertations (ETD): Now-a-days universities and research institutes are making their research output in the form of theses and dissertations (at Master’s level, M.Phil. and Ph.D.) available on their institutional repositories. These theses and dissertations are stored on the university/institution’s server in electronic formats. In India, Indian Institute of Science (IISc), Bangalore [http://etd.ncsi.iisc.ernet.in/] and Indian Electronic Theses and Dissertations (iETD@INFLIBNET Centre) [http://ietd.inflibnet.ac.in/] have made theses and dissertations available on the Internet. Whereas, the IISc ETD is restricted to the publications of IISc, the IETD@INFLIBNET encourages any research scholar from any of the Indian Universities to submit their theses/dissertations.

2.1.5 Online Public Access Catalog (OPAC): An Online Public Access Catalog or OPAC (or sometimes also referred as iPAC for Internet/Intranet Public Access Catalogue) is a computerized online catalog of the materials held in a library, or library system. The library staff and the public can usually access it at computers within the library, or from home. Since the mid-1990s, these systems have increasingly migrated to Web-based interfaces. (WEB-OPACs). OPACs are often part of an integrated library system (ILS) / automation software. E.g. Koha, Libsys, Newgenlib, etc. In its most simple form, a library’s OPAC consists of a simple index of the bibliographic data catalogued in the system. More complex OPACs offer a variety of search capabilities on several indexes, integrate rich content (book covers, video clips, etc.), and offer interactive request and renewal functionality. In the past, libraries made their catalogs available to users outside the library via means of a Telnet interface, usually accessible through a direct dial-up interface, or across the Internet. Today, most integrated library systems offer a browser-based OPAC (aka iPAC) module as a standard capability or optional feature. OPAC modules rely on pull-down menus, popup windows, dialog boxes, mouse operations, and other graphical user interface components to simplify the entry of search commands and formatting of retrieved information. Examples of OPACs include, Library of Congress, U.S.A (http://catalog.loc.gov/) and British Library, U.K. [http://catalogue.bl.uk].

2.2 By Content

2.2.1 Primary information resources contain new or original idea or new interpretations of known facts. Primary information is information in its original form, unchanged and unedited. Examples of primary works would include: an original copy of the Declaration of Independence; a letter in Mahatma Gandhi’s handwriting; an oral history interview with an important pioneer settler; diaries; videos of current events as they happen; photographs; autobiographies; original public records.

2.2.2 Secondary information resources are those derived from primary sources. Secondary information is a second-hand version, representing someone else’s interpretation. It may be a restatement or summary of information taken from primary information sources, but it has been filtered through another person (or persons). Examples of secondary sources are encyclopaedias and handbooks, textbooks, biographies, periodical articles, and historical fiction.

2.2.3 Tertiary information resources are those that are based on the primary and secondary sources of information. The information presented in the tertiary sources is highly condensed and the aim is to provide relevant information in minimum number of expressions. They are primarily the aids to search primary and secondary sources.

2.3 By Format

2.3.1 Text Documents: Textual documents can be categorized into two basic types:

Plain Text Documents: contain only plain, unformatted text. Plain text files are usually saved using the TXT file name extension, but many other extension are used as well depending on the plain text file’s intended purpose. The encoding has traditionally been either American Standard Code for Information Interchange (ASCII), one of its many derivatives such as ISO/IEC 646, or sometimes EBCDIC. Unicode is today gradually replacing the older ASCII derivatives limited to 7 or 8 bit codes. Plain text files can be opened, read, and edited with most text editors. Examples include Notepad (Windows); edit (DOS), ed, emacs, vi, vim, Gedit or nano (Unix, Linux), SimpleText (Mac OS), or TextEdit (Mac OS X).
Rich Text Documents: combine text with formatting information in a way that allows the text in those documents to use any mixture of fonts, fonts sizes, font styles (bold, italic,etc), and paragraph styles (centered, bulleted, etc). Rich text documents may also contain non-text content such as images. File formats used for rich text documents include .doc for Microsoft Word — Structural binary format developed by Microsoft; Hypertext Markup Language (HTML) (.html, .htm); Office Open XML — .docx (XML-based standard for office documents, ISO standard from 2008); OpenDocument — .odt (XML-based standard for office documents, ISO standard from 2006); OpenOffice.org XML — .sxw (open, XML-based format for office documents); PDF — Open standard for documents exchange. ISO standards from 2001, 2005, 2008. It is readable on almost every platform with free or open source readers. Open source PDF creators are also available.; DjVu (pronounced like déjà vu) — file format designed primarily to store scanned documents; TeX — Popular, open-source, typesetting program and format, first successful mathematical notation language.; Rich Text Format (RTF) — meta data format being developed by Microsoft since 1987 for Microsoft products and cross-platform document interchange; Text Encoding Initiative (TEI) — XML format for digital publication.

2.3.2 Audio Files: An audio file format is a file format for storing audio data on a computer system. There are three major groups of audio file formats:

1. Uncompressed audio formats, such as WAV, AIFF, AU;

2. Formats with lossless compression, such as FLAC, Monkey’s Audio (filename extension APE), WavPack (filename extension WV), Shorten, TTA, ATRAC Advanced Lossless, Apple Lossless, MPEG-4 SLS, MPEG-4 ALS, MPEG-4 DST, Windows Media Audio Lossless (WMA Lossless).

3. Formats with lossy compression, such as MP3, Vorbis, Musepack, AAC, ATRAC and lossy Windows Media Audio (WMA).

2.3.3 Video Files: Video is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion. The most common digital video formats are as follows:

AVI video format – The AVI video format (Audio Video Interleave) was developed by Microsoft. Videos stored in the AVI video format have the extension .avi.
MPEG video format – The MPEG video format (Moving Pictures Expert Group) is the most popular format on the Internet. It is cross-platform, and supported by all the most popular web browsers. Videos stored in the MPEG video format have the extension .mpg or .mpeg.
MP4 video format – The MP4 video format (MPEG-4 video format) is a moving picture compression standard which is used for Internet, broadcast, and on storage media.
Windows Media video format – The Windows Media format is developed by Microsoft. Videos stored in the Windows Media video format have the extension .wmv.
QuickTime video format – The QuickTime format is developed by Apple. QuickTime video format is a common format on the Internet, but QuickTime movies cannot be played on a Windows computer without an extra (free) component installed.
RealVideo format – The RealVideo format was developed for the Internet by Real Media. The RealVideo format allows streaming of video (on-line video, Internet TV) with low bandwidths. Because of the low bandwidth priority, quality is often reduced
Shockwave Flash video format – The Shockwave Flash format was developed by Macromedia. The SWF video format requires an extra component to play. This component comes preinstalled with the latest versions of Netscape and Internet Explorer.

2.3.4 Image Files: The most common digital image formats are:

RAW Format: RAW images come in different flavours depending on the manufacturer of the camera taking pictures. RAW files are really useful because they aren’t compressed and they come with considerable metadata that can be manipulated further in a tool like Photoshop. In fact Photoshop is the best tool for importing RAW files whether they come from Canon or Nikon due to native import ability in the most recent versions of the software.
JPEG stands for Joint Photographic Experts Group. JPEG files are the standard, lossy compression format designed for digital photographic images. It is the optimal file format for storing photographic images with full color or gray scales and continuous variations in color images. JPEG can be read by virtually any browser or image reader. It is the best format when file size is a concern. However, JPEG is not the best choice for images with lettering, cartoons, line art, drawings, or black and white images as it does a poor job with uniform color and sharp edges.
GIF stands for Graphic Interchange Format. GIF files are the standard lossless compression format. They are designed for flat images and work well for text and illustrations. This compression type works best for images with limited distinct colors, typically fewer than 256. It is the file choice for line drawings, simple cartoons, and computer generated images. The GIF file type has improved Web page interface design by allowing for sleek graphically designed pages that are quick to download.
PNG stands for Portable Network Graphics. PNG is a format that recovers images exactly. It is superior to GIF because it supports 16 million colors with a lossless compression known as deflation. PNG is used on images with large areas of exactly uniform color. PNG images almost always look great. Unfortunately, PNG images are not readable by all browsers and therefore are not always the best choice for Web sites.

2.3.5 Multimedia Documents: Multimedia includes a combination of text, audio, still images, animation, video, and interactivity content forms. Multimedia is usually recorded and played, displayed or accessed by information content processing devices, such as computerized and electronic devices, but can also be part of a live performance. Multimedia data and information must be stored in a disk file using formats similar to image file formats. Multimedia formats, however, are much more complex than most other file formats because of the wide variety of data they must store. Such data includes text, image data, audio and video data, computer animations, and other forms of binary data, such as Musical Instrument Digital Interface (MIDI), control information, and graphical fonts. Typical multimedia formats do not define new methods for storing these types of data. Instead, they offer the ability to store data in one or more existing data formats that are already in general use. For example, a multimedia format may allow text to be stored as PostScript or Rich Text Format (RTF) data rather than in conventional ASCII plain-text format. Still-image bitmap data may be stored as BMP or TIFF files rather than as raw bitmaps. Similarly, audio, video, and animation data can be stored using industry-recognized formats specified as being supported by that multimedia file format.

2.4 By Source

2.4.1 Websites/ Web Portals: A collection of HTML and subordinate documents on the World Wide Web (WWW) that is typically accessible from the same URL and residing on the same server, and form a coherent, usually interlinked whole. There are various types of websites on the WWW. Some of them are as follows:

Personal websites: A personal portal is a site on the World Wide Web that typically provides personalized capabilities to its visitors, providing a pathway to other content. Information, news, and updates are examples of content that would be delivered through such a portal. Personal portals can be related to any specific topic such as providing friend information on a social network or providing links to outside content that may help others beyond your reach of services. Portals are not limited to simply providing links. Information or content that you are putting on the internet creates a portal, or a path to new knowledge and/or capabilities.
Regional websites: Regional portals contain local information such as weather forecasts, street maps and local business information. “Local content – global reach” portals have emerged in countries like Korea (Naver), India (Rediff), China (Sina.com), Romania, Greece (in.gr), etc. Such portals reach out to the widespread diaspora across the world.
News websites: The traditional media houses all around the world are fast adapting to the new age technologies. This marks the beginning of news portals by media houses across the globe. This new media channels give them the opportunity to reach the viewers in a shorter span of time than their print media counter parts. E.g. New Delhi Television (NDTV) [http://www.ndtv.com/], The Times of India [http://timesofindia.indiatimes.com/], The Hindu [http://www.thehindu.com]
Government websites: Many governments had already committed to creating portal sites for their citizens. In the United States the main portal is USA.gov in English. The official web portal of the European Union is Europa [http://europa.eu/]. Europa links to all EU agencies and institutions in addition to press releases and audiovisual content from press conferences.
Corporate websites: Corporate intranets became common during the 1990s. Portal solutions can also include workflow management, collaboration between work groups, and policy-managed content publication. Most can allow internal and external access to specific corporate information using secure authentication or single sign-on. Corporate Portals also offer customers & employees self-service opportunities.
Wiki site: is a site which users collaboratively edit (such as Wikipedia [http://wikipedia.org]).
Blog (web log): sites are generally used to post online diaries which may include discussion forums (e.g., Blogger [http://blogger.com]).
Archive sites: are used to preserve valuable electronic content threatened with extinction.

Two examples are: Internet Archive [http://www.archive.org/], which since 1996 has preserved billions of old (and new) web pages; and Google Groups [http://groups.google.com/], which in early 2005 was archiving over 845,000,000 messages posted to Usenet news/discussion groups.

2.4.2 Digital Libraries: A digital library is a library in which collections are stored in digital formats (as opposed to print, microform, or other media) and accessible by computers. The digital content may be stored locally, or accessed remotely via computer networks. A digital library is a type of information retrieval system. The DELOS Digital Library Reference Model defines a digital library as: “An organization, which might be virtual, that comprehensively collects, manages and preserves for the long term rich digital content, and offers to its user communities specialized functionality on that content, of measurable quality and according to codified policies.”

2.4.2 .1 Types of Digital Library

1.Academic repositories: Many academic libraries are actively involved in building institutional repositories of the institution’s books, papers, theses, and other works which can be digitized or were ‘born digital’. Many of these repositories are made available to the general public with few restrictions, in accordance with the goals of open access. Institutional, truly free, and corporate repositories are often referred to as digital libraries. Examples: Librarians’ Digital Library (LDL) [https://drtc.isibang.ac.in]; Digital Library of India [http://dli.iiit.ac.in/]

2.Digital archives: differ from libraries in several ways. Traditionally, archives were defined as: 1. Containing primary sources of information (typically letters and papers directly produced by an individual or organization) rather than the secondary sources found in a library (books, etc); 2. Having their contents organized in groups rather than individual items. Whereas books in a library are cataloged individually, items in an archive are typically grouped by provenance (the individual or organization who created them) and original order (the order in which the materials were kept by the creator); 3. Having unique contents. Whereas a book may be found at many different libraries, depending on its rarity, the records in an archive are usually one-of-a-kind, and cannot be found or consulted at any other location except at the archive that holds them. Example: Oxford Text Archive[http://www.ota.ox.ac.uk/]

2.4.3 Networked Digital Libraries: An extension of the digital library environment is the Networked Digital Library (NDL) where digital libraries with similar content, clientele and services form a network to give integrated services to users at all nodes thus resulting in making the benefits and impact of the digital library multi-fold.

NDLs are basically combined efforts of a few member institutions or projects bringing together their resources and hence the basic component is the collection resulting from such co-operative efforts. The functional aspect encompasses sharing the work involved for making such a network successful. This component is about the members, policy makers, administration and management. The core comprises of the actual data, formats, hardware solutions, software solutions, interfaces and managing protocols and upgrades as and when required. However Digital Libraries in themselves represent complex systems. Networked Digital Libraries face additional challenges like maintaining standards in data representation, retrieval and transfer. In addition, they have to gear up to the technological disparities and work with changing network architectures and protocols. Since NDLs are complicated systems and involve all sorts of physical and logical components, there should enough efforts and time dedicated to planning and implementation of NDLs.

The concept Networked digital library is intended for resource sharing among digital libraries of similar interests and content. The National Science Digital Library (NSDL) [http://nsdl.org/] and Networked Digital Library of Theses and Dissertations (NDLTD) [http://www.ndltd.org/] are examples of NDLs which have integrated several smaller digital library projects to holistically serve communities across geographical and institutional boundaries.

2.4.4 Institutional Repositories: An Institutional Repository is an online locus for collecting, preserving, and disseminating — in digital form — the intellectual output of an institution, particularly a research institution. For a university, this would include materials such as research journal articles, before (preprints) and after (postprints) undergoing peer review, and digital versions of theses and dissertations, but it might also include other digital assets generated by normal academic life, such as administrative documents, course notes, or learning objects. There are currently almost 1300 repositories around the world. Over the past three years the number has been growing at an average rate of one per day. The statistics on numbers and where they are can be found in the Registry of Open Access Repositories (ROAR: http://roar.eprints.org/) and in the Directory of Open Access Repositories (OpenDOAR http://www.opendoar.org/). There is also a mapped representation at Repository66 (http://maps.repository66.org/).

A repository has the following purposes and benefits for an institution:

Opens up the outputs of the university to the world
Maximises the visibility and impact of these outputs as a result
Showcases the university to interested constituencies – prospective staff, prospective students and other stakeholders
Collects and curates digital outputs
Manages and measures research and teaching activities
Provides a workspace for work-in-progress, and for collaborative or large-scale projects
Enables and encourages interdisciplinary approaches to research
Facilitates the development and sharing of digital teaching materials and aids
Supports student endeavours, providing access to theses and dissertations and a location for the development of e-portfolios

Examples: Electronic Theses and Dissertations of Indian Institute of Science, Bangalore, India [http://etd.ncsi.iisc.ernet.in/]; Digital Knowledge Repository of Central Drug Research Institute (CDRI), Lucknow, India. [http://dkr.cdri.res.in:8080/dspace/index.jsp]

3. Discovery Tools for virtual information resources

3.1 Search Engines

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. The most common example of a search engine is a Web search engine which searches for information on the World Wide Web (discussed in the next section). A search engine is a set of programs which are used to search for information within a specific realm and collate that information in a database. People often use this term in reference to an Internet/Web search engine, a search engine which is specifically designed to search the Internet/Web, but search engines can also be devised for offline content, such as a library catalog, the contents of a personal hard drive, or a catalog of museum collections. Search engines help people to organize and display information in a way which makes it readily accessible. Search engine software is part of many applications like Digital Library Software, Integrated Library Management Systems (ILMS), etc.

3.2 Web Search Engines

Search engines are one important tool used to locate these Internet based resources. They are useful for general Internet searching, particularly if one is not familiar with the subject area in which one is searching. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Web search engines work by storing information about many web pages, which they retrieve from the HTML itself. These pages are retrieved by a Web crawler (also known as a spider or crawler) — an automated Web browser which follows every link on the site. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries.

There are thousands of search engines available on the Internet. Choosing one over the other is a matter of evaluation. The evaluation criteria for web search engines are:

The contents of the database created by the search engine’s spider or crawler are a crucial factor determining whether or not the user will succeed in finding relevant information.
Size is another important criterion. How many Web pages has the spider visited, scanned,and recorded in the database? Some of the larger search engines have databases covering over three billion Web pages, while the databases of smaller search engines cover half a billion or less.
Up-to-dateness of the search engine’s database: The Web is constantly changing and growing. New sites appear, old sites vanish, and existing sites modify their content. Unless the spider sent out by the search engine can keep up with these changes, the information recorded in the database will become out of date.
The ranking or relevance algorithm used by the search engine determines whether the most relevant hits appear towards the top of the results list.

Examples of Web Search Engines:

Google (http://www.google.com)
Yahoo! (http://www.yahoo.com)
Ask (http://www.ask.com)
AllTheWeb (http://www.alltheweb.com)
AltaVista (http://www.altavista.com)
Live Search (http://www.live.com/)

3.3 Meta-Search Engines

A meta-search engine is a search tool that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source. Metasearch engines enable users to enter search criteria once and access several search engines simultaneously. Metasearch engines operate on the premise that the Web is too large for any one search engine to index it all and that more comprehensive search results can be obtained by combining the results from several search engines. This also may save the user from having to use multiple search engines separately.

Metasearch engines create a virtual database. They do not compile a physical database or catalogue of the web. Instead, they take a user’s request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm.

No two metasearch engines are alike. Some search only the most popular search engines while others also search lesser-known engines, newsgroups, and other databases. They also differ in how the results are presented and the quantity of engines that are used. Some will list results according to search engine or database. Others return results according to relevance, often concealing which search engine returned which results. This benefits the user by eliminating duplicate hits and grouping the most relevant ones at the top of the list.

Examples of Meta Search Engines:

Yippy (http://yippy.com/)
Dogpile (www.dogpile.com)
SurfWax (www.surfwax.com)
Copernic Agent (www.copernic.com)

3.4 Web Directories

A web directory is a directory on the World Wide Web. It specializes in linking to other web sites and categorizing those links. A web directory is not a search engine and does not display lists of web pages based on keywords; instead, it lists web sites by category and subcategory. Most web directory entries are also not found by web crawlers but by humans. The categorization is usually based on the whole web site rather than one page or a set of keywords, and sites are often limited to inclusion in only a few categories. Web directories often allow site owners to directly submit their site for inclusion, and have editors review submissions for fitness. Most of the directories are very general in scope and list websites across a wide range of categories, regions and languages. But there are also some specialized directories which focus on restricted regions, single languages, or specialist sectors. One type of specialized directory with a large number of sites in existence is the shopping directory for example. Shopping directories specialize in the listing of retail e-commerce sites. Examples of well known, general, web directories are:

Yahoo! Directory (http://dir.yahoo.com/)
Open Directory Project (ODP) ((http://dmoz.org/)). ODP is significant due to its extensive categorization and large number of listings and its free availability for use by other directories and search engines.
World Wide Web Virtual Library (VLIB) (http://vlib.org/) – The oldest directory of the Web.

Examples of specialist web directories are:

1. Biographicon (http://www.biographicon.com/) – a directory of biographical entries

2. Business.com (http://www.business.com/) – Business directory which charges a fee for review and operates as a Pay per click search engine.

3. VFunk (http://www.vfunk.com/)- online directory that specializes in listing and categorizing global dance music and urban lifestyle listings.

3.5 Subject Gateways

Subject gateways also known as Information Gateways or Subject Portals, are online services and sites that provide searchable and browseable catalogues of Internet based resources. Subject gateways will typically focus on a related set of academic subject areas. Subject Gateways can be defined as “a collection of directory entries of web resources which are carefully selected by human experts stating explicit quality criteria, with an abstract describing content and value, providing good subject access”. Subject gateways are doing for on-line information resources what librarians do for books i.e. to help users locate relevant and high quality resources on the Internet. They act as Information Intermediaries, by serving specific user groups, identifying information needs and building targeted and quality collections.

Characteristics of Subject Gateways

Resource discovery guides that provide links to information resources (documents, collections, sites, or services)

Subject-based

Characterized by resource description and subject classification

Quality control

Draw on LIS expertise

Are developed cheaply and are well-suited to cooperation
Support academics – teaching, learning and research needs

Examples:

General

BUBL information service: BUBL is a UK-based interactive information service which provides links to over 12,000 internet resources in a wide range of subject areas. Initially designed as a resource for librarians, it includes a directory of UK organisations and institutions, job postings, user group links, surveys and comprehensive archives. BUBL provides links to current editions of all major UK newspapers, as well as abstracts and selected full text from over 200 journals.
Librarians’ Internet Index: A searchable, annotated subject directory of Internet resources selected and evaluated by librarians for their usefulness to users. The site features an extensive directory of clickable subject topics. Users can also locate relevant web sites by using a detailed subject heading index or by using the keyword search box. Also offers a free e-mail subscription to a current awareness service which features short annotations to the top 20 web resources added each week.
Pinakes: a subject launchpad: Provides access to Internet resources, by linking to the major subject gateways. Includes links to resources on art, chemistry, libraries, education, and other topics.
E-Print network: Provides access to electronic preprints available from diverse sites.It is a searchable gateway to preprint servers that deal with scientific and technical disciplines.
WWW Virtual Library: Catalog of Internet resources. Indexes cover a broad range of subject areas. Each entry within a subject area links to the appropriate Internet resource and is accompanied by a brief description and a relevancy ranking.
Infomine: A Web resource featuring well organized access to important university level research and educational tools on the Internet.
Renardus: Integrated search and browse access to records from individual subject gateway services across Europe. It provides a source of selected, high quality Internet resources for those teaching, learning and researching in higher education in Europe.

Social Sciences and Humanities

HUSSARR: The French HUSSARR site is a subject gateway to information resources in the study of French humanities and social studies (excluding law). Searching is by author, title, subject and keyword.
intute: social sciences : (from 1998-) A freely available Internet service which aims to provide a trusted source of selected, high quality Internet information for students, academics, researchers and practitioners in the social sciences, business and law.
Voice of the Shuttle: Made publicly accessible in 1995, this site began as an introduction to the Web for humanists at the University of California, Santa Barbara. Its mission has been to provide a structured and briefly annotated guide to online resources in the humanities and associated disciplines.

Biological Sciences and Medicine

BIOME: Web site providing access to a searchable catalogue of Internet sites and resources covering the health and life sciences. Options include using the search box for Internet resources in the whole of the health and life sciences, or using one of the five subject-specific gateways.
CancerNet : A comprehensive gateway to recent and accurate information from the National Cancer Institute (U.S.), covering: types of cancer, treatment options, clinical trials, genetics, causes and risk factors, support resources and a dictionary of terminology.
Genome gateway: This web site is a special section of the Nature Genome gateway. It provides free access to Nature Publishing Group genome material.
National Library of Medicine gateway: This gateway is a service that permits simultaneous searches in multiple retrieval systems at the U.S. National Library of Medicine (NLM). It is being developed by the Lister Hill National Center for Biomedical Communications (LHNCBC) at the National Library of Medicine (NLM), a part of the National Institutes of Health (NIH).
Netting the evidence: ScHarr web site intended to facilitate evidence-based healthcare by providing support and access to helpful organisations and useful learning resources, such as an evidence-based virtual library, software and journals.

Physical Sciences and Engineering·

EEVL – The Internet Guide to Engineering, Mathematics and Computing.Provide to quality networked engineering, mathematics and computing resources.Provides Created and run by a team of information specialists from a number of universities and institutions in the UK.
Gesource: Web site providing access to internet resources for students, researchers and practitioners in geography and the environment through five distinct subject gateways: Environment, General Geography, Human Geography, Physical Geography, and Techniques and Approaches.

PSIgate – Physical Sciences Information Gateway. Provides access to internet resources for students, researchers and practitioners in the physical sciences, specifically in: astronomy, chemistry, earth sciences, physics, and science history and policy.

3.6 Metadata Harvesters

Metadata Harvester is a type of software that is used to accumulate and index metadata providing a searchable, web-based interface. One such popular software is the The PKP Open Archives Harvester. Now it is called the Open Harvester System. The Open Harvester System is a free metadata indexing system developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research. OHS allows creating a searchable index of the metadata from Open Archives Initiative (OAI)-compliant archives, such as sites using Open Journal Systems (OJS) or Open Conference Systems (OCS) or digital libraries hosted on open source software such as DSpace, EPrints, etc. The latest OHS version 2.x includes the following features:

Ability to harvest OAI metadata in a variety of schemas (including unqualified Dublin Core (DC), the PKP (Open Journal Systems/Open Conference Systems) Dublin Core extension, MODS, and MARCXML). Additional schemas are supported via plugins.
Flexible search interface that allows simple searching and advanced searching using crosswalk fields from all harvested archives.
Advanced searching of archives that share the same schema will be possible using fields as defined in the schema.
When creating crosswalks for searching, admins can define elements like text, date, or HTML multiple select interface widgets.
User Interface with Cascading Style Sheets (CSS) and template-based HTML for easy customization.
Searching is highly scalable (creates an inverted index for searching).

The PKP OA Harvester allows any institution to create it’s own metadata harvester, which can be focused specifically on gathering information from or for their research community. Examples:

DRTC’s Search Digital Libraries (http://drtc.isibang.ac.in/sdl/)
University of Glasgow’s Harvester service for ePrints (http://daedalus.lib.gla.ac.uk:83/pkpharvester/harvester/index.php)
Canadian Association of Research Libraries Harvester (http://carl-abrc-oai.lib.sfu.ca/)

3.7 Federated Searching

Federated search is the simultaneous search of multiple online databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine. Federated search provides a singular search interface to numerous underlying data sources. This reduces the burden on the search patron by not requiring knowledge of each individual search interface or even knowledge of the existence of the individual data sources being searched.

As described by Peter Jacso (2004), Federated searching consists of:

Transforming a query and broadcasting it to a group of disparate databases or other web resources, with the appropriate syntax,
merging the results collected from the databases,
presenting them in a succinct and unified format with minimal duplication, and
providing a means, performed either automatically or by the portal user, to sort the merged result set.

Federated search portals, either commercial or open access, generally search public access bibliographic databases; public access Web-based library catalogues (OPACs), Web-based search engines like Google and/or open-access, government-operated or corporate data collections. These individual information sources send back to the portal’s interface a list of results from the search query. The user can then review this results list.

Some examples of Federated Search Engines are:

Federated Search Engine for Library and Information Science hosted by Documentation Research and Training Centre (DRTC), Bangalore, India. It uses dbWiz (http://researcher.sfu.ca/dbwiz), an open source software.

WorldWideScience (http://worldwidescience.org/), hosted by the U.S. Department of Energy’s Office of Scientific and Technical Information. WorldWideScience is composed of more than 40 information sources, several of which are federated search portals themselves.

Science.gov (http://www.science.gov/) federates more than 30 information sources representing most of the R&D output of the U.S. Federal government. Science.gov returns its highest ranked results to WorldWideScience, which then merges and ranks these results with the search returned by the other information sources that comprise WorldWideScience.

Another application Sesam (http://www.sesam.se/, http://www.sesam.no/) running in both Norway and Sweden has been built on top of an open source platform specialised for federated search solutions. Sesat, an acronym for Sesam Search Application Toolkit, is a platform that provides much of the framework and functionality required for handling parallel and pipelined searches and displaying them elegantly in a user interface, allowing engineers to focus on the index/database configuration tuning.

4. Search Techniques

Searching for relevant information is not an easy task as it seems to be. It is worth to formulate an efficient search strategy before embarking upon this task. Starting a search task without a strategy or plan can result in irrelevant results, waste of precious time, and frustration. Search Strategy is a systematic plan for conducting a search. The process of planning a search strategy will help clarify the searcher’s thinking about his/her topic, and ensure that he/she is looking for information appropriate to his/her task. The steps involved in a search strategy are:

Make sure you fully understand the question/topic

Identify keywords and phrases
Identify alternative terms
synonyms (E.g. mobile telephones, cellular telephones)
plural/singular forms (E.g. women, woman)
spelling variations (E.g. behaviour, behavior)
variations of a root word (E.g. feminism, feminist, feminine)
acronyms (E.g. chief executive officer, CEO)
Create the search statement
Start searching
Evaluate the search results
Refine search (in case of too many results — narrow down/restrict the search statement; in case of too less results — broaden the search statement; if no results, try using the index or dictionary of the retrieval system)
Save search statements
Take references

4.1 Information Search Process

The Information Search Process (ISP) is a six-stage process that information seekers go through when seeking information. ISP was first suggested by Carol Kuhlthau in 1991. The six stages of ISP are as follows: Stage 1: Initiation, Stage 2: Selection, Stage 3: Exploration, Stage 4: Formulation, Stage 5: Collection, Stage 6: Presentation.

During the first stage, initiation, the information seeker has a topic for which they need information. As they think more about the topic, they may discuss the topic with others and brainstorm the topic further.

In the second stage, selection, the individual begins to decide where to get the information needed. Some information retrieval may occur at this point.
In the third stage, exploration, information on the topic is gathered. During this stage, new personal knowledge is created.
During the fourth stage, formulation, the information seeker starts to evaluate the information that has been gathered. At this point, a focus begins to form and there is not as much confusion and uncertainty as in earlier stages. Formulation is considered to be the most important stage of the process.
During the fifth stage, collection, the information seeker knows what is needed to support the focus. At this point, the search is more effective because the focus is clear.
In the sixth and final stage, presentation, the individual has completed the information search. Now the information seeker will summarize and report on the information that was found through the process.

4.2 Search Techniques

Now we will look into the various search techniques that can be used while searching for information on various information sources like Digital Libraries, Institutional Repositories, Subject Gateways, etc. Basically, the search techniques one can use will depend on the search algorithms and facilities provided by the backend search engine software. Hence, it is always a good idea to first go through the online help facility, if available, of the information source one is searching on. Some of the search techniques are discussed below.

Keyword/Phrase Search:

In this search technique, a single keyword or a combination or more than one keyword (phrase) is used. This retrieves all items where ever that/those keyword(s) appeared. In some search engines, phrases are enclosed in double quotes. For example,

Keyword/Exact Term: Information

Phrase Search: “information retrieval”

Boolean Searches:

Boolean Searches are based on Boolean algebra. Boolean algebra was developed in 1854 by George Boole in his book An Investigation of the Laws of Thought. There are three Boolean operators AND, OR and NOT. ‘OR’ is the default conjunction operator in most of the search engines.

AND operator: Retrieves those documents having all the keywords joined by AND operator. Sometimes leads to no results. For example,

internet AND libraries

“library science” AND “information science”

OR operator: Retrieves documents having either of the keywords joined by OR. Sometimes leads to too many hits.

internet OR libraries

“library science” OR “information science”

NOT operator: Eliminates those documents carrying the term specified by NOT internet AND libraries OR archives NOT museums

“library science” NOT “information science”

Truncation/wildcards/stemming:

Truncation search enables to search for those documents carrying different variants/forms of the same common root word. There are basically three kinds of truncation searches:

Suffix/Right truncation (retriev* = retrieve, retrieving, retrieval, retrieves)
Prefix/Left truncation (*rogen= hydrogen, nitrogen, etc)
Embedded/Middle truncation (colo*r= color, colour, analy*e= analyse, analyze)

Wild card symbols varies from one system to another

pierc* pierce

pierc# piercing

pierc? pierced

Field Specific Search:

Field Search allows the user to specify whether the search term belongs to any specific field. Like title, author, subject, etc. This search facility is generally found in bibliographic/fulltext databases/Digital libraries/Institutional Repositories. For example,

author:ranganathan

title:web

keyword:ocr

abstract:digital

mimetype:msword (mimetype indicates the file format)

sponsor:ALA

Group search/Nesting:

In a Group search, also called as Nesting, search terms can be grouped into sets, and operators can then be applied to the whole set in the search query, using parentheses, e.g.

(interactive resources OR learning objects) AND (Geography)

Nesting will also allow bringing different variants of a word, e.g.

Child AND (behaviour OR behavior)

Field Grouping:

In some search engines, Group search can be combined with field search also. Parentheses can be used to group multiple clauses to a single field, for example,

title: (+”interactive resources” +Geography)

Range specific search:

Here the user can specify the range of certain field values. Example: 1956-1985 (date), A-S (alphabetic range), etc. For example,

Search query 1:

author:[twain to wilde]

Retrieves all documents from twain to wilde INCLUDING twain and wilde

Search query 2:

author:{twain to wilde}

Retrieves all documents from twain to wilde but EXCLUDING the onescontaining the terms twain and wilde

Proximity Search:

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search. For example, a search could be used to find “red brick house”, and match phrases such as “red house of brick” or “house made of red brick”. By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology. Fox example,

ADJ (for adjacent)

social ADJ welfare (social welfare activities)

WITHIN (These words separated by n or less by a specified distance in any order)

computer W/3 market (computer as a growing market)

Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: for example, “brick NEAR house”

Fuzzy Search:

Fuzzy Search works on the Levenshtein distance (LD) algorithm, also called as ‘Edit Distance algorithm’. Levenshtein distance (LD) is a measure of the similarity between two strings. If source string is (s) and the target string is (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. i.e.

If s = “test” and t = “test”, then LD(s,t) = 0, because no transformations are needed. The strings are already identical.

If s = “test” and t = “tent”, then LD(s,t) = 1, because one substitution (change “s” to “n”) is sufficient to transform s into t.

The Lucene Search Engine software provides the option of Fuzzy Searching. A modified version of the Lucene Search Engine is used by DSpace digital library software. For example, in DSpace the following search strings can be given,

author: sanker~ can match shanker, shankar, sankar

keyword: libary~ can match library

abstract: digitl~ can match digital