5 Technical Infrastructure of a Digital Library

Jagdish Arora and Yatrik Patel

 

I.     Objective

 

The objectives of this module are to discuss and impart knowledge on broader aspects of the technical infrastructure of a digital library i.e.computers and network infrastructure requirement, including server-side hardware components, server-side software components, and client-side hardware & software components as well as role of cloud computing in digital libraries.

 

II.     Learning Outcome 

 

After going through this lesson, learners would gain knowledge about network and computing infrastructure, including network and communication devices, server-side hardware including input devices, storage devices and client-side hardware & software components that are required for setting-up a digital library as well as for making it accessible to end users. Leaners would also gain knowledge on cloud computing and its use for enhanced library services.

 

III.   Structure

 

1.      Introduction

2.      Networks and Computing Infrastructure

2.1.     Server-side Hardware Components

2.1.1.      Input Devices

2.1.2.      Storage Devices

2.1.3.      Communication Devices

2.2.            Server-side Software Components

2.2.1.      Software Required for Content Creation

2.2.2.      Software Required for Operations of Digital Library

2.2.3.      Digital Library Software

2.3.            Client-side Hardware & Software Components

3.      Digital Libraries and Cloud Computing

4.      Summary

 

 

 

 

 

 

1.  Introduction 

 

The Internet and web technology are the principle mechanism deployed in a digital library to search, navigate and deliver electronic resources across the globe. The primary objective of the digital library is to meet the information need of its users. Digital libraries have to be more and more responsive by maximizing the innovative impact of advancement in information and communication technology. Development in information and communication technology have greatly changed the way of information handling. To establish a digital library there must be   an infrastructure for managing, indexing and disseminating multimedia content. A scalable technical infrastructure needs to be carefully planned to meet the functional requirement of digital libraries.  This module will discuss the core infrastructure elements that can handle voluminous content and other complexities of digital libraries.

 

Hardware, server allocations, databases and distribution approaches, network infrastructure and bandwidth considerations, are key in establishing the digital library as a resource that teachers, students, researchers, and the general public regard as reliable

 

2.  Networks and Computing Infrastructure 

 

Establishing a digital library requires a great deal of computer (both software and hardware) and network infrastructural components that are not available off-the-shelf as packaged solutions. There are no turn-key, monolithic systems available for digital libraries, instead digital libraries are collection of disparate systems and resources connected through a network, and integrated within one interface, currently the web interface. Use of open architecture, standard and protocols, however, make it possible that pieces of required infrastructure, be it hardware, software or accessories, are gathered from different vendors and integrated to construct a working environment. While some of the components required for establishing a digital library would be internal to the institutions, but several others would be distributed across the Internet, owned and controlled by a large number of independent players. The task of building a digital library, therefore, requires a great deal of integration of various components (Flecker, D., 2000).

 

A digital library implementation requires an enterprise-level technology solution that is scalable both in size and functionalities with built-in reliability, availability and serviceability (RAS) features (Wright, 2002). The storage capacity of a digital library should be scalable to accommodate its ever growing collection without requiring redesign and reengineering of system design as requirements grow. The use of open systems architecture provides a robust platform, digital library management solutions and development tools. Current servers from multiple vendors are being used by several digital library implementations for its scalability, and RAS features. These servers also offer high-availability features such as full hardware redundancy, fault-isolated dynamic system domains, concurrent maintenance and clustering support along with offerings for modular storage capacity that can be added incrementally.

 

A typical digital library in a distributed client-server environment consists of hardware and software components at server side as well as at client’s side. Clients are machines that are used for accessing digital library by users while the server hosts databases, digital objects, browse and search interfaces to facilitate its access.

 

2.1  Server-side Hardware Components 

 

Servers are the heart of a digital library. Server for digital library implementation need to be computationally powerful, have adequate main memory (RAM) to handle the expected work, have large amount of secure disc storage for the database(s) and digital objects and have adequate network bandwidth to meet communication requirements. A digital library may need a number of specialized servers for different tasks so as to distribute the workload onto different servers. It would require one or more object server(s) to store digital objects and other multimedia objects, an index server that maintains indices and support searching of data stored in a distributed system and last but not least a rights management system to take care of unauthorized usage and intellectual property right issues. However, for a smaller library, many distinct activities can be performed on a single server. It is important that the server is scalable so that additional storage, processing power or networking capabilities can be added whenever required.

 

2.1.1  Input Devices 

 

Image-based digital library implementation require input devices like scanners, digital cameras, video cameras, and touch screen systems. A large range of choices are available for these image capturing devices. Scanners are available in all sizes and shapes. Flatbed scanners or digital cameras mounted on book cradle are more suitable for libraries. Details on such input devices are available in modules on digitization as well as in modules under the paper ICT Applications in Libraries.

 

2.1.2  Storage Devices 

 

Since digital libraries require large amounts of storage, particular attention need to be given to the storage solution. A digital library would require one or more servers to store raw data (images, text, video, etc.) indices of metadata so as to retrieve information from the digital libraries in desired fashion. Digital library collections that are too large to store entirely on a disk use hierarchical storage mechanisms (HSM). In an HSM, the most frequently used data is kept on fast disks while less frequently used data is kept in nearline such as an automated (robotic) tape library. An HSM can automatically migrate data from tape to disk and vice-versa as required. Intelligent storage area networks (SAN) and Network Attached Storage (NAS) are now available in which the physical storage devices are intelligently controlled and made available to a number of servers.

 

Redundancy is another important storage consideration. In a system that is completely dependent on the interaction of various kinds and levels of hardware and software, failure in any one of the subsystems could mean the loss or corruption of the information object. Effective storage management thus means providing for redundant copies of the archived objects to ensure availability of documents in case of loss. A number of RAID (Redundant Array of Inexpensive Disks) models are now available for greater security and performance. The RAID technology distributes the data across a number of disks in a way that even if one or more disks fail, the system would still function while the failed component is replaced. Digital archives may also choose to make backup copies on their own or to make arrangements for other sites to serve as backup.

 

Although harddisc (fixed and removable) solutions are increasingly available at an affordable cost, optical storage devices, including CD ROM, DVD ROM, BlueRay or opto-magnetic devices in standalone or networked mode, are attractive alternatives for long-term storage of digital information. Optical drives record information by writing data onto the disc with a laser beam. This media offers enormous storage capabilities.

 

2.1.3  Communication Devices 

 

Setting-up a digital library requires a network and communication equipment like communication switches, routers, hubs, repeaters, modems and other items required in a Local Area Network or to connect Internet. These hardware and software items are required for setting-up any network and are not specific to a digital library.

 

2.2  Server-side Software Components 

 

A typical digital library requires a number of software packages to handle its highly diversified resources, activities and services. Different softwares are required to handle different components and activities of a digital library. Software required for a digital library can broadly be categories into the following two categories:

 

2.2.1  Software Required for Content Creation 

 

A document capturing software is required for scanning legacy documents that are not available in computer-processible file. Most scanners and digital camera come with a basic image capturing software. The images captured in the process may need manipulation to enhance their quality. Software like Adobe’s Photoshop or open source GIMP (GNU Image Manipulation Program) provides image enhancement features like filters, tonal reproduction, colour management, touch, crop, image sharpening, contrast, transparent background, etc. Software like ABBYY FineReader provides multiple functionalities like image capturing, image enhancement and OCR.

 

Printed text, pictures and figures captured in the process of scanning are stored in a file as a bit-mapped page image, irrespective of the fact whether a scanned page contains a photograph, a line drawing or text. A bit-mapped page image is a type of computer graphic, literally an electronic picture of the page which can most easily be equated to a facsimile image of the page and as such it can be read by humans, but not by the computers. As such “text” in a page image is not searchable on a computer. The bit- mapped pages are converted into textual files using Optical Character Recognition (OCR) software. Most document imaging softwares have OCR package in-built. However, OCR packages, such as Scansoft, OmniPage Professional and ABBYY FineReader, are also available as separate utilities. Acrobat Capture also has an OCR built into it. Converting material already available in digital format into PDF requires Acrobat Software Suite (or other conversion software).

 

2.2.2   Software Required for Operations of Digital Library 

 

Like any other server, a server for digital library requires an operating system. A De facto operating system for most digital library implementation is Unix and its variants such as Linux. As digital libraries are built around the Web and Internet technology, the server for a digital library requires a web server software likeApache’s httpd or Microsoft’s Internet Information Server  (IIS).

 

Organization of digital objects with associated metadata requires an RDBMS package such as Oracle, MySQL, MS SQL, PostgreSQLor NoSQL packages like Cassandra, MongoDB etc. The database management software provides structured storage and retrieval facilities to the contents of a digital library. Further, a digital library requires a search engine connected to a DBMS to support searching of digital objects stored in it. Dspace, for example uses Apache Lucene search engine. Moreover, the contents of a digital library may have to be offered to only authorized users. The right management software such as InterTrust Systems Developer’s Kit,Active Directory by Microsoft facilitates control and monitor of access to contents of a digital library.

 

Since a single integrated software package from a single vendor is not available, a digital library software may be a system with components added onto an open architecture framework. For example, the Dspace, a popular, open source digital library software consists of as number of software like: Web server, DBMS (Postgres or Oracle), Apache Tomcat, Apache Ant, Java, Handles and Lucene Search Engine. Some of the important digital library softwares are described briefly below.

 

2.2.3  Digital Library Software 

 

A number of digital libraries are being constructed at present utilizing a mixture of information retrieval, media management and web server packages. All these pieces of software need to be integrated so as to present a cohesive environment and to avoid problems with growth and expansion. However, there are few software packages that attempts to provide a number functions of a digital library in an integrated fashion. Some of the important software used in setting-up a digital library are:

 

DSpace (www.dspace.org) was developed in partnership between Hewlett-Packard (HP) and MIT (Massachusetts Institute of Technology) and being maintained by DuraSpace foundation. Dspace, as institutional repository software, is making its mark with an increasing number of institutions around the globe installing, evaluating and using the package. The latest stable version is 4.0 and is available for download at http://sourceforge.net/projects/dspace/.

 

DSpace captures, stores, indexes, preserves, and redistributes the intellectual output of aninstitution’s research faculty in digital formats. DSpace accepts all forms of digital materials including text, images, video, and audio files. Possible content includes: articles and preprints, technical reports, working papers, conference papers, e-theses, datasets (statistical, geospatial, matlab, etc.), images (visual, scientific, etc.), audio files, video files, learning objects and reformatted digital library collections.

 

Greenstone Digital Library (GSDL) is a suite of software which has the ability to serve digital library collections and build new collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. The Greenstone Digital Library Software is produced by the New Zealand Digital Library Project at the University of Waikato, and distributed in cooperation with UNESCO and the Humanities Library Project. It is open-source software, available from http://greenstone.org under the terms of the GNU General Public License. The New Zealand Digital Library Web site (http://nzdl.org) contains numerous example collections, all created with the Greenstone software. The Greenstone runs on Windows and Linux platform. The distribution includes ready-to-use binaries for all versions of Windows and for Linux. It also includes complete source code for the system, which can be compiled using appropriate compiler. Greenstone works with associated software that is also freely available: the Apache Web server and PERL.

 

GNU E-Prints is an open source digital library software package designed primarily to create institutional repositories (http://www.eprints.org/). The default configuration creates a research papers archive. With its origins in the scholarly communication movement, E-prints default configuration is geared to research papers, but it can be adapted for other purposes and content. It was developed at the Electronics and Computer Science Department of the University of Southampton. GNU E-Prints is freely distributable and subject to the GNU General Public License. The latest version is 3.3.12 and is available for download at http://files.eprints.org/. Installing the E-prints software is relatively easy. Knowledge of MySql (used as backend database), apache WWW server and Perl programming language would be helpful. Mod_perl module for Apache significantly increases the performance of Perl scripts. Complete documentation for the installation of E-prints is available on the web site (http://wiki.eprints.org).

 

The CONTENTdm from OCLC is a multimedia software suite that provides easy loading, management and access to media archives in a library. The software provides tools to assist with every phase of collection development. One can start small with a few items or CONTENTdm can handle databases with millions of objects. The CONTENTdm technology is based on years of university research and testing that have resulted in a proven set of programs.

 

FEDORA (Flexible Extensible Digital Object Repository Architecture) repository system (http://www.fedora.info) is an open source, digital object repository system developed jointly by the University of Virginia Library and Cornell University and now being maintained by DuraSpace Foundation. The Fedora project is devoted to the goal of providing open-source repository software that can serve as the foundation for many types of information management systems. The software demonstrates how distributed digital information management can be deployed using web-based technologies, including XML and web services. Some of the important features of FEDORA include:

 

•  XML submission and storage: Digital objects are stored as XML-encoded files that conform to an extension of the Metadata Encoding and Transmission Standard (METS) schema.

•   Parameterized disseminators: Behaviors defined for an object support user-supplied options that are handled at dissemination time.

•  Access Control and Authentication: Although Advanced Access Control and Authentication are not scheduled until Phase II of the project, a simple form of access control has been added in Phase I of the project to provide access restrictions based on IP address. IP range restriction is supported in both the Management and Access APIs. In addition, the Management API is protected by HTTP Basic Authentication.

•   Default Disseminator: The Default Disseminator is a built-in internal disseminator on every object that provides a system-defined behavior mechanism for disseminating the basic contents of an object.

 

•   Searching: Selected system metadata fields are indexed along with the primary Dublin Core record for each object. The Fedora repository system provides a search interface for both full text and field-specific queries across these metadata fields.

 

•    OAI Metadata Harvesting: The OAI Protocol for Metadata Harvesting is a standard for sharing metadata across repositories. Every Fedora digital object has a primary Dublin Core record that conforms to the schema. This metadata is accessible using the OAI Protocol for Metadata Harvesting, v2.0.

 

•   Batch Utility: The Fedora repository system includes a Batch Utility as part of the Management client that enables the mass creation and loading of data objects.

 

2.3 Client-side Hardware & Software Components 

 

Clients are the machines that reside on the user’s desks. Planners of the digital library, therefore, need to prescribe a minimum level of hardware and software that a user would require so as to achieve efficient and effective interaction with the digital library. Most of the digital libraries require an Internet-enabled multimedia PC (or Machintosh) or a tablet equipped with an Internet Browser like Internet Explorer,Mozilla FireFox or Google Chrome as their clients. The client-side PCs may also require the following software packages (plug-ins) to download format-specific deliverables from a digital library:

 

Application Software URL
Internet Browser Google Chrome Internet Explorer Mozilla Firefox http://www.google.com/chrome http://www.microsoft.com/

http://www.mozilla.org

Reading PDF Files Acrobat Reader(Adobe) http://www.adobe.com
For Playing Audio and Video Files Real Player VLC Player http://www.real.com http://www.videolan.org
File Transfer Client WS_FTP http://www.ipswitchft.com/
Display and printing of Word, Powerpoint , Access Documents Microsoft Office Open Office http://www.microsoft.com http://www.openoffice.org
TIFF Images TIFF Viewer http://www.alternatiff.com/
Image Manipulation and Editing GIMP   (The   GNU   Image Manipulation Program) http://www.gimp.org
Video Editing Adobe Premier http://www.adobe.com

 

3.  Digital Libraries and Cloud Computing 

 

Cloud computing can be understood as a way to use off-site computer processing power to replace content creation and servers that were traditionally hosted onsite. In layman’s terms, this means “using Web services for our computing needs” (Kroski, 2009). Cloud computing allows content creation to be made “when data and software applications reside on and are drawn from the network rather than locally on any one workstation”. By utilizing online applications, users can create and save their files online, share content, work collaboratively with others or create entire services that can all be accessed online without need of having the programs on their own computer. These online services can reduce the need for expensive software, hardware, and even advanced technical knowledge from library staff since cloud computing services are often streamlined to be very user-friendly.

 

The cloud computing can be advantageous and it will increase the ability of a library to try out new software without having to buy the hardware as well as being able to scale the computing power to meet the demand of users. A library can increase the quantum of cloud computing they require by contacting their vendor instead of physically acquiring new hardware to meet increased demands. This approach will be quite cost effective in terms of money and manpower. Followings are the general advantages of having a digital library on cloud:

 

•   Compliant facilities and processes

•   Cost effective

•   Enterprise grade services and management

•   Faster provisioning of systems and applications

•   Flexible and innovative

•   Flexible and resilient in disaster recovery

•   Highly secured infrastructure

•   Reduces hardware and maintenance cost

•   Round the clock access

•   Simplicity of integration

•   Simplified cost and consumption model

 

4.  Summary 

 

Digital libraries are built around the Internet and web technologies. A typical digital library implementation follows client-server architecture as does the Internet and web technology. Client-server architecture as applied to the digital library is discussed broadly, however utmost care needs to be taken to create a sustainable digital library with proper balancing between economy and technicalities ensuring long term preservation.

 

References and Reading List

 

  1. Davis, J.R. and Lagoze, C. NCSTRL: Design and development of a globally distributed digital library. Journal of the American Society for Information Science, 51(3), 273-280, 2000.
  2. Ferrer, Robert. University of Illinois: the federation of digital libraries: Amongst heterogeneous information systems. Science and Technology Libraries, 17(3&4), 81-119, 1999.
  3. Fox,  E.A.  and  Powell,  J.  Multilingual  federated  searching  across  heterogeneous collections. D-Lib Magazine, September, 1998. (http://www.dlib.org/dlib/september98/powell/09powell.html)
  4. Fox, E.A. et al. Networked Digital Library of Theses and Dissertations: bridging the gaps for global access. Part. 1: Mission and progress. D-Lib Magazine, 7(9), 2001. (http://www.dlib.org/dlib/september01/suleman/09suleman-pt1.html)
  5. Fox, E.A. et al. Networked Digital Library of Theses and Dissertations: bridging the gaps for global access. Part. 2: Services and research, D-Lib Magazine, 7(9), 2001. (http://www.dlib.org/dlib/september01/suleman/09suleman-pt2.html)
  6. Kahn, Robert and Wilensky, Robert. A framework for distributed digital object services. cnri.dlib/tn95-01, May 13, 1995. (http://www.cnri.reston.va.us/k-w.html).
  7. Kardorf, B. SGML and PDF: Why we need both. Journal of Electronic Publishing, 3(4), 1998. 14p. (http://www.press.umich.edu/jep/03-04/kardorf.html)
  8. Lagoze, C.  and Fielding, D. Defining collections in distributed digital libraries. D-Lib Magazine, November, 1998 (http://www.dlib.org/dlib/november98/lagoze/07lagoze.html)
  9. Paepcke, A.,   Chang, C-C.K., Garcia-Molina, H. and Winograd, T. Interoperability for digital libraries worldwide. Communications of the ACM, 41(4), 33-43, 1998.
  10. Payette, S., Blanchi, C., Lagoze, C. and Overly, E.A. Interoperability for digital objects and repositories.  D-Lib Magazine, 5(3), May1999. (http://www.dlib.org/dlib/May99/payette/05payette.html)
  11. Sayer, Donald, et al (2001). The Open Archival Information System (OAIS) Reference Model and its usage. http://public.ccsds.org/publications/documents/SO2002/SPACEOPS02_P_T5_39.PDF
  12. Sheth,  A.P.  and  Larson,  J.A.  federated  database  systems  for  managing  distributed, heterogeneous and autonomous databases. ACM Computing Surveys, 22, 183-236, 1990.