11 Digitisation and computerisation
I. Objectives
The objectives of the unit/module are to:
- Describe the concept, scope, process, and need of digitization in libraries
- Discuss various types of file formats and media types, scanning softwares
- Mention the salient features and prerequisites of various digital library softwares i.e. Dspace, E- print, and Greenstone digital library
- Enlist the important tips for Planning and management of digitization projects.
II. Learning Outcome
After going through this unit/module, you would be able understand the concept, scope, process, and need of digitization in libraries. You would learn about the approaches of digitization, various types of file formats and media types, scanning softwares, salient features and prerequisites of various digital library softwares i.e. Dspace, E-print, and Greenstone digital library, and important tips for planning and management of digitization projects.
III. Structure
1. Introduction
2. Digitisation
2.1 Concept
2.2 Need
2.3 Digitisation Process
2.4 Basic approaches of digitization
3. Computerization
3.1 File Formats and Media types
3.2 Scanning Software
4. Planning and Management 4.1 Feasibility study
4.2 Planning the Project
4.3 Library Automation Hardware and Software Planning
4.4 Human Resources Planning
4.5 Financial planning
4.6 Purchase of Hardware and Software
4.7 Selection of Material for Digitisation and ‘Born Digital’
4.8 Placement and Training of Manpower
4.9 Content Creation
4.10 Execution of the Project
5. Challenges and Problems
6. Summary
7. References
1. Introduction
Digitisation is the process of converting the content of physical media (e.g., books, articles, manuscripts, photographs, etc.) into digital format. In most of the library applications, digitisation normally results in to documents that are accessible from the website of a library. The contents of digital document can be manipulated and compressed for further storage. This is due to the fact that when analogue information is fed into a computer, it is broken down into 0s and 1s changing its characteristics from analogue to digital. These bits of data can be re-combined for manipulation and compressed for storage. As an example, multi-volume encyclopedias that take-up yards of shelf-space in analogue form can fit into a small space on a computer drive or stored on to a CD ROM disc, which can be searched, retrieved manipulated and forwarded over the network. One of the most important characteristics of digital information is that it is not fixed in the way the texts are printed on a paper. Digital texts are neither final nor finite, and are not fixed either in essence or in form, except, when it is printed out as a hard copy. Flexibility is one of the main plus points of digital information. An unlimited number of identical copies can be created from a digital file, because a digital file is not destroyed by copying. Moreover, digital information can be made accessible from remote location simultaneously by a large number of users. Optical scanners and digital cameras are used to digitize images by translating them into bit maps. Similarly, it is also possible to digitize sound, video, graphics and animations, etc. Thus, digitisation is the process that creates a digital image from an analogue image and not an end in itself.
In the digitization process, selection criteria, particularly those, which reflect user needs, are of utmost importance. It implies that all the principles that are applicable in traditional collection development are also applicable when materials are being selected for digitisation. However, there are several other considerations related to legal, policy, technical, and other resources that become important in a digitisation project. Digitisation includes one of the three important methods of maintaining digitized collections. The other two methods pertain to provision of access to electronic resources (whether licensed or free) and creation of library portals for important Internet resources. In this module the concept, scope, need, process, and various aspects of computerization required for digitization have been described.
2. Digitisation
2.1 Concept
The word “digital” describes any system based on discontinuous data or events. Computers are digital machines because at their most basic level they can distinguish between just two values, 0 and 1, or off and on. All data that a computer processes must be encoded digitally as a series of zeroes and ones.
The opposite of digital is analogue. A typical analogue device is a clock in which the hands move continuously around the face. Such a clock is capable of indicating every possible time of the day. In contrast, a digital clock is capable of representing only a finite number of times (every tenth of a second, for example). As mentioned before, a printed book is analogue form of information. The contents of a book need to be digitized to convert it into digital form. Digitisation refers to the process of translating a piece of information such as a book, journal articles, sound recordings, pictures, audio tapes or videos recordings, etc. into bits. Bits are the fundamental units of information in a computer system. Converting information into these binary digits is called digitisation, which can be achieved through a variety of existing technologies. A digital image, in turn, is composed of a set of pixels (picture elements), arranged according to a pre-defined ratio of columns and rows. An image file can be managed as a regular computer file and can be retrieved, printed and modified using appropriate software. Further, textual images can be OCRed so as to make its contents searchable. An image of the physical object is captured using a scanner or digital camera and converted into digital format that can be stored electronically and accessed via computers.
Scope The process of digitisation, however does not stop at scanning of physical objects, a considerable amount of work is involved in optimizing usage of digitized documents. Sometimes, these post- scanning processes are often assumed in the meaning of digitisation. At other times the word “digitisation” is used in restricted sense to include only the process of scanning.
2.2 Need
Digitisation makes the document more useful and more accessible. It is possible for a user to conduct a full-text search on a document that is digitized and OCRed. A reader can create hyperlinks to related items within the text itself as well as to external resources. It may be noted that digitisation does not mean replacing the traditional library collections and services; rather, it serves to enhance them. There are various reasons for converting a document into digital format for example, the objective of digitisation, availability of finances, end user etc. While the objectives of digitisation initiatives differ from organisation to organisation, the main objective is to improve the access.
Other objectives may include preservation, cost saving, information sharing, and keeping up to date with technology. While new and emerging technologies allow digital information to be presented in innovative ways, the majority of potential users are unlikely to have access to sophisticated hardware and software. Further, sharing of information among various institutions is often restricted by the use of incompatible software also.
One of the main benefits of digitisation is to preserve rare and fragile objects by enhancing their access to multiple users simultaneously. Very often, when an object is rare and precious, access is only allowed to a certain category of users. Digitisation can allow more users to enjoy the benefit of access. Although, digitisation offers great advantages for access like, allowing users to find, retrieve, study and manipulate material, it cannot be considered as a good alternative for preservation. This happens because of ever changing formats, protocols and software used for creating digital objects.
There are several reasons for libraries to go for digitisation and there are as many ways to create the digitized images, depending on the needs and uses. The prime reason for digitisation is the need of the user for convenient access to high quality information. Other important reasons are:
Multiple Referencing: Digital information can be used simultaneously by several users at a time.
Wide Area Usage: Digital information can be made accessible to distant users through the computer networks over the Internet.
Qualitative Preservation: In digital preservation, images can be scanned at high resolution and bit depth for best possible quality. The quality remains the same inspite of multiple usages by several users. However, a special care has to be taken while choosing digitisation for preservation of information.
Archival Storage: Digitisation is used for restoration of rare material. The rare books, images or archival material are kept in digitized format as a common practice.
Security Measure: Valuable documents and records are scanned and kept in digital format for safety and security.
2.3 Digitization Process
To begin the process of digitisation, first of all, we need to select documents for digitisation. The process of selection of material for digitisation involves identification, selection and prioritization of documents that are to be digitized. Data can be captured in two ways i.e. data already available in digital form known as “born digital” which can be easily converted into other formats. The second way is to convert the physically available data in the form of print matter from external sources. However, in this case, Intellectual Property Right (IPR) issues may have to be resolved. It may also be required to obtain permission from the authors, publishers and data suppliers for digitisation particularly when the data is not in public domain.
Selection In this step, the IPR issues should be addressed. One may have to get permissions from the publishers and individuals which could be difficult, and very time consuming. It may also involve negotiation and payment of copyright fees, if applicable. Selected documents for digitisation may already be available in digital form. It is always economical and appropriate to buy e-media, if available, rather than their conversion. The type and nature of document, for example, whether the same is in a bad shape, over-sized material, a manuscript, bound volumes of journalistic. would need a highly specialized equipment and skilled manpower. Further, the documents to be digitized may include simple text, line art, photographs, color images, etc. The selection of documents needs to be reviewed very carefully taking in to consideration the various important factors i.e. quality, utility, cost, and security. In selection of material for digitisation, priority should be for rare and much–in-demand documents and images.
Other factors which may be considered for selecting appropriate media for digitisation may include:
Photographs and slides: Selecting photographs is a very crucial process and requires high resolution. The quality, future needs, and copyright issues and aspects are also important and must be taken in to account.
Audio: The sound quality is to be checked and required corrections made together by the subject expert and computer sound editor together.
Video: The video clippings are normally edited on Beta max tapes, which can be used for transferring on to digital format. While editing color tone, resolution is checked and corrected.
Documents: The objective of digitisation is to have increased access to digitized materials including its value addition. The criteria for selecting documents are: more demand, rare availability, and difficulty in handling which should be reviewed and selected for the process. If the correction of literary value demands much input, then documents may be considered for publication rather than digitisation. However, the main consideration for digitisation of documents should be intellectual value and significance of contents i.e. authority, quality, timeliness, uniqueness, and demand. To sum up the main considerations are: the intellectual contents, physical nature of the source materials, and number of current and potential users.
Scanning In the process of scanning, the image is “read” i.e. scanned at a predefined dynamic range and resolution. The resulting file, called “bit-map page image” is formatted (see sec.3.1.4) and tagged for storage and subsequent retrieval by the software package used for scanning. Electronic scanners are used for getting an electronic image into a computer from its original which may be a text, manuscript, and photograph etc. Acquisition of image through a camera, fax card,or other imaging devices is also possible. However, image scanners are most important and most commonly used component of an imaging system for the transfer of paper-based documents.
Steps in the Process of Scanning using a Scanner
– A picture is placed on the scanner’s glass
– Scanner software is started
– The area to be scanned is selected
– The image type is chosen
– The image is sharpened
– The image size is set, and
– The scanned image is saved using a desirable format
Indexing Scanned images are a set of pictures that need to be related to a text database describing them and their contents. The indexing process involves linking of the database of scanned images to a text database. An imaging system typically stores a large amount of unstructured data in a two file system for storing and retrieving scanned images. The first is traditional file that has a text description of the image i.e. keywords or descriptors along with a key to a second file. The second file contains the document location. The user selects a record from the first file using a search algorithm. Once the user selects a record, the application program keys into the location index, finds the document and displays it.
Most of the document imaging software packages through their menu driven or command driven interface, facilitate elaborate indexing of documents. While some document management systems (DMS) facilitate selection of indexing terms from the image file, others allow only manual inputting of indexing terms. Further, many DMS packages provide Optical Character Recognition (OCR) read (OCRed) capabilities for transforming the images into standard ASCII files. The OCRedtext then serves as a database for full-text search of the stored images.
Storing The most persistent problem of a document image relates to its file size and, therefore, to its storage. Every part of an electronic page image is saved regardless of the presence or absence of ink. The file size varies directly with scanning resolution, the size of the area being digitised and the style of graphic file format used to save the image. The scanned images, therefore, need to be transferred from the hard disc of scanning workstation to an external large capacity storage devices such as an optical disc, CD ROM/DVD ROM disc, snap servers, etc. While the smaller document imaging system may use offline media, which need to be reloaded when required, or fixed hard disc drives allocated for image storage, larger document management systems use auto-changers such as optical jukeboxes and tape library systems. The storage required by the scanned images varies and depends upon factors such as scanning resolution, page size, compression ratio and page content. Further, the image storage device may be either remote or local to the retrieval workstation depending upon the imaging system and document management system used.
Retrieving Once scanned images and OCRed text documents have been saved as a file, a database is needed for selective retrieval of data contained in one or more fields within each record in the database. A document imaging system, typically, uses at least two files to store and retrieve documents. The first is traditional file having text description of the image along with a key to the second file. The second file contains the document location. The user selects a record from the first-file using a search algorithm. When the user selects a record, the application program keys into the location index, finds the document and displays it. Most of the Document management system provide elaborate search possibilities including Boolean (AND, OR, NOT), proximity operators,and wild cards. Users are also allowed to refine their search strategy. Once the required images have been identified their associated document image can quickly be retrieved from the image storage device for display or for getting output in print form.
2.4 Basic approaches of digitization
There are four basic approaches that can be adopted to convert from print to digital medium:
– Image Only scanning
– Retaining Page Layout and Optical character recognition
– Retaining Page Layout using Acrobat Capture; and
– Re-keying the Data
Image only Scanning Cost wise, ‘Image only’ option is the lowest. In this option, each page becomes exact replica of the original source document. Several digital library projects (see sec 6)are concerned with providing digital access to materials that already exists in printed media in traditional libraries.
Some of the features of scanned page images are:
(i) It offers a reasonable solution to libraries for converting existing paper collection for example, heritage documents without having access to the original data in computer readable or processing formats which could be converted into HTML/SGML or in any other structured or unstructured text.
(ii) It is a natural choice for large-scale conversions for major digital library initiatives. Printed text, pictures and figures are transformed into computer-accessible forms using a digital scanner or a digital camera in a process called document imaging or scanning.
(iii) The digitally scanned images are stored in a file as bit-mapped page images, irrespective of the fact that a scanned page contains a text, a photograph, or a line drawing.
(iv) A bit-mapped page image is a type of computer graphic, literally an electronic picture of the page which can be equated to a facsimile image of the page and as such they can be read by human beings, but not by the computers. However, this “text” in a page image is not searchable on a computer using the present-day technology, the e-based implementation of which also requires a large space for data storage and transmission.
(v) Capturing page image format is comparatively easy and inexpensive, therefore, it is a loyal reproduction of its original page, maintaining its integrity and originality.
(vi) The scanned textual images, however, are not searchable unless it is OCRed, which in itself, is highly error prone process, especially when it involves S&T texts. There are various options of technology for converting print to digital form.
It may be noted that in case, OCR is not done, the document is not searchable. Most scanning software generate Tagged Image File Format (TIFF) format by default, which, can be converted into PDF using a number of software tools. Scanning to TIFF / PDF format is recommended only when the requirement of project is to make documents portable and accessible from any computing platform. The image can be browsed through a table of contents file composed in HTML which provides link to scanned image objects.
Retaining page layout and Optical Character Recognition (OCR) There are some equipment, for example, Xerox’s Text Bridge and Caere’s Omni page which include technology that allow the option of retaining text and graphics in their original layout as well as word-processing and plain ASCII formats. Output can also include HTML with attributes like bold, underline, and italics which are retained.
Retaining Layout after Optical Character Recognition (OCR) OCR programs are software tools used to transform scanned textual page images into word processing file. OCR as text recognition includes the process of electronically identifying text in a bit-mapped page image or set of images which can generate a file containing that text in ASCII code or in a specified word processing format with the intact image in the process. A scanned document is nothing more than a picture of a printed page which is not editable and cannot be manipulated or managed based on their contents. In other words, scanned documents are referred to by their labels rather than characters in the documents.
Retaining Page Layout using Acrobat Capture The Acrobat Capture provides various options for retaining not only the page layout but also the fonts, and to fit text into the exact space occupied in the original. In this way the scanned and OCRed copy never under- or over-shoot the page. Accordingly, it treats unidentified or unrecognizable text as images that are pasted in its place. Such images are perfectly readable by anyone by looking at the PDF file, but no editable and searchable text file is possible. Contrary to this, ordinary OCR programs treat unrecognized text similar to wild characters or some other special characters in the ASCII output. Acrobat Capture can be used to scan pages in three ways i.e. as images, image +text and as normal PDF. All of these three options retain page layout. Out of these three options the Image only option has already been described in sec. 4.4.1. The other two options are described below:
(i) Image + Text: In this option, OCRed text is generated for each image where each page is an exact replica of the original and left untouched. However, the OCRed text sits behind the image and is used for searching. It may be noted that OCRed text is only used for searching and is not corrected for errors. While the cost involved in this case is much less than PDF Normal, the file size of Image + Text PDFs is considerably larger than the corresponding PDF Normal files and pages will not displayed quickly or cleanly on screen. This happens because, the entire page is a bitmap and neither fonts nor line drawings are victories.
(ii) PDF Normal: In this option, all graphics and formatting are preserved, and substitute fonts may be used where direct matches are not possible. It gives a clear view on-screen display. It is searchable, but with significantly smaller file size than Image Text. The result is not, however, an exact replica of the scanned page. It is a good choice when files need to be posted on to the web or otherwise delivered online. If during the Capture and OCR process, a word cannot be recognized to the specified confidence level, Capture, by default, substitutes a small portion of the original bitmap image. Capture ‘best guess ‘of the suspect word lies behind the bitmap so as to make the searching and indexing still possible. On the other hand, there is no surety that these bitmapped words will be correctly guessed. In addition, the bitmap is somewhat interfering and suspicious from the ‘look’ of the page. Further, Capture provides option to correct suspected errors left as bit-mapped image or leave them untouched.
Re-keying The best solution in re-keying includes keying-in the data and its verification. This involves a complete keying of the text, followed by a full re-keying by a different operator, the two keying-in operations might take place simultaneously. The two keyed files are compared and any errors or inconsistencies are corrected. This procedure can guarantee at least 99.9% accuracy, but to reach still more close to 100% accuracy level, it would normally require full proof-reading of the keyed files, table lookups, and dictionary spell checks.
3. Computerization
3.1 File Formats and Media types
Every object in a digital library needs to have a title name or identifier which can identify its type and format in a distinct manner. This is achieved by assigning file extensions to the digital objects. The file extensions in a digital library typically denote formats, protocols and rights management that are appropriate for the type of material. Besides simple formats, a term called file format is also used to store different media types like text, graphics, pictures, images, music works, video programs, databases, and models including any combination of these types. File format is nothing but an arrangement for discrete sets of data that allow a computer and software to interpret the data.
Some of the examples of names of file formats applicable to digital library and their file extensions are given in Tables1, 2 and 3(see sec3.1.5)
However, Formats and Encoding Used for Text and image-based contents of a digital library can be stored and presented as
– simple text or ASCII (American Standard Code for Information Interchange);
– structured text(SGML or HTML or XML);
– unstructured text;
– page description language, and
– page image formats.
Simple Text or ASCII Simple text or ASCII is the most commonly used encoding scheme used for facilitating exchange of data from one software to another or from one platform to another.
Now a days, full-text papers from many journals are available electronically through online vendors like STN and DIALOG. While, the simple text or ASCII is compact, easy and economical to capture and store, searchable, inter-operable and is compatible with other text-based services, the same cannot be used for displaying mathematical formulas or complex tables. Also, the diagrams, photographs, graphics, special characters cannot be displayed in ASCII. Further, ASCII format does not store text formatting information, i.e., bold, font type, font size, italics, or paragraph justification information. Thus, a simple text or ASCII in many ways is not sufficient to represent many journal articles because of these reasons. Although simple text or ASCII is very useful in searching and selection, its inability to capture the richness of the original makes it an interim step to structured text formats.
Structured Text Format Structured text format captures the essence of documents by ‘marking-up’ the text so that the original form could be recreated or even produce other forms like ASCII. Structured text formats have provision for embed images, graphics and other multimedia formats in the text. Standard Generalized Markup Language (SGML) is one of the most important and popular structured text format. Similarly, Office Document Architecture (ODA)also is a competing standard. SGML is an international standard, around which several related standards are built. SGML is a flexible language from which Hyper-Text Markup Language (HTM)was originated, and is considered as de facto standard for markup language of the World Wide Web (www) which controls the display format of documents and even the appearance of the user interface for interacting with the documents. Like simple text or ASCII, structured text can be searched or manipulated. It is highly flexible and suitable both for electronic and paper production. Well-formatted text increases visual presentation volume of textual, graphical and pictorial information. Structured formats can easily display equations and complex tables. Also, the structured text is compact in comparison to the image-based formats, even after including embed graphics and pictures.
Creation of structured text, if rekeyed, is always too expensive on a production basis. However, creation of structured text is generally integrated with the production of printed artifacts. SGML is in fact, a format generated as a by-product of printed artifacts generated electronically.
Besides SGML and HTML, there are other formats used in digital library implementation for example, TeX. TeX is used for formatting highly mathematical text and it allows greater control over the resulting display of document, including reviewing the formatting of errors.
Page Description Language (PDL) Page Description Languages (PDLs), such as Adobe’s PostScript and PDF (Portable Document Format) are similar to image but the formatted pages displayed to the user are text-based rather than image-based. PostScript and PDF formats can easily be captured during the typesetting process. PostScript is especially easy to capture since most of the systems automatically generate it and conversion program, called Acrobat Distiller, can be used to convert PostScript file into PDF files. The documents stored as PDF require Acrobat Reader at the user’s end to read or print the document. The Acrobat Reader can be downloaded free of cost from the Adobe’s Web Site.
Acrobat’s Portable Document Format (PDF) is a by-product of PostScript. Adobe’s page-description language has become the standard way to describe pages electronically in the graphics world. While PostScript is a programming language, PDF is a page-description format.
PDF can have two formats:
(i) Text-based PDF that uses outline font technology of PostScript PDL (Page Description Language) from Adobe to describe format of a page;
(ii) Raster-scanned image PDF without the text output of OCR (Optical Character Recognition). The image PDF is essentially equivalent to TIFF or CCITT G4 formats or to a photograph where text characters cannot be manipulated by the computer. Besides, an image-based PDF may be converted into text-based PDF once it goes through the process of OCR. In this process, scanned image is replaced by the text with fonts and layout matching with the scanned document.
Page Image Format The digitally scanned images are stored in a file as a bit-mapped page image, irrespective of the fact whether; a scanned page contains a text, line drawing or a photograph. The bit-mapped page image can be created in many different formats depending upon the scanner and its software. There are certain national and international standards for image-file formats and compression methods to ensure that data is interchangeable amongst systems. An image file stores discrete sets of data and information allowing a computing system to display interprets and prints the image in a pre-defined way. An image file format consists of three different components, i.e.,
– header which stores information on file identifier and image specifications;
– image data consisting of look-up table and image raster and
– footer that signals file termination information.
While bit-mapped portion of a raster image is standardized, it is the file header that differentiates one format from another.
Tagged Image File Format (TIFF) It is the most commonly used page image file format and is considered to be the de facto standard for bitonal images. Some image formats are proprietary developed by commercial vendors and require specific software or hardware for display and printing. Images can be colored, grey-scale or black and white (called bitonal). They can be uncompressed (raw) or compressed using several different compression algorithms.
Abbreviation | Format | File extention |
ASCII | File format for unstructured text ASCII | .txt |
File format for Structured Text | .html | |
HTML | Hypertext markup Language | |
Portable Document Format (Adobe) | ||
PostScript | PostScript (Adobe) | .ps |
SGML | Standard Generalized Markup Language | sgml. |
TEX | Texture Format | .txt |
XML | Extended Markup Language | .xml |
Table 1: File Formats Used in a Digital Library
Abbreviation | Format | File Extention |
IMG | Ventura Publisher | img. |
JFIF | JPEG File Format | .jfif |
BMP | Bit Map Page (Windows) | .bmp |
JPEG | Joint Photographic Expert Group | .mpg |
PCD | Photo CD (Kotak) | .pcd |
PCP | PC Paint (Black & White) | .pcp |
PCX | PC paint Brush (Colour and Black & White) | .pcx |
Portable Document Format | ||
PNG | Portable Network Graphic | .png |
PSD | Photoshop | .psd |
SPIFF | Still Picture Interchange File Format | .spf |
TGA | True Vision Targa | .tga |
TIFF | Tagged Image File Format | .tif |
TIFF-G4 | Tagged Image File Format with Group4 File Compression | .tif |
Table 2: File format for Images | ||
AIFF | Audio Interchange File Format | .aif |
AU | Audio (Sun Microsystem) | .au |
AVI | Audio Visual Interleave | .avi |
FLA | Macromedia Flash Movie | .fla |
FLC | AutoDesk Flic Animatiom | .flc |
MIDI | Musical Instrument Digital Interface | .midi |
MOV | Quicktime for Windows Movie | .mov |
MPEG | Motion Picture Expert Group | .mpg |
MP2 | MPEG Audio Layer 2 | .mp2 |
MP3 | MPEG Audio Layer 3 | .mp3 |
RAF | Real Audio Format (Progressive Networks) | .ra |
SND | Sound | .snd |
VoC | Creative Voice | .voc |
WAVE | Waveform Audio (Microsoft) | .wav |
Table 3: Audio and Video File Format
3.2 Scanning Software
For scanning the image and capturing the same in the computer, the scanning software is used. Generally, this software is provided by the manufacturer of the product to the buyers. Following are some of the applications mostly in practice.
Digitisation of Audio and Video The analogue type of sound tracks which we generally listen from radio or tape recorders can be digitized by attaching an audio player to a system. It is done through an audio capture card so as to record the sound to the system. The audio files are saved as mp3, midi, and wav etc. In terms of sound quality, MP3 format is better and highly compact as compared to other formats. Theses audio files can be further processed using noise reduction software. Like audio, video capture also requires a video capture card with input from video cassette player (VCP/VCR), TV antenna, cable or movie camera etc. The digitized files can be saved as mpg, mov, and avi file formats.
Image Editing Applications Once the process of the scanning of image is over and the same is available in the computer, the image editing applications can be used for further manipulation. Most image editing software offer features like image editing, cropping, color adjustments, forms conversion, resizing, sharpening, filter, etc. Most image editing software can also be used for capturing the images.
Organizing Digital Images Scanned images need to be organized so that the same are useful, otherwise a disc full of digital images without any organisation, browse and search options have no meaning except for the information on the one who created it. Besides this, the images need to be linked to the associated metadata so as to facilitate their browsing and searching. The following three steps describe the process of organizing the digital images:
– Organize the scanned image files into disc hierarchy that logically maps the physical organisation of the document. For example, in a project on scanning of journals, create a folder for each journal, which, in turn, may have folder for each volume scanned. Each volume, in turn, may have a subfolder for each issue. The folder for each issue, in turn, may contain scanned articles that appeared in the issue along with a content page, composed in HTML providing links to articles in that issue.
– Name the scanned image files in a strictly controlled manner that reflects their logical relationship. For example, each article may be named after the surname of first author followed by a volume number and an issue number. For example, file name “guptadkv6n2.pdf” conveys that the article is by “D. K. Gupta” that appeared in volume 6 and issue no.2. The file name for each article would, therefore, convey a logical and hierarchical organisation of the journal.
– Describe the scanned images file internally using image header and externally using linked descriptive metadata files. The following three types of metadata are associated with the digital objects:
i. Descriptive Metadata: Includes content or bibliographic description consisting of keywords and subject descriptors.
ii. Administrative or technical Metadata: Incorporates details on original date of creation, source, file format used, version of digital object, compression technology used, and object relationship, etc. Administrative data may reside within or outside the digital object and is required for long-term collection management to ensure longevity of digital collection.
iii. Structural Metadata: Elements within digital objects facilitate navigation, for example, table of contents, index at issue level or volume level, page turning in an electronic book, etc.
The simplest but least effective method for providing access is through a table of contents and linking each item to its respective object/image. Content pages of issues of journals done in HTML would offer browsing facility. Full-text search to HTML pages or OCRed pages can be achieved by installing one of the free Internet search engines like Oingo Free Search (http://www.oingo.com/oingo_free_search/products.html); Swish-E (http:// www.berkeley.edu/SWISH-E/);WhatyoUseek (http://intra.whatuseek.com/); Excite (http:/ /excite.com/) and Google (http://www.google.com). Large scanning projects would, however, require a back-end database storing images or links to the images and metadata (descriptive/administrative). Back-end database used by most document management systems holds the functionality required by most web applications. Important management systems like File Net have now integrated their database with HTML conversion tools. Further, some of the document management systems have also signed up with Adobe to incorporate Acrobat and Acrobat Capture into their web-based document management systems. These databases entertain queries from users through “HTML forms” and generate search results on the fly. Several digital library packages are now available as “open source” or “free-ware” that can be used not only for organizing the digital objects but also for their search and retrieval.
Digital Library Software Several digital library software are currently available like, DSpace, Greenstone Digital Library (GSDL), EPrints, Fedora, etc., which are available freely for download on the Internet. There are also some commercial Digital Library Software available but none has been used on large scale as compared to those mentioned above. A brief description of salient features of some of the common digital library software available in public domain is given below.
D Space D Space is making its impact, with an increasing number of institutions around the globe installing, evaluating, and using the package. D Space (www.dspace.org) has been developed in partnership between Hewlett Packard (HP) and Massachusetts Institute of Technology (MIT). The same is used as institutional repository software and the development is still in progress. The latest version is 1.2 available for download at the D Space web site. Currently, the original developers undertake most of the core development, but a growing technical user base is generating suggestions for future releases as well as looking for producing some add-on modules. In addition the D Space Federation is guiding the transition of this software to a more community-wide open-source development model.
D Space captures, stores, indexes, preserves, and, redistributes the intellectual output of a university’s research faculty into digital formats. D Space accepts all forms of digital materials including text, images, video, and audio files. Possible content includes: papers and preprints; conference papers, technical reports; working papers;; e- theses; datasets(statistical, and geospatial etc.); images (visual, scientific, etc.); audio files; video files; learning objects; and reformatted digital library collections. The back end technologies used include: Apache, Tomcat, OpenSSL/mod_ssl; Java1.3, JSP1.2, Servlet 2.3; PostgreSQL7, JDBC (RDBMS); CNRI handle System5 (persisten ids); Lucene 1.2 ( index/ search).
Specifications as Prerequisites D Space depends upon the Java programming language and the Postgre SQL open source database system. It also requires a number of additional Java- based elements to be installed: Tomcat, which is a Java based server; a number of Java code libraries; and the Ant, a Java compiler. It is recommended that D Space be installed on a Linux or a Unix machine. It requires an experienced system administrator to do the prerequisite installation.
E-Prints E-prints were developed in the Intelligent Agents, Multimedia Group at the Electronics and Computer Science Department of the University of Southampton. With its origin in the scholarly communication movement, e-print default configuration is geared to research papers but it can be adapted to other purposes and content. GNU E-prints is freely distributed to the GNU General Public License. GNU Eprints 2.x is a free software which creates online archives (http:// software.eprints.org/). The default configuration creates a research paper archive. The latest version is 2.3 and is available for download at http://software.eprints.org/download.php
Specifications as Prerequisites
– Any computer capable of running GNU/Linux or similar operating system. The faster, the better, but any Intel Pentium II processor will provide good performance.
– A GNU operating system. GNU/Linux (a very advanced and free UNIX- like operating system) works just fine, and is in fact the development platform
– Apache WWW server
– Perl programming language, also a number of additional modules
– mod_perl module for Apache, which significantly increases the performance of Perl scripts
– My SQL Databases
Greenstone Digital Library (http://greenstone.org) The Greenstone Digital Library software is produced by the New Zealand Digital Library Project at the University of Waikato, and distributed in cooperation with UNESCO and the Humanities Library Project Greenstone Digital Library is open-source software available under the terms of the GNU General Public License. It has the ability to serve digital library collections and build new collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM.. The New Zealand Digital Library Web site (http://nzdl.org) contains numerous example collections, all created with the Greenstone software, which are publicly available for anyone to peruse,
Specifications as Prerequisites
– The Greenstone runs on Windows and Unix platforms.
– The distribution includes ready-to-use binaries for all versions of Windows and for Linux.
– It also includes complete source code for the system, which can be compiled using Microsoft C++ or gcc.
– Greenstone works with associated software that is also freely available: the Apache Web server and PERL.
4. Planning and Management
Digitisation is a highly specialized and cost-intensive activity and is the first step towards building a digital library. It requires inputs from various branches of knowledge which implies that the purpose, objectives, and needs of digitisation are determined clearly. So, the digitisation proposal should define its goals, scope, benefits, costs, time span, implementation issues, deliverables and target users. Depending on the situation, it may be desirable to continue with traditional collections and at the same time to acquire collections in digital media. Other options may be to buy access to electronic resources; and develop library portals or subject gateways instead of undertaking digitisation project. This strategy would save the cost and efforts on digitisation and other recurring administrative costs.
Once a decision for digitisation is finalized, due consideration and importance should be given to factors like reusability, sustenance, verification, interoperability, and documentation both for users as well as for the developer. Some of the pre-requisites for careful planning of digitisation in various steps are described below:
4.1 Feasibility study
The feasibility study should be established not only in terms of the availability of tools, and expertise, but also include the factors like Number/volume of documents to be covered in the process of digitisation, demand for material to be digitized, target audience, and user’s requirements etc. The study should also assess whether the library can take-up the same as in-house project or to be out-sourced.
4.2 Planning the Project
The planning of the project needs to cover the areas of Managerial Planning to involve the process of sequencing various tasks, their time management and project monitoring. Activities which require managerial planning may include manpower recruitment, conducting feasibility study, digitisation (whether in-house or out-sourced procurement of equipment, IPR management issues, organization and integration of content, finding market, launching and marketing of services. Additionally, some of the other management techniques like PERT, CPM, flow diagrams, and SWOT analysis may be deployed at this stage.
4.3 Library Automation Hardware and Software Planning
In this activity, first of all, the technical specifications may be worked out before the actual process of digitisation starts irrespective of whether digitisation is out-sourced or done in-house. Next, the requirements of software and hardware for the servers and network components may be worked out including their financial implications and network components. Further, the connectivity and bandwidth needed for hosting the digitized collection may be planned. For doing this, the existing types, formats, standards and practices should be reviewed first.
Then, the draft specifications are to be prepared and to be tested with sample data. Necessary changes may be made in the specifications based on this testing laid down for metadata creation for digital objects and for the digital collection. Digital objects and digital collections typically require descriptive i.e. keywords/descriptors, structural i.e. navigation, content pages etc., and administrative i.e. formats, compression, standards, etc. metadata.
4.4 Human Resources Planning
Human resource planning would depend on whether the library is going for outsourcing the process of digitization or in-house digitization. The areas for required human resources are: training of existing staff, recruitment of new staff with desired skills, and staff time involved. It is be noted that project management continues to be an important issue even if the digitisation work is outsourced. The project management may be further divided in to various groups with clearly defined responsibilities. Finally, to facilitate unambiguous communication among the groups and the staff, a well defined reporting structure may be laid down.
4.5 Financial planning
Financial planning is very important and includes the various types of costs such as migration from one medium to another and from one computer to another, cost of hosting the services and their maintenance Other aspects are:
4.6 Purchase of Hardware and Software
Choice of the tool, equipment, and technology may be made considering the various aspects like software for search and access, storage and back up devices, network equipment regained, and other related items.
The software may be acquired or developed in-house. After acquiring / developing software, the following operations will be executed.
– Installation of hardware and software;
– Installation of the network required for hosting the digitized collection.
– Consider bandwidth requirements that depend upon the media offered by the digital library. While simple text requires relatively low bandwidth to deliver content, images and video require large bandwidth; and
– Installation of other components
4.7 Selection of Material for Digitisation and ‘Born Digital’
The first and foremost in the process of execution of the project, is to identify, select, and priorities the documents which are to be digitized. If documents are available in digital form, they can be easily converted to other formats. In other case, when the organisation is itself creating contents, strategies are to be laid down to capture ‘born digital’ data. If the selected material is from external sources, IPR issues need to be resolved. In case the material being digitized is not available in public domain, then it becomes necessary to get permission from the publishers or data suppliers for digitization. Moreover, decision may be taken whether to OCR the digitized images. Documents selected for digitisation may already be available in digital format. However, it is always economical to buy e-media, if available rather than their conversion. Further, bound volumes of journals, manuscripts, deteriorating collections, and oversized material etc., would require highly specialized equipment and highly specialized manpower.
4.8 Placement and Training of Manpower
Since the entire job of developing and or maintaining a digital library is a highly skilled one, there should be no compromise or any lapse in the quality of intake or selection of manpower for the job. It should be noted that even if good quality manpower is employed, they usually need training to upgrade and sharpen their skills for this job which implies that necessary training, should form an integral component of the execution of the project.
4.9 Content Creation
The steps involved in content creation include the following
– Conversion of datasets which are ‘born digital’, e.g. converting MS Word file into PDF;
– Conversion of the existing printed sections into digital format (digitisation); and
– Identification of vendors in case the digitisation work is to be outsourced.
4.10 Execution of the Project
Once the software, equipment and other infrastructure facilities are installed, and the priorities of the documents for digitisation laid down, the execution of the project is initiated. The library may use digital library software like DSpace, or Greenstone Digital library etc.
5. Challenges and Problems
The most significant challenges in planning and execution of a digitisation project relate to technical limitations, budgetary constraints, copyright considerations, lack of policy guidelines and lastly, the selection of materials for digitisation. Other important issues and problems relate to the selection process, preparation of the materials, cooperation with the publishing sector, the specifications for the digitisation itself, research into improvement of optical character recognition (OCR), research into several file formats to reduce the cost of storage, automatic quality control mechanisms, new language-based techniques for search and retrieval, the digital preservation of the files and the technical infrastructure to support all these aspects.
At the turn of the century a shift in emphasis occurred in digitisation activities. Libraries moved from digitizing highlights to digitizing complete collections. Digitisation projects became larger and therefore project management became a more important issue. Developments in methods and techniques were stabilizing and there was a growing awareness of the problem of long-term preservation of the digital files. Instead of visually attractive materials the libraries started digitizing text materials and audio and video collections. New possibilities for the use of the digitized collections were discovered, such as applications for specific target groups like scientists and students.
In the beginning, a library may have to buy its own scanners and hire its own staff, but it has been discovered that scanning is not the core business of the library. So, digitisation may have to be outsourced. Because high standards are set, it may not be possible to meet high quality standards of the libraries as per their requirements. The high standards lead to very high quality images, but also produce very high costs for scanning and storage of the large master files.
Because of the growing scale of digitisation projects some of the old basic assumptions have to be reconsidered. There is now a preference for digitizing from microfilm which mostly delivers lower quality images but is cheaper than digitizing from originals. Also, research is being done into alternative file formats for TIFF like jPeg2000, which decreases the amount of necessary storage capacity. For the same reason the use of only one format for both access and preservation is being considered. For a library to be able to outsource its scanning activities, it is necessary to introduce quality standards that commercial companies can handle. While outsourcing all scanning activities, expertise in the area of imaging techniques may still be held within the library. Further, because of the high standards for digitisation, quality assurance of the digital files is automatically on a high level as well. From the beginning quality control managers have to check everything that is digitized. Sometimes, in the first small-scale projects, every file is to be checked and scanned again if the quality is not perfect
It may be concluded that a better balance between quality, quantity and costs has to be struck if libraries wish to digitize on a large-scale. Digitisation processes have to become more efficient and the only way to do this is to limit expectations and not try to be perfect at all costs.
6. Summary
Defines the concept and scope of digitisation as the process of converting the content of physical media including the basic approaches of the same as practiced in libraries. Digitisation not only includes scanning equipment, process, digital library software, file formats and media types etc., but also covers the most important aspect i.e. identification and selection of material to be digitized. The salient features of various digital library software e.g. DSpace, E-print, and Greenstone digital library etc. have been discussed. One of the most important aspects i.e. planning and management for digitization is the area where librarian has major role to play. Some of the important tips in this area are provided which are useful in digitization projects initiated by many libraries. These include: Feasibility study, Library Automation Hardware and Software Planning, Human resource planning, Placement and Training of Manpower, Financial planning, Content creation, Selection of Material for Digitisation and ‘Born Digital’, and Purchase of hardware and software. Among the challenges and problems faced by the librarians, technical limitations, budgetary constraints, copyright considerations, and lack of policy guidelines are worth mentioning. It is concluded that a better balance between quality, quantity and costs has to be struck if libraries wish to digitize on a large-scale.
References
- Arms, William Y. Digital Libraries. The MIT Press: Cambridge, MA, 2000.
- Jantz, Ronald. “Technological Discontinuities in the Library: Digital Projects That Illustrate New Opportunities for the Librarian and the Library.” IFLA Journal 27 (2001), 74-77.
- Kessler, Jack.Internet Digital Libraries: The International Dimension. Boston: Artech House Publishers, 1996
- Lesk, Michael. Practical Digital Libraries: Books, Bytes and Bucks. San Fransisco: Morgan Kaufmann Publishers, 1997
- Noerr, Peter.Digital Library Tool Kit. U.S.A.: Sun Microsystems, 2000
Learn More:
Web links
http://www.imls.gov/pubs/forumframework.htm http://www.sun.com/products-n-solutions/edu/libraries/digitaltoolkit.html
http://hds.essex.ac.uk/g2gp/digitising_history/index.html
http://www.kb.nl/galerie/100hoogtepunten/index-en.html
http://deity.gov.in/content/national-digital-library
DID YOU KNOW?
- In the digitization process, selection criteria, particularly those, which reflect user needs, are of utmost importance. It implies that all the principles that are applicable in traditional collection development are also applicable when materials are being selected for digitisation. However, there are several other considerations related to legal, policy, technical, and other resources that become important in a digitisation project.
- Several digital library softwares are currently available like, Dspace, Greenstone Digital Library (GSDL), Eprints, Fedora, etc., which are available freely for download on the Internet.
- The digitisation proposal should define its goals, scope, benefits, costs, time span, implementation issues, deliverables and target users.