10 Digitization – Part I
H G Hosmani and Yatrik Patel
I. Objectives
This module helps users to know about the following concept relate to the technology of digitization
• To know digitization: basics, concept and need
• To know the steps required in the process of digitization
• To understand the technology of digitization
• To know content compression tools techniques
• Use of optical character recognition(OCR) to tune/filter the contents
• Identify the available tools for digitization
• Organizing and integrating digital images with digital content
II. Learning Outcomes
After going through this lesson, learners would learn basic concept, needs and process of digitization. Learners would have an understanding of technology of digitization that is used for monitoring the quality of a digital image at the time of capture. Learners would gain knowledge on tools and techniques of image compression that is instrumental in reducing the size of an image. Learners would attain knowledge about OCR (Optical Character Recognition) software tools that are used for transforming scanned textual page images into a word processing file. Lastly, learners would gain knowledge about the file format and media types in digital libraries.
III. Structure
1. Introduction
2. Digitization: Basics
2.1 Definition
2.2 Needs of Digitization
3. Selection of Material for Digitization
4. Steps in the Process of Digitization
4.1 Scanning
4.2 Indexing
4.3 Store
4.4 Retrieve
5. Digitization: Input and Output Options
5.1 Scanned as Image Only
5.2 Optical Character Recognition (OCR) and Retaining Page Layout
5.2.1 Retaining Layout after OCR
5.2.2 Retaining Page Layout using Acrobat Capture
5.3 Re-keying
6. Technology of Digitization
6.1 Bit Depth or Dynamic Range
6.2 Resolution
6.3 Threshold
6.4 Image Enhancement
7. Compression
7.1 Lossless Compression
7.2 Lossy Compression
7.3 Compression Protocols
7.3.1 TIFF-G4
7.3.2 JPEG (Joint Photographic Expert Group)
7.3.3 LZW (Lenpel-Ziv Welch)
7.3.4 OCR (Optical Character Recognition)
8. File Formats and Media Types
9. Summary
1. Introduction
All recorded information in a traditional library is analogue in nature. The analogue information can include printed books, periodical articles, manuscripts, cards, photographs, vinyl disks, video and audio tapes. However, when analogue information is fed into a computer, it is broken down into 0s and 1s changing its characteristics from analogue to digital. These bits of data can be re-combined for manipulation and compressed for storage. Voluminous encyclopedias that take-up yards of shelf-space in analogue form can fit into a small space on a computer drive or stored on to an optical disc, which can be searched, retrieved, manipulated and sent over the network. One of the most important traits of digital information is that it is not fixed in the way texts printed on a paper are. Digital texts are neither final nor finite, and are not fixed either in essence or in form except, when it is printed out as a hard copy.
Flexibility is one of the chief assets of digital information. An endless number of identical copies can be created from a digital file, because a digital file does not decay by copying. Moreover, digital information can be made accessible from remote location simultaneously by a large number of users.
Digitization is the process of converting the content of physical media (e.g., periodical articles, books, manuscripts, cards, photographs, vinyl disks, etc.) into digital format. In most library applications, Digitization normally results in documents that are accessible from the web site of a library and thus, on the Internet. Optical scanners and digital cameras are used to digitize images by translating them into bitmaps. It is also possible to digitize sound, video, graphics, animations, etc.
Digitization is not an end in itself. It is the process that creates a digital image from an analogue image. Selection criteria, particularly those which reflect user needs are of paramount importance. Therefore, the principles that are applicable in traditional collections development are applicable when materials are being selected for Digitization. However, there are several other considerations related to technical, legal, policy, and resources that become important in a Digitization project.
Digitization is one of the three important methods of building digitized collections. The other two methods include providing access to electronic resources (whether free or licensed) and creating library portals for important Internet resources.
2. Digitization: Basics
2.1 Definition
The word “digital” describes any system based on discontinuous data or events. Computers are digital machines because at their most basic level they can distinguish between just two values, 0 and 1, or off and on. All data that a computer processes must be encoded digitally as a series of zeroes and ones.
The opposite of digital is analogue. A typical analogue device is a clock in which the hands move continuously around the face. Such a clock is capable of indicating every possible time of the day. In contrast, a digital clock is capable of representing only a finite number of times (every tenth of a second, for example).
As mentioned before, a printed book is an analogue form of information. The contents of a book need to be digitized to convert it into digital form. Digitization is the process of converting the content of physical media (e.g., periodical articles, books, manuscripts, cards, photographs, vinyl disks, etc.) to digital formats.
Digitization refers to the process of translating a piece of information such as a book, journal articles, sound recordings, pictures, audio tapes or video recordings, etc. into bits. Bits are the fundamental units of information in a computer system. Converting information into these binary digits is called Digitization, which can be achieved through a variety of existing technologies. A digital image, in turn, is composed of a set of pixels (picture elements), arranged according to a pre-defined ratio of columns and rows. An image file can be managed as a regular computer file and can be retrieved, printed and modified using appropriate software. Further, textual images can be OCRed so as to make its contents searchable.
An image of the physical object is captured using a scanner or digital camera and converted into digital format that can be stored electronically and accessed via computers. The process of Digitization, however, does not stop at scanning of physical objects, a considerable amount of work is evolved in optimizing usage of digitized documents. Sometimes, these post- scanning processes are often assumed in the meaning of Digitization. At other times the word “Digitization” is used in a restricted sense to include on only the process of scanning.
2.2 Needs of Digitization
Digitizing a document in print or other physical media (e.g., sound recordings) makes the document more useful as well as more accessible. It is possible for a user to conduct a full- text search on a document that is digitized and OCRed. It is possible to create hyperlinks to lead a reader to related items within the text itself as well as to external resources. Ultimately, Digitization does not mean replacing the traditional library collections and services; rather, it serves to enhance them.
A document can be converted into digital format depending on the objective of Digitization, end user, availability of finances, etc. While the objectives of Digitization initiatives differ from organization to organization, the primary objective is to improve the access. Other objectives include cost savings, preservation, keeping pace with technology and information sharing. The most significant challenges in planning and executing of a Digitization project, relate to technical limitations, budgetary constraints, copyright considerations, lack of policy guidelines and lastly, the selection of materials for Digitization.
While new and emerging technologies allow digital information to be presented in innovative ways, the majority of potential users are unlikely to have access to sophisticated hardware and software. Sharing of information among various institutions is often restricted by the use of incompatible software.
One of the main benefits of Digitization is to preserve rare and fragile objects by enhancing their access to multiple numbers of users simultaneously. Very often, when an object is rare and precious, access is only allowed for a certain category of people. Going digital could allow more users to enjoy the benefit of access. Although, Digitization offers great advantages for access like, allowing users to find, retrieve, study and manipulate material, it cannot be considered as a good alternate for preservation because of ever changing formats, protocols and software used for creating digital objects.
There are several reasons for libraries to go for Digitization and there are as many ways to create the digitized images, depending on the needs and uses. The prime reason for the Digitization is the need of the user for convenient access to high quality information. Other important considerations are:
Quality Preservation: The digital information has potential for qualitative preservation of information. The preservation-quality images can be scanned at high resolution and bit depth for best possible quality. The quality remains the same inspite of multiple usage by several users. However, caution needs to be exercised while choosing digitized information as preservation media.
Multiple Referencing: Digital information can be used simultaneously by several users at a time.
Wide Area Usage: Digital information can be made accessible to distant users through the computer networks over the Internet.
Archival Storage: Digitization is used for restoration of rare material. The rare books, images or archival material are kept in digitized format as a common practice.
Security Measure: Valuable documents and records are scanned and kept in digital format for safety.
3. Selection of Material for Digitization
To begin the process of Digitization, first of all, we need to select documents for Digitization. The process of selection of material for Digitization involves identification, selection and prioritization of documents that are to be digitized. If an organization generates contents, strategies may be adopted to capture data that is “borne digital”. If documents are available in digital form, it can be easily converted into other formats. If the selected material is from the external sources, IPR issues need to be resolved. If material being digitized is not available in public-domain then it is important to obtain permission from the publishers and data suppliers for Digitization. The IPR issues must be addressed early in the selection process. Getting permissions from publishers and individuals could be time consuming, difficult and may involve negotiation and payment of copyright fees. Moreover, the decision may be taken whether to OCR the digitized images. Documents selected for Digitization may already be available in digital format. It is always economical to buy e-media, if available, than their conversion. Moreover, over-sized material, deteriorating collections, bound volumes of journals, manuscripts, etc. would require highly specialized equipment and highly skilled manpower.
The documents to be digitized may include text, line art, photographs, colour images, etc. The selection of document needs to be reviewed very carefully considering all the factors of utility, quality, security and cost. Rare and much in demand documents and images are selected as first priority without considering the quality. Factors that may be considered before selecting different media for Digitization include:
Audio: The sound quality has to be checked and require corrections made together by the subject expert and computer sound editor.
Video: The video clippings are normally edited on Beta max tapes which can be used for transferring it on digital format. While editing colour tone, resolution is checked and corrected.
Photographs: The selection of photographs is a very crucial process. High resolution is required for photographic images and slides. Especially the quality, future need and the copyright aspects have to be checked.
Documents: Documents which are much in demand, too fragile to handle, and rare in availability are reviewed and selected for the process. If the correction of literary value demands much input, then documents are considered for publication rather than Digitization. Moreover, the purpose of all Digitization is related to increased access to digitized materials and value addition. The first consideration for Digitization of documents should be an intellectual significance of contents in terms of its quality, authority, uniqueness, timeliness and demand. The intellectual contents, physical nature of the source materials, number of current and potential users is, therefore, major considerations.
4. Steps in the Process of Digitization
The following four steps are involved in the process of Digitization. Software, variably called document image processing (DIP), Electronic Filing System (EFS) and Document Management Systems (DMS) provides all or most of these functions:
4.1 Scanning
Electronic scanners are used for acquisition of an electronic image into a computer through its original that may be a photograph, text, manuscript, etc. An image is “read” or scanned at a predefined resolution and dynamic range. The resulting file, called “bit-map page image” is formatted (image formats described elsewhere) and tagged for storage and subsequent retrieval by the software package used for scanning. Acquisition of image through fax card, electronic camera or other imaging devices are also feasible. However, image scanners are most important and most commonly used component of an imaging system for the transfer of normal paper-based documents.
Fig.1: Scanning using a Flatbed Scanner
Steps in the Process of Scanning using a Flatbed Scanner
Step 1. Place picture on the scanner’s glass
Step 2. Start scanner software
Step 3. Select the area to be scanned
Step 4. Choose the image type
Step 5. Sharpen the image
Step 6. Set the image size
Step 7. Save the scanned image using a desirable format (GIF or JPEG)
4.2 Indexing
If converting a document into an image or text file is considered as the first step in the process of imaging, indexing these files comprises the second step. The process of indexing scanned image involves linking of database of scanned image to a text database. Scanned images are just like a set of pictures that need to be related to a text database describing them and their contents. An imaging system typically stores a large amount of unstructured data in a two file system for storing and retrieving scanned images. The first is traditional file that has a text description of the image (keywords or descriptors) along with a key to a second file. The second file contains the document location. The user selects a record from the first file using a search algorithm. Once the user selects a record, the application keys into the location index finds the document and displays it.
Fig. 2: Two File Systeminan Image Retrieval System
Most of the document imaging software package through their menu drive or command driven interface, facilitate elaborate indexing of documents. While some document management system facilitate selection of indexing terms from the image file, others allow only manual keying in of indexing terms. Further, many DMS packages provide OCRed capabilities for transforming the images into standard ASCII files. The OCRed text then serve as a database for full-text search of the stored images.
4.3 Store
The most tenacious problem of a document image relates to its file size and, therefore, to its storage. Every part of an electronic page image is saved regardless of the presence or absence of ink. The file size varies directly with scanning resolution, the size of the area being digitized and the style of graphic file format used to save the image. The scanned images, therefore, need to be transferred from the hard disc of scanning workstation to an external large capacity storage devices such as an optical disc, CD ROM / DVD ROM disc, snap servers, etc. While the smaller document imaging system may use offline media, which need to be reloaded when required, or fixed hard disc drives allocated for image storage. The Larger document management system use auto-changers such as optical jukeboxes and tape library systems. The storage required by the scanned image varies and depends upon factors such as scanning resolution, page size, compression ratio and page content. Further, the image storage device may be either remote or local to the retrieval workstation depending upon the imaging systems and document management system used.
4.4 Retrieve
Once scanned images and OCRed text documents have been saved as a file, a database is needed for selective retrieval of data contained in one or more fields within each record in the database. Typically, a document imaging system uses at least two files to store and retrieve documents. The first is traditional file that has a text description of the image along with a key to the second file. The second file contains the document location. The user selects a record from the first-file using a search algorithm. Once the user selects a record, the application keys into the location index finds the document and displays it. Most of the document management system provides elaborate search possibilities including the use of Boolean and proximity operators (and, or, not) and wild cards. Users are also allowed to refine their search strategy. Once the required images have been identified, their associated document image can quickly be retrieved from the image storage device for display or printed output.
5. Digitization: Input and Output Options
A document can be converted into digital format depending on the objective of Digitization, end user, availability of finances, etc. There are four basic approaches that can be adapted to translate from print to digital:
6.1. Scanned as Image Only
6.2. OCR and Retaining Page Layout
6.3. Retaining Page Layout using Acrobat Capture; and
6.4. Re-keying the Data
5.1 Scanned as Image Only
Image only is the lowest cost option in which each page is an exact replica of the original source document. Several digital library projects are concerned by providing digital access to materials that already exists with traditional libraries in printed media. Scanned page images are practically the only reasonable solution for institutions such as libraries for converting existing paper collection (legacy documents) without having access to the original data in computer processible formats convertible into HTML / SGML or in any other structured or unstructured text. Scanned page images are natural choice for large-scale conversions for major digital library initiatives. Printed text, pictures and figures are transformed into computer-accessible forms using a digital scanner or a digital camera in a process called document imaging or scanning. The digitally scanned images are stored in a file as a bit- mapped page image, irrespective of the fact that a scanned page contains a photograph, a line drawing or text. A bit-mapped page image is a type of computer graphic, literally an electronic picture of the page which can most easily be equated to a facsimile image of the page and as such they can be read by humans, but not by the computers, understandably “text” in a page image is not searchable on a computer using the present-day technology. An image-based implementation requires a large space for data storage and transmission.
Capturing page image format is comparatively easy and inexpensive, therefore, it is a faithful reproduction of its original maintaining page integrity and originality. The scanned textual images, however, are not searchable unless it is OCRed, which in itself, is highly error prone process especially when it involves scientific texts. Options and technology for converting print to digital are given separately.
Since OCR is not carried out, the document is not searchable. Most scanning software generates TIFF format by default, which, can be converted into PDF using a number of software tools. Scan to TIFF / PDF format is recommended only when the requirement of the project is to make documents portable and accessible from any computing platform. The image can be browsed through a table of contents file composed in HTML that provides link to scanned image objects.
5.2 Optical Character Recognition (OCR) and Retaining Page Layout
The latest versions of both Xerox’s TextBridge and Caere’s Omnipage incorporate technology that allows the option of maintaining text and graphics in their original layout as well as plain ASCII and word-processing formats. Output can also include HTML with attributes like bold, underline, and italic which are retained.
5.2.1 Retaining Layout after OCR
A scanned document is nothing more than a picture of a printed page. It can not be edited or manipulated or managed based on their contents. In other words, scanned documents have to be referred to by their labels rather than characters in the documents. OCR (Optical Character Recognition) programs are software tools used to transform scanned textual page images into a word processing file. OCR or text recognition is the process of electronically identifying text in a bit-mapped page image or set of images and generate a file containing that text in ASCII code or in a specified word processing format leaving the image intact in the process.
5.2.2 Retaining Page Layout using Acrobat Capture
The Acrobat Capture 3.0 provides several options for retaining not only the page layout, but also the fonts, and to fit the text into the exact space occupied in the original, so that the scanned and OCRed copy never over- or under-shoots the page. Accordingly, it treats unrecognizable text as images that are pasted in its place. Such images are perfectly readable by anyone by looking at the PDF file, but will be absent from the editable and searchable text file. In contrast, ordinary OCR programs treat unrecognized text as the tilde or some other special character in the ASCII output. Acrobat Capture can be used to scan pages as images, image +text and as normal PDF, all the three options retain page layout.
i) Image Only: Image only option has already been described.
ii) Image + Text: In image+text solutions, OCRed text is generated for each image where each page is an exact replica of the original and left untouched, however, the OCRed text sits behind the image and is used for searching. The OCRed text is generally not corrected for errors since, it is used only for searching. The cost involved is much less than PDF Normal. However, the entire page is a bitmap and neither fonts nor line drawings are vectorised, so the file size of Image + Text PDFs is considerably larger than the corresponding PDF Normal files and pages will not display as quickly or cleanly on screen.
iii) PDF Normal: PDF normal gives the clear view on-screen display. It is searchable, with significantly smaller file size than Image+Text. The result is not, however, an exact replica of the scanned page. While all graphics and formatting are preserved, substitute fonts may be used where direct matches are not possible. It is a good choice when files need to be posted on to the web or otherwise delivered online. If during the Capture and OCR process, a word cannot be recognized to the specified confidence level, Capture, by default, substitutes a small portion of the original bitmap image. Capture “best guess” of the suspect word lies behind the bitmap so that searching and indexing are still possible. However, one cannot guarantee that these bitmapped words are correctly guessed. In addition, the bitmap is somewhat obtrusive and detracting from the ‘look’ of the page. Further, Capture provides option to correct suspected errors left as bit-mapped image or leave them untouched.
5.3 Re-keying
A classic solution of this kind would comprise of keying-in the data and its verification. This involves a complete keying of the text, followed by a full rekeying by a different operator, the two keying-in operation might take place simultaneously. The two keyed files are compared and any errors or inconsistencies are corrected. This would guarantee at least 99.9% accuracy, but to reach 99.955% accuracy level, it would normally require full proof- reading of the keyed files, plus table lookups and dictionary spell checks.
Fig.3: Rekeying-in as an Option for Digitization
6. Technology of Digitization
Digital images, also called “bit-mapped page image” are “electronic photographs” composed or set of bits or pixels (picture elements) represented by “0” and “1”. A bit-mapped page image is a true representation of its original in terms of typefaces, illustrations, layout and presentation of scanned documents. As such information or contents of “bit-mapped page image” cannot be searched or manipulated unlike text file documents (or ASCII). However, an ASCII file can be generated from a bit-mapped page image using an optical character recognition (OCR) software such as Xerox’s TextBridge and Caere’s OmniPage. The quality of digital image can be monitored at the time of capture by the following factors:
7.1. Bit depth / dynamic range
7.2 Resolution
7.3 Threshold
7.4 Image enhancement
Terminology associated with technological aspects of Digitization described below is given in the keywords. Students are advised to understand the terminology, specially bit, byte and pixel before going through the unit.
6.1 Bit Depth or Dynamic Range
The number of bits used to define each pixel determines the bit depth. The greater the bit depth, the greater the number of gray scale or colour tones that can be represented. Dynamic range is the term used to express the full range of total variations, as measured by a densitometer between the lightest and the darkest of a document. Digital images can be captured at varied density or bits pixel depending upon i) the nature of source material or document to be scanned; ii) target audience or users; and iii) capabilities of the display and print subsystem that are to be used. Bitonal or black & white or binary scanning is generally employed in libraries to scan pages containing text or the drawings. Bitonal or binary scanning represents one bit per pixel (either “0” (black) or “1” (white). Gray scale scanning is used for reliable reproduction of intermediate or continuous tones found in black & white photographs to represent shades of grey. Multiple numbers of bits ranging from 2-8 are assigned to each pixel to represent shades of grey in this process. Although each bit is either black or white, as in the case of bitonal images, but bits are combined to produce a level of grey in the pixel that is, black, white or somewhere in between.
Fig. 4: Setting Bit Depth in Precisionscan Pro Scanning software
Table 1: No. of Bits used for Representing Shades in Colour and Gray-scale Scanning Lastly colour scanning can be employed to scan colour photographs. As in the case of grey- scale scanning, multiple bits per pixels typically 2 (lowest quality) to 8 (highest quality) per primary colour are used for representing colour. Colour images are evidently more complex than grey scale images, because it involves encoding of shades of each of the three primary colours, i.e. red, green and blue (RGB). If a coloured image is captured at 2 bits per primary colour, each primary colour can have 22 or 4 shades and each pixel can have 43 shades for each of the three primary colours. Evidently, increase in bit depth increases the quality of image captured and the space required to store the resultant image. Generally speaking, 12 bits per pixel (4 bits per primary colour) is considered the minimum pixel depth for good quality colour image. Most of today’s colour scanner can scan at 24-bit colour (8 bit per primary colour).
6.2 Resolution
The resolution of an image is defined in terms of number of pixel (picture elements) in given area. It is measured in terms of dot per inch (dpi) in case of an image file and as ratio of number of pixel on horizontal line x Number of pixel in vertical lines in case of display resolution on a monitor. Higher the dpi is set on the scanner, the better the resolution and quality of image and larger the image file.
Regardless of the resolution, image quality of an image can be improved by capturing an image in greyscale. The additional gray-scale data can be processed electronically to sharpen edges, file-in characters, remove extraneous dirt, remove unwanted page strains or discoloration, so as to create a much higher quality image than possible with binary scanning alone. A major drawback in gray scale is that there is a large amount of data capture. It may be noted that continuing increase in resolution will not result in any appreciable gain in image quality after some time, except for increase in file size. It is thus important to determine the point where sufficient resolution has been used to capture all significant detail present in the source document.
The black and white or bitonal images (textual) are scanned most commonly at 300 dpi that preserve 99.9% of the information contents of a page and can be considered as adequate access resolution. Some preservation projects scans at 600 dpi for better quality. A standard SVGA/VGA monitor has a resolution of 640 x 480 lines while the ultra-high monitors have a resolution of about 2048 x 1664 (about 150 dpi).
Fig. 5: Setting-up Resolution Manually
6.3 Threshold
The threshold setting in bitonal scanning defines the point on a scale, usually ranging from 0- 255, at which grey values will be interpreted as black or white pixels. In bitonal scanning, resolution and threshold are the key determinants of image quality. Bitonal scanning is best suited to high-contrast documents, such as text and line drawings. Gray scale or colour scanning is required for continuous tone or low contrast for documents such as photographs. In grey scale/colour scanning both resolution and bit depth combine to play significant roles in image quality.
In Line art mode, every pixel has only two possible values. Every pixel will either be black or white. The Line art Threshold control determines the decision point about brightness determining if the sampled value will be a black dot or a white dot. The normal Threshold default is 128 (the midrange of the 8-bit 0 – 255 range). Image intensity values above the threshold are white pixels, and values below the threshold are black pixels. Adjusting threshold is like a brightness setting to determine what is black and what is white.
Threshold for text printed on a coloured background or cheap-quality paper like newsprint has to be kept at lower range. Reducing the threshold from 128 to about 85 would greatly improve the quality of scan. Such adjustments would also improve the performance of OCR software.
Fig. 6: Threshold Setting in Bitonal Scanning
6.4 Image Enhancement
Image enhancement process can be used to improve scanned images at a cost of image authenticity and fidelity. The process of image enhancement is, however, time consuming, it requires special skills and would invariably increase the cost of conversion. Typical image enhancement features available in a scanning or image editing software include filters, tonal reproduction, curves and colour management, touch, crop, image sharpening, contrast, transparent background, etc. In a page scanned in grey-scale, the text /line art and half tone areas can be decomposed and each area of the page can be filtered separately to maximize its quality. The text area on a page can be treated with edge sharpening filters, so as to clearly define the character edges, a second filter could be used to remove the high-frequency noise and finally another filter could fill-in broken characters. Grey-scale area of the page could be processed with different filters to maximize the quality of the halftone.
Fig. 7: Sharpening Image using HP Precisionscan Pro
7. Compression
Image files are evidently larger than textual ASCII files. It is thus necessary to compress image files so as to achieve economic storage, processing and transmission over the network. A black & white image of a page of text scanned at 300 dpi is about 1 mb in size where as a text file containing the same information is about 2-3 kb. Image compression is the process of reducing size of an image by abbreviating the repetitive information such as one or more rows of white bits to a single code. The compression algorithms may be grouped into the following two categories:
7.1 Lossless Compression
The conversion process converts repeated information as a mathematical algorithm that can be decompressed without loosing any details into the original image with absolute fidelity. No information is “lost” or “sacrificed” in the process of compression. Lossless compression is primarily used in bitonal images.
7.2 Lossy Compression
Lossy compression process discards or minimize details that are least significant or which may not make appreciable effect on the quality of image. This kind of compression is called “lossy” because when the image that is compressed using “Lossy” compression techniques is decompressed, it will not be an exact replica of the original image. Lossy compression is used with grey-scale / colour scanning.
Compression is a necessity in digital imaging but more important is the ability to output or produce the uncompressed true replica of images. This is especially important when images are transferred from one platform to another or are handled by software packages under different operating system.
Uncompressed images often work better than compressed images for different reasons. It is thus suggested that scanned images should be either stored as uncompressed images or at the most as lossless compressed images. Further, it is optimal to use one of the standard and widely supported compression protocols than a proprietary one, even if it offers efficient compression and better quality. Attributes of original documents may also be considered while selecting compression techniques. For example ITU G-4 is designed to compress text where as JPEG, GIF and ImagePAC are designed to compress pictures. It is important to ensure the migration of images from one platform to another and from one hardware media to another. It may be noted that highly compressed files are more prone to corruption than uncompressed files.
7.3 Compression Protocols
The following protocols are commonly used for bitonal, gray scale or colour compression:
7.3.1 TIFF-G4 International Telecommunication Union (ITU Group 4) is considered as de facto standard compression scheme for black & white bitonal images. An image created as a TIFF and compressed using ITU-G4 compression technique is called a Group-4 TIFF or TIFFG4 and is considered as defacto standard for storing bitonal images. TIFF G-4 is a lossless compression scheme. Joint Bi-level Image Group (JBIG) (ISO-11544) is another standard compression technique for bitonal images.
7.3.2 JPEG (Joint Photographic Expert Group)JPEG (Joint Photographic Expert Group) is an ISO-10918-I compression protocol that works by finding areas of the image that have same tone, shade, colour or other characteristics and represents this area by a code. Compression is achieved at loss of data. Preliminary testing indicates that a compression of about 10 or 15 to one can be achieved without visible degradation of image quality.
7.3.3 LZW (Lenpel-Ziv Welch)
LZW compression technique uses a table-based lookup algorithm invented by Abraham Lempel, Jacob Ziv, and Terry Welch. Two commonly-used file formats in which LZW compression is used are the Graphics Interchange Format (GIF) and Tag Image File Format (TIFF). LZW compression is also suitable for compressing text files. A particular LZW compression algorithm takes each input sequence of binary digit of a given length (for example, 12 bits) and creates an entry in a table (sometimes called a “dictionary” or “codebook”) for that particular bit pattern, consisting of the pattern itself and a shorter code. As input is read, any pattern that has been read before the results in the substitution of the shorter code effectively compresses the total amount of input to something smaller. The decoding program that uncompresses the file is able to build the table itself by using the algorithm as it processes the encoded input.
7.3.4 OCR (Optical Character Recognition)
OCR (Optical Character Recognition) programs are software tools used to transform a scanned textual page images into a word processing file. OCR or text recognition is the process of electronically identifying text in a bit-mapped page image or set of images and generate a file containing that text in ASCII code or in a specified word processing format leaving the image intact in the process. The OCR is performed in order to make every word in a scanned document readable and fully searchable without having to key-in everything in the computer manually. Once a bit-mapped page image has gone through the process of OCR, a document can be manipulated and managed by its contents, i.e. using the words available in the text.
OCR does not actually convert an image into text but rather creates a separate file containing the text while leaving the image intact. There are four types of OCR technology that are prevailing in the market. These technologies are: matrix matching, feature extraction, structural analysis and neural network.
• Matrix / Template Matching: Compares each character with a template of the same character. Such a system is usually limited to a specific number of fonts, or must be “taught” to recognize a particular font.
• Feature Extraction: Can recognize a character from its structure and shape (angles, points, breaks, etc.) based on a set of rules. The process claims to recognize all fonts.
• Structural analysis: Determines characters on the basis of density gradations or character darkness.
• Neural Networking: Neural networking is a form of artificial intelligence that attempts to mimic the processes of the human mind. Combined with traditional OCR techniques plus pattern recognition, a neural network-based system can perform text recognition and “learn” from its success and failure. Referred to as “Intelligent Character Recognition”, a neural network-based system is being used to recognize hand-written text as well as other traditionally difficult source material. Neural network ICR can contemplate characters in the context of an entire word. Newer ICR combines neural networking with fuzzy logic.
Fig. 8: The image scanner optically captures text images to be recognized. Text images are processed with OCR software and hardware. The process involves three operations: document analysis (extracting individual character images), recognizing these images based on i) their template stored in the OCR database; ii) structure and shape (angles, points, breaks, etc.) iii) density gradations or character darkness and iv) contextual processing. The output interface is responsible for communication of OCR system that results to the outside world.
Several software packages now offer facility of retaining the page layout after it has been OCRed. The process for retaining the page layout is software dependent. Caere’s Omnipro offer two ways of retaining page layout following OCR. It calls them True Page Classic and True Page Easy. True Page Classic places each paragraph within a separate frame of a word processor into which the OCR output is saved. If one wish to edit anything subsequently, then the relevant paragraph box may need to be resized. However, Easy Edit facilitates editing of pages without the necessity of resizing the boxes although there are greater chances of spillage over the page. Xerox Text Bridge offers a similar feature called DocuRT, which is broadly equivalent to True Page Easy edit. The process of OCR dismantle the page, OCR it, and then reassemble it in such a way that all the component parts such as tabs, columns, table, graphics can be used in a text manipulation package such as word processor.
There is a little doubt about the fact that OCR is less accurate than rekeying-in the data. At an accuracy ratio of 98%, a page having 1800 characters will have 36 errors per page on an average. It is therefore, imperative to cleanup after OCR unless the original scanned image will be viewed as a page and OCR is being used purely to create a searchable index on the words that will be searched via a fuzzy retrieval engine like Excalibur, which is highly tolerant to OCR errors.
Another possibility for cleaned-up OCR is use of specialist OCR system such as, Prime Recognition. With production OCR in mind, Prime OCR licenses leading to recognizing engine and passes the data through several of them using voting technology along with artificial intelligent algorithms. Although it takes longer initially, but saves time in long run and prime contends that it improves the result achieved by a single engine by 65 – 80 %. The technology is available at a price depending upon number of search engine that one would like to incorporate. Michigan Digital Library production services used Prime OCR for placing more than two million pages of SGML – encoded text and the same No. of page images on the web.
8. File Formats and Media Types
A defined arrangement for discrete sets of data that allow a computer and software to interpret the data is called a file format. Different file formats are used to store different media types like text, images, graphics, pictures, musical works, computer programs, databases, models and designs video programs and compound works combining many type of information. Although, almost every type of information can be represented in digital form, a few important file formats for text and images typically applicable to a library-based digital library are described in module “Digital Library Standards”.
9. Summary
Digitization is the first step in the process of building digital libraries. Digitization is also used for achieving preservation and archiving, although it is not considered as good option for preservation and archiving. It is highly labour-intensive and cost-intensive process that involves several complexities including copyright and IPR issues. However, digital objects offers numerous benefits in terms of accessibility and search. The documents to be digitized may include text, line art, photographs, colour images, etc. The selection of document need to be reviewed very carefully considering all the factors of utility, quality, security and cost. Rare and much in demand documents and images are selected as first priority without considering the quality.
The process of Digitization involves four steps, namely scanning, indexing, storage and retrieval. A scanned document is nothing more than a picture of a printed page. It can not be edited or manipulated or managed based on their contents. In other words, scanned documents have to be referred by their labels rather than characters in the documents. OCR (Optical Character Recognition) programs are software tools used to transform a scanned textual page images into word processing file. OCR or text recognition is the process of electronically identifying the text in a bit-mapped page image or set of images and generates a file containing text in ASCII code or in a specified word processing format leaving the image intact in the process.
The quality of digital image can be monitored at the time of capture by four factors, namely i) Bit depth / dynamic range; ii) resolution; iii) threshold; and iv) image enhancement. The unit describes these parameters in detail. Image files are evidently larger than textual files. It is thus necessary to compress image files. Image compression is the process of reducing size of an image by abbreviating the repetitive information such as one or more rows of white bits to a single code. The compression algorithms may be grouped as lossless compression and lossy compression. The unit describes compression technology and protocols.
References and Further Reading
- Arms, C., & Fleischhauer, C. (2005). Digital formats: Factors for sustainability, functionality, and quality. In Archiving Conference (Vol. 2005, No. 1, pp. 222-227). Society for Imaging Science and Technology.
- Arms, W. Y. (1995). Key concepts in the architecture of the digital library. D-lib Magazine, 1(1).
- Arms, W. Y., Blanchi, C., & Overly, E. A. (1997). An architecture for information in digital libraries. D-Lib Magazine, 3(2).
- Arthur, K., Byrne, S., Long, E., Montori, C. Q., & Nadler, J. (2004). Recognizing digitization as a preservation reformatting method. Microform & imaging review,33(4), 171-180.
- Arthur, K., Byrne, S., Long, E., Montori, C. Q., & Nadler, J. (2004). Recognizing digitization as a preservation reformatting method. Microform & imaging review,33(4), 171-180.
- Ayris, P. (1998). Guidance for selecting materials for Digitization.
- Battiato, S., Castorina, A., & Mancuso, M. (2003). High dynamic range imaging for digital still camera: an overview. Journal of Electronic Imaging, 12(3), 459-469.
- Bishop, A. P., Van House, N. A., & Buttenfield, B. P. (Eds.). (2003). Digital library use: Social practice in design and evaluation. MIT Press.
- Buchanan, G., Bainbridge, D., Don, K. J., & Witten, I. H. (2005, June). A new framework for building digital library collections. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries (pp. 23-31). ACM.
- Cassidy, J. C., Tse, F., & Bai, Y. (2013). U.S. Patent No. 8,503,036. Washington, DC: U.S. Patent and Trademark Office.
- Chaudhuri, B. B., & Pal, U. (1997, August). An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on (Vol. 2, pp. 1011-1015). IEEE.
- Coyle, K. (2006). Mass digitization of books. The Journal of Academic Librarianship, 32(6), 641-645.
- Crane, G. (1996, April). Building a digital library: The Perseus Project as a case study in the humanities. In Proceedings of the first ACM international conference on Digital libraries (pp. 3-10). ACM.
- Crane, G. (1996, April). Building a digital library: The Perseus Project as a case study in the humanities. In Proceedings of the first ACM international conference on Digital libraries (pp. 3-10). ACM.
- Dixon, S. (2007). Digital performance: a history of new media in theater, dance, performance art, and installation. MIT Press (MA).
- Gillesse, R., Rog, J., & Verheusen, A. (2008). Alternative file formats for storing master images of Digitization projects. Den Haag: Koninklijke Bibliotheek, 27.
- Holley, R. (2004). Developing a Digitization framework for your organisation.Electronic Library, The, 22(6), 518-522.
- Holley, R. (2009). How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper Digitization programs. D-Lib Magazine,15(3/4).
- Hughes, L. M. (2004). Digitizing collections: strategic issues for the information manager. Janée, G., & Frew, J. (2002, July). The ADEPT digital library architecture. InProceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries (pp. 342-350). ACM.
- Joint Photographic Experts Group. ISO 10918-1. JPEG, 1994.
- Kluzner, V., Tzadok, A., Shimony, Y., Walach, E., & Antonacopoulos, A. (2009, July). Word-based Adaptive OCR for historical books. In Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on(pp. 501-505). IEEE.
- Law, D. (2004). Digital libraries: policies, planning and practice. Ashgate.
- Levy, D. M. (1996). Documents and digital libraries. In 1st ACM International Conference on Digital Libraries (DL) (Vol. 96).
- Lopresti, D. P., & Sandberg, J. S. (1998). U.S. Patent No. 5,748,807. Washington, DC: U.S. Patent and Trademark Office.
- Magazine, D. L. (1998). The NCSTRL approach to open architecture for the confederated digital library. D-Lib Magazine.
- Marchionini, G. (2001). Digital libraries. Communications of the ACM, 44(5), 31.
- McCray, A. T., & Gallagher, M. E. (2001). Principles for digital library development. Communications of the ACM, 44(5), 48-54.
- Mutula, S. M., & Ojedokn, A. A. (2008). Digital libraries. Information and knowledge management in the digital age: concepts, technologies and African perspectives. Ibadan: Third World Information Services
- Nagy, G., Nartker, T. A., & Rice, S. V. (1999, December). Optical character recognition: An illustrated guide to the frontier. In Electronic Imaging (pp. 58-69). International Society for Optics and Photonics.
- Parekh, Y. R., & Parekh, P. (2009). Planning for Digital Preservation of Special Collections in Gujarat University Library.
- Pennebaker, W. B., & Mitchell, J. L. (1993). JPEG: Still image data compression standard. Springer.
- Schatz, B. (1995). Building the interspace: The Illinois digital library project.Communications of the ACM, 38(4), 62-63.
- Schatz, B., & Chen, H. (1996). Building large-scale digital libraries. Computer,29(5), 22-26. Simske, S. J., & Lin, X. (2003, December). Creating digital libraries: content generation and re-mastering. In First International Workshop on Document Image Analysis for Libraries.
- Simske, S. J., & Lin, X. (2003, December). Creating digital libraries: content generation and re-mastering. In First International Workshop on Document Image Analysis for Libraries.
- Suleman, H., & Fox, E. A. (2001). A framework for building open digital libraries. D-Lib magazine, 7(12), 1082-9873.
- Terras, M. M. (2008). Digital images for the information professional. Ashgate Publishing, Ltd.
- Vohra, R., & Sharma, A. (2011). Preservation And Conservation of Manuscripts: A Case Study of AC Joshi Library, Panjab University, Chandigarh. Library Herald, 49(2), 158-170.
- Zhou, Y. (2013). Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible. Information technology and Libraries, 29(3), 151-160.