34 Mass Storage Management

Mary Anitha Rajam

37.1 Introduction

In this module, we learn different aspects related to mass storage management. The file system is the most visible part of the operating system. The design of a file system can be divided into has three major components One is the design of the application and the operating system interface to support the implementation of the file system. The second is the design of the data structures that are part of the operating system and used for the implementation of the file system. The third component is the design of the data structures and the algorithms related to the disk where the file system is kept. In this module, we learn different aspects of disk management like disk initialization, booting from disk and bad-block recovery. We also learn how swap-space is managed in disks. We also learn about different RAID levels.

37.2 Disk Management – Disk Formatting

A new disk is just a platter of magnetic recording material. There is nothing written into it. The disk has to be formatted first. This is called low-level formatting or physical formatting. Physical formatting is dividing a disk into sectors that the disk controller can read and write. This formatting is usually done in the industry. When the disk is divided into sectors, a special data structure is assigned for each sector. Each sector has a header, a data area and a trailer. A sector number is written in the header and an error correcting code (ECC) is written in the trailer.

When the controller writes to a sector, it calculates the ECC from the bytes of data written and the ECC is written to the trailer of the sector. When the sector is read again, ECC is recalculated and is checked with the value already stored in the trailer. If the stored value and the calculated ECC number are different, the sector may be corrupted. If the sector is corrupted, the ECC can be used to correct if there are errors in a few bits of data. If the error can be found, it reports a recoverable soft error. If the error cannot be corrected, the sector is declared as a bad sector.

Many hard disks are low-level formatted as a part of the manufacturing process. If the low- level formatting is not done by the manufacturer, the controller can also do the low level formatting. If the controller low-level-formats the disk, the block size can be specified (512, 256, 1024 and so on).

In addition to low-level formatting, to use a disk to hold files, the operating system still needs to record its own data structures on the disk. For this, the disk is partitioned into one or more groups of cylinders. Each partition can then be treated as a separate disk. Each partition can have a separate file system. The “creation of a file system” or logical formatting involves storing some initial file-system data structures onto the disk. The data structures stored may include maps of free and allocated space (FAT) and an initial empty directory.

37.3 Disk Management–Boot Block

For a computer to start running, it needs an initial program to run called the bootstrap program. The bootstrap program initializes all aspects of the system, finds the OS kernel on the disk, loads the kernel to main memory and jumps to an initial address to start execution. The bootstrap loader can be stored in ROM. The bootstrap loader in the ROM brings into memory the full bootstrap program from the disk. The full bootstrap program is stored in boot blocks in the disk. A disk having a boot partition is called a boot disk (system disk).

Disks are prone to failure because of moving parts. If the failure is complete, the failed disk has to be replaced and data has to be restored from backup. Sometimes, some sectors alone become defective and are called bad blocks. One way to find bad blocks is to manually scan all the blocks and find the bad blocks. Once blocks are identified as bad, how are they handled?

One method to handle bad blocks is sector sparing. In this method, the controller maintains a list of bad blocks on the disk. This list is initialized during low-level format and is updated over the life of the disk. Low-level formatting sets aside a few spare sectors. These spare sectors are used only when there are bad blocks. The controller can be told to replace each bad sector logically with one of the spare sectors.

Replacing the bad sector with one of the spare sectors can be done in two ways: Sector slipping and sector sparing. In sector sparing, the contents of the bad block are copied to the spare sector. In sector slipping, the contents of the disk block adjacent to the spare sector are copied to the spare sector. The adjacent block (say, b) now becomes free. The contents of the disk block adjacent to this block (b) are now copied to this block (b). This continues till the contents of the bad block are copied to its adjacent block.

Figure 37.1 shows an example of sector slipping and sector sparing. Figure 37.1 (a) shows two spare sectors (30, 31) and a defective sector (7). In sector sparing (Figure 37.1 (b)), the contents of disk block 7 are copied to sector 30. In sector slipping (Figure 37.1 (c)), the contents of sector 29 are copied to 30, contents of sector 28 are copied to 29 and so on and contents of sector 7 are copied to sector 8.

Fig. 37.1 Sector Sparing, Slipping (Source: [1])

37.4 Swap-Space Management

The virtual memory uses disk space as an extension of main memory. The main memory size is usually small when compared to the secondary storage. When there is not enough space in the main memory, partially executed processes can be moved to the swap space kept in the secondary storage device. When there is enough space again in the main memory, the processes can be moved back to the main memory.

Swap-space can be carved out of the normal file system or, it can be in a separate disk partition. When the disk is partitioned, space is allocated for swap space. Some operating systems allow the use of multiple swap spaces, that is, the swap space can be a file as a part of a file system or/and a dedicated swap partitions as in the case of Linux.

37.4.1 Swap space – as a large file within the file system

Since the swap space is treated as a file, the normal file-system routines can be used to create it, name it and allocate space. But, navigating the directory structure and the disk- allocation data structures takes time. Therefore, measures similar to improving the access speed of files can be used here. The performance can be improved by caching disk blocks in main memory and allocating physically contiguous blocks for the swap files. Caching disk blocks and using contiguously allocated disk blocks increase the speed of access.

37.4.2 Swap space – as a separate raw partition

A fixed amount of swap space is allocated during disk partitioning in the disk. There is no file system or directory structure. A swap space manager allocates and deallocates blocks from the raw partition. The swap space manager uses algorithms for allocation and deallocation that are optimized for speed rather than for storage efficiency. Since space for processes is allocated and deallocated from a single large swap space, this results in internal fragmentation. This internal fragmentation remains in the system only till the system boots up the next time.

Traditional UNIX kernel used swapping where the entire processes were moved between the disk and the main memory. Later, swapping and paging were combined. In demand paging, only the required pages are brought into the main memory when needed. Therefore, these pages alone need to be swapped out and swapped in.

4.3 BSD UNIX allocates swap space when a process starts. The swap space can hold a text segment (the program) and a data segment. The kernel uses swap maps to track swap- space use. There are separate swap maps for keeping track of text segments and data segments. Figure 37.2 shows the swap map used for text segments. The figure shows that similar sized space is allocated for text segments. Allocation is done in chunks of fixed size (512K). The last chunk is partially full.

Fig.37.2 BSD Text-Segment Swap Map

Figure 37.3 shows the swap map used for data segments. The first chunk allocated is of some minimum size (16K). Each subsequent chunk is twice the size of the previous chunk until the size reaches a predefined maximum. This kind of allocation is done for data because data can grow over time.

Fig. 37.3 BSD Data-Segment Swap Map

Solaris 1 brings text-segment pages containing code from the file system and places them in the main memory and uses them. These pages are thrown away if selected for pageout. The pages are not written to the swap space. The next time when the pages are needed, the pages are reread from the file system rather than writing to swap space and reading from there. Swap space is used only for heap, stack and uninitialized data of process.

Solaris 2 allocates swap space only when a page is forced out of physical memory, not when the virtual memory page is first created. Modern computers have more physical memory and therefore, tend to page less.

Linux uses swap space only for anonymous memory (memory not backed by any file). The swap area may be kept in a swap file or in a dedicated swap partition. Each swap area has a series of 4-KB page slots, used to hold swapped pages. A swap map is maintained which is an array of integer counters. If the value of the counter is zero, the page slot is available. If the value is greater than zero, then it means that the page slot is used by the page of processes.

37.5 RAID Structure

Disk drives are getting cheaper. Therefore, it is possible to attach many disks to a computer system. Multiple disk drives can provide reliability via redundancy. Redundant arrays of independent disks (RAID) are used to address performance and reliability issues.

37.5.1 RAID – Improvement of Reliability via Redundancy

The chance that one disk will fail out of N disks is higher than the chance that a single disk will fail. If only one copy of data is stored, disk failure will result in loss of data. The solution for this is to go for redundancy. Redundancy is storing the same data multiple times in different places. That is, extra information is stored to rebuild the lost information. The simplest approach to redundancy is to duplicate every disk. This is called mirroring.

Every logical disk consists of two physical disks. Every write is carried out on both the physical disks. The result is a mirrored volume. That is, the same data is now available in two disks. If one of the two disks fails, the data from the second disk is used and the first disk is repaired. But if the second disk fails before the failed first disk is replaced, data is lost.

Therefore, the mean time to failure of a mirrored volume is the mean time to failure of individual disks + the mean time to repair.

37.5.2 Improvement in Performance via Parallelism

When disk mirroring is used, speed of reading information can be increased. That is, read requests can be sent to either disk, and the results from both the disks can be used. Therefore, the number of reads per unit time is doubled.

Data striping is used to achieve increase in performance via parallelism. Data striping can be done using bit-level striping or block-level striping. Bit-level striping is splitting the bits of each byte across multiple disks. If there are 8 disks, each bit of a byte is written into each disk. Every disk participates in every access. Suppose you want to read a byte of data, each bit is read from each disk at the same time and is combined to get the byte of data. In block-level striping, blocks of a file are striped across multiple disks. With n disks, block i of a file goes to (i mod n)+1 th disk. Block-level striping is the most common striping mechanism used.

Thus, it is seen that mirroring provides reliability but is expensive. Striping provides high data rates but does not improve reliability. Many schemes that combine disk striping along with parity bits have been proposed. These schemes are called RAID levels.

37.5.3 RAID Level-0

This scheme is often called as striping. In this scheme, the file is broken into blocks of data and the blocks are striped across disks in the system. That is, all blocks are not kept in the same disk in the system. This scheme provides no redundancy or error detection. Figure 37.4 shows an example of how striping is done in RAID level-0. The file shown in the figure has five blocks (block 0, block 1, block 2, block 3 and block 4). Block 0 is kept in disk 0, block 1 is kept in disk 1, block 2 is kept in disk 0, block 3 is kept in disk 1 and block 4 is kept in disk 1. In this scheme, speed of access is increased because two blocks can be read from the disks at the same time. kept in a second disk. This provides complete redundancy of data. Read performance can be improved by reading file data in parallel. But while writing it takes more time because, we must write the data out twice. This is the most expensive RAID implementation because it requires twice as much storage space as needed for storing the file once. Figure 37.5 shows an example of how data is stored in RAID level-1. All the five blocks of the file are stored in disk 0 as well as in disk 1. Therefore, the data is redundant.

Fig. 37. 4 RAID Level-0

Fig. 37. 5 RAID Level-1

37.5.5 RAID Level-2

In RAID level-2, data is striped across disks similar to Level-0, but the difference is that data is bit interleaved instead of block interleaved. To improve the reliability, an error correcting code (ECC) is added. This ECC is used to monitor correctness of the information on disk. Multiple disks record the ECC information to determine which disk is in fault. A parity disk is then used to reconstruct corrupted or lost data. Figure 37.6 shows that data is bit interleaved. The contents of disk block 0 are put in two disks. Similarly, the contents of other disk blocks are also put into multiple disks. There are two disks for storing ECC and a third disk used as a parity disk. error. Modern disks can already determine if there is an error using ECC codes with each sector. Therefore, the ECC disks are not used. It is just enough to include a parity disk. if a sector is bad, the disk itself tells us. Therefore, the parity disk alone is needed to correct the error.

Fig. 37. 6 RAID Level-2

37.6.7 RAID Level-4

The problem with Level-2 and Level-3 is bit interleaving. To access a single file block of data, it is necessary to access all the disks. Level-4 interleaves file blocks and allows multiple small I/O’s to be done at once. But, it still uses a single disk for parity. Now the parity is calculated over data from multiple blocks. Level – 2 and level-3 calculate it over a single block.

In level-4, since the parity is calculated over data from multiple blocks, if an error is detected, it is necessary to read the other blocks on other disks to reconstruct data.

Figure 37.7 shows the difference in the way in which parity is stored in level-4, level-2 and level-3. From the figure, we see that, block ‘a’ is stored in four different disks in the case of level-3 (as a0, a1, a2, a3). The parity is calculated for block ‘a’ from all the disks and is stored in the parity disk. In the case of level-4, each block is stored in a different disk. For example, block ‘a’ is stored in one disk, block ‘b’ is stored in another disk and so on. The parity is calculated using the first bit of block ‘a’, first bit of block ‘b’’, first bit of block ‘c’ and first bit of block ‘d’.

Fig.37.7 Level-4 vs. Level-2,3

Therefore, reads are simple to understand. If we want to read block A, we read it from disk 0. But if there is an error, we need to read from blocks A, B, C, D and parity block to calculate the correct data.

37.6.8 RAID Level-5

In level-5 file data is striped and data is checked over all the disks. There is no longer a single check disk (parity disk). This drastically improves the performance of multiple writes, because writing can now be done in parallel. This slightly improves reads because one more disk is to be used for reading. Figure 37.8 shows the difference between RAID level-4 and level-5 It is seen that in level-4, there is a single disk to store parity information. In level-5, the parity information is stored in the disks where the data is stored, along with the data.

Fig. 37.8 RAID Level-5 vs. Level-4

6.9 RAID Level-10

This level combines Level-0 and Level-1. This stripes a file’s data across multiple disks and gives great read/write performance. In addition to this, each strip is also mirrored onto a second disk. This gives the best redundancy and is the most high performance system, though being the most expensive system.

37.7 Summary

In this module various aspects related to disk management such as formatting, boot block and bad block. We also learnt various methods used for swap space management. We then learnt various RAID levels and their performance.

References

[1] Abraham Silberschatz, Peter B. Galvin, Greg Gagne, “Operating System Concepts”, Ninth Edition, John Wiley & Sons Inc., 2012.