36 Scheduling in Linux

Mary Anitha Rajam

39.1 Introduction

The job of allocating CPU time to different tasks within an operating system is called as CPU scheduling. Linux supports preemptive multitasking. Since multitasking is supported, many tasks can be waiting to use the CPU. Since preemption is supported, it is possible to remove a process from the CPU even before it relinquishes the CPU on its own. The process scheduler decides which process runs and when. Scheduling is not just running and interrupting of user processes, in Linux, scheduling also includes the running of the various kernel tasks. Kernel tasks are tasks that are requested by a running process and tasks that execute internally on behalf of the kernel (tasks sparned by Linux’s I/O subsystem).In this module, we learn how scheduling is handled in the Linux operating system. Linux handles scheduling of user processes and kernel tasks in different ways. In this module we learn how Linux schedules user processes, real-time tasks and kernel tasks.

39.2 Process Scheduling

Linux uses two process-scheduling algorithms for user-defined processes. A time-sharing algorithm is used for fair preemptive scheduling between multiple processes, when non-real time tasks are executed. This algorithm focuses on fairness for scheduling. For real-time tasks, a scheduling algorithm where absolute priorities are more important than fairness is used. We now see how the time-sharing fair algorithm works.

39.2.1 Time-Sharing Fair Algorithm (CFS)

The Completely Fair Scheduler (CFS) was introduced in Linux kernel version 2.6. The CFS is a preemptive, priority-based algorithm with two separate priority ranges. One is the priority that is given to real-time processes and the other is the priority that is assigned to a process based on the nice value of the process. The real-time priority ranges from 0 to 99 and the nice value ranges from – 20 to 19. Smaller nice values indicate higher priorities. A process can choose to be nice to other processes by choosing a smaller nice value.

In earlier day UNIX, the core variables used for scheduling were the priority and time slice; fairness was not considered. But, Linux implements a fair scheduling algorithm. All processes are to be treated in a fair manner and hence, all processes are allotted a proportion of the processor’s time. CFS calculates how long a process should run as a function of the total number of runnable processes. If there are N runnable processes, each process should be allotted 1/N of the processor’s time. This allotment is adjusted by weighting each process’s allotment by its nice value.

Processes with the default nice value have a weight of 1 – priority is unchanged. Processes with a smaller nice value (higher priority) have to be assigned a higher priority and receive a higher weight. Processes with a larger nice value (lower priority) have to be assigned a lower priority and receive a lower weight. When all processes have the same nice value, each process is allotted 1/N of the processor’s time. When the nice values are different, CFS runs each process for a time slice proportional to the process’s weight divided by the total weight of all runnable processes. Therefore, processes with smaller weights get less CPU time and processes with bigger weights get more CPU time.

To calculate the actual length of time a process runs, another configuration variable called target latency is used. Target latency is the interval of time during which every runnable task should run at least once. Assume that the target latency is 10 milliseconds. Assume that we have two runnable processes of the same priority. Both of these processes should run within the 10 milliseconds. Each of these processes has the same weight and therefore, each process receives the same proportion of the processor’s time. Since the target latency is 10 milliseconds, the first process runs for 5 milliseconds and the second process runs for the next 5 milliseconds. If there are 10 runnable processes and the target latency is 10 milliseconds, each process will run for 1 millisecond before repeating. If there are 1000 processes, each process can run only for 1 microsecond.

When each process can run only for 1 microsecond, at the end of this time, context switch takes place. Thus, switching costs are involved and hence, scheduling processes for short lengths of time is inefficient. Therefore, another configuration variable called the minimum granularity is used. Minimum granularity is the minimum length of time any process is allotted the processor’s time. Once a process is assigned the CPU, the process uses the CPU at least for this minimum time. The process is not preempted till this time is over. Regardless of the latency, the process will run for at least the minimum granularity. Thus, CFS ensures that the switching costs do not grow unacceptably large when the number of runnable processes becomes too large. But, fairness is violated because all processes may not get CPU time within the time frame.

39.2.2 Real-Time Scheduling

Linux has a different scheduling algorithm when real-time processes are present. Linux implements two real-time scheduling classes, namely, FCFS and round robin. In both the cases, each process has a priority in addition to its scheduling class. Always the process with the highest priority is run. Among processes with the same priority, the one that was waiting for the longest is scheduled. FCFS processes continue to run until they exit or block. Even if a process with higher priority arrives, the currently running process continues to run. Round robin processes will be preempted, when a higher priority process arrives and the low priority processes are moved to the end of the scheduling queue. Round-robin processes of equal priority will automatically time-share among themselves. Linux does soft-real-time scheduling. If hard real-time scheduling is needed for devices of specific use, then variants of Linux are used to accommodate hard real-time scheduling.

39.2.3 Kernel Synchronization

We now see how kernel tasks are scheduled. The way the kernel schedules its own operations is different from the way it schedules user processes. Kernel tasks are tasks that are run in the kernel mode. Kernel mode execution occurs whenever the user programs are asking for operating system services. Kernel-mode execution can occur in two ways:

– A running program may request an operating system service, either explicitly via a system call, or implicitly, for example, when a page fault occurs

– A device driver may deliver a hardware interrupt that causes the CPU to start executing a kernel defined handler for that interrupt

Kernel synchronization requires a framework that allows kernel tasks to run without violating the integrity of shared data. There are many data structures that are maintained in the kernel.

These data structures are shared by all the processes that need the kernel’s services. Therefore, many tasks may try to access kernel data structures simultaneously. Suppose a task is accessing the file table maintained in the kernel, an interrupt occurs and the interrupt service routine is executed. The interrupt service routine cannot access the file table without risking data corruption. We now look at how this issue is taken care of. Until version 2.6, the Linux kernel was nonpreemptive. That is, a process running in kernel mode cannot be preempted even if a process with higher priority is waiting. So, if a process is executing in the kernel mode, it can be preempted only when it completes its execution and returns back to the user mode. The Linux kernel is fully preemptive with version 2.6 – a task can be preempted when it is running in the kernel. In this case, when a process running in kernel mode is preempted, how are the data structures that are currently in use by the process protected? There are two techniques that are included in the Linux kernel to protect the data structures.

39.2.3.1 Kernel Synchronization – Technique 1

Linux provides spinlocks and semaphores for locking in the kernel. On symmetric multiprocessing (SMP) machines, spinlocks are used. In single processor machines, there are calls for enabling and disabling kernel preemption. Before accessing the data structure, preemption is disabled and after completing the use of the data structure, preemption is enabled. The calls used for disabling and enabling kernel preemption are preempt_disable() and preempt_enable() repectively. In addition, the kernel is not preemptible if a kernel-mode task is holding a spinlock. Spinlocks along with enabling and disabling of kernel preemption are used in kernel only when the lock is held for short durations. When a lock is held for longer durations, semaphores are used.

39.2.3.2 Kernel Synchronization – Technique 2

If there are critical sections that occur in interrupt service routines, how are these critical sections of code protected? The basic tool is the process’s interrupt-control hardware. By disabling interrupts (or using spinlocks) during a critical section, the kernel avoids the risk of concurrent access to shared data structures. That is, when critical sections of code are executed, interrupts are disabled. But, there is a penalty for disabling interrupts. On most architectures, interrupt enable and disable are not cheap. As long as interrupts remain disabled, all I/O is suspended, all devices wait until interrupts are enabled and the performance degrades. Therefore, another mechanism is used.

Linux uses a synchronization architecture that allows long critical sections to run for their entire duration without having interrupts disabled. Linux implements this architecture by separating interrupt service routines into two sections. Each interrupt service routine is divided into a top half and a bottom half. The top half is the standard ISR that runs with recursive interrupts disabled. That is, interrupts of the same number are disabled, but other interrupts may run. For example, if packets have arrived from the network, the network interface card interrupts. The top half of the interrupt service routine collects the packets and moves them to the buffer in the kernel and then calls the bottom half. The bottom half of a service routine is run, with all interrupts enabled. Thus, it is ensured that interrupts are not kept disabled for a long duration of time.

A miniature scheduler ensures that bottom halves never interrupt themselves. The bottom-half scheduler is invoked automatically whenever an interrupt service routine exits. If another interrupt occurs while a bottom half is executing, then that interrupt can request that the same bottom half execute, but the execution will be deferred until the one currently running completes. Each execution of the bottom half can be interrupted by a top half but can never be interrupted by a similar bottom half. The top-half/bottom-half architecture is completed by a mechanism for disabling selected bottom halves while executing normal, foreground kernel code.

Interrupt handlers can code their critical sections as bottom halves. When the foreground wants to enter a critical section, it can disable any relevant bottom halves to prevent any other critical sections from interrupting it. At the end of the critical section, the kernel can re-enable the bottom halves and run any bottom-half tasks that have been queued up by top-half interrupt service routines during the critical section.

Fig. 39.1 Interrupt Protection Levels

Figure 39.1 shows the different protection levels. Top-half interrupt handlers have higher priority than bottom-half interrupt handlers. Each level may be interrupted by code running at a higher level, but will never be interrupted by code running at the same or a lower level.

39.3 Summary

In this module, we learnt how Linux handles scheduling of user processes using Completely Fair Scheduler. We also learnt how real-time processes are scheduled. We also learnt how Linux handles kernel task scheduling.

References

Abraham Silberschatz, Peter B. Galvin, Greg Gagne, “Operating System Concepts”, Ninth Edition, John Wiley & Sons Inc., 2012.