Table of Contents
-
What is a Process?
1.1 Process vs. Thread
1.2 Process States -
Process Creation and Termination
2.1 Process Creation:fork(),exec(), andclone()
2.2 Copy-on-Write (CoW) infork()
2.3 Process Termination:exit(),wait(), and Zombies -
The Process Control Block:
task_struct
3.1 Key Fields intask_struct
3.2 Task List and PID Management -
CPU Scheduling: Fundamentals
4.1 Goals of Scheduling
4.2 Scheduling Policies in Linux -
The Completely Fair Scheduler (CFS)
5.1 Virtual Runtime (vruntime)
5.2 The Red-Black Tree
5.3 Target Latency and Granularity -
Real-Time Scheduling
6.1 SCHED_FIFO and SCHED_RR
6.2 Priority Inversion and Mitigations -
Scheduling Classes and Kernel Preemption
7.1 Modular Scheduling withsched_class
7.2 Preemptive vs. Voluntary Scheduling -
Challenges and Optimizations
8.1 Handling Many Processes (O(1) vs. CFS)
8.2 NUMA Awareness and Energy Efficiency
1. What is a Process?
A process is an instance of a running program. It includes the program’s code, data, stack, and register state, along with operating system metadata (e.g., PID, priority). The kernel treats each process as an independent entity, but processes can share resources (e.g., memory, file handles) to collaborate.
1.1 Process vs. Thread
In Linux, threads (or “lightweight processes”) are a special case of processes. A thread shares the same address space, file descriptors, and signal handlers as its parent process but has its own stack and register state. Threads are created using clone() with shared resources, whereas full processes use fork().
1.2 Process States
A process transitions between states as it interacts with the kernel. Linux defines the following core states (stored in task_struct->state):
- TASK_RUNNING: The process is either executing on the CPU or waiting in the runqueue to be scheduled.
- TASK_INTERRUPTIBLE: The process is sleeping (blocked) and can be woken by a signal (e.g., waiting for user input).
- TASK_UNINTERRUPTIBLE: The process is sleeping and cannot be woken by a signal (e.g., waiting for disk I/O to complete).
- TASK_STOPPED: The process has been paused (e.g., via
SIGSTOP). - EXIT_ZOMBIE: The process has terminated, but its parent has not yet called
wait()to collect its exit status. - EXIT_DEAD: The final state before the kernel frees the process’s resources.
2. Process Creation and Termination
The kernel provides system calls to create and destroy processes, ensuring resources are managed safely.
2.1 Process Creation: fork(), exec(), and clone()
fork(): Creates a new process by duplicating the parent. The child inherits the parent’s address space, file descriptors, and environment. The only difference is the return value:fork()returns 0 to the child and the child’s PID to the parent.exec(): Replaces the child’s address space with a new program (e.g.,/bin/ls). Afterexec(), the child process runs the new program, discarding the parent’s code/data.clone(): A low-level system call used to create threads. It allows fine-grained control over shared resources (e.g., shared memory, signal handlers) via flags likeCLONE_VM(share address space) orCLONE_FILES(share file descriptors).
2.2 Copy-on-Write (CoW) in fork()
Early implementations of fork() copied the entire parent address space, which was inefficient for short-lived children (e.g., fork() followed by exec()). Linux uses copy-on-write (CoW) to optimize this:
- The parent and child share physical memory pages initially.
- Pages are only copied when either process writes to them (triggering a page fault).
- This reduces overhead, as
exec()often discards the parent’s address space immediately.
2.3 Process Termination: exit(), wait(), and Zombies
exit(status): A process terminates by callingexit(), which:- Closes open file descriptors.
- Releases resources (e.g., memory, semaphores).
- Updates its
task_structstate toEXIT_ZOMBIEand notifies the parent.
waitpid(pid, &status, options): The parent callswaitpid()to collect the child’s exit status and free itstask_struct. If the parent does not callwaitpid(), the child remains a zombie process (stateEXIT_ZOMBIE), leaking kernel resources.- Orphan Processes: If a parent exits before its children, the
initprocess (PID 1) adopts the orphans and callswait()to clean them up.
3. The Process Control Block: task_struct
The kernel tracks every process with a Process Control Block (PCB), implemented as a struct task_struct in Linux. This large data structure (≈1.5KB on 64-bit systems) contains all metadata needed to manage the process.
3.1 Key Fields in task_struct
| Field | Purpose |
|---|---|
pid | Unique Process ID (32-bit integer, 0–32767 by default). |
tgid | Thread Group ID (PID of the main thread in a thread group). |
state | Current process state (e.g., TASK_RUNNING, TASK_INTERRUPTIBLE). |
stack | Kernel stack pointer (each process has a 2-page kernel stack). |
mm | Pointer to struct mm_struct (manages the process’s address space). |
files | Pointer to struct files_struct (open file descriptors). |
sched_class | Scheduling class (e.g., CFS, real-time). |
se | sched_entity (scheduling metadata, e.g., vruntime for CFS). |
3.2 Task List and PID Management
- Task List: All
task_structinstances are linked in a circular doubly linked list called the task list. The kernel traverses this list with macros likefor_each_process(p). - PID Namespaces: Linux supports PID namespaces (e.g., in containers), where processes in a namespace see a isolated view of PIDs. The global PID (visible to the host) is stored in
task_struct->pid, while the namespace-local PID is intask_struct->tgid.
4. CPU Scheduling: Fundamentals
CPU scheduling is the kernel’s process of selecting which runnable process to execute next on the CPU. It is critical for system responsiveness and efficiency.
4.1 Goals of Scheduling
- Fairness: Each process gets a proportional share of CPU time.
- Throughput: Maximize the number of processes completed per unit time.
- Low Latency: Minimize response time for interactive processes (e.g., text editors).
- Real-Time Guarantees: Ensure real-time processes meet deadlines.
4.2 Scheduling Policies in Linux
Linux supports multiple scheduling policies, each optimized for different workloads:
| Policy | Use Case | Priority Range |
|---|---|---|
| SCHED_NORMAL | Default for most processes (CFS). | 100–139 (nice -20→19) |
| SCHED_FIFO | Real-time (first-in-first-out). | 1–99 |
| SCHED_RR | Real-time (round-robin). | 1–99 |
| SCHED_IDLE | Low-priority background tasks. | 0 (lowest) |
5. The Completely Fair Scheduler (CFS)
Introduced in Linux 2.6.23 (2007), the Completely Fair Scheduler (CFS) replaced the earlier O(1) scheduler. CFS is designed for fairness: it ensures each process gets a CPU share proportional to its weight (priority).
5.1 Virtual Runtime (vruntime)
CFS avoids tracking time slices directly. Instead, it uses virtual runtime (vruntime), a measure of how much CPU time a process has “earned” relative to others.
vruntime = actual_runtime * (NICE_0_LOAD / task_load)task_loaddepends on the process’s nice value (lower nice = higher priority = higher load).- Processes with lower
vruntime(less CPU time) are scheduled first.
5.2 The Red-Black Tree
CFS organizes runnable processes in a red-black tree (a self-balancing binary search tree), ordered by vruntime.
- Insertion: When a process becomes runnable, it is inserted into the tree by
vruntime. - Selecting the Next Task:
pick_next_task_cfs()selects the leftmost node (smallestvruntime), ensuring the “most deserving” process runs next. - Tree Rotations: The red-black tree ensures O(log n) insertion/deletion, critical for large numbers of processes.
5.3 Target Latency and Granularity
CFS balances fairness and responsiveness with two parameters:
- Target Latency: The maximum time to schedule all runnable processes once (e.g., 20ms).
- Minimum Granularity: The minimum time a process runs before preemption (e.g., 1ms).
- For
nrunnable processes, each getstarget_latency / nCPU time, but no less thanminimum_granularity.
6. Real-Time Scheduling
Real-time processes (e.g., industrial control, audio processing) require guaranteed response times. Linux provides two real-time policies via sched_setscheduler().
6.1 SCHED_FIFO and SCHED_RR
- SCHED_FIFO (First-In-First-Out):
- Runs until it blocks (e.g., I/O) or yields (via
sched_yield()). - Higher-priority SCHED_FIFO tasks preempt lower-priority ones.
- Runs until it blocks (e.g., I/O) or yields (via
- SCHED_RR (Round-Robin):
- Similar to SCHED_FIFO but with a time slice (e.g., 100ms).
- After the time slice, the task is moved to the end of its priority queue.
6.2 Priority Inversion and Mitigations
Priority inversion occurs when a low-priority real-time task holds a lock needed by a high-priority task, blocking the high-priority task. Linux mitigates this with:
- Priority Inheritance: The low-priority task temporarily inherits the high-priority task’s priority while holding the lock.
- PI Mutexes: Kernel mutexes (
struct mutex) support priority inheritance by default.
7. Scheduling Classes and Kernel Preemption
7.1 Modular Scheduling with sched_class
Linux uses a modular scheduling framework via the sched_class structure, which abstracts scheduling policies. Each policy (CFS, SCHED_FIFO, etc.) implements a sched_class with methods like:
enqueue_task: Add a task to the runqueue.pick_next_task: Select the next task to run.
The kernel iterates oversched_classinstances in priority order (real-time first, then CFS, then SCHED_IDLE) to select the next task.
7.2 Preemptive vs. Voluntary Scheduling
- User-Space Preemption: CFS preempts tasks when their
vruntimeexceeds their fair share, ensuring responsiveness. - Kernel Preemption: Linux 2.6 introduced preemptive kernels, allowing preemption even in kernel mode (e.g., during a system call), except in critical sections (protected by
preempt_disable()). - Voluntary Preemption: Older kernels relied on
schedule()calls in kernel code, leading to latency for real-time tasks.
8. Challenges and Optimizations
8.1 Handling Many Processes (O(1) vs. CFS)
- O(1) Scheduler (Linux 2.4): Used fixed-priority runqueues with bitmap-based selection (O(1) time). However, it struggled with fairness for interactive tasks.
- CFS: Solves fairness but has O(log n) overhead. For systems with 100k+ processes, optimizations like per-CPU runqueues and load balancing reduce contention.
8.2 NUMA Awareness and Energy Efficiency
- NUMA (Non-Uniform Memory Access): CFS balances tasks across CPUs but may need NUMA-aware scheduling to minimize memory access latency (e.g.,
sched_numa). - Energy Efficiency: The
SCHED_POWERSAVEpolicy prioritizes CPU frequency scaling over performance, reducing power consumption.
9. Conclusion
Process management and scheduling are the “brain” of the Linux kernel, ensuring efficient resource utilization and responsiveness. From the task_struct tracking processes to CFS’s fair scheduling and real-time policies, these mechanisms enable Linux to scale from embedded devices to supercomputers.
Understanding these internals helps developers optimize applications (e.g., setting thread priorities) and system administrators troubleshoot performance issues (e.g., zombie processes, priority inversion). As Linux evolves, scheduling continues to adapt—with ongoing work on AI-driven scheduling and better support for heterogeneous CPUs.
10. References
- Love, R. (2010). Linux Kernel Development (3rd ed.). Pearson.
- Bovet, D. P., & Cesati, M. (2005). Understanding the Linux Kernel (3rd ed.). O’Reilly.
- Linux Kernel Documentation: Scheduling
- Corbet, J. (2007). Inside the Linux 2.6 Completely Fair Scheduler. LWN.net.
- Wikipedia: Linux kernel, Completely Fair Scheduler.