Table of Contents
- Understanding Memory: Physical vs. Virtual
- Address Spaces: User vs. Kernel
- Paging: The Foundation of Virtual Memory
- Memory Allocation Mechanisms
- Kernel Memory Zones
- Page Cache: Optimizing Disk I/O
- Swapping: Extending Memory to Disk
- Advanced Topics: Huge Pages and NUMA
- Memory Management APIs for Developers
- Challenges and Optimizations
- Conclusion
- References
1. Understanding Memory: Physical vs. Virtual
At the hardware level, a computer’s memory is physical memory (RAM)—a finite array of bytes directly accessible by the CPU. However, modern operating systems like Linux abstract physical memory with virtual memory, a layer that gives each process the illusion of owning a contiguous, isolated address space.
Why Virtual Memory?
- Isolation: Processes cannot access each other’s memory, preventing crashes or malicious interference.
- Abstraction: Physical memory is managed as a pool, allowing the kernel to allocate and reallocate it dynamically.
- Overcommitment: The total virtual memory used by all processes can exceed physical RAM (via swapping to disk).
2. Address Spaces: User vs. Kernel
Every process in Linux has its own virtual address space, divided into two regions:
User Space
- The lower portion of the address space (e.g., 0 to 3GB on 32-bit systems, or 0 to 2^47-1 on 64-bit systems).
- Contains process-specific code, data, heap, and stack.
- Accessible only by the process itself (in user mode).
Kernel Space
- The upper portion of the address space (e.g., 3GB to 4GB on 32-bit systems).
- Shared across all processes and accessible only in kernel mode (via system calls or interrupts).
- Directly mapped to physical memory via a linear mapping (for most architectures), allowing the kernel to access physical RAM without complex translation.

Conceptual layout of a 32-bit process address space: user space (lower) and kernel space (upper).
3. Paging: The Foundation of Virtual Memory
Paging is the mechanism that maps virtual addresses to physical addresses. It breaks memory into fixed-size pages (virtual) and frames (physical), typically 4KB (but configurable to 2MB/1GB with huge pages).
How Paging Works
-
Page Tables: The kernel maintains hierarchical page tables for each process to map virtual pages to physical frames. On 64-bit systems, this hierarchy includes:
- PGD (Page Global Directory): Root of the table.
- PUD (Page Upper Directory), PMD (Page Middle Directory): Intermediate levels (optional, depending on architecture).
- PTE (Page Table Entry): Leaf nodes containing the physical frame number and flags (e.g., read/write, present).
-
Address Translation: When the CPU accesses a virtual address, the Memory Management Unit (MMU) traverses the page tables to find the corresponding physical frame. If the PTE is marked “not present,” a page fault occurs, triggering the kernel to resolve the missing page (e.g., load from disk or allocate a new frame).
Page Faults
Page faults are common and essential. They handle:
- Demand Paging: Loading pages into memory only when accessed (reduces initial memory usage).
- Copy-on-Write (CoW): Sharing pages between processes until modified (e.g., after
fork()).
4. Memory Allocation Mechanisms
The kernel provides multiple allocators to meet diverse memory needs, from large contiguous frames to tiny kernel objects.
4.1 The Buddy System: Managing Physical Pages
The buddy system allocates physical memory in blocks of sizes that are powers of two (e.g., 1, 2, 4, 8 pages). It minimizes external fragmentation by merging freed blocks.
How It Works
- Free Lists: The kernel maintains free lists for each block size (e.g.,
free_area[0]for 1-page blocks,free_area[1]for 2-page blocks). - Allocation: When a request for
npages arrives, the smallest block ≥nis split into smaller “buddies” until the requested size is met. - Deallocation: Freed blocks are merged with their “buddy” (a contiguous block of the same size) to form larger blocks, reducing fragmentation.
Example
- Request: 3 pages. The smallest block ≥3 is 4 pages. Split 4-page block into two 2-page blocks. Split one 2-page block into two 1-page blocks. Allocate 1+2 pages.
- Freeing: When the 1-page and 2-page blocks are freed, they merge back into a 4-page block.
Limitations
- Internal Fragmentation: Wasting space for requests not aligned to power-of-two sizes (e.g., a 3-page request uses a 4-page block, wasting 1 page).
4.2 The Slab Allocator: Efficient Small Object Allocation
The buddy system is inefficient for small allocations (e.g., 100 bytes). The slab allocator optimizes for frequently allocated kernel objects (e.g., inode, task_struct).
Key Concepts
- Slab Caches: Per-object-type caches (e.g.,
inode_cache,filp_cachefor file pointers). - Slabs: Contiguous memory blocks (allocated via the buddy system) divided into fixed-size objects.
- Reuse: Allocates objects from cached slabs, avoiding repeated initialization overhead.
SLUB: The Modern Slab Allocator
Linux uses SLUB (Simple Linux Universal Block) as its primary slab allocator. It simplifies management, reduces overhead, and improves scalability over older designs (SLAB, SLOB).
5. Kernel Memory Zones
Physical memory is not uniform. Hardware constraints (e.g., DMA devices with limited addressability) require the kernel to partition memory into zones:
- ZONE_DMA: For DMA devices that can only address the first 16MB (x86).
- ZONE_DMA32: For 32-bit DMA devices needing access to up to 4GB (64-bit systems).
- ZONE_NORMAL: Directly mapped to kernel virtual address space (most critical allocations).
- ZONE_HIGHMEM: Physical memory beyond the kernel’s linear mapping (32-bit systems only). Accessed via temporary mappings.
The kernel uses GFP_* flags (e.g., GFP_DMA, GFP_HIGHUSER) to specify which zones to allocate from.
6. Page Cache: Optimizing Disk I/O
The page cache is a critical optimization that caches disk data in memory, reducing slow disk accesses.
How It Works
- Caching Reads: When a file is read, data is stored in the page cache. Subsequent reads fetch from memory.
- Write-Behind Caching: Writes are cached and flushed to disk asynchronously (via
pdflushthreads), improving write performance. - Invalidation: Cached pages are marked invalid when the underlying disk data changes (e.g., via
write()).
Management
The kernel tracks cached pages using address_space objects (per-file) and reclaims memory via the LRU (Least Recently Used) algorithm, which evicts inactive pages first.
7. Swapping: Extending Memory to Disk
When physical memory is full, the kernel swaps least recently used (LRU) pages to a swap space (disk partition or file), freeing physical frames for active processes.
Swap Process
- Swap Cache: Tracks pages pending write to disk to avoid redundant I/O.
- Swappiness: A kernel parameter (
vm.swappiness, 0–100) that controls swap aggressiveness (lower = prefer reclaiming page cache, higher = prefer swapping).
Risks
Excessive swapping (“thrashing”) degrades performance, as disk is orders of magnitude slower than RAM.
8. Advanced Topics: Huge Pages and NUMA
Huge Pages
Huge pages (2MB/1GB) reduce MMU overhead by lowering the number of page table entries. They benefit memory-intensive workloads (e.g., databases, VMs).
- Transparent Huge Pages (THP): Automatically allocates huge pages for eligible processes (enabled by default on most systems).
NUMA (Non-Uniform Memory Access)
On multi-socket systems, memory near a CPU (local node) is faster to access than remote memory. The kernel is NUMA-aware:
- Node Allocation: Allocates memory from the local node first.
- NUMA Balancing: Migrates pages to local nodes if remote access is frequent.
9. Memory Management APIs for Developers
Kernel developers use specialized APIs to allocate memory:
kmalloc(size, flags): Allocates small, physically contiguous memory (slab-based).__get_free_pages(gfp_mask, order): Allocates2^orderphysical pages (buddy system).vmalloc(size): Allocates virtually contiguous memory (may be physically fragmented, used for large allocations).
10. Challenges and Optimizations
- Fragmentation: Mitigated by the buddy system (merging) and slab allocators (reuse).
- OOM Killer: When memory is exhausted, the kernel selects a process to terminate based on
oom_score(lowest priority, highest memory usage). - Memory Pressure: Monitored via
vmstat(e.g.,pgmajfault,swap).
11. Conclusion
Linux memory management is a masterpiece of engineering, balancing efficiency, flexibility, and hardware constraints. From paging to the buddy system, and from the page cache to NUMA, each component works in harmony to keep systems responsive. Whether you’re tuning a server or debugging a kernel module, a deep understanding of these mechanisms is invaluable.
12. References
- Bovet, D. P., & Cesati, M. (2005). Understanding the Linux Kernel (3rd ed.). O’Reilly.
- Love, R. (2010). Linux Kernel Development (3rd ed.). Pearson.
- Linux Kernel Documentation: Documentation/vm/
- LWN.net: Memory Management Series
This blog is a high-level overview. For implementation details, refer to the Linux kernel source code and official documentation.