Table of Contents
- Linux Kernel Architecture Overview
- Critical Internal Data Structures
- Essential Kernel Functions and Their Roles
- Synchronization Primitives in the Kernel
- 4.1 Spinlocks
- 4.2 Mutexes and Semaphores
- 4.3 Read-Copy-Update (RCU)
- Conclusion
- References
1. Linux Kernel Architecture Overview
1.1 Monolithic Design vs. Microkernels
The Linux kernel follows a monolithic architecture, meaning all core functionality (process management, memory management, file systems, networking, etc.) runs in a single address space (kernel space). This differs from microkernels (e.g., Minix, QNX), where only essential services (scheduling, IPC) run in kernel space, and drivers/filesystems run in user space.
Advantages of monolithic design:
- Lower overhead: No context switches between kernel and user space for core operations.
- Faster communication: Subsystems interact directly via function calls.
Tradeoffs:
- Larger codebase: Harder to maintain, but modularized via loadable kernel modules (LKMs).
1.2 Key Kernel Subsystems
The kernel is organized into interconnected subsystems, each responsible for a specific task:
- Process Management: Handles process creation, scheduling, and termination.
- Memory Management: Manages physical and virtual memory, including allocation, paging, and swapping.
- Virtual File System (VFS): Abstracts different file systems (ext4, Btrfs, NFS) into a unified interface.
- Networking: Implements network protocols (TCP/IP, UDP) and packet processing.
- Device Drivers: Mediates between hardware and the kernel (e.g., storage, GPU, USB drivers).
- Security: Enforces access control (SELinux, AppArmor) and user/group permissions.
2. Critical Internal Data Structures
Data structures are the “backbone” of the kernel, storing and organizing state for subsystems. Below are the most fundamental ones.
2.1 Process Management: struct task_struct
Every running process in Linux is represented by a struct task_struct (often called a “process descriptor”). It contains all information about the process, from its ID to its memory layout.
Defined in: <linux/sched.h>
Size: ~10KB (64-bit systems), varying with kernel configuration.
Key Fields:
pid_t pid: Process ID (unique identifier).char comm[TASK_COMM_LEN]: Process name (e.g., “bash”).volatile long state: Current state (e.g.,TASK_RUNNING,TASK_SLEEPING,TASK_ZOMBIE).struct mm_struct *mm: Pointer to the memory management structure (address space).struct task_struct *parent: Pointer to the parent process (e.g.,bashis parent ofls).struct list_head children: List of child processes.struct files_struct *files: Open file descriptors (array ofstruct file *).
Example: When you run ps aux, the kernel iterates over the global list of task_struct (via for_each_process()) and extracts pid, comm, and state.
2.2 Memory Management: struct mm_struct and struct page
struct mm_struct
Represents a process’s address space (virtual memory layout). It tracks which memory regions (code, data, stack, mmap’d files) the process can access.
Defined in: <linux/mm_types.h>
Key Fields:
pgd_t *pgd: Page Global Directory (top-level of the page table hierarchy).struct vm_area_struct *mmap: List of memory regions (e.g.,[heap],[stack]).unsigned long total_vm: Total pages in the address space.struct rw_semaphore mmap_sem: Semaphore for protecting memory region modifications.
struct page
Represents a single physical page frame (e.g., 4KB on x86). The kernel uses an array of struct page (the “page global directory”) to track all physical memory.
Defined in: <linux/mm_types.h>
Key Fields:
unsigned long flags: Status flags (e.g.,PG_locked= page is locked,PG_reserved= page cannot be swapped).atomic_t _count: Reference count (how many times the page is in use).struct address_space *mapping: Points to the inode if the page is part of a file (e.g., cached data).
2.3 Virtual File System (VFS): struct file and struct inode
The VFS abstracts file systems, allowing user-space to use open(), read(), etc., regardless of the underlying file system (ext4, NFS, etc.).
struct file
Represents an open file (not the file itself). Each file descriptor (fd) in a process points to a struct file.
Defined in: <linux/fs.h>
Key Fields:
loff_t f_pos: Current file position (offset for next read/write).const struct file_operations *f_op: Pointer to file operations (e.g.,read(),write(),close()).unsigned int f_flags: Open flags (e.g.,O_RDONLY,O_NONBLOCK).struct path f_path: Path to the file (containsdentryandmntfor directory entry and mount point).
struct inode
Represents a file’s metadata (on-disk attributes). Unlike struct file, struct inode is unique per file (not per open instance).
Defined in: <linux/fs.h>
Key Fields:
ino_t i_ino: Inode number (unique per file system).umode_t i_mode: File type and permissions (e.g.,S_IFREGfor regular file,0644for rw-r—r—).loff_t i_size: File size in bytes.struct inode_operations *i_op: Inode operations (e.g.,create(),unlink()).struct file_operations *i_fop: Default file operations for the file.
2.4 Networking: struct sk_buff
Network packets are processed using struct sk_buff (socket buffer), the “workhorse” of Linux networking. It encapsulates packet data and metadata as it moves through network layers (L2/L3/L4).
Defined in: <linux/skbuff.h>
Key Fields:
unsigned char *head, *data, *tail, *end: Pointers defining the packet data region (head= start of buffer,data= start of packet payload).struct sock *sk: Associated socket (for connection-oriented protocols like TCP).__be16 protocol: Network protocol (e.g.,htons(ETH_P_IP)for IPv4).unsigned int len: Length of the packet payload.
Lifecycle: Packets are allocated with alloc_skb(), modified by network layers (e.g., adding/removing headers), and freed with kfree_skb().
3. Essential Kernel Functions and Their Roles
3.1 Process Creation: fork(), clone(), and execve()
fork()
Creates a new process by duplicating the calling process (parent). The new process (child) is nearly identical to the parent, sharing code but with a copy-on-write (CoW) address space.
Flow:
- User-space calls
fork(), triggering thesys_fork()system call. sys_fork()callscopy_process(), which:- Duplicates the parent’s
task_struct(with a newpid). - Copies the
mm_struct(CoW: pages are shared until modified). - Initializes the child’s state to
TASK_RUNNING. - Adds the child to the global task list.
- Duplicates the parent’s
- Returns
0to the child and the child’spidto the parent.
clone()
A more flexible version of fork(), allowing the caller to share resources (e.g., address space, file descriptors) with the child. Used to create threads (via CLONE_VM flag: shares mm_struct).
execve()
Replaces the current process’s address space with a new executable (e.g., when running ls from bash).
Flow:
sys_execve()parses the executable (ELF, script), loads its code/data into memory, and updates themm_struct.- The process’s
task_structis reused, but its memory and registers are overwritten.
3.2 Memory Allocation: kmalloc(), vmalloc(), and __get_free_page()
kmalloc(size_t size, gfp_t flags)
Allocates small, contiguous chunks of kernel memory (up to ~128KB on most systems) from the slab allocator.
Flags:
GFP_KERNEL: Can sleep (useskmalloc’s default allocator).GFP_ATOMIC: Cannot sleep (used in interrupt context).
Example:
struct my_struct *ptr = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
if (!ptr) /* handle error */;
vmalloc(size_t size)
Allocates large, non-contiguous virtual memory (good for >128KB). Uses page tables to map non-contiguous physical pages to contiguous virtual addresses. Higher overhead than kmalloc().
__get_free_page(gfp_t flags)
Allocates whole physical pages (e.g., 4KB). Use __get_free_pages() for multiple contiguous pages.
3.3 VFS Operations: sys_open(), sys_read(), and sys_write()
sys_open(const char *pathname, int flags, mode_t mode)
Opens a file and returns a file descriptor (fd).
Flow:
- Resolves
pathnameto aninodeviapath_lookup(). - Checks permissions (e.g.,
inode->i_mode). - Creates a
struct fileand initializes itsf_op(frominode->i_fop). - Adds the
struct fileto the process’sfiles_struct(fd table).
sys_read(unsigned int fd, char __user *buf, size_t count)
Reads data from an open file into user-space.
Flow:
- Looks up the
struct filevia the fd. - Calls
file->f_op->read(file, buf, count, &file->f_pos). - Copies data from kernel space to user space (using
copy_to_user()).
3.4 Scheduling: schedule() and context_switch()
schedule()
The kernel’s main scheduler function, responsible for selecting the next process to run.
Logic:
- Iterates over runnable processes (
TASK_RUNNINGstate) in the runqueue. - Selects the process with the highest priority (CFS scheduler: fair scheduling based on CPU time).
context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next)
Switches the CPU from the current process (prev) to the next process (next).
Steps:
- Saves
prev’s registers to its kernel stack. - Loads
next’s registers from its kernel stack. - Updates the page table via
switch_mm(prev->mm, next->mm, next).
4. Synchronization Primitives in the Kernel
Kernel code runs concurrently (multiple CPUs, interrupts), so synchronization is critical to avoid race conditions.
4.1 Spinlocks
A busy-waiting lock for short, atomic sections. Disables preemption and (on SMP systems) interrupts to ensure exclusive access.
Usage:
spinlock_t my_lock;
spin_lock_init(&my_lock);
spin_lock(&my_lock); /* Acquire lock (blocks until available) */
/* Critical section */
spin_unlock(&my_lock); /* Release lock */
Best for: Interrupt context or very short sections (avoids sleep overhead).
4.2 Mutexes and Semaphores
Mutexes
A sleeping lock for longer sections. Only one process can hold the mutex at a time.
Usage:
struct mutex my_mutex;
mutex_init(&my_mutex);
mutex_lock(&my_mutex); /* Sleeps if lock is held */
/* Critical section */
mutex_unlock(&my_mutex);
Semaphores
A counting lock allowing up to n processes to access a resource (e.g., n=1 is a binary semaphore, equivalent to a mutex).
4.3 Read-Copy-Update (RCU)
Optimized for read-mostly data structures (e.g., routing tables). Readers access data without locking; writers update a copy and “publish” it later.
Readers: Use rcu_read_lock()/rcu_read_unlock() to mark read sections.
Writers: Use rcu_assign_pointer() to update pointers, and synchronize_rcu() to wait for all readers to finish before freeing old data.
5. Conclusion
The Linux kernel’s internal structures and functions are the foundation of its power and flexibility. From task_struct managing processes to sk_buff handling network packets, these components work in harmony to deliver a robust OS.
Understanding them is key for kernel development, debugging, or optimizing system performance. While this blog scratches the surface, diving deeper (e.g., studying kernel source code, experimenting with LKMs) will reveal the kernel’s full complexity and ingenuity.
6. References
- Linux Kernel Documentation
- Love, R. (2010). Linux Kernel Development (3rd ed.). Pearson.
- Bovet, D. P., & Cesati, M. (2005). Understanding the Linux Kernel (3rd ed.). O’Reilly.
- LKML (Linux Kernel Mailing List)
- LWN.net Kernel Articles