Understanding Linux Memory Management: A Comprehensive Guide

Memory management is a fundamental task of any operating system, akin to CPU management. It primarily involves storing system instructions, application data, and caches.

How does Linux handle memory management? Let's delve deeper into this subject.

Memory Mapping

Can you recall the amount of memory in your current computer? It's a critical factor when purchasing a machine. For instance, my laptop has 8GB of memory.

When we mention memory size, such as 8GB, we refer to physical memory. Also known as main memory, it predominantly consists of Dynamic Random Access Memory (DRAM) in most systems. Only the kernel has direct access to this physical memory. But how do processes access it?

The Linux kernel assigns each process a distinct virtual address space that is contiguous, allowing straightforward access to what is referred to as virtual memory.

This virtual address space is divided into two segments: kernel space and user space. The address space's range is contingent on the processor's word length—the maximum data size a single CPU instruction can manage. For example, in common 32-bit and 64-bit systems, their virtual address spaces can be represented in diagrams:

In a 32-bit system, the kernel space uses 1GB at the top, with the remaining 3GB allocated to user space. In contrast, both kernel and user spaces in a 64-bit system span 128TB each, occupying the upper and lower segments of the entire memory space, while the central area remains undefined.

You may recall the difference between user mode and kernel mode in processes. A process in user mode can only access user space memory. It can access kernel space memory only upon entering kernel mode. Although the address space for each process includes kernel space, this kernel space ultimately maps to the same physical memory. Therefore, when a process switches to kernel mode, it can seamlessly access kernel space memory.

Given that each process has a vast address space, the cumulative virtual memory of all processes greatly surpasses the physical memory available. Consequently, not all virtual memory receives physical allocation; only that which is actively utilized is assigned physical memory. This allocation occurs through memory mapping.

Memory mapping essentially connects virtual memory addresses to physical memory addresses. To facilitate this mapping, the kernel keeps a page table for each process, detailing the relationship between virtual and physical addresses:

The page table is stored within the CPU's Memory Management Unit (MMU), allowing the processor to access the required memory directly through hardware under normal circumstances.

When a process tries to access a virtual address not found in the page table, a page fault exception is raised. The system then enters kernel space to allocate physical memory, updates the process's page table, and subsequently returns to user space to continue execution.

The Translation Lookaside Buffer (TLB) impacts CPU memory access performance. Essentially, the TLB serves as a cache for the page table within the MMU. As the virtual address space of a process is isolated, optimizing TLB cache usage by minimizing context switches and TLB flushes can boost CPU memory access efficiency.

It is crucial to understand that the MMU does not manage memory in byte increments but rather defines a minimum unit for mapping, typically a page of 4 KB. Thus, every memory mapping operation relates to either 4 KB or a multiple of it.

One challenge stemming from the 4 KB page size is the potential size of the entire page table. For instance, in a 32-bit system, over a million page table entries (4GB/4KB) are required to cover the complete address space. To mitigate the issue of excessive page table entries, Linux employs two strategies: multi-level page tables and HugePages.

Multi-level page tables organize memory pages by dividing it into blocks, transitioning the mapping from a single index to a block index and an offset within that block. Since the virtual memory space typically occupies only a fraction of the total, multi-level page tables retain only the blocks in use, significantly reducing the number of page table entries.

Linux utilizes a four-level page table, where the virtual address is segmented into five parts: the first four entries select the page, and the last index indicates the offset within that page:

As the term suggests, Large Pages refer to memory blocks larger than standard pages. Common sizes for large pages are 2MB and 1GB, typically employed in memory-intensive processes such as Oracle and DPDK.

Through these mechanisms, processes access physical memory via virtual addresses mapped through page tables. But how is this memory utilized in a Linux process?

Virtual Memory Space Layout

To understand how virtual memory is distributed, the upper kernel space is straightforward. However, the user space below it is segmented into several different sections. In a 32-bit system, I illustrated their relationships in a diagram:

The diagram reveals five distinct memory segments arranged from low to high within user space:

Read-Only Segment: Contains code and constants.
Data Segment: Holds global variables.
Heap: Includes dynamically allocated memory that grows upwards from lower addresses.
File Mapping Segment: Encompasses dynamic libraries, shared memory, etc., growing downwards from higher addresses.
Stack: Contains local variables and function call contexts, typically fixed at around 8 MB.

Among these five segments, the heap and file mapping segment are allocated dynamically. For instance, using the C standard library functions malloc() or mmap() can dynamically allocate memory in the heap and file mapping segment, respectively.

The memory layout in a 64-bit system resembles this but encompasses a much larger memory space. This leads to the crucial question: how is memory allocated and deallocated?

Memory Allocation and Deallocation

The malloc() function in the C standard library is responsible for memory allocation. It primarily utilizes two system call implementations: brk() and mmap().

For smaller memory blocks (less than 128K), the C standard library employs brk() to allocate memory by adjusting the heap's top position. Freed memory is not returned immediately to the system but is cached for future use.

For larger memory blocks (greater than 128K), mmap() is directly used for allocation, acquiring memory from the file mapping segment.

Each method has its own advantages and disadvantages:

Advantages of `brk()`: The caching mechanism can reduce page faults and improve memory access efficiency. However, since this memory isn't returned to the system, frequent allocation and deallocation may lead to fragmentation.
Advantages of `mmap()`: Memory allocated through mmap() is returned to the system upon deallocation, preventing fragmentation. However, mmap() can introduce page faults during allocation, which increases kernel management overhead in memory-intensive scenarios. Hence, malloc() resorts to mmap() only for large memory blocks.

It's essential to note that memory is not actually allocated when these calls are made. Real allocation only occurs upon first access via page faults, prompting the kernel to allocate memory.

In general, Linux employs a buddy system for memory management. As previously mentioned, memory is organized in pages by the MMU, and the buddy system also manages memory in pages, merging adjacent ones to reduce fragmentation (such as that caused by brk()).

You may wonder how memory is allocated for objects smaller than a page, like those under 1K.

In practice, numerous objects fall below a page's size, and allocating a separate page for each would waste memory. Thus, in user space, memory allocated by malloc() through brk() is cached for reuse rather than returned to the system immediately. In kernel space, Linux utilizes a slab allocator to manage small memory objects. The slab functions as a cache built on the buddy system, primarily for allocating and freeing small kernel objects.

If memory is allocated but not deallocated, it can lead to memory leaks, exhausting system memory. Thus, applications must invoke free() or unmap() to release memory once it is no longer required.

Naturally, the system prevents a process from consuming all available memory. When memory runs low, various mechanisms are employed to reclaim it, including:

Reclaiming Cache: Utilizing the LRU (Least Recently Used) algorithm to recover the least recently used memory pages.
Reclaiming Infrequently Accessed Memory: Moving seldom-used memory to disk swap space (paging out). When accessed again, it is retrieved from the disk back into memory (paging in).
Killing Processes: In critically low memory conditions, the system may employ the OOM (Out of Memory) killer to terminate processes consuming substantial memory to safeguard the system.

The method of reclaiming infrequently accessed memory involves swap space, a disk area serving as an extension of RAM. It temporarily retains data not currently in use and retrieves it when necessary.

Thus, swap space enhances the system's available memory. However, swapping only occurs under low memory conditions, and since disk read/write speeds are significantly slower than RAM, this can result in severe performance degradation.

The OOM mechanism serves as a kernel protective measure. It monitors process memory utilization, assigning an oom_score to each based on their memory consumption:

Higher `oom_score`: Indicates that the process uses more memory and is more likely to be terminated by OOM.
Lower `oom_score`: Signifies less memory usage and a lower likelihood of being killed by OOM.

Administrators can manually set a process's oom_adj value through the /proc filesystem to adjust the oom_score.

The oom_adj value ranges from [-17, 15]; a higher value increases the likelihood of the process being killed, whereas a lower value decreases it, with -17 signifying "do not kill."

For instance, the following command adjusts the oom_adj for the sshd process to -16, making it less susceptible to termination by OOM:

echo -16 > /proc/$(pidof sshd)/oom_adj

Checking Memory Usage

By grasping memory distribution and allocation/recycling processes, you should have a foundational understanding of how memory functions. While the actual system operations are more complex and involve additional mechanisms, the main principles have been covered here. This knowledge provides a clear framework for comprehending memory operations beyond mere technical jargon.

So, how can we check the system's memory usage after learning these principles?

We have already mentioned some related tools during our previous CPU discussions. The first tool that comes to mind is the free command. Here's an example of the output generated by the free command:

$ free

total used free shared buff/cache available

Mem: 8169348 263524 6875352 668 1030472 7611064

Swap: 0 0 0

The free command outputs a table with values displayed in bytes by default. This table consists of two rows and six columns, representing physical memory (Mem) and swap space (Swap) usage:

total: Total memory amount.
used: Memory in use, including shared memory.
free: Unused memory amount.
shared: Memory utilized by shared resources.
buff/cache: Size of buffers and caches.
available: Memory available for new processes.

Notably, the last column, available, includes not just unused memory but also reclaimable cache, typically making it larger than the unused memory. However, not all caches can be reclaimed, as some may still be in use.

While free provides a comprehensive overview of the system's memory usage, if you want to monitor individual process memory utilization, tools like top or ps can be employed. Below is a sample output from the top command:

# Press M to Sort by Memory Usage

$ top

...

KiB Mem : 8169348 total, 6871440 free, 267096 used, 1030812 buff/cache

KiB Swap: 0 total, 0 free, 0 used. 7607492 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

430 root 19 -1 122360 35588 23748 S 0.0 0.4 0:32.17 systemd-journal

1075 root 20 0 771860 22744 11368 S 0.0 0.3 0:38.89 snapd

1048 root 20 0 170904 17292 9488 S 0.0 0.2 0:00.24 networkd-dispat

1 root 20 0 78020 9156 6644 S 0.0 0.1 0:22.92 systemd

12376 azure 20 0 76632 7456 6420 S 0.0 0.1 0:00.01 systemd

12374 root 20 0 107984 7312 6304 S 0.0 0.1 0:00.00 sshd

...

At the top of the top command output, overall system memory statistics are presented, similar to those of the free command, so we won't repeat those explanations here. Instead, let’s focus on memory-related columns like VIRT, RES, SHR, and %MEM.

Here's what these columns signify:

VIRT: Total virtual memory size for the process, encompassing all requested memory, regardless of physical allocation.
RES: Actual physical memory used by the process, excluding swap space and shared memory.
SHR: Amount of shared memory utilized by the process, which includes shared resources and dynamic libraries.
%MEM: Percentage of the system's physical memory consumed by the process.

While reviewing the top output, be aware of two additional points:

Virtual Memory vs. Physical Memory: Virtual memory does not always correlate directly with physical memory. Typically, a process's virtual memory size (VIRT) is significantly larger than its resident memory size (RES).
Shared Memory (SHR): The SHR column does not consistently indicate genuinely shared memory. It includes program code segments and non-shared dynamic libraries, as well as memory shared with other processes. Thus, summing SHR values across processes does not yield accurate overall memory usage.

Conclusion

In this discussion, we explored the principles behind Linux memory management. A typical process interfaces exclusively with the virtual memory allocated by the kernel, which is mapped to physical memory through system page tables.

When a process invokes memory via malloc(), the allocation does not happen immediately. Instead, memory is allocated upon first access due to a page fault exception that triggers the kernel's allocation process.

Given that a process's virtual address space often exceeds physical memory, Linux employs mechanisms to address memory shortages, such as cache reclamation, swap space, and Out of Memory (OOM) handling.

To assess the memory usage of the system or specific processes, utilize performance tools like free, top, and ps. These are among the most commonly employed performance analysis tools, and mastering their usage and understanding their metrics is vital.