Last year, I took the graduate operating systems course at my university. For our final project, we were tasked with benchmarking a system of our choosing. I chose my own laptop, as it was a straightforward and simple choice. In the process, I learned a lot about the various optimizations that compilers, the Linux operating system, and the hardware implement, many of which go unnoticed to most people.
Machine Description
The specs of my laptop (a Thinkpad X1 Carbon) are pretty standard for most modern laptops. It has an 11th Gen Intel Core i7 CPU with a 3.0 GHz clock rate; on-chip memory consists of the L1, L2, and L3 caches that have sizes of 128 KB (instruction), 192 KB (data), 5 MB, and 12 MB respectively. This CPU seems to have eight cores (as reported by the OS), but in reality it has four physical cores with two kernel threads each. It’s also equipped with 32 GB of LPDDR4 RAM (with a 136.5 GB/s bandwidth) and a 1 TB Western Digital SSD (with an 8 GB/s bandwidth). At the time of running this experiment, my laptop was running Ubuntu 22.04.
Basic Methodology
All of the experiments discussed here were written in C, as this simplified many of the benchmarking techniques; it allowed for more memory manipulation and an extensive system call interface. We used the rdtsc CPU instruction to measure our times. rdtsc measures times in terms of CPU cycles and has very little overhead, making it ideal for measuring especially our very low latency operations that operate on the scale of clock cycles. The CPU occasionally rearranges instructions (including rdtsc), so it was important to ensure serial exeuction so that the code to be benchmarked would actually get executed between rdtsc instructions. The mfence and lfence instructions helped with this. To ensure consistency between runs, all the tests were run with as light a load on the machine as possible, and the process priority of the benchmarking program was boosted with the nice syscall to its maximum. Because we’re running on a multicore CPU, it’s also important to make sure that the processes can only run on a single core.
In terms of data collection, the most effective strategy was to run numerous trials for each run and trim outliers. We went with 1000 trials, as it struck a good balance between a large number of trials and not taking too long to run (especially for the slower experiments). It was common to occasionally get very high outliers when timing, and this was likely caused by a context switch to another process. While the process priority boosting was meant to minimize the chances of this happening, it’s still inevitable; running only high priority processes would ultimately lead to starvation for other processes. We also discarded the first 10 trials in order to ensure a majority of our program’s instructions were loaded into the L1 instruction cache — once again, this ensured consistency between trials.
CPU & Scheduling
Single read
Our first task was to measure the overhead incurred by our measurement method. Although the rdtsc instruction itself contributed some overhead, there would be more from the mov instructions and function calls needed to actually store that timing data in variables. In total, this ended up coming out to around 14 ns (or 42 clock cycles), which is a fairly low overhead and pretty consistent with the advertised times for the instructions used.
Function call
Next, we measured the time it takes to call a function and how that relates to the number of arguments. In x86 assembly, arguments are pushed onto the stack via mov instructions, so we expect that a function with more arguments takes longer since there are more mov instructions. We do indeed see an increasing trend, but it isn’t uniform as we would expect. Calling a function with five arguments is actually faster than one with four, and it takes roughly the same amount of time to call a function with seven arguments.
Press enter or click to view image in full size
Interestingly, this strange pattern wasn’t just a coincidence of random noise; this pattern showed up in every single trial in the run, as well as in other runs. Checking the disassembly of the program shows the expected number of mov instructions; it increases as we add more arguments. It’s possible that there’s some CPU level optimization of chaining multiple mov instructions (some kind of pipelining?) that results in these unexpected decreases.
System call
A system call (or “syscall”) is a special type of function call. It involves doing an operation that only the OS has privileges to do, such as performing disk I/O or creating new processes. As a result, if a program tries to execute a syscall, it must first “trap” into the OS. Then, the OS handles the syscall and returns to the user-level process. Therefore, we would expect a system call to take quite a bit longer than a standard function call.
We chose to measure the runtime of the getpid syscall. It’s a fairly minimal syscall and only returns the process ID of the currently running process — this way, the runtime of getpid more accurately reflects the time spent trapping into and out of the OS. Additionally, the results of getpid are never cached, so we can be sure that our program does in fact trap into the OS every time we call it.
We measured an average runtime of around 84 ns. As expected, this is significantly longer than a normal function call (which measured out at less than a nanosecond on average).
Task creation
Creating new processes is an essential job of the OS. We benchmarked the creation of two different types of tasks — processes and user-level threads (not to be confused with kernel-level threads). Unlike processes, user-level threads aren’t managed by the kernel and are instead part of a parent process, making them lightweight and efficient to create and destroy. As a result, we would expect creating user-level threads to be much faster than creating processes.
We used the fork() system call to create new processes and pthread_create() to create new threads. In both cases, the newly spawned task immediately exited in order to avoid any additional interference with the parent process conducting the measurements. The creation of both user-level threads and processes took a relatively long time (both were on the scale of microseconds), but creating processes was by far the most time consuming —it took almost 40 µs, while creating a new thread only took around 4.5 µs.
Context switch
Context switches are how the OS makes sure a single process doesn’t hog up the CPU. Every so often, the OS switches the CPU to running a different process so it can make progress on all pending processes. For most of this benchmarking experiment, we’ve been trying to minimize this; getting the most accurate timing data means our process needs to be on the CPU as much as possible. Now, we were tasked with actually measuring how long a context switch took.
It turns out that making a context switch happen isn’t exactly that simple either. After all, it seems like context switches just happen whenever the OS feels like it. However, it turns out that the OS also context switches whenever it encounters a blocking function call (such as a file read). This way, it doesn’t have to busy-wait until the function returns and can instead make progress on some other process. Once the function comes back with a result, it alerts the CPU to switch back to the calling process. We used this fact to our advantage and triggered context switches by using pipes. We created another process with fork() ; the child process tries to read from the pipe, and the parent process writes to the pipe. This guarantees that we get a context switch to happen. The process is repeated, so we end up timing two context switches for each trial.
We also measured the context switch time for user-level threads. Given that user-level threads are completely separate from the OS and are more lightweight as mentioned earlier, we would expect that context switching between threads is much faster than between processes. Indeed, we see exactly this — our average process context switch time was around 3145 ns, while that of user-level threads was around 280 ns (around 11 times faster than processes).
Memory
Accessing RAM
The next phase of this benchmarking experiment was to collect some data about the system memory, and the first thing to do would be to measure how long accessing memory took. Memory access times can vary depending on whether the target data is stored in one of the CPU caches. There are three CPU caches: L1, L2, and L3, in order of increasing size. In general, the smaller the memory size, the faster the access time. Therefore, accessing the L1 cache is the fastest, followed by L2, L3, and finally main memory.
Our methodology for measuring memory access times was based on that of lmbench, a benchmark suite written way back in 1996. We created arrays ranging in size from 1024 bytes to 1024 MB. Therefore, the smallest arrays would be able to fit entirely within the L1 cache, while the largest arrays would not fully fit in any of the caches (requiring an access to main memory). The lmbench paper loops through the entire array and accesses elements with a stride (i.e. accessing every n elements). This was done to eliminate unnecessary caching. Because the hardware expects memory to be sequentially accessed, each memory access loads the requested bytes as well as the next consecutive bytes into L1 (if they aren’t already there) in a data block known as a cache line. For many machines including my laptop, the cache line size is 64 bytes. Accessing the arrays with a stride ensures that each access hits new data that hasn’t already been cached into L1.
Looping through the entire array, especially for the larger arrays, ended up being extremely slow. Therefore, we instead took the approach of accessing 1000 random elements, which would serve a similar purpose of trying not to access already cached data. We also had to be careful to actually initialize all of our data. The memory access times were unusually low when we didn’t do this, potentially because the array was previously zero-filled. Sometimes reading or writing zeroes can be optimized by the OS.
The memory access times in the lmbench paper show sharp increases when the array size increases beyond the size of one of the caches, with memory access times otherwise staying mostly constant over a cache level.
Press enter or click to view image in full size
Our data doesn’t quite show these well-defined delineations between cache levels. In particular, there’s very little observed difference in latency between L1 and L2 regimes, with a slow increase moving towards the L3 regime and a much faster increase entering main memory (and continuing thereafter). This indicates that we were likely still hitting some cached data, even with our random array accesses. It will certainly be interesting to repeat the experiment and instead access the arrays with a stride (and perform it on a smaller scale to improve runtimes).
Press enter or click to view image in full size
RAM Bandwidth
Next, we measured the bandwidth of the RAM. Based on my laptop’s RAM type (LPDDR4), the maximum theoretical bandwidth is around 136.5 GB/s. Of course, the theoretical bandwidth probably wouldn’t be sustained for long, so we expected to measure a somewhat lower bandwidth.
Initially, we planned to simply create a large array and see how many elements could be read / written. However, this once again introduced the issue of caching. As explained earlier, when reading from memory, multiple consecutive bytes from memory are cached via a cacheline (which was 64 bytes for my laptop). As a result, every time I read a byte from memory, the next 63 bytes would then get cached. Therefore, reading consecutive bytes would read from the cache instead of RAM. The strategy we settled on to combat this was to use a stride as described earlier, performing memory reads spaced out by 64 bytes so that each read always accesses the RAM.
The read bandwidth we measured this way was 36.4 GB/s. This is quite a bit lower than the theoretical bandwidth. The write bandwidth was slower still, at around 13.3 GB/s. The discrepancy could be due to contention with other processes, though this probably wouldn’t cause such a huge difference with the load on the machine being fairly light. It’s also possible that the non-sequential pattern of memory accesses performed in a loop results in underutilization of the full memory bandwidth. In a similar fashion, reading small chunks of a file sequentially is considerably slower than a single read of a large chunk. However, this may be instead due to the overhead of the read system call.
Page fault service time
From a programmer’s point of view, it seems like processes have one big contiguous block of memory to work with. However, this is yet another abstraction provided by the OS. Behind the smoke and mirrors, each process is actually assigned virtual memory, which gets mapped to portions of physical memory (known as pages) as needed. This serves two purposes. It makes it appear as if processes have an unlimited pool of memory to work with, and that the process basically has the entire machine to itself. On the other hand, it also serves as a security measure; processes can only name addresses within their own virtual address space, which prevents them from accessing the physical page of any other process. This provides an effective mechanism of isolation within processes, which is one of the OS’s main jobs.
If a process tries to access virtual memory that isn’t mapped to a physical page, a page fault occurs. This can happen right at process startup when none of its pages are mapped. It can also happen if the system runs out of memory and must free a physical page, unmapping the associated virtual page. To avoid losing the data on that page, its contents are backed up to disk. The page fault is handled by trapping to the OS (much like a system call), which brings the corresponding physical page back from disk if necessary and remaps it to the process’s virtual memory.
Again, forcing a page fault to occur turned out to be not exactly super simple. The best way we found to do this was to use mmap, which allows you to map a file to process memory. In other words, reads and writes to the mapped region of memory are mirrored to the file and vice versa. However, not all the required memory is mapped to the file immediately. This is especially helpful if we’re trying to mmap a file that’s larger than the available physical memory on the system. Therefore, our first couple accesses to the file would cause page faults, which would then map the virtual pages in the process to the appropriate sections of the file. To this end, we created a large file and attempted to write to the start of every virtual page. As described before, each write would cause a page fault as the virtual page hasn’t yet been mapped. We also called fsync to make sure the writes didn’t get buffered.
Using this methodology, we measured an average page fault time of around 16.3 µs. At the time, this seemed a bit low; according to the Wikipedia article on disk drive performance, the average seek time for SSDs is 80–160 µs, and our page fault time would be around 10 times faster than this. However, the reference for this number originates from a 2011 paper, so it’s very possible that seek times have significantly improved since then. I tried to find other estimates online, but most of the results I found seem to reference this same number, likely getting it from the same Wikipedia article we looked at. Either way, it’s still much slower than reading from main memory, which is to be expected for accessing disk.
File system
File cache size
Reading from disk is pretty slow. When doing operations with intensive I/O (such as reading from a file), we want to avoid accessing disk as much as possible. To this end, disk blocks that may be needed in the future are cached in main memory. Unlike the CPU caches, the file buffer cache in memory doesn’t have a fixed size and can grow and shrink as needed. The theoretical maximum size of the file cache in my system would be 32 GB, since this is the size of my main memory. I didn’t expect it to actually take up that much space, since there must be some space in RAM reserved for the OS and other miscellaneous processes that might still be running.
To measure the actual maximum size of the file cache, we simply created a very large file (64 GB) that would be much larger than the size of main memory. This file must be filled with random data, as reading an empty file once again gets optimized by the OS, and we never actually access disk. Then, we read progressively larger and larger sections of the file. We expect to see a significant jump in latencies once the section of file is larger than what can fit in the file buffer cache, resulting in more cache misses and more disk reads. Additionally, we make sure to clear the file cache after each run with a given file size by running echo 3 > /proc/sys/vm/drop_caches.
Press enter or click to view image in full size
We indeed see such a sudden increase between approximately 22.6 GB to 26.9 GB. This matches up pretty well with our expectations; the maximum file size ends up being a significant portion (but not the entirety) of the main memory capacity. It would also be interesting to run this experiment at a finer granularity, especially near the file size ranges mentioned above. The use of a log scale for the file size introduces quite a bit of uncertainty in our actual measurement.
File read time
In this experiment, we compare the file read times for sequential and random access patterns for different file sizes. Unlike in the previous experiment, we explicitly want to avoid file caching this time around so that we read from disk every time, so we use the posix_fadvise system call to avoid caching the file blocks we read. For the sequential reads, we simply read consecutive segments of the file equal to the block size (4096 bytes). For the smaller files, we looped back around to the start whenever we reached the end of the file. Meanwhile, for random reads we first set the file pointer to a random location. Then, we read a block sized chunk of the file and repeated for the remaining trials.
Press enter or click to view image in full size
For both access patterns, we see an increasing trend as file size increases. This is due to the way files are stored on disk in Unix based OSes. Because it was important to ensure that files could be stored across multiple non-contiguous disk blocks (for more efficient space usage), a special file block known as an inode is used to maintain pointers to all the blocks of the file. For very large files, a single inode won’t have enough space to store pointers to all the file blocks. In this case, a multi-level structure is used instead, and some of the pointers of the inode instead point to another block that stores pointers to file blocks. Since the maximum possible file size grows exponentially with every level of indirection, this inode structure can be used to reference massive files. On the other hand, more levels of indirection results in more disk accesses required to access a file block, as more time must be spent traversing the pointer structure. This is why the average read time increased significantly for our larger files.
Press enter or click to view image in full size
It’s also interesting that sequential access actually perform relatively poorly for file sizes in the range of 16 to 128 KB. This is probably because the file size is still relatively small by this point, and we end up looping over the file multiple times throughout our trials. Therefore, the access pattern doesn’t represent a sequential access pattern as well. Once the file size reaches 2 MB, we see a significant decrease in latency (it consistently remains around half of the random access pattern latency) since our trials now perform a much longer run of sequential reads.
Remote file read time
With the prevalence of distributed computing models, reading files from remote machines is another common scenario. To do this experiment, I used a lab machine on-campus and connected to it via SSH. Using the ping command on this network gave a roundtrip time of around 20 ms, so I expected to see file read latencies of at least 20 ms.
The approach for this experiment was similar to the previous experiment where we measured local file read times, but this time I used the SFTP API to read files over SSH. Since I wasn’t using the standard C library for reading files, the posix_fadvise system call wouldn’t work anymore, and I had to run echo 3 > /proc/sys/vm/drop_caches.
Press enter or click to view image in full size
The strangest thing here is that random file read times increase significantly for larger files, while sequential read times really don’t increase to that degree. This might be because sequential reads allow SFTP to prefetch and batch file blocks together when encrypting. There’s otherwise some additional variation for even the smaller files — notably, sequential reads are noticeably slower for the smallest files. This is possibly due to a similar reason that sequential reads were slower on small files using our methodology for local file reads. Alternatively, it could just be that network speeds were slightly slower during these particular trials; a couple milliseconds is certainly within the range of normal network speed variation. Running some more trials would help confirm whether this was a temporary blip or a consistent trend.
Final thoughts
Throughout this project, the hardest part was simply avoiding all the optimizations that could potentially interfere with our measurements. My code isn’t very efficient even normally, but this was the first time I really had to try and unoptimize it. From when a program gets compiled to when it’s running on the hardware, all sorts of intricate tricks are employed to extract every bit of speedup possible. I didn’t even know many of these seemingly quirky OS behaviors existed until I tried running my benchmarks and got seemingly nonsensical results, forcing me to find a workaround.
Although this time I was working against the optimizations, learning about them will help me write my programs to work alongside them in the future. Many of these behaviors are predicated on the expectation that a program is written in a certain way; the fact that I was able to work around the optimizations at all shows this. While it seems like magic at first, my trash heap of a codebase unfortunately won’t get transformed by the OS into a software powerhouse. However, perhaps with what I’ve learned from this project I might be able to start doing that myself.
.png)

