Computers are fascinating machines, and the vehicle for most of the innovation that is happening today. Most software engineers write application programs in high-level languages and thus don't appreciate the layers beneath.
This article delineates some details at the boundary of software and hardware. This article is inspired by the discussions that I had in the paper reading session where Vijay, Vinayak, Chirag, Himanshu and I read the paper A Comparison of Software and Hardware Techniques for x86 Virtualization. This paper was published by two VMware engineers while I was at VMware, and reading it again helped me brush up the main ideas behind software virtualization, specifically binary translation and how processors work in general.I realized that a lot of details about the hardware software interface is not taught well in univerisities - most of what I write here I learn while at job.
High level languages are compiled to assembly language which is then compiled to machine code. In this article we take a tour of how the processor interprets and executes the instructions. A computer has three prominent parts: (a) CPU (b) Memory (c) I/O. In this article we will study CPU and Memory, focusing on the hardware-software boundary.
Warmup - a quick exercise
Perhaps a quick exercise will help us get warmed up for things to come. Consider the following straightforward C program, in a file called test.c.
You can first compile it to assembly, by running gcc -S test.c and convert it to assembly. Here is the output on my system:
Further, by running gcc -o test test.c and then running hexdump on the output hexdump test, we get the following binary output (first column is byte offset, next 8 columns are successive bytes, expressed in hexadecimal numbers for clarity)
Our objective in rest of the article is to get some understanding how the bytes above are interpreted as instructions and executed by the processor.
CPU - "Fetch and Execute"
A CPU, at the core, has a straightforward life cycle - fetch-decode-execute - instruction after instruction. In the next few paragraphs we elucidate on what it means.
In order to understand "fetch-decode-execute", we should visualize that the processor consists of registers, and some circuitry (called ALUs) to do operations on the values in registers. (Often the values to be operated upon come from main memory.)
An instruction is a sequence of bits. What else could it be! There is however an elaborate structure to the bits in the instruction, which is described by ISA (Instruction Set Architecture - look here for intel x8664 ISA for example.) of the processor. Typically, first few bits encode the operations (load vs store vs add vs multiply etc). Next few bits may specify first operand and subsequent few bits may specify second operand.
For instance, consider the following bit sequence:
01001000 10000011 11000000 00100010
On x86, the assembly corresponding to this machine code is: add rax, 34 i.e. add 34 to the contents of register rax. Here are some specific details:
- The first byte (0x48) is the REX prefix that enables 64-bit operation.
- The second byte (0x83) denotes that this is an add instruction with 8-bit immediate.
- The third byte (0xC0) contains ModR/M information specifying register-direct mode and the rax register.
- The fourth byte (0x22) represents the immediate value 34 in hexadecimal.
So, the job of the processor is - fetch the next instruction from the memory, decode it and "execute" it - i.e. do what is specified in the instruction.
Great. So now we know what an instruction is. Along the way we also came to know that there are "registers" in the processor. Registers are like scratch spaces which a programmer can use.
It is the right time to understand registers a bit better.
Registers
Some details on registers here. Each register is a 64-bit (8-byte) storage space. They are of two types: (a) general purpose and (b) special purpose registers. A general purpose register is really like the scratch space to be used by the programmer to achieve some computation. A special purpose register has some specific purpose - they are meant to communicate some specific information to the rest of the processor.
There are 15 general purpose registers:
RAX
RBX
RCX
RDX
RSI
RBP
RDI
R8
R9
R10
R11
R12
R13
R14
R15
Then there are some special purpose registers. There are many of them, but in this article we will need to know of the following:
- RIP: the instruction pointer - contains the address of the next instruction to be executed
- IR: the instruction register - contains the instruction currently being executed. So the value in IR comes from the address stored in RIP.
- RSP: the stack pointer - contains the address of the top of the stack. More on this later.
- RFLAGS: the flags register. This register contains some flags (0 or 1) around the execution results of some other instructions. E.g. if there is an overflow in a previous operation then certain bit will get set. Another bit will get set based on the result of comparison of two values in an instruction
- CR3: contains the page table base address. We will discuss more about this when we discuss MMU (memory management unit) later in this article
There are many more, but for the purpose of our discussion only the above registers will suffice.
The Common Flow of Instructions, and the Instruction Pointer RIP
Now that we have seen how a single instruction executes, it is easy to see that instructions can be stored one after the other in memory, fetched serially and executed. And that is, indeed, most of the execution. Once an instruction execution is done, RIP is incremented by the length of the current instruction and then the processor fetches-decodes-executes the next instruction.
Let's say content of address 0x12345678 is 01001000 10000011 11000000 00100010. Instead of writing human unfriendly bit sequence we will write it as "add rax, 34". Let's also say that at next address i.e. 0x1234567C, we have "add rbx, 22". We succinctly write this as:
A: add rax, 34 add rbx, 22where A is a shorthand for 0x12345678. Initially, value of RIP is A, and thus "add rax, 34" gets executed. Post the execution of this instruction RIP changes to A + 4 where the next instruction resides. Thus, the next instruction gets fetched and executed.
Function Calls and the Stack Pointer (RSP register)
The next concept we need to understand to appreciate fetch-decode-execute lifecycle is how function calls are implemented in the processors and the role of RSP.
If the concept of "function" did not exist in high level programming languages, then perhaps one could do without RSP. However, functions are integral to programming languages - they help us organize our code in logical groupings.
Now suppose that in your high level language you write a function func. At the time of writing this function func, you probably won't know which all functions can call this function. Maybe today, func1 and func2 can call this, and tomorrow someone might write a third function func3 which can call the function func. func itself can also call func (recursion). And when func is executing, it does not know who called it.
So, you have to compile the function func to assembly and then to machine code in a way that does not depend on who will call this function. Let's say the compiler (or even you!) compiles the function and you use a lot of registers - rax, rbx, rcx etc. Now, when function func1 calls you, that function too might be using the same registers, and once execution of func is finished, func1 will need those values. So, if code for func1 is executing and function call to func is made, then before starting to execute the code of func, all the registers should be saved somewhere, and when execution of func finishes, all the registers should be restored. The only place where we can save it is in memory. We refer to the set of registers as the "context" - so we say that context is saved in the stack or restored from the stack.
There can be long chain of function calls - func1 calling func2 calling func3 and so on. All the context that will be needed later needs to be saved, and then restored as needed (when return happens).
This whole problem is solved by having a concept of "stack" in the memory. For any given process (now what is a "process" will be more concretely discussed in a later section of this article), the kernel allocates some memory and the start of the address is stored in the register RSP. Then, before any "call" instruction ("call" is the assembly instruction for function call), the compiled code saves relevant registers - which it was using as scratch space - on the stack (i.e. starting at the values pointed to by RSP). "call" instruction automatically saves the current RIP, at the top of the stack, changes RIP to the argument provided in call, and updates RSP. Next instruction to execute will be of the callee since RIP has been updated. Similarly, on "ret" instruction ("ret" is the assembly instruction for returning from a function call), processor automatically restores updates RIP from the value at the top of the stack (it will be value of RIP we saved at the time of call). The assembly code in the caller then restores the registers previously saved from the stack.
And here is a quick example cooked up by claude
Above description should make clear that (1) Function call is not a software-only concept - the hardware supports it. However, were the hardware not supporting it, we can still have been implemented in software (assembly code could have done all the saving/restoring of the context and jump instruction could have been used instead of call and ret.), however, it would have resulted in slower execution of the programs. (2) RSP - the stack pointer is involved in implementing the saving/restore of the context on the stack, and stores the address of the top of the stack.
Details about the Instruction Pointer (RIP)
We had a brief look at the RIP earlier, and saw that in the linear execution of the instructions, RIP always gets incremented by the length of the currently executing instruction (Note that in some architectures, like x86, instruction length is variable; but in other architectures, like ARM, instruction length is fixed). Now, let's look at RIP in details and see the scenarios where it changes in non-linear fashion.
The value of RIP changes as per the following logic:
- Unless the current instruction is a jump instruction, RIP gets incremented to RIP + instruction_length as the current instruction executes
- If the current instruction is a jump or conditional jump instruction, then the value of RIP changes as per the result of jump instruction
- At any time an "interrupt" may come. If that happens, RIP immediately changes to interrupt handler
- RIP changes in different ways when a call / ret instruction happens
The first of the cases has been studied already. Let's look at the next scenarios.
Branches
if and for statements of high level languages are translated to a variety of jump instructions.
Consider the following C code from the paper:
int isPrime(int a) { for (int i = 2; i < a; i++) { if (a % i == 0) return 0; } return 1; }It translates to following assembly:
isPrime: mov %edi, %ecx ; %ecx = %edi (a) mov $2, %esi ; i = 2 cmp %ecx, %esi ; is i >= a? jge prime ; jump if yes nexti: mov %ecx, %eax ; set %eax = a cdq ; sign-extend idiv %esi ; a % i test %edx, %edx ; is remainder zero? jz notPrime ; jump if yes inc %esi ; i++ cmp %ecx, %esi ; is i >= a? jl nexti ; jump if no prime: mov $1, %eax ; return value in %eax ret notPrime: xor %eax, %eax ; %eax = 0 retIn the above example, isPrime, nexti etc will all be addresses. When the processor executes "cmp %ecx, %esi" it compares the values in the registers %esi and %ecx, and sets or unsets a specific bit in a flags register RFLAGS depending on whether the two values are equal or not. Next when "jge prime" ("jump if greater than or equal") executes - then RIP is set to either the address called 'prime' if previous comparison was true (this info is now in flags register), or else RIP is incremented by the instruction length if previous comparison was false. So, the execution either jumps to some target or else continues linearly depending on the result of previous comparison.
Interrupt Handling
Another phenomenon in which RIP changes to value other than RIP + instruction_length is when interrupt happens. When an interrupt happens (e.g. when you type in something on your keyboard or when disk I/O completes), current context is saved on the stack (just as it happens in function call) and RIP is assigned the address of interrupt handler. How does the processor know the address of the interrupt handler? The kernel needs to set that address at a specific location in memory when the machine is starting up.
Call/ret Instructions
We have already looked at call and ret instructions, they also lead to non-linear jumps in the value of RIP.
This ends the description of the fetch-decode-execute cycle of the processors. However, while we are at it, a few other concepts are in order.
Pipelining
Above "fetch-decode-execute" suggests that the order in which operations happen is: fetch instruction 1, decode instruction 1, execute instruction 1, fetch instruction 2, decode instruction 2, execute instruction 2, and so on.
However, the parts of the processor which need to do fetching, decoding and execution are different, and so while execution of a certain instruction is happening, the next instruction can get to decoding stage and the instruction next to that can get to fetch stage. That way, overall IPC (instructions per cycle) increases. So, the execution looks like this:
In this article we have discussed a 3 stage pipeline. In real life processors, pipeline has many stages - e.g. 12 to 20 stages.
Note that when instructions are to be executed one after the then the above scheme works well. However, if the pipeline encounters a conditional jump instruction then what do we do? We don't know what instruction will be executed after the jump instruction till the time jump instruction has finished execution. In such cases processors do "branch prediction" - they make a good guess about what path the branch would take and do "speculative execution". We take care not to write anything to memory since that would be irreversible. In case the guess turns out to be wrong, there is "pipeline flushing" - we redo the right branch path. Though branch mispredictions are costly due to pipeline flushes, branch prediction is still worth it because for most workloads, it is easy to get 99%+ accuracy since most branches almost always take a certain of the two directions e.g. in while i < 100 will for 100 times go in one direction and then once in other direction.
Superscalar and out of order executions
Modern processors are much more complex. In particular, they are "superscalar" i.e. multiple instructions can be in execute stage simultaneously as long as they don't have data dependence. E.g. if you have A = B + C and D = E * F then the machine instructions for them can execute simultaneously.
Also, instructions need not be executed in the same order as they appear in the program. E.g. suppose you have following instructions:
1. load r1, [memory] // Takes 100+ cycles (cache miss) 2. add r2, r3, r4 // Takes 1 cycle, 3. mul r5, r6, r7 // Takes 1 cycle, 4. add r8, r1, r9 // Needs r1 from instruction 1Since instrucions 2 and 3 will wait for 1, and they will be waiting for a long time, instruction 4 can execute before 2 or 3 since its execution does not depend on instrucions 1 or 2 or 3.
The need for Memory Management Unit - MMU
In the last section we studied the basics of processors - fetch, decode and execute. If there were only one process needing to run at a time, then that is all that we would need. However, we want to multi processing systems - where we time share the same processor between many processes, so that you can run a text editor, a web browser, and database server and a web server all simultaneously.
The key thing to note is that all the processes refer to memory addresses and they are compiled independently. Thus, if two processes run together then they may read from or write to same memory location. That is not what we want - a process may write value x to a memory location m, and moments later when it reads back the value, it may read y since some othe process overwrote the value. To avoid this - we introduce the concept of "virtual memory".
Page table walk and address translation
In modern procesors, when a processor accesses a memory location address V, like in load r1, V, RAM address V is not accessed - rather V is treated as a "virtual address" and is translated to "physical address" P which is then accessed in the main memory.
This translation is done by the processor itself - kernel code is not involved in the mechanics of translation. However kernel sets up "page tables" using which translation is done. So, we now look at this process in detail.
First, we consider the anatomy of a virtual address V.
Given a virtual address V, here is the process to find the physical address P corresponding to it. We call this process "page table walk".
First, the procesor read the register CR3. This register is written to by the kernel at the time of context switch. We say that CR3 points to the root of the page table hierarchy. Imagine that there are beginning at the address CR3, there are 512 entries, each entry of 8 bytes. We will see which of these 1024 entries we want to read during page translation.
Then the the processor accesses the memory address CR3 + 8 * V[39:47] There are 9 bits in V[39:47] and since 2 ** 9 = 512, these 9 bits encode which of 512 entires in the page table root we want to read. Thus 8 bytes starting CR3 + 8 * V[39:47] is a certain page table entry.
So, now the processor read these 8 bytes, called PML4 entry (page map level 4 entry). Here is the structure of page table entry.
| Bits 63-52 | Available for OS use |
| Bits 51-12 | Physical address (40 bits) - points to next level or physical page |
| Bits 11-9 | Available for OS use |
| Bit 8 | Global (G) |
| Bit 7 | Page size (PS) - 0 for 4KB, 1 for larger pages |
| Bit 6 | Dirty (D) |
| Bit 5 | Accessed (A) |
| Bit 4 | Cache disable (PCD) |
| Bit 3 | Write-through (PWT) |
| Bit 2 | User/Supervisor (U/S) |
| Bit 1 | Read/Write (R/W) |
| Bit 0 | Present (P) |
For now we will ignore all the bits except for bits 51-12. These bits contain the start address of the next level of page table hierarchy that we need to access. If this address is A3, the processor finds the next level page table entry by the reading the 8 bytes starting A3 + 8 * V[30:38]. We call these 8 bytes as an PDPT entry (page directory pointer table entry).
Again, if A2 is the content of bits 51-12 in PDPT entry, then the processor reads 8 bytes starting A2 + 8 * V[21:29], and we call this PDT entry (page directory table entry).
If A1 is the content of bits 51-12 in PDT page table entry, then the processor reads 8 bytes starting A1 + 8 * V[12:28], and we call this PT entry (page table entry).
Finally, if A0 is content of 51-12 in PT entry, then the processor reads the byte (or however may bytes are needed to be read) from the address A1 + V[12:28]. In other words A1 + V[12:28] is the physical address corresponding the original virtual address V.
Here is a comprehensive diagram illustrating the page table walk, thanks to claude.
Translation lookaside buffer (TLB)
As would be clear from above description, a single virtual address access, can lead to multiple actual memory accesses
- Reading the address contained in CR3
- Reading PML4 entry
- Reading PDPT entry
- Reading PDT entry
- Reading PT entry
- Reading the content of actual physical address
This is 6 memory accesses for a single load / store assembly instruction! So, the processors use caching - in a cache called "Translation lookaside buffer", they cache the translation of V[12:47] thus avoiding all memory accesses except the last one. (Even the last access could be cached in L1/L2 caches, but we are not going there)
Context switch, CR3 changes and TLB flushes
From the above description it follows that CR3 - the address of the root of the page table hierarchy - is one per process. So, when a context switch from one process to another happen, the write to CR3 register by the kernel can be thought of as the precise point at which context switch happens:
Another thing to note is that when the write to CR3 happens then TLB is flushed, i.e. all the entries of TLB are invalidated, since new process will would have a different set of translations.
It is the responsibility of the kernel to ensure that virtual addresses from two different processes don't translate to the same physical address. Actually in some cases, two different process have same translations, and we see those use cases in the next section.
Page faults
This is a good point to study "page fault". To understand that we need to look back at the structure of page table entry. Bit 0 there was P which stands for "present". This bit is 1 if the entry is valid and 0 if the entry is not valid. If, during page walk, the processor encounters an entry which is not present, then it means that the translation of the corresponding virtual address does not exit. Then the processor generates a "page fault" - i.e. saves the current state on the stack, looks up the page fault handler address in IDT (interrupt descriptor table - which is set by kernel at boot time), switches execution to ring 0 (i.e. kernel mode) and executes the handler.
Why would page table entry not be present? Well, by default entries have to be not present. Virtual address space is very large - since we use bits 0-47 from virtual address, i.e. a total of 48 bits for address translation, total virtual space present with a process is 2 ** 48 = 256 TB. Most systems don't have that much physical address, so most virtual addresses won't have corresponding physical addresses. The processor thus needs a way to find what translations are present and what are not.
Another phenomenon to consider is the following. When the application program allocates a large amount of memory (say 1 GB), then the kernel need not allocate physical memory for the whole 1 GB. It can just note down in its data structures that it has promised 1 GB of memory, starting at some virtual address V. When the application code accesses any address from V to V + 1 GB, a page fault will be generated since P bit in page table entry would be 0. The kernel would then find that it had allocated this address to the application code. So, it would find an available physical page (which is not allocated to another process), set the page table entry (including setting P bit to 1) and jump back to the same instruction which caused the page fault. This time, the instruction would succeed.
Fork
This is also a good place to study how fork sysetm call is implemented, at least in linux kernel. Before we study fork, we need to know of bit 1 - the read/write bit in the structure of page table entry. If the bit is 0, address translation will succeed only if it is happening during a read instruction. If address translation is happening during a write instruction then a page fault would be generated.
Coming back to the fork system call, when a fork happens, both parent and child processes are supposed to have the view of the memory that the parent had before the fork. If address a has content c in parent process, same should hold true in child process. After the fork however, parent and child go their own way - updates by one are not reflected to the other.
One way to implement fork would to copy all the pages physical pages corresponding to the parent process, and create a fresh set of page tables that contain the new translations and have CR3 point to the root of the new page table hierarchy. However, that would mean copying large amounts of data, and what is worse - many times fork is followed by exec in the child code, in which the child completely ignores the memory it received from the parents. child
So, kernel instead follows a better scheme. A basic version of that scheme is that the kernel creates new page table entries for the child, in a way that all the address translations from the new set of tables are the same as the old translations. However, both in the parent and in the child, all entries are marked read-only. As long as the child and the parent processes only read the memory location, there is no problem. However, when someone attempts to write to a location, a page fault is generated (since PTE mentions the translation to be read only). In servicing the page fault, the kernel creates a fresh copy of the page that was attempted to be written to. Then, the PTEs of parent and child are updated so that the translations are now to different physical pages. Also, now both the entries are marked as read-write. And then the faulting instruction is re-attempted. This time it would succeed.
Note that the application code would be oblivious to all this. It attempted a write, and then write succeeded. We call the above the process as copy on write, or COW.
Kernel is mapped to all address spaces accessible in kernel mode only
Finally, note that when an interrupt arrives, or a page fault happens, then application code (user mode code) stops executing and kernel code starts executing (from a prespecified address). On this switch, CR3 does not change - i.e. the process remains the same. So, there are virtual addresses which map to kernel code and data. However, we don't want user mode code to access this code and data, despite some virtual addresses mapping to it. To achieve this, we use bit 2 - user/supervisor bit of the structure of page table entry. When set to 1, it means that this entry can be used only during the kernel mode of execution. So, if a user code tries to accesss the virtual addresses corresponding to this entry, there would be a page fault.
This is an elegant way that avoids both copying the kernel code in all address spaces, as well as avoiding teh change of CR3 at user-kernel mode execution switch.
Final thoughts
I hope you enjoyed reading this article as much as I enjoyed writing it.
.png)


