Understand CPU Branch Instructions Better

2 hours ago 2

Branch instructions are the primary means by which a program running on a CPU makes a decision.

This post is part of a series of posts on CPU performance, as part of the Pointer Wars: Linked List Edition challenge. This challenge is great for undergraduates, graduate students, and new engineers who want feedback about writing high performance C or C++ code. Much more info here.

The Sequential Execution Model and Branch Instructions

Programs written to execute on a CPU follow something called the sequential execution model. Sequential execution means that each CPU instruction is executed one at a time, and unless specified otherwise, the next instruction to execute is at a memory location directly after the current instruction. Do note that the sequential execution model refers to the semantics that the CPU must follow, not their actual implementations, as implementations violate this rule in ways the programmer isn’t aware of all the time. Instructions that might change the control flow of a program (control flow being the path the program takes through the code) are called branches or jumps, depending on the instruction set of the CPU. There are also dedicated call and return instructions to support function calls and those that are used to implement privileged or debug execution, such as system calls. For the purposes of this blog post, I’m generally going to refer to all of those instructions as “branches”.

Branches ultimately have a target, which is an instruction to which they may branch to if some condition is met. When they redirect the CPU to that target instruction, the branch is said to be “taken”. That’s true even in the case when the target instruction is the next sequential address (note that this is an odd corner case that likely never exists in “real code”). When a branch instruction is not taken, it’s called “not taken”. Relatively straightforward terminology.

Conditional versus Unconditional Branches

One classification of branches is whether they are conditional or unconditional. An unconditional branch is one that is always taken, by definition. Unconditional branches tend to show up when the compiler absolutely can’t avoid them, often because the target of the branch cannot be placed sequentially in memory following the instruction that proceeds it, for whatever reason (a real reason, or a compiler heuristic telling it not to). That tends to occur with function calls and returns, goto statements, switch statements, tail call recursive function implementations, and unoptimized C loop implementations (for instance C semantics are such that the condition of a while loop is evaluated first, and then the loop body is executed, repeat). There are likely others, compilers emit some beautifully optimized code, but those in my mind are the common cases.

Conditional branches on the other hand are only taken if a particular condition is met, otherwise they “fall through” to the next sequential instruction. The condition to be met is specified by the assembly language instruction. There tend to be two classes of specifying conditions, where an instruction set tends to gravitate towards one or the other:

Whether certain CPU flags are set. Certain CPU implementations have a flags register, a special purpose register which is populated by the ALU or other instructions indicating certain conditions have occurred. The four common conditions are Zero (Z), Carry (C), Negative (N), and signed carry (V), and a combination of these flags can be used to determine equality, overflow, and inequality (less than/greater than). These instruction sets often rely on an explicit compare instruction to update the flags (although the result of many other ALU instructions can update the flags as well) and may have specific instructions whose sole purpose is to set or clear individual flags.
Comparison between two registers or other data. Other CPU implementations compare two register values, or a register and an immediate value, or a register and a value in a memory location, or even a memory value and an immediate value. An immediate value is a constant that is directly encoded in the instruction, or inferred by the instruction of the branch (eg, compare register against zero). Based on the comparison that the branch instruction performs, and the criteria specified by the instruction (eg, equal, greater than), the branch is either taken or not taken. Some instruction sets have helper instructions, such as the MIPS SLT (Set Less Than), which are used in conjunction with equality testing to implement certain branch conditions across multiple instructions.

Conditional branches show up when the program has to make a choice, and the compiler can’t deduce at compile time that the choice remains unchanged throughout the lifetime of the program, or even if it can, cannot deduce the value of that condition. When you think of conditional branches, you should think of if statements and loop comparisons.

As to the “can’t deduce at compiler time” verbiage, compilers can sometimes deduce that an if statement or loop will either always execute or never execute, for example in the case of pre-processor defines being used as a condition for an if statement. A common C idiom is for debug-only code to be placed in an if statement with a pre-processor define being specified for debug mode. By the time the compiler sees the code, the processor will have replaced the DEBUG (or similarly named) pre-processor define with an integer, which allows the compiler to either include or eliminate that body of the if statement. Writing code such that values which do not change at compile time is a useful skill that aids your compiler in writing better code.

Whether compiler optimizations are turned on or off makes a big difference in the number of branches the compiler emits into your program. Unoptimized code tends to be a lot more branchy in a rather regular and beautiful way but is inherently inefficient in part due to that.

Direct versus Indirect Branches

Whether a branch is direct or not is the other common axis by which a branch instruction is described. A direct branch is one that has its target address directly encoded within the instruction, often as a relative offset as opposed to an absolute address. This means that the branch instruction specifies the number of instructions to go forwards or backwards in memory, expressed as a signed integer, relative to the branch instruction’s memory address (or often the address of the instruction after the branch, for microarchitectural optimization reasons that are beyond the scope of this post). This is very useful for if statements and loops, because the number of instructions to go forward or backwards is a much, much, much smaller number than the absolute address to branch to (which may not even be known at compile time). That keeps instruction sizes and program executable sizes smaller than otherwise. Direct branches can also include the address to branch to as opposed to the “relative addressing mode” just discussed. Specifying the full address is common for the implementation of function calls, which may be much farther away than the target of an if statement or loop. This can be known at compile or link time, and may even be deferred until dynamic linking is performed on the executable, but the point is that at some point prior to the first invocation of the branch with said address, the address is fixed, never changes, and is directly encoded in the instruction.

Indirect branches, or those who rely on the value in a register or memory location to determine their target, are a different beast. The most common form of indirect branch is used by return statements from functions. A function always has a single entry point in a program (at least, I don’t know of any exceptions to this rule), but can have many places to which to return, making a traditional direct branch impossible to use when returning. This is why return addresses are stored on the stack, or in a specially designated general purpose register by the compiler (or frankly, both).

There are other uses of indirect branches as well, namely the implementation of function pointers in the C and C++ programming languages, and support virtualization of C++ class functions.

Branch Prediction: A Fun Game to Play with The Sequential Execution Model

Forget CPU architecture entirely and focus on some first principles with me: if you’re trying to execute instructions one by one and do it as fast as possible, you might make a realization or two. A realization that you may have is that many instructions simply execute the instruction that comes after them. If you drop the magical notion that “an instruction just appears out of nowhere” and accept that grabbing the next instruction has a time cost, you’re likely motivated to determine what the next instruction is and fetch it while the current instruction is being executed, because you know with high confidence that it’s the next one. This works perfectly (mostly anyway, some exceptional cases apply), except for those nasty branches, which may go to their target instruction or fall through to the next sequential instruction. A priori, your CPU can’t tell, and that presents a problem to your plan to speed up exection.

In comes branch prediction, where when your CPU encounters a branch, you predict both whether it is taken or not taken, and also what the next instruction will be. Once your CPU evaluates the branch condition and checks the prediction, your CPU either predicted correctly, in which case great job, or failed to predict correctly, which means that some (likely all) of the work your CPU did needs to be thrown out. The word “some” in the previous sentence only exists because there are publicly known solutions that seek to minimize the amount of work thrown out, but they are well within the realms of graduate school literature and not discussed here (but feel free to research “Dynamic Instruction Reuse”).

Branch prediction works because branches tend to be predictable. For example, in a loop that runs 100 times, if you merely predicted that the loop would continue running each time, you’d achieve the correct result 99 out of 100 times. The amount of time spent cleaning up the cases where the prediction was wrong must be lower than the amount of time saved by doing the prediction. In practice this is rather easy to achieve with academically published methods, and certainly modern CPUs do an astonishingly great job at this. Even in the 1990s, three decades ago, branch predictors could get 90 to 95% accuracy depending on the workload being ran, and branch prediction has been a subject of research since that time through today.

Branch prediction is rather important as a common rule of thumb is that one fourth to one fifth of a “typical” or “average” program’s instructions are branches. Predicting correctly isn’t the only problem with branches, however.

Opportunities For Improving Branching Related Performance Issues

Frequent branches are expensive. Not only can they be predicted incorrectly and require time to recover from that misprediction, but they may limit the number of instructions that can be performed per cycle by the CPU. It’s getting more common place that a CPU can perform multiple branches per CPU clock cycle, but your code isn’t guaranteed to be running on a CPU that can do that. Software optimization guides may warn compiler writers that exceeding a certain number of branches per every 16 bytes or every cache line (64 bytes, or 16 ARM instructions) may have a performance penalty. Branches don’t just pose a problem because of their predictability, but also their frequency, because if the CPU must stall fetching a branch, it stalls all the instructions following that branch as well.

I’m not sure how much these are “branch specific” as opposed to general good software practices, but here’s what I’ve got for you in terms of how to write better code, with branches in mind:

Reduce complexity of if statement and while loop conditions. A compiler may have to perform multiple branches per term in a logical expression found in an if or while loop condition. A common C idiom for a pointer to a struct foo is “if (foo != NULL && foo->bar == 7)”. If the function makes use of foo multiple times and checks whether foo is NULL multiple times, consider checking once and returning an error condition as opposed to performing a NULL pointer check each time, if your code is structed in such a way that that makes sense. A compiler may not generally be able to prove a pointer is NULL in every case, even if you know it to be, which means it’s going to check every time your code specifies it must.
Ensure that your functions which can be inlined are actually being inlined. Inline compiler optimizations involve the code of one function being placed into another. This is incredibly useful, because it allows for function specific optimizations such as constant propagation and dead code elimination to be performed within the calling function. That kills off two branches: a function call and function return, and perhaps others via constant propagation, especially if those constants control the behavior of if or switch statements. Shared object files in particular may not inline certain functions, even if you think they should be, and compiler heuristics by definition are not optimal. At the same time, certain classes of functions, such as recursive ones, pose substantial problems for inlining code, if it’s possible to inline them at all!
Avoid extreme depth of function calls, if possible: Certain CPU architectures, including academic ones, implement function return optimizations through the use of a “Return Address Stack”, used to predict the return address on function return. If you exceed the size of this internal-to-the-CPU stack, your program may suffer for it. This can easily occur when your program calls library foo, which calls library bar, which calls library baz, which calls library blah. That sort of code is egregiously inefficient for a number of reasons, probably the least of which is the Return Address Stack, but it is something that can cause mispredictions and thus made the list.
Avoid code patterns that generate indirect branches. Indirect branches are nearly always more expensive than direct branches because their target can be anywhere, making them harder to predict (even if in practice your program only ever uses two targets). C++ inheritance and C and C++ function pointers can cause these instructions to be emitted by the compiler. This is an inherent tradeoff; C++ inheritance support can be incredibly useful, but it sometimes comes at a cost if the C++ compiler cannot devirtualize your code (requiring a virtual table lookup to determine the function address to call based on what type a pointer is). The cost must be worth the benefit, and that’s a call you have to make in line with your project’s requirements.
Write code that can make use of conditional move or conditional select instructions. Conditional move and select instructions are a common part of many instruction sets. They allow for a general purpose register to be conditionally updated (conditional move), or based on a condition to move one of two general purpose registers into another general purpose register (conditional select). Conditional select instructions are commonly introduced by the ternary operator in C and C++, or code that the compiler can prove behaves like a ternary operator. General advice here is hard, it really involves looking at a ton of assembly language and seeing patterns in what the compiler generates, but in general if you only update a variable in an if statement, you may wish to avoid doing additional work such that the compiler can use a conditional move instruction.

Read Entire Article