I’ve been overwhelmed by security work lately, so let me switch to something lighter to relax my mind.
Python 3.14 has officially introduced a new mechanism called Tail Call Interpreter (Made by Ken Jin), which is undoubtedly another major milestone that lays the foundation for the future.
Main Content
Before discussing Python 3.14’s Tail Call Interpreter, we need to talk about the most basic syntax in C - switch-case.
| 1 2 3 4 5 6 7 8 9 10 | switch (x) { case 1: break; case 2: break; default: } |
What would the final assembly look like for such code? Most people’s first reaction might be to use cmp followed by je, and if it doesn’t match, continue. Let’s compile a version to see.
For this code:
| 1 2 3 4 5 6 7 8 | void small_switch(int x) { switch(x) { case 1: printf("One\n"); break; case 2: printf("Two\n"); break; case 3: printf("Three\n"); break; default: printf("Other\n"); break; } } |
The final assembly output would be:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | 00000000000011f0 <small_switch>: 11f0: 83 ff 02 cmp $0x2,%edi 11f3: 74 2b je 1220 <small_switch+0x30> 11f5: 83 ff 03 cmp $0x3,%edi 11f8: 74 16 je 1210 <small_switch+0x20> 11fa: 83 ff 01 cmp $0x1,%edi 11fd: 75 31 jne 1230 <small_switch+0x40> 11ff: 48 8d 3d fe 0d 00 00 lea 0xdfe(%rip),%rdi # 2004 <_IO_stdin_used+0x4> 1206: e9 25 fe ff ff jmp 1030 <puts@plt> 120b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 1210: 48 8d 3d f5 0d 00 00 lea 0xdf5(%rip),%rdi # 200c <_IO_stdin_used+0xc> 1217: e9 14 fe ff ff jmp 1030 <puts@plt> 121c: 0f 1f 40 00 nopl 0x0(%rax) 1220: 48 8d 3d e1 0d 00 00 lea 0xde1(%rip),%rdi # 2008 <_IO_stdin_used+0x8> 1227: e9 04 fe ff ff jmp 1030 <puts@plt> 122c: 0f 1f 40 00 nopl 0x0(%rax) 1230: 48 8d 3d db 0d 00 00 lea 0xddb(%rip),%rdi # 2012 <_IO_stdin_used+0x12> 1237: e9 f4 fd ff ff jmp 1030 <puts@plt> 123c: 0f 1f 40 00 nopl 0x0(%rax) |
We can see that overall it’s as we expected - continuous cmp followed by continuous je. Now let’s evaluate the complexity here? Oh, time complexity O(n), that’s straightforward.
Damn, for Python with such a huge switch case structure, wouldn’t that be O(n) every time? Wouldn’t that be a performance disaster?
Actually, no. Usually, compilers use different strategies to handle switch case structures based on data type and scale.
Let’s prepare several examples:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | void small_switch(int x) { switch(x) { case 1: printf("One\n"); break; case 2: printf("Two\n"); break; case 3: printf("Three\n"); break; default: printf("Other\n"); break; } } void dense_switch(int x) { switch(x) { case 10: printf("Ten\n"); break; case 11: printf("Eleven\n"); break; case 12: printf("Twelve\n"); break; case 13: printf("Thirteen\n"); break; case 14: printf("Fourteen\n"); break; case 15: printf("Fifteen\n"); break; case 16: printf("Sixteen\n"); break; case 17: printf("Seventeen\n"); break; default: printf("Other\n"); break; } } void sparse_switch(int x) { switch(x) { case 1: printf("One\n"); break; case 100: printf("Hundred\n"); break; case 1000: printf("Thousand\n"); break; case 10000: printf("Ten thousand\n"); break; default: printf("Other\n"); break; } } void large_dense_switch(int x) { switch(x) { case 1: printf("Case 1\n"); break; case 2: printf("Case 2\n"); break; case 3: printf("Case 3\n"); break; case 4: printf("Case 4\n"); break; case 5: printf("Case 5\n"); break; case 6: printf("Case 6\n"); break; case 7: printf("Case 7\n"); break; case 8: printf("Case 8\n"); break; case 9: printf("Case 9\n"); break; case 10: printf("Case 10\n"); break; case 11: printf("Case 11\n"); break; case 12: printf("Case 12\n"); break; case 13: printf("Case 13\n"); break; case 14: printf("Case 14\n"); break; case 15: printf("Case 15\n"); break; case 16: printf("Case 16\n"); break; case 17: printf("Case 17\n"); break; case 18: printf("Case 18\n"); break; case 19: printf("Case 19\n"); break; case 20: printf("Case 20\n"); break; default: printf("Other\n"); break; } } void mixed_switch(int x) { switch(x) { case 1: printf("Case 1\n"); break; case 2: printf("Case 2\n"); break; case 3: printf("Case 3\n"); break; case 50: printf("Case 50\n"); break; case 100: printf("Case 100\n"); break; case 101: printf("Case 101\n"); break; case 102: printf("Case 102\n"); break; default: printf("Other\n"); break; } } void char_switch(char c) { switch(c) { case 'a': printf("Letter a\n"); break; case 'b': printf("Letter b\n"); break; case 'c': printf("Letter c\n"); break; case 'd': printf("Letter d\n"); break; case 'e': printf("Letter e\n"); break; default: printf("Other char\n"); break; } } |
Then we disassemble and look at the results (I’ll only paste the key parts here):
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | 00000000000011f0 <small_switch>: 11f0: 83 ff 02 cmp $0x2,%edi # Compare if it's 2 11f3: 74 2b je 1220 # Jump to case 2 11f5: 83 ff 03 cmp $0x3,%edi # Compare if it's 3 11f8: 74 16 je 1210 # Jump to case 3 11fa: 83 ff 01 cmp $0x1,%edi # Compare if it's 1 11fd: 75 31 jne 1230 # If not, jump to default 0000000000001240 <dense_switch>: 1240: 83 ef 0a sub $0xa,%edi # Subtract 10 (minimum case value) 1243: 83 ff 07 cmp $0x7,%edi # Compare range (17-10=7) 1246: 0f 87 90 00 00 00 ja 12dc # Out of range, jump to default 124c: 48 8d 15 15 0f 00 00 lea 0xf15(%rip),%rdx # Load jump table address 1253: 48 63 04 ba movslq (%rdx,%rdi,4),%rax # Get offset 1257: 48 01 d0 add %rdx,%rax # Calculate target address 125a: ff e0 jmp *%rax # Indirect jump 00000000000012f0 <sparse_switch>: 12f0: 81 ff e8 03 00 00 cmp $0x3e8,%edi # Compare 1000 12f6: 74 40 je 1338 # If equal, jump 12f8: 7f 16 jg 1310 # If greater than 1000, continue checking 12fa: 83 ff 01 cmp $0x1,%edi # Less than 1000, check 1 12fd: 74 49 je 1348 12ff: 83 ff 64 cmp $0x64,%edi # Check 100 1302: 75 24 jne 1328 # If none match, default ... 1310: 81 ff 10 27 00 00 cmp $0x2710,%edi # Check 10000 0000000000001360 <large_dense_switch>: 1360: 83 ff 14 cmp $0x14,%edi # Check if ≤20 1363: 0f 87 53 01 00 00 ja 14bc # Out of range 1369: 48 8d 15 18 0e 00 00 lea 0xe18(%rip),%rdx # Jump table address 1372: 48 63 04 ba movslq (%rdx,%rdi,4),%rax # Direct indexing 1376: 48 01 d0 add %rdx,%rax 1379: ff e0 jmp *%rax 00000000000014d0 <mixed_switch>: 14d0: 83 ff 32 cmp $0x32,%edi # Compare 50 14d3: 74 7b je 1550 14d5: 7f 29 jg 1500 # >50 case 14d7: 83 ff 02 cmp $0x2,%edi # ≤50, check small values 14da: 74 64 je 1540 14dc: 83 ff 03 cmp $0x3,%edi ... 1500: 83 ff 65 cmp $0x65,%edi # >50, check 100,101,102 1503: 74 5b je 1560 0000000000001580 <char_switch>: 1580: 83 ef 61 sub $0x61,%edi # Subtract ASCII value of 'a' 1583: 40 80 ff 04 cmp $0x4,%dil # Check if ≤4 (a-e) 1587: 77 63 ja 15ec 1589: 48 8d 15 4c 0c 00 00 lea 0xc4c(%rip),%rdx 1590: 40 0f b6 ff movzbl %dil,%edi # Zero-extend to 32 bit 1594: 48 63 04 ba movslq (%rdx,%rdi,4),%rax |
Here we can see that the compiler handles switch case structures differently based on different data types. Let me summarize this with a table:
| small_switch | 3 | Consecutive(1,2,3) | Linear comparison | O(n) |
| dense_switch | 8 | Consecutive(10-17) | Offset jump table | O(1) |
| sparse_switch | 4 | Sparse(1,100,1000,10000) | Binary search | O(log n) |
| large_dense_switch | 20 | Consecutive(1-20) | Standard jump table | O(1) |
| mixed_switch | 7 | Partially consecutive | Nested comparison | O(log n) |
| char_switch | 5 | Consecutive(‘a’-‘e’) | Character offset table | O(1) |
OK, here we find that the final implementation of switch-case varies depending on data types, leading to unpredictability in our final code. So do we have ways to optimize this problem? The answer is yes.
Let’s look at the following code:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | #include <stdio.h> void basic_computed_goto(int operation) { static void* jump_table[] = { &&op_add, &&op_sub, &&op_mul, &&op_div, &&op_mod, &&op_default }; int a = 10, b = 3; int result; if (operation < 0 || operation > 4) { operation = 5; } printf("Operation %d: a=%d, b=%d -> ", operation, a, b); goto *jump_table[operation]; op_add: result = a + b; printf("ADD: %d\n", result); return; op_sub: result = a - b; printf("SUB: %d\n", result); return; op_mul: result = a * b; printf("MUL: %d\n", result); return; op_div: if (b != 0) { result = a / b; printf("DIV: %d\n", result); } else { printf("DIV: Error (division by zero)\n"); } return; op_mod: if (b != 0) { result = a % b; printf("MOD: %d\n", result); } else { printf("MOD: Error (division by zero)\n"); } return; op_default: printf("Unknown operation\n"); return; } |
We can see that the core operation here is to turn each case of our switch-case into a label, then we use a jump_table to directly jump to the corresponding label. Let’s look at the assembly of the most critical part:
| 1 2 | 11d3: 48 8d 05 c6 2b 00 00 lea 0x2bc6(%rip),%rax # 3da0 <jump_table.0> 11da: ff 24 d8 jmp *(%rax,%rbx,8) |
Here we can summarize that using Computed Goto compared to traditional switch-case has the following advantages:
- Reduces the cost of branch prediction fallback
- Better instruction cache locality
- Reduces the number and overhead of cmp instructions
So how much faster can it be? You can refer to the test results of Computed Goto introduced in CPython, which showed an overall improvement of about 15%.
So is the Computed Goto approach perfect? Actually, no. Currently, CPython’s interpreter ceval.c, which is also currently the largest switch case, has several typical problems:
- Computed Goto as a specialized feature of clang and gcc, other platforms have limited benefits
- Currently Computed Goto is not mature, different versions of the same compiler may have different issues
- Extremely large switch cases cause compilers to not optimize switch cases well enough
- We cannot use perf to precisely perform quantitative analysis of per-opcode overhead in our entire process, which will be a big problem in the context of making Python faster
Points 1, 3, and 4 are easy to understand. Let’s look at an example of point 2 (thanks to Ken Jin for providing the example).
In GCC 11, switch-case would generate normal code in certain parts:
| 1 2 3 4 5 6 | 738f: movq %r13, %r15 7392: movzbl %ah, %ebx 7395: movzbl %al, %eax 7398: movq (,%rax,8), %rax 73a0: movl %ebx, -0x248(%rbp) 73a6: jmpq *%rax |
While in GCC 13-15Beta, it would generate code like this:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | 747a: movzbl %ah, %ebx 747d: movzbl %al, %eax 7480: movl %ebx, -0x248(%rbp) 7486: movq (,%rax,8), %rax 748e: jmp 0x72a0 <_PyEval_EvalFrameDefault+0x970> 72a0: movq %r15, %xmm0 72a5: movq %r13, %xmm3 72aa: movq %r15, %rbx 72ad: punpcklqdq %xmm3, %xmm0 72b1: movhlps %xmm0, %xmm2 72b4: movq %xmm2, %r10 72b9: movq %r10, %r11 72bc: jmpq *%rax |
We can see that additional registers were introduced. Computer Architecture 101: additional registers mean additional overhead. Registers are expensive!
So do we have ways to iterate on the extremely large switch case above? Some students might be thinking, since the switch case above is extremely large, why don’t we split it into multiple small functions so that the compiler can have enough context to optimize, and our perf can also precisely analyze the overhead of each function. Wouldn’t that be great?
But other students object: function calls trigger call instructions, which bring additional overhead of register push and pop operations. Won’t this make it slower again?
So can we optimize this? The answer is yes. Many students might have thought of it - tail call.
Suppose we have this code:
| 1 2 3 4 5 6 | __attribute__((noinline)) void g(int x) { printf("Value: %d\n", x); }; void f(int x) { return g(x); } |
We can see this assembly:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | 0000000000001140 <g>: 1140: 55 push %rbp 1141: 48 89 e5 mov %rsp,%rbp 1144: 48 83 ec 10 sub $0x10,%rsp 1148: 89 7d fc mov %edi,-0x4(%rbp) 114b: 8b 75 fc mov -0x4(%rbp),%esi 114e: 48 8d 3d af 0e 00 00 lea 0xeaf(%rip),%rdi # 2004 <_IO_stdin_used+0x4> 1155: b0 00 mov $0x0,%al 1157: e8 d4 fe ff ff call 1030 <printf@plt> 115c: 48 83 c4 10 add $0x10,%rsp 1160: 5d pop %rbp 1161: c3 ret 1162: 66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1) 1169: 1f 84 00 00 00 00 00 0000000000001170 <f>: 1170: 55 push %rbp 1171: 48 89 e5 mov %rsp,%rbp 1174: 48 83 ec 10 sub $0x10,%rsp 1178: 89 7d fc mov %edi,-0x4(%rbp) 117b: 8b 7d fc mov -0x4(%rbp),%edi 117e: e8 bd ff ff ff call 1140 <g> 1183: 48 83 c4 10 add $0x10,%rsp 1187: 5d pop %rbp 1188: c3 ret 1189: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) |
The call 1140 <g> instruction is very prominent. This is also an important source of function call overhead.
In modern compilers, there’s a special optimization called tail recursion, where when the last step of a function is calling another function, the compiler can optimize away the overhead of this call.
Let’s test this:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #include <stdio.h> __attribute__((preserve_none)) void g(int x); __attribute__((noinline, preserve_none)) void g(int x){ printf("Value: %d\n", x); } __attribute__((preserve_none)) void f(int x) { [[clang::musttail]] return g(x); } int main() { f(42); return 0; } |
Let’s look at the related assembly:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | 0000000000001140 <g>: 1140: 55 push %rbp 1141: 48 89 e5 mov %rsp,%rbp 1144: 48 83 ec 10 sub $0x10,%rsp 1148: 44 89 65 fc mov %r12d,-0x4(%rbp) 114c: 8b 75 fc mov -0x4(%rbp),%esi 114f: 48 8d 3d ae 0e 00 00 lea 0xeae(%rip),%rdi # 2004 <_IO_stdin_used+0x4> 1156: 31 c0 xor %eax,%eax 1158: e8 d3 fe ff ff call 1030 <printf@plt> 115d: 48 83 c4 10 add $0x10,%rsp 1161: 5d pop %rbp 1162: c3 ret 1163: 66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 0x0(%rax,%rax,1) 116a: 84 00 00 00 00 00 0000000000001170 <f>: 1170: 55 push %rbp 1171: 48 89 e5 mov %rsp,%rbp 1174: 44 89 65 fc mov %r12d,-0x4(%rbp) 1178: 44 8b 65 fc mov -0x4(%rbp),%r12d 117c: 5d pop %rbp 117d: e9 be ff ff ff jmp 1140 <g> 1182: 66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1) 1189: 1f 84 00 00 00 00 00 |
We can see that the last step of function f is jmp 1140 <g>, not call 1140 <g>. This means when we call function g, we won’t have additional overhead like register allocation that traditional call instructions bring.
Some students might have realized that after tail recursion processing, this can completely be viewed as a high-performance Goto.
Bingo, the idea here is similar. A 1977 paper “Debunking the ‘Expensive Procedure Call’ Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO” mentioned that efficient procedure calls can have performance close to Goto, while being more concise in implementation.
In Python 3.14, the implementation of Tail Call Interpreter is based on this idea.
We can see that we’ve applied tail recursion processing to the macro that dispatches opcodes:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # define Py_MUSTTAIL [[clang::musttail]] # define Py_PRESERVE_NONE_CC __attribute__((preserve_none)) Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS); # define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS) # define DISPATCH_GOTO() \ do { \ Py_MUSTTAIL return (INSTRUCTION_TABLE[opcode])(TAIL_CALL_ARGS); \ } while (0) # define JUMP_TO_LABEL(name) \ do { \ Py_MUSTTAIL return (_TAIL_CALL_##name)(TAIL_CALL_ARGS); \ } while (0) # ifdef Py_STATS # define JUMP_TO_PREDICTED(name) \ do { \ Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg, lastopcode); \ } while (0) # else # define JUMP_TO_PREDICTED(name) \ do { \ Py_MUSTTAIL return (_TAIL_CALL_##name)(frame, stack_pointer, tstate, this_instr, oparg); \ } while (0) # endif # define LABEL(name) TARGET(name) |
So while ensuring our baseline performance is as good as or even better than Computed Goto, we can get the following benefits:
- Broader platform support
- After splitting cases, compilers are less likely to make mistakes, and overall performance predictability is stronger
- Happy perf
- And I can do more cool stuff with tools like eBPF
Summary
This article is pretty much it. Although it claims to only introduce Python 3.14’s Tail Call Interpreter, it still completely introduces the entire evolution of thinking.
This also gives me an insight: predictability is really a very important characteristic in many cases.
This, along with remote debug, are the two features I like most in 3.14. Long live observability!
.png)

![How Do CPUs Work? The Engineering That Runs the Digital World [video]](https://www.youtube.com/img/desktop/supported_browsers/chrome.png)