I recently saw this tweet which I thought was funny:

How true is it?
I was surprised to learn that CPython represents each integer in a heap allocated PyLongObject*.
Does that mean Python heap allocates every integer you create? Integers are likely the most used data type of any program, that means a lot of heap allocations.
Beyond heap allocations, this could just make basic arithmetic operations like adding two numbers 200-500x slower1.
Or maybe it uses the industry standard pointer tagging technique to avoid allocating small integers?
To figure out I modified CPython to print when it allocates a PyLongObject*:
static PyLongObject * long_alloc(Py_ssize_t size) { assert(size >= 0); PyLongObject *result = NULL; /* ... */ Py_ssize_t ndigits = size ? size : 1; if (ndigits == 1) { result = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints); } if (result == NULL) { printf("ALLOCATING a new number object\n"); result = PyObject_Malloc(offsetof(PyLongObject, long_value.ob_digit) + ndigits*sizeof(digit)); if (!result) { PyErr_NoMemory(); return NULL; } _PyObject_Init((PyObject*)result, &PyLong_Type); } /* ...omitted code... */ return result; }And wrote a script to add some numbers 100k times:
# lol.py for i in range(0, 100_000): print(i + 1)And then counted the number of times it allocated:
echo "Allocating number object $( ./python.exe lol.py \ | rg -o ALLOCATING \ | wc -l ) times" Allocating number object 100904 timesHuh, that seems like a lot!
It looks like it could be allocating once per each iteration in the loop.
Let’s take out the print statement and see if it’s just the addition:
# lol.py for i in range(0, 100_000): a = i + 1 echo "Allocating number object $( ./python.exe lol.py \ | rg -o ALLOCATING \ | wc -l ) times" Allocating number object 905 timesThat’s almost 100k less, so it does look like it’s the printing and CPython has some optimization in place to avoid allocations for addition.
Let’s look at the internal function to add two integers, long_add(a, b).
Just by looking at its signature, the return type is PyLongObject*, which could mean that it’s allocating a new object everytime you add two integers:
static PyLongObject * long_add(PyLongObject *a, PyLongObject *b) { if (_PyLong_BothAreCompact(a, b)) { stwodigits z = medium_value(a) + medium_value(b); return _PyLong_FromSTwoDigits(z); } /* ... more code ... */ }The _PyLong_BothAreCompact(...) function checks a bitfield on PyLongObject to check if it’s “compact” (the integer value can fit in 63 bits) which will always be the case for our script.
Let’s look at _PyLong_FromSTwoDigits:
/* Create a new int object from a C word-sized int */ static inline PyLongObject * _PyLong_FromSTwoDigits(stwodigits x) { if (IS_SMALL_INT(x)) { return (PyLongObject*)get_small_int((sdigit)x); } assert(x != 0); /* check that it is <= 2^(32-1) */ if (is_medium_int(x)) { return (PyLongObject*)_PyLong_FromMedium((sdigit)x); } return (PyLongObject*)_PyLong_FromLarge(x); }There’s three separate code paths:
-
when x is a “small int”: call get_small_int(x).
-
when x is a “medium int”: call _PyLong_FromMedium(x).
-
when x is a “large int”: call _PyLong_FromLarge(x).
The call to get_small_int() is catching my eye. Is it like the tagged pointer trick I mentioned earlier?
static PyObject * get_small_int(sdigit ival) { assert(IS_SMALL_INT(ival)); return (PyObject *)&_PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival]; }Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 10252. This would account for 1025 iterations of our loop but not for the rest.
Let’s look at the medium case (meaning x is less than 232-1) and the _PyLong_FromMedium() function:
static PyObject * _PyLong_FromMedium(sdigit x) { assert(!IS_SMALL_INT(x)); assert(is_medium_int(x)); PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints); if (v == NULL) { v = PyObject_Malloc(sizeof(PyLongObject)); if (v == NULL) { PyErr_NoMemory(); return NULL; } _PyObject_Init((PyObject*)v, &PyLong_Type); } digit abs_x = x < 0 ? -x : x; _PyLong_SetSignAndDigitCount(v, x<0?-1:1, 1); v->long_value.ob_digit[0] = abs_x; return (PyObject*)v; }Okay, it seems to not be using the long_alloc() function from earlier and instead trying to get a PyLongObject* from a freelist (using _Py_FREELIST_POP()), or otherwise allocating the memory for it if that fails.
A “freelist” is a common technique used in memory allocation strategies to keep track of freed allocations so they can be reused later, avoiding allocating more memory if possible.
If we look at the long_dealloc() function, there’s a codepath specifically for pushing the allocation to the freelist and not calling free on it so it can be reused:
static void long_dealloc(PyObject *self) { /* ... */ if (PyLong_CheckExact(self) && _PyLong_IsCompact((PyLongObject *)self)) { _Py_FREELIST_FREE(ints, self, PyObject_Free); return; } Py_TYPE(self)->tp_free(self); }I deleted the printf() I added to long_alloc() earlier, and added two print statements to _PyLong_FromMedium() to print if it allocated or reused memory:
> ./python.exe lol.py | rg 'ALLOCATING|REUSING' | sort | uniq -c 102 ALLOCATING 99193 REUSINGOur script appears to actually be reusing most of the PyLongObject objects!
So where did the allocation come from when our script was print()’ing the integers? It turns out the implemention of print() allocates a scratch PyLongObject* for conversion purposes, even though this can be avoided in most cases3.
It’s nice to see that simply adding two numbers in a loop does not allocate memory everytime.We just saw an optimization to reduce the frequency of allocations (by reusing them), but I also want to poke around and see if Python is doing anything to amortize the actual cost of an allocation.
Here’s the internal allocation function used for objects:
static inline void* pymalloc_alloc(OMState *state, void *Py_UNUSED(ctx), size_t nbytes) { if (UNLIKELY(nbytes == 0)) { return NULL; } if (UNLIKELY(nbytes > SMALL_REQUEST_THRESHOLD)) { return NULL; } uint size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT; poolp pool = usedpools[size + size]; pymem_block *bp; if (LIKELY(pool != pool->nextpool)) { /* * There is a used pool for this size class. * Pick up the head block of its free list. */ ++pool->ref.count; bp = pool->freeblock; assert(bp != NULL); if (UNLIKELY((pool->freeblock = *(pymem_block **)bp) == NULL)) { // Reached the end of the free list, try to extend it. pymalloc_pool_extend(pool, size); } } else { /* There isn't a pool of the right size class immediately * available: use a free pool. */ bp = allocate_from_new_pool(state, size); } return (void *)bp; }It looks like the memory allocator for objects is a pool allocator with multiple different pools based on the size of the object.
The pool allocator splits its backing buffer into fixed size chunks and typically is implemented with a freelist. This comes with some pretty nice advantages.
First, because everything is fixed in size, allocating and freeing becomes very simple (and thus faster!) especially since it doesn’t need to do any of the complicated bookkeeping malloc(...) needs to do to work for any allocation size.
There’s also less fragmentation (because that arises from variable allocation sizes, which means you can optimally reuse freed memory as the pool is not getting fragmented.
Similarly to an arena, you pay most of the cost upfront by pre-allocating a backing buffer at initialization. In fact in CPython’s case, the pool’s backing buffer itself is allocated by an arena whose capacity is 1mb or 256kb:
#ifdef USE_LARGE_ARENAS #define ARENA_BITS 20 /* 1 MiB */ #else #define ARENA_BITS 18 /* 256 KiB */ #endif #define ARENA_SIZE (1 << ARENA_BITS) #define ARENA_SIZE_MASK (ARENA_SIZE - 1)Also, CPython will use mmap() with MAP_ANONYMOUS so it reserves 1mb of virtual memory upfront and the physical memory is allocated on demand on page faults.
With all these tricks, it’s likely that our script (the one that prints and does seem to allocate memory in the loop) is probably reusing memory from the pool allocator more than it’s actually allocating new memory (at least for PyLongObject integers)
Despite some of the optimizations, there’s obviously still a big of overhead of executing all this allocation code, when theoretically all it should take the CPU to add two integers is a single ADD instruction.In addition, the representation of PyLongObject is designed to handle both small and really big integers alike, meaning the code to handle the slow and unlikely path (using a really big integer in your Python script) pessimizes the fast path (using regularly sized integers).
To be honest, I’m a little surprised there isn’t a fast path optimization here. Even Smalltalk in the 80s used pointer tagging to avoid heap allocating integers4!
But we do see a good example of how using specialized memory allocators lets us greatly reduce some of the costs of memory allocation.
This is why Zig is one my favorite languages. The answer to the question of “I wonder if this is allocating?” is “does this function take an allocator?“5. And the std.mem.Allocator interface lets you easily swap out allocators, and all of the standard library is designed to accept a std.mem.Allocator when it wants to allocate memory.
.png)


