The future of Python web services looks GIL-free

2 weeks ago 1

Python 3.14 was released at the beginning of the month. This release was particularly interesting to me because of the improvements on the "free-threaded" variant of the interpreter.

Specifically, the two major changes when compared to the free-threaded variant of Python 3.13 are:

Free-threaded support now reached phase II, meaning it's no longer considered experimental
The implementation is now completed, meaning that the workarounds introduced in Python 3.13 to make code sound without the GIL are now gone, and the free-threaded implementation now uses the adaptive interpreter as the GIL enabled variant. These facts, plus additional optimizations make the performance penalty now way better, moving from a 35% penalty to a 5-10% difference.

Miguel Grinberg put together a nice article about Python 3.14 performance, with sections dedicated to the free-threaded variant compared to the GIL one. His results show a compelling improvement in performance for Python 3.14t compared to 3.13t, which is compelling!

While his benchmarks focus on CPU-bound work, calculating the Fibonacci sequence and using the bubble sort algorithm, a huge part of my experience with Python is centered on web development – the main OSS projects I maintain are a web framework and a web server for Python, after all – and thus I wanted to make a proper comparison of free-threaded and GIL Python interpreters on web applications: even if 99.9999% of web services out there are I/O bound – interacting with a database or making requests to other services – concurrency is a key factor here, and we spent decades doing weird stuff with the multiprocessing module to do more work in parallel. Is this the time where we can finally stop wasting gigabytes of memory just to serve more than 1 request at time?

Benchmarks are hard

Let's face it. We had hard times understanding benchmarks – especially when those benchmarks are around web technologies. The internet is full of discussions around them, with people arguing about every aspect of them: the methodology, the code, the environment. The most popular reactions to benchmarks are like "but why you didn't tested also X" or "my app using that library doesn't scale like this, you're lying". I hear already these kind of comments on this article.

But this is because we tend to generalise over benchmarks, and we really shouldn't. To my perspective, a good benchmarks is very self-contained, testing a very small thing out of the actual – and much wider – context. And why is that? Because a good benchmark should reduce noise as much as possible. I'm definitely not interested into the framework X is faster than Y war – also because those statements usually lack the on what part – nor I really care on having a very wide matrix of test cases.

I really just want to see if, taking one single ASGI web application and one single WSGI application, doing the same thing, we can spot differences between the standard 3.14 Python and its free-threaded variant, and make considerations based on those results. Please keep this in mind when looking at the numbers below.

The methodology

As mentioned before, the idea is to test the two main application protocols in Python – ASGI and WSGI – on Python 3.14 with the GIL enabled and disabled, keeping everything else fixed: the server, the code, the concurrency, the event loop.

Thus, I created an ASGI application using FastAPI and a WSGI application using Flask – why these frameworks? Just because they're the most popular – with two endpoints: a silly JSON response generator, and a fake I/O bound endpoint. Here is the code for the FastAPI version:

import asyncio from fastapi import FastAPI from fastapi.responses import PlainTextResponse, JSONResponse app = FastAPI() @app.get("/json") async def json_data(): return JSONResponse({"message": "Hello, world!"}) @app.get("/io") async def io_fake(): await asyncio.sleep(0.01) return PlainTextResponse(b"Hello, waited 10ms")

and here's the code for the Flask version:

import json import time import flask app = flask.Flask(__name__) app.config["JSONIFY_PRETTYPRINT_REGULAR"] = False @app.route("/json") def json_data(): return flask.jsonify(message="Hello, world!") @app.route("/io") def io_fake(): time.sleep(0.01) response = flask.make_response(b"Hello, waited 10ms") response.content_type = "text/plain" return response

As you can see, the fake I/O endpoint wait for 10ms, as the idea is to simulate something like waiting for the database to return a query result. Yes, I know, I'm ignoring all the serialization/deserialization part of communicating with the database here, and the JSON endpoint is not something you will have in an actual application, but – again – that's not the point here.

We then serve these applications using Granian and spawn a bunch of requests using rewrk. Why Granian? Well, first I maintain the project, but also – and more importantly – is the only server I'm aware of which uses threads in place of processes for running workers on free-threaded Python.

Everything is run on a single machine with the following specs:

Gentoo Linux 6.12.47
AMD Ryzen 7 5700X
CPython 3.14 and 3.14t installed through uv

ASGI benchmarks

We run the FastAPI application both with 1 worker and 2 workers, with a concurrency of 128 in the first case and 256 in the latter. Here are the Granian and rewrk commands:

granian --interface asgi --loop asyncio --workers {N} impl_fastapi:app
rewrk -d 30s -c {CONCURRENCY} --host http://127.0.0.1:8000/{ENDPOINT}

JSON endpoint

Python workers RPS Latency avg Latency max CPU RAM

3.14	1	30415	4.20ms	45.29ms	0.42	90MB
3.14t	1	24218	5.27ms	59.25ms	0.80	80MB
3.14	2	59219	4.32ms	70.71ms	1.47	147MB
3.14t	2	48446	5.28ms	68.17ms	1.73	90MB

As we can see from the numbers, the free-threaded implementation is ~20% slower, but with the advantage of reduced memory usage.

I/O endpoint

Python workers RPS Latency avg Latency max CPU RAM

3.14	1	11333	11.28ms	40.72ms	0.41	90MB
3.14t	1	11351	11.26ms	35.18ms	0.38	81MB
3.14	2	22775	11.22ms	114.82ms	0.69	148MB
3.14t	2	23473	10.89ms	60.29ms	1.10	91MB

Here the two implementation are very similar in terms of throughput, with the free-threaded one slightly better. Once again, the free-threaded implementation has the advantage of consuming less memory.

WSGI benchmarks

Running a WSGI application containing both a – very low – CPU bound endpoint and an I/O bound endpoint with the same configuration is much more complicated. Why? Because – on the GIL interpreter – while for CPU bound endpoints we want to avoid as much as possible GIL contention, and thus have as less threads as possible, for I/O bound workloads we want to have a decent amount of threads to do work while another request is waiting on I/O.

To clarify this point, let's see what happens to both endpoints on the GIL Python 3.14 when we use a single worker but different numbers of threads in Granian:

endpoint threads RPS Latency avg Latency max

JSON	1	19377	6.60ms	28.35ms
JSON	8	18704	6.76ms	25.82ms
JSON	32	18639	6.68ms	33.91ms
JSON	128	15547	8.17ms	3949.40ms
I/O	1	94	1263.59ms	1357.80ms
I/O	8	781	161.99ms	197.73ms
I/O	32	3115	40.82ms	120.61ms
I/O	128	11271	11.28ms	59.58ms

As you can see, the more threads we add, the more close we get to the expected result on the I/O endpoint, but, at the same time, the JSON endpoint gets more and more worse. When deploying WSGI applications, a lot of time is usually put into finding the balance between GIL contention and proper parallelism. This is also why for the last 20 years people discussed a lot about the numbers of threads to use with WSGI applications – with several instances where we actually use completely empirical values like 2*CPU+1 – and also why gevent was a thing before asyncio.

On the free-threaded side of things, we can stop worry about this, as each thread can actually run code in parallel without waiting for the GIL, but we have a different thing to consider: should we increase the workers or the threads? At the end of the day workers are also threads, right? Let's experiment a bit with the JSON endpoint:

workers threads RPS Latency avg Latency max

1	2	28898	4.42ms	86.96ms
2	1	28424	4.49ms	75.80ms
1	4	54669	2.33ms	112.06ms
4	1	53532	2.38ms	121.91ms
2	2	55426	2.30ms	124.16ms

It seems that increasing workers add some overhead – which makes sense – and the sweet spot is balancing the two depending on the workload. Given we still need a high thread count to support the I/O endpoint – in the same way Granian cannot understand when the application is waiting for I/O in the GIL implementation, it can't also in the free-threaded implementation – this doesn't really matter, but it was fun to observe.

Given all of the above, my approach is to run the Flask application both with a single worker and two of them, while keeping the amount of threads (per worker) fixed to 64. As for the ASGI benchmark, we use a concurrency of 128 in the first case and 256 in the second. Here are the commands:

granian --interface wsgi --workers {N} --blocking-threads 64 impl_flask:app
rewrk -d 30s -c {CONCURRENCY} --host http://127.0.0.1:8000/{ENDPOINT}

JSON endpoint

Python workers RPS Latency avg Latency max CPU RAM

3.14	1	18773	6.11ms	27446.19ms	0.53	101MB
3.14t	1	70626	1.81ms	311.76ms	6.50	356MB
3.14	2	36173	5.73ms	27692.21ms	1.31	188MB
3.14t	2	60138	4.25ms	294.55ms	6.56	413MB

For CPU-bound workloads, we cleary see the advantage of the free-threaded version in terms of throughput: it shouldn't surprise us given it can utilize a lot more cpu. The memory usage is way higher on the free-threaded version though; it's not clear from this benchmarks if it's just because of the higher concurrency, or the Python garbage collector operates less efficiently in this context.

I/O endpoint

Python workers RPS Latency avg Latency max CPU RAM

3.14	1	6282	20.34ms	62.28ms	0.40	105MB
3.14t	1	6244	20.47ms	164.59ms	0.42	216MB
3.14	2	12566	20.33ms	88.34ms	0.65	180MB
3.14t	2	12444	20.55ms	124.06ms	1.18	286MB

For I/O bound workloads, the two implementations are very similar. Once again, the memory usage on the free-threaded implementation is way higher than its counter-part.

Final thoughts

While pure-Python code execution appears to be up to 20% slower on free-threaded Python 3.14, we can spot several advantages on the free-threaded side of things.

On asynchronous protocols like ASGI, despite the fact the concurrency model doesn't change that much – we shift from one event loop per process, to one event loop per thread – just the fact we no longer need to scale memory allocations just to use more CPU is a massive improvement. Even considering memory is cheap compared to CPUs, with modern hardware this can make a huge difference in cost both on large deployments and projects running on a single VM. You can now allocate more stuff on a single machine and scale once the CPU is overwhelmed, instead of worrying about how much RAM you need.

On the throughput difference, it might be also worth noting that all the above benchmarks used the stdlib asyncio event loop implementation, and projects like uvloop or rloop might play a role in improving the throughput and latency down the road. But also: the latency in the above benchmarks is actually better for I/O bound workloads on the free-threaded implementation, and given, to quote DHH, we're all CRUD monkeys, and thus the vast majority of time our applications are just waiting for the database, it's also possible the free-threaded Python implementation is already better to use on ASGI applications, today.

On synchronous protocol like WSGI, you might have mixed feelings due to the memory usage, but it's absolutely possible at this stage that Granian just need some changes to improve the garbage collection on the Python side of things. If that's the case, WSGI is now fun again, and we can stop worrying about balancing threads, stop monkeypatching our applications to use gevent, stop planning that let's migrate to asyncio rewrite and just rely on the fact we don't need to worry anymore about blocking operations.

For people like me, managing thousands of ASGI and WSGI Python containers on a vast infrastructure – yes, I work at Sentry, in case you missed – free-threaded Python has the potential to be a huge quality of life improvement. But also for everybody out there coding a web application in Python: simplifying the concurrency paradigms and the deployment process of such applications is a good thing.

I'm pretty sure the whole gilectomy thing wasn't planned with web applications in mind, and it might also take a while to get there, but to me the future of Python web services looks GIL-free.

Read Entire Article

The future of Python web services looks GIL-free

Benchmarks are hard

The methodology

ASGI benchmarks

JSON endpoint

I/O endpoint

WSGI benchmarks

JSON endpoint

I/O endpoint

Final thoughts

Related

Hedy Lamarr: The Unseen Genius Behind Wi-Fi and Bluetooth

Doo: Programming Language Ideas and Suggestions?

Genetically Engineered Babies Are Banned. Tech Titans Are Tr...