Should we apply old-school multi-core scheduling to GPUs?

5 days ago 1

# Should we apply old-school multi-core scheduling to GPUs? by [@bwasti](https://x.com/bwasti) --- Recently, the big usability and efficiency wins in ML have largely been from researchers adopting time-tested engineering concepts. - [PyTorch](https://arxiv.org/abs/1912.01703) brought huge usability improvements to the space by leveraging the efficiency of interpretted execution (Python) modern dynamic memory allocation. - Loop-fusion, a concept from the 70's, became highly popular with the widespread adoption of compilers like [XLA](https://arxiv.org/abs/2301.13062) (in JAX) and [Triton](https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf). - [Paged-attention](https://arxiv.org/abs/2309.06180) applied the idea of memory paging to the space of KV-caches and is now present in the most popular and efficient inference engines like vLLM. What all of these systems have in common is that they split out a conceptual overhead. Frontend users don't need to consider deep technical details while backend engineers are able to (without getting in the way). So, what else should we conceptually split up? ## Scheduling sucks I think the most glaring problem right now is the *tight coupling between training code and distributed scheduling*. A common approach to handle this is to slap on a piece of code that mutates the underlying modules. [FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a good example of this: it hacks into the modules to shard up weights that it can. If your code and weights are normal, this is a perfectly good solution. ![](https://pytorch.org/wp-content/uploads/2024/11/fsdp_workflow.png) Another approach is to just use modules that handle the parallelism natively. Megatron defines things like [ColumnParallelLinear](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py#L743) that, although looking and feeling like a Linear module, is actually a super optimized distributed version that can scale to many GPUs. If you know what Column means and why you'd use that over the Row version, this is a nice API that's quick to get running with. ![](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/tp.png) ([*source*](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html)) ## Scale These are compatible ideas and, as a result, both are heavily used. But things start to break down when you start *really^(tm)* scaling. The first thing to *really^(tm)* scale is to get a ton compute per node. That means amping up the batch size to insane levels. GPUs have a ton of RAM these days, but if you want to *really^(tm)* scale it's simply not enough because weights and optimizer states are massive. The trick is to split the model's sequential elements (layers) onto different GPUs. That way you free up a ton of RAM and can batch up a lot more computation. And while this works well in theory, it's super fucking annoying. This is what happens: ![](https://alband.github.io/doc_view/_images/no_pipe.png) And the obvious fix (moar batches!) still has issues: ![](https://alband.github.io/doc_view/_images/pipe.png) ([*source*](https://alband.github.io/doc_view/pipeline.html)) So what do people do these days? Well the trick is to get really smart people to think really hard and derive stuff like this: ![](https://github.com/deepseek-ai/DualPipe/blob/main/images/dualpipe.png?raw=true) ([*source*](https://github.com/deepseek-ai/DualPipe)) Which basically has zero "bubbles" of idleness. Now I am certainly impressed with this rubics cube solution, I don't think I'd personally be able to derive such a thing. But wait, didn't our industry tackle problems like this decades ago? ## Ancient History Multi-core machines have been around for a while; since at least the 90's with the development of the [Stanford Hydra CMP](https://arsenalfc.stanford.edu/kunle/publications/hydra_MICRO00.pdf). That means engineers have been scheduling workloads on these types of architectures for at least three decades. There's basically two types of multi-core scheduler paradigms, either one or multiple task queues. In 1998, [Cilk-5](https://dl.acm.org/doi/10.1145/277650.277725) was written and introduced the canonical "work-stealing" implementation (which had been around before, but not as elegantly). This solves for multiple task queues, where chips effectively do their own thing until they're idle. When they're idle they check on their neighbors to take some additional work. This scales quite well but often creates a tension between when to "peak" and how much work to take on. [(more discussion here)](https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-multi.pdf) Single-queue systems are vastly more simple but take on the burden of locking mechanisms which slows them down. Despite that, they have their place and you can find them in one of the Linux kernel's schedulers (BFS, [citation](https://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf)). ## GPUs are not CPUs Of course we can't just jump in an apply all these concepts to GPUs, because they don't really work like that. There are, as far as I can really tell, three key differences that impact the application of CPU scheduler algorithms. - GPU workloads stretch over multiple nodes with varying comm latencies. - GPUs don't have a shared memory system. - GPUs don't have interrupts. Stretching over multiple nodes makes this whole thing super annoying to deal with. Luckily, Nvidia picked up on this a while ago and their new B series GPUs can be interconnected as if they're on the same host at a crazy degree: 72 at once! Older H series GPUs follow an 8x per node with high latencies to other GPUs. Let's just consider the new stuff and scratch out that first issue. - ~GPU workloads stretch over multiple nodes with varying comm latencies.~ - GPUs don't have a shared memory system. - GPUs don't have interrupts. Having a shared memory system is nice because it makes work-stealing and migration a lot easier: every task has the same access latency and switching things up is basically free. Just kidding. Even before modern multi-core CPUs added NUMA (non-uniform memory access), the L1 and L2 caches were thrashed on workload changes, so scheduling algorithms typically had very advanced affinity logic. With the new stuff (hitting some parts of memory is more expensive for certain cores), the logic is even more robust. With GPUs we have decent RDMA in the interconnects, so instead of paging to a shared memory space with implicit hierarchitcal latencies, we just have to inject operations to move things around. Basically the same idea though: *switching tasks may require memory that is not nearby and our scheduler should know that.* - ~GPU workloads stretch over multiple nodes with varying comm latencies.~ - ~GPUs don't have a shared memory system.~ - GPUs don't have interrupts. This one is *particularly* obnoxious to deal with. In CPU land you can pause execution and switch to different tasks. This is a fundamental requirement for listening Spotify while reading blogs on your browser. You can't just stop a GPU and even worse you can't really wait until it's done executing (without injecting a stupid amount of latency). That's because GPUs are asynchronous as hell and you're supposed to load them up with a stream of kernels to keep them occupied. The lag here can be on the order of 10 to 50 microseconds. A lot of full operations run in that time (or less) so taking a hit like that can be disastrous. So what's the fix? I don't really know. My thinking right now is that operations are pretty fast, so we can bundle them up (somewhat arbitrarily) and it won't be too bad. If we want to get fancy with it, we can leverage the fact that ML workloads are obscenely repetitive and create simple cost models to predict how long operations will take to run. If we want to get dumber with it: expose the bundling to the user so they can tell us what "forward_first_chunk" and "backward_last_chunk" are. ## [GT](https://github.com/bwasti/gt) - an experimental bit of code To play with ideas above I wrote/vibe-coded a [PyTorch wrapper](https://github.com/bwasti/gt) that uses the single-queue technique to schedule really simple code onto many GPUs. [![](https://private-user-images.githubusercontent.com/4842908/508758471-309cc78c-3572-44aa-be94-48b54996d24f.gif?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjIxOTMwNjAsIm5iZiI6MTc2MjE5Mjc2MCwicGF0aCI6Ii80ODQyOTA4LzUwODc1ODQ3MS0zMDljYzc4Yy0zNTcyLTQ0YWEtYmU5NC00OGI1NDk5NmQyNGYuZ2lmP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MTEwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTExMDNUMTc1OTIwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9OWVlZDc2MWE1ODJkZjk5MjY0N2NmNzJhYTg5NDExYzJhMmQ0NzEwODZlMjdiNjMzMDYxMmZkY2YxMjNmMzMwOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.UIxuOZnFZJBL98TeduVdgpzleh4waC4FAaJIyrW07Kc)](https://github.com/bwasti/gt) It is currently *very* dumb, but I think the design is in a solid state and won't change. The key (and somewhat lame) bit is that I had to re-implement autograd in the "client." PyTorch is leveraged at the worker and if you leave the autograd that low-leve you can't schedule backward passes in a clever way. Everything else basically just passes user "instructions" as a stream to a dispatcher that then rewrites the instructions to be distributed PyTorch operations. One could imagine disabling the dispatcher entirely and still getting the same results, albeit slower. This frees up the client to be *pure* math, with little need to consider things like FSDP or ColumnParallelLinear. Dispatch logic can then convert this pure math into reasonable distributed execution by implementing algorithms similar to the ones mentioned above. But... maybe we need an escape hatch? This stuff is simply not super developed, and we may want to be much more explicit about where stuff lives. I didn't want to destroy the "pure math" aspect of the client so I added this optional "signaling" feature: ``` import gt with gt.signal.context('forward_layer1'): x = gt.randn(100, 64) y = x @ w ``` which does *nothing*. Unless you load a config into the dispatcher: ``` forward_layer1: shard: axis: 0 # Shard batch dimension workers: [0, 1, 2, 3] # Use workers 0-3 backward: backward_layer1 # Different config for gradients backward_layer1: shard: axis: 0 workers: [0, 1, 2, 3] ``` This makes life way easier yet still thoroughly decouples the code from the "schedule." If you want to play with this (or contribute!!! 😁😁), check out the largely AI-generated docs: https://bwasti.github.io/gt/. I tried to set things up in a way that AI contribution is pretty easy. The main trick was creating "instruction tape" that allows for easy algorithm debugging. For humans there's something that kind of looks like `top`: ``` python -m gt.scripts.top ``` ![](https://private-user-images.githubusercontent.com/4842908/509110812-8040d59d-6869-42c8-bb2d-ee36d916935b.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjIxOTM3ODEsIm5iZiI6MTc2MjE5MzQ4MSwicGF0aCI6Ii80ODQyOTA4LzUwOTExMDgxMi04MDQwZDU5ZC02ODY5LTQyYzgtYmIyZC1lZTM2ZDkxNjkzNWIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MTEwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTExMDNUMTgxMTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MTgyYWU1OTQ4Y2U4OWFjMmE0MWQ4Njk4YjU4OGZiYmRkNDcyYjI1Mzc4NjdhZTI4MDk5M2YyODZjYTllYzI0YiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.KNGxkzEbmd4dqgDUFLesX-vPVWHzEy9tb6FLU6_cDwg) which will show which GPUs are running which operations. ## Future I'm an inference infrastructure engineer by trade, yet I recently lead the pre-training of a small/medium-sized production model. This leaves me in a bit of a confusing spot professionally, but I'll keep playing with the single-queue scheduler idea in my spare time. Thanks for reading!

Read Entire Article

Should we apply old-school multi-core scheduling to GPUs?

Related

UCBLogo (Logo, the Programming Language)

Congressional Budget Office hacked, China suspected in breac...

How many options fit into a boolean?