Context
A recent MICRO 2024 article titled “Over-synchronization in GPU Programs” describes how eliminating redundant or coarser-grained (slower) synchronization in GPU programs can lead to significant gains in performance and introduces a tool called ScopeAdvice to perform these optimizations. This work spurred a discussion across several research groups which we thought could be interesting to a wider community, so we came together to write a blog post.
We begin with an overview of research on GPU memory models, discussing work in both testing and formalization, and highlight accessible tools that help demystify these complex systems. We then examine the MICRO paper as a case study, using these tools to rigorously evaluate its proposed optimizations. We validate all but one, and for the remaining case, we identify potential issues that point to open challenges and opportunities for future work.
Introduction
Reducing synchronization overhead is a long-standing goal in high-performance computing, especially on GPUs, where their ever-increasing scale of deployment makes even small savings impactful. But reducing synchronization is risky; it is possible to introduce subtle bugs that appear extremely rarely. The very scale that makes GPUs an attractive optimization platform also amplifies this risk; bugs that might appear once in a billion executions can become regular occurrences when run across vast numbers of GPUs.
Avoiding these bugs requires a deep understanding of the memory model: the rules governing how memory operations appear across threads, and how fences or barriers enforce ordering. Unfortunately, these models are notoriously complex, even being referred to as the “rocket science of computer science“.
Progress in this area has come from a feedback loop between testing real systems and refining formal specifications. This has led to tools that make memory models more approachable. Broadly, these fall into two categories: testing tools that observe actual GPU behavior, and formal tools that model it. This post surveys both approaches and shows how they can be used together to evaluate tools like ScopeAdvice. But first, some background.
Background: GPUs and Memory Models
CUDA, introduced by NVIDIA in 2007, enabled general-purpose programming on GPUs and marked a major shift in GPU computing. Today, many high-level GPU languages exist (e.g., AMD’s HIP, Apple’s Metal, Microsoft’s HLSL), but since ScopeAdvice targets CUDA, we use CUDA terminology here.
Threads are organized into a hierarchy of scopes mapped to hardware units called grids (all threads that execute across the GPU), divided into threadblocks (threads sharing a streaming multiprocessor and fast scratchpad memory), which are further divided into warps (threads executing SIMD-style). Each level supports different synchronization mechanisms, governed by CUDA’s memory model.
The CUDA memory model provides scoped atomic types, which define the set of threads allowed concurrent access, e.g., device-scoped for all threads in the grid, block-scoped for threads in the same threadblock. The memory model is relaxed, allowing memory reordering unless constrained by stronger memory-order qualifiers or scoped fences. These fences vary in performance: block-scoped fences can be up to 21x faster than device-scoped ones (measured with scripts here), creating optimization opportunities by minimizing unnecessary synchronization.
A: data.store(1, memory_order_relaxed); | D: uint r0 = flag.load(memory_order_relaxed); |
B: __threadfence(); // __block() | E: __threadfence(); // __block() |
C: flag.store(1, memory_order_relaxed); | F: uint r1 = data.load(memory_order_relaxed); |
Figure 1. An example of a message passing pattern used to synchronize access to shared variables by multiple threads. Both data and flag are device-scoped atomic variables.
Figure 1 illustrates a message-passing pattern: thread 0 writes data = 1, then sets a flag = 1, while thread 1 loads flag and then data. Without fences, CUDA’s relaxed memory model allows reordering, making it possible for thread 1 to see flag == 1 but data == 0. A __threadfence() between operations on both threads enforces the needed ordering—if thread 1 sees flag == 1, it must also see data == 1. Importantly, if both threads are in the same threadblock, a cheaper __threadfence_block() would suffice.
Testing GPU Memory Models
In the early days of general purpose GPU programming, memory ordering was poorly understood and under-documented. As developers ported more complex parallel programs to GPUs, the lack of a clear consistency model became increasingly problematic. Without official specifications, researchers turned to empirical testing to uncover actual hardware behavior.
A 2015 paper adapted CPU memory testing techniques to GPUs, analyzing a range of NVIDIA devices and highlighted two observations: (1) there was widespread misunderstanding about GPU memory orderings, and even some textbook algorithms were incorrect on real hardware (e.g., see here); and, (2) actual memory reordering observations on GPUs were extremely rare, often not observable even after prolonged testing. To reveal weak behaviors, researchers utilized stressing threads that generated chaotic memory activity during tests, which was not required in classic CPU testing approaches.
Since then, GPU memory model testing tools have evolved, culminating in GPUHarbor, a browser-based platform using WebGPU to run memory tests across GPUs from all major vendors. With it, users can explore memory behaviors on their own hardware, including re-running the message-passing test from Fig. 1. This research has led to bug discoveries in multiple GPU implementations and informed official test suites for Vulkan and WebGPU.
Specifying GPU Memory Models
Memory models were historically described using natural language and litmus tests (e.g., the program in Fig. 1) that outlined which behaviors should be allowed or forbidden. In the late 2000s, formal CPU models began to emerge. These models proved useful for identifying bugs or inconsistencies in the specification, particularly when combined with testing methodologies to validate them against real hardware implementations. Building on the success of these models, this trend was later applied to GPUs, with the Khronos Group and NVIDIA releasing their formal models for the Vulkan API and PTX (the low-level ISA used by CUDA) in 2018 and 2019, respectively.
Together with the formal models, tools started to materialize. It is now possible to automatically answer questions such as “Is there any reordering leading to a bad state?”, or “Does my code have data races?” with formal tools like herd7 and the Alloy model checker. Other formal tools, like Dartagnan, scale to complex pieces of code and have been used to verify concurrency libraries with respect to several CPU memory models. Recently, support for GPU memory models was added. There are ongoing efforts to extend Dartagnan to other programming languages via the SPIR-V portable intermediate representation. This would enable reasoning about parallel GPU algorithms found in libraries such as NVIDIA’s CUB.
Case Study: ScopeAdvice
The MICRO’24 article demonstrated three variants of over-synchronization in CUDA programs and proposed optimizations to eliminate them. In the first one (Variant 1), the article proposed that a block-scoped fence is sufficient (e.g., E in Fig. 1) even when communicating threads are from different threadblocks if shared variables are declared “volatile” or are device-scoped read-modify-writes. Since accesses to such variables bypass the L1 cache (private) and fetch data directly from the L2 cache (shared across threadblocks), the article argued that a fence of any scope (e.g., block) provides sufficient ordering. The article also demonstrated that redundant fences beside barriers (Variant 2) and scope-agnostic lock/unlock routines (Variant 3) can leave significant performance (up to 54%) on the table.
It was later discovered that Variant 1 optimization is unsound under NVIDIA’s PTX memory model. A block-scoped fence guarantees the order in which the memory accesses of the issuing thread are visible to the threads within its threadblock. Since threads in a threadblock run on the same SM, a block-scoped fence may not guarantee ordering among memory accesses beyond the SM (e.g., accessing the L2 cache). Thus, this optimization can cause unexpected outcomes in programs. Note that optimizations for Variants 2 and 3 are sound under the PTX memory model.
Interestingly, Variant 1 optimization does not cause unexpected outcomes in practice. Variant 1 typically appears in the presence of loops, creating control dependencies between instructions (e.g., D and F, Fig. 1). Although the optimization remains unsound (PTX does not preserve dependencies), we did not observe unexpected outcomes in current compilers and hardware under rigorous testing environments. This is due to microarchitecture-specific dependencies in the presence of loops.
These observations raise three pertinent questions: (1) How to find such optimization opportunities? (2) How to verify if these optimizations are sound? and (3) How to test if these optimizations can cause unexpected outcomes?
While tools like ScopeAdvice can address the first question, tools like Dartagnan and GPUHarbor can address the last two. Dartagnan reported Variant 1 optimization as unsound, even though GPUHarbor did not observe unexpected outcomes in the presence of control dependency, i.e., loops (as noted in the MICRO’24 article). However, removing the loop led GPUHarbor to observe unexpected outcomes, suggesting that the control dependency may provide the ordering required for the optimization.
These observations suggest that formal memory models may disallow performance optimization opportunities. Although dependencies have been challenging to formalize, this case study shows that they are worth revisiting, given the importance of efficient GPU programs. The authors also believe that creating a holistic set of tools that can find performance optimization opportunities, consult formal memory models, and explore the applicability of optimizations in practice is imperative.
Thanks to NVIDIA for their contributions to this article
About the Authors:
A large group of researchers contributed to this post:
- The authors of the MICRO paper: Ajay Nayak (IISc Bangalore) and Arkaprava Basu (IISc Bangalore)
- Researchers in formally specifying (GPU) memory models: Hernan Ponce de Leon (Huawei), Natalia Gavrilenko (Huawei), Keijo Heljanko (University of Helsinki), Haining Tong (University of Helsinki)
- Researchers in empirically testing (GPU) memory models: Tyler Sorensen (Microsoft Research and UC Santa Cruz) and Reese Levine (UC Santa Cruz)
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.