Show HN: Cuq – Formal Verification of Rust GPU Kernels

3 hours ago 1

Rust's rise as a systems language has extended into GPU programming through projects like Rust-CUDA and rust-gpu, which compile Rust kernels to NVIDIA's PTX or SPIR-V backends. Yet despite Rust's strong safety guarantees, there is currently no formal semantics for Rust's GPU subset, nor any verified mapping from Rust's compiler IR to PTX's formally defined execution model.

This project introduces the first framework for formally verifying the semantics of Rust GPU kernels by translating Rust's Mid-level Intermediate Representation (MIR) into Coq and connecting it to the existing Coq formalization of the PTX memory model (Lustig et al., ASPLOS 2019). Rather than modeling Rust's ownership and borrowing rules directly, this work focuses on defining a mechanized operational semantics for a realistic subset of MIR and establishing memory-model soundness: proving that MIR atomic and synchronization operations compile soundly to PTX instructions under the PTX memory model.

Cuq = CUDA + Coq.

No formal semantics for Rust GPU code: Although Rust compilers can emit GPU code via NVVM or SPIR-V, the semantics of such kernels are defined only informally through the compiler's behavior. There is no mechanized model of MIR execution for GPU targets.
Disconnect between high-level Rust and verified GPU models: NVIDIA's PTX memory model has a complete Coq specification, but that model has never been linked to a high-level language. Existing proofs connect only C++ atomics to PTX atomics.
MIR as a verification sweet spot: MIR is a well-typed SSA IR that preserves Rust's structured control flow and side-effect information while stripping away syntax. It provides a precise, implementation-independent level at which to define semantics and translate to Coq.

Define a mechanized semantics for MIR: Implement a Coq formalization of a simplified MIR subset sufficient to express GPU kernels: variable assignment, arithmetic, control flow, memory loads/stores, and synchronization intrinsics.
Translate MIR to Coq: Develop a translation tool that consumes rustc's -Z dump-mir output and produces corresponding Gallina definitions. The translation captures MIR basic blocks, terminators, and memory actions as Coq terms.
Connect to PTX semantics: Use the existing Coq formalization of PTX to define a memory-model correspondence between MIR and PTX traces. The initial goal is to prove soundness in the same sense as Lustig et al. (ASPLOS 2019):

If a MIR kernel is data-race-free under the MIR memory model, its compiled PTX program admits only executions consistent with the PTX memory model.
Property verification: Leverage this semantics to verify kernel-level properties such as:
- Absence of divergent barrier synchronization;
- Preservation of sequential equivalence (e.g., for reductions or scans);
- Conformance to the PTX consistency model under shared-memory interactions.
Prototype toolchain: Deliver a prototype that automatically translates Rust-CUDA kernels into Coq terms, evaluates their semantics within Coq, and interfaces with PTX proofs.

A Coq formalization of Rust MIR semantics for GPU kernels using Rust nightly-2025-03-02.
A MIR→PTX memory-model correspondence theorem, establishing soundness of atomic and synchronization operations for a well-defined kernel subset.
A prototype translator generating Coq verification artifacts from Rust code.
Case studies on standard CUDA benchmarks (e.g., SAXPY, reductions) verifying barrier correctness and dataflow soundness.

While this first phase omits Rust's ownership and lifetime reasoning, the framework is designed to incorporate it later. Future extensions can integrate ownership types or affine resource logics into the MIR semantics, enabling end-to-end proofs of data-race freedom and alias safety.

This project establishes the missing formal bridge between Rust's compiler infrastructure and the only existing mechanized model of GPU execution. By defining verified semantics for MIR and connecting it to PTX, it provides the foundation for future CompCert-style verified compilation of GPU code and opens the door to ownership-aware proofs of safety and correctness for massively parallel Rust programs.

Rebuild the MIR dumps, translate them into Coq, and check the traces/bridges with:

The target performs three steps:

rustc -Z dump-mir=all for examples/saxpy.rs and examples/atomic_flag.rs (writes into mir_dump/).
tools/mir2coq.py parses the PreCodegen.after dumps and regenerates coq/examples/{saxpy,atomic_flag}_gen.v.
make -C coq all type-checks the MIR semantics, the generated programs, and the MIR→PTX translation lemmas.

Afterwards you can inspect coq/examples/*_gen.v and re-run Eval compute queries found in coq/MIRTests.v to see the MIR event traces and their PTX images.

examples/*.rs --rustc -Z dump-mir--> mir_dump/*.mir --tools/mir2coq.py--> coq/examples/*_gen.v \ | \--> target/*.ptx (optional) v Coq build (MIRSyntax + MIRSemantics + Translate + Soundness) -> PTX event traces

Ensure the Rust nightly and Coq toolchain are available:
- rustup toolchain install nightly-2025-03-02
- rustup override set nightly-2025-03-02
- opam install coq (Coq ≥ 8.18)
In every new shell, activate the Coq switch so coq_makefile is on your PATH:
Run the end-to-end build:

Refer to docs/mapping-table.md for the full table. In short:

TyI32/TyU32/TyF32 loads and stores become EvLoad/EvStore in PTX with space_global, relaxed semantics, and the matching mem_ty (MemS32, MemU32, MemF32).
Acquire loads and release stores attach sem_acquire/sem_release and CTA scope, mirroring the observed ld.acquire.sys.<ty> and st.release.sys.<ty>.
Barriers translate to EvBarrier scope_cta.

The translator (coq/Translate.v) and the docs stay in sync via helper functions mem_ty_of_mir and z_of_val.

Global memory only; shared-memory scopes and bank conflicts are out of scope.
Non-atomic accesses are relaxed and scope-less; only one acquire/release pair with SYS scope is modelled.
Floating-point values are treated as raw IEEE-754 bit patterns (Z payloads); no reasoning about NaNs or rounding edge cases yet.
Translator handles a curated subset of MIR (no arbitrary control flow, panic paths, or complex intrinsics).

Extend the translator grammar to cover additional MIR statements (comparisons, guards, simple loops/barriers) while preserving determinism.
Enrich the PTX shim with reads-from / coherence relations from the PTX Coq model.
Prove the remaining per-event lemmas (Load_ok, Store_ok) and lift the translate_trace_shape property toward an end-to-end soundness theorem.
Integrate shared-memory scope tags and CTA-wide fences, then revisit atomics/fences beyond acquire-release.

Read Entire Article

Show HN: Cuq – Formal Verification of Rust GPU Kernels

Related

Best strategies to grow from 300 –> 1000 users?

GM will ditch Apple CarPlay and Android Auto on all its cars...

Vive la France long live the US