Continuous Nvidia CUDA Profiling in Production

9 hours ago 1

tl;dr: Today we're releasing the world's first (to our knowledge) open source NVIDIA CUDA profiler that has low overhead, suitable for always-on profiling in production as part of the v0.43.0 release of the parca-agent.

Understanding the performance characteristics of CUDA applications in production can be a challenge. NVIDIA Nsight, the standard profiling tool is very informative but it's invasive, telling you about every syscall, CUDA API call, memory transfers and even application level stacktraces! All that insight exacts a significant performance penalty. At Polar Signals the name of the game is continuous profiling in production so we needed to come up with a way to get valuable information with very little overhead.

This post builds on our previous work with GPU metrics collection, where we demonstrated tracking GPU utilization and memory usage. Now we're taking it a step further by profiling individual CUDA kernel executions to understand exactly where GPU time is being spent.

The Problem

Traditional profiling approaches face a fundamental challenge: how do you get fine-grained timing information from the program back to your profiler without introducing significant overhead? The user's GPU application will typically be an unprivileged python program running in a some container and our profiler in this case is a privileged agent (parca-agent). Common solutions (and the overhead they introduce) include:

  • Writing data to files (filesystem I/O overhead): this is what NVIDIA Nsight does, tricky to do when everything is containerized
  • Sending data over domain sockets or network connections (serialization and network overhead).

For continuous profiling in production environments, it would be nice to cut to the chase and avoid the overhead of these solutions.

The Solution

Our solution combines the CUPTI profiling API with USDT probes and eBPF into a simple pipeline making what we believe to be the worlds first (to our knowledge) open-source low-overhead always on GPU profiler.

1. Dynamic CUDA Instrumentation

At the heart of our solution is parcagpu, a shim library that intercepts CUDA API calls. Using CUDA's CUDA_INJECTION64_PATH mechanism, we can inject this library into any CUDA application without modification:

export CUDA_INJECTION64_PATH=/path/to/libparcagpucupti.so ./your_cuda_application

Our goal is zero instrumentation but this approach is probably as close as we can realistically get.

The library uses NVIDIA's CUPTI (CUDA Profiling Tools Interface) to:

  • Subscribe to CUDA runtime API callbacks for cudaLaunchKernel and cudaGraphLaunch
  • Register with CUPTI's activity API to receive kernel execution records
  • Collect timing data, device IDs, stream IDs, and kernel names from the activity stream

Then instead of writing this data to files or sockets, we expose it to eBPF through USDT probes.

2. USDT Probes: Stable application tracepoints

USDT (User Statically-Defined Tracing) probes are lightweight tracepoints embedded in userspace applications. We define two probes in parcagpu:

parcagpu:cuda_correlation - Fired when a CUDA kernel launch is initiated:

DTRACE_PROBE3(parcagpu, cuda_correlation, correlationId, cbid, name);

This probe captures the correlation ID that links the kernel launch to its eventual execution, along with the callback ID that identifies whether it's a regular kernel launch or graph launch.

parcagpu:kernel_executed - Fired when CUPTI's activity API delivers kernel execution data:

DTRACE_PROBE8(parcagpu, kernel_executed, start, end, correlationId, deviceId, streamId, graphId, graphNodeId, name);

This probe exposes timing and context information including:

  • Start and end timestamps (from GPU hardware counters)
  • Correlation ID to match with the launch
  • Device and stream identifiers
  • Graph ID and node ID (for CUDA graph executions)
  • Kernel name

These probes compile down to just a NOP instruction when not actively traced, making them virtually free when profiling is disabled, in reality CUPTI itself adds a little overhead but its pretty close to being in the noise.

3. eBPF: Capturing the Data

The parca-agent profiler attaches to these USDT probes using the uprobe attach by address mode (as opposed to by symbol) where the address comes from the .note.stapsdt ELF section so we know exactly what we're "tapping" into. But there's a chicken-and-egg problem: the parcagpu library is loaded dynamically when the CUDA application initializes CUPTI, so we need to know when it gets loaded. In practice CUDA applications can be short-lived, so to get a realistic performance picture we need to immediately start profiling when the application starts issuing GPU work.

We solve this with another uprobe on dlopen in the dynamic linker. This allows us to turn our profiler on immediately when dlopen is called on libparcagpucupti.so. Then we can detect when the library is loaded into the application process and attach our USDT probe handlers.

In practice we don't even have to attach to two USDT probes; we dynamically detect whether the kernel supports the multi-uprobe feature (usually 6.6 kernels and above) and then can attach one eBPF program to both USDT tracepoints and vector to the right program with one probe:

SEC("usdt/cuda_probe") int cuda_probe(struct pt_regs *ctx) { u64 cookie = bpf_get_attach_cookie(ctx); switch (cookie) { case 'c': return cuda_correlation(ctx); case 'k': return cuda_kernel_exec(ctx); default: DEBUG_PRINT("cuda_probe: unknown cookie %llu", cookie); break; } return 0; }

In the future we envision the dynamic ability to attach to N USDT probes and have pre-built ebpf programs that can do various things with the probe information, but lets not get ahead of ourselves.

Instrumenting Regular and Graph Kernel Launches

CUDA supports two distinct execution models, and our profiler handles both:

Regular Kernel Launches

For standard kernel launches, the flow is straightforward:

  1. Application calls cudaLaunchKernel
  2. CUPTI callback fires with correlation ID
  3. cuda_correlation probe captures stack trace
  4. Kernel executes on GPU
  5. CUPTI activity API delivers timing data
  6. kernel_executed probe fires with timing
  7. Userspace matches correlation IDs and emits complete profile

CUDA Graph Launches

CUDA graphs allow pre-recording sequences of kernel launches for efficient replay. A single cudaGraphLaunch call can execute dozens or hundreds of kernels. Our profiler handles this by:

  1. Detecting graph launches via callback id == GRAPH_LAUNCH_CBID
  2. Keeping the trace open to match with multiple kernel executions
  3. Associating each kernel with its graph ID and node ID
  4. Creating a separate profile sample for each kernel in the graph with the kernel name as a pseudo-frame in the callstack

We never know when the stream of kernels for a particular graph ends, so we have to hold on to these traces for a bit and let them drain out over time.

The Perf Event Buffer

To move the information about N kernel launches efficiently from USDT to our agent we use perf event buffers - a ring buffer mechanism built into the Linux kernel specifically for high-performance tracing. No serialization, no system calls, no copies beyond the single write into the ring buffer.

This is as close to zero-copy as you can get: the timing data is written once by the eBPF program directly into shared memory, then read once by the Go parca-agent program.

Actually, we could probably do even better, what if instead of sticking a USDT probe on each kernel activity event we just passed the Cupti Activity buffer to the uprobe and read all the activities there? Then hundreds of probe firings could collapse to one. That will open up the door to subscribing to more CUPTI events for little additional overhead, dare to dream! Of course all the activity buffer handling code provided by Cupti will have to be recreated in eBPF but there are no free lunches!

Matching Traces with Timing Data

So how do we match these application stack traces (captured when the kernel is launched) with the GPU timing data (delivered asynchronously after execution)? We use CUPTI's correlation IDs for this.

A gpuTraceFixer maintains two maps per process:

  • timesAwaitingTraces: Timing data that arrived before the stack trace
  • tracesAwaitingTimes: Stack traces waiting for their timing data
1. CPU: cudaLaunchKernel() called | +---> cuda_correlation probe fires - Captures: Stack Trace + Correlation ID (e.g., 12345) - Stored in: tracesAwaitingTimes[12345] = {stack, PID, ...} 2. GPU: Kernel executes asynchronously 3. CUPTI Activity API delivers timing data | +---> kernel_executed probe fires - Receives: {start, end, corrID: 12345, device, stream, ...} - Sent via: Perf Event Buffer to userspace 4. Correlation: Match timing with stack trace | +---> Lookup tracesAwaitingTimes[12345] | +---> Match found! - Combine: Stack Trace + GPU Timing - Output: Complete Profile Sample Note: Works in either order - timing can arrive before or after stack trace

When a match is found, we combine them into a complete profile sample with:

  • Full CPU stack trace showing what launched the kernel
  • GPU execution duration
  • Device, stream, and graph IDs
  • Demangled kernel name

Labels for Context

Beyond basic profiling, we attach labels to each sample providing additional context:

  • cuda_device: Which GPU device executed the kernel
  • cuda_stream: Which CUDA stream (for understanding parallelism)
  • cuda_graph: Graph ID (for graph executions)
  • cuda_id: Correlation ID (for graph kernels to distinguish multiple executions)

This allows filtering and grouping profiles by device, stream, or graph in the PolarSignals UI. Did we miss any? Let us know if there's some GPU context you'd like us to add!

Conclusion

By combining CUPTI for CUDA instrumentation, USDT probes as a bridge, and eBPF for kernel-space data collection, we've built a CUDA profiler that can run continuously in production. The architecture is elegant: timing data flows from GPU hardware counters through CUPTI's activity stream, into USDT probes, captured by eBPF, and delivered to userspace via perf event buffers - all without touching the filesystem or network.

This approach handles both regular kernel launches and complex CUDA graph executions, provides rich contextual metadata through custom labels, and works for both AMD64 and ARM64 architectures. The result is a profiler that gives you deep visibility into GPU workload performance suitable for continuous production profiling.

So what does it look like? Here's some real data from running vllm examples on some real hardware (thanks Lambda.AI!).

Getting Started

To use this CUDA profiling capability, you'll need to run the Parca agent with CUDA instrumentation enabled

parca-agent --instrument-cuda-launch

The --instrument-cuda-launch flag activates the eBPF programs that attach to the USDT probes in parcagpu. When enabled, the agent will automatically detect when CUDA applications load the parcagpu library and begin capturing GPU kernel execution profiles.

You can download the latest parca-agent from the parca-agent releases page.

The only other requirement is to make sure the CUDA_INJECTION64_PATH environment variable points to the libparcagpucupti.so for all your GPU application processes. Get that library here.

Let us know how it goes! Reach out on Discord or better yet schedule a demo!.

Key Implementation Files

For those interested in diving deeper into the implementation:

Read Entire Article