Show HN: UHOP – An Open Hardware Optimization Platform for GPUs

5 hours ago 1

UHOP is an open hardware optimization platform that unifies GPU acceleration across CUDA, ROCm/HIP, Metal, OpenCL, and future architectures. It detects your machine, dispatches to the best backend, can generate kernels with AI, validates them, and caches the fastest path for reuse — so developers can write simple code and run fast everywhere.

Key capabilities today:

Automatic backend detection: Torch (CUDA/MPS/CPU), OpenCL (GPU/CPU), Triton (Linux), CPU fallback
Drop‑in acceleration via @uhop.optimize("op") decorator (e.g., matmul)
AI kernel generation (OpenAI) for OpenCL/CUDA/Python/Triton with validation/smoke tests
On‑disk caching of selected kernels/implementations per device
Friendly CLI for hardware info, demos, AI codegen, and cache tools
Optional Local Agent so the web portal can run on your hardware

Vision: a universal, community-driven runtime optimizer that makes high‑performance computing approachable, portable, and fun — across vendors and form factors.

Planned (see issues/): multi‑backend benchmarking/policies, correctness suites, distributed training loops for AI‑generated kernels, richer dashboard, and tighter framework integrations (PyTorch/JAX).

The platform has four layers working together:

Frontend (Vite + React) — live controls, real‑time logs, and benchmarks
Backend (Node/Express + ws) — routes jobs to your Local Agent or server runtime
Local Agent (Python) — runs UHOP operations on your machine securely
UHOP Core (Python) — backends, optimizer, AI codegen/validation, caching

See also: docs/architecture.svg (source image) for sharing in blogs/slides.

At a glance, the request flow prefers the Local Agent when connected, and falls back to server‑side execution when not.

Prereqs

Python 3.10+
OS: Windows, macOS, or Linux
Drivers/toolchains as applicable: CUDA (NVIDIA), OpenCL runtime (AMD/Intel/NVIDIA), Apple MPS (macOS)
Optional: OPENAI_API_KEY for AI codegen

Install

git clone https://github.com/sevenloops/uhop.git cd uhop pip install -e . # install CLI `uhop` # optional extras pip install -e .[dev] # tests & notebooks pip install -e .[amd] # ROCm Python tools pip install -e .[nvidia] # CuPy for CUDA

Verify your setup

uhop info uhop info --json

Run a demo

# Matmul: naive Python vs UHOP‑optimized uhop demo --size 192 --iters 3 # Fused Conv2D+ReLU (OpenCL). Choose device if multiple are present: uhop demo-conv2d-relu --h 128 --w 128 --c-in 3 --c-out 32 --k 3 --stride 1 --padding 1 uhop demo-conv2d-relu --ocl-device 0

Try OpenCL elementwise add vs naive

python examples/compare_elementwise_add_opencl_vs_naive.py --size 2000000

Integrate in your code

from uhop import optimize @optimize("matmul") def my_matmul(a, b): # write the simplest correct version — UHOP will dispatch/accelerate import numpy as np return np.array(a) @ np.array(b)

Environment knobs

UHOP_OPENCL_DEVICE_INDEX=<idx> — default OpenCL device override
UHOP_STRICT_VALIDATE=1 — tighten AI‑kernel validation during codegen

AI Kernel Generation (optional)

# Generate OpenCL matmul, validate build, run smoke test python -m uhop.cli ai-generate matmul --target opencl --validate --smoke # Generate fused Conv2D+ReLU and benchmark vs current fused backend python -m uhop.cli ai-generate-fused --stride 1 --padding 1

Minimal Web API (optional)

Expose a local HTTP API for demos/automation:

uhop web-api --host 0.0.0.0 --port 5824 # or python -m uhop.web_api --host 0.0.0.0 --port 5824

Endpoints

GET /health
GET /info
POST /demo/matmul with { "size": 256, "iters": 3 }

Docker

docker build -t uhop-demo-api -f api.Dockerfile . docker run --rm -p 5824:5824 uhop-demo-api

We’re building UHOP as a friendly, long‑term open platform. All experience levels welcome — and we especially invite:

GPU engineers (CUDA/ROCm/Metal/OpenCL)
Compiler/runtime developers (Triton/MLIR/TVM)
ML engineers and researchers (kernels, validation, datasets)
Frontend devs (Vite/React/Tailwind, data viz)

Start here:

Read CONTRIBUTING.md for local setup, tests, and PR tips
Run ./contributing.sh setup and ./contributing.sh test
Explore issues/ for scoped design notes and milestones

Expectations:

Keep public APIs stable; update docs/tests with behavior changes
Aim for reproducible steps and minimal dependencies
Small, focused PRs with clear titles (Conventional Commits encouraged)

Milestone Focus Status

Pre‑MVP	Runtime decorator, hardware detection, caching, CLI demo	In progress
MVP	Multi‑backend benchmarking and selection policies	Planned
AI Kernels v1	Automated validation, correctness suites, smoke tests	Planned
Dashboard	Logging, benchmark viz, local agent UX	Planned
Frameworks	PyTorch/JAX wrappers, training loop integration	Planned
All‑systems support	CUDA, ROCm/HIP, Metal, OpenCL (explore Vulkan/oneAPI)	Vision
All‑ops coverage	Elementwise, reductions, convs, attention, norms, fused ops	Vision
Protocol Spec v1.0	Stable spec: device negotiation, cache manifests, kernel metadata	Vision

See the issues/ directory for detailed write‑ups:

Jump in with these approachable starters:

Improve OpenCL/kernel templates and add simple correctness tests
Add a CUDA/HIP example parity with the OpenCL elementwise add
Enhance uhop info --json fields (driver versions, memory footprints)
Add README snippets for Windows/Mac specific setup tips
Polish the frontend build or add a minimal dashboard card
Optimize CI/CD workflow and docs for PRs and promotions (badges, faster CI, templates) — see issues/15-ci-cd-workflow-docs-promo.md

Or pick one from the tracked proposals above in issues/ and comment to claim.

Run the test suite (GPU‑dependent tests skip automatically):

Targeted runs:

pytest -q tests/test_matmul.py pytest -q -k "opencl or cuda or hip or metal"

Tags: gpu, compiler, rocm, cuda, opencl, metal, hpc, mlops, deep-learning, open-hardware

Read Entire Article