A multiplexing tensor framework.
The motivation for this project is a rejection of the clunky lock-step paradigm ML researchers tend to use. GT attempts to pull some of the ideas that are present in the decades of development done on multi-core operating systems. It fully embraces dynamic scheduling and heavily asynchronous execution while presenting a familiar eager frontend.
- Three components (diagram)
- N x clients (as many users as you want!)
- 1 x dispatcher (for coordinating)
- N x workers (1 per GPU)
- Everything communicates with a stream of instructions
- Clients deal with math. They emit (GPU-unaware) pure functional instructions
- The dispatcher rewrites these instructions on the fly to be GPU-aware and sends them to the workers
- Workers asyncronously process these instructions, optionally JIT compiling
- Instruction streams are annotated
- Clients can send "signals" which allow the dispatcher to more appropriately shard the tensors
- Dispatchers annotate "hot" paths to give hints to workers about JIT compiling
- Annotations are supplemented with YAML configs that specify sharding and compilation information
- Every annotation can be safely ignored, so the same code can run anywhere (just remove the YAML)
It may not look like it, but in the background GT automatically spins up an asynchronous dispatching server and GPU worker.
- High-performance transport - ZeroMQ (ZMQ) with automatic message batching and efficient DEALER/ROUTER pattern
- Autograd support - Tape-based automatic differentiation exclusively at the client layer
- PyTorch-compatible API - Familiar syntax for tensor operations
- AI-assisted development - Optimized for collaboration with AI coding assistants. See AI Development
For distributed training, see Distributed Setup.
Control tensor placement across workers using named signals and YAML configs:
Supported configurations:
- Data parallelism (batch sharding)
- Model parallelism (feature sharding)
- Pipeline parallelism (stage-wise worker assignment)
- Replicated parameters
- Per-layer sharding strategies
- Compilation directives (compile: true for torch.compile boundaries)
See examples/README_SIGNALS.md for comprehensive guide.
Tape-based automatic differentiation:
Implemented:
- Tape-based autograd (PyTorch-style)
- Gradient computation with broadcasting
- In-place parameter updates
- SGD optimizer
Operations can be logged with timestamps:
Output format:
Useful for:
- Debugging hangs (see last operation before timeout)
- Identifying slow operations (large timestamp gaps)
- Understanding distributed execution flow
Generate timeline visualizations showing operation flow through the system:
The visualizer is included with GT (matplotlib is a core dependency).
Output shows:
- Timeline lanes for Client, Dispatcher, and each Worker
- Color-coded operations (MatMul, BinaryOp, UnaryOp, etc.)
- Event types indicated by marker shapes (RECV, WORKER_SEND, WORKER_RECV)
- Data transfer sizes indicated by marker sizes
- Communication arrows showing instruction flow between components
- Instruction IDs annotated on key events
Use cases:
- Identify idle workers or unbalanced load
- Visualize distributed operation patterns (embarrassingly parallel, all-gather, all-reduce)
- Find communication bottlenecks
- Debug distributed execution issues
See gt/scripts/README.md for complete documentation.
Monitor running dispatchers with htop-style worker activity visualization:
The monitor is included with GT (pyzmq, rich, and psutil are core dependencies).
Features:
- Real-time EMA-smoothed activity bars showing operation breakdown per worker
- Color-coded operations (matmul, add, relu, etc.)
- Idle time tracking to identify underutilized workers
- Auto-detection of running dispatchers
- Non-intrusive - connects via ZMQ monitoring socket without affecting performance
Capture event streams for later analysis:
Options:
- -s, --seconds DURATION - Maximum capture duration (required)
- -n, --max-events N - Stop after N events (optional)
- --port PORT - Dispatcher port (auto-detected by default)
- --dir DIR - Output directory (default: current directory)
Workflow:
- Run your workload - Normal GT script execution
- Capture trace - Record events for specified duration or event count
- Visualize - Generate timeline diagrams from captured data
This complements the monitoring tools:
- gt.scripts.top - Real-time monitoring (htop-style)
- gt.scripts.trace - Capture events to file
- gt.scripts.visualize - Generate timeline diagrams
Inspect internal state:
- PyTorch (default): GPU support, compilation, distributed primitives
- NumPy: CPU-only reference implementation (for testing)
Workers use PyTorch by default for both GPU and CPU execution.
GT uses ZeroMQ (ZMQ) for client-dispatcher-worker communication:
Benefits:
- Automatic message batching - ZMQ queues and batches messages at the transport layer
- Higher throughput - More efficient than raw TCP for high-frequency small messages
- Built-in patterns - DEALER/ROUTER pattern handles multiple connections efficiently
- Scalability - Supports many concurrent clients and workers without manual connection management
- IPC optimization - Uses Unix domain sockets (IPC) for localhost connections, bypassing TCP/IP stack for lower latency
Architecture:
- Dispatcher - Single ZMQ ROUTER socket handles all connections
- Clients/Workers - DEALER sockets for async communication
- Worker Registration - Workers send registration message on startup
- Transport selection - Automatically uses IPC (ipc://) for localhost, TCP (tcp://) for remote hosts
This replaces the previous TCP implementation and provides better performance for the high message rate typical in distributed training workloads.
Included by default:
- PyTorch (GPU/CPU support, compilation)
- matplotlib (timeline visualizations)
- rich + psutil (real-time monitoring)
- NumPy, pytest, pyzmq, pyyaml
Configure multiple workers:
Terminal 1 - Start dispatcher:
Terminal 2-N - Start workers (1 per GPU):
Terminal N - Run code:
| GT_CONFIG | Path to sharding config YAML | None |
| GT_AUTO_COMPILE | Enable automatic hot path detection and compilation | 0 |
| GT_COMPILE | Force compile all operations | 0 |
| GT_WORKER_BATCH_SIZE | Number of operations to batch per worker | 1 |
| GT_VERBOSE | Enable framework status messages (startup, connections) | 0 |
| GT_DEBUG_CLIENT | Enable client-side debug messages | 0 |
| GT_DEBUG_DISPATCHER | Enable dispatcher debug messages | 0 |
| GT_DEBUG_WORKER | Enable worker debug messages | 0 |
| GT_DEBUG_COMPILE | Enable compilation debug messages | 0 |
| GT_INSTRUCTION_LOG | Path to instruction stream log file | None |
By default, GT produces no output except errors. Use GT_VERBOSE=1 to see startup messages.
Example:
Arithmetic: +, -, *, /, @ (matmul) Activations: relu(), sigmoid(), tanh() Reductions: sum(), mean() Math: exp(), log() Shape: transpose(), .T In-place: -=, zero_()
Tests auto-start a local GT system and verify numeric correctness.
Note: GT adds communication/serialization overhead. For small operations this overhead is significant. For large operations (training, large matmuls), this overhead becomes negligible.
Components:
- gt/client/ - User-facing API with location-transparent tensors, tape-based autograd, signal tracking, and neural network modules
- gt/dispatcher/ - Coordinates clients and schedules operations using ZMQ ROUTER socket. Maps client tensors to physical locations, reads signal configs for sharding decisions, and logs instruction streams
- gt/worker/ - Executes operations using backends. Connects via ZMQ DEALER socket. Processes operations one at a time (stream processing). Supports multiple backends (PyTorch/NumPy). One worker per GPU.
- gt/transport/ - ZeroMQ-based communication layer with DEALER/ROUTER pattern for high-performance message passing
- gt/signal.py - Signal-based sharding API with context managers, thread-local signal stack, and backward signal support
- gt/config.py - YAML config loading that parses sharding strategies and maps signal names to worker assignments
- Signal-Based Sharding Guide - Complete guide to sharding API
- CLAUDE.md - Detailed architecture documentation
- Hot Path Detection - Automatic compilation for repeated patterns (future work)
The code prioritizes:
- Clarity over performance in initial implementation
- PyTorch-compatible API
- Declarative configuration via YAML
- Simple stream processing (one operation at a time)
See examples/ directory:
- signal_demo.py - Signal-based sharding demonstration
- compile_demo.py - Compilation directives demonstration
- debug_demo.py - Debug utilities demonstration
- visualize_demo.py - Instruction tape visualization demonstration
- config_sharding.yaml - Example sharding configuration
- config_compile.yaml - Example compilation configuration
- demo.py - Basic tensor operations
- simple_launch.py - Manual server/worker launch
- Signals: Use to control sharding strategies via configuration
- Multiple Workers: Scale across GPUs for data/model parallelism
- Logging: Use instruction logging to identify bottlenecks
- Transport: ZeroMQ provides efficient message batching at transport layer
Contributions welcome. This is a research prototype focused on simplicity and readability.
GT is designed to be understood, modified, and debugged with AI coding assistants:
- CLAUDE.md provides detailed architectural context optimized for Claude and other AI assistants
- Explicit codebase structure, design decisions, and implementation patterns
- Helps AI quickly understand the system and make consistent changes
- Sharding strategies defined in human-readable YAML configs
- Easy for AI to parse, understand, and generate configurations
- Clear mapping between signals and worker assignments
- See Signal-Based Sharding
- Tape-based autograd - Inspect gradient computation graph with gt.debug.print_tape()
- Instruction stream logging - Track every operation with timestamps via GT_INSTRUCTION_LOG
- Worker statistics - View operation counts and performance metrics
- Makes it easy to identify bugs and understand execution flow
- 50+ tests covering tensor operations, autograd, distributed execution
- Tests serve as executable documentation and specifications
- Easy for AI to understand intended behavior and verify changes
- See Running Tests
- PyTorch-compatible API that AI models are already trained on
- Familiar patterns like Module, Linear, SGD, backward()
- Extensive inline documentation and type hints
- Reduces cognitive load when making changes
MIT
.png)
