Protocol Learning
Node0-7.5B is the first pretraining run open to the public where anyone can join and contribute with a consumer-grade (16GB+) GPU and an Internet connection. This run is a proof of concept — because it is using completely novel networking and distributed training implementations, many things are being tested for the first time. The purpose of this run is to evaluate stability and convergence in an open multi-party setting. A detailed technical report will follow after the run concludes.
Model Parallelism
Model parallelism is the collective name for strategies that break up a large model and distribute it across multiple GPUs, enabling training at scales impossible on a single device. It is the standard approach used to train today's largest models. Traditionally, this requires extremely fast communication between devices, as activations and gradients must be transferred at every step — something only feasible in datacenter environments with high-speed connections (≥ 100 Gbps). Node0's unique contribution is that, for the first time, model-parallel training is being carried out over Internet connections rather than within a datacenter.
Compression
To split the model itself over participants, we make use of a novel compression algorithm that constrains the output projection weights of Transformer blocks to a shared, learned low-dimensional subspace. Leveraging these constrained weights alongside the recursive structure of Transformers, we achieve over 99% compression in both forward and backward passes, while preserving convergence. For further details, see our Protocol Models paper.
Original
Compressed
Principal Components
Size
100 MB
1 MB
Transfer Time
8 s
0.08 s