Galton Board Softmax O(N2) Replacement

1 month ago 6

What if probability didn't have to be calculated—what if it could flow?

Galton Lab is a research playground that reimagines how neural networks make predictions. Instead of computing probability distributions the traditional way (softmax over thousands of options), we let probability flow through learned geometric landscapes—like water finding its way downhill.

🎮 Try the Interactive Demos — Learn the concepts from physics to transformers and beyond in your browser!

The Big Idea (in plain English)

Think about how a Galton board works: you drop a ball, it bounces off pegs, and eventually lands in a bucket. The pattern of where balls land creates a probability distribution through physics, not arithmetic.

We're applying this idea to machine learning:

Traditional approach:

Neural network → Calculate probabilities for ALL 50,000 words → Pick one (Expensive, rigid, happens the same way every time)

Our approach:

Neural network → Create a probability landscape → Drop "probes" that flow toward likely tokens (Adaptive, interpretable, uses less compute when confident)

When a model is very confident ("The capital of France is ___"), probes converge quickly → fast prediction. When uncertain, probes spread out → the model automatically takes more time. No manual tuning required—it emerges from the physics.

2. Built-in Uncertainty Quantification

You can literally see confidence by watching how probes move. Tight convergence = confident. Spread out = uncertain.

Instead of opaque probability numbers, you get trajectories you can visualize. You can watch probability mass flow toward the winning token and understand why it won.

No need to compute probabilities for every token in your vocabulary. The geometry guides probes to likely regions automatically.

The core insight—probability as geometric flow—applies far beyond token prediction. Anywhere you use softmax to make categorical choices, you can replace it with learned flow fields.

Working Examples:

🖼️ Image Classification — CNN → 2D flow field → digits
🔍 Attention Mechanism — Replace softmax(QK^T) with probe routing
🎮 RL Policies — Action selection through geometric flow

See examples/ for runnable code with visualizations, and docs/use-cases.md for 8+ application domains.

The pattern is universal:

# Anywhere you see this... logits = network(input) probs = softmax(logits) # You can do this instead... context = network(input) field = sdf(context) probes = integrate(field) probs = density(probes)

And get uncertainty quantification, interpretability, and adaptive compute for free.

What's in This Repository

1. Discrete Galton Boards

The intuitive starting point

Digital versions of physical Galton boards with learnable "pegs" that guide probes left or right. Simple to understand, easy to visualize, and surprisingly effective for small vocabularies.

Probes drop through a grid of learned biases
Each row nudges probes toward likely tokens
Adaptive: stops when one bucket gets enough mass
Hierarchical variants for scaling to larger vocabularies

Files: src/galton_lab/board.py, experiments/hierarchical_compare.py

2. Continuous ODE Sampler

The scalable evolution

When discrete boards hit their limits, we move to continuous flow. Probes now follow smooth trajectories on a ring (torus topology), guided by a learned velocity field.

Represents probability as a flow through continuous space
Uses ODEs (Ordinary Differential Equations) integrated with RK2
Learned using neural SDFs (Signed Distance Fields)
Scales to real vocabularies while staying differentiable

Files: src/galton_lab/ode/, galton/train.py

Supporting Infrastructure

Context composers (src/galton_lab/composers.py): Map input context → probability landscapes
Training tools (galton/train.py, tests/): GPU-ready training loops with warm-start presets
Visualizations (src/galton_lab/visualize.py): Watch probability flow in real-time
Documentation (Galton.md, docs/char32_ode_warmstart.md): Deep dives into the theory and practice

git clone https://github.com/Foundation42/galton.git cd galton python -m pip install -e ".[dev]" pytest # Run tests to verify everything works

See it in action (discrete boards):

# Visual demo comparing different board architectures python experiments/hierarchical_compare.py # Adaptive compute experiment with visualizations python experiments/adaptive_eval.py --save-plots --write-csv

Train a model (continuous ODE sampler):

# Simple toy task (ABCD pattern prediction) python galton/train.py --task abcd --device auto --amp \ --per-example-fields --batch 8192 # Character-level language model python galton/train.py --task char32 --device auto --amp \ --sampler ode --batch 4096 --warm-start-preset char32

Flag What it does

--device auto	Use GPU if available, else CPU
--amp	Mixed precision training (faster on GPU)
--sampler ode	Use continuous flow instead of discrete board
--warm-start-preset char32	Use proven initialization for character models
--auto-handoff	Automatically transition from warm-start to sharpening phase
--compile	JIT compile with PyTorch 2.0+ (even faster)

How It Works (For the Curious)

From Discrete to Continuous

Discrete Galton Board:

Input: "The cat sat on the" ↓ [Configure peg biases based on context] ↓ [Drop N probe particles] ↓ Each probe bounces through rows: - Read peg bias at current position - Add noise - Move left or right - Repeat for each row ↓ [Count probes in each token bucket] ↓ Output: "mat" (bucket with most probes wins)

Continuous ODE Sampler:

Input: "The cat sat on the" ↓ [Neural network creates a velocity field on a ring] ↓ [Integrate probe trajectories using ODEs] - Probes follow smooth curves - Velocity field guides them toward likely tokens - Integration uses RK2 (Runge-Kutta 2nd order) ↓ [Soft bucket assignment using Gaussian windows] ↓ Output: "mat" (highest probability mass)

Warm Start: Begin with soft, spread-out probability landscapes
- High sigma (σ=0.9) for wide Gaussian windows
- Directional bias to break symmetry
- Knowledge distillation from a simple "teacher" model
Auto Handoff: System detects when model is confident
- Monitors margin (gap between top choices)
- Checks if target token probability is sufficient
Sharpening: Tighten the focus
- Reduce sigma (σ=0.5) for narrower peaks
- Remove training wheels (bias, distillation)
- Pure cross-entropy optimization

Near-Term Research Directions

Scale to production vocabularies (10k-50k tokens) using hierarchical routing
Integrate with real transformers as a drop-in softmax replacement
Stochastic variants (SDEs) for better exploration during training
Comparative benchmarks against standard sampling methods

Language models with built-in uncertainty quantification
Reinforcement learning with interpretable policy flows
Structured generation where grammar rules shape the probability landscape
Any domain with periodic structure (audio, time series, molecular conformations)

Can geometric flow matching replace all categorical distributions?
Does this connect to diffusion models, optimal transport, or energy-based learning?
Can we prove convergence guarantees for the adaptive compute property?

Interactive Demos — Two interactive journeys:
- Foundation Demo — Physics to transformers (6 stages)
- Beyond LLMs Demo — Image classification, attention, RL (4 stages)
examples/ — Runnable Python examples: image classification, attention, RL policies
docs/use-cases.md — 8+ application domains beyond language models
Galton.md — The complete origin story: from 4am idea to working prototype
docs/char32_ode_warmstart.md — Deep technical dive on the continuous ODE sampler
experiments/ — Additional experiments with visualizations
tests/ — Unit tests that double as usage examples

galton/ ├── src/galton_lab/ # Core library │ ├── board.py # Discrete Galton board logic │ ├── ode/ # Continuous flow sampler (SDFs, integration) │ ├── composers.py # Context → probability landscape mapping │ ├── torch_modules.py # PyTorch-wrapped samplers │ └── visualize.py # Plotting and diagnostics ├── galton/train.py # Main training script with presets ├── experiments/ # Standalone demos and analyses ├── tests/ # Geometry invariants and regression tests ├── docs/ # Technical documentation └── sketches/ # Early prototypes and explorations

This is an active research project. We welcome:

Experiments — Try it on new tasks and share results
Visualizations — Make the flow more intuitive
Theory — Connect to related mathematical frameworks
Critique — Tell us where this breaks or why it won't scale

Open an issue to discuss ideas or submit a PR with improvements.

If you build on this work, please cite:

@software{galton_lab_2025, author = {Christian Beaumont and Anthropic Claude (Chat and Code) and DeepSeek and OpenAI (GPT-4o and Codex)}, title = {Galton Lab: Probability Sampling Through Learned Flow Fields}, year = {2025}, url = {https://github.com/Foundation42/galton} }

MIT — See LICENSE file for details.

"In a world full of edges, be a torus."

Read Entire Article