Galton Board Softmax O(N2) Replacement

1 month ago 6

What if probability didn't have to be calculated—what if it could flow?

Galton Lab is a research playground that reimagines how neural networks make predictions. Instead of computing probability distributions the traditional way (softmax over thousands of options), we let probability flow through learned geometric landscapes—like water finding its way downhill.

🎮 Try the Interactive Demos — Learn the concepts from physics to transformers and beyond in your browser!

The Big Idea (in plain English)

Think about how a Galton board works: you drop a ball, it bounces off pegs, and eventually lands in a bucket. The pattern of where balls land creates a probability distribution through physics, not arithmetic.

We're applying this idea to machine learning:

Traditional approach:

Neural network → Calculate probabilities for ALL 50,000 words → Pick one (Expensive, rigid, happens the same way every time)

Our approach:

Neural network → Create a probability landscape → Drop "probes" that flow toward likely tokens (Adaptive, interpretable, uses less compute when confident)

When a model is very confident ("The capital of France is ___"), probes converge quickly → fast prediction. When uncertain, probes spread out → the model automatically takes more time. No manual tuning required—it emerges from the physics.

2. Built-in Uncertainty Quantification

You can literally see confidence by watching how probes move. Tight convergence = confident. Spread out = uncertain.

Instead of opaque probability numbers, you get trajectories you can visualize. You can watch probability mass flow toward the winning token and understand why it won.

No need to compute probabilities for every token in your vocabulary. The geometry guides probes to likely regions automatically.

The core insight—probability as geometric flow—applies far beyond token prediction. Anywhere you use softmax to make categorical choices, you can replace it with learned flow fields.

Working Examples:

See examples/ for runnable code with visualizations, and docs/use-cases.md for 8+ application domains.

The pattern is universal:

# Anywhere you see this... logits = network(input) probs = softmax(logits) # You can do this instead... context = network(input) field = sdf(context) probes = integrate(field) probs = density(probes)

And get uncertainty quantification, interpretability, and adaptive compute for free.

What's in This Repository

1. Discrete Galton Boards

The intuitive starting point

Digital versions of physical Galton boards with learnable "pegs" that guide probes left or right. Simple to understand, easy to visualize, and surprisingly effective for small vocabularies.

  • Probes drop through a grid of learned biases
  • Each row nudges probes toward likely tokens
  • Adaptive: stops when one bucket gets enough mass
  • Hierarchical variants for scaling to larger vocabularies

Files: src/galton_lab/board.py, experiments/hierarchical_compare.py

2. Continuous ODE Sampler

The scalable evolution

When discrete boards hit their limits, we move to continuous flow. Probes now follow smooth trajectories on a ring (torus topology), guided by a learned velocity field.

  • Represents probability as a flow through continuous space
  • Uses ODEs (Ordinary Differential Equations) integrated with RK2
  • Learned using neural SDFs (Signed Distance Fields)
  • Scales to real vocabularies while staying differentiable

Files: src/galton_lab/ode/, galton/train.py

Supporting Infrastructure

  • Context composers (src/galton_lab/composers.py): Map input context → probability landscapes
  • Training tools (galton/train.py, tests/): GPU-ready training loops with warm-start presets
  • Visualizations (src/galton_lab/visualize.py): Watch probability flow in real-time
  • Documentation (Galton.md, docs/char32_ode_warmstart.md): Deep dives into the theory and practice
git clone https://github.com/Foundation42/galton.git cd galton python -m pip install -e ".[dev]" pytest # Run tests to verify everything works

See it in action (discrete boards):

# Visual demo comparing different board architectures python experiments/hierarchical_compare.py # Adaptive compute experiment with visualizations python experiments/adaptive_eval.py --save-plots --write-csv

Train a model (continuous ODE sampler):

# Simple toy task (ABCD pattern prediction) python galton/train.py --task abcd --device auto --amp \ --per-example-fields --batch 8192 # Character-level language model python galton/train.py --task char32 --device auto --amp \ --sampler ode --batch 4096 --warm-start-preset char32
Flag What it does
--device auto Use GPU if available, else CPU
--amp Mixed precision training (faster on GPU)
--sampler ode Use continuous flow instead of discrete board
--warm-start-preset char32 Use proven initialization for character models
--auto-handoff Automatically transition from warm-start to sharpening phase
--compile JIT compile with PyTorch 2.0+ (even faster)

How It Works (For the Curious)

From Discrete to Continuous

Discrete Galton Board:

Input: "The cat sat on the" ↓ [Configure peg biases based on context] ↓ [Drop N probe particles] ↓ Each probe bounces through rows: - Read peg bias at current position - Add noise - Move left or right - Repeat for each row ↓ [Count probes in each token bucket] ↓ Output: "mat" (bucket with most probes wins)

Continuous ODE Sampler:

Input: "The cat sat on the" ↓ [Neural network creates a velocity field on a ring] ↓ [Integrate probe trajectories using ODEs] - Probes follow smooth curves - Velocity field guides them toward likely tokens - Integration uses RK2 (Runge-Kutta 2nd order) ↓ [Soft bucket assignment using Gaussian windows] ↓ Output: "mat" (highest probability mass)
  1. Warm Start: Begin with soft, spread-out probability landscapes

    • High sigma (σ=0.9) for wide Gaussian windows
    • Directional bias to break symmetry
    • Knowledge distillation from a simple "teacher" model
  2. Auto Handoff: System detects when model is confident

    • Monitors margin (gap between top choices)
    • Checks if target token probability is sufficient
  3. Sharpening: Tighten the focus

    • Reduce sigma (σ=0.5) for narrower peaks
    • Remove training wheels (bias, distillation)
    • Pure cross-entropy optimization

Near-Term Research Directions

  • Scale to production vocabularies (10k-50k tokens) using hierarchical routing
  • Integrate with real transformers as a drop-in softmax replacement
  • Stochastic variants (SDEs) for better exploration during training
  • Comparative benchmarks against standard sampling methods
  • Language models with built-in uncertainty quantification
  • Reinforcement learning with interpretable policy flows
  • Structured generation where grammar rules shape the probability landscape
  • Any domain with periodic structure (audio, time series, molecular conformations)
  • Can geometric flow matching replace all categorical distributions?
  • Does this connect to diffusion models, optimal transport, or energy-based learning?
  • Can we prove convergence guarantees for the adaptive compute property?
  • Interactive Demos — Two interactive journeys:
  • examples/ — Runnable Python examples: image classification, attention, RL policies
  • docs/use-cases.md — 8+ application domains beyond language models
  • Galton.md — The complete origin story: from 4am idea to working prototype
  • docs/char32_ode_warmstart.md — Deep technical dive on the continuous ODE sampler
  • experiments/ — Additional experiments with visualizations
  • tests/ — Unit tests that double as usage examples
galton/ ├── src/galton_lab/ # Core library │ ├── board.py # Discrete Galton board logic │ ├── ode/ # Continuous flow sampler (SDFs, integration) │ ├── composers.py # Context → probability landscape mapping │ ├── torch_modules.py # PyTorch-wrapped samplers │ └── visualize.py # Plotting and diagnostics ├── galton/train.py # Main training script with presets ├── experiments/ # Standalone demos and analyses ├── tests/ # Geometry invariants and regression tests ├── docs/ # Technical documentation └── sketches/ # Early prototypes and explorations

This is an active research project. We welcome:

  • Experiments — Try it on new tasks and share results
  • Visualizations — Make the flow more intuitive
  • Theory — Connect to related mathematical frameworks
  • Critique — Tell us where this breaks or why it won't scale

Open an issue to discuss ideas or submit a PR with improvements.

If you build on this work, please cite:

@software{galton_lab_2025, author = {Christian Beaumont and Anthropic Claude (Chat and Code) and DeepSeek and OpenAI (GPT-4o and Codex)}, title = {Galton Lab: Probability Sampling Through Learned Flow Fields}, year = {2025}, url = {https://github.com/Foundation42/galton} }

MIT — See LICENSE file for details.


"In a world full of edges, be a torus."

Read Entire Article