Show HN: Run HF Transformers in pure Go (10 MB binary, no Python)
1 hour ago
2
A high-performance GPU-accelerated neural network framework written in Go, featuring WebGPU compute shaders for parallel execution and WebAssembly export for browser deployment. Now with transformer inference support!
🎉 NEW: Full transformer inference in browser WASM! SmolLM2-135M-Instruct successfully generates coherent text entirely in the browser with pure Go implementation.
🤯 BREAKTHROUGH: LOOM's Softmax layer includes native Mixture of Experts (MoE) via Grid Softmax - the same architecture used in GPT-4, Switch Transformer, and Mixtral. Mathematically proven equivalent with 97.1% loss reduction and perfect gradient matching. See examples/moe_proof_demo.go for rigorous proof!
Loom is a modern neural network framework that combines the simplicity of Go with the power of GPU acceleration via WebGPU. It supports multiple layer types, flexible grid-based architectures, and provides both CPU and GPU execution paths with automatic gradient computation. The framework can be compiled to WebAssembly for running neural networks and transformer inference directly in the browser.
Example transformer output (SmolLM2-135M in browser):
Prompt: "Once upon a time"
Output: "hi
I'm excited to see what you come up with! Let me know if you have any"
Performance: CPU implementations are production-ready and performant. GPU acceleration provides 10-100x speedup for Dense/Conv2D/Attention on large batches.
🎨 Softmax Layer - The Unique Feature
LOOM makes softmax a first-class layer (not just a function), enabling:
Method Discovery: Query all available network methods at runtime
Signature Inspection: Get parameter types and return values for any method
JSON Metadata: Export complete API documentation as JSON
WASM Integration: Automatic exposure of Go methods to JavaScript
loom/
├── nn/ # Neural network package
│ ├── types.go # Core types and structures
│ ├── registry.go # Layer initialization function registry
│ ├── forward.go # Forward propagation (CPU/GPU)
│ ├── backward.go # Backward propagation (CPU/GPU)
│ ├── gpu.go # WebGPU initialization and shaders
│ ├── attention.go # Multi-Head Attention implementation
│ ├── attention_gpu.go # MHA GPU kernels
│ ├── cnn.go # Conv2D implementation
│ ├── conv2d_gpu.go # Conv2D GPU kernels
│ ├── rnn.go # RNN implementation
│ ├── lstm.go # LSTM implementation
│ ├── training.go # Training loop with evaluation support
│ ├── evaluation.go # DeviationMetrics evaluation system
│ ├── introspection.go # Runtime method discovery
│ ├── serialization.go # Model save/load
│ ├── transformer.go # Transformer model loading and inference
│ └── README.md # Detailed package documentation
│
├── tokenizer/ # Pure Go BPE tokenizer
│ ├── bpe.go # Byte Pair Encoding implementation
│ ├── tokenizer.go # HuggingFace tokenizer.json loader
│ └── README.md # Tokenizer documentation and examples
│
├── wasm/ # WebAssembly module
│ ├── main.go # WASM wrapper with type conversion
│ ├── inference.go # Transformer inference exports for WASM
│ ├── build.sh # Build script for WASM compilation
│ ├── example.html # Interactive browser demo
│ ├── inference.html # Transformer inference demo
│ └── README.md # WASM documentation and examples
│
├── cabi/ # C ABI for FFI
│ ├── main.go # C foreign function interface
│ ├── transformer.go # Transformer inference C exports
│ ├── simple_bench.c # C benchmark program
│ ├── build.sh # Build script for shared library
│ └── README.md # C API reference and examples
│
├── python/ # Python package (welvet)
│ ├── pyproject.toml # Python package configuration
│ ├── README.md # Python package documentation
│ ├── src/welvet/ # Python bindings via ctypes
│ │ ├── __init__.py # Package initialization
│ │ ├── utils.py # High-level Python API
│ │ └── */ # Multi-platform C libraries
│ └── examples/ # Python examples
│ ├── test_transformer.py # CLI inference example
│ └── transformer_web_interface.py # Web UI with streaming
│
├── model_conversion/ # Model import & pure Go inference
│ ├── README.md # Conversion documentation
│ ├── requirements.txt # Python dependencies
│ ├── convert_tiny.py # BERT/tiny model converter
│ ├── convert_model.py # General model converter
│ ├── serve_model_bytes.go # Pure Go model serving
│ ├── web_interface.go # Pure Go web interface
│ └── verify_bert_weights.py # Weight verification tool
│
├── typescript/ # TypeScript/WASM package
│ ├── package.json # npm package configuration
│ ├── README.md # TypeScript package documentation
│ ├── src/ # TypeScript bindings
│ │ ├── index.ts # Main WASM loader
│ │ ├── transformer.ts # Transformer API wrapper
│ │ └── types.ts # TypeScript type definitions
│ └── examples/ # TypeScript examples
│ ├── transformer.ts # Node.js inference example
│ └── transformer.html # Browser demo with streaming
│
├── csharp/ # C#/.NET package (Welvet)
│ ├── Welvet.csproj # NuGet package configuration
│ ├── NativeMethods.cs # P/Invoke declarations (C-ABI)
│ ├── Network.cs # High-level managed API
│ ├── Transformer.cs # Transformer inference API (NEW!)
│ ├── Activation.cs # Activation enum
│ ├── README.md # C# package documentation
│ ├── runtimes/ # Native libraries per platform
│ └── examples/ # C# example programs
│ ├── TransformerTest.cs # CLI inference example
│ └── TransformerWebInterface.cs # Web UI with streaming
│
├── fabric/ # Demo application
│ ├── main.go # Interactive demo menu
│ ├── demos/ # Individual layer demos
│ └── examples/ # Benchmarks and tests
│
├── pods/ # GPU compute pods (primitives)
│ ├── ml_gemm.go # Matrix multiplication
│ ├── ml_softmax_norm.go # Softmax and normalization
│ ├── primitives_scan.go # Parallel prefix scan
│ └── ...
│
└── detector/ # GPU device detection
├── detector.go # Hardware capability detection
└── detector_wasm.go # WASM stub (GPU N/A in browser)
# Clone the repository
git clone https://github.com/openfluke/loom.git
cd loom
# Install dependencies
go mod download
# Build the demo applicationcd fabric
go build
Option A: Import Pre-trained Models
Convert and use pre-trained transformer models from HuggingFace:
# Install Python dependenciescd model_conversion
pip install -r requirements.txt
# Convert BERT-Tiny (4MB, 2 layers)
python3 convert_tiny.py
# Select option 1 for BERT-Tiny# Verify the conversion
python3 verify_bert_weights.py
# ✅ Expected: 54% similarity (weights working!)# Test in Go
go run run_bert_tiny.go
✨ Model Serialization - Save & Load Complete Networks
The Easy Way - One Function Call:
// Save a trained model (includes all weights and configuration)err:=network.SaveModel("model.json", "my_model")
// Load it back - ONE LINE! Everything restored automaticallyloadedNet, err:=nn.LoadModel("model.json", "my_model")
// Done! All layers, weights, and configuration loaded// Or use strings (great for APIs/databases)jsonString, err:=network.SaveModelToString("my_model")
loadedNet, err:=nn.LoadModelFromString(jsonString, "my_model")
Example Test: See examples/all_layers_validation.go for a complete demo with all 6 layer types + 10 softmax variants (16 layers total)
cd examples
go run all_layers_validation.go
# Creates: test.json, inputs.txt, outputs.txt# Tests: save → load → verify → train
🤖 Transformer Inference - Run LLMs in Browser or Python
Run pretrained transformer models like SmolLM2-135M entirely client-side:
Python (Server or CLI):
importwelvet# Load tokenizer and modeltokenizer=welvet.load_tokenizer_from_bytes(open("tokenizer.json", "rb").read())
model=welvet.load_transformer_from_bytes(
open("config.json", "rb").read(),
open("model.safetensors", "rb").read()
)
# Generate text with streamingfortokeninwelvet.generate_text_stream("The capital of France is", max_tokens=50):
print(token, end="", flush=True)
TypeScript/Browser (100% Client-Side):
import{initLoom,createTransformerAPI}from"@openfluke/welvet";awaitinitLoom();consttransformer=awaitcreateTransformerAPI();// Load from URLs (or File API)awaittransformer.loadTokenizer(tokenizerData);awaittransformer.loadModel(configData,weightsData);// Stream tokens in real-timeforawait(consttokenoftransformer.generateStream(prompt,50,0.7)){console.log(token);// Updates UI immediately}
Load and use converted BERT models from HuggingFace:
package main
import (
"fmt""github.com/openfluke/loom/nn"
)
funcmain() {
// Load converted BERT-Tiny modelnetwork, err:=nn.LoadImportedModel("model_conversion/bert-tiny.json", "bert-tiny")
iferr!=nil {
panic(err)
}
fmt.Printf("Loaded BERT with %d layers\n", network.TotalLayers())
// Output: Loaded BERT with 10 layers// 2 transformer blocks: [MHA, LayerNorm, Dense, Dense, LayerNorm] × 2// Create embeddings (from tokenizer + embedding layer)seqLength:=128hiddenSize:=128embeddings:=make([]float32, seqLength*hiddenSize)
// ... fill with word + position embeddings from BERT tokenizer// Run forward pass through transformeroutput, _:=network.ForwardCPU(embeddings)
// Output: contextual embeddings for each tokenfmt.Printf("Output shape: %d values (%d tokens × %d hidden)\n",
len(output), seqLength, hiddenSize)
}
Convert your own models:
cd model_conversion
python3 convert_tiny.py # Select BERT-Tiny, Mini, or custom
python3 verify_bert_weights.py # Verify 54% similarity
go run run_bert_tiny.go # Test in Go
From simple_bench.c (784→392→10 network, 100 iterations):
CPU Forward: 100 iterations in 36.93 ms (avg: 0.3693 ms/iter)
GPU Forward: 100 iterations in 296.38 ms (avg: 2.9638 ms/iter)
Speedup: 8.03x (CPU faster for small batches)
✅ Multi-platform support - Linux, macOS, Windows, Android, iOS
✅ Cross-compilation - Build for multiple architectures from a single machine
✅ 17MB shared library - Includes full framework + CGO runtime
✅ Handle-based management - Safe object lifecycle with sync.Mutex
✅ JSON parameters - Language-agnostic API
✅ Dynamic method calling - Access all 24+ Network methods via reflection
✅ Introspection - List methods, get signatures, query object info
✅ GPU support - Enable/disable GPU acceleration at runtime
✅ Model persistence - Save/load as JSON strings
See cabi/README.md for complete API reference, multi-platform build instructions, and language bindings (Python, Rust, C++, etc.).
Wrapper for Embedding Loom Via External (C-ABI) Toolchain
High-level Python bindings for LOOM with GPU acceleration support.
High-level C# bindings for LOOM with full P/Invoke support for .NET 9.0+.
dotnet add package Welvet
usingWelvet;// Create network with GPU accelerationusingvarnetwork=Network.Create(inputSize:4,gridRows:1,gridCols:1,layersPerCell:2,useGpu:true);// Configure: 4 -> 8 -> 2network.ConfigureSequential(layerSizes:new[]{4,8,2},activations:new[]{Activation.ScaledReLU,Activation.Sigmoid});// Training datavarinputs=newfloat[][]{new[]{0.1f,0.2f,0.3f,0.4f},new[]{0.5f,0.6f,0.7f,0.8f}};vartargets=newfloat[][]{new[]{1.0f,0.0f},new[]{0.0f,1.0f}};// Trainfor(intepoch=0;epoch<10;epoch++){floatloss=network.TrainEpoch(inputs,targets,learningRate:0.1f);Console.WriteLine($"Epoch {epoch+1}: loss = {loss:F4}");}// Predictvaroutput=network.Forward(new[]{0.1f,0.2f,0.3f,0.4f});Console.WriteLine($"Output: [{string.Join(", ",output)}]");
// Load complete model from JSON stringusingvarnetwork=Network.LoadFromString(modelJson,"my_model");// Save model to JSON stringstringjson=network.SaveToString("my_model");
✅ Modern C# API - IDisposable, nullable reference types, async-ready
✅ GPU Support - WebGPU acceleration via P/Invoke to C-ABI
✅ Multi-platform - Linux, macOS, Windows with native library packaging
✅ Type Safe - Strong typing with proper exception handling
✅ .NET 9.0+ - Built for latest .NET runtime
✅ Zero Dependencies - Pure P/Invoke, no external packages
Results from Option 14 (CPU vs GPU Comprehensive Benchmark):
Forward: 0.81x speedup (GPU: 4.8ms vs CPU: 3.9ms)
Backward: 0.19x speedup (GPU: 10.6ms vs CPU: 2.0ms)
Total: 0.38x at batch=4096, 80 layers
Status: Full GPU acceleration (overhead dominates at small batches)
Forward: 1.04x speedup (GPU: 693ms vs CPU: 721ms)
Backward: 1.08x speedup (GPU: 2.39s vs CPU: 2.58s)
Total: 1.07x speedup at batch=32, seq=256, dim=512
Status: Hybrid GPU/CPU - Q/K/V projections on GPU, attention on CPU
Status: GPU implementation has bugs, falls back to CPU
Total: 1.02x at batch=32, 64x64 images
Status: CPU only (sequential operations incompatible with GPU parallelism)
GPU: Intel Arc Graphics (MTL), Vulkan backend
Save and load trained models with both file-based and string-based methods:
// Save a single modelnetwork.SaveModel("model.json", "my_model_v1")
// Load a single modelloadedNetwork, err:=nn.LoadModel("model.json", "my_model_v1")
// Save multiple models in a bundlemodels:=map[string]*nn.Network{
"model_a": networkA,
"model_b": networkB,
}
nn.SaveBundle("models.json", models)
// Load bundlebundle, err:=nn.LoadBundle("models.json")
String-Based Serialization (WASM/CABI)
Perfect for WebAssembly, FFI, network transfer, or embedded models:
// Serialize to JSON stringjsonString, err:=network.SaveModelToString("my_model_v1")
// Load from JSON string (no file system needed!)loadedNetwork, err:=nn.LoadModelFromString(jsonString, "my_model_v1")
// Bundle to stringbundle:=&nn.ModelBundle{...}
jsonStr, err:=bundle.SaveToString()
// Load bundle from stringbundle, err:=nn.LoadBundleFromString(jsonString)
# Build the library
go build ./nn
# Run testscd fabric/examples
go test -v
# Run benchmarkscd fabric
go build
./fabric
# Select option 14 for comprehensive CPU vs GPU benchmark
Go: 1.24 or higher
GPU: WebGPU-compatible GPU (Vulkan, Metal, or D3D12)
OS: Linux, macOS, or Windows
Fix Conv2D GPU shader bugs
Optimize Dense GPU for small batches
GPU softmax kernel for MHA
Multi-GPU support
FP16/FP32 mixed precision
Parallel RNN alternatives (QRNN, SRU)
Batch normalization
Dropout layers
Model visualization tools
Training Loop: Built-in Train() method with gradient clipping and loss tracking
DeviationMetrics Evaluation: 7-bucket accuracy tracking with sample-level analysis
Validation Integration: Automatic periodic evaluation during training
Metrics Persistence: JSON save/load for evaluation results
Multi-Head Attention: GPU-accelerated with hybrid CPU/GPU execution (1.07x speedup)
Model Serialization: File and string-based save/load (WASM/FFI compatible)
RNN/LSTM: Full CPU implementation with BPTT
Dense GPU: Forward/backward with WebGPU compute shaders
Optimizers: SGD with momentum, gradient clipping, learning rate scheduling
Loss Functions: MSE, Cross-Entropy with softmax
Contributions are welcome! Please feel free to submit a Pull Request.
Apache License 2.0 - see LICENSE file for details.
WebGPU compute shader architecture
Inspired by modern deep learning frameworks (PyTorch, TensorFlow)
Built with Go's simplicity and performance
For questions and support, please open an issue on GitHub.