Show HN: Rust BPE tokenizer for Qwen models that's 12x faster than HuggingFace

1 month ago 8

A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 6x faster tokenization by default and 12x faster with parallelization compared to HuggingFace tokenizers.

🚀 Linear-time tokenization based on the rust-gems BPE crate for fast tokenization
🎯 Optimized pretokenization for Qwen's pretokenization pattern using a two-pass approach instead of the base lookahead regex
🐍 Python bindings via PyO3 for seamless integration
📦 Native BPE format support (vocab.json + merges.txt)
⚡ 6x faster encoding by default, 12x faster with parallelism, and 2x faster decoding compared to HuggingFace
✅ 100% accuracy verified across comprehensive test suite, including special tokens

Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:

# Patch transformers to use bpe-qwen for Qwen models from bpe_qwen import AutoLinearTokenizer # This automatically uses bpe-qwen under the hood tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") # Use it exactly like a HuggingFace tokenizer outputs = tokenizer( "Hello, world!", return_tensors="pt", padding=True, truncation=True ) print(outputs["input_ids"]) # Batch processing with native HuggingFace API batch = tokenizer( ["Text 1", "Text 2", "Text 3"], padding=True, return_attention_mask=True )

Performance comparison with HuggingFace tokenizers on WikiText dataset (2,891 texts, 1.3M characters):

Tokenizer Speed Speedup vs HF

bpe-qwen	6.40M tokens/sec	6.28x
HuggingFace	1.02M tokens/sec	1.00x

Parallel Performance (8 workers):

Tokenizer Speed Speedup vs HF Parallel Benefit

bpe-qwen	33.08M tokens/sec	12.52x	5.17x vs sequential
HuggingFace	2.64M tokens/sec	1.00x	2.59x vs sequential

✅ Token consistency verified: All methods produce identical 298,938 tokens

# Install Rust toolchain curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Clone and build git clone https://github.com/sweepai/bpe-qwen.git cd bpe-qwen maturin develop --release # Run tests python test_simple.py python benchmark.py

Requires vocab.json and merges.txt files (not tokenizer.json)
Some multi-byte UTF-8 characters are not handled correctly

True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
Custom allocators: Specialized memory management for tokenization workloads