Show HN: TokenDagger – A tokenizer 2-4x faster than OpenAI's Tiktoken

4 months ago 5

 MIT Python 3.8+ PyPI version

A fast implementation of OpenAI's TikToken, designed for large-scale text processing. 2x Throughput and 4x faster on code sample tokenization.

Performed on an AMD EPYC 4584PX - 16c/32t - 4.2 GHz.

Throughput Benchmark Results

  • Fast Regex Parsing: Optimized PCRE2 regex engine for efficient token pattern matching
  • Simplified BPE: Simplied algorithm to reduce performance impact of large special token vocabulary.
  • OpenAI Compatible: Full compatibility with OpenAI's TikToken tokenizer
make clean && make pip3 install tiktoken python3 tests/test_tokendagger_vs_tiktoken.py --tokenizer llama python3 tests/test_tokendagger_vs_tiktoken.py --tokenizer mistral python3 tests/performance_benchmark.py --tokenizer llama python3 tests/performance_benchmark.py --tokenizer mistral python3 tests/code_performance_benchmark.py --tokenizer llama
================================================================================ 🎉 CONCLUSION: TokenDagger is 4.02x faster on code tokenization! ================================================================================
git clone [email protected]:M4THYOU/TokenDagger.git sudo apt install libpcre2-dev git submodule update --init --recursive sudo apt update && sudo apt install -y python3-dev

And optionally for running the tests:

  • PCRE2: Perl Compatible Regular Expressions - GitHub
Read Entire Article