A simple implementation of ModernBERT, in pure C. Inspired by llama2.c.
To keep it minimal, I hard-coded the ModernBERT architecture into one file, supporting inference only. You can load the base model weights, or fine-tuned weights. Currently, only support token classification as a downstream task.
Clone this repository:
Then, open the repository folder:
Intall Python dependencies:
Export the tokenizer binary, which downloads + outputs the vocabulary, merges, and metadata into a format that can be easily loaded by the C file:
Similarly, export the model weights:
This export will take some time, to download the model weights from huggingface and convert them. For ModernBERT-base, expect a ~570MB output file.
Compile the C code:
Run the C code:
You can load any huggingface model that uses the ModernBERT architecture. See the model/export.py script. (warning: this repo has not been tested with ModernBERT-large, or any of it's derivatives.)
The tokenizer implementation tokenizer/tokenizer.c is a crude approximation of huggingface's BPE-based PretrainedTokenizer. It works for ~80% of cases, but misses alot of edge cases.
The default make compile command currently applies the -O3 optimization, which includes optimizations that are expensive in terms of compile time and memory usage. You can expect token throughput > 1200 tokens/s. I include this rough figure only as a point of reference, because there are caveats:
- Since ModernBERT is an encoder model, it doesn't decode the output one token at a time. The model uses a single pass to produce outputs for all input tokens, no auto-regression. So token throughput here is not apples to apples with most llms out there, that are decoder-only.
- The time still scales with the number of input tokens, so with more tokens, time-to-first-token is larger.
The bulk of the performance comes from the OpenBLAS library, and the highly-optimized matrix multiplications.
You can try to compile with make compilefast. This turns on the -Ofast flag, which includes additional optimizations that may break compliance with the C/IEEE specifications, in addition to -O3. See the GCC docs for more information. But I didn't see much of a difference.
MIT
.png)

