NIFE compresses large embedding models into static, drop-in replacements with up to 1000x faster query embedding (see benchmarks).
- 400-900x faster CPU query embedding
- Fully aligned with their teacher models
- Re-use your existing vector index
Nearly Inference Free Embedding (NIFE) models are static embedding models that are fully aligned with a much larger model. Because static models are so small and fast, NIFE allows you to:
- Speed up query time immensely: 200x embed time speed-up on CPU.
- Get away with using a much smaller memory/compute footprint. Create embeddings in your DB service.
- Reuse your big model index: Switch dynamically between your big model and the NIFE model.
Some possible use-cases for NIFE include search engines with slow and fast paths, RAGs in agent loops, and on-the-fly document comparisons.
This snippet loads stephantulkens/NIFE-mxbai-embed-large-v1, which is aligned with mixedbread-ai/mxbai-embed-large-v1. Use it in any spot where you use mixedbread-ai/mxbai-embed-large-v1.
This snippet is an example of how you could use it. But in reality you should just use it wherever you encode a query using your teacher model. There's no need to keep the teacher in memory. This makes NIFE extremely flexible, because you can decouple the inference model from the indexing model. Because the models load extremely quickly, they can be used in edge environments and one-off things like lambda functions.
On PyPi:
A NIFE model is just a sentence transformer router model, so you don't need to install pynife to use NIFE models. Nevertheless, NIFE contains some helper functions for loading a model trained with NIFE.
Note that with all NIFE models the teacher model is unchanged; so if you have a large set of documents indexed with the teacher model, you can use the NIFE model as a drop-in replacement.
Use just like any other sentence transformer:
You can also use the small model and big model together as a single router using a helper function from pynife. This is useful for benchmarking; in production you should probably use the query model by itself.
I have two pretrained models:
- stephantulkens/NIFE-mxbai-embed-large-v1: aligned with mxbai-embed-large-v1.
- stephantulkens/NIFE-gte-modernbert-base: aligned with gte-modernbert-base.
For retrieval using dense models, the normal mode of operation is to embed your documents, and put them in some index. Then, using that same model, also embed your queries. In general, larger embedding models are better than smaller models, so you're often better off by making your embedder as large as possible. This however, makes inference more difficult; you need to host a larger model, and embedding queries might take longer.
For sparse models, like SPLADE, there is an interesting alternative, which they call doc-SPLADE, and which sentence transformers calls inference free. In doc-SPLADE, you only embed using the full model for documents in your index. When querying, you just index the sparse index using the tokenizer.
NIFE is the answer to the question: what would inference free dense retrieval be? It is called Nearly Inference Free, because you still need to have some mapping from tokens to embeddings.
See this table:
| Full | SPLADE | Sentence transformer |
| Inference free | doc-SPLADE | NIFE |
As in doc-SPLADE, you lose performance. No real way about it, but as with other fast models, the gap is smaller than you might think.
I benchmark our models on NanoBEIR. I use two trained models:
For all models, I report NDC@10 and queries per second. I do this for the student model and teacher model, to show how much performance you lose when switching between them. Detailed benchmark performance can be found in the benchmarks folder.. The query timings were performed on the first 1000 queries of the msmarco dataset, and averaged over 7 runs. The benchmarks were run on an Apple M3 pro.
| NIFE | 71400 (14ms/1k queries) | 59.2 |
| Teacher | 237 (4210ms/1k queries) | 66.34 |
| NIFE | 65789 (15ms/1k queries) | 59.2 |
| Teacher | 108 (9190ms/1k queries) | 65.6 |
It is interesting that both NIFE models get the same performance, even with different teacher models. This could point towards a ceiling effect, where a certain percentage of queries can be answered correctly by static models, while others require contextualization.
We use knowledge distillation from an initialized static model to the teacher we want to emulate. Some special things:
- The static model is initialized directly from the teacher by inferring all tokens in the tokenizer through the whole model. This is similar to how this was done in model2vec, except we skip the PCA and weighting steps.
- The knowledge distillation is done in cosine space. We don't guarantee any alignment in euclidean space. Using, e.g., MSE or KLDiv between the student and teacher did not work as well.
- We train a custom tokenizer on our pre-training corpus, which is MsMARCO. This custom tokenizer is based on bert-base-uncased, but with a lot of added vocabulary. The models used in NIFE all have around 100k vocabulary size.
- We perform two stages of training; following LEAF, we also train on queries. This raises performance considerably, but training on interleaved queries and documents does not work very well. So we first train on a corpus of documents (MsMarco), and then finetune using a lower learning rate on a large selection of queries from a variety of sources.
- Unlike LEAF, we leave out all instructions from the knowledge distillation process. Static models can't deal with instructions, because there is no interaction between the instruction and other tokens in the document. Instructions can therefore at best be a constant offset of your embedding space. This can be really useful, but not for this specific task.
NIFE can't do the following things:
- Ignore words based on context: the query "What is the capital of France?" the word "France" will cause documents containing the term "France" to be retrieved. There is no way for the model to attenuate this vector and morph it into the answer ("Paris").
- Deal with negation: for the same reason as above; there is no interaction between tokens, so the similarity between "Cars that aren't red" and "Cars that are red" will be really high.
If you think NIFE could be interesting for your business let me know, I am open to consulting jobs regarding training models and fast inference. Just reach out to me via e-mail.
MIT
Stéphan Tulkens
If you use pynife or NIFE models in general, please cite this work as follows:
.png)

