PyLate, is a powerful tool for research and training with ColBERT developped at LightOn. It carries a heavy set of dependencies. That's fine for most environments and especially to train state-of-the-art information retrieval models, but it can be a real headache when you just want to run inference in a live application and spawn your model in milliseconds.
That's why we built pylate-rs. The main difference is that we've completely removed the PyTorch and Transformers dependencies. Instead, we went a different route and built it with Candle, the deep-learning crate made with Rust. The goal was to create a focused, lightweight tool that does one thing well: compute ColBERT embeddings.
Time to import pylate_rs, intialize the model on the target device and start computing embeddings with Python:
MPS (Apple sillicon)
-96%
Along with the Python release, we uploaded a dedicated crate that can be used in any Rust project, and also to compile it to WebAssembly for use in the browser.
If you're not familiar with ColBERT, it's a model for computing sentence embeddings. As an encoder-based model, it generates an embedding for each token in a sentence. The output from the final Transformer layer is a matrix of shape (embedding_dimension, num_tokens), for instance, 768 x num_tokens.
A linear layer then reduces the embedding dimension, resulting in a 128 x num_tokens matrix. In contrast, sentence transformers don't output per-token embeddings. Instead, they aggregate all token embeddings—a process called pooling—using methods like mean or max pooling. This produces a single vector representation for the entire sentence, with a shape like 768 x 1.
ColBERT often outperforms sentence transformers because it allows for more fine-grained weights updates during training. If the model generates a poor representation for a specific token, the weight adjustments can target that specific token's embedding. This is unlike standard models that update the entire sentence representation, even if only a small part was incorrect.
This per-token approach enables a powerful "late interaction" mechanism. Instead of comparing two fixed sentence vectors, ColBERT calculates the similarity between each query token and all document tokens. These fine-grained scores are then aggregated to determine the final relevance score.
Formally, given a query Q with token embeddings qi and a document D with token embeddings dj, the MaxSim score is calculated by finding the maximum similarity for each query token across all document tokens, and then summing these maximums:
S(Q, D) = |Q| ∑ i=1 max 1 ≤ j ≤ |D| (qi ⋅ dj)
Every interactive charts below run in the browser using WebAssembly with the wasm bindings of pylate-rs.
Please select a model to begin.
Query
Documents (one per line)
As you can see, the MaxSim operation is a sum of maximum similarities, not an average. Consequently, the final score is not bounded within a fixed range like [0, 1]. Its magnitude scales with the number of tokens in the query, making it difficult to apply a universal similarity threshold. The score's scale depends also on the specific query and document context. A specific field might in average yield higher scores than another one.
This token-centric design also allows for visualizing the interactions between a query and a document. In our implementation, the token embeddings are L2-normalized. As a result, the similarity score for any token pair is bounded. The overall relevance score is then computed by summing the maximum similarity score for each query token over all document tokens.
Max-Sim Only
Waiting for model to load...
By summing the similarity scores of the document tokens that contribute to the MaxSim calculation,
we can visualize the
weight of each token in the final score. These are the specific document tokens that had the highest
similarity for each
corresponding query token. The visualization is freely inspired from the excellent Jo Kristian Bergum demo.
However, it's crucial to remember that the token embeddings are contextualized.
This means a token
can indirectly
influence the score even if it is not highlighted as a top-scoring match. It does so by altering the
embeddings of its
neighboring tokens, which may be the ones that are directly measured. The final score is the sum of
the scores from the
highlighted tokens, but their values are shaped by the entire context.
Enter a query and document above to see the visualization.
It may seem paradoxical to praise ColBERT for its token-level granularity only to then find ways to reduce the number of token embeddings we use. In reality, this reflects a practical trade-off between computational cost and representational detail.
pylate-rs implements a token reduction strategy following the article from Benjamin Clavié and Antoine Chaffin. The core idea is to find a balance between using all tokens and using only the most salient ones.
Use the slider below to adjust this pooling factor. You will see how it simplifies the document representation and affects the final similarity score, illustrating the direct trade-off between performance and accuracy.
Waiting for model to load...
Original Embeddings
Tokens: 0
Pooled Embeddings
Tokens: 0
In June 2O25, we released fast-plaid at LightOn, a Rust implementation of the PLAID algorithm for efficient nearest-neighbor search. Paired with pylate-rs, fast-plaid offers a lightweight solution for running ColBERT as a retriever in Python. Currently, fast-plaid is immutable, meaning the index must be rebuilt to add new documents. For use cases requiring mutable indexes, we recommend exploring solutions like Weaviate's implementation of MUVERA. We plan to add mutability and filtering to fast-plaid in the future. Any contribution is welcome!
Here is a sample code for running ColBERT with pylate-rs and fast-plaid. This is the fastest way to create a multi-vectors index apart from calling pylate-rs from rust ⚡️. It's compatible with CUDA and CPUs and batch-oriented.
We can then load the existing index and search for the most relevant documents:
At LightOn, we are developing Generative Models, Encoders, ColBERT models and state-of-the-art RAG pipelines. We released PyLate, an optimized solution for training ColBERT models on hardware ranging from a single CPU to a multi-GPU node.
In partnership with AnswerAI, LightOn released ModernBERT, a new encoder state-of-the-art encoder. We later fine-tuned it to create a state-of-the-art ColBERT: GTE-ModernColBERT.
LightOn also released Reason-ModernColBERT, which achieves state-of-the-art results on the BRIGHT benchmark. Reason-ModernColBERT outperforms models 45x larger on the gold standard for reasoning-intensive retrieval. Both GTE-ModernColBERT and Reason-ModernColBERT models were trained with PyLate and are compatible with pylate-rs.
You can find all compatible models on the Hugging Face Hub under the PyLate tag.
For more information, visit the PyLate and pylate-rs repositories on GitHub and leave a ⭐️ if you find them useful!
PyLate is being built with my amazing co-maintainer, Antoine Chaffin.
.png)

