Information Retrieval, spawn Colbert in ms

4 months ago 2

LightOn

PyLate, is a powerful tool for research and training with ColBERT developped at LightOn. It carries a heavy set of dependencies. That's fine for most environments and especially to train state-of-the-art information retrieval models, but it can be a real headache when you just want to run inference in a live application and spawn your model in milliseconds.

That's why we built pylate-rs. The main difference is that we've completely removed the PyTorch and Transformers dependencies. Instead, we went a different route and built it with Candle, the deep-learning crate made with Rust. The goal was to create a focused, lightweight tool that does one thing well: compute ColBERT embeddings.

Time to import pylate_rs, intialize the model on the target device and start computing embeddings with Python:

MPS (Apple sillicon)

-96%

Along with the Python release, we uploaded a dedicated crate that can be used in any Rust project, and also to compile it to WebAssembly for use in the browser.

If you're not familiar with ColBERT, it's a model for computing sentence embeddings. As an encoder-based model, it generates an embedding for each token in a sentence. The output from the final Transformer layer is a matrix of shape (embedding_dimension, num_tokens), for instance, 768 x num_tokens.

A linear layer then reduces the embedding dimension, resulting in a 128 x num_tokens matrix. In contrast, sentence transformers don't output per-token embeddings. Instead, they aggregate all token embeddings—a process called pooling—using methods like mean or max pooling. This produces a single vector representation for the entire sentence, with a shape like 768 x 1.

ColBERT often outperforms sentence transformers because it allows for more fine-grained weights updates during training. If the model generates a poor representation for a specific token, the weight adjustments can target that specific token's embedding. This is unlike standard models that update the entire sentence representation, even if only a small part was incorrect.

This per-token approach enables a powerful "late interaction" mechanism. Instead of comparing two fixed sentence vectors, ColBERT calculates the similarity between each query token and all document tokens. These fine-grained scores are then aggregated to determine the final relevance score.

Formally, given a query Q with token embeddings q_i and a document D with token embeddings d_j, the MaxSim score is calculated by finding the maximum similarity for each query token across all document tokens, and then summing these maximums:

S(Q, D) = |Q| ∑ i=1 max 1 ≤ j ≤ |D| (q_i ⋅ d_j)

Every interactive charts below run in the browser using WebAssembly with the wasm bindings of pylate-rs.

Please select a model to begin.

Query

Documents (one per line)

As you can see, the MaxSim operation is a sum of maximum similarities, not an average. Consequently, the final score is not bounded within a fixed range like [0, 1]. Its magnitude scales with the number of tokens in the query, making it difficult to apply a universal similarity threshold. The score's scale depends also on the specific query and document context. A specific field might in average yield higher scores than another one.

This token-centric design also allows for visualizing the interactions between a query and a document. In our implementation, the token embeddings are L2-normalized. As a result, the similarity score for any token pair is bounded. The overall relevance score is then computed by summing the maximum similarity score for each query token over all document tokens.

Max-Sim Only

Waiting for model to load...

By summing the similarity scores of the document tokens that contribute to the MaxSim calculation, we can visualize the weight of each token in the final score. These are the specific document tokens that had the highest similarity for each corresponding query token. The visualization is freely inspired from the excellent Jo Kristian Bergum demo.
However, it's crucial to remember that the token embeddings are contextualized. This means a token can indirectly influence the score even if it is not highlighted as a top-scoring match. It does so by altering the embeddings of its neighboring tokens, which may be the ones that are directly measured. The final score is the sum of the scores from the highlighted tokens, but their values are shaped by the entire context.

Enter a query and document above to see the visualization.

It may seem paradoxical to praise ColBERT for its token-level granularity only to then find ways to reduce the number of token embeddings we use. In reality, this reflects a practical trade-off between computational cost and representational detail.

pylate-rs implements a token reduction strategy following the article from Benjamin Clavié and Antoine Chaffin. The core idea is to find a balance between using all tokens and using only the most salient ones.

Use the slider below to adjust this pooling factor. You will see how it simplifies the document representation and affects the final similarity score, illustrating the direct trade-off between performance and accuracy.

Waiting for model to load...

Original Embeddings

Tokens: 0

Pooled Embeddings

Tokens: 0

In June 2O25, we released fast-plaid at LightOn, a Rust implementation of the PLAID algorithm for efficient nearest-neighbor search. Paired with pylate-rs, fast-plaid offers a lightweight solution for running ColBERT as a retriever in Python. Currently, fast-plaid is immutable, meaning the index must be rebuilt to add new documents. For use cases requiring mutable indexes, we recommend exploring solutions like Weaviate's implementation of MUVERA. We plan to add mutability and filtering to fast-plaid in the future. Any contribution is welcome!

Here is a sample code for running ColBERT with pylate-rs and fast-plaid. This is the fastest way to create a multi-vectors index apart from calling pylate-rs from rust ⚡️. It's compatible with CUDA and CPUs and batch-oriented.

import torch from fast_plaid import search from pylate_rs import models model = models.ColBERT( model_name_or_path="lightonai/GTE-ModernColBERT-v1", device="cpu", # mps or cuda ) documents = [ "1st Arrondissement: Louvre, Tuileries Garden, Palais Royal, historic, tourist.", "2nd Arrondissement: Bourse, financial, covered passages, Sentier, business.", "3rd Arrondissement: Marais, Musée Picasso, galleries, trendy, historic.", "4th Arrondissement: Notre-Dame, Marais, Hôtel de Ville, LGBTQ+.", "5th Arrondissement: Latin Quarter, Sorbonne, Panthéon, student, intellectual.", "6th Arrondissement: Saint-Germain-des-Prés, Luxembourg Gardens, chic, artistic, cafés.", "7th Arrondissement: Eiffel Tower, Musée d'Orsay, Les Invalides, affluent, prestigious.", "8th Arrondissement: Champs-Élysées, Arc de Triomphe, luxury, shopping, Élysée.", "9th Arrondissement: Palais Garnier, department stores, shopping, theaters.", "10th Arrondissement: Gare du Nord, Gare de l'Est, Canal Saint-Martin.", "11th Arrondissement: Bastille, nightlife, Oberkampf, revolutionary, hip.", "12th Arrondissement: Bois de Vincennes, Opéra Bastille, Bercy, residential.", "13th Arrondissement: Chinatown, Bibliothèque Nationale, modern, diverse, street-art.", "14th Arrondissement: Montparnasse, Catacombs, residential, artistic, quiet.", "15th Arrondissement: Residential, family, populous, Parc André Citroën.", "16th Arrondissement: Trocadéro, Bois de Boulogne, affluent, elegant, embassies.", "17th Arrondissement: Diverse, Palais des Congrès, residential, Batignolles.", "18th Arrondissement: Montmartre, Sacré-Cœur, Moulin Rouge, artistic, historic.", "19th Arrondissement: Parc de la Villette, Cité des Sciences, canals, diverse.", "20th Arrondissement: Père Lachaise, Belleville, cosmopolitan, artistic, historic.", ] # Encoding documents documents_embeddings = model.encode( sentences=documents, is_query=False, pool_factor=2, # Let's divide the number of embeddings by 2. ) # Creating the FastPlaid index fast_plaid = search.FastPlaid(index="index") fast_plaid.create( documents_embeddings=[torch.tensor(embedding) for embedding in documents_embeddings] )

We can then load the existing index and search for the most relevant documents:

import torch from fast_plaid import search from pylate_rs import models fast_plaid = search.FastPlaid(index="index") queries = [ "arrondissement with the Eiffel Tower and Musée d'Orsay", "Latin Quarter and Sorbonne University", "arrondissement with Sacré-Cœur and Moulin Rouge", "arrondissement with the Louvre and Tuileries Garden", "arrondissement with Notre-Dame Cathedral and the Marais", ] queries_embeddings = model.encode( sentences=queries, is_query=True, ) scores = fast_plaid.search( queries_embeddings=torch.tensor(queries_embeddings), top_k=3, ) print(scores)

At LightOn, we are developing Generative Models, Encoders, ColBERT models and state-of-the-art RAG pipelines. We released PyLate, an optimized solution for training ColBERT models on hardware ranging from a single CPU to a multi-GPU node.

In partnership with AnswerAI, LightOn released ModernBERT, a new encoder state-of-the-art encoder. We later fine-tuned it to create a state-of-the-art ColBERT: GTE-ModernColBERT.

LightOn also released Reason-ModernColBERT, which achieves state-of-the-art results on the BRIGHT benchmark. Reason-ModernColBERT outperforms models 45x larger on the gold standard for reasoning-intensive retrieval. Both GTE-ModernColBERT and Reason-ModernColBERT models were trained with PyLate and are compatible with pylate-rs.

You can find all compatible models on the Hugging Face Hub under the PyLate tag.

For more information, visit the PyLate and pylate-rs repositories on GitHub and leave a ⭐️ if you find them useful!

PyLate is being built with my amazing co-maintainer, Antoine Chaffin.

Read Entire Article