Embeddings: Decoder-Only Transformers are SoTA Encoders (with some fine-tuning)

3 days ago 2

Despite their name, decoder-only Transformers (M-LLMs, e.g. Llama, Mistral, Qwen, etc.) can be fine-tuned into an encoder to obtain embeddings for retrieval tasks, e.g. RAG, semantic search, etc., which can achieve state-of-the-art (SoTA) results
You can have your cake and eat it too: the same M-LLM fine-tuned for embeddings can be used to generate text without loss in performance. See GRIT on how this can be done.

Decoder-only transformer architectures have taken over. Encoder-based transformer (encoder-only, encode-decode) architectures (e.g. BERT) are becoming something of the past, the latest encoder-based transformer was released ~2 years ago. Large research labs are no longer training encoder-based models, and this is for a good reason: encoder-based architectures are harder to scale.

No newly encoder-based transformers leave us with a problem: how should we obtain state-of-the-art (SoTA) embeddings? Will individuals or smaller labs/companies be forced to depend on proprietary models such as Gemini Embedding, OpenAI embeddings, or Voyage Code (all are likely decoder-only transformers)? Or continue using old encoder models, such as all-mpnet-base-v2? Or, will we need to train new encoder-based models to keep up with the continued advancements decoder-only models are achieving?

The answer to the above questions is in the title of this post: no, we don't need encoder-based transformers. One can use an existing pre-trained decoder-only transformer as a base for an encoder. Many people and companies are doing this already, in fact, as I was writing this post, a new model by ByteDance (the company behind TikTok) popped up on the MTEB leaderboard and topped the leaderboard: Seed-1.5-Embedding which is based on their LLM (decoder-only transformer).

You might still wonder why bother? LLMs are typically large, and performing a forward pass is not cheap. SBERT is computationally cheap to run as it can run on a cheap VPS (CPU). My counter-arguments here are:

You can have your cake and eat it too:
- If you're hosting a generative LLM already, you could fine-tune your generative LLM to encode text without degrading generative performance. For RAG, you can use k-v caching to speed up the RAG pipeline.
- See GRIT for evidence that it is possible: "Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss".
- To prevent generative performance regression, you essentially need to jointly train an embedding objective (contrastive loss, e.g. InfoNCE) with a generative objective (next-token prediction).
Smaller LLMs (<1B params) are continuing to get better due to advances in training data, training recipes/techniques, e.g. distillation. Qwen3 0.6B can run on mobile devices, and qualitatively it performs well (I can't find benchmark numbers for this variant of Qwen3).

Feel free to skip this section if you're familiar with embeddings/vector-based search.

Encoders have the nice property of being able to take some input data (e.g. text, images, audio, etc.) and spit out an embedding vector associated to this. This embedding vector can then be used to solve various tasks efficiently, such as retrieval, clustering, pair classification. Commonly embeddings vectors are used for semantic search, Retrieval Augmented Generation (RAG) and recommendation.

We can use embedding vectors for these tasks because the space these vectors occupy is a Metric space defined by some function f(x_1, x_2). Meaning one can perform a nearest neighbour search (k-NN) with when f represents distance (L2) or similarity function (cosine).

Once you obtain embeddings for all the datapoints in your dataset, you can then perform a semantic query by first encoding the query input into the embedding space (using the encoder model) and then performing k-Nearest Neighbour search. Practically, if your dataset fits in memory, you can do this with a simple matrix multiply (for cosine similarity via torch.matmul(x, X.T)), with database plugins (such as pgvector or sqlite-vec). If your dataset doesn't fit in memory, you can use more advanced techniques (e.g. approximation algorithms); existing solutions include FAISS, Milvus, and other vector-DBs.

Retrieval is typically a bit more involved than a simple k-NN. One common and simple extension is to combine exact search text retrieval techniques (e.g. BM-25; see SQLite's documentation for a good introduction). To do this, you need to combine ("re-rank") the search retrieval results in some manner, which can be done via a learned function or simple heuristics (e.g. Reciprocal Rank Fusion (RRF), see a practical example using sqlite-vec). Other extensions include filtering by metadata (for hard constraints), e.g. if the input query is "how to read a file in Python" the "in Python" is classified as a hard constraint, enforcing a filter to only include results relevant to the Python programming language. Metadata can be predicted by another model (e.g. GLiNER), this task is referred to as "Entity Extraction", which could also be solved with a decoder-only LLM too (with fine-tuning or prompt-engineering, there's probably a paper covering this^*).

^{* Most low-hanging fruit ideas I had in this space had a paper: "you
could fine-tune a decoder LLM to encode text" - oh there's a series of papers
covering this already, "you could use the same LLM for generative and
embeddings without generative performance regressing" - oh there's a paper for this
(GRIT).}

A decoder can be made into an encoder by performing fine-tuning in a specific manner, this fine-tuned/adapted model can deliver the same generative performance if you train with a next-token objective jointly with an embedding loss function (contrastive loss). I'll summarize how to do so below. Without fine-tuning, the following approach doesn't work well (step 3), even for an instruction-tuned model; ablations in E5-V (Table 6) show this.

The following are my summarized notes on how to perform fine-tuning to transform a decoder-only transformer into an encoder, from the papers: (text-only) GRIT, NV-Embed, and (multi-modal) GME, E5-V:

Use a specific set of instructions to distinguish between generative and embedding modes, different types of tasks, document sources (e.g. Wikipedia, Arxiv), and different modalities (text, image, audio, video, etc.):
- For Generative and Embedding modes:
  - Create a new token to enable the model to enter "embedding" mode (e.g. <|embed|>), and/or use a sequence of tokens e.g. "Represent" in the system prompt
- For different modalities, you can influence the model to refer to the same semantic space, e.g. E-5V embeds text and images into the embedding space via the following instructions:
  - Text input: "<text> Summary of the above sentence in one word"
  - Image input: "<image> Summary of the above image in one word"
- Different types of tasks and document sources (see Section P in GRIT's Appendix), examples:
  - Clustering Reddit posts: "Identify the topic or theme of Reddit posts based on the titles"
  - Retrieval on Wikipedia: "Represent the climate-based claim to find a Wikipedia abstract to support it"
(Optional) alter the architecture of (M-)LLM to use bi-directional attention for "encoding" mode. This is shown to improve the resulting embedding quality w.r.t retrieval and other task metrics.
- If in "generative" mode, you can disable bi-directional attention and use casual attention for inference and training.
- Note, GME has ablations that show the reverse is true (i.e. bi-directional attention hurts performance), perhaps due to not training jointly and/or full model training not being employed.
To obtain an embedding for an input sample:
- Without bi-directional attention:
  - Use the last hidden state of the last output token, e.g. with HuggingFace: emb = model(**inputs, output_hidden_states=True, return_dict=True)["hidden_states"][-1][:, -1, :] emb = F.normalize(emb, dim=-1)
- With bi-directional attention:
  - Perform a mean pool across all hidden states for the last output token. This is empirically better than the above for bi-directional attention.
Loss:
- Embedding loss: use a contrastive loss (InfoNCE is commonly used), denote this as L_\text{Rep}
- For the case where you train generative jointly, you can combine losses in the typical manner, e.g. as done in GRIT: L_\text{GRIT} = \lambda_\text{Rep}L_\text{Rep} + \lambda_\text{Gen}L_\text{Gen}
To optimize the M-LLM: use LoRA, QLoRA, or perform full fine-tuning.
- Warning: if you are not training jointly, then full fine-tuning may not perform as well as LoRA/QLoRA. I suspect this is because generative performance regresses more when tuning all weights of the model, and hence language understanding regresses, which is correlated with embedding performance.

The MTEB benchmark serves as a standard benchmark for evaluating embedding quality on a wide variety of benchmarks and tasks across numerous datasets. You can take a deeper look for yourself on the leaderboard.

Here's a comparison of models for Retrieval in MTEB for Text-to-Text Retrieval:

^{1. GritLM is based on Mistral-7B from 2023 and is now deprecated and retired.}
^{2. This model is gte-Qwen fine-tuned further, showing that dataset quality and quantity matter.}
^{3. These results are not published on the MTEB leaderboard, but their results are shown on their model page}

Here's how different multi-modal model approaches compare for Text-to-Image (T<->I) Retrieval (in any direction):

Unfortunately, GME results are not available on the public leaderboard, but if you trust their results on their HF page, GME performs better than E5-V by approximately ~20% for T->I and I->T (English).

Interestingly, zero-shot classification performance of E5-V is much weaker compared to CLIP, but multi-lingual performance for text to image retrieval is significantly higher than the CLIP alternative, likely due to the stronger language model or simply the LLM's pre-training dataset.

To conclude, decoder-only transformers can be trained into strong SoTA encoders. You don't need to depend on proprietary models: you could train an encoder using the plethora of open-source M-LLMs released by research labs as a base. The model architecture and weights you end up using are dependent on your requirements and compute constraints, so evaluate it for yourself.

If you're still using SBERT (e.g. all-mpnet-base-v2), consider a decoder-only transformer as an encoder instead.

Read Entire Article