Authors: Hai Huang, Yann LeCun, Randall Balestriero
Paper: https://arxiv.org/abs/2509.14252
Code: https://github.com/rbalestr-lab/llm-jepa
WHAT was done? The authors introduce LLM-JEPA, a novel training objective that integrates Joint Embedding Predictive Architectures (JEPAs)—a highly successful paradigm from computer vision—into the training of Large Language Models (LLMs). This hybrid approach complements the standard next-token prediction loss with a JEPA objective that learns to predict the embedding of one "view" of data (e.g., a code snippet) from another related view (e.g., its natural language description). Being able to obtain non-trivial views like this is crucial to the success of JEPA objectives.
WHY it matters? This work successfully bridges a long-standing gap between training methodologies in vision and language. By moving beyond purely input-space reconstruction, LLM-JEPA enables models to learn richer, more structured, and abstract representations. The empirical results are compelling: LLM-JEPA significantly boosts performance across various models and tasks, improves robustness to overfitting, and accelerates convergence in parameter-efficient fine-tuning (PEFT). It represents a promising new direction for developing more capable and efficient LLMs.
Note: see also other posts related to the JEPA architecture: JEPA for time-series, video V-JEPA and V-JEPA 2.
The landscape of self-supervised learning has long been marked by a curious divergence. In computer vision, Joint Embedding Predictive Architectures (JEPAs) have demonstrated remarkable success by learning abstract representations without needing to reconstruct raw pixels (https://arxiv.org/abs/2301.08243, https://arxiv.org/abs/2404.08471). In contrast, Large Language Models (LLMs) have been almost exclusively trained with input-space objectives like next-token prediction. This raises a critical question: can language models benefit from the same embedding-space learning principles that have propelled vision models forward?
This paper provides a resounding "yes" with the introduction of LLM-JEPA, a framework that successfully adapts the JEPA paradigm to LLMs. The work presents a compelling narrative of cross-domain inspiration, demonstrating that by combining generative and predictive objectives, we can build more powerful and robust language models.
The core innovation is a hybrid training objective that enhances an LLM's "abstraction capabilities" without sacrificing its "generative capabilities." The proposed loss function ℒ_LLM-JEPA is a weighted sum of two components:
Standard LLM Loss (ℒ_LLM): The first term is the familiar autoregressive cross-entropy loss for next-token prediction, ensuring the model remains a proficient text generator.
JEPA Loss: The second term is the novel JEPA objective, which operates entirely in the embedding space.
The JEPA component is built on a few key concepts:
Views: The method requires data that provides multiple perspectives, or "views," of the same underlying concept. The authors astutely identify datasets where this naturally occurs, such as pairs of natural language descriptions (Text) and their corresponding code implementations (Code), like SQL queries or regular expressions (Figure 2).
Encoder (Enc): The LLM itself serves as the encoder. The embedding of a sequence is taken from the hidden state of the final token in the last layer. To avoid complex, architecture-specific modifications, the embeddings for the two views are obtained through separate forward passes. That’s because, for example, passing the concatenation of [Text, Code] would require meddling with the self attention to avoid cross-view interaction which would be efficient but specific to each LLM architecture.
Predictor (Pred): A special [PRED] token is appended to an input to prompt the model to perform a prediction task in its embedding space. This is a particularly elegant design choice as it avoids adding a separate, parameter-heavy neural network for the predictor, instead reusing the transformative power of the LLM itself. Practically, they append k ∈ {0, . . . , K} predictor tokens to an input prompt and use the embedding of the last predictor token to be Pred(Enc(·)). When k = 0, the predictor is trivial, i.e., Pred(x) = x.
Metric (d): The distance between the predicted and actual target embeddings is measured using cosine similarity.
By forcing the model to predict an embedding rather than raw tokens, the JEPA objective encourages it to distill the core semantic essence of the input, filtering out irrelevant surface-level details and focusing on the abstract concepts that are invariant across both Text and Code views.
The authors conduct an extensive empirical evaluation across a wide range of models (Llama3, Gemma2, OpenELM, OLMo) and datasets (NL-RX, GSM8K, Spider). The results consistently validate the superiority of the LLM-JEPA approach.
During finetuning, for a given (model,dataset) case, they search for the best learning rate lr ∈ {1e−5, 2e−5, 4e−5, 8e−5} based on the best possible accuracy of ℒ_LLM after 4 epochs. Then they tune the hyperparameter specific to ℒ_LLM-JEPA, k and λ in a two dimensional grid defined by (k, λ) ∈ {0, 1, 2, 3, 4} × {0.5, 1, 2, 4}.
During pretraining, they pretrain Llama-3.2-1B-Instruct from randomly initialized weights on NL-RX-SYNTH dataset.
Improved Performance: LLM-JEPA significantly outperforms the standard LLM training baseline across all tested models and datasets, in both fine-tuning and pre-training scenarios (Figure 1, Table 1, Table 8).
Fundamental Representation Improvement: The benefits are not just limited to the training task. Models pre-trained with LLM-JEPA show improved performance on downstream fine-tuning tasks even when using a standard objective, demonstrating that the JEPA objective instills a more robust and transferable foundation in the model's weights (Table 4).
Robustness to Overfitting: One of the most striking findings is LLM-JEPA's resistance to overfitting. During LoRA fine-tuning, the baseline model's performance plateaus and degrades, whereas LLM-JEPA continues to improve with more training epochs (Figure 5).
Enhanced Representation Structure: The JEPA objective acts as a powerful regularizer, inducing a more organized representation space. t-SNE visualizations show that LLM-JEPA arranges Text and Code embeddings into clear, corresponding clusters, a structure that is disrupted by standard fine-tuning (Figure 6).
Further analysis reveals that the mapping between Text and Code embeddings is constrained to a narrow, near-linear subspace, confirming that the model learns a highly structured relationship between the views (Figure 7, Table 10).
Faster and Better PEFT: LLM-JEPA accelerates convergence during LoRA fine-tuning. At a LoRA rank of 512, it achieves accuracy comparable to full fine-tuning, a level the baseline method fails to reach even with the same parameter budget (Table 3).
The primary hurdle for LLM-JEPA to overcome is the current computational overhead. The need for three forward passes—one for the generative loss and one for each view's embedding—results in a roughly 3-fold increase in training compute. This is the key challenge that researchers will need to address to make this promising architecture viable for training state-of-the-art models at scale. The authors' proposed future work on single-pass training using attention masking represents a critical next step in this journey.
Looking ahead, the authors plan to scale their experiments and explore data augmentation techniques to generate non-trivial views for any dataset, which would unlock the application of LLM-JEPA beyond the specialized datasets used in this work.
This paper presents a well-executed and significant contribution to the field of language model training. By thoughtfully adapting the JEPA framework from vision, the authors have developed a method that demonstrably improves the performance, robustness, and representation quality of LLMs. LLM-JEPA is more than an incremental improvement; it opens up a new avenue for research that moves beyond the dominant paradigm of input-space reconstruction. This work offers a valuable and practical step towards building more abstractly capable and efficient AI systems.