用于学习语义丰富视觉表征的文本条件JEPA
research area Computer Vision, research area Methods and Algorithmsconference ICML
content type paperpublished May 2026
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
AuthorsChen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind
Copy Bibtex
Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.
Related readings and updates.
Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers
October 8, 2025research area Computer Vision, research area Methods and Algorithmsconference ICLR
Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a…
How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks
November 20, 2024research area Computer Vision, research area Methods and Algorithmsconference NeurIPS
Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a…

Discover opportunities in Machine Learning.
Our research in machine learning breaks new ground every day.


链接抓取:https://arxiv.org/abs/2605.03245
Computer Science > Machine Learning
arXiv:2605.03245 (cs)
[Submitted on 5 May 2026]
Title:Text-Conditional JEPA for Learning Semantically Rich Visual Representations
Authors:Chen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind
View a PDF of the paper titled Text-Conditional JEPA for Learning Semantically Rich Visual Representations, by Chen Huang and Xianhang Li and Vimal Thilak and Etai Littwin and Josh Susskind
Abstract:Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.
| Comments: | ICML 2026 |
|---|---|
| Subjects: | Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.03245 [cs.LG] |
| (or arXiv:2605.03245v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.03245 Focus to learn more arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Chen Huang [view email]
[v1] Tue, 5 May 2026 00:26:57 UTC (3,083 KB)
Full-text links:
Access Paper:
Current browse context:
cs.LG
Change to browse by:
References & Citations
export BibTeX citation Loading...
BibTeX formatted citation
×
loading...
Data provided by:
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
IArxiv recommender toggle
IArxiv Recommender (What is IArxiv?)
- Author
- Venue
- Institution
- Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)