用于学习语义丰富视觉表征的文本条件JEPA

05-07 08:00

阅读原文→
研究人员提出文本条件联合嵌入预测架构(TC-JEPA),通过引入图像描述文本作为条件信息来降低掩码特征预测中的视觉不确定性。该方法采用细粒度文本调节器,对输入文本标记计算稀疏交叉注意力,从而调制预测的图像补丁特征。与基于掩码特征预测的I-JEPA相比,TC-JEPA能够学习到语义更丰富的视觉表征,解决了原有方法因视觉不确定性导致的语义学习不足问题

原文内容

用于学习语义丰富视觉表征的文本条件JEPA

research area Computer Vision, research area Methods and Algorithmsconference ICML

content type paperpublished May 2026

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

AuthorsChen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

View publication

Copy Bibtex

Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Related readings and updates.

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

October 8, 2025research area Computer Vision, research area Methods and Algorithmsconference ICLR

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a…

Read more

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

November 20, 2024research area Computer Vision, research area Methods and Algorithmsconference NeurIPS

Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a…

Read more

Bottom banner

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

原文图片

原文图片

链接抓取:https://arxiv.org/abs/2605.03245

Computer Science > Machine Learning

arXiv:2605.03245 (cs)

[Submitted on 5 May 2026]

Title:Text-Conditional JEPA for Learning Semantically Rich Visual Representations

Authors:Chen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

View a PDF of the paper titled Text-Conditional JEPA for Learning Semantically Rich Visual Representations, by Chen Huang and Xianhang Li and Vimal Thilak and Etai Littwin and Josh Susskind

View PDF HTML (experimental)

Abstract:Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Comments: ICML 2026
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2605.03245 [cs.LG]
(or arXiv:2605.03245v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.03245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chen Huang [view email]
[v1] Tue, 5 May 2026 00:26:57 UTC (3,083 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2026-05

Change to browse by:

cs
cs.CV

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

×

loading...

Data provided by:

Bookmark

BibSonomy Reddit

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

IArxiv recommender toggle

IArxiv Recommender (What is IArxiv?)

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)