Agentic AI for Visual Media

Create stunning content, generate ideas, and automate tasks with our suite of AI-powered tools designed for creators and businesses.

Zoom Live

Location

Mile High 1EF, Denver, Colorado

Date and time

Wednesday, June 3, 2026

Admission

in conjunction with CVPR 2026

Workshop on Agentic AI for Visual Media

About Workshop

Recent advances in image and video processing, editing, and generation have transformed how visual content is created and consumed. While individual models excel at specific tasks, real-world workflows often require flexible composition, adaptive reasoning, and multi-step decision-making—capabilities that remain largely human-driven.

The rise of large language models (LLMs), multimodal LLMs, and agentic AI systems has made it possible to develop autonomous, tool-using agents that can orchestrate perception, generation, editing, and evaluation in unified pipelines. This paradigm shift opens new opportunities for creative media, industrial vision applications, and interactive systems.

The Workshop on Agentic AI for Visual Media will bring together researchers and practitioners at the intersection of computer vision, generative modeling, multimodal AI, and agent systems. Our goal is to explore cutting-edge research, practical deployments, evaluation methods, and future challenges in building next-generation vision agents.

Timeline

2026-06-03

Workshop Date. Wednesday, 3 June 2026, in conjunction with CVPR 2026 in Denver, Colorado.

2026-05

Keynote Lineup Updated. Seven keynote speakers confirmed (one afternoon slot TBA). See workshop schedule.

2026-03-31

Notification of Acceptance. Authors will be notified of paper decisions.

2026-03-20

Paper Submission Deadline. Last day to submit long papers and extended abstracts.

2026-03

Call for Papers Open. Submissions are open on OpenReview. See submission guidelines.

2026-03

Sponsorship. Our workshop received generous sponsorship from Snap Inc., Adobe, and Tencent.

2026-01

Workshop Website Launched. Official website for the Workshop on Agentic AI for Visual Media is now live.

Keynote Speakers (Tentative)

Location: Mile High 1EF

Prof. Zhengzhong Tu

Assistant Professor, Texas A&M University

Prof. Ranjay Krishna

University of Washington

Prof. Manling Li

Northwestern University

Christine Hu

Co-founder & CEO, Philo Labs

Dr. Jack Parker-Holder

Google DeepMind

Dr. Yeying Jin

Tencent

Prof. Xihui Liu

University of Hong Kong

Workshop Schedule

Full-day program on Wednesday, June 3, 2026 at CVPR 2026 with seven keynote talks and afternoon sessions.

Workshop Day — Wednesday, June 3, 2026

Morning

8:50 AM - 9:00 AM

Opening

Welcome remarks and overview of the workshop program.

9:00 AM - 9:40 AM

From Pixels to Systems: Agentic Visual Intelligence for Real-World Computer Vision

Keynote talk.

Speaker

Prof. Zhengzhong Tu

Assistant Professor, Texas A&M University

9:40 AM - 10:20 AM

From Vision to Action: Extracting Structure and Agency from Flat Pixels

Keynote talk.

Speaker

Prof. Ranjay Krishna

University of Washington

10:20 AM - 10:50 AM

Coffee Break

Refreshments and networking.

10:50 AM - 11:30 AM

Interactive and Multimodal Visual Generation towards World Models

Keynote talk.

Speaker

Prof. Xihui Liu

University of Hong Kong

11:30 AM - 12:10 PM

Agent and World, in One Model

Most of the field treats agent and world model research as two programs. We don't. A world model is an agent whose policy you called dynamics. An agent is a world model you queried for actions. The cut between them is something you draw.

This talk covers:

How we're improving VLMs' agentic capabilities in video AI.
How we're improving video gen models as environments.
What happens at the seam, and why we think this is also the right framing for safety.

Speaker

Christine Hu

Co-founder & CEO, Philo Labs

Afternoon

12:10 PM - 1:45 PM

Lunch

Lunch break on your own or with fellow attendees.

1:45 PM - 2:25 PM

Keynote

Talk details to be announced.

Speaker

Dr. Jack Parker-Holder

Google DeepMind

2:25 PM - 3:05 PM

Game World Model

Keynote talk.

Speaker

Dr. Yeying Jin

Tencent

3:05 PM - 3:45 PM

Failure Modes of VLM Agents: A Reinforcement Learning Perspective

The failures of a reinforcement-learned agent are not random bugs to be patched away; they are systematic signatures of how the agent was trained. This talk reads recent vision-language agents through their failure modes, organized along three axes. The first is reasoning collapse, where optimizing for reward erodes the very multi-step reasoning that made the agent capable. The second is multi-turn instability, where reinforcing world-model reasoning across turns helps yet stays fragile, and where action unfolds over many steps, whether exploring a scene or generating an image stroke by stroke. The third is unsafe planning, where language-driven control introduces systematic safety risks in the physical world. I will close with embodied priors and online reinforcement learning as one route toward agents that fail less by design.