Agentic AI for Visual Media
Create stunning content, generate ideas, and automate tasks with our suite of AI-powered tools designed for creators and businesses.
Location
Mile High 1EF, Denver, Colorado
Date and time
Wednesday, June 3, 2026
Admission
in conjunction with CVPR 2026
About Workshop
Recent advances in image and video processing, editing, and generation have transformed how visual content is created and consumed. While individual models excel at specific tasks, real-world workflows often require flexible composition, adaptive reasoning, and multi-step decision-making—capabilities that remain largely human-driven.
The rise of large language models (LLMs), multimodal LLMs, and agentic AI systems has made it possible to develop autonomous, tool-using agents that can orchestrate perception, generation, editing, and evaluation in unified pipelines. This paradigm shift opens new opportunities for creative media, industrial vision applications, and interactive systems.
The Workshop on Agentic AI for Visual Media will bring together researchers and practitioners at the intersection of computer vision, generative modeling, multimodal AI, and agent systems. Our goal is to explore cutting-edge research, practical deployments, evaluation methods, and future challenges in building next-generation vision agents.
Timeline
Workshop Date. Wednesday, 3 June 2026, in conjunction with CVPR 2026 in Denver, Colorado.
Keynote Lineup Updated. Seven keynote speakers confirmed (one afternoon slot TBA). See workshop schedule.
Notification of Acceptance. Authors will be notified of paper decisions.
Paper Submission Deadline. Last day to submit long papers and extended abstracts.
Call for Papers Open. Submissions are open on OpenReview. See submission guidelines.
Sponsorship. Our workshop received generous sponsorship from Snap Inc., Adobe, and Tencent.
Workshop Website Launched. Official website for the Workshop on Agentic AI for Visual Media is now live.
Keynote Speakers (Tentative)
Location: Mile High 1EF
Prof. Zhengzhong Tu
Assistant Professor, Texas A&M University
Prof. Ranjay Krishna
University of Washington
Prof. Manling Li
Northwestern University
Christine Hu
Co-founder & CEO, Philo Labs
Dr. Jack Parker-Holder
Google DeepMind
Dr. Yeying Jin
Tencent
Prof. Xihui Liu
University of Hong KongWorkshop Schedule
Full-day program on Wednesday, June 3, 2026 at CVPR 2026 with seven keynote talks and afternoon sessions.
Workshop Day — Wednesday, June 3, 2026
Morning
8:50 AM - 9:00 AM
Opening
Welcome remarks and overview of the workshop program.
9:00 AM - 9:40 AM
From Pixels to Systems: Agentic Visual Intelligence for Real-World Computer Vision
Keynote talk.
Speaker
Prof. Zhengzhong Tu
Assistant Professor, Texas A&M University9:40 AM - 10:20 AM
From Vision to Action: Extracting Structure and Agency from Flat Pixels
Keynote talk.
Speaker
Prof. Ranjay Krishna
University of Washington10:20 AM - 10:50 AM
Coffee Break
Refreshments and networking.
10:50 AM - 11:30 AM
Interactive and Multimodal Visual Generation towards World Models
Keynote talk.
Speaker
Prof. Xihui Liu
University of Hong Kong11:30 AM - 12:10 PM
Agent and World, in One Model
Most of the field treats agent and world model research as two programs. We don't. A world model is an agent whose policy you called dynamics. An agent is a world model you queried for actions. The cut between them is something you draw.
This talk covers:
- How we're improving VLMs' agentic capabilities in video AI.
- How we're improving video gen models as environments.
- What happens at the seam, and why we think this is also the right framing for safety.
Speaker
Christine Hu
Co-founder & CEO, Philo LabsAfternoon
12:10 PM - 1:45 PM
Lunch
Lunch break on your own or with fellow attendees.
1:45 PM - 2:25 PM
Keynote
Talk details to be announced.
Speaker
Dr. Jack Parker-Holder
Google DeepMind2:25 PM - 3:05 PM
Game World Model
Keynote talk.
Speaker
Dr. Yeying Jin
Tencent3:05 PM - 3:45 PM
Failure Modes of VLM Agents: A Reinforcement Learning Perspective
The failures of a reinforcement-learned agent are not random bugs to be patched away; they are systematic signatures of how the agent was trained. This talk reads recent vision-language agents through their failure modes, organized along three axes. The first is reasoning collapse, where optimizing for reward erodes the very multi-step reasoning that made the agent capable. The second is multi-turn instability, where reinforcing world-model reasoning across turns helps yet stays fragile, and where action unfolds over many steps, whether exploring a scene or generating an image stroke by stroke. The third is unsafe planning, where language-driven control introduces systematic safety risks in the physical world. I will close with embodied priors and online reinforcement learning as one route toward agents that fail less by design.
Speaker
Prof. Manling Li
Northwestern University3:45 PM - 4:15 PM
Coffee Break
Refreshments and networking.
4:15 PM - 5:40 PM
Poster Session
Poster viewing for accepted workshop papers.
5:40 PM - 5:50 PM
Closing
Closing remarks and thank you.
Contact
Get in touch for any questions about the workshop.