跳到主要导航 跳到搜索 跳到主要内容

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

  • Chengyou Jia
  • , Minnan Luo
  • , Xiaojun Chang
  • , Zhuohang Dang
  • , Mingfei Han
  • , Mengmeng Wang
  • , Guang Dai
  • , Sizhe Dang
  • , Jingdong Wang
  • Xi'an Jiaotong University
  • University of Science and Technology of China
  • Mohamed Bin Zayed University of Artificial Intelligence
  • University of Technology Sydney
  • Zhejiang University of Technology
  • State Grid Corporation of China
  • Baidu Inc

科研成果: 书/报告/会议事项章节会议稿件同行评审

14 引用 (Scopus)

摘要

Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.

源语言英语
主期刊名MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
4640-4649
页数10
ISBN(电子版)9798400706868
DOI
出版状态已出版 - 28 10月 2024
活动32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, 澳大利亚
期限: 28 10月 20241 11月 2024

出版系列

姓名MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

会议

会议32nd ACM International Conference on Multimedia, MM 2024
国家/地区澳大利亚
Melbourne
时期28/10/241/11/24

学术指纹

探究 'Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此