Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer

  • Yifei Xu
  • , Zaiqiang Wu
  • , Li Li
  • , Siqi Li
  • , Wenlong Li
  • , Mingqi Li
  • , Yuan Rao
  • , Shuiguang Deng

Research output: Contribution to journalArticlepeer-review

Abstract

Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods.

Original languageEnglish
Pages (from-to)9487-9501
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number9
DOIs
StatePublished - 2025

Keywords

  • Siamese masked autoencoder
  • Unsupervised video summarizer
  • repelling loss
  • shot diversity enhancer

Fingerprint

Dive into the research topics of 'Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer'. Together they form a unique fingerprint.

Cite this