Skip to main navigation Skip to search Skip to main content

Disentangle to edit: instruction-guided latent manipulation for 3D facial video consistency

  • Peixu Zhang
  • , Zhaoxi Mu
  • , Shulei Ji
  • , Wang Xi
  • , Xinyu Yang
  • Xi'an Jiaotong University
  • Zhejiang University

Research output: Contribution to journalArticlepeer-review

Abstract

3D facial video editing has promising applications but is hindered by challenges such as a lack of spatio-temporal consistency and limited adaptability to natural language inputs. To address these challenges, we propose the disentangle to edit (D-to-E) framework. First, facial codes disentanglement inversion is exploited to disentangle identity and motion codes, allowing editing of identity attributes while preserving facial motion, thereby improving the temporal consistency of the results. Second, a diffusion-based facial code editor extends 2D editing to 3D, enabling flexible editing of identity codes through natural language guidance. Furthermore, we introduce an identity-structure preservation mechanism to enhance the spatial consistency of the results. Extensive experiments demonstrate that D-to-E can effectively perform spatio-temporal consistent multi-view facial video editing through natural language instructions.

Original languageEnglish
Article number221
JournalMultimedia Systems
Volume32
Issue number3
DOIs
StatePublished - Jun 2026

Keywords

  • 3D facial video editing
  • Diffusion model
  • Feature disentanglement
  • Video generation

Fingerprint

Dive into the research topics of 'Disentangle to edit: instruction-guided latent manipulation for 3D facial video consistency'. Together they form a unique fingerprint.

Cite this