Abstract
3D facial video editing has promising applications but is hindered by challenges such as a lack of spatio-temporal consistency and limited adaptability to natural language inputs. To address these challenges, we propose the disentangle to edit (D-to-E) framework. First, facial codes disentanglement inversion is exploited to disentangle identity and motion codes, allowing editing of identity attributes while preserving facial motion, thereby improving the temporal consistency of the results. Second, a diffusion-based facial code editor extends 2D editing to 3D, enabling flexible editing of identity codes through natural language guidance. Furthermore, we introduce an identity-structure preservation mechanism to enhance the spatial consistency of the results. Extensive experiments demonstrate that D-to-E can effectively perform spatio-temporal consistent multi-view facial video editing through natural language instructions.
| Original language | English |
|---|---|
| Article number | 221 |
| Journal | Multimedia Systems |
| Volume | 32 |
| Issue number | 3 |
| DOIs | |
| State | Published - Jun 2026 |
Keywords
- 3D facial video editing
- Diffusion model
- Feature disentanglement
- Video generation
Fingerprint
Dive into the research topics of 'Disentangle to edit: instruction-guided latent manipulation for 3D facial video consistency'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver