Skip to main navigation Skip to search Skip to main content

TDE-VC: Timbre Disentanglement and Extraction Via Consistency for Zero-Shot Voice Conversion

  • Xinjiang University
  • Public Security Department of Xinjiang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Voice conversion (VC) transforms certain characteristics of speech from a source to a target while preserving the original linguistic content. This paper focuses on timbre conversion, a key type of VC. Current VC methods face two challenges: retaining source speaker information in the extracted content and inadequately capturing timbre features, often leading to suboptimal speaker similarity in the converted speech. To address these issues, we propose the TDE-VC model, a zero-shot voice conversion framework that incorporates a phased-trained content extractor, combining the strengths of adversarial speaker classifier and data perturbation to extract cleaner content. Critically, we introduce a timbre disentanglement and extraction strategy, based on a multi-level consistency constraint, which effectively disentangles timbre from content and guides the timbre encoder to focus solely on timbre extraction. Additionally, we present an effective multi-scale timbre encoder. Experimental results demonstrate that TDE-VC significantly improves speaker similarity, especially for unseen target speakers, while maintaining competitive naturalness compared to existing methods. The demo page is publicly available.1

Original languageEnglish
Title of host publication2025 IEEE International Conference on Multimedia and Expo
Subtitle of host publicationJourney to the Center of Machine Imagination, ICME 2025 - Conference Proceedings
PublisherIEEE Computer Society
ISBN (Electronic)9798331594954
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Multimedia and Expo, ICME 2025 - Nantes, France
Duration: 30 Jun 20254 Jul 2025

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Country/TerritoryFrance
CityNantes
Period30/06/254/07/25

Keywords

  • consistency constraint
  • phased training
  • timbre disentanglement
  • voice conversion
  • zero-shot

Fingerprint

Dive into the research topics of 'TDE-VC: Timbre Disentanglement and Extraction Via Consistency for Zero-Shot Voice Conversion'. Together they form a unique fingerprint.

Cite this