TY - GEN
T1 - TDE-VC
T2 - 2025 IEEE International Conference on Multimedia and Expo, ICME 2025
AU - Hu, Ying
AU - Tu, Shangkun
AU - Li, Fan
AU - He, Lijun
AU - Yan, Hai
AU - Li, Yan
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Voice conversion (VC) transforms certain characteristics of speech from a source to a target while preserving the original linguistic content. This paper focuses on timbre conversion, a key type of VC. Current VC methods face two challenges: retaining source speaker information in the extracted content and inadequately capturing timbre features, often leading to suboptimal speaker similarity in the converted speech. To address these issues, we propose the TDE-VC model, a zero-shot voice conversion framework that incorporates a phased-trained content extractor, combining the strengths of adversarial speaker classifier and data perturbation to extract cleaner content. Critically, we introduce a timbre disentanglement and extraction strategy, based on a multi-level consistency constraint, which effectively disentangles timbre from content and guides the timbre encoder to focus solely on timbre extraction. Additionally, we present an effective multi-scale timbre encoder. Experimental results demonstrate that TDE-VC significantly improves speaker similarity, especially for unseen target speakers, while maintaining competitive naturalness compared to existing methods. The demo page is publicly available.1
AB - Voice conversion (VC) transforms certain characteristics of speech from a source to a target while preserving the original linguistic content. This paper focuses on timbre conversion, a key type of VC. Current VC methods face two challenges: retaining source speaker information in the extracted content and inadequately capturing timbre features, often leading to suboptimal speaker similarity in the converted speech. To address these issues, we propose the TDE-VC model, a zero-shot voice conversion framework that incorporates a phased-trained content extractor, combining the strengths of adversarial speaker classifier and data perturbation to extract cleaner content. Critically, we introduce a timbre disentanglement and extraction strategy, based on a multi-level consistency constraint, which effectively disentangles timbre from content and guides the timbre encoder to focus solely on timbre extraction. Additionally, we present an effective multi-scale timbre encoder. Experimental results demonstrate that TDE-VC significantly improves speaker similarity, especially for unseen target speakers, while maintaining competitive naturalness compared to existing methods. The demo page is publicly available.1
KW - consistency constraint
KW - phased training
KW - timbre disentanglement
KW - voice conversion
KW - zero-shot
UR - https://www.scopus.com/pages/publications/105022639224
U2 - 10.1109/ICME59968.2025.11209862
DO - 10.1109/ICME59968.2025.11209862
M3 - 会议稿件
AN - SCOPUS:105022639224
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2025 IEEE International Conference on Multimedia and Expo
PB - IEEE Computer Society
Y2 - 30 June 2025 through 4 July 2025
ER -