Audio Samples from "Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning"

Paper:

Pre-printed paper will be released soon.

Authors:

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai

Abstract:

This paper present a recognition-synthesis method for nonparallel voice conversion. A recognizer is used to transforms acoustic features into hidden representations while a synthesizer recovers output features from the recognizer outputs and the speaker identity. By separating the speaker characteristics from the hidden representations, voice conversion can be achieved by replacing the speaker identity with the target one. Recognizer is trained with text classification loss and jointly optimized with the synthesizer. Speaker adversarial training is further adopted to make hidden representations speaker-independent. For training of the model, a strategy of pre-training on multispeaker dataset then fine-tuning on two-speaker pair is used. During fine-tuning, discriminators are further introduced and a generative adversarial network (GAN) loss is used to prevent the predicted features from being over-smoothed. Our method achieved higher similarity over the strong baseline that achieved first place in Voice Conversion Challenge 2018. We also conducted ablation studies to confirm several designation in our proposed method.

 


0. Ground truth audio samples

 

slt

rms

source
target

 

1. Comparsion of the proposed method to baseline method

 

Methods

rms-to-slt

slt-to-rms

VCC2018
Proposed

 

2. Reducing the size of training data for fine-tuing

Size of data
rms-to-slt
slt-to-rms
500
300
100

 

3. Ablation studies

Methods
rms-to-slt
slt-to-rms
Proposed
-adv
-text
-adv-text
-GAN
-joint
-pretrain
-tunerec
-all

"-adv", "-text" represent the proposed method without using adversarial training, text classification loss respectively. "-pretrain" represents the proposed method without using pre-training strategy. "-GAN" represents the proposed method finetuning without the GAN loss. "-joint" represents training the recognizer and synthesizer separately without joint optimization. "-tunerec" represents fix the recognizer after pre-training and only fine-tuning the synthesizer. "-all" represents the conventional recognition-synthesis method that uses the same structure and training data as the proposed method