Audio Samples from "Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning"
Paper:
Pre-printed paper will be released soon.
Authors:
Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
Abstract:
This paper present a recognition-synthesis method for nonparallel voice conversion. A recognizer is used to transforms
acoustic features into hidden representations while a synthesizer recovers output features from the recognizer outputs
and the speaker identity. By separating the speaker characteristics from the hidden representations, voice conversion
can be achieved by replacing the speaker identity with the target one. Recognizer is trained with text classification
loss and jointly optimized with the synthesizer. Speaker adversarial training is further adopted to make hidden
representations speaker-independent. For training of the model, a strategy of pre-training on multispeaker dataset then
fine-tuning on two-speaker pair is used. During fine-tuning, discriminators are further introduced and a generative
adversarial network (GAN) loss is used to prevent the predicted features from being over-smoothed. Our method achieved
higher similarity over the strong baseline that achieved first place in Voice Conversion Challenge 2018. We also
conducted ablation studies to confirm several designation in our proposed method.
0. Ground truth audio samples
slt
rms
source
target
1. Comparsion of the proposed method to baseline method
Methods
rms-to-slt
slt-to-rms
VCC2018
Proposed
2. Reducing the size of training data for fine-tuing
Size of data
rms-to-slt
slt-to-rms
500
300
100
3. Ablation studies
Methods
rms-to-slt
slt-to-rms
Proposed
-adv
-text
-adv-text
-GAN
-joint
-pretrain
-tunerec
-all
"-adv", "-text" represent the proposed method without using adversarial training, text classification loss respectively. "-pretrain" represents the proposed method without using pre-training strategy. "-GAN" represents the proposed method finetuning without the GAN loss. "-joint" represents training the recognizer and synthesizer separately without joint optimization. "-tunerec" represents fix the recognizer after pre-training and only fine-tuning the synthesizer. "-all" represents the conventional recognition-synthesis method that uses the same structure and training data as the proposed method