TVQVC

Audio Samples from "TVQVC: Transformer based Vector Quantized Variational Autoencoder with CTC loss for Voice Conversion"

Paper:

Accepted by Interspeech 2021.

Authors:

Ziyi Chen, Pengyuan Zhang

Abstract:

Techniques of voice conversion(VC) aim to modify the speaker identity and style of an utterance while preserving the linguistic content. Although there are lots of VC methods, the state of the art of VC is still cascading automatic speech recognition(ASR) and text-to-speech(TTS). This paper presents a new structure of vector-quantized autoencoder based on transformer with CTC loss for non-parallel VC, which inspired by cascading ASR and TTS VC method. Our proposed method combines CTC loss and vector quantization to get high-level linguistic information without speaker information. Objective and subjective evaluations on the mandarin datasets show that the converted speech of our proposed model is better than baselines on naturalness, rhythm and speaker similarity.

Attention please! Seen utterances mean they appeared in training set. Unseen utterances mean they are in test set. MM represents the male of M2VoC Dev set and MF represents of the female of M2VoC Dev set, they are parallel data. Baker represents the female of the databaker open source mandarin dataset.

1. Convert between MM and MF.

Convert MF to MM:

	Seen utterance	Converted results	Unseen utterance	Converted results
GT
VQVAE
PPGVC
Proposed

Convert MM to MF:

	Seen utterance	Converted results	Unseen utterance	Converted results
GT
VQVAE
PPGVC
Proposed

2. Convert MM and MF to Baker.

Reference utterances of Baker:

Methods	Reference utterances
GT

MM to Baker:

Methods	Seen utterance	Converted results	Unseen utterance	Converted results
VQVAE
PPGVC
Proposed

MF to Baker:

Methods	Seen utterance	Converted results	Unseen utterance	Converted results
VQVAE
PPGVC
Proposed

3. Ablation studies

Methods	MM-to-Baker	MF-to-Baker
Proposed
-f0
-vq

"-f0" represents the proposed model without quantization fundamental frequency. "-vq" represents the proposed model without vector quantization layer. We can observe that "-f0" has lower naturalness, and "-vq" has lower speaker similarity.