Accepted by Interspeech 2021.
Attention please! Seen utterances mean they appeared in training set. Unseen utterances mean they are in test set. MM represents the male of M2VoC Dev set and MF represents of the female of M2VoC Dev set, they are parallel data. Baker represents the female of the databaker open source mandarin dataset.
|
|
|
|
|
GT
|
||||
VQVAE
|
||||
PPGVC
|
||||
Proposed
|
|
|
|
|
|
GT
|
||||
VQVAE
|
||||
PPGVC
|
||||
Proposed
|
Methods
|
Reference utterances |
|||
GT
|
Methods
|
Seen utterance |
Converted results |
Unseen utterance |
Converted results |
VQVAE
|
||||
PPGVC
|
||||
Proposed
|
Methods
|
Seen utterance |
Converted results |
Unseen utterance |
Converted results |
VQVAE
|
||||
PPGVC
|
||||
Proposed
|
Methods
|
MM-to-Baker
|
MF-to-Baker
|
||
Proposed
|
||||
-f0
|
||||
-vq
|
"-f0" represents the proposed model without quantization fundamental frequency. "-vq" represents the proposed model without vector quantization layer. We can observe that "-f0" has lower naturalness, and "-vq" has lower speaker similarity.