Improving conversion and attack capability for any-to-many voice conversion with post-processing

Authors:

Ziyi Chen, Hua Hua, Yuxiang Zhang, Ming Li, Pengyuan Zhang

Address:

Key Laboratory of Speech Acoustics & Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, China

University of Chinese Academy of Sciences, Beijing, China

Data Science Research Center, Duke Kunshan University, Kunshan, China

Email:

chenziyi@hccl.ioa.ac.cn

 


 

 

1. Audio examples of the target speaker. Baker represents audios in the databaker open source mandarin dataset. MM represents audios of the male speaker in M2VoC Dev set and MF represents audios of the female speaker in M2VoC Dev set.

baker:

 

Example 11

Example 12

Example 13

Example 14

baker

MM:

 

Example 21

Example 22

Example 23

Example 24

MM

MF:

 

Example 31

Example 32

Example 33

Example 34

MF

 

 

2. Voice conversion result. GT represents the ground truth. ppgfast represents conformer ppg + modified fastspeech + hifigan. ppgtaco represents conformer ppg + modified tacotron2 + hifigan. proposed represents proposed method.

baker :

 

Chat style utterance 1

Chat style utterance 2

Narrative style utterance 1

Narrative style utterance 2

GT
ppgfast
ppgtaco
proposed

MM :

 

Chat style utterance 1

Chat style utterance 2

Narrative style utterance 1

Narrative style utterance 2

GT
ppgfast
ppgtaco
proposed

MF :

 

Chat style utterance 1

Chat style utterance 2

Narrative style utterance 1

Narrative style utterance 2

GT
ppgfast
ppgtaco
proposed