Streaming non-autoregressive model for any-to-many voice conversion

Authors:

Ziyi Chen, Haoran Miao, Pengyuan Zhang

Address:

Key Laboratory of Speech Acoustics & Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, China

University of Chinese Academy of Sciences, Beijing, China

Email:

chenziyi@hccl.ioa.ac.cn

 


Vocoder evaluations


RawCat means MB HiFi-GAN generates speech by 40ms chunk without overlap and concatenating them directly.

MBC4S1 means MB HiFi-GAN generates speech by 40ms chunk with 10ms overlap of hanning window smoothness.

MBC4S2 means MB HiFi-GAN generates speech by 40ms chunk with 20ms overlap of hanning window smoothness.

MBS HiFi-GAN means proposed streaming vocoder.


Results of different vocoders are shown below.

 

Example

Example

Example

RawCat
MBC4S1
MBC4S2
MBS HiFi-GAN

 


Voice conversion evaluations


Audio examples of the target speaker:

Baker represents audios in the databaker open source mandarin dataset.

MM represents audios of the male speaker in M2VoC Dev set.

MF represents audios of the female speaker in M2VoC Dev set.


Conversion results of proposed methods:

Target Speaker MM :

Reference:

 

Reference 1

Reference 2

Reference 3

Reference 4

MM

Conversion results :

 

Example 1

Example 2

Example 3

Example 4

Source
Proposed

Target Speaker Baker :

Reference:

 

Reference 5

Reference 6

Reference 7

Reference 8

baker

Conversion results:

 

Example 5

Example 6

Example 7

Example 8

Source
Proposed

Target Speaker MF :

Reference:

 

Reference 9

Reference 10

Reference 11

Reference 12

MF

Conversion results :

 

Example 9

Example 10

Example 11

Example 12

Source
Proposed

 


Comparisons of different methods :

Target Speaker MM :

Reference:

 

Reference 1

Reference 2

Reference 3

Reference 4

MM

Conversion results :

 

Example 13

Example 14

Example 15

Example 16

Source
Proposed
w/o antispk
w/o sd

Target Speaker Baker :

Reference:

 

Reference 5

Reference 6

Reference 7

Reference 8

baker

Conversion results:

 

Example 17

Example 18

Example 19

Example 20

Source
Proposed
w/o antispk
w/o sd

Target Speaker MF :

Reference:

 

Reference 9

Reference 10

Reference 11

Reference 12

MF

Conversion results :

 

Example 21

Example 22

Example 23

Example 24

Source
Proposed
w/o antispk
w/o sd