Streaming non-autoregressive model for any-to-many voice conversion

Authors:

Ziyi Chen, Haoran Miao, Pengyuan Zhang

Address:

Key Laboratory of Speech Acoustics & Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, China

University of Chinese Academy of Sciences, Beijing, China

Email:

chenziyi@hccl.ioa.ac.cn

Vocoder evaluations

RawCat means MB HiFi-GAN generates speech by 40ms chunk without overlap and concatenating them directly.

MBC4S1 means MB HiFi-GAN generates speech by 40ms chunk with 10ms overlap of hanning window smoothness.

MBC4S2 means MB HiFi-GAN generates speech by 40ms chunk with 20ms overlap of hanning window smoothness.

MBS HiFi-GAN means proposed streaming vocoder.

Results of different vocoders are shown below.

	Example	Example	Example
RawCat
MBC4S1
MBC4S2
MBS HiFi-GAN

Voice conversion evaluations

Audio examples of the target speaker:

Baker represents audios in the databaker open source mandarin dataset.

MM represents audios of the male speaker in M2VoC Dev set.

MF represents audios of the female speaker in M2VoC Dev set.

Conversion results of proposed methods:

Target Speaker MM :

Reference:

	Reference 1	Reference 2	Reference 3	Reference 4
MM

Conversion results :

	Example 1	Example 2	Example 3	Example 4
Source
Proposed

Target Speaker Baker :

Reference:

	Reference 5	Reference 6	Reference 7	Reference 8
baker

Conversion results:

	Example 5	Example 6	Example 7	Example 8
Source
Proposed

Target Speaker MF :

Reference:

	Reference 9	Reference 10	Reference 11	Reference 12
MF

Conversion results :

	Example 9	Example 10	Example 11	Example 12
Source
Proposed

Comparisons of different methods :

Target Speaker MM :

Reference:

	Reference 1	Reference 2	Reference 3	Reference 4
MM

Conversion results :

	Example 13	Example 14	Example 15	Example 16
Source
Proposed
w/o antispk
w/o sd

Target Speaker Baker :

Reference:

	Reference 5	Reference 6	Reference 7	Reference 8
baker

Conversion results:

	Example 17	Example 18	Example 19	Example 20
Source
Proposed
w/o antispk
w/o sd

Target Speaker MF :

Reference:

	Reference 9	Reference 10	Reference 11	Reference 12
MF

Conversion results :

	Example 21	Example 22	Example 23	Example 24
Source
Proposed
w/o antispk
w/o sd