VoCo: supplemental

This section contains the sentences used in the MOS and Identification Tests.
Click in the grid of buttons below to play the audio.
Colors (ranging from red=bad to green=good) encode the scores which are also shown as text in the button labels.
To switch between the results of the two tests (MOS/ID) click this big button:

Each sentence has just one synthesized word (except for "Real") as follows:

Synth	We use the source TTS voice to synthesize the word and insert it in context.
CUTE	Based on our framework, we use CUTE (not VoCo) for the voice conversion.
Auto	The VoCo method using pre-defined α and β values in range selection.
Choose	We manually choose one synthesis from several alternatives (up to 16), if it improves on Auto above.
Edit	We use the editing interface to further refine the synthesis, if it improves on Auto/Choose.
Real	The actual human recording, without modification.

Male 1 (DBL)

	Synth	CUTE	Auto	Choose	Edit	Real
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Male 2 (RMS)

	Synth	CUTE	Auto	Choose	Edit	Real
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Female 1 (CLB)

	Synth	CUTE	Auto	Choose	Edit	Real
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Female 2 (SLT)

	Synth	CUTE	Auto	Choose	Edit	Real
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

This section addresses the question of whether audio experts could do as well as VoCo using conventional audio editing software.
It contains just four sentences, and results are compared by an MOS test.
In each sentence a word is replaced with a different word by a generic TTS,
two experts using an audio editing software, and by VoCo.

word	Synth	Expert1	Expert2	VoCo
mentioned
director
benefit
television

VoCo: Text-based Insertion and Replacement in Audio Narration

Male 1 (DBL)

Male 2 (RMS)

Female 1 (CLB)

Female 2 (SLT)