This page contains the results of experiments described in the paper:

VoCo: Text-based Insertion and Replacement in Audio Narration

The paper and other materials (video, etc) for the project are at: http://gfx.cs.princeton.edu/pubs/Jin_2017_VTI/
See Section 5 of the paper for a complete description of these tests.


This section contains the sentences used in the MOS and Identification Tests.
Click in the grid of buttons below to play the audio.
Colors (ranging from red=bad to green=good) encode the scores which are also shown as text in the button labels.
To switch between the results of the two tests (MOS/ID) click this big button:

Each sentence has just one synthesized word (except for "Real") as follows:
SynthWe use the source TTS voice to synthesize the word and insert it in context.
CUTEBased on our framework, we use CUTE (not VoCo) for the voice conversion.
AutoThe VoCo method using pre-defined α and β values in range selection.
ChooseWe manually choose one synthesis from several alternatives (up to 16), if it improves on Auto above.
EditWe use the editing interface to further refine the synthesis, if it improves on Auto/Choose.
RealThe actual human recording, without modification.

Male 1 (DBL)

SynthCUTEAutoChooseEditReal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Male 2 (RMS)

SynthCUTEAutoChooseEditReal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Female 1 (CLB)

SynthCUTEAutoChooseEditReal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Female 2 (SLT)

SynthCUTEAutoChooseEditReal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


This section addresses the question of whether audio experts could do as well as VoCo using conventional audio editing software.
It contains just four sentences, and results are compared by an MOS test.
In each sentence a word is replaced with a different word by a generic TTS,
two experts using an audio editing software, and by VoCo.

wordSynthExpert1Expert2VoCo
mentioned
director
benefit
television