VoCo: Text-based Insertion and Replacement in Audio Narration

ACM Transactions on Graphics, July 2017

Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi,
Jingwan Lu, Adam Finkelstein

Text-based editing provides a natural interface for modifying audio narrations. Our approach allows the editor to replace an existing word (or insert a new word) by typing, and the system automatically synthesizes the new word by stitching together snippets of audio from elsewhere in the narration. Here we replace the word 'sixteen' by 'seventeen' in a text editor, and the new audio is automatically stitched together from parts of the words 'else', 'seventy', 'want' and 'seen'.

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Paper Preprint (2MB)
Video (80MB)
Video on YouTube
Experimental Results
Related paper CUTE
Related paper FFTNet

Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, and Adam Finkelstein.
"VoCo: Text-based Insertion and Replacement in Audio Narration."
ACM Transactions on Graphics 36(4), Article 96, July 2017.

@article{Jin:2017:VTI,
   author = "Zeyu Jin and Gautham J. Mysore and Stephen DiVerdi and Jingwan Lu and
      Adam Finkelstein",
   title = "{VoCo}: Text-based Insertion and Replacement in Audio Narration",
   journal = "ACM Transactions on Graphics",
   year = "2017",
   month = jul,
   volume = "36",
   number = "4",
   articleno = "96"
}