STORiCo: Storytelling TTS for Hindi with Character Voice Modulation

Pavan Kalyan Tankala, Preethi Jyothi, Preeti Rao, Pushpak Bhattacharyya

Main: Multimodality Oral Paper

Session 3: Multimodality (Oral)
Conference Room: Marie Louise 2
Conference Time: March 18, 14:00-15:30 (CET) (Europe/Malta)
TLDR:
You can open the #paper-362-Oral channel in a separate window.
Abstract: We present a new Hindi text-to-speech (TTS) dataset and demonstrate its utility for the expressive synthesis of children's audio stories. The dataset comprises narration by a single female speaker who modifies her voice to produce different story characters. Annotation for dialogue identification, character labelling, and character attribution are provided, all of which are expected to facilitate the learning of character voice and speaking styles. Experiments are conducted using different versions of the annotated dataset that enable training a multi-speaker TTS model on the single-speaker data. Subjective tests show that the multi-speaker model improves expressiveness and character voice consistency compared to the baseline single-speaker TTS. With the multi-speaker model, objective evaluations show comparable word error rates, better speaker voice consistency, and higher correlations with ground-truth emotion attributes. We release a new 16.8 hours storytelling speech dataset in Hindi and propose effective solutions for expressive TTS with narrator voice modulation and character voice consistency.