Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

G. Sanjay; K.C. Sooraj; Deepak Mishra

doi:10.1109/ISCMI51676.2020.9311564

Profiles Research Units Publications

Conferences

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

G. Sanjay, K.C. Sooraj,

Published in Institute of Electrical and Electronics Engineers Inc.

2020

DOI: 10.1109/ISCMI51676.2020.9311564

Pages: 255 - 259

Abstract

Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained. © 2020 IEEE.

Topics: Speech synthesis (60)%, Mean opinion score (57)%, Spectrogram (56)% and Transformer (machine learning model) (54)%

View more info for "Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder"

About the journal

Journal	Data powered by SciSpace2020 7th International Conference on Soft Computing and Machine Intelligence, ISCMI 2020
Publisher	Data powered by SciSpaceInstitute of Electrical and Electronics Engineers Inc.

Authors (1)

Deepak Mishra
- Department of Computer Science & Engineering

ACADEMICS

RESEARCH

STUDENTS

FACULTY