VarianceFlow: Using Variance Information via Normalizing Flow
for High-quality and Controllable Text-to-Speech

Authors: Yoonhyung Lee, Jinhyeok Yang, Kyomin Jung


Abstract

There are two types of methods for non-autoregressive text-to-speech models to effectively learn the one-to-many relationship between text and speech. The first one is to use advanced generative framework such as normalizing flow (NF). The second one is to use variance information such as pitch or energy together when generating speech. For the second type, it is also possible to control the variance factors by adjusting the variance values provided to a model. In this paper, we propose a novel method called VarianceFlow combining the advantages of the two types. By modelling the variance with NF, VarianceFlow predicts the variance information more precisely with improved speech quality. Also, the objective function of NF makes the model use the variance information and the text in disentangled manner resulting in more precise variance control. In experiments, VarianceFlow shows superiorper formance over other state-of-the-art TTS models in terms of speech quality and controllability.


MOS results

Text: Even the Caslon type when enlarged shows great shortcomings in this respect and others who were present say that no agent was inebriated or acted improperly. Hill had both feet on the car and was climbing aboard to assist President and Mrs. Kennedy.
GT Waveform
GT Melspectrogram
Tacotron2
Glow-TTS
FastSpeech2-phoneme
FastSpeech2-frame
VarianceFlow-phoneme
VarianceFlow-frame (Ours)

Pitch shift

FastSpeech 2

λ=-2

λ=-4

λ=-6

λ=+2

λ=+4

λ=+6

VarianceFlow-reversed

λ=-2

λ=-4

λ=-6

λ=+2

λ=+4

λ=+6

VarianceFlow

λ=-2

λ=-4

λ=-6

λ=+2

λ=+4

λ=+6


Diversity

VarianceFlow, σ=0.0

Sample 1

Sample 2

Sample 3

Sample 4

VarianceFlow, σ=0.667

Sample 1

Sample 2

Sample 3

Sample 4

VarianceFlow, σ=0.667, +Encoder Dropout

Sample 1

Sample 2

Sample 3

Sample 4