Ziqiang Shi, Rujie Liu.
This demo introduces LangWave, which can generate realistic speech voice with given middle voice features, e.g. mel-spectrogram as input.
In recent years, methods based on diffusion generative models have achieved state-of-the-art performances in voice generation. Most of these previous approaches are based on first-order stochastic differential equations or their equivalent diffusion models. This paper attempts to upgrade these first-order methods and propose LangWave, which uses the third-order \textbf{Lang}evin dynamical system to generate speech \textbf{wave}forms. LangWave can simultaneously model the position, velocity and acceleration of voice wave diffusion and sampling in the ambient Euclidean space. Thus our vocoder can more precisely and smoothly control the wave evolution from white noise to meaningful waveforms. Our experiments on the public data set LJSpeech show that the effect is significant in both objective and subjective evaluation, and achieve the new state-of-the-art MOS of 4.55.
Please check the following voice samples generated by LangWave and other compared systems. The corresponding sentences are as follows:
1. but they proceeded in all seriousness, and would have shrunk from no outrage or atrocity in furtherance of their foolhardy enterprise.
2. three cars for press photographers, an official party bus for white house staff members and others, and two press buses.
3. a base station at a fixed location in dallas operated a radio network which linked together the lead car,
4. the lifting had been so complete in this case that there was no trace of the print on the rifle itself when it was examined by latona.
5. with the active cooperation of the responsible agencies and with the understanding of the people of the united states in their demands upon their president,
Ground truth | WaveGlow | ItoWave | DiffWave | WaveNet | LangWave |
---|---|---|---|---|---|
[1]. Ziqiang Shi, Rujie Liu. LangWave: Realistic Voice Generation Based on High-order Langevin dynamics.