Text to speech is not a new technology, as early as 10 years there are all kinds of hardware and software products can automatically read out the text, but the previous text to speech technology has a problem, the computer synthesized voice is too rigid.

So in recent years, cutting-edge technology research in this field has also shifted to how to make computer-synthesized sounds more emotional.

The process of voice interaction involves speech synthesis, that is, turning text into sound, and sound is an information carrier for textual content. Voice interaction is the most common, familiar and enjoyable form of presentation in daily life, for example, people talking to people, watching TV, listening to the radio, interacting with the stereo, and so on. The experience will have a great impact on the user’s perception.

If the speech synthesis quality is better, the speech effect is closer to the real person, and the emotional expression is rich, then the user’s willingness to interact will naturally be stronger, the user will feel that this is not a cold machine, there will be willing to further interact with this kind of intelligent body.

In 2016, the advent of WaveNet revolutionized sound generation by turning frame-by-frame sound generation into point-by-point waveform generation. The resulting benefit is that the computer synthesized sound is very close to the original sound. Although it still has the disadvantage of being computationally complicated, this disadvantage has also been gradually made acceptable in the last two years through a series of modifications, such as parallel WaveNet, etc., while the advantages are becoming more and more fully realized.

In 2017, a series of variants such as Tacotron, and subsequently Tacotron2, provided us with an end-to-end approach to speech synthesis. It uses the core Attention mechanism to take the correlation between input and output and represent it nicely through a model. Tacotron is a great improvement in the rhythm and tempo of the synthesized speech.

Lovo.ai is an intelligent voice synthesis tool based on Tacotron2, they participated in CES 2020 and got a lot of attention. I believe more and more people will use this product.

