What are the advantages of deep learning based speech synthesis/TTS systems compared to parametric/concatenative TTS?

  • xxce2AAb@feddit.dk
    link
    fedilink
    arrow-up
    6
    ·
    3 days ago

    Much less manual work to implement and refine to achieve convincing results? On the flip side: Huge models, and comparatively much more computationally expensive to run.

  • m_‮f@discuss.online
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 days ago

    The Bitter Lesson talks about speech recognition instead of synthesis, but I would guess that it’s a similar dynamic:

    In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.

    Also posted over in !discuss@discuss.online here, since I was reminded of the essay