Audio Course

Unit 0. Welcome to the course!

Unit 1. Working with audio data

Unit 2. A gentle introduction to audio applications

Unit 3. Transformer architectures for audio

Unit 4. Build a music genre classifier

Unit 5. Automatic Speech Recognition

Unit 6. From text to speech

What you'll learn and what you'll build Text-to-speech datasets Pre-trained models for text-to-speech Fine-tuning SpeechT5 Evaluating text-to-speech models Hands-on exercise Supplemental reading and resources

Unit 7. Putting it all together

Unit 8. Finish line

Course Events

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Supplemental reading and resources

This unit introduced the text-to-speech task, and covered a lot of ground. Want to learn more? Here you will find additional resources that will help you deepen your understanding of the topics and enhance your learning experience.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis: a paper introducing HiFi-GAN for speech synthesis.
X-Vectors: Robust DNN Embeddings For Speaker Recognition: a paper introducing X-Vector method for speaker embeddings.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech: a paper introducing FastSpeech 2, another popular text-to-speech model that uses a non-autoregressive TTS method.
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech: a paper introducing MQTTS, an autoregressive TTS system that replaces mel-spectrograms with quantized discrete representation.

Update on GitHub

←Hands-on exercise What you'll learn and what you'll build→