Skip to content
Mobrief

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Hugging Face Daily Papers investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis.

Hugging Face Daily Papers · · ~4 min read
Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

  • While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker…
  • To address this issue, Hugging Face Daily Papers proposes ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding,…
  • Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that Hugging Face Daily Papers's approach improves speaker similarity over naive synthetic augmentation while…

Context

While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, Hugging Face Daily Papers proposes ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that Hugging Face Daily Papers's approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.

For builders

While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker…

While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker…

Read Original
Open
O open S save B back M mode