NetEase Youdao Releases Confucius4-TTS: Open-Source 14-Language Voice Cloning from Just 3 Seconds of Audio
Technology📅 June 23, 2026👤 FreeReadText Team

NetEase Youdao Releases Confucius4-TTS: Open-Source 14-Language Voice Cloning from Just 3 Seconds of Audio

Chinese edtech giant NetEase Youdao open-sources Confucius4-TTS under Apache 2.0, a 1.3B-parameter voice cloning model achieving 85%+ voice similarity from 3 seconds of audio across 14 languages — with no reference text needed for cross-lingual cloning.

On June 23, 2026, NetEase Youdao — the edtech arm of Chinese internet company NetEase — released Confucius4-TTS, a 1.3-billion-parameter text-to-speech model capable of cloning any voice from just 3 seconds of audio with over 85% similarity and 97% task accuracy. The model supports 14 languages including Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese, and is fully open-sourced under the Apache 2.0 license, allowing unrestricted commercial use.

Confucius4-TTS is built on a GPT-style semantic language model backbone rather than a traditional neural vocoder architecture like HiFi-GAN, paired with an ECAPA-TDNN speaker encoder and a Flow Matching generation framework. This design enables what NetEase Youdao claims is an industry first: zero-shot cross-lingual voice cloning without any reference text. Users provide a short audio sample in any language, and the model can generate speech in all 14 supported languages while preserving the speaker's vocal identity, emotion, and prosody — effectively making a single voice fluent across 14 languages with no accent transfer.

The full model weights, training recipes, and inference code are available on GitHub and Hugging Face, with a 54GB resource package supporting local offline deployment. A Gradio-based web demo allows instant testing without setup. NetEase Youdao is positioning the model for cross-border content creation — short video dubbing, digital human voiceovers, multilingual AI tutoring, and brand localization — use cases traditionally requiring expensive human voice actors and weeks of production time per language. Early community response has been strong, with the release described as a landmark for open-source multilingual TTS.

The release arrives during a period of rapid expansion in open-source voice AI. Meta's Llama-Voice (April 2026) and Rumik's Silk Mulberry 1.5 (June 2026) both opened new fronts in open-source TTS, but Confucius4-TTS differentiates on its zero-shot cross-lingual capability — the ability to clone a voice from one language and generate speech in 13 others without per-language fine-tuning. NetEase Youdao's choice of Apache 2.0 over a restrictive research license signals an intent to drive developer adoption, particularly in Asian language markets where commercial TTS quality has historically trailed English-language models.

NetEase YoudaoConfucius4-TTSOpen SourceVoice CloningMultilingual TTSFlow Matching

出典

← Back to News