Technology📅 April 12, 2026👤 FreeReadText Team

Meta Releases Llama-Voice: First Fully Open-Source TTS Model to Match Commercial Giants in 50+ Languages

Meta drops Llama-Voice under an Apache 2.0 license, delivering near state-of-the-art voice synthesis, zero-shot voice cloning from 10 seconds of audio, and 52-language coverage — all runnable on a single consumer GPU.

In April 2026, Meta released Llama-Voice, the first fully open-source text-to-speech foundation model to approach parity with commercial offerings from OpenAI, ElevenLabs, and Microsoft. Published under an Apache 2.0 license with full weights, training code, and data recipes on Hugging Face, the 7B-parameter model supports 52 languages and can perform zero-shot voice cloning from as little as 10 seconds of reference audio. Meta claims the model runs in real-time on a single NVIDIA RTX 4090 or Apple M3 Max, lowering the barrier to high-quality voice AI dramatically.

The release is the culmination of Meta FAIR's multi-year Massively Multilingual Speech program, which expanded Wav2Vec and SeamlessM4T into a generative direction. Llama-Voice uses a decoder-only architecture aligned with the Llama language model family, making it the first TTS model that can be fine-tuned with the same tooling developers already use for text LLMs. Early benchmarks from Hugging Face's AudioArena leaderboard place Llama-Voice within 4% of OpenAI Voice Engine on naturalness (MOS 4.41 vs 4.59) and ahead of every closed model on low-resource languages such as Swahili, Bengali, and Tagalog.

The open release has triggered an immediate reaction across the industry. Within 48 hours of launch, the model had accumulated over 800,000 downloads and spawned dozens of community fine-tunes for podcasting, audiobooks, and game NPC dialogue. Independent developer Georgi Gerganov released a `llama-voice.cpp` port that runs quantized inference on Apple Silicon laptops at 1.8x real-time. Analysts at SemiAnalysis estimate that Llama-Voice will compress commercial TTS API pricing by 30–50% over the next 12 months, as enterprises gain a credible self-hosted alternative.

For Meta, the strategic logic mirrors its Llama text-model playbook: commoditize the complement. By making high-quality voice synthesis freely available, Meta aims to accelerate voice-first applications across its own Ray-Ban Meta glasses, WhatsApp, and Instagram ecosystems while denying competitors a proprietary moat. Critics have raised familiar concerns about the misuse potential of a freely available voice cloning model, but Meta counters that Llama-Voice ships with built-in watermarking via the SeamlessWatermark system and that responsible disclosure is better served by transparent, auditable weights than by closed APIs.

MetaLlama-VoiceOpen SourceMultilingual TTSVoice CloningHugging Face

출처

← Back to News