Technology📅 June 2, 2026👤 FreeReadText Team

Microsoft Launches MAI-Voice-2 at Build 2026: Expressive Speech and Zero-Shot Voice Cloning Across 15 Languages

Microsoft unveils MAI-Voice-2, calling it the most expressive and natural-sounding text-to-speech model it has built, expanding from English-only to 15 languages with granular emotion control, code-switching, and zero-shot voice prompting from a few seconds of audio.

On June 2, 2026, Microsoft announced MAI-Voice-2 at its Build 2026 keynote, describing it as 'the most expressive, natural-sounding text-to-speech model we've built to date.' The release is a major step up from last year's MAI-Voice-1, advancing across fidelity, language coverage, speaker consistency, and emotional range. In blind testing, listeners preferred MAI-Voice-2 over its predecessor 72% of the time, and across 11 languages the synthetic speech was preferred 45.5% of the time versus 44% for human recordings, with 10.5% rated a tie.

The headline change is multilingual breadth. Where MAI-Voice-1 was English-only, MAI-Voice-2 supports 15 languages and locales — including English (US and Australia), Italian, French, German, Hindi, Spanish (Spain and Mexico), Portuguese (Brazil and Portugal), Korean, Chinese (Simplified), Turkish, Russian, Thai, Dutch, Romanian, and Hungarian — with mid-sentence code-switching available for Hindi-English and Spanish-English pairs. The model adds granular emotion tags such as sad, whispered, and excited, and supports zero-shot voice prompting from a 5-to-60-second reference clip to control tone, accent, pacing, and speaking style while maintaining a stable speaker identity across long-form content.

MAI-Voice-2 is available through Microsoft Foundry and Azure Speech, with pricing starting at $22 per million characters — the same entry point as MAI-Voice-1 — and is being integrated into VS Code and the Dynamics 365 Contact Center. Microsoft pairs the model with a consent-enforced system designed to prevent unlicensed voice cloning, reflecting the heightened scrutiny on synthetic-voice technology now that the EU AI Act's transparency rules and US state statutes like Tennessee's ELVIS Act are in force.

The launch lands in an unusually crowded stretch for voice AI. Google shipped Gemini 3.1 Flash TTS in April, Meta open-sourced Llama-Voice the same month, and OpenAI rolled out its GPT-Realtime-2 voice models in early May. Rather than competing purely on price, Microsoft is leaning on distribution — bundling MAI-Voice-2 into Azure, Copilot, and developer tooling already used across its enterprise base — to position the model as the default choice for organizations building voice-first applications on its platform.

MicrosoftMAI-Voice-2Multilingual TTSVoice CloningBuild 2026Azure

Source

← Back to News