Google introduces Gemini 3.1 Flash TTS, a new text-to-speech model with audio tags for fine-grained vocal control, native multi-speaker dialogue, and 70+ language support — landing in the 'most attractive quadrant' of the Artificial Analysis TTS leaderboard with an Elo of 1,211.
On April 15, 2026, Google announced Gemini 3.1 Flash TTS, the newest text-to-speech model in the Gemini family, designed to deliver higher-quality, more controllable, and more expressive audio output. The model expands beyond traditional TTS by introducing audio tags that let developers control vocal style, pace, and delivery directly inside the prompt, alongside native support for multi-speaker dialogue, scene direction, and speaker-level specificity. Google announced the rollout on its official blog in a post co-authored by Senior Product Manager Vilobh Meshram and Principal Research Engineer Max Gubin.
Gemini 3.1 Flash TTS supports more than 70 languages and lands in what Artificial Analysis calls the 'most attractive quadrant' of its public TTS leaderboard, combining high speech quality with low cost. The model achieved an Elo score of 1,211 on the Artificial Analysis TTS Arena, which ranks models by blind human preference comparisons. Google highlighted the model's controllability as a key differentiator: rather than locking creators into a fixed delivery style, audio tags allow per-line direction — useful for audiobooks, game NPCs, podcast generation, and customer-facing voice agents.
Availability spans Google's developer and enterprise stacks. The model is in preview through the Gemini API and Google AI Studio for developers, on Vertex AI for enterprise customers, and is being integrated into Google Vids for Workspace users to generate narrated video content. All audio generated by Gemini 3.1 Flash TTS is automatically watermarked with SynthID, Google's invisible audio watermarking technology, so downstream tools and platforms can detect AI-generated speech — a direct response to the deepfake regulatory environment now in force under the EU AI Act and Tennessee's ELVIS Act.
The release intensifies competition in a market that has seen rapid moves over the past two months: Microsoft's MAI-Voice-1 in early April, Meta's open-source Llama-Voice mid-month, and xAI's Grok Voice API on April 17 with sharply lower pricing. Google's pitch is the combination of breadth and integration — 70+ languages out of the box, native multi-speaker support, and tight coupling with the rest of Gemini's multimodal stack — positioning Gemini 3.1 Flash TTS as the default choice for developers already building on Google's platform rather than as a price-led commodity play.