Technology📅 July 1, 2026👤 FreeReadText Team

ViiTorVoice-NAR Goes Open Source: First TTS Model That Edits Single Words Inside Finished Audio

Chinese startup Yunshang Qulv releases ViiTorVoice-NAR under Apache 2.0, introducing word-level audio editing that replaces individual words without regenerating surrounding content — alongside sub-60ms latency and benchmark-leading accuracy on both English and Chinese.

On July 1, 2026, Chinese AI startup Yunshang Qulv released ViiTorVoice-NAR, an open-source text-to-speech model that introduces a capability no commercially deployed TTS system currently offers: word-level audio editing. Rather than regenerating an entire sentence to fix a single mispronounced word — the approach taken by autoregressive systems like CosyVoice3, Qwen3-TTS, and Fish Audio S2 — ViiTorVoice-NAR can replace individual words or short phrases within a finished recording while preserving the surrounding audio's timbre, rhythm, emotion, and background characteristics intact.

The model achieves this through a non-autoregressive (NAR) architecture using a masked discrete language model inspired by BERT-style bidirectional context. It treats the targeted word region as a blank and reconstructs it using context from both directions — conceptually similar to a fill-in-the-blank operation applied to audio rather than text. On the Seed-TTS benchmark, ViiTorVoice-NAR achieved 1.32% English word error rate and 0.99% Chinese WER — the first model reported below 1.0% on Chinese. First-frame latency is under 60 milliseconds, well below the industry average of 150–200ms. The model also supports reference-text-free voice cloning and fine-grained emotional control via inline tags such as laughter and sighs.

Released under the Apache 2.0 license on GitHub and Hugging Face, ViiTorVoice-NAR can be deployed locally, keeping audio data on the user's own hardware. However, the release has drawn attention for what it does not include: there is no built-in audio watermarking and no consent verification mechanism for voice cloning — issues that take on regulatory urgency with the EU AI Act's mandatory synthetic audio labeling requirements taking effect on August 2, 2026, just 32 days after the model's release. Developers integrating ViiTorVoice-NAR into production systems will need to implement their own compliance tooling to meet these obligations.

The release marks a milestone for Chinese AI in the global TTS landscape. While Chinese labs have produced strong speech models before, ViiTorVoice-NAR is the first to claim a global-first capability — word-level audio editing — that no Western commercial or open-source model currently matches. For content creators, podcast producers, and audiobook studios, the ability to fix a single misread word without re-recording an entire passage represents a genuine workflow breakthrough. The Apache 2.0 license also means the technique will likely be studied, replicated, and improved upon rapidly — potentially making word-level editing a standard feature in next-generation TTS systems across the industry.

ViiTorVoiceOpen Source TTSWord-Level EditingChinese AINAR ArchitectureApache 2.0

Sursă

← Back to News