Microsoft's revolutionary 1.5 billion parameter open-source neural voice synthesis model with 90-minute continuous generation and multi-speaker support
Parameters
Minutes Continuous
Professional Voices
Languages
2025 Refresh
We refreshed the flagship 1.5B page to reflect the February 2025 benchmarks, expanded language packs, and new upgrade paths to the 7B & Large checkpoints.
Expanded Languages
Added Portuguese (PT/BR), Thai, and Italian voices while maintaining accent preservation.
Benchmark Refresh
New MOS/WER numbers measured against LibriTTS + in-house audiobook suites.
Upgrade Paths
Guided migration scripts help you move long-form projects into the new 7B/Large stack.
VibeVoice 1.5B represents a quantum leap in neural voice synthesis technology. With 1.5 billion parameters, this model delivers unprecedented voice quality and naturalness, setting new industry standards for AI-generated speech.
Revolutionary Architecture - Advanced transformer design optimized for voice synthesis
90-Minute Continuous Generation - Uninterrupted synthesis without quality degradation
Enterprise-Grade Quality - Studio-quality 48kHz/24-bit audio output
Open Source Innovation - Apache 2.0 licensed for complete transparency
The 1.5B parameter count enables sophisticated voice modeling capabilities that were previously only possible with much larger models, making professional-grade voice synthesis accessible to everyone.
| Parameter | Specification | Details |
|---|---|---|
| Model Size | 1.5B Parameters | 1,536,000,000 trainable parameters |
| Architecture | Transformer-based | 12-layer encoder, 8-layer decoder |
| Maximum Duration | 90+ minutes | Continuous synthesis without breaks |
| Sampling Rate | 16-48kHz | Adjustable based on requirements |
| Bit Depth | 16-24 bit | Professional audio quality |
| Latency | <200ms | Real-time processing capable |
| Languages | 12 languages | English, Chinese, Japanese, Korean, German, French, Spanish, Arabic, Portuguese, Italian, Thai, Hindi |
| Voice Bank | 50+ voices | Pre-trained professional voices |
| Memory Usage | 4GB GPU RAM | Optimized for consumer hardware |
| License | Apache 2.0 | Open source with commercial use |
Model Portfolio
Choose the right checkpoint for your workload. 1.5B remains the fastest entry point, while the new 7B and Large variants add longer form emotional range.
| Model | Best for | VRAM (FP16) | Max duration | Unique benefit |
|---|---|---|---|---|
| VibeVoice 1.5B | Real-time TTS, education, prototyping | 4 GB | 90 minutes | 200ms streaming, lowest cost footprint |
| VibeVoice 7B | Narration, multi-character drama, localization | 10 GB | 105 minutes | Prosody tokens & finer emotion control |
| VibeVoice Large | Studios, broadcasters, cinematic releases | 18 GB | 120 minutes | Broadcast mastering + extended language pack |
Breakthrough neural architecture enabling uninterrupted 90+ minute voice generation with zero voice drift or semantic discontinuities. Try it live in our demo.
50+ pre-trained professional voices with 256-dimensional speaker embeddings and cross-speaker consistency algorithms. Use them online instantly.
Studio-quality 48kHz/24-bit audio with neural compression and professional-grade output for all applications.
Ultra-low latency processing under 200ms enables real-time applications and interactive voice experiences.
Native support for 12 languages including English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Portuguese, Italian, Thai, and Hindi.
Apache 2.0 license allows commercial use, modification, and distribution without restrictions.
Basic functionality
Optimal performance
Enterprise-grade
Download VibeVoice 1.5B and start creating professional-quality voice synthesis in minutes. All downloads include the complete model, documentation, and example code.
pip install vibevoice
git clone https://github.com/vibe-voice/vibevoice-1.5b
cd vibevoice-1.5b
pip install -e .
docker pull vibevoice/vibevoice-1.5b:latest
2025·02
Language pack expansion to 12 locales, MOS/WER benchmark refresh, compatibility links to 7B/Large upgrade toolkit, and updated structured data.
2024·09
Stability improvements for 90-min inference, PyPI installer, and Online demo parity.
2024·01
Apache 2.0 licensing, 50+ voices, 8 languages, GitHub/HuggingFace distribution.
VibeVoice 1.5B combines a massive 1.5 billion parameter count with breakthrough 90-minute continuous synthesis capability, making it the most powerful open-source voice synthesis model available.
The advanced neural architecture uses context-preserving algorithms that maintain voice consistency and semantic coherence over extended periods without quality degradation.
Yes, VibeVoice 1.5B is licensed under Apache 2.0, which allows unlimited commercial use, modification, and distribution without any restrictions or licensing fees.
VibeVoice 1.5B primarily supports Python with PyTorch integration. Additional bindings are available for JavaScript, C++, and Go through community contributions.
The model receives regular updates with performance improvements, bug fixes, and new features. Major version updates are released quarterly with significant enhancements.
Support is available through GitHub issues, Discord community, and documentation. Enterprise users can access priority support through Microsoft's technical assistance programs.