aicoolies logo

VibeVoice

Microsoft's open-source frontier voice AI for long-form multi-speaker audio

Share
open-sourceOpen Source
Visit Website →

VibeVoice is Microsoft's open-source voice AI family with both TTS and speech recognition models. The TTS model generates up to 90 minutes of expressive multi-speaker audio with 4 distinct voices. VibeVoice-ASR transcribes 60-minute recordings in a single pass with speaker identification and timestamps. Built on continuous speech tokenizers at 7.5 Hz and next-token diffusion, it compresses audio 80x more efficiently than Encodec while preserving fidelity.

We have a review for this tool

A detailed review by the aicoolies team — click to read

VibeVoice represents a significant architectural innovation in speech synthesis. Traditional TTS systems struggle with long-form audio due to high token rates that create computational bottlenecks, and they typically handle only one or two speakers with limited emotional range. VibeVoice solves these challenges through continuous speech tokenizers operating at an ultra-low frame rate of approximately 7.5 Hz, compressing audio representation dramatically while maintaining acoustic fidelity. The system uses a Large Language Model based on Qwen 2.5 to understand textual context and dialogue flow, combined with a diffusion head that generates high-fidelity acoustic details. The result is natural conversational audio with proper turn-taking, emotional nuance, and consistent speaker identity across long sequences.

The family includes three main components. VibeVoice-1.5B is the flagship TTS model accepted as an Oral at ICLR 2026, generating up to 90 minutes of multi-speaker conversational audio. VibeVoice-Realtime-0.5B is a lightweight variant for streaming TTS with approximately 200ms latency, designed for real-time services and LLM voice output. VibeVoice-ASR handles speech-to-text for 60-minute recordings, producing structured transcriptions with speaker identification, timestamps, and customized hotword support. The ASR component was integrated into Hugging Face Transformers in March 2026. All models support English and Chinese natively, with experimental multilingual capabilities in nine additional languages including German, French, Japanese, and Korean.

Released under MIT license with built-in safety measures including audible AI-generated disclaimers and imperceptible watermarks for provenance verification, VibeVoice is available through GitHub documentation and Hugging Face model cards. Microsoft positions the project for research and development, and the current repository notes that the TTS code was removed for responsible-use reasons even though public model cards remain available. With about 49K GitHub stars, VibeVoice is especially relevant for developers evaluating voice agents, podcast generation, and voice-enabled AI applications.

Pricing

Free and open-source (MIT license); Self-hosted only

Platforms

Python/PyTorch, Hugging Face model cards, Colab/Transformers demos; GPU recommended

Categories

Tags

Use Cases

Alternatives

TimesFM

Google's pretrained foundation model for zero-shot time-series forecasting

TimesFM is a pretrained time-series foundation model from Google Research that performs zero-shot forecasting on diverse datasets without task-specific training. It handles univariate and multivariate time series across domains including finance, logistics, energy, and infrastructure monitoring with accuracy competitive against traditional statistical methods like ARIMA and Prophet.

open-sourceOpen Source
PrismML Bonsai logo

PrismML Bonsai

First commercially viable 1-bit LLMs that are 14x smaller and 8x faster

PrismML Bonsai delivers the first commercially viable 1-bit large language models with 8B, 4B, and 1.7B parameter variants. The 8B model runs in just 1GB of RAM versus 16GB for standard FP16 models, achieving 44 tokens per second on iPhone. Backed by $16.25M from Khosla Ventures and released under Apache 2.0, Bonsai makes capable LLMs practical for edge devices and resource-constrained environments.

open-sourceOpen Source

verl

Production-grade reinforcement learning framework for LLM training

verl is an open-source reinforcement learning framework designed specifically for training and aligning large language models. Built for production use with support for distributed training across multiple GPUs and nodes, it implements RLHF, DPO, and other alignment algorithms that make LLMs follow instructions, avoid harmful outputs, and generate higher quality responses. Over 580 contributors and 20,000 GitHub stars signal strong adoption.

open-sourceOpen Source

Chatterbox

State-of-the-art open-source text-to-speech with emotion control

Chatterbox is an open-source text-to-speech model by Resemble AI that delivers state-of-the-art voice synthesis with fine-grained emotion and style control. The model supports zero-shot voice cloning from short audio samples, produces natural-sounding speech across multiple speaking styles, and runs locally without cloud dependencies. With over 24,000 GitHub stars, it has become the leading open-source alternative to commercial TTS services for developers building voice-enabled AI applications.

open-sourceOpen Source

Related Tools

Deep Lake logo

Deep Lake

AI data runtime for multimodal datasets and vector search

Deep Lake is an open-source AI data runtime from Activeloop for storing, versioning, and querying multimodal data and embeddings. It fits teams building RAG, training, evaluation, or dataset-heavy agent workflows that need a bridge between vector search, structured metadata, and large image, text, audio, or video collections.

open-sourceOpen Source
SeekDB logo

SeekDB

AI-native state store with hybrid vector and full-text search

SeekDB is an open-source AI-native state store from the OceanBase ecosystem that combines MySQL-compatible data access with hybrid vector and full-text retrieval. It targets agent and AI application teams that need embedded or server deployment, copy-on-write style sandboxes, and searchable state without gluing together several separate storage layers.

open-sourceOpen Source
Marqo logo

Marqo

Embedding-first search and discovery engine for AI-powered product experiences.

Marqo is an open-source tensor search engine that combines embedding generation and vector search in a single API, removing the need to manage separate embedding pipelines and vector databases. Built for product discovery and multi-modal search, it lets teams index text, images, and structured data together, returning ranked results based on semantic similarity rather than keyword overlap.

freemium
Magika logo

Magika

AI-powered file-type detection at Google scale

Open-source AI-powered file-type detection tool from Google that uses a custom deep-learning model under a few megabytes to identify more than 200 binary and textual content types in milliseconds, even on a single CPU. Magika ships as a CLI, Python package, JavaScript/TypeScript library, and an ONNX model, achieves around 99% accuracy on its test set, and is already used at Google scale across Gmail, Drive, and Safe Browsing as well as by VirusTotal and abuse.ch.

freeOpen Source
Zep logo

Zep

Context engineering platform for AI agents with temporal knowledge graphs

Zep is a context engineering platform that assembles relationship-aware context for AI agents from conversations, business data, documents, and events. It maintains a temporal knowledge graph that automatically extracts entities and relationships, tracking how context evolves over time. Zep delivers formatted context blocks optimized for LLMs with sub-200ms latency, integrating with LangChain, LlamaIndex, AutoGen, and Google ADK through Python, TypeScript, and Go SDKs.

freemium
Hindsight logo

Hindsight

Agent memory system that learns, not just remembers

Hindsight is an agent memory system that enables AI agents to learn from experience rather than just store conversations. It organizes memories into three biomimetic categories: World knowledge for facts, Experiences for agent events, and Mental Models for learned understanding. The system provides retain, recall, and reflect operations backed by a temporal knowledge graph with parallel retrieval strategies including semantic, keyword, graph traversal, and temporal search.

freemiumOpen Source

Used in Stacks

Comparisons