aicoolies logo
Fish Speech logo

Fish Speech

Multilingual emotional text-to-speech with 80+ language support

Share
open-sourceOpen Source
Visit Website →

Fish Speech is an open-source text-to-speech system supporting 80+ languages with emotional expression, zero-shot voice cloning, and real-time streaming. It generates natural speech with controllable emotions, speaking styles, and prosody. Features a web interface, API server, and integration with AI agent frameworks for voice-enabled applications. Over 29,000 GitHub stars.

Fish Speech is a multilingual text-to-speech system built on a dual-autoregressive (Dual-AR) architecture that combines neural codecs with large language models for natural, expressive voice synthesis. Trained on over 10 million hours of audio across 80+ languages, it uses an RVQ-based codec (10 codebooks, ~21 Hz frame rate) and eliminates traditional grapheme-to-phoneme conversion by leveraging LLM-based linguistic feature extraction. The result is more fluid cross-lingual handling and state-of-the-art quality—on Seed-TTS benchmarks, Fish Speech achieves lower WER than closed-source competitors.

Voice cloning in Fish Speech is fast and practical: a 10-30 second reference sample captures speaker timbre, prosody, and emotional nuance without fine-tuning. The Dual-AR design stabilizes codebook generation through Grouped Finite Scalar Quantization (GFSQ), improving inference speed and reducing artifacts. Fish S2 Pro, the latest version, applies reinforcement learning alignment to enhance naturalness further. Whether synthesizing content in Japanese, Cantonese, or English, the system adapts dynamically without retraining.

Fish Speech targets researchers, content creators, and developers building multilingual voice applications. The open-source codebase supports GPU acceleration (NVIDIA, AMD via ZLUDA, Apple Metal) and runs on modest hardware. The ecosystem includes preprocessing tools (VAD, speaker segmentation, ASR labeling) and a WebUI for training custom voices, lowering barriers for teams building voice cloning pipelines outside the proprietary API ecosystem.

Pricing

Free open-source; Fish Audio cloud API available

Platforms

Python, CUDA GPUs, API server, web UI

Categories

Tags

Use Cases

Alternatives

GPT-SoVITS

Open-source voice cloning and text-to-speech with few-shot learning

GPT-SoVITS is an open-source voice cloning and text-to-speech system that generates natural-sounding speech from just a few seconds of reference audio. It combines GPT-style language modeling with SoVITS voice synthesis for zero-shot and few-shot voice cloning across multiple languages. Supports Chinese, English, Japanese, Korean, and Cantonese with over 56,000 GitHub stars.

open-sourceOpen Source
Coqui TTS logo

Coqui TTS

Open-source deep learning text-to-speech toolkit

Coqui TTS is an open-source deep learning toolkit for text-to-speech synthesis, originally built by former Mozilla TTS engineers. It supports multi-speaker and multilingual synthesis, voice cloning from just six seconds of audio, and ships pre-trained models for 20+ languages. After Coqui shut down in 2023, the Idiap Research Institute forked and actively maintains it. With 45K+ GitHub stars, it remains the most popular open-source TTS framework in Python.

open-sourceOpen Source
VoxCPM logo

VoxCPM

Tokenizer-free multilingual TTS with voice cloning

VoxCPM is an open-source text-to-speech system from OpenBMB generating continuous speech across 30 languages without traditional tokenization. Its 2B parameter end-to-end diffusion architecture produces 48kHz studio-quality audio with natural prosody and emotion. Key capabilities include voice design from text descriptions, few-shot voice cloning, and multilingual synthesis without language-specific modules. The Apache 2.0 project has 8,700 GitHub stars.

open-sourceOpen Source

Related Tools

Deep Lake logo

Deep Lake

AI data runtime for multimodal datasets and vector search

Deep Lake is an open-source AI data runtime from Activeloop for storing, versioning, and querying multimodal data and embeddings. It fits teams building RAG, training, evaluation, or dataset-heavy agent workflows that need a bridge between vector search, structured metadata, and large image, text, audio, or video collections.

open-sourceOpen Source
SeekDB logo

SeekDB

AI-native state store with hybrid vector and full-text search

SeekDB is an open-source AI-native state store from the OceanBase ecosystem that combines MySQL-compatible data access with hybrid vector and full-text retrieval. It targets agent and AI application teams that need embedded or server deployment, copy-on-write style sandboxes, and searchable state without gluing together several separate storage layers.

open-sourceOpen Source
Marqo logo

Marqo

Embedding-first search and discovery engine for AI-powered product experiences.

Marqo is an open-source tensor search engine that combines embedding generation and vector search in a single API, removing the need to manage separate embedding pipelines and vector databases. Built for product discovery and multi-modal search, it lets teams index text, images, and structured data together, returning ranked results based on semantic similarity rather than keyword overlap.

freemium
Magika logo

Magika

AI-powered file-type detection at Google scale

Open-source AI-powered file-type detection tool from Google that uses a custom deep-learning model under a few megabytes to identify more than 200 binary and textual content types in milliseconds, even on a single CPU. Magika ships as a CLI, Python package, JavaScript/TypeScript library, and an ONNX model, achieves around 99% accuracy on its test set, and is already used at Google scale across Gmail, Drive, and Safe Browsing as well as by VirusTotal and abuse.ch.

freeOpen Source
Zep logo

Zep

Context engineering platform for AI agents with temporal knowledge graphs

Zep is a context engineering platform that assembles relationship-aware context for AI agents from conversations, business data, documents, and events. It maintains a temporal knowledge graph that automatically extracts entities and relationships, tracking how context evolves over time. Zep delivers formatted context blocks optimized for LLMs with sub-200ms latency, integrating with LangChain, LlamaIndex, AutoGen, and Google ADK through Python, TypeScript, and Go SDKs.

freemium
Hindsight logo

Hindsight

Agent memory system that learns, not just remembers

Hindsight is an agent memory system that enables AI agents to learn from experience rather than just store conversations. It organizes memories into three biomimetic categories: World knowledge for facts, Experiences for agent events, and Mental Models for learned understanding. The system provides retain, recall, and reflect operations backed by a temporal knowledge graph with parallel retrieval strategies including semantic, keyword, graph traversal, and temporal search.

freemiumOpen Source