Fish Speech is a multilingual text-to-speech system built on a dual-autoregressive (Dual-AR) architecture that combines neural codecs with large language models for natural, expressive voice synthesis. Trained on over 10 million hours of audio across 80+ languages, it uses an RVQ-based codec (10 codebooks, ~21 Hz frame rate) and eliminates traditional grapheme-to-phoneme conversion by leveraging LLM-based linguistic feature extraction. The result is more fluid cross-lingual handling and state-of-the-art quality—on Seed-TTS benchmarks, Fish Speech achieves lower WER than closed-source competitors.
Voice cloning in Fish Speech is fast and practical: a 10-30 second reference sample captures speaker timbre, prosody, and emotional nuance without fine-tuning. The Dual-AR design stabilizes codebook generation through Grouped Finite Scalar Quantization (GFSQ), improving inference speed and reducing artifacts. Fish S2 Pro, the latest version, applies reinforcement learning alignment to enhance naturalness further. Whether synthesizing content in Japanese, Cantonese, or English, the system adapts dynamically without retraining.
Fish Speech targets researchers, content creators, and developers building multilingual voice applications. The open-source codebase supports GPU acceleration (NVIDIA, AMD via ZLUDA, Apple Metal) and runs on modest hardware. The ecosystem includes preprocessing tools (VAD, speaker segmentation, ASR labeling) and a WebUI for training custom voices, lowering barriers for teams building voice cloning pipelines outside the proprietary API ecosystem.
