This stack assembles the best open-source tools from China's AI ecosystem for teams building applications on Chinese model families. ms-swift by ModelScope provides the most comprehensive fine-tuning framework with support for 600+ models and native integration with both ModelScope Hub and Hugging Face, making it the ideal training tool for Qwen, ChatGLM, DeepSeek, and other Chinese model architectures.
Xinference serves as the local inference engine, running LLMs, embedding models, and audio models through an OpenAI-compatible API with a web dashboard for model management. Its support for vLLM, llama.cpp, and transformers backends provides flexibility to optimize for throughput or compatibility depending on the deployment scenario.
Qwen-Agent provides the agent framework optimized specifically for Qwen models, leveraging native function calling, code interpretation, and multimodal capabilities that generic frameworks access less efficiently. Built-in tools for web browsing, code execution, and RAG are tuned for Qwen's output format and Chinese language strengths.
FlashMLA from DeepSeek contributes the optimized attention kernels that make serving MLA-based models like DeepSeek-V2 and V3 efficient on NVIDIA GPUs. The latent attention compression reduces KV-cache memory requirements, directly increasing the number of concurrent requests each GPU can handle in production serving.
GPT-SoVITS adds voice capabilities with few-shot voice cloning from seconds of reference audio. The system generates natural Chinese speech with the speaker's characteristics preserved, enabling voice-enabled AI applications, content creation tools, and accessibility features for applications serving Chinese-speaking users.
The entire stack runs on open-source software with all components available under permissive licenses. Teams start with inference through Xinference, add fine-tuning capabilities through ms-swift when customization is needed, build agents through Qwen-Agent for interactive applications, and add voice through GPT-SoVITS for speech-enabled experiences.