Mooncake: A Serving Platform for Kimi, a Leading LLM Service
Overview
Mooncake is a serving platform developed by Moonshot AI for Kimi, a leading Large Language Model (LLM) service. The platform provides a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters, allowing for efficient data transfer and processing. The core component of Mooncake is its KVCache-centric scheduler, which balances throughput and latency requirements while adhering to Service Level Objectives (SLOs).
Components
The Mooncake architecture consists of several key components:
- Transfer Engine: A high-performance data transfer framework that supports rapid, reliable, and flexible data transfer over various protocols, including TCP, RDMA, NVIDIA GPUDirect-based RDMA, and NVMe over Fabric (NVMe-of).
- P2P Store: A library that enables sharing temporary objects among nodes in a cluster, avoiding bandwidth saturation issues and reducing CPU/RDMA NIC pressures.
- Mooncake Store: A planned component that will support pooled KVCache for more flexible P/D disaggregation.
Show Cases
- Transfer Engine
- Highlights: Efficient use of multiple RDMA NIC devices, topology-aware path selection, and robustness on temporary network errors.
- Performance: The Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4×200 Gbps and 8×400 Gbps RoCE networks, respectively.
- P2P Store
- Highlights: Decentralized architecture, efficient data distribution, and reduced CPU/RDMA NIC pressures.
- Performance: The P2P Store can distribute objects with full utilization of hardware incoming bandwidth, achieving throughputs of up to 3.1 GB/s.
- vLLM Integration
- Highlights: Optimization of LLM inference through disaggregated prefilling, simpler interfaces, and more efficient use of RDMA devices.
- Performance: The vLLM integration with Transfer Engine achieves a mean TTFT (Token Throughput per Host) up to 25% lower than traditional TCP-based transports.
Update
As of December 16, 2024, the latest version of vLLM Integration (Guide v0.2) is available, which is based on vLLM's main branch. The update includes improvements to Topology Aware Path Selection and multi-card bandwidth aggregation, resulting in lower TTFT values.
Conclusion
Mooncake is a revolutionary serving platform for Kimi, a leading LLM service, that provides a KVCache-centric disaggregated architecture and efficient data transfer and processing. The platform's transfer engine, P2P store, and vLLM integration show promising results in terms of performance, scalability, and robustness.