Mooncake: A Deep Dive into Moonshot AI's LLM Serving Platform
Mooncake, developed by Moonshot AI, is the robust serving platform behind Kimi, a leading Large Language Model (LLM) service. This platform employs an innovative, KVCache-centric disaggregated architecture designed to optimize LLM serving, particularly in long-context scenarios. Recently, Moonshot AI open-sourced the core component of Mooncake, the Transfer Engine, along with its technical report, offering valuable insights into its architecture and performance. This article explores the key features, components, and capabilities of Mooncake, highlighting its significance in the evolution of LLM infrastructure.
Understanding the Mooncake Architecture
Mooncake's architecture is built around the concept of disaggregation, separating the prefill and decoding clusters to make effective use of resources.
- KVCache-Centric Design: The architecture revolves around KVCache, efficiently managing and utilizing the underutilized CPU, DRAM, and SSD resources within the GPU cluster to create a disaggregated cache.
- Disaggregated Prefill and Decoding: By separating the prefill and decoding processes, Mooncake optimizes resource allocation and overall throughput.
- Prediction-Based Early Rejection Policy: To handle overloaded scenarios, Mooncake employs a prediction-based early rejection policy, ensuring adherence to Service Level Objectives (SLOs) even under high demand.
This innovative architecture allows Mooncake to excel in long-context scenarios, significantly boosting throughput while maintaining latency requirements.
Key Components of Mooncake
Mooncake comprises several critical components working in harmony to deliver high-performance LLM serving:
- Transfer Engine: This is the heart of Mooncake, facilitating rapid and reliable data transfer across various protocols, including TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect), and NVMe over Fabric (NVMe-of). It abstracts the complexities of hardware, providing a unified interface for data transfer from DRAM, VRAM, or NVMe.
- P2P Store: Built on top of the Transfer Engine, P2P Store enables the sharing of temporary objects between nodes in a cluster. This is particularly useful for scenarios like checkpoint transfer, where data needs to be efficiently distributed across the network. It employs a decentralized architecture, utilizing the
etcd
service for global metadata management.
- vLLM Integration: Mooncake integrates with vLLM, enhancing its prefill-decode disaggregation capabilities. By leveraging RDMA devices, this integration significantly improves the efficiency of LLM inference.
The Power of the Transfer Engine
The Transfer Engine is a standout feature of Mooncake, offering high-performance data transfer with several key advantages:
- Efficient Use of Multiple RDMA NIC Devices: It supports the aggregation of transfer bandwidth by utilizing multiple RDMA NIC devices.
- Topology-Aware Path Selection: The engine intelligently selects optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
- Robustness to Network Errors: In case of transmission failures, the Transfer Engine automatically attempts to use alternative paths for data delivery, ensuring reliability.
- Performance: Benchmarks show that the Transfer Engine can deliver impressive bandwidth. In tests using 40 GB of data, it achieved up to 87 GB/s in a 4x200 Gbps RoCE network and 190 GB/s in an 8x400 Gbps RoCE network, significantly outperforming the TCP protocol.
P2P Store: Efficient Data Distribution
P2P Store addresses a critical need in distributed systems: efficient data distribution. Its decentralized architecture and reliance on the Transfer Engine provide several benefits:
- Decentralized Architecture: By leveraging a client-side architecture with global metadata managed by
etcd
, P2P Store avoids the bottlenecks associated with centralized data distribution.
- Efficient Data Distribution: It enhances the efficiency of large-scale data distribution by allowing replicated nodes to share data directly, alleviating the pressure on data providers.
- High Performance: Utilizing the Transfer Engine, P2P Store achieves full utilization of hardware bandwidth.
vLLM Integration for Optimized Inference
The integration of Mooncake with vLLM enhances the efficiency of Large Language Model (LLM) inference, enabling the separation of the prefill phase from the decode phase across different processes. By default, vLLM uses nccl
and gloo
as the transport layer, but these are inefficient for decoupling phases across different machines. Mooncake's Transfer Engine is the perfect solution.
Practical Applications and Show Cases
Mooncake's components are not just theoretical; they are designed for practical implementation to solve real-world problems. Moonshot AI provides several practical examples and guides detailing the usage of Mooncake's components.
- Transfer Engine Standalone: A guide to demonstrate how to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine Standalone Guide
- P2P Store Guide: Demonstrates how to implement the P2P Store library, suitable for sharing temporary objects between nodes in a cluster. P2P Store Guide
- vLLM Integration Guide: Details how to integrate Mooncake's Transfer Engine into vLLM, enhancing its prefill-decode disaggregation capabilities. vLLM Integration Guide
Conclusion
Mooncake represents a significant advancement in the infrastructure supporting Large Language Models. Its KVCache-centric disaggregated architecture, combined with the powerful Transfer Engine and P2P Store, offers a robust and efficient platform for LLM serving. As Moonshot AI continues to develop and open-source components of Mooncake, it provides valuable tools and insights for the broader AI community, fostering further innovation in LLM infrastructure.