GitHub - kvcache-ai/Mooncake: Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

Mooncake: A Deep Dive into Moonshot AI's LLM Serving Platform

Mooncake, developed by Moonshot AI, is the robust serving platform behind Kimi, a leading Large Language Model (LLM) service. This platform employs an innovative, KVCache-centric disaggregated architecture designed to optimize LLM serving, particularly in long-context scenarios. Recently, Moonshot AI open-sourced the core component of Mooncake, the Transfer Engine, along with its technical report, offering valuable insights into its architecture and performance. This article explores the key features, components, and capabilities of Mooncake, highlighting its significance in the evolution of LLM infrastructure.

Understanding the Mooncake Architecture

Mooncake's architecture is built around the concept of disaggregation, separating the prefill and decoding clusters to make effective use of resources.

KVCache-Centric Design: The architecture revolves around KVCache, efficiently managing and utilizing the underutilized CPU, DRAM, and SSD resources within the GPU cluster to create a disaggregated cache.
Disaggregated Prefill and Decoding: By separating the prefill and decoding processes, Mooncake optimizes resource allocation and overall throughput.
Prediction-Based Early Rejection Policy: To handle overloaded scenarios, Mooncake employs a prediction-based early rejection policy, ensuring adherence to Service Level Objectives (SLOs) even under high demand.

This innovative architecture allows Mooncake to excel in long-context scenarios, significantly boosting throughput while maintaining latency requirements.

Key Components of Mooncake

Mooncake comprises several critical components working in harmony to deliver high-performance LLM serving:

Transfer Engine: This is the heart of Mooncake, facilitating rapid and reliable data transfer across various protocols, including TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect), and NVMe over Fabric (NVMe-of). It abstracts the complexities of hardware, providing a unified interface for data transfer from DRAM, VRAM, or NVMe.
P2P Store: Built on top of the Transfer Engine, P2P Store enables the sharing of temporary objects between nodes in a cluster. This is particularly useful for scenarios like checkpoint transfer, where data needs to be efficiently distributed across the network. It employs a decentralized architecture, utilizing the etcd service for global metadata management.
vLLM Integration: Mooncake integrates with vLLM, enhancing its prefill-decode disaggregation capabilities. By leveraging RDMA devices, this integration significantly improves the efficiency of LLM inference.

The Power of the Transfer Engine

The Transfer Engine is a standout feature of Mooncake, offering high-performance data transfer with several key advantages:

Efficient Use of Multiple RDMA NIC Devices: It supports the aggregation of transfer bandwidth by utilizing multiple RDMA NIC devices.
Topology-Aware Path Selection: The engine intelligently selects optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
Robustness to Network Errors: In case of transmission failures, the Transfer Engine automatically attempts to use alternative paths for data delivery, ensuring reliability.
Performance: Benchmarks show that the Transfer Engine can deliver impressive bandwidth. In tests using 40 GB of data, it achieved up to 87 GB/s in a 4x200 Gbps RoCE network and 190 GB/s in an 8x400 Gbps RoCE network, significantly outperforming the TCP protocol.

P2P Store: Efficient Data Distribution

P2P Store addresses a critical need in distributed systems: efficient data distribution. Its decentralized architecture and reliance on the Transfer Engine provide several benefits:

Decentralized Architecture: By leveraging a client-side architecture with global metadata managed by etcd, P2P Store avoids the bottlenecks associated with centralized data distribution.
Efficient Data Distribution: It enhances the efficiency of large-scale data distribution by allowing replicated nodes to share data directly, alleviating the pressure on data providers.
High Performance: Utilizing the Transfer Engine, P2P Store achieves full utilization of hardware bandwidth.

vLLM Integration for Optimized Inference

The integration of Mooncake with vLLM enhances the efficiency of Large Language Model (LLM) inference, enabling the separation of the prefill phase from the decode phase across different processes. By default, vLLM uses nccl and gloo as the transport layer, but these are inefficient for decoupling phases across different machines. Mooncake's Transfer Engine is the perfect solution.

Practical Applications and Show Cases

Mooncake's components are not just theoretical; they are designed for practical implementation to solve real-world problems. Moonshot AI provides several practical examples and guides detailing the usage of Mooncake's components.

Transfer Engine Standalone: A guide to demonstrate how to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine Standalone Guide
P2P Store Guide: Demonstrates how to implement the P2P Store library, suitable for sharing temporary objects between nodes in a cluster. P2P Store Guide
vLLM Integration Guide: Details how to integrate Mooncake's Transfer Engine into vLLM, enhancing its prefill-decode disaggregation capabilities. vLLM Integration Guide

Conclusion

Mooncake represents a significant advancement in the infrastructure supporting Large Language Models. Its KVCache-centric disaggregated architecture, combined with the powerful Transfer Engine and P2P Store, offers a robust and efficient platform for LLM serving. As Moonshot AI continues to develop and open-source components of Mooncake, it provides valuable tools and insights for the broader AI community, fostering further innovation in LLM infrastructure.

. . .

!Headcanon Generator! | Aphmau Amino

Jan 24, 2025 ... I was bored and used a headcanon generator I screenshotted a few things that seemed like they fit. A.

AI Poem Generator - Free AI Rhyming Poem Maker Online

AI Poem Generator can create the rhyming poems for you on any subject for free, AI poem maker can write the beautiful poem based on your input words or ...

Free AI Presentation Maker | Slidesgo

Try for free our new AI Presentation Generator and create a customizable template in seconds. Explore beyond PowerPoint.

how to convert pdf to word? - Microsoft Community

Aug 25, 2023 ... To convert a PDF file to a Word document, you can use the following steps: 1. Open Microsoft Word on your computer. 2. Click on the "File" tab and select "Open ...

Free Title Generator: Powered by AI | Semrush

AI Title Generator is a totally free title maker, designed by Semrush to help you find creative title ideas.