DeepSeek has introduced a groundbreaking feature to its API: Context Caching on Disk. This innovative approach significantly reduces the cost and latency associated with using large language models (LLMs), making it a game-changer for developers and businesses alike. Let's delve into how this technology works and what benefits it offers.
A significant portion of user inputs in LLM API interactions tend to be repetitive. This repetition can stem from:
This redundancy translates directly into wasted computation and increased costs. DeepSeek's Context Caching directly tackles this problem.
DeepSeek's Context Caching on Disk caches content that is expected to be reused on a distributed disk array.
When a duplicate input prefix (from the 0th token) is detected, DeepSeek retrieves the repeated parts from the cache, bypassing the need for recomputation. This intelligent caching mechanism offers remarkable advantages:
For cache hits, DeepSeek charges a mere $0.014 per million tokens, resulting in potential API cost reductions of up to 90%! See Models & Pricing for updated information
The best part? Using DeepSeek API's Caching Service requires no code or interface changes and is available for all users! The cache service runs automatically in the background, with billing based solely on actual cache hits.
Important Note: Only requests with identical prefixes, starting from the 0th token, will trigger a cache hit. Partial matches within the input will be ignored.
Context caching is particularly beneficial in scenarios involving:
For more detailed instructions, refer to the guide Use Context Caching.
DeepSeek provides two new useful fields in the Usage section of the API response, making it easy to track and monitor your cache performance:
prompt_cache_hit_tokens
: Number of tokens served from the cache, billed at $0.014 per million tokens.prompt_cache_miss_tokens
: Number of tokens not served from the cache, billed at the standard rate.By monitoring these metrics, you can gain valuable insights into the effectiveness of context caching in your specific use cases.
The impact of context caching is significant:
DeepSeek prioritizes security and data privacy. The caching system incorporates a robust security strategy:
DeepSeek is the first LLM provider globally implement extensive disk caching in API services.
This disk caching implementation is powered by the MLA architecture in DeepSeek V2 series, which enhances model performance while significantly reducing the size of the context KV cache. This combination allows for efficient storage on low-cost disks, making widespread caching feasible.
The DeepSeek API handles up to 1 trillion tokens per day, with no defined limits on concurrency or rate, ensure high-quality service even under heavy load.
Some system design details:
DeepSeek's Context Caching on Disk represents a significant leap forward in LLM API efficiency. By intelligently caching and reusing repetitive input sequences, DeepSeek empowers developers to:
Whether you're building Q&A assistants, immersive role-playing experiences, or sophisticated data analysis tools, DeepSeek's context caching can unlock new possibilities and make your LLM applications more efficient and cost-effective.