DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude | DeepSeek API Docs

DeepSeek API's Context Caching: A Game Changer for LLM Costs and Performance

DeepSeek has introduced a groundbreaking feature to its API: Context Caching on Disk. This innovative approach significantly reduces the cost and latency associated with using large language models (LLMs), making it a game-changer for developers and businesses alike. Let's delve into how this technology works and what benefits it offers.

The Problem: Repetitive Inputs in LLM APIs

A significant portion of user inputs in LLM API interactions tend to be repetitive. This repetition can stem from:

Repeated references: User prompts often include references to previous information.
Multi-turn conversations: The context of prior turns in a conversation is frequently re-entered into subsequent prompts.

This redundancy translates directly into wasted computation and increased costs. DeepSeek's Context Caching directly tackles this problem.

DeepSeek's Solution: Context Caching on Disk

DeepSeek's Context Caching on Disk caches content that is expected to be reused on a distributed disk array.

When a duplicate input prefix (from the 0th token) is detected, DeepSeek retrieves the repeated parts from the cache, bypassing the need for recomputation. This intelligent caching mechanism offers remarkable advantages:

Reduced Latency: Faster response times for requests with repetitive content.
Significant Cost Savings: By avoiding redundant computations, DeepSeek dramatically cuts overall usage costs.

For cache hits, DeepSeek charges a mere $0.014 per million tokens, resulting in potential API cost reductions of up to 90%! See Models & Pricing for updated information

How to Leverage DeepSeek's Caching Service

The best part? Using DeepSeek API's Caching Service requires no code or interface changes and is available for all users! The cache service runs automatically in the background, with billing based solely on actual cache hits.

Important Note: Only requests with identical prefixes, starting from the 0th token, will trigger a cache hit. Partial matches within the input will be ignored.

Prime Scenarios for Context Caching

Context caching is particularly beneficial in scenarios involving:

Q&A Assistants: Long, pre-set prompts used repeatedly will benefit from caching.
Role-Playing: The extensive character settings common in role-playing games can be efficiently cached.
Data Analysis: Recurring queries on the same dataset or files become much more affordable.
Code Analysis and Debugging: Repeated repository references during debugging can be served from the cache, dramatically saving on tokens.
Few-shot learning: Improve model output performance

For more detailed instructions, refer to the guide Use Context Caching.

Monitoring Cache Performance

DeepSeek provides two new useful fields in the Usage section of the API response, making it easy to track and monitor your cache performance:

prompt_cache_hit_tokens: Number of tokens served from the cache, billed at $0.014 per million tokens.
prompt_cache_miss_tokens: Number of tokens not served from the cache, billed at the standard rate.

By monitoring these metrics, you can gain valuable insights into the effectiveness of context caching in your specific use cases.

Quantifiable Improvements: Latency and Cost

The impact of context caching is significant:

Reduced Latency: A 128K prompt with high reference saw first-token latency plummet from 13 seconds to a mere 500 milliseconds!
Lowered Costs: Users can save up to 90% on costs with optimization for cache characteristics. Even without optimization, users save over 50% on average. The only cost is per million tokens for cache hits, and the storage of the cache is free.

Addressing Security Concerns

DeepSeek prioritizes security and data privacy. The caching system incorporates a robust security strategy:

User Isolation: Each user's cache is isolated and logically invisible to others, ensuring data privacy.
Automatic Clearing: Unused cache entries are automatically cleared after a period of inactivity, preventing long-term retention or misuse.

DeepSeek's Leading-Edge Technology

DeepSeek is the first LLM provider globally implement extensive disk caching in API services.

This disk caching implementation is powered by the MLA architecture in DeepSeek V2 series, which enhances model performance while significantly reducing the size of the context KV cache. This combination allows for efficient storage on low-cost disks, making widespread caching feasible.

Concurrency, Rate Limits, and System Design

The DeepSeek API handles up to 1 trillion tokens per day, with no defined limits on concurrency or rate, ensure high-quality service even under heavy load.

Some system design details:

The cache system uses 64 tokens as a storage unit; content less than 64 tokens will not be cached.
The cache system does not guarantee 100% cache hits.
Unused cache entries are automatically cleared, typically within a few hours to days.

Key Takeaways

DeepSeek's Context Caching on Disk represents a significant leap forward in LLM API efficiency. By intelligently caching and reusing repetitive input sequences, DeepSeek empowers developers to:

Dramatically reduce costs (up to 90% savings).
Significantly lower latency (improved response times).
Maintain data privacy and security.
Scale applications without worrying about Concurrency and Rate Limits.

Whether you're building Q&A assistants, immersive role-playing experiences, or sophisticated data analysis tools, DeepSeek's context caching can unlock new possibilities and make your LLM applications more efficient and cost-effective.

. . .

DeepSeek

Jan 26, 2025 ... A reasoning model is a large language model told to “think step-by-step” before it gives a final answer. This “ chain of thought ; DeepSeek R1 is ...

Chrome Settings vs. Vivaldi Settings | Vivaldi Forum

Feb 10, 2023 ... The actual Vivaldi settings are the ones you found at vivaldi://settings whereas the chrome://settings are the ones related to the chromium core behind vivaldi.

President Trump's America First Priorities – The White House

17 hours ago ... President Trump is taking swift action to end the weaponization of government against political rivals and ordering all document retention as ...

Complimentary Online PDF Conversion Tools > PDF to Word

Free PDF to Word Converter. Convert a PDF to a Word file in a few simple steps. Drag & drop or choose a document to begin.

Where to download the last Chrome version with Java (NPAPI ...

Oct 18, 2015 ... To enable the NPAPI plugins, follow these Steps: Open a new tab and enter chrome://flags/#enable-npapi; Enable NPAPI Mac, Windows: click "Enable ...