Large Language Models (LLMs) are powerful tools, but their API costs can be prohibitive, especially when dealing with repetitive inputs. DeepSeek is tackling this challenge head-on with its innovative Context Caching on Disk technology. This feature dramatically reduces API costs and significantly improves latency, making LLMs more accessible and efficient for developers.
A significant portion of user input in LLM API interactions tends to be redundant. This is evident in scenarios like:
This repetition leads to unnecessary computational overhead and increased costs.
DeepSeek's Context Caching on Disk addresses redundancy by:
This approach not only slashes latency but also results in substantial cost savings.
With DeepSeek's Context Caching, the price is significantly reduced for cache hits. The cost is $0.014 per million tokens when the content is served from the cache, representing up to a 90% reduction in API costs!
The best news? DeepSeek's disk caching service is available to all users right now, and it requires no code changes or complex setup.
For a cache hit to occur, the requests must have identical prefixes, beginning from the 0th token. partial matches within the input will not be considered.
Context caching proves beneficial in numerous scenarios, including:
DeepSeek provides insights into cache performance through two new fields in the API responses:
prompt_cache_hit_tokens
: The number of tokens served from the cache, billed at $0.014 per million tokensprompt_cache_miss_tokens
: The number of tokens not served from the cache, billed at the standard rate of $0.14 per million tokens.Latency Reduction: For requests that include long, repetitive content, the first-token latency is significantly lower.
Cost Saving: Achieve savings of up to 90% by leveraging the caching system's ability to optimise repeated prompts. Even without explicit optimization the historical save is around 50%.
You only pay $0.014 per million tokens for cache hits, and storage usage for that cached data is free.
DeepSeek prioritizes data privacy and security with its cache system:
DeepSeek is among the first LLM service providers to use disk caching in API services, taking advantage of the MLA architecture in DeepSeek V2. This architecture improves performance and reduces the size of context KV caches, which allows you to efficiently store data on lower-cost disks.
DeepSeek API can process up to 1 trillion tokens daily and does not put any limits on concurrency or rate.
There are some considerations to make when deciding whether to use context caching:
DeepSeek's Context Caching on Disk is a game-changer for LLM API usage. It reduces costs, lowers latency, and enhances overall efficiency. By caching content that is predicted to be re-used, the system bypasses the need for recomputation when similar prompts are provided.
By implementing context caching, DeepSeek is making LLMs more accessible and practical for a wider range of applications.