DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude | DeepSeek API Docs

DeepSeek API's Context Caching: Revolutionizing LLM Cost and Performance

Large Language Models (LLMs) are powerful tools, but their API costs can be prohibitive, especially when dealing with repetitive inputs. DeepSeek is tackling this challenge head-on with its innovative Context Caching on Disk technology. This feature dramatically reduces API costs and significantly improves latency, making LLMs more accessible and efficient for developers.

The Problem: Repetitive Inputs in LLM APIs

A significant portion of user input in LLM API interactions tends to be redundant. This is evident in scenarios like:

Repeated References: User prompts often contain recurring information
Multi-turn Conversations: Previous turns are frequently re-submitted as context.

This repetition leads to unnecessary computational overhead and increased costs.

DeepSeek's Solution: Context Caching on Disk

DeepSeek's Context Caching on Disk addresses redundancy by:

Caching Reusable Content: Identifying and storing frequently used content on a distributed disk array.
Retrieving from Cache: When duplicate inputs are detected, the system retrieves the cached content, bypassing the need for recomputation.

This approach not only slashes latency but also results in substantial cost savings.

Drastic Price Reduction: Up to 90% Savings

With DeepSeek's Context Caching, the price is significantly reduced for cache hits. The cost is $0.014 per million tokens when the content is served from the cache, representing up to a 90% reduction in API costs!

How to Utilize DeepSeek's Caching Service

The best news? DeepSeek's disk caching service is available to all users right now, and it requires no code changes or complex setup.

Automatic Operation: The caching service runs automatically in the background.
Billing Based on Hits: You're only charged when the system successfully serves content from the cache.

Important Note:

For a cache hit to occur, the requests must have identical prefixes, beginning from the 0th token. partial matches within the input will not be considered.

Prime Use Cases:

Context caching proves beneficial in numerous scenarios, including:

Q&A Assistants: Improve efficiency with long, pre-defined prompts.
Role-Playing: Manage extensive character settings and long conversation histories.
Data Analysis: Streamline recurring queries on the same datasets (learn more about data analysis with LLMs).
Code Analysis & Debugging: Efficiently handle repeated references to code repositories
Few-shot Learning Improvement: Enhance outcomes utilising stored context.

Monitoring Cache Performance

DeepSeek provides insights into cache performance through two new fields in the API responses:

prompt_cache_hit_tokens: The number of tokens served from the cache, billed at $0.014 per million tokens
prompt_cache_miss_tokens: The number of tokens not served from the cache, billed at the standard rate of $0.14 per million tokens.

Benefits: Latency Reduction and Cost Savings

Latency Reduction: For requests that include long, repetitive content, the first-token latency is significantly lower.

Cost Saving: Achieve savings of up to 90% by leveraging the caching system's ability to optimise repeated prompts. Even without explicit optimization the historical save is around 50%.

You only pay $0.014 per million tokens for cache hits, and storage usage for that cached data is free.

Security First

DeepSeek prioritizes data privacy and security with its cache system:

Data Isolation: Each user's cache is isolated, making it logically invisible to others.
Automatic Clearing: Unused cache entries are automatically purged after some time.

Why DeepSeek Leads in Disk Caching

DeepSeek is among the first LLM service providers to use disk caching in API services, taking advantage of the MLA architecture in DeepSeek V2. This architecture improves performance and reduces the size of context KV caches, which allows you to efficiently store data on lower-cost disks.

API Concurrency and Rate Limits

DeepSeek API can process up to 1 trillion tokens daily and does not put any limits on concurrency or rate.

Key Considerations before caching

There are some considerations to make when deciding whether to use context caching:

The cache system uses 64 tokens as a storage unit, so it will not cache content below this threshold.
The cache system does not guarantee 100% cache hits.
Unused cache entries are automatically cleaned, typically within a few hours or days.

Take Advantage of Context Caching

DeepSeek's Context Caching on Disk is a game-changer for LLM API usage. It reduces costs, lowers latency, and enhances overall efficiency. By caching content that is predicted to be re-used, the system bypasses the need for recomputation when similar prompts are provided.

Further Resources

By implementing context caching, DeepSeek is making LLMs more accessible and practical for a wider range of applications.

. . .

Hearts (card game) - Wikipedia

Minor rule variants · Instead of the eldest hand leading, the player holding the 2♧ leads it to the first trick. · A heart may not be played to lead a trick ...

Username Generator: Millions of Random Ideas | NordPass

Try our instant username generator to create unique and secure usernames, gamer tags or handles for your social media accounts.

Free AI Image Generator - Image Creator in Bing

Free, AI-powered text-to-image generator transforms your words into stunning visuals in seconds. Perfect for quick and easy image creation.

Melhor Conversor JPG para PDF: Converta Imagens JPEG Online ...

Como converter JPGs para PDF Livre · Selecione os JPGs que você deseja alterar para PDF, depois adicione as imagens ao nosso conversor JPG para PDF para ...

Google Webmaster Tools

Use Search Console to monitor Google Search results data for your properties.