The DeepSeek API is designed to empower developers and researchers with cutting-edge AI capabilities. One of the key features that enhances its performance and cost-effectiveness is Context Caching. This article dives deep into how Context Caching works within the DeepSeek API, providing you with the knowledge to leverage it for optimal results.
Context caching is a powerful optimization technique employed by the DeepSeek API that leverages hard disk caching to store and reuse parts of previous requests. Essentially, it enables the API to remember and quickly access repeated information, improving efficiency and reducing computational costs. Best of all, DeepSeek API's disk caching technology is enabled by default, so users benefit automatically without any code modification.
Each user request triggers the construction of a hard disk cache. Crucially, context caching identifies overlapping prefixes between subsequent requests. When such overlaps occur, the API fetches the matching portion directly from the cache, resulting in a "cache hit." This means that the "cache hit" will reduce costs.
Key Concept: The "cache hit" is triggered only by the repeated prefix part between requests.
To illustrate the benefits of context caching, let's explore a few common use cases:
Imagine using the DeepSeek API for analyzing financial reports.
First Request:
messages: [
{"role": "system", "content": "You are an experienced financial report analyst..."},
{"role": "user", "content": "<financial report content>\n\nPlease summarize the key information of this financial report."}
]
Second Request:
messages: [
{"role": "system", "content": "You are an experienced financial report analyst..."},
{"role": "user", "content": "<financial report content>\n\nPlease analyze the profitability of this financial report."}
]
In this scenario, both requests share a common prefix: the system message and the content of the financial report. The DeepSeek API will recognize this overlap and retrieve the prefix from the cache during the second request, leading to a significant reduction in processing time and cost.
Context caching is especially useful in multi-turn conversations:
First Request:
messages: [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What is the capital of China?"}
]
Second Request:
messages: [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What is the capital of China?"},
{"role": "assistant", "content": "The capital of China is Beijing."},
{"role": "user", "content": "What is the capital of the United States?"}
]
The second request reuses the initial system message and the first user message. The DeepSeek API identifies these repeated elements and leverages the cache, optimizing the conversation flow. For more information of managing multi-turn conversations with DeepSeek, see Multi-round Conversation.
Few-shot learning provides the model example to learn a specify pattern, generally contains a common context prefix.
First Request:
messages: [
{"role": "system", "content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`"},
{"role": "user", "content": "In what year did Qin Shi Huang unify the six states?"},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{"role": "user", "content": "Who was the founding emperor of the Ming Dynasty?"},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{"role": "user", "content": "Who was the founding emperor of the Qing Dynasty?"}
]
Second Request:
messages: [
{"role": "system", "content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`"},
{"role": "user", "content": "In what year did Qin Shi Huang unify the six states?"},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{"role": "user", "content": "Who was the founding emperor of the Ming Dynasty?"},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{"role": "user", "content": "When did the Shang Dynasty fall?"}
]
Since few-shot generally provides the same context prefix, the cost of few-shot is significantly reduced with the support of context caching.
The DeepSeek API provides insights into the effectiveness of context caching through the usage
section of its responses which include information on tokens utilization. By checking the status, you can see what number of tokens resulted in a cached hit.
Specifically, two fields are provided:
prompt_cache_hit_tokens
: The number of tokens fetched from the cache (lower cost).prompt_cache_miss_tokens
: The number of tokens that were not found in the cache and required full processing (higher cost).While context caching offers significant advantages, keep these points in mind:
Context caching is an invaluable asset within the DeepSeek API, optimizing performance, and reducing costs. By understanding how it works and strategically structuring your requests (such as in multi-turn conversations or using consistent prefixes), you can maximize the benefits of this powerful feature. As you continue to explore the DeepSeek API, consider how context caching can be integrated into your workflows to achieve more efficient and cost-effective AI solutions.