Context Caching | DeepSeek API Docs

Unleashing Efficiency: Understanding Context Caching with the DeepSeek API

The DeepSeek API is designed to empower developers and researchers with cutting-edge AI capabilities. One of the key features that enhances its performance and cost-effectiveness is Context Caching. This article dives deep into how Context Caching works within the DeepSeek API, providing you with the knowledge to leverage it for optimal results.

What is Context Caching?

Context caching is a powerful optimization technique employed by the DeepSeek API that leverages hard disk caching to store and reuse parts of previous requests. Essentially, it enables the API to remember and quickly access repeated information, improving efficiency and reducing computational costs. Best of all, DeepSeek API's disk caching technology is enabled by default, so users benefit automatically without any code modification.

How Context Caching Works

Each user request triggers the construction of a hard disk cache. Crucially, context caching identifies overlapping prefixes between subsequent requests. When such overlaps occur, the API fetches the matching portion directly from the cache, resulting in a "cache hit." This means that the "cache hit" will reduce costs.

Key Concept: The "cache hit" is triggered only by the repeated prefix part between requests.

Context Caching in Action: Real-World Examples

To illustrate the benefits of context caching, let's explore a few common use cases:

1. Long Text Question Answering (Q&A)

Imagine using the DeepSeek API for analyzing financial reports.

First Request:

messages: [
  {"role": "system", "content": "You are an experienced financial report analyst..."},
  {"role": "user", "content": "<financial report content>\n\nPlease summarize the key information of this financial report."}
]

Second Request:

messages: [
  {"role": "system", "content": "You are an experienced financial report analyst..."},
  {"role": "user", "content": "<financial report content>\n\nPlease analyze the profitability of this financial report."}
]

In this scenario, both requests share a common prefix: the system message and the content of the financial report. The DeepSeek API will recognize this overlap and retrieve the prefix from the cache during the second request, leading to a significant reduction in processing time and cost.

2. Multi-Round Conversations

Context caching is especially useful in multi-turn conversations:

First Request:

messages: [
  {"role": "system", "content": "You are a helpful assistant"},
  {"role": "user", "content": "What is the capital of China?"}
]

Second Request:

messages: [
  {"role": "system", "content": "You are a helpful assistant"},
  {"role": "user", "content": "What is the capital of China?"},
  {"role": "assistant", "content": "The capital of China is Beijing."},
  {"role": "user", "content": "What is the capital of the United States?"}
]

The second request reuses the initial system message and the first user message. The DeepSeek API identifies these repeated elements and leverages the cache, optimizing the conversation flow. For more information of managing multi-turn conversations with DeepSeek, see Multi-round Conversation.

3. Few-Shot Learning

Few-shot learning provides the model example to learn a specify pattern, generally contains a common context prefix.

First Request:

messages: [
  {"role": "system", "content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`"},
  {"role": "user", "content": "In what year did Qin Shi Huang unify the six states?"},
  {"role": "assistant", "content": "Answer: 221 BC"},
  {"role": "user", "content": "Who was the founder of the Han Dynasty?"},
  {"role": "assistant", "content": "Answer: Liu Bang"},
  {"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
  {"role": "assistant", "content": "Answer: Li Zhu"},
  {"role": "user", "content": "Who was the founding emperor of the Ming Dynasty?"},
  {"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
  {"role": "user", "content": "Who was the founding emperor of the Qing Dynasty?"}
]

Second Request:

messages: [
  {"role": "system", "content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`"},
  {"role": "user", "content": "In what year did Qin Shi Huang unify the six states?"},
  {"role": "assistant", "content": "Answer: 221 BC"},
  {"role": "user", "content": "Who was the founder of the Han Dynasty?"},
  {"role": "assistant", "content": "Answer: Liu Bang"},
  {"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
  {"role": "assistant", "content": "Answer: Li Zhu"},
  {"role": "user", "content": "Who was the founding emperor of the Ming Dynasty?"},
  {"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
  {"role": "user", "content": "When did the Shang Dynasty fall?"}
]

Since few-shot generally provides the same context prefix, the cost of few-shot is significantly reduced with the support of context caching.

Monitoring Cache Hit Status

The DeepSeek API provides insights into the effectiveness of context caching through the usage section of its responses which include information on tokens utilization. By checking the status, you can see what number of tokens resulted in a cached hit.

Specifically, two fields are provided:

  • prompt_cache_hit_tokens: The number of tokens fetched from the cache (lower cost).
  • prompt_cache_miss_tokens: The number of tokens that were not found in the cache and required full processing (higher cost).

Important Considerations

While context caching offers significant advantages, keep these points in mind:

  • Output Randomness: The hard disk cache only retrieves the input. The output generated by the API is based on computation, and influenced by parameters like temperature, which introduces randomness. For ways to finetune parameters, see The Temperature Parameter
  • Storage Unit: The cache operates using 64-token units. Content shorter than 64 tokens will not be cached.
  • Best-Effort Basis: The cache system works on a "best-effort" basis and does not guarantee a 100% cache hit rate.
  • Cache Lifespan: Cache construction takes seconds. When the cache is no longer used, it gets cleared automatically, usually within hours or days.

Conclusion

Context caching is an invaluable asset within the DeepSeek API, optimizing performance, and reducing costs. By understanding how it works and strategically structuring your requests (such as in multi-turn conversations or using consistent prefixes), you can maximize the benefits of this powerful feature. As you continue to explore the DeepSeek API, consider how context caching can be integrated into your workflows to achieve more efficient and cost-effective AI solutions.

. . .