The AI world is buzzing about DeepSeek AI, a Chinese open-weights frontier AI laboratory, and their newly released reasoning model, R1. This model is significant not just for its impressive performance, but also for its open-source nature, offering a recipe for others to replicate and build upon. This article dives into the details of DeepSeek R1, its training methodology, and what it means for the future of reasoning language models (RLMs).
The release of DeepSeek R1 marks a turning point in RLM research, a field that previously lacked a clear foundational paper. Before R1, progress was largely confined to industrial research, with limited transparency. Now, with R1's MIT license, researchers and companies can freely use and build upon this technology, accelerating the development and deployment of RLMs.
Key highlights of the DeepSeek R1 release include:
These models are accessible via DeepThink at chat.deepseek.com and in their new app.
One immediate impact of R1's release is the disruption of the pricing landscape for reasoning models. Previously, OpenAI's o1 commanded a premium price due to its long-context serving and market dominance. However, R1's pricing is significantly lower, signaling an impending price war reminiscent of the Mixtral inference price war of 2023.
The training of DeepSeek R1 involves four distinct stages, each contributing to the model's overall reasoning capabilities:
Each phase plays a critical role in shaping R1's capabilities, from initial learning to generalization and user-friendliness.
DeepSeek R1 Zero stands out as the first open model trained with large-scale RL without supervised fine-tuning (SFT) as a preliminary step.” R1-Zero avoids rambling typical of non instrcution-tuned models by the use of system prompts. This enables the model to generate <answer>
HTML tags, making it more manageable. RL-on-base-models should be studied further.
DeepSeek performs a small amount of supervised finetuning on the original base model with “a few thousand” filtered completions from the R1-Zero model to improve the final performance of the final reasoning model. Using DeepSeek-R1 itself is likely the easiest way. This phase readies the loss landscape of the model to make the “emergent” behaviors like “wait, let me check my work” or “that was wrong” come forth more easily in RL training.
RL for reasoning models is built on a simple idea where you should reward the model for getting correct answers to problems where you can check if it has a correct answer. DeepSeek mentions three reward components during the reasoning phase of RL: Accuracy rewards, Format rewards, Language consistency rewards. DeepSeek uses the RL algorithm that they introduced, Group Relative Policy Optimization, which is the PPO update rule with a different value approximation method based on Monte Carlo advantage estimates rather than holding a separate value model in memory. The nature of the reward setup (and the data) is the key to this sort of reasoning training and many of the small RL details can be substituted for each other.
Rejection sampling is used to begin to introduce general capabilities back into the model. All in, we currently have very few details here and there is a lot of open space to learn (and likely improve).
DeepSeek R1 goes back to reinforcement learning, which is aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. As this grows into a larger area of research and development these questions will slowly be answered.
DeepSeek R1's release signifies a significant step forward in the field of reasoning language models. Its open-source nature encourages collaboration and accelerates innovation. However, key questions remain:
DeepSeek R1 is not the only approach to training these models, but it is a recipe that others can immediately use as a foundation. As the community continues to develop datasets and infrastructure, and similar reports continue to emerge from other organizations, the future of reasoning language models looks promising.
For those interested in further exploration, check out the Inference & Reasoning tag.