DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs

DeepSeek R1: Replicating o1 and the Future of Reasoning Language Models

The AI world is buzzing about DeepSeek AI, a Chinese open-weights frontier AI laboratory, and their newly released reasoning model, R1. This model is significant not just for its impressive performance, but also for its open-source nature, offering a recipe for others to replicate and build upon. This article dives into the details of DeepSeek R1, its training methodology, and what it means for the future of reasoning language models (RLMs).

A New Era of Reasoning Models

The release of DeepSeek R1 marks a turning point in RLM research, a field that previously lacked a clear foundational paper. Before R1, progress was largely confined to industrial research, with limited transparency. Now, with R1's MIT license, researchers and companies can freely use and build upon this technology, accelerating the development and deployment of RLMs.

Key highlights of the DeepSeek R1 release include:

R1: A flagship reasoning language model trained through a 4-stage, Reinforcement Learning (RL)-heavy process.
R1-Zero: An RL-only reasoning model, directly trained from their V3 base model, used to create training data for the full R1.
A suite of open-weight models fine-tuned with Supervised Finetuning (SFT) data derived from R1.
A technical report detailing their RL training methods.

These models are accessible via DeepThink at chat.deepseek.com and in their new app.

The Price War Cometh: R1 vs. o1

One immediate impact of R1's release is the disruption of the pricing landscape for reasoning models. Previously, OpenAI's o1 commanded a premium price due to its long-context serving and market dominance. However, R1's pricing is significantly lower, signaling an impending price war reminiscent of the Mixtral inference price war of 2023.

DeepSeek R1's Training Recipe: A Four-Stage Process

The training of DeepSeek R1 involves four distinct stages, each contributing to the model's overall reasoning capabilities:

"Cold-start" SFT on Synthetic Reasoning Data: Initial supervised fine-tuning using synthetic reasoning data generated by the R1-Zero model.
Large-Scale Reinforcement Learning: Extensive RL training on reasoning problems until the model converges. This is the core of unlocking the reasoning potential.
Rejection Sampling for Generalization: Employs rejection sampling on a mix of reasoning and general queries to transition the model towards general-purpose use.
RL Training with Mixed Objectives: Final RL training combines reasoning problems (with verifiable rewards) and general preference tuning reward models to further refine the model's performance.

Each phase plays a critical role in shaping R1's capabilities, from initial learning to generalization and user-friendliness.

Step 0: Training R1-Zero to Initialize R1 with synthetic data

DeepSeek R1 Zero stands out as the first open model trained with large-scale RL without supervised fine-tuning (SFT) as a preliminary step.” R1-Zero avoids rambling typical of non instrcution-tuned models by the use of system prompts. This enables the model to generate <answer> HTML tags, making it more manageable. RL-on-base-models should be studied further.

Step 1: Reasoning SFT “Cold Start”

DeepSeek performs a small amount of supervised finetuning on the original base model with “a few thousand” filtered completions from the R1-Zero model to improve the final performance of the final reasoning model. Using DeepSeek-R1 itself is likely the easiest way. This phase readies the loss landscape of the model to make the “emergent” behaviors like “wait, let me check my work” or “that was wrong” come forth more easily in RL training.

Step 2: Large-Scale RL for reasoning

RL for reasoning models is built on a simple idea where you should reward the model for getting correct answers to problems where you can check if it has a correct answer. DeepSeek mentions three reward components during the reasoning phase of RL: Accuracy rewards, Format rewards, Language consistency rewards. DeepSeek uses the RL algorithm that they introduced, Group Relative Policy Optimization, which is the PPO update rule with a different value approximation method based on Monte Carlo advantage estimates rather than holding a separate value model in memory. The nature of the reward setup (and the data) is the key to this sort of reasoning training and many of the small RL details can be substituted for each other.

Step 3: Rejection Sampling to introduce general abilities

Rejection sampling is used to begin to introduce general capabilities back into the model. All in, we currently have very few details here and there is a lot of open space to learn (and likely improve).

Step 4: Final RL training for general use

DeepSeek R1 goes back to reinforcement learning, which is aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. As this grows into a larger area of research and development these questions will slowly be answered.

Key Takeaways and Future Directions

DeepSeek R1's release signifies a significant step forward in the field of reasoning language models. Its open-source nature encourages collaboration and accelerates innovation. However, key questions remain:

Data is Crucial: The specifics of the data used for training are vital for replication and further research. Open versions of these datasets are needed.
Infrastructure Matters: Details of DeepSeek's RL infrastructure are essential for those looking to build upon these models.
Scaling Laws: How small can language models become while still retaining advanced reasoning capabilities?

The Road Ahead

DeepSeek R1 is not the only approach to training these models, but it is a recipe that others can immediately use as a foundation. As the community continues to develop datasets and infrastructure, and similar reports continue to emerge from other organizations, the future of reasoning language models looks promising.

For those interested in further exploration, check out the Inference & Reasoning tag.

. . .

PNG Converter - FreeConvert.com

Upload and convert images in the highest quality in seconds! You can also compress, resize, or make transparent PNGs.

Movavi Video Converter 2020 Review - YouTube

Feb 19, 2020 ... Try Movavi Video Converter for 7 days For Windows or Mac OS X: https://bit.ly/2S2TECd 20% off coupon: CONVERT2020 Movavi Video Editor Plus: ...

Google Workspace: Secure Online Productivity & Collaboration Tools

Learn how the suite of secure, online tools from Google Workspace empowers teams of all sizes to do their best work.

Image to PDF – Convert Images to PDF Online

To begin, upload one or up to 20 images to our conversion tool. You can do this by dragging and dropping your images onto the “Drop Your Files Here” field. Or, ...

Norton Password Generator

Create strong passwords with Password Generator. 6#LBR1wR6_esp1druZAf Strong password Use the slider, and select from the options, below, to lengthen your ...