The AI landscape is constantly evolving, and recent announcements from DeepSeek have sent shockwaves through the industry, particularly concerning competition between the U.S. and China. This article delves into DeepSeek, its innovative models, and the implications for the future of AI.
Before diving into the details, it’s crucial to address the initial reactions to DeepSeek's announcements. Similar to the surprise surrounding Huawei's 7nm chip in the Mate 60 Pro, the significance of DeepSeek's advancements wasn't immediately clear to many. The key takeaway is that details of DeepSeek’s accomplishments are less important than the reaction and what that reaction says about people’s pre-existing assumptions.
DeepSeek isn't just one model; it is a series of models, each building upon the last with increasing sophistication. Key advancements were introduced with the V2 model. The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA.
The true impact of V2's innovations became apparent with the release of V3. V3 added a new approach to load balancing which reduced communications overhead, and also used multi-token prediction in training increased efficiency again. DeepSeek claimed the final training run for V3 cost $5.576 million, which caused incredulity.
DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses.
The company publicly stated that training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on their cluster with 2048 H800 GPUs. Consequently, their pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. This results in a cost of $5.576M, assuming the rental price of the H800 GPU is $2 per GPU hour.
This was achieved through their optimized co-design of algorithms, frameworks, and hardware. V3 boasts 671 billion parameters, but uses Mixture of Experts so that only 37 billion parameters in the active expert are computed per token; parameters were stored with BF16 or FP32 precision, but were reduced to FP8 precision for calculations.
DeepSeek engineers programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is impossible to do in CUDA, and the engineers had to drop to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. All of the design decisions made by DeepSeek only make sense when constrained to the H800.
V3 is competitive with OpenAI’s 4o and Anthropic’s Sonnet-3.5, and appears to be better than Llama’s biggest model. DeepSeek was able to distill those models to give V3 high quality tokens to train on.
Distillation involves extracting understanding from another model by sending inputs to the teacher model, recording the outputs, and using that to train the student model. This allows companies to optimize models for inference, while others can free-ride on leading-edge investments. Distillation is probably the core economic factor causing the slow divorce of Microsoft and OpenAI.
R1 is a reasoning model like OpenAI’s o1, but it has open weights, meaning that instead of paying OpenAI to get reasoning, you can run R1 on the server of your choice, or even locally, at dramatically lower cost.
DeepSeek made two models: R1 and R1-Zero. R1-Zero is the bigger deal because it uses pure reinforcement learning (RL).
Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function. In this case, DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process.
What emerged is a model that developed reasoning and chains-of-thought on its own, including what DeepSeek called “Aha Moments”.
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, DeepSeek-R1 incorporates a small amount of cold-start data and a multi-stage training pipeline.
This is similar to what OpenAI did for o1: DeepSeek started the model out with a bunch of examples of chain-of-thought thinking so it could learn the proper format for human consumption, and then did the reinforcement learning to enhance its reasoning, along with a number of editing and refinement steps.
The core takeaway is that AI models are teaching AI models and AI models are teaching themselves. We are watching the assembly of an AI takeoff scenario in realtime.
DeepSeek’s efficiency and broad availability cast significant doubt on the most optimistic Nvidia growth story, at least in the near term. The payoffs from both model and infrastructure optimization also suggest there are significant gains to be had from exploring alternative approaches to inference in particular.
DeepSeek CEO Liang Wenfeng has stated that open source is key to attracting talent and creating a strong technical ecosystem. This approach contrasts with the strategies of some U.S. companies that advocate for regulation and closed-source models. OpenAI's greatest crime is the 2023 Executive Order on AI.
The key question remains: Will the U.S. double down on defensive measures, or realize that there is real competition and actually give ourself permission to compete by cutting out all of the cruft in our companies that has nothing to do with winning?