The AI world was recently shaken by DeepSeek, a Chinese AI company that demonstrated its ability to train a high-performing foundation model, DeepSeek-V3, on a relatively small cluster of "crippled" Nvidia H800 GPUs. This achievement challenges the assumption that training state-of-the-art AI models requires massive infrastructure and vast financial resources
Founded in May 2023 by Liang Wenfeng, DeepSeek-AI emerged from High-Flyer AI, a hedge fund utilizing AI algorithms for financial trading. After DeepSeek published a paper describing a new kind of load balancer it had created the company largely stayed under the radar until August 2024. Over the holidays, the company published the architectural details of its DeepSeek-V3 foundation model, is comprised of an extensive 671 billion parameters (with a fraction, 37 billion, active per token) and was trained using 14.8 trillion tokens.
Adding to the intrigue, DeepSeek released its DeepSeek-R1 model, incorporating reinforcement learning and supervised fine-tuning to enhance reasoning capabilities. DeepSeek AI is charging 6.5X more for the R1 model than for the base V3 model, as you can see here. What sets DeepSeek apart is its commitment to open source, by making the source code for its V3, R1, and V2 models available on GitHub.
The core question is how DeepSeek achieved comparable performance to the likes of OpenAI's GPT-4, Google's Gemini 1.5, and Anthropic's Claude 3.5, while using significantly less hardware. DeepSeek-V3 was trained using only 2,048 Nvidia H800 GPUs (specifically, the H800 SXM5 version, which has its FP64 performance capped), racking up 2.79 million GPU-hours at an estimated cost of $5.58 million.
DeepSeek's cluster consisted of 256 server nodes, each equipped with eight H800 GPUs interconnected via NVSwitch. While the exact InfiniBand adapter speed remains unconfirmed, it underscores the relatively modest scale of DeepSeek's infrastructure.
This raises skepticism, yet the DeepSeek-V3 paper details several key innovations:
DualPipe Communication Accelerator: DeepSeek repurposed 20 of the 132 streaming multiprocessors (SMs) on the Hopper GPU to function as a communication accelerator and scheduler. As the V3 paper puts it: this helps to "overlap between computation and communication to hide the communication latency during computation".
Pipeline Parallelism and Data Parallelism: The model utilizes these techniques with optimized memory management, negating the need for tensor parallelism.
Auxiliary Loss-Free Load Balancing: Efficiently routes tokens within the Mixture of Experts (MoE) architecture.
FP8 Low Precision Processing: Optimizes memory bandwidth and utilization of the H800's 80 GB memory.
It is also achieved through tile-wise and block-wide quantization, allowing for numerical range adjustments tailored to specific calculations within the dataset.
DeepSeek's approach holds significant implications:
However, the claim of achieving comparable performance with significantly less hardware remains contentious. It's crucial for other organizations with ample resources to replicate DeepSeek's results to validate their findings.
The DeepSeek-R1 model refines the V3 model by feeding the outputs of other AI models into reinforcement learning and supervised fine-tuning processes, to improve the "reasoning patterns" of V3 DeepSeek outlined that they "distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length."
DeepSeek's approach to AI model training highlights the potential for algorithmic innovation to compensate for hardware limitations using a small cluster of Nvidia H800 GPUs. While skepticism remains regarding the extent of their performance gains, the techniques developed by DeepSeek could reshape the future of AI development, paving the way for more efficient and accessible training methodologies.