DeepSeek, a Chinese AI lab, has recently captured the attention of the global AI community. This article dives deep into the discussions surrounding DeepSeek, exploring its cost structure, training methodologies, and the ripple effects its advancements have on the broader AI landscape.
Over the past few weeks, DeepSeek has dominated conversations within the AI world. This surge in interest isn't entirely new; SemiAnalysis, has been covering DeepSeek's progress for months. However, the recent explosion of hype seems disproportionate to the reality of the company's current standing.
One significant shift in the narrative is the focus on DeepSeek's efficiency. Initially, there was skepticism about breaking scaling laws, which SemiAnalysis previously addressed in their Scaling Laws report. Now, the concern is that DeepSeek's algorithmic improvements are happening too rapidly, potentially creating overcapacity in the GPU market and negatively impacting companies like Nvidia. While this overstates the impact, the reality is that the models are inducing demand, as proven by observable effects on H100 and H200 pricing.
DeepSeek's origins trace back to High-Flyer, a Chinese hedge fund that recognized the potential of AI in trading algorithms. High-Flyer's early investment in GPUs, including 10,000 A100 GPUs in 2021, proved to be a strategic advantage, especially before export restrictions became a significant factor. In May 2023, High-Flyer spun off DeepSeek to focus exclusively on advancing AI capabilities. Today, the two entities share resources, both human and computational, indicating that DeepSeek is a serious undertaking backed by substantial investment. It's estimated that High-Flyer's and DeepSeek's combined GPU investments exceed $500 million USD.
DeepSeek's access to substantial GPU resources is critical to its success. It's estimated they have access to approximately 50,000 Hopper architecture GPUs, including about 10,000 H800s and 10,000 H100s. These GPUs are shared between High-Flyer and DeepSeek, distributed across various locations, and utilized for trading, inference, training, and research. For a more detailed breakdown, refer to SemiAnalysis's Accelerator Model.
SemiAnalysis estimates DeepSeek's total server Capital Expenditure (CapEx) to be around $1.6 billion, with operating costs adding another $944 million. They further indicate that DeepSeek, like other leading AI labs and hyperscalers, maintains a larger pool of GPUs than are used for individual training runs.
DeepSeek's talent acquisition strategy focuses on capability and curiosity rather than pedigree, sourcing talent from top Chinese universities such as PKU and Zhejiang. They offer competitive salaries, reportedly exceeding $1.3 million USD for promising candidates, attracting individuals with the lure of access to thousands of GPUs without usage limitations. With a relatively small team of around 150 employees, DeepSeek operates with agility and focus, allowing for rapid innovation. The company also distinguishes itself by running its own datacenters, granting greater control and opportunities for infrastructure-level innovation.
The widely cited "$6 million" training cost for DeepSeek V3 is misleading. This figure only represents the GPU cost of the pre-training run and excludes significant expenses such as R&D, hardware Total Cost of Ownership (TCO), and the cost of experimentation. Developing innovations like Multi-Head Latent Attention (MLA) requires significant time, manpower, and GPU resources. In reality, DeepSeek's hardware expenditure has likely exceeded $500 million over its history. This is consistent with other industry players: Anthropic's Claude 3.5 Sonnet cost tens of millions to train, but they still raise billions because of the amount of resources it requires to be on the bleeding edge. The company needs funds to experiment, come up with new architectures, gather and clean data, and pay employees, to name a few.
DeepSeek's V3 model is impressive, particularly when compared to older models like GPT-4o (released in May 2024). Algorithmic improvements allow for smaller amounts of compute to train and inference models of the same capability, and this pattern plays out over and over again. Algorithmic progress in language models is estimated at 4x per year meaning that for every passing year, 4x less compute is needed to achieve the same capability; Dario Amodei, CEO of Anthropic, has argued that algorithmic advancements are even faster and can yield a 10x improvement. Inference costs for GPT-3 quality models have fallen by 1200x.
DeepSeek's R1 model has achieved results comparable to o1, highlighting the rapid progress in reasoning capabilities. This new paradigm, centered around reasoning capabilities through synthetic data generation and Reinforcement Learning (RL) in post-training, allows for quicker gains with a lower price.. However, there seems to be a game of cat and mouse with benchmarks. Specifically, benchmarks are not mentioned where they are not leading in.
DeepSeek's success can be attributed to several key technical innovations:
DeepSeek's advancements have significant implications for the AI industry, particularly in terms of margins, commoditization of capabilities, and the ongoing pursuit of stronger AI models. DeepSeek's recent open weights are impressive, and the next few months and year will show what can come of it.
The industry should, with eyes wide open, recognize the impact and potential dominance CCP export controls would have on DeepSeek.