DeepSeek has emerged as a significant player in the AI landscape, particularly with the release of DeepSeek v3. This model has achieved state-of-the-art (SOTA) performance among open-weight models, showcasing impressive efficiency by utilizing only 2.8 million H800 hours of training hardware time. This article delves into the architectural innovations that underpin DeepSeek's success, highlighting the key improvements over traditional Transformer models.
DeepSeek v3 stands out not only for its performance but also for its training efficiency. Compared to models like Llama 3.1 405B, DeepSeek v3 requires significantly less training compute (approximately ten times less). The technical report accompanying the release provides valuable insights into the model's architecture and the engineering challenges involved in its training.
DeepSeek's advancements can be attributed to innovative architectural choices, primarily focusing on:
MLA is arguably the most significant architectural innovation, initially introduced in DeepSeek v2. It offers a superior method for KV cache reduction compared to traditional techniques such as grouped-query attention and multi-query attention.
The KV cache is crucial for efficient inference in Transformer models. During sequential token generation, the model needs context from all past tokens. Caching the key and value vectors avoids redundant computations, significantly speeding up the process.
Unlike methods like grouped-query attention, MLA aims to reduce KV cache size without compromising model quality. It transforms the computation of key and value vectors into a two-step process using a low-rank matrix factorization approach.
Mixture-of-Experts (MoE) models have become a popular way to improve vanilla Transformers. They divide the feedforward blocks into multiple experts and use a routing mechanism to send each token to a subset of these experts. This allows the model to have more parameters than it activates for each token, effectively decoupling the model's knowledge capacity from the individual token processing cost.
The routing mechanism in MoE models introduces a discontinuous function, which can lead to "routing collapse," where the model gets stuck using only a few experts.
DeepSeek v3 addresses these challenges through several innovations:
DeepSeek v3 introduces the ability to predict multiple tokens per forward pass, utilizing a multi-token prediction objective during training.
The model performs a standard forward pass for next-token prediction and then feeds the resulting vector into a subsequent Transformer block to predict the second next token, iterating this process for a defined number of tokens.
DeepSeek's improvements to the Transformer architecture demonstrate a deep understanding of the model's inner workings and its limitations. These innovations, while seemingly obvious in retrospect, are the result of careful research and a clear vision for addressing architectural deficiencies. As the field progresses, future improvements may focus on prioritizing compute based on the difficulty of predictions, further optimizing Transformer models for efficiency and performance.