The world of Large Language Models (LLMs) is constantly evolving, with new models emerging regularly, each pushing the boundaries of what's possible. Among the most recent and impressive additions is DeepSeek-V3, developed by DeepSeek AI. This article provides a comprehensive overview of DeepSeek-V3, exploring its architecture, capabilities, performance, and how you can run it locally.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model boasting a staggering 671 billion total parameters. What's particularly impressive is that only 37 billion parameters are activated for each token, making it incredibly efficient for inference.
The development of DeepSeek-V3 focuses on:
Trained on a massive 14.8 trillion tokens and refined through Supervised Fine-Tuning and Reinforcement Learning, DeepSeek-V3 rivals closed-source models in performance while maintaining training stability and reasonable computational costs.
DeepSeek-V3 builds upon the efficient architecture validated in DeepSeek-V2, introducing several key innovations:
These architectural enhancements, combined with an FP8 mixed precision training framework, result in remarkable efficiency and cost-effectiveness. The pre-training of DeepSeek-V3 was completed for only 2.664M H800 GPU hours, delivering a powerful open-source base model.
Furthermore, DeepSeek employs knowledge distillation from its DeepSeek-R1 series, transferring reasoning capabilities into DeepSeek-V3. This innovative approach elegantly incorporates verification and reflection patterns, significantly boosting the model's reasoning performance while maintaining control over output style and length.
DeepSeek-V3 is available in two primary versions:
Both models feature 671 billion total parameters, with 37 billion activated parameters and a context length of 128K. Note that the total size on Hugging Face is 685B, including the Multi-Token Prediction (MTP) module weights of 14B.
For those looking to delve deeper into the model weights, README_WEIGHTS.md provides comprehensive details on the Main Model weights and MTP Modules. The MTP support is continually being developed within the community, so your contributions and feedback would be greatly appreciated
The evaluation results for DeepSeek-V3 are compelling, showcasing its superior performance across a variety of benchmarks.
DeepSeek-V3 consistently outperforms other open-source models, particularly in math and code-related tasks. On many benchmarks, DeepSeek-V3 has achieved state-of-the-art performance.
DeepSeek-V3 maintains excellent performance across all context window lengths up to 128K, as demonstrated by Needle In A Haystack (NIAH) tests.
DeepSeek-V3 can be deployed locally using a variety of hardware and open-source tools:
Due to the native adoption of FP8 training in the DeepSeek framework, only FP8 weights are provided, and the linked documentation from the project details how to cast the weights to BF16, if required.
The code repository for DeepSeek-V3 is licensed under the MIT License. The use of the DeepSeek-V3 Base/Chat models is subject to the Model License. DeepSeek-V3 series (including Base and Chat) supports commercial use, with appropriate acknowledgement.
DeepSeek-V3 represents a significant advancement in open-source large language models. With its innovative architecture, efficient training, and impressive performance, it stands out as a competitive alternative to closed-source models. Whether you are a researcher, developer, or simply an AI enthusiast, DeepSeek-V3 offers a powerful platform for exploring the frontiers of natural language processing.