GitHub - deepseek-ai/DeepSeek-V3

DeepSeek-V3: A Deep Dive into the Latest Open-Source Language Model

DeepSeek-V3 is making waves in the AI community. This article covers everything you need to know about it, from its architecture and training to performance benchmarks and how to run it locally.

Introduction to DeepSeek-V3

DeepSeek-V3 is a strong Mixture-of-Experts (MoE) language model developed by DeepSeek-AI. It boasts an impressive 671 billion total parameters, with 37 billion activated for each token.

Key features of DeepSeek-V3 include:

  • Uses of Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and training.
  • An auxiliary-loss-free strategy for load balancing.
  • A multi-token prediction (MTP) training objective for improved performance.
  • Pre-trained on 14.8 trillion tokens of diverse, high-quality data.
  • Fine-tuned with Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

DeepSeek-V3 stands out for its ability to achieve performance comparable to leading closed-source models while maintaining training stability.

Model Summary

DeepSeek-V3 incorporates several innovative strategies to achieve ultimate training efficiency and performance:

  • Innovative Load Balancing Strategy: This strategy minimizes performance degradation associated with encouraging load balancing.
  • Multi-Token Prediction (MTP) Objective: MTP improves model performance and enables speculative decoding for faster inference.
  • FP8 Mixed Precision Training: DeepSeek-V3 validates the feasibility and effectiveness of FP8 training on a large-scale model.
  • Communication Bottleneck Optimization: They achieved nearly full computation-communication overlap in cross-node MoE training.
  • Knowledge Distillation: Reasoning capabilities are distilled from DeepSeek R1 models into DeepSeek-V3.

Model Downloads

You can download the mode from Hugging Face. There are two versions available:

Keep in mind that the total size of DeepSeek-V3 models on Hugging Face is 685B, including the MTP Module weights. The MTP support is under active development within the community, offering opportunities for contribution and feedback.

Evaluation Results

DeepSeek-V3's performance has been evaluated on a variety of benchmarks, and the results are impressive.

Base Model Performance

DeepSeek-V3 outperforms other open-source models on most benchmarks, especially in math, multilingual and code-related tasks. It features very low BPB (Bits Per Byte) and high accuracy/EM (Exact Match) levels within its architecture. The language model really shines in benchmarks, such as: MMLU, DROP, HumanEval, GSM8K, MATH and C-Eval.

Context Window

Results from the Needle In A Haystack (NIAH) tests show that DeepSeek-V3 maintains strong performance across context window lengths up to 128K.

Chat Model Performance

DeepSeek-V3 stands out as the best-performing open-source model and shows competitive results compared to frontier closed-source models, particularly excelling in code generation, mathematical problem-solving, and understanding Chinese language nuances.

Chat Website & API Platform

You can directly interact with DeepSeek-V3 through DeepSeek's official website (chat.deepseek.com). Also, an OpenAI-Compatible API is also offered through the DeepSeek Platform (platform.deepseek.com).

How to Run DeepSeek-V3 Locally

DeepSeek-V3 supports local deployment using various hardware and software options:

  • DeepSeek-Infer Demo: A lightweight demo for FP8 and BF16 inference.
  • SGLang: Supports DeepSeek-V3 with BF16 and FP8 inference modes.
  • LMDeploy: Enables efficient FP8 and BF16 inference.
  • TensorRT-LLM: Supports BF16 inference and INT4/8 quantization (FP8 support coming soon).
  • vLLM: Supports FP8 and BF16 modes with tensor and pipeline parallelism.
  • AMD GPU: Compatible with AMD GPUs via SGLang in BF16 and FP8 modes.
  • Huawei Ascend NPU: Supports DeepSeek-V3 on Huawei Ascend devices.

Remember that since FP8 training is natively used in the framework, DeepSeek AI provides only FP8 weights. If you need BF16 weights, you can use the provided conversion script. Check the GitHub repo for more details.

License Information

The code repository is under the MIT License, and the use of the DeepSeek-V3 models is subject to the Model License. Both DeepSeek-V3 Base and Chat models support commercial use. Always refer to the license files for the most accurate and up-to-date information.

Conclusion

DeepSeek-V3 represents a significant advancement in open-source language models. With its innovative architecture, efficient training methods, and impressive performance, it offers a compelling alternative to closed-source models. Whether you're a researcher, developer, or AI enthusiast, DeepSeek-V3 provides a powerful tool for exploring the possibilities of large language models.

. . .