The DeepSeek R1, a cutting-edge AI model developed by the DeepSeek team, is making waves in the AI community. Boasting a staggering 671 billion parameters and leveraging a Mixture-of-Experts (MoE) architecture, this model activates 37 billion parameters per token, enabling exceptional performance in complex tasks. This article explores the DeepSeek R1 model, its capabilities, and how it's optimized for the Huawei Ascend platform.
DeepSeek R1's architecture is designed to enhance deep thinking capabilities through a multi-stage cyclic training approach. This includes:
This comprehensive training regimen allows DeepSeek R1 to excel in:
Notably, its performance rivals that of OpenAI's o1 model, positioning DeepSeek R1 as a competitive force in the large language model landscape.
To leverage the full potential of DeepSeek R1, Huawei's Ascend platform provides a robust ecosystem of hardware and software tools. Here's a breakdown of how to optimize DeepSeek R1 for Ascend:
1. Hardware Requirements:
2. Weight Conversion:
GPU-Side Conversion: A script from DeepSeek-V3 can be reused:
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inferece/
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-R1 --output-bf16-hf-path /path/to/deepseek-R1-bf16
NPU-Side Conversion: Use the script from ModelZoo-PyTorch:
git clone https://gitee.com/ascend/ModelZoo-PyTorch.git
cd ModelZoo-PyTorch\MindIE\LLM\DeepSeek\DeepSeek-V2\NPU_inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-R1 --output-bf16-hf-path /path/to/deepseek-R1-bf16
Remember to manually copy tokenizer files after conversion.
Important Considerations: Ensure sufficient disk space for the original and converted weights (approximately 640GB before conversion and 1.3TB after conversion for DeepSeek-R1).
3. Quantization:
4. Loading the Image:
docker load -i <image_name>
.docker images
.5. Containerization:
Preparing the Model: Download or obtain the model weights and place them in the designated directory. Download scripts are available on the ModelZoo.
Launching the Container:
docker run -itd --privileged --name=<container_name> --net=host \
--shm-size 500g \
--device=/dev/davinci0 ... --device=/dev/davinci7 \
--device=/dev/davinci_manager --device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /path-to-weights:/path-to-weights \
<image_name> bash
Configure communication environment variables:
export ATB_LLM_HCCL_ENABLE=1
export ATB_LLM_COMM_BACKEND="hccl"
export HCCL_CONNECT_TIMEOUT=7200
export WORLD_SIZE=32
6. Testing and Validation:
Pure Model Testing:
config.json
to set model_type
to deepseekv2
.hccn_tool
.rank_table_file.json
with correct IP addresses and device IDs.run.sh
script for accuracy and performance testing.Service-Oriented Testing:
MIES_CONTAINER_IP
.config.json
../bin/mindieservice_daemon
.NPU_MEMORY_FRACTION
or reduce maxSeqLen
, maxInputTokenLen
, etc., in config.json
.HCCL_CONNECT_TIMEOUT
and HCCL_EXEC_TIMEOUT
environment variables.tokenizer.py
.model_runner.py
.Huawei's Ascend platform provides a comprehensive suite of tools and resources for AI developers. From hardware acceleration with the Atlas series to software frameworks like MindSpore, Ascend empowers developers to build and deploy high-performance AI applications.
By optimizing DeepSeek R1 for Ascend, users can unlock the model's full potential and achieve exceptional results in a wide range of AI tasks.
For further exploration, refer to these valuable resources: