Deploying DeepSeek V3 and DeepSeek R1 Models on Alibaba Cloud PAI

DeepSeek-V3 and DeepSeek-R1 are cutting-edge large language models (LLMs) developed by DeepSeek. DeepSeek-V3, with its staggering 671 billion parameters, utilizes a Mixture-of-Experts (MoE) architecture. DeepSeek-R1 is a high-performance inference model trained using DeepSeek-V3-Base. This article guides you through the one-click deployment of these models on Alibaba Cloud's Platform of Artificial Intelligence (PAI), leveraging the Model Gallery for streamlined deployment and acceleration options.

Understanding DeepSeek Models

DeepSeek-V3: A massive 671B parameter MoE language model, ideal for complex tasks requiring extensive knowledge.
DeepSeek-R1: Optimized for inference, balancing performance and resource requirements.

Choosing a Model

Deploying the full-scale DeepSeek-R1 or DeepSeek-V3 model (671B parameters) demands significant resources (8 GPUs with at least 96GB VRAM each), leading to higher costs. For users with limited resources or cost-sensitive applications, consider the distilled versions of the models.

According to tests, the DeepSeek-R1-Distill-Qwen-32B model offers a good balance between performance and cost, making it suitable for cloud deployment as an alternative to the full DeepSeek-R1. Other distilled models (7B, 8B, 14B) are also available. The Model Gallery provides model evaluation tools to assess the actual performance of each model, helping you make an informed decision.

Minimum Configurations and Deployment Options

The following table outlines the minimum hardware configurations and supported token counts for each model, alongside the available deployment methods:

Model	Minimum Configuration	Max Token Count	BladeLLM	SGLang	vLLM	Standard
DeepSeek-R1	8 * GU120 (8 * 96 GB VRAM)	Not Supported	16384	0	4096	Not Supported
DeepSeek-V3	8 * GU120 (8 * 96 GB VRAM)	Not Supported	16384	0	4096	2000
DeepSeek-R1-Distill-Qwen-1.5B	1 * A10 (24 GB VRAM)	131072	Not Supported	131072	131072	131072
DeepSeek-R1-Distill-Qwen-7B	1 * A10 (24 GB VRAM)	131072	Not Supported	32768	131072	131072
DeepSeek-R1-Distill-Llama-8B	1 * A10 (24 GB VRAM)	131072	Not Supported	32768	131072	131072
DeepSeek-R1-Distill-Qwen-14B	1 * GPU L (48 GB VRAM)	131072	Not Supported	32768	131072	131072
DeepSeek-R1-Distill-Qwen-32B	2 * GPU L (2 * 48 GB VRAM)	131072	Not Supported	32768	131072	131072
DeepSeek-R1-Distill-Llama-70B	2 * GU120 (2 * 96 GB VRAM)	131072	Not Supported	32768	131072	131072

Deployment Methodologies:

BladeLLM Acceleration: Alibaba PAI's in-house high-performance inference framework. Refer to the BladeLLM documentation for more information.
SGLang Acceleration: A fast serving framework for large language and vision-language models (SGLang documentation).
vLLM Acceleration: A popular library for accelerating LLM inference (vLLM documentation).
Standard Deployment: Basic deployment without any inference acceleration.

For optimal performance and maximum token support, accelerated deployment (BladeLLM or SGLang) is recommended. Note that accelerated deployments only support API calls, while standard deployments support both API calls and WebUI chat interfaces.

Step-by-Step Deployment Guide

Access Model Gallery:
- Log in to the PAI console.
- Select the appropriate region.
- Navigate to Workspace List, then select your workspace.
- In the left-hand navigation, choose Quick Start > Model Gallery.
Select Your Model:
- Locate the desired model (e.g., DeepSeek-R1-Distill-Qwen-32B) in the Model Gallery.
- Click the model card to access the details page.
Deploy the Model:
- Click Deploy in the top-right corner.
- Choose your preferred deployment method and resource configuration.
- Confirm your settings to initiate the one-click deployment, which creates a PAI-EAS service.

Hardware Considerations

If you are deploying DeepSeek-R1 or DeepSeek-V3, you have the following instance options:

Single Instance Standard: ml.gu8v.c192m1024.8-gu120, ecs.gn8v-8x.48xlarge (public resources, availability may be limited), ecs.ebmgn8v.48xlarge (requires dedicated EAS resources – Purchase EAS Dedicated Resources)
Single Instance GP7V: ml.gp7vf.16.40xlarge (public resources, available only via auction; when standard resources are scarce, try searching for GP7V resources in China North 6 (Ulanqab) and ensure VPC is configured during deployment)
Distributed Deployment:
- GU7X Instances: 4 * ml.gu7xf.8xlarge-gu108 (public resources, available only via auction in China North 6 (Ulanqab); ensure VPC is configured during deployment).
- Lingjun Intelligent Computing Resources: Requires whitelist activation. Contact your sales manager or submit a work order for consultation. Ensure VPC is configured during deployment in China North 6 (Ulanqab). (PAI Lingjun provides high-performance, high-elasticity heterogeneous computing power, increasing resource utilization by up to 3x – see PAI Lingjun Overview).

Model Invocation

The method of calling the model varies depending on the deployment type:

Feature	BladeLLM	SGLang	vLLM	Standard
WebUI	Not Supported. Local WebUI required (BladeLLM GitHub, BladeLLM OSS)	Supported via online debugging. See "Online Debugging" section.	Supported via online debugging. See "Online Debugging" section.	Supported. Access via Model Gallery > Task Management > Deployment Tasks > Service Details > "View WEB Application". WebUI Usage
Online Debugging	Supported. See "Online Debugging" section.	Supported. See "Online Debugging" section.	Supported. See "Online Debugging" section.	N/A
API Call	Supported via `completions` and `chat/completions` endpoints.	Supported via `completions` and `chat/completions` endpoints.	Supported via `completions` and `chat/completions` endpoints.	Supported via `completions` and `chat/completions` endpoints. Standard deployments also support direct calls. API Call Details

API Endpoints

All deployment methods support HTTP POST requests to the following endpoints:

Completions Interface: <EAS_ENDPOINT>/v1/completions
Chat Interface: <EAS_ENDPOINT>/v1/chat/completions

Standard deployments also allow direct calls to the <EAS_ENDPOINT> without any suffixes.

DeepSeek-R1 Usage Recommendations

Set temperature between 0.5-0.7 (recommended: 0.6) to avoid repetition or incoherence.
Do not use system prompt; include all instructions within the user prompt.
For mathematical problems, include "Please reason step by step and put the final answer in \boxed{}."

Cost Considerations

Due to the size of DeepSeek-V3 and DeepSeek-R1, deployment can be expensive. Distilled, lighter models can significantly reduce costs. Evaluate your options carefully before deploying to a production environment. Consider the benefits of cloud cost management

Free Trial: If you haven't used EAS before, claim PAI-EAS trial resources from the Alibaba Cloud Free Trial Center.
Savings Plans: For long-term use, combine public resource groups with Savings Plans or purchase pre-paid EAS resource groups.
Spot Instances: For non-production environments, use spot instances for cost savings, but be aware of potential instability.

Troubleshooting (FAQ)

Long Deployment Times:
- Possible causes: Insufficient resources in the selected region, or lengthy model loading times (20-30 minutes for large models like DeepSeek-R1 and DeepSeek-V3).
- Solutions: Check deployment task details for instance status (Task Management > Deployment Tasks > More > More Information). If the service fails to start, switch regions and redeploy. Consider using a smaller distilled model.
API Returns 404: Verify that you've included the OpenAI API suffix (e.g., /v1/chat/completions) in the URL. Refer to the model's introduction page in the Model Gallery for API call examples.
EAS Gateway Timeout: The default timeout is 180 seconds. If you need a longer timeout (up to 600 seconds), configure an exclusive EAS gateway and submit a work order to adjust the timeout.
Missing "Internet Search" Functionality: Deploying a model service alone does not provide "Internet Search". You must build an AI Agent using a platform like PAI's LangStudio. See Chat With Web Search.

Integrating with AI Applications (e.g., Dify)

In Dify, edit "Model Provider" and add "OpenAI-API-compatible".
Enter model name (e.g., "DeepSeek-R1-Distill-Qwen-7B"), EAS service token as the API Key, and EAS service endpoint (with /v1 at the end) as the API endpoint URL.
Retrieve the EAS service token and endpoint from Model Gallery > Task Management > Deployment Tasks > View Call Information.

By following this guide, you can efficiently deploy and utilize the powerful DeepSeek-V3 and DeepSeek-R1 models on Alibaba Cloud PAI, choosing configurations and optimization techniques that best fit your specific needs and budget.

. . .

What is the weight equivalent conversion for a smith machine at ...

Jan 16, 2023 ... So, without a smith machine, it's bar + plates, bar typically being 44-45 pounds. If you read the smith machine, it will tell you the weight, ...

AI Reword Generator Tool

Rephrase text instantly with the AI Reword Generator. Improve clarity, avoid plagiarism, and create unique content effortlessly.

How AI is improving diagnostics and health outcomes | World ...

Sep 25, 2024 ... ... medical professionals. The healthcare sector is poised for a revolution, with artificial intelligence (AI) playing a pivotal role in ...

Opinion | Let AI remake the whole U.S. government (oh, and save ...

Mar 6, 2024 ... This new era of AI has presented a once-in-a-century chance to wipe away a lot of the damage and renew the mission.

App Store에서 제공하는 Google Chrome

... 다운로드하세요. 새 ... 제 휴대폰이 아이폰7인데 iOS 업데이트 이후 크롬에서 동영상을 키면 검은화면이 나오고 소리만 들립니다.