GitHub - deepseek-ai/DeepSeek-MoE: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeekMoE: Revolutionizing Language Models with Mixture-of-Experts Architecture

In the rapidly evolving field of Natural Language Processing (NLP), Mixture-of-Experts (MoE) models are emerging as a promising approach to scaling up language models while maintaining computational efficiency. DeepSeek AI has recently introduced DeepSeekMoE, a 16.4 billion parameter MoE language model that showcases the potential of this architecture. This article delves into the key features, evaluation results, and practical applications of DeepSeekMoE.

Understanding Mixture-of-Experts (MoE)

Before diving into the specifics of DeepSeekMoE, let's briefly understand the concept of MoE. Traditional language models typically use a single, monolithic architecture. MoE models, on the other hand, consist of multiple "expert" sub-networks. For each input, a routing mechanism selects a subset of these experts to process the data, allowing the model to specialize in different aspects of the language. This approach enables MoE models to achieve higher capacity with lower computational cost compared to dense models of similar size.

Introducing DeepSeekMoE 16B

DeepSeekMoE 16B is a Mixture-of-Experts language model developed by DeepSeek AI. It stands out due to its innovative MoE architecture, which incorporates two key strategies:

  • Fine-grained expert segmentation: This allows for more specialized and efficient learning.
  • Shared experts isolation: This enhances the model's ability to handle diverse tasks.

Trained from scratch on a massive dataset of 2 trillion English and Chinese tokens, DeepSeekMoE 16B achieves performance comparable to DeepSeek 7B and LLaMA2 7B, while using only about 40% of the computations. This makes it an attractive option for researchers and developers seeking to leverage the power of large language models without the prohibitive computational costs.

Performance Evaluation

DeepSeek AI has rigorously evaluated DeepSeekMoE 16B on various benchmarks, demonstrating its competitive performance:

DeepSeekMoE 16B Base

  • Open LLM Leaderboard: DeepSeekMoE 16B consistently outperforms open-source models with a similar number of activated parameters.
  • Internal Benchmarks: Achieves comparable performance to DeepSeek 7B (a dense model trained on the same data) with only 40.5% of the computations.
  • Comparison with LLaMA2 7B: Outperforms LLaMA2 7B on most benchmarks while using only 39.6% of the computations.

DeepSeekMoE 16B Chat

  • Achieves comparable or better performance than DeepSeek 7B Chat and LLaMA2 7B SFT, while using around 40% of the computations.

These results highlight the efficiency and effectiveness of the DeepSeekMoE architecture, demonstrating its ability to deliver strong performance with reduced computational demands.

Getting Started with DeepSeekMoE

DeepSeek AI has made both the base and chat versions of DeepSeekMoE 16B available to the public, encouraging academic and commercial research. The models can be easily downloaded from Hugging Face Hub.

Installation

To get started, ensure you have a Python environment (version 3.8 or later) and install the necessary dependencies:

pip install -r requirements.txt

Inference with Hugging Face Transformers

You can readily use Hugging Face's Transformers library for model inference. Here are examples for both text completion and chat completion:

Text Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-moe-16b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

Chat Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-moe-16b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Who are you?"}
]

input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)

print(result)

Fine-tuning DeepSeekMoE

DeepSeek AI provides a finetune/finetune.py script for fine-tuning the models on downstream tasks. The script supports training with DeepSpeed.

Preparing the Data

Prepare your training data in the specified format using the Sample Dataset Format which consists of instruction and output fields.

Fine-tuning with DeepSpeed

Here's an example of how to fine-tune DeepSeekMoE using DeepSpeed:

DATA_PATH="<your_data_path>"
OUTPUT_PATH="<your_output_path>"
MODEL_PATH="<your_model_path>"
cd finetune
deepspeed finetune.py \
 --model_name_or_path $MODEL_PATH \
 --data_path $DATA_PATH \
 --output_dir $OUTPUT_PATH \
 --num_train_epochs 3 \
 --model_max_length 1024 \
 --per_device_train_batch_size 16 \
 --per_device_eval_batch_size 1 \
 --gradient_accumulation_steps 4 \
 --evaluation_strategy "no" \
 --save_strategy "steps" \
 --save_steps 100 \
 --save_total_limit 100 \
 --learning_rate 2e-5 \
 --warmup_steps 10 \
 --logging_steps 1 \
 --lr_scheduler_type "cosine" \
 --gradient_checkpointing True \
 --report_to "tensorboard" \
 --deepspeed configs/ds_config_zero3.json \
 --bf16 True \
 --use_lora False

Licensing and Citation

DeepSeekMoE is released under the MIT license for the code repository and a separate Model License that permits commercial use.

If you use DeepSeekMoE in your research, please cite the following paper:

@article{dai2024deepseekmoe,
 author={Damai Dai and Chengqi Deng and Chenggang Zhao and R. X. Xu and Huazuo Gao and Deli Chen and Jiashi Li and Wangding Zeng and Xingkai Yu and Y. Wu and Zhenda Xie and Y. K. Li and Panpan Huang and Fuli Luo and Chong Ruan and Zhifang Sui and Wenfeng Liang},
 title={DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models},
 journal = {CoRR},
 volume = {abs/2401.06066},
 year = {2024},
 url = {https://arxiv.org/abs/2401.06066},
}

Conclusion

DeepSeekMoE represents a significant advancement in Mixture-of-Experts language models. Its innovative architecture, strong performance, and permissive licensing make it a valuable resource for researchers and developers seeking to push the boundaries of NLP. By leveraging the power of expert specialization, DeepSeekMoE demonstrates a path towards more efficient and scalable language models, paving the way for future innovations in the field. You can explore other related projects on GitHub's AI Resources to further enhance your knowledge in this field.

. . .
DeepSeek API Docs: Your First API Call

Invoke The Chat API​ · curl https://api.deepseek.com/chat/completions \ · -H "Content-Type: application/json" \ · -H "Authorization: Bearer " \

DeepSeek API Docs: Your First API Call