In the rapidly evolving field of Natural Language Processing (NLP), Mixture-of-Experts (MoE) models are emerging as a promising approach to scaling up language models while maintaining computational efficiency. DeepSeek AI has recently introduced DeepSeekMoE, a 16.4 billion parameter MoE language model that showcases the potential of this architecture. This article delves into the key features, evaluation results, and practical applications of DeepSeekMoE.
Before diving into the specifics of DeepSeekMoE, let's briefly understand the concept of MoE. Traditional language models typically use a single, monolithic architecture. MoE models, on the other hand, consist of multiple "expert" sub-networks. For each input, a routing mechanism selects a subset of these experts to process the data, allowing the model to specialize in different aspects of the language. This approach enables MoE models to achieve higher capacity with lower computational cost compared to dense models of similar size.
DeepSeekMoE 16B is a Mixture-of-Experts language model developed by DeepSeek AI. It stands out due to its innovative MoE architecture, which incorporates two key strategies:
Trained from scratch on a massive dataset of 2 trillion English and Chinese tokens, DeepSeekMoE 16B achieves performance comparable to DeepSeek 7B and LLaMA2 7B, while using only about 40% of the computations. This makes it an attractive option for researchers and developers seeking to leverage the power of large language models without the prohibitive computational costs.
DeepSeek AI has rigorously evaluated DeepSeekMoE 16B on various benchmarks, demonstrating its competitive performance:
These results highlight the efficiency and effectiveness of the DeepSeekMoE architecture, demonstrating its ability to deliver strong performance with reduced computational demands.
DeepSeek AI has made both the base and chat versions of DeepSeekMoE 16B available to the public, encouraging academic and commercial research. The models can be easily downloaded from Hugging Face Hub.
To get started, ensure you have a Python environment (version 3.8 or later) and install the necessary dependencies:
pip install -r requirements.txt
You can readily use Hugging Face's Transformers library for model inference. Here are examples for both text completion and chat completion:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/deepseek-moe-16b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/deepseek-moe-16b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
messages = [
{"role": "user", "content": "Who are you?"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
DeepSeek AI provides a finetune/finetune.py
script for fine-tuning the models on downstream tasks. The script supports training with DeepSpeed.
Prepare your training data in the specified format using the Sample Dataset Format which consists of instruction and output fields.
Here's an example of how to fine-tune DeepSeekMoE using DeepSpeed:
DATA_PATH="<your_data_path>"
OUTPUT_PATH="<your_output_path>"
MODEL_PATH="<your_model_path>"
cd finetune
deepspeed finetune.py \
--model_name_or_path $MODEL_PATH \
--data_path $DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 3 \
--model_max_length 1024 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 100 \
--learning_rate 2e-5 \
--warmup_steps 10 \
--logging_steps 1 \
--lr_scheduler_type "cosine" \
--gradient_checkpointing True \
--report_to "tensorboard" \
--deepspeed configs/ds_config_zero3.json \
--bf16 True \
--use_lora False
DeepSeekMoE is released under the MIT license for the code repository and a separate Model License that permits commercial use.
If you use DeepSeekMoE in your research, please cite the following paper:
@article{dai2024deepseekmoe,
author={Damai Dai and Chengqi Deng and Chenggang Zhao and R. X. Xu and Huazuo Gao and Deli Chen and Jiashi Li and Wangding Zeng and Xingkai Yu and Y. Wu and Zhenda Xie and Y. K. Li and Panpan Huang and Fuli Luo and Chong Ruan and Zhifang Sui and Wenfeng Liang},
title={DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models},
journal = {CoRR},
volume = {abs/2401.06066},
year = {2024},
url = {https://arxiv.org/abs/2401.06066},
}
DeepSeekMoE represents a significant advancement in Mixture-of-Experts language models. Its innovative architecture, strong performance, and permissive licensing make it a valuable resource for researchers and developers seeking to push the boundaries of NLP. By leveraging the power of expert specialization, DeepSeekMoE demonstrates a path towards more efficient and scalable language models, paving the way for future innovations in the field. You can explore other related projects on GitHub's AI Resources to further enhance your knowledge in this field.