The world of AI-powered code generation is rapidly evolving, and DeepSeek Coder is at the forefront of this technological shift. Developed by DeepSeek AI, this innovative tool promises to "Let the Code Write Itself," offering developers of all levels a powerful assistant capable of handling a wide range of coding tasks.
DeepSeek Coder is a series of code language models trained from the ground up on a massive dataset of 2 trillion tokens. This dataset consists of 87% code and 13% natural language (both English and Chinese), providing the model with a comprehensive understanding of both the syntax and context of code.
Key features of DeepSeek Coder :
DeepSeek Coder boasts support for a wide range of programming languages ensuring its utility across diverse development environments. The list includes:
DeepSeek Coder shines in various coding tasks, including:
DeepSeek Coder's capabilities are validated through rigorous evaluations. On benchmarks like HumanEval, MBPP, and DS-1000, the 33B parameter base model significantly outperforms existing open-source code LLMs like CodeLlama-34B. Impressively, the 7B parameter model often matches the performance of the larger CodeLlama-34B. The instruction-tuned model also rivals GPT3.5-turbo on HumanEval. These benchmarks showcase DeepSeek Coder's enhanced ability to understand and generate code.
DeepSeek Coder is designed for ease of use, with readily available resources and clear instructions for implementation.
To get started, install the necessary dependencies using pip
:
pip install -r requirements.txt
Here are a few examples demonstrating DeepSeek Coder's capabilities:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
input_text = """<|fim begin|>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<|fim hole|>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|fim end|>"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
messages=[ { 'role': 'user', 'content': "write a quick sort algorithm in python."}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
# tokenizer.eos_token_id is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
DeepSeek Coder can be further customized through fine-tuning on specific datasets. A dedicated script (finetune_deepseekcoder.py
) and instructions are provided to guide users through this process. By fine-tuning the model, developers can optimize its performance for specialized tasks and domains.
To achieve high-throughput inference, DeepSeek Coder supports integration with vLLM (Very Large Language Model). Detailed examples are provided for both text completion and chat completion tasks.
DeepSeek Coder represents a significant leap forward in AI-powered code generation. Its massive training dataset, flexible model sizes, and state-of-the-art performance make it an invaluable tool for developers seeking to enhance their productivity and code quality. As the tool continues to evolve, it is poised to transform the landscape of software development, making coding more efficient, accessible, and innovative.