GitHub - deepseek-ai/DeepSeek-Coder: DeepSeek Coder: Let the Code Write Itself

DeepSeek Coder: The AI Revolutionizing Code Generation

The world of AI-powered code generation is rapidly evolving, and DeepSeek Coder is at the forefront of this technological shift. Developed by DeepSeek AI, this innovative tool promises to "Let the Code Write Itself," offering developers of all levels a powerful assistant capable of handling a wide range of coding tasks.

What is DeepSeek Coder?

DeepSeek Coder is a series of code language models trained from the ground up on a massive dataset of 2 trillion tokens. This dataset consists of 87% code and 13% natural language (both English and Chinese), providing the model with a comprehensive understanding of both the syntax and context of code.

Key features of DeepSeek Coder :

Multiple Model Sizes: Available in 1B, 5.7B, 6.7B, and 33B parameter versions, allowing users to select the model size that best suits their computational resources and project requirements.
Extensive Training Data: Trained on 2T tokens with a diverse mix of code and natural language.
Project-Level Context: Utilizes a 16K window size and fill-in-the-blank tasks during pre-training, enabling it to handle project-level code completion and infilling effectively.
Broad Language Support: Supports a vast array of programming languages, including Python, Java, C++, JavaScript, and many more (see full list below).
State-of-the-Art Performance: Achieves top-tier performance among open-source code models across various benchmarks.

Supported Programming Languages

DeepSeek Coder boasts support for a wide range of programming languages ensuring its utility across diverse development environments. The list includes:

ada
agda
alloy
antlr
applescript
assembly
augeas
awk
batchfile
bluespec
c
c-sharp
clojure
cmake
coffeescript
common-lisp
cpp
css
cuda
dart
dockerfile
elixir
elm
emacs-lisp
erlang
f-sharp
fortran
glsl
go
groovy
haskell
html
idris
isabelle
java
java-server-pages
javascript
json
julia
jupyter-notebook
kotlin
lean
literate-agda
literate-coffeescript
literate-haskell
lua
makefile
maple
markdown
mathematica
matlab
ocaml
pascal
perl
php
powershell
prolog
protocol-buffer
python
r
racket
restructuredtext
rmarkdown
ruby
rust
sas
scala
scheme
shell
smalltalk
solidity
sparql
sql
stan
standard-ml
stata
systemverilog
tcl
tcsh
tex
thrift
typescript
verilog
vhdl
visual-basic
xslt
yacc
yaml
zig

DeepSeek Coder's Capabilities

DeepSeek Coder shines in various coding tasks, including:

Code Completion: Suggests code snippets based on the context of your project, saving time and reducing errors.
Code Insertion: Fills in missing sections of code, streamlining the development process.
Chat Model Inference: Answers coding-related questions and provides explanations, acting as a helpful AI assistant.
Repository-Level Code Completion: Completes code while understanding the context of the entire repository.

Performance Benchmarks

DeepSeek Coder's capabilities are validated through rigorous evaluations. On benchmarks like HumanEval, MBPP, and DS-1000, the 33B parameter base model significantly outperforms existing open-source code LLMs like CodeLlama-34B. Impressively, the 7B parameter model often matches the performance of the larger CodeLlama-34B. The instruction-tuned model also rivals GPT3.5-turbo on HumanEval. These benchmarks showcase DeepSeek Coder's enhanced ability to understand and generate code.

Usage and Implementation

DeepSeek Coder is designed for ease of use, with readily available resources and clear instructions for implementation.

Installation

To get started, install the necessary dependencies using pip:

pip install -r requirements.txt

Examples

Here are a few examples demonstrating DeepSeek Coder's capabilities:

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fill-in-the-Middle (FIM)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

input_text = """<｜fim begin｜>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<｜fim hole｜>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<｜fim end｜>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Chat Model

from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
 messages=[ { 'role': 'user', 'content': "write a quick sort algorithm in python."}
 ]
 inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
 # tokenizer.eos_token_id is the id of <|EOT|> token
 outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
 print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

Fine-Tuning DeepSeek Coder

DeepSeek Coder can be further customized through fine-tuning on specific datasets. A dedicated script (finetune_deepseekcoder.py) and instructions are provided to guide users through this process. By fine-tuning the model, developers can optimize its performance for specialized tasks and domains.

Inference with vLLM

To achieve high-throughput inference, DeepSeek Coder supports integration with vLLM (Very Large Language Model). Detailed examples are provided for both text completion and chat completion tasks.

Conclusion

DeepSeek Coder represents a significant leap forward in AI-powered code generation. Its massive training dataset, flexible model sizes, and state-of-the-art performance make it an invaluable tool for developers seeking to enhance their productivity and code quality. As the tool continues to evolve, it is poised to transform the landscape of software development, making coding more efficient, accessible, and innovative.

. . .

What is DeepSeek, and why is it causing Nvidia and other stocks to ...

Jan 28, 2025 ... DeepSeek is a private Chinese company founded in July 2023 by Liang Wenfeng, a graduate of Zhejiang University, one of China's top universities, ...

VIdeo Converters for Ubuntu - Ask Ubuntu

Mar 12, 2015 ... I would suggest you to go for HandBrake Video Convertor for Ubuntu. HandBrake is an open-source, GPL-licensed, multiplatform, multithreaded for ...

I used the Headcanon generator, and I liked it. : r/ReZero

Dec 4, 2024 ... 168 votes, 24 comments. 30K subscribers in the ReZero community. A place to discuss about the novel, Re: Starting life in another world from ...

GENERATOR | Snowflake Documentation

GENERATOR¶. Creates rows of data based either on a specified number of rows, a specified generation period (in seconds), or both. This system-defined table ...

Free APA Citation Generator | With Chrome Extension - Scribbr

Scribbr's free citation generator automatically generates accurate references and in-text citations.