GitHub - deepseek-ai/DeepSeek-Coder: DeepSeek Coder: Let the Code Write Itself

DeepSeek Coder: An In-Depth Look at the AI Model Revolutionizing Code Generation

In the rapidly evolving field of artificial intelligence, code generation models are becoming increasingly sophisticated. DeepSeek Coder, developed by DeepSeek AI, stands out as a powerful tool designed to automate and enhance the coding process. This article delves into the features, capabilities, and applications of DeepSeek Coder, providing a comprehensive overview for developers and AI enthusiasts alike.

What is DeepSeek Coder?

DeepSeek Coder is a series of code language models trained from scratch on a massive dataset of 2 trillion tokens. These tokens consist of 87% code and 13% natural language, covering both English and Chinese. The models are available in various sizes, ranging from 1 billion to 33 billion parameters, making them scalable and adaptable to different user needs.

Key Features of DeepSeek Coder

Massive Training Data: Trained on 2T tokens with a high proportion of code data.
Scalability: Available in multiple sizes (1B, 5.7B, 6.7B, and 33B) to suit different computational resources.
Superior Performance: Outperforms existing open-source code models on benchmarks like HumanEval and MBPP.
Advanced Code Completion: Supports project-level code completion and infilling with a 16K window size.
Extensive Language Support: Supports a wide array of programming languages

Supported Programming Languages

DeepSeek Coder supports a vast range of programming languages, including:

Ada
Agda
Alloy
Antlr
Applescript
Assembly
Augeas
Awk
Batchfile
Bluespec
C
C#
Clojure
CMake
CoffeeScript
Common Lisp
C++
CSS
CUDA
Dart
Dockerfile
Elixir
Elm
Emacs Lisp
Erlang
F#
Fortran
GLSL
Go
Groovy
Haskell
HTML
Idris
Isabelle
Java
Java Server Pages
JavaScript
JSON
Julia
Jupyter Notebook
Kotlin
Lean
Literate Agda
Literate CoffeeScript
Literate Haskell
Lua
Makefile
Maple
Markdown
Mathematica
Matlab
OCaml
Pascal
Perl
PHP
PowerShell
Prolog
Protocol Buffer
Python
R
Racket
RestructuredText
RMarkdown
Ruby
Rust
SAS
Scala
Scheme
Shell
Smalltalk
Solidity
SPARQL
SQL
Stan
Standard ML
Stata
SystemVerilog
TCL
Tcsh
Tex
Thrift
TypeScript
Verilog
VHDL
Visual Basic
XSLT
Yacc
YAML
Zig

Performance Evaluation

DeepSeek Coder has been evaluated on several coding benchmarks, demonstrating its superior capabilities:

HumanEval: DeepSeek Coder-Base-33B outperforms CodeLlama-34B by a significant margin.
MBPP: Excels in solving coding problems, showcasing strong problem-solving abilities.
DS-1000: Demonstrates high performance in data science tasks.

These results confirm that DeepSeek Coder is a state-of-the-art solution for code generation.

How to Use DeepSeek Coder

To start using DeepSeek Coder, follow these steps:

Installation:
- Install the required dependencies using pip:
```
pip install -r requirements.txt
```

Code Completion:

Use the following code snippet for code completion:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion:

Insert code with the following:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
input_text = """<｜fim begin｜>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<｜fim hole｜>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<｜fim end｜>"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Chat Model Inference:

Utilize the chat model for generating code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
messages=[ { 'role': 'user', 'content': "write a quick sort algorithm in python."} ]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
# tokenizer.eos_token_id is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

Fine-Tuning DeepSeek Coder

Fine-tuning DeepSeek Coder allows users to adapt the model to specific tasks. Here’s how to fine-tune the model:

Install Dependencies:
- Install the necessary packages:
```
pip install -r finetune/requirements.txt
```
Prepare Training Data:\
- Format your data into JSON strings with instruction and output fields.

Run Fine-Tuning Script:

Execute the finetune_deepseekcoder.py script with appropriate parameters.

deepspeed finetune_deepseekcoder.py \
  --model_name_or_path $MODEL_PATH \
  --data_path $DATA_PATH \
  --output_dir $OUTPUT_PATH \
  --num_train_epochs 3 \
  --model_max_length 1024 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 1 \
  --gradient_accumulation_steps 4 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 100 \
  --save_total_limit 100 \
  --learning_rate 2e-5 \
  --warmup_steps 10 \
  --logging_steps 1 \
  --lr_scheduler_type "cosine" \
  --gradient_checkpointing True \
  --report_to "tensorboard" \
  --deepspeed configs/ds_config_zero3.json \
  --bf16 True

Inference with vLLM

For high-throughput inference, DeepSeek Coder can be used with vLLM.

Text Completion

from vllm import LLM, SamplingParams
tp_size = 4  # Tensor Parallelism
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
model_name = "deepseek-ai/deepseek-coder-6.7b-base"
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size)
prompts = [
    "If everyone in a country loves one another,",
    "The research should also focus on the technologies",
    "To determine if the label is correct, we need to"
]
outputs = llm.generate(prompts, sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Chat Completion

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
tp_size = 4  # Tensor Parallelism
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
model_name = "deepseek-ai/deepseek-coder-6.7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size)
messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "What can you do?"}],
    [{"role": "user", "content": "Explain Transformer briefly."}],
]
prompts = [tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) for messages in messages_list]
sampling_params.stop = [tokenizer.eos_token]
outputs = llm.generate(prompts, sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Resources and Further Reading

Homepage: DeepSeek AI
Chat with DeepSeek Coder: DeepSeek Coder Chat
Models Download: [Hugging Face Models] (https://huggingface.co/deepseek-ai)
DeepSeek Coder on GitHub: DeepSeek-Coder

Conclusion

DeepSeek Coder represents a significant advancement in AI-driven code generation. Its massive training data, scalable architecture, and superior performance make it a valuable asset for developers. Whether you're looking to automate code completion, fine-tune models for specific tasks, or leverage high-throughput inference, DeepSeek Coder offers a comprehensive solution. As AI continues to reshape the landscape of software development, tools like DeepSeek Coder will undoubtedly play a crucial role in enhancing productivity and innovation.

. . .

DeepSeek vs ChatGPT: Ultimate AI Comparison Guide 2025

Jan 15, 2025 ... DeepSeek performs better in many technical tasks, such as programming and mathematics. In contrast, ChatGPT does very well in performing ...

Convert JPG to PDF for free - JPG to PDF online converter

JPG to PDF online converter - Convert JPG to PDF in a few clicks. Have JPG when you need a PDF? Convert JPG to PDF in a few seconds with this free, online ...

AiPPT - AI-Powered One-click PowerPoint Generation for Free

AiPPT With our Design AI , every one of your PowerPoint & Slides & deck will look professionally designed instantly. Start Your Free Trial !

JPG to PDF converter: Convert image to PDF for free| Adobe Acrobat

It's quick and easy to convert image to PDF with our online tool. With only a couple of clicks, you can convert a JPG to PDF on any device, and any browser.

Google Chrome on the App Store

Description. Download the new Google Chrome for your iPhone and iPad. Now more simple, secure and faster than ever. Get the best of Google Search, and easily ...