The world of open-source large language models (LLMs) is constantly evolving, and DeepSeek R1 is a name you need to know. This powerhouse model rivals OpenAI's O1 in reasoning capabilities while remaining fully open-source. But how can you harness its power without massive hardware investments? The answer lies in dynamic quantization, specifically the 1.58-bit version optimized by Unsloth.
DeepSeek R1 is a 671 billion parameter model pushing the boundaries of what's possible with open-source AI. Its ability to reason and generate code is impressive, making it valuable for various applications.
The sheer size of DeepSeek R1 (originally 720GB) poses a significant hurdle for many users. Running such a massive model requires substantial VRAM and processing power, limiting accessibility.
Unsloth tackled this challenge head-on, successfully quantizing DeepSeek R1 down to a mere 131GB – an astonishing 80% reduction! This was achieved by using dynamic quantization.
Instead of naively quantizing all layers, Unsloth's dynamic quantization selectively applies different bit precisions to different layers based on their sensitivity. This maintains functionality while minimizing size.
Unsloth provides several dynamic quantized versions, each offering a trade-off between size and quality:
MoE Bits | Disk Size | Type | Quality | Link |
---|---|---|---|---|
down_proj 1.58-bit | 131GB | IQ1_S | Fair | Link |
2.06/1.56bit | 158GB | IQ1_M | Good | Link |
2.06bit | 183GB | IQ2_XXS | Better | Link |
3.5/2.5bit | 212GB | Q2_K_XL | Best | Link |
You can explore the full collection of GGUFs, including 4-bit and distilled versions here.
To rigorously test the quantized models Unsloth asked DeepSeek R1 to create a Flappy Bird game (Pass@3 score), evaluating the output based on 10 criteria. Here are the key findings:
Why dynamic quantization matters: It preserves the model's ability to generate coherent and functional code.
Unsloth's success stems from a deep understanding of DeepSeek R1's architecture, utilizing:
down_proj
matrices are extremely sensitive to quantization due to how SwiGLU operates.Dynamic R1 quants are compatible with any system that runs GGUFs (Ollama, OpenWebUI, Transformers).
apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
Use the code snippet below.
# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-R1-GGUF",
local_dir = "DeepSeek-R1-GGUF",
allow_patterns = ["*UD-IQ1_S*"],
)
Use the following formula to determine the number of layers to offload to your GPU:
n_offload = (VRAM (GB) / Filesize (GB)) * n_layers - 4
Where n_layers
for DeepSeek R1 is 61.
Quant File Size | 24GB GPU | 80GB GPU | 2x80GB GPU |
---|---|---|---|
1.58bit | 7 | 33 | All layers 61 |
1.73bit | 5 | 26 | 57 |
2.22bit | 4 | 22 | 49 |
2.51bit | 2 | 19 | 32 |
./llama.cpp/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 16 \
--prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--n-gpu-layers 7 \
-no-cnv \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
../llama.cpp/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 16 \
--prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--n-gpu-layers 59 \
-no-cnv \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
Follow the official Open WebUI tutorial: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/
For Ollama, merge the GGUF split files first:
../llama.cpp/llama-gguf-split --merge \
DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
merged_file.gguf
Unsloth's dynamic quantization of DeepSeek R1 is a game-changer, drastically lowering the barrier to entry for running this powerful reasoning model. By understanding the model's architecture and selectively quantizing layers, Unsloth delivers exceptional performance with significantly reduced hardware requirements. Dive in and experience the power of DeepSeek R1 today!