Unleashing DeepSeek R1: Run Powerful Reasoning Models with 1.58-bit Dynamic Quantization

The world of open-source large language models (LLMs) is constantly evolving, and DeepSeek R1 is a name you need to know. This powerhouse model rivals OpenAI's O1 in reasoning capabilities while remaining fully open-source. But how can you harness its power without massive hardware investments? The answer lies in dynamic quantization, specifically the 1.58-bit version optimized by Unsloth.

What is DeepSeek R1 and Why Should You Care?

DeepSeek R1 is a 671 billion parameter model pushing the boundaries of what's possible with open-source AI. Its ability to reason and generate code is impressive, making it valuable for various applications.

The Challenge: Model Size and Computational Cost

The sheer size of DeepSeek R1 (originally 720GB) poses a significant hurdle for many users. Running such a massive model requires substantial VRAM and processing power, limiting accessibility.

The Solution: 1.58-bit Dynamic Quantization by Unsloth

Unsloth tackled this challenge head-on, successfully quantizing DeepSeek R1 down to a mere 131GB – an astonishing 80% reduction! This was achieved by using dynamic quantization.

What is Dynamic Quantization?

Instead of naively quantizing all layers, Unsloth's dynamic quantization selectively applies different bit precisions to different layers based on their sensitivity. This maintains functionality while minimizing size.

Key Idea: Certain layers (like the first few dense layers and MoE layers) are more sensitive to quantization.
Selective Quantization: These crucial layers are quantized to higher bits (like 4-bit or 6-bit), while the majority of MoE layers are aggressively quantized to 1.5-bit.

Benefits of Unsloth's Dynamic Quantization

Significant Size Reduction: Get the 671B parameter model down to 131GB.
Reduced VRAM Requirements: Fast inference with 160GB of VRAM (2x H100 80GB).
CPU Compatibility: Run with just 20GB of RAM (though slower). 80GB+ of combined VRAM + RAM is recommended for optimal performance.
Maintained Functionality: 1.58bit quantization can still produce valid output.

Dynamic Quantized Versions Available

Unsloth provides several dynamic quantized versions, each offering a trade-off between size and quality:

MoE Bits	Disk Size	Type	Quality	Link
down_proj 1.58-bit	131GB	IQ1_S	Fair	Link
2.06/1.56bit	158GB	IQ1_M	Good	Link
2.06bit	183GB	IQ2_XXS	Better	Link
3.5/2.5bit	212GB	Q2_K_XL	Best	Link

You can explore the full collection of GGUFs, including 4-bit and distilled versions here.

Benchmarking the Results: Dynamic vs. Basic Quantization

To rigorously test the quantized models Unsloth asked DeepSeek R1 to create a Flappy Bird game (Pass@3 score), evaluating the output based on 10 criteria. Here are the key findings:

Dynamic Quantization Wins: The 1.58-bit dynamic version achieved a score of 69.2% and the 2-bit version 91.7%
Naive Quantization Fails: Quantizing all layers naively, even at higher bitrates, resulted in significantly worse performance or completely broken outputs (infinite repetitions, black screens).

Why dynamic quantization matters: It preserves the model's ability to generate coherent and functional code.

Diving Deeper: Exploiting DeepSeek R1's Architecture

Unsloth's success stems from a deep understanding of DeepSeek R1's architecture, utilizing:

Key Architectural Insights:

The first 3 layers are fully dense.
MoE (Mixture of Experts) layers allow increasing parameters without increasing FLOPs by masking matrix entries.
The down_proj matrices are extremely sensitive to quantization due to how SwiGLU operates.

How to Run the 1.58-bit Dynamic R1 Quant

Dynamic R1 quants are compatible with any system that runs GGUFs (Ollama, OpenWebUI, Transformers).

Setting up llama.cpp (Optional):

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Downloading the Model:

Use the code snippet below.

# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-R1-GGUF",
local_dir = "DeepSeek-R1-GGUF",
allow_patterns = ["*UD-IQ1_S*"],
)

Calculating Layers to Offload:

Use the following formula to determine the number of layers to offload to your GPU:

n_offload = (VRAM (GB) / Filesize (GB)) * n_layers - 4

Where n_layers for DeepSeek R1 is 61.

Quant File Size	24GB GPU	80GB GPU	2x80GB GPU
1.58bit	7	33	All layers 61
1.73bit	5	26	57
2.22bit	4	22	49
2.51bit	2	19	32

Running the Model with llama.cpp:

./llama.cpp/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 16 \
--prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--n-gpu-layers 7 \
-no-cnv \
--prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

Running on Mac/Apple Devices:

../llama.cpp/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 16 \
--prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--n-gpu-layers 59 \
-no-cnv \
--prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

Using Ollama/Open WebUI

Follow the official Open WebUI tutorial: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/

For Ollama, merge the GGUF split files first:

../llama.cpp/llama-gguf-split --merge \
DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
merged_file.gguf

Conclusion

Unsloth's dynamic quantization of DeepSeek R1 is a game-changer, drastically lowering the barrier to entry for running this powerful reasoning model. By understanding the model's architecture and selectively quantizing layers, Unsloth delivers exceptional performance with significantly reduced hardware requirements. Dive in and experience the power of DeepSeek R1 today!

Further Exploration

Unsloth's Blog: unsloth.ai/blog - Stay up-to-date with the latest advancements in LLM optimization.
Unsloth's Discord: discord.com/invite/unsloth - Get help and share your experiences with the Unsloth community.
Full Results: docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit- Access complete tables and generated Python code.

. . .

Reword Generator - Best Rewording Tool Online

Reword generator helps to reword sentences, paragraphs, articles, and essays to make content unique, engaging, and plagiarism free.

DeepSeek R1 is now available on Azure AI Foundry and GitHub ...

Jan 29, 2025 ... As part of Azure AI Foundry, DeepSeek R1 is accessible on a trusted, scalable, and enterprise-ready platform, enabling businesses to seamlessly ...

Extension For Downloading Videos (Tried Just About Everything) : r ...

Jun 10, 2020 ... Best download extension I've used on Chrome was "Video Downloader GetThemAll", but it's not on store anymore and it doesn't work when I ...

What determines 'Default' value in chrome://flags

Mar 11, 2016 ... "Default" has a default value set by source code but in many cases can be overridden dynamically by the field trial code. Selecting one of the ...

PDF AI Insights for Free | Smallpdf

With our free AI PDF tool, you can use artificial intelligence to find key points in your PDFs quickly. Use it as an AI PDF reader to save time skimming pages.