DeepSeek FAQ: Unpacking the Game-Changing AI Model

The AI landscape is constantly evolving, and recent announcements from DeepSeek have sent shockwaves through the industry, particularly concerning competition between the U.S. and China. This article delves into DeepSeek, its innovative models, and the implications for the future of AI.

Misconceptions and Missed Signals

Before diving into the details, it’s crucial to address the initial reactions to DeepSeek's announcements. Similar to the surprise surrounding Huawei's 7nm chip in the Mate 60 Pro, the significance of DeepSeek's advancements wasn't immediately clear to many. The key takeaway is that details of DeepSeek’s accomplishments are less important than the reaction and what that reaction says about people’s pre-existing assumptions.

What is DeepSeek? Exploring the V2 Model

DeepSeek isn't just one model; it is a series of models, each building upon the last with increasing sophistication. Key advancements were introduced with the V2 model. The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA.

DeepSeekMoE (Mixture of Experts): Traditional models activate the entire network during training and inference, but DeepSeekMoE splits the model into multiple "experts." Only necessary experts are activated for a given task, improving efficiency. GPT-4 also utilizes a MoE approach. DeepSeek's implementation includes finely-grained specialized experts and shared experts with generalized capabilities, along with innovations in load-balancing and routing during training.
DeepSeekMLA (Multi-Head Latent Attention): Memory requirements are a huge limitation on efficiency. DeepSeekMLA compresses the key-value store, dramatically decreasing memory usage during inference.

DeepSeek V3: Shockingly Cheap Training

The true impact of V2's innovations became apparent with the release of V3. V3 added a new approach to load balancing which reduced communications overhead, and also used multi-token prediction in training increased efficiency again. DeepSeek claimed the final training run for V3 cost $5.576 million, which caused incredulity.

DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses.

The company publicly stated that training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on their cluster with 2048 H800 GPUs. Consequently, their pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. This results in a cost of $5.576M, assuming the rental price of the H800 GPU is $2 per GPU hour.

This was achieved through their optimized co-design of algorithms, frameworks, and hardware. V3 boasts 671 billion parameters, but uses Mixture of Experts so that only 37 billion parameters in the active expert are computed per token; parameters were stored with BF16 or FP32 precision, but were reduced to FP8 precision for calculations.

DeepSeek engineers programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is impossible to do in CUDA, and the engineers had to drop to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. All of the design decisions made by DeepSeek only make sense when constrained to the H800.

V3 is competitive with OpenAI’s 4o and Anthropic’s Sonnet-3.5, and appears to be better than Llama’s biggest model. DeepSeek was able to distill those models to give V3 high quality tokens to train on.

Distillation: An Important Innovation

Distillation involves extracting understanding from another model by sending inputs to the teacher model, recording the outputs, and using that to train the student model. This allows companies to optimize models for inference, while others can free-ride on leading-edge investments. Distillation is probably the core economic factor causing the slow divorce of Microsoft and OpenAI.

R1: Reasoning and Open Weights

R1 is a reasoning model like OpenAI’s o1, but it has open weights, meaning that instead of paying OpenAI to get reasoning, you can run R1 on the server of your choice, or even locally, at dramatically lower cost.

R1-Zero: Pure Reinforcement Learning

DeepSeek made two models: R1 and R1-Zero. R1-Zero is the bigger deal because it uses pure reinforcement learning (RL).

Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function. In this case, DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process.

What emerged is a model that developed reasoning and chains-of-thought on its own, including what DeepSeek called “Aha Moments”.

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, DeepSeek-R1 incorporates a small amount of cold-start data and a multi-stage training pipeline.

This is similar to what OpenAI did for o1: DeepSeek started the model out with a bunch of examples of chain-of-thought thinking so it could learn the proper format for human consumption, and then did the reinforcement learning to enhance its reasoning, along with a number of editing and refinement steps.

The core takeaway is that AI models are teaching AI models and AI models are teaching themselves. We are watching the assembly of an AI takeoff scenario in realtime.

Implications and Impact

Winners and Losers

Consumers and businesses: Who can anticipate a future of effectively-free AI products and services.
Big consumer tech companies: A world of free AI is a world where product and distribution matters most.
China: DeepSeek’s relative success to America’s leading AI labs will result in a further unleashing of Chinese innovation as they realize they can compete.
Google: Google may be negatively impacted since products may displace search.
Anthropic: Anthropic is probably the biggest loser of the weekend since DeepSeek made it to number one in the App Store, simply highlighting how Claude hasn’t gotten any traction outside of San Francisco.

Concerns About Nvidia

DeepSeek’s efficiency and broad availability cast significant doubt on the most optimistic Nvidia growth story, at least in the near term. The payoffs from both model and infrastructure optimization also suggest there are significant gains to be had from exploring alternative approaches to inference in particular.

Openness vs. Control: The Future of AI

DeepSeek CEO Liang Wenfeng has stated that open source is key to attracting talent and creating a strong technical ecosystem. This approach contrasts with the strategies of some U.S. companies that advocate for regulation and closed-source models. OpenAI's greatest crime is the 2023 Executive Order on AI.

The key question remains: Will the U.S. double down on defensive measures, or realize that there is real competition and actually give ourself permission to compete by cutting out all of the cruft in our companies that has nothing to do with winning?

. . .

CloudConvert

CloudConvert is an online file converter. We support nearly all audio, video, document, ebook, archive, image, spreadsheet, and presentation formats.

Free AI Rewording Tool

Swiftly reword and rephrase sentences or paragraphs for posts, emails or articles ... Acronym Generator. Looking for an easy way to generate acronyms? Try our AI ...

AI Image Generator: Turn Text to Images, generative art and ...

Create stunning images instantly with AI. Unlock your creative potential with our AI-powered tool, perfect for generating stunning images in seconds.

Best web-based YouTube video downloader? : r/DataHoarder

Aug 6, 2024 ... For the most reliable experience, consider YouTube Premium for offline downloads within the YouTube app.

Google Cloud: Cloud Computing Services

Meet your business challenges head on with cloud computing services from Google, including data management, hybrid & multi-cloud, and AI & ML.