In the rapidly evolving landscape of Artificial Intelligence, the development of large language models (LLMs) has often been associated with exorbitant costs and massive computational resources. However, a recent breakthrough by the Chinese AI firm, DeepSeek, is challenging this paradigm. DeepSeek's release of its DeepSeek-V3 model, accompanied by a comprehensive 53-page technical report, is making waves for its impressive capabilities achieved at a fraction of the cost compared to industry giants like OpenAI and Anthropic.
Launched by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. ("DeepSeek"), the DeepSeek-V3 model distinguishes itself through its open-source nature and detailed technical documentation. Unlike many opaque reports, DeepSeek's report provides significant transparency into its key technologies and training details. What truly sets V3 apart is its dramatically upgraded performance achieved with a training cost of merely $5.576 million, utilizing just 2048 H800 GPUs in under two months. This contrasts sharply with the estimated $100 million training cost for GPT-4o, as revealed by Anthropic CEO Dario Amodei.
Andrej Karpathy, a founding member of OpenAI, lauded DeepSeek-V3 for making LLM pre-training accessible even with limited computational budgets. This raises a pivotal question: How did DeepSeek manage to "do more with less", and is this heralding a new trajectory for LLM development?
DeepSeek has carved a unique niche in the AI ecosystem as the only company focusing solely on foundational models without venturing into consumer-facing (2C) applications. Committed to an open-source approach and without external funding, DeepSeek's prior release, DeepSeek-V2, gained immense popularity for its innovative architecture and unparalleled cost-effectiveness.
The inference cost of DeepSeek-V2 was a mere ¥1 (approximately $0.14) per million tokens, significantly lower than Llama3 70B (1/7th the cost) and GPT-4 Turbo (1/70th the cost). This cost reduction was achieved through the implementation of:
These innovations, coupled with model compression, expert parallel training, FP8 mixed-precision training, data distillation, and algorithm optimization, drastically reduced the overall cost of the V3 model. The integration of FP8, an emerging low-precision training methodology, reduces both memory footprint and computational demands by decreasing the number of bits required for data representation.
Zhang Xiaorong, Dean of the Deep Technology Research Institute, emphasized that DeepSeek's success is rooted in its breakthroughs and innovations in LLM technology. By striking a balance between high performance and low cost through algorithmic optimization and engineering practices, DeepSeek is injecting vitality into the industry and influencing the technological roadmap and engineering practices of LLMs.
While the approach of using massive parameters, vast computational resources, and substantial investment, as exemplified by ChatGPT, has proven effective, it remains unattainable for most startups. The estimated training costs for GPT-5 already exceeds hundreds of millions of dollars, underscoring the prohibitive expenses associated with scaling LLMs using traditional methods.
The emergence of DeepSeek-V3 offers an alternative. Lin Yonghua, Vice President and Chief Engineer of the Zhiyuan Research Institute, believes that the Scaling Law should extend beyond pre-training to subsequent training phases, especially in areas like reasoning and reinforcement learning. DeepSeek leverages the techniques used in DeepSeek R1, which has proved to be very effective and transformative. This is mirrored by advancements like Kimi's use of reinforcement learning in search scenarios and Ant Group's research into enhancing model capabilities through post-training and reinforcement learning.
Key takeaway: Instead of relying solely on increased computing power, parameter size, and data volume, DeepSeek's approach prioritizes algorithmic innovation to enhance fundamental model capabilities during the post-training phase.
It's important to note this doesn't diminish the requirement for serious computing power, but it does shift where that power is needed.
The DeepSeek-V3 model represents a paradigm shift in the development of AI. It demonstrates that groundbreaking AI can be achieved without unsustainable financial investment and opens opportunities for more players to innovate in Large Language Models (LLMs). This can lead to:
DeepSeek's breakthrough underscores that while large-scale GPU clusters are essential, "burning money" should not be the sole strategy for progress. Zhou Hongyi, founder of 360 Group, praised DeepSeek for achieving results with 2,000 cards that typically require tens of thousands. This approach lowers the cost of LLMs and accelerates the popularization of AI across specialized, vertical, and industry-specific applications.
As the AI landscape evolves, expect to see a convergence in technologies and among companies. Improving computational efficiency and reducing inference costs will demand optimized computing architectures and efficient resource utilization. DeepSeek's success presents a compelling case for prioritizing innovation and efficiency in the race to advance AI capabilities.
Further Reading: