NVIDIA's recent stumble following the buzz around DeepSeek-R1 might be a sign of deeper challenges ahead. Recent discussions sparked by DeepSeek-V3's technical report suggest a groundbreaking approach to AI hardware optimization, potentially challenging NVIDIA's stronghold in the AI accelerator market.
According to a recent analysis by Mirae Asset Securities Research, DeepSeek's V3 model achieves a staggering 10x increase in hardware efficiency compared to models from Meta and others. The key? DeepSeek seemingly rebuilt everything from scratch.
Instead of relying solely on NVIDIA's CUDA, DeepSeek strategically modified the H800 GPU, reconfiguring 20 out of 132 streaming multiprocessors (SMs) for inter-server communication rather than solely for computational tasks. This ingenious workaround effectively bypasses hardware limitations related to communication speed, providing a significant performance boost, learn more about GPU performance here.
DeepSeek achieved this optimization using NVIDIA's PTX (Parallel Thread Execution) language, a lower-level programming language that sits closer to assembly. Unlike CUDA, PTX allows for fine-grained control over hardware resources like register allocation and thread/warp-level adjustments.
While PTX offers unparalleled optimization potential, it comes at a cost: complexity. PTX programming is notoriously difficult to manage and maintain, which is precisely why most developers opt for easier-to-use high-level languages like CUDA.
However, DeepSeek's willingness to dive into the nitty-gritty of PTX demonstrates their commitment to squeezing every ounce of performance from their hardware. As one commenter noted, a team willing to use PTX over CUDA is likely composed of ex-quantitative traders who would do anything for that extra edge.
The implications of DeepSeek’s approach are significant. As one Amazon engineer aptly asked: Will CUDA still remain insurmountable? Is NVIDIA's CUDA technology truly unbreachable?
The question is not just about DeepSeek. It is a question of whether top-tier research labs can effectively utilize any GPU irrespective of vendor lock-in.
Some speculate whether the future holds a "new source of power" with DeepSeek potentially open-sourcing a CUDA alternative.
To quell the conjecture, it's important to acknowledge a few things. Firstly, PTX is still an integral part of NVIDIA GPUs, serving as an intermediate representation between high-level CUDA code and low-level hardware instructions. CUDA simplifies development by providing high-level APIs and toolchains. PTX acts as a bridge.
In the current compilation setup, CUDA code is first translated into PTX code which is then converted into machine code (SASS, Streaming ASSembler).
Furthermore, DeepSeek's PTX optimizations don't signal a complete departure from the CUDA ecosystem. However, it demonstrates their capacity to optimize for other GPUs. DeepSeek has been collaborating with AMD and Huawei, optimizing its models to support other hardware ecosystems quickly -- learn more about AMD's approach here.
Another interesting angle to consider is the possibility of AI writing assembly code. Could AI be leveraged to write optimized PTX code?
DeepSeek-R1 has already demonstrated its ability to significantly improve the performance of large model inference frameworks. In a recent pull request to the Llama.cpp project, SIMD instructions (allowing for parallel data processing) were used to accelerate WebAssembly execution. What makes this PR special is that 99% of the code was written by DeepSeek-R1 through prompts and testing.
Related Articles:
The question of whether CUDA will remain a dominant force is now open for debate. DeepSeek's advancements, combined with the growing interest in alternative hardware solutions, suggest that the future of AI acceleration may be more diverse than previously imagined.