TAI #132: Deepseek v3 – 10x+ Improvement in Both Training and Inference Cost for Frontier LLMs
Also, pay 6,000,000x more for o3? Qwen2-VL-72B, Devin 1.1, AIOpsLab & more!
What happened this week in AI by Louie
While last week was about closed AI and huge inference cost escalation with o3, this week, we got a Christmas surprise from China with Deepseek v3, which was a huge win for open-source and LLM cost reduction. In another open-source Christmas treat, Alibaba released Qwen2-VL-72B, an impressive visual reasoning model scoring 70.3 on MMMU vs. 77.3 for o1 and 69.1 for 4o.
DeepSeek V3 is a high-performance Mixture-of-Experts (MoE) model that may meaningfully challenge the idea that only huge tech companies with unlimited GPU budgets can train top-tier large language models. DeepSeek is a Chinese AI group that many in the community had on their radar for its prior releases, but V3 takes another leap. This model has 671 billion total parameters with 37 billion activated per token for inference (v2 was 236B total parameters, 21B active) and was trained on 14.8T tokens. The model now matches state-of-the-art non-reasoning models such as 4o and Sonnet 3.5 on most benchmarks at a much lower price (~10x below 4o). The model is also very fast at 60 tokens-per-second output speed (3x the smaller v2). Perhaps even more significant than the inference cost reduction was the huge breakthrough in training costs. It sets a new bar for training efficiency, as the entire training run consumed only about $5.576 million, with around 2.788 million H800 GPU hours (just a 2,048 cluster)—roughly one-eleventh of the compute used on H100s for LLama 3.1 405bn (and despite H800 GPU performance being constrained vs H100 by US sanctions). Deepseek makes some interesting choices, including relatively shallow neural networks for such a large model (61 layers vs LLama 405bn at 123), including 3 layers shared between the experts.
This model release underscores the power of MoE architectures and the huge potential for wins from engineering innovations, especially for those who want big-model performance without incurring a monstrous GPU bill. DeepSeek V3's impressive performance and efficiency are underpinned by a suite of innovative techniques. A novel auxiliary-loss-free load balancing strategy ensures optimal utilization of experts within its MoE architecture, minimizing performance degradation. Multi-Token Prediction (MTP) further enhanced the model training and could also enable speculative decoding for faster inference. The successful use of FP8 mixed precision training also marks a first for models of this capability, significantly improving training efficiency. For efficient inference, Multi-head Latent Attention (MLA) compresses the key-value cache, while the DeepSeekMoE architecture contributes to both economical training and rapid inference. Finally, knowledge distillation from the prior DeepSeek R1 series was also used to refine and enhance V3's reasoning capabilities without the same inference token overhead. All together, these add up to training cost savings that are potentially as high as 100x for some comparable large models.
It’s worth noting that some conversations about v3 this week centered around whether the model trained on ChatGPT outputs, given that it sometimes identifies as ChatGPT during prompting. We think this conversation is mostly a distraction from the very real and transparent technical breakthroughs that primarily led to this model performance. Most model families likely used ChatGPT outputs for synthetic data during training at some point, but this behavior can also be the result of the prevalence of ChatGPT in open web data used for training.
Why should you care?
As we’ve often said in this newsletter, the trifecta of “better, faster, cheaper” is what truly moves the needle for enterprise AI adoption. DeepSeek V3 seems to score all three. The company’s API pricing— $0.27 per million input tokens and just over a dollar per million output tokens—lets them undercut GPT-4o by a factor of nearly 10. Combine that with a 60 tokens-per-second output speed, and you have something that could push enterprise teams to experiment with a new open model that might outrank their current in-house or closed-lab solutions. If you’re already building on top of open-weight models or considering a migration away from expensive closed APIs, it might be a good time to try out V3. Like how we see advanced fine-tuning, agent or retrieval-based pipelines flourish when inference costs dip, the ability to run V3 at a lower cost might open up more creative or higher-volume LLM use cases.
The continued open-source stance also remains a big story: DeepSeek V3, like its predecessors, is out there for full community scrutiny. The release includes the model weights, training frameworks, data stats, and cost breakdown. This transparency is much higher than most open-source model weights LLMs released in the US and Europe. This transparency also means that the open-source community can try out new ideas—fine-tuning, domain-specific RAG, and knowledge distillation from other models—without worrying about legal or technical barriers. We often point to open-source as a key driver in the LLM race toward affordability and specialized performance, and DeepSeek V3’s code release will likely accelerate that further. The willingness of a Chinese AI lab to release a thoroughly transparent, large-scale MoE model underlines the global nature of AI development. It’s no longer a simple West vs. West race among the big tech giants in the U.S. Ultimately, that helps everyone by spurring new ideas for cost-effectiveness, new training algorithms, or improved load-balancing techniques. So far, we have seen much less development or western deployments of Deepseek models compared to META’s popular Llama models. Deepseek models have a complicated architecture, and while they are designed for affordable inference, inference needs larger GPU clusters than the LLama models. We think, however, that V3 is now a big enough leap in both cost and capability to incentivize a more active global open-source community hosting and building on top of them.
A final point to make this week is the truly absurd divergence between this release and last week’s reveal of OpenAI o3 in terms of what they mean for inference cost. This model is now a ~10x saving for a frontier (non-reasoning mode) LLM. On the other hand, last week, OpenAI revealed a 600,000x increase relative to 4o in total final “useful” output token cost for some modes of o3 (in low-efficiency mode with ~1,000 samples) in the ARC-AGI benchmark (4o is likely still the base model). Together, this means that in February (or when o3 is fully released), you may well have a choice to spend up to 6,000,000x more for OpenAI o3 in low-efficiency mode vs using Deepseek v3. Which of these trends will win out in terms of inference demand for the years ahead? Will GPU availability for AI adoption bottleneck us, or will rapid advances in inference efficiency also begin to translate into reasoning models? The very strong performance of the o3-mini is already suggestive of rapid cost progress for reasoning models, but 600,000x is a large hurdle to innovate away!
A brief note on our o3 inference cost scaling estimates:
Starting with 4o (likely the base model for both o1 and o3) - the o1 model output tokens are priced 6x more than 4o. This higher cost per token is due to quadratic elements in compute flops scaling when generating more output tokens and linear scaling of KV Cache usage with sequence length both meaning smaller batch sizes for multiple user requests have to be used. On top of this - we have “series” scaling of output tokens - due to the hidden thinking tokens, the average o1 request uses ~5x more output tokens to generate a response vs. 4o (together with price, this leads to an average 30x more cost per useful output). This series scaling of output tokens can actually be taken closer to 100x currently for some tasks (with 100k token output context window for o1) - meaning series scaling of these reasoning models can lead to up to 600x higher price vs. the base 4o model. On top of this, OpenAI appears to also be scaling tokens in “parallel”- for the o3 ARC-AGI evaluation, it used six samples (high-efficiency mode) or 1024 samples (low efficiency) of model results. It is unclear how the best answer is chosen from these — it could be simple majority voting, but more likely, there is complexity and extra secret sauce here in how the best samples are automatically and rapidly searched, evaluated, and chosen. In any case - this could add another 1,000x scaling potential from “parallel” scaling on top of the 600x we found for series scaling. We expect o1-Pro (only available within $200/month ChatGPT Pro) uses parallel scaling and majority voting.
Beyond these, as discussed, there is also huge variability in the costs of the base models used. For example, Deepseek’s V3 model (on par or ahead of 4o on most benchmarks) will be priced at about 10x less than 4o per output token.
Overall, we could spend up to 6,000,000x more per useful output token using o3 with 1000 samples relative to Deepseek v3. Obviously, this isn’t an entirely fair comparison, given some tasks are completely out of reach of non-reasoning models. But for others, you may soon have a very interesting engineering or business decision about whether you spend 600,000x more for marginally better reliability!
Hottest News
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model featuring 671 billion parameters, with 37 billion activated per token. The model builds on architectures such as Multi-Head Latent Attention (MLA) and DeepSeekMoE, which were refined in earlier versions. DeepSeek-V3 has been trained on an extensive dataset of 14.8 trillion high-quality tokens, ensuring a broad and diverse knowledge base. Importantly, the model is fully open-source.
Qwen released QVQ, an open-weight model for multimodal reasoning, built upon Qwen2-VL-72B. QVQ represents a leap forward in AI’s capacity for visual understanding and complex problem-solving. QVQ achieves a score of 70.3 on MMMU and shows substantial improvements across math-related benchmarks compared to Qwen2-VL-72B-Instruct.
OpenAI says its corporate structure must evolve to advance its mission of ensuring artificial general intelligence (AGI), AI that can complete most tasks humans can, benefits all humanity. The company says it plans to begin transitioning its existing for-profit into a Delaware Public Benefit Corporation (PBC), with ordinary shares of stock and the OpenAI mission as its public benefit interest. This week The Information also reported that OpenAI last year secretly agreed on a definition of AGI with Microsoft (achieving AGI triggers a reversion of profits to its charity structure) - essentially defined as a model capable of achieving $100bn of profits.
Microsoft developed AIOpsLab, an evaluation framework designed to enable the systematic design, development, and enhancement of AIOps agents. AIOpsLab aims to address the need for reproducible, standardized, and scalable benchmarks. At its core, AIOpsLab integrates real-world workloads, fault injection capabilities, and interfaces between agents and cloud environments to simulate production-like scenarios. This open-source framework covers the entire lifecycle of cloud operations, from detecting faults to resolving them.
Cognition rolled out Devin 1.1, a faster, more cost-efficient, and more customizable version. Devin 1.1 is ~10% faster and ~10% more cost-efficient than Devin 1.0, especially for tasks that require Devin to make many code edits. It also supports full API functionality.
Meta introduced "large concept models" (LCMs) that move beyond token-level processing in LLMs to operate on higher-level semantic units called "concepts." Concepts are language and modality-agnostic, representing ideas or actions at a broader level. Trained on up to 2.7T tokens with models scaling to 7B parameters, the LCM excels in tasks like summarization and summary expansion. It shows strong zero-shot performance across languages and outperforms similar-sized LLMs.
Five 5-minute reads/videos to keep you learning
This post from Anthropic shares their learning from working with customers and building agents and gives practical advice for developers on building effective agents. It answers essential questions such as when (and when not) to use agents and when and how to use frameworks, and it discusses the building blocks and workflows of agents.
This article provides an overview of agentic workflows, their key components, and common use cases. It also discusses recent software offerings that could help you implement them!
This article provides a comprehensive overview of fine-tuning LLMs. It discusses the types of data required and the steps involved in the process. The guide also explores various approaches, such as using open datasets like the Anthropic HH-RLHF dataset for model alignment, MIMIC-III for healthcare, and CodeSearchNet for coding. Additionally, it addresses the creation of custom datasets and generating synthetic data using large LLMs like ChatGPT.
This is a step-by-step tutorial on visualizing and understanding GPU memory usage in PyTorch during training. It also shows how to estimate memory requirements and optimize GPU memory usage.
This blog post introduces the ModernBERT models, a new state-of-the-art family of small and efficient encoder-only models that finally gives BERT a much-needed makeover. ModernBERT demonstrates that encoder-only models can be improved by modern methods. They continue to offer very strong performance on some tasks, providing a great size/performance ratio.
This comprehensive guide dives into the implementation of agentic workflows using the LlamaIndex library, exploring function calling, agent runners, Agentic Retrieval-Augmented Generation (RAG), and ReACT agents.
Repositories & Tools
1. Eliza is a simple autonomous agent framework.
2. ModernBERT brings BERT into modernity via architectural changes and scaling.
3. Lyra is a speech-centric framework for omni-cognition.
4. YuLan-Mini is a lightweight language model with 2.4 billion parameters.
5. APPL is a prompt programming language that extends Python to provide a natural and Efficient way to use LLMs.
Top Papers of The Week
This work introduces a "Large Concept Model" (LCM) that moves beyond token-level processing in LLMs to operate on higher-level semantic units called "concepts." Concepts are language- and modality-agnostic, representing ideas or actions at a broader level. The LCM uses sentence embeddings (via SONAR) to predict sentences autoregressively, with techniques like MSE regression, diffusion-based generation, and quantized embedding models.
This work devises "precision-aware" scaling laws for training and inference in LLMs. Training in lower precision reduces the "effective parameter count," predicting performance loss from low-precision training or post-training quantization. For inference, quantization degrades performance more as models are trained on larger datasets, and beyond a point, extra training data can worsen results.
Artificial Life (ALife) has yet to integrate Foundation Models (FMs), creating a significant opportunity for the field to overcome the historical reliance on manual design and trial-and-error in discovering lifelike simulation configurations. This paper presents an Automated Search for Artificial Life (ASAL) for finding simulations that produce target phenomena, discovering simulations that generate temporally open-ended novelty, and illuminating an entire space of interestingly diverse simulations.
This paper introduces a professional domain knowledge service framework called Knowledge Augmented Generation (KAG). KAG addresses the limitations of RAG systems, including the gap between vector similarity and the relevance of knowledge reasoning and insensitivity to knowledge logic, such as numerical values, temporal relations, expert rules, and others. It improves generation and reasoning performance by bidirectionally enhancing large language models (LLMs) and KGs.
This paper introduces the privacy-preserving cloud RAG service to protect the user query. It proposes RemoteRAG as a solution for privacy (DistanceDP to characterize privacy leakage), efficiency (by limiting the search range from documents), and accuracy (ensuring that the small range includes target documents). RemoteRAG can resist existing embedding inversion attack methods while achieving no loss in retrieval under various settings.
Quick Links
1. Nonprofit group Encode joins Elon Musk’s effort to block OpenAI’s for-profit transition. Facebook’s parent company and AI rival, Meta, also supports efforts to block OpenAI’s conversion. Musk, an early contributor to the original nonprofit entity, accused OpenAI of abandoning its original philanthropic mission of making the fruits of its AI research available to all.
2. Microsoft and OpenAI have agreed on a new, specific definition of AGI. According to a report, OpenAI can only achieve AGI when it has built a system that can generate $100 billion in profits. The definition of AGI is highly important for Microsoft due to a clause in the agreement. Once OpenAI achieves AGI, Microsoft will not be able to access the AI startup's most powerful models.
Who’s Hiring in AI
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net.
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.