Reiner Pope – The math behind how LLMs are trained and served | Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

10 days ago•Dwarkesh Podcast•Dwarkesh Patel

Podcast2 hr 13 min

Listen to Episode

Note: AI-generated summary based on third-party content. Not financial advice. Read more.

Quick Insights

Investors should maintain a high-conviction position in NVIDIA (NVDA), specifically focusing on the transition to the Blackwell NVL72 and upcoming Rubin architectures which solve critical "scale-up" networking bottlenecks. Beyond raw chips, look for opportunities in the "cabling and switching" sector, as the physical density of interconnects and NVLink technology is now the primary constraint on AI model scaling. A significant portion of hyperscaler CapEx is being consumed by High Bandwidth Memory (HBM), making companies that specialize in CXL (Compute Express Link) and tiered memory management essential for reducing costs. Efficiency-first architectures like DeepSeek demonstrate that "sparse" models (Mixture of Experts) will dominate the market by offering frontier-level performance with significantly higher profit margins. Finally, monitor the shift from training-heavy to inference-heavy hardware, as bespoke chip startups focusing on memory bandwidth rather than just raw math speed are poised to capture the next wave of AI infrastructure spending.

Detailed Analysis

Based on the technical discussion between Dwarkesh Patel and Reiner Pope (CEO of Maddox, former Google TPU architect), the following investment insights and structural themes for the AI infrastructure sector have been extracted.

NVIDIA (NVDA) / Blackwell Architecture

The discussion centers on the Blackwell NVL72 cluster as the current gold standard for frontier model training and inference. The "Blackwell Rack" is identified as the fundamental unit of compute because it defines the boundary of "scale-up" networking.

Scale-Up vs. Scale-Out: NVIDIA’s competitive advantage lies in the NVLink (scale-up) network within a rack, which is roughly 8x faster than the "scale-out" network (InfiniBand/Ethernet) used to connect different racks.
The Rack as the "Unit": Because Mixture of Experts (MoE) models require "all-to-all" communication, they are physically constrained by the size of a single rack.
Rubin Generation: The upcoming Rubin chips are expected to increase the scale-up domain from 72 GPUs to over 500, which will allow for significantly larger and more complex models to run without hitting networking bottlenecks.

Takeaways

Bullish on Interconnects: Investors should focus on the "cabling and switching" density. The physical constraint on AI progress is currently the "wire density" and the ability to pack more cables into a rack without snapping them or overheating.
Memory Bandwidth > Memory Capacity: While the market focuses on HBM (High Bandwidth Memory) capacity, the transcript suggests bandwidth (the speed of moving data) is the actual bottleneck for inference latency.

DeepSeek / Mixture of Experts (MoE)

The transcript highlights DeepSeek (specifically v3) as a pioneer in "sparse" model architecture. This is a critical theme for the economics of AI.

Sparsity Economics: DeepSeek uses 37 billion active parameters out of 700 billion total. This allows them to achieve the quality of a massive model with the compute cost of a much smaller one.
The "Goldilocks Zone": There is a mathematical balance point where a model is equally memory-bound and compute-bound. DeepSeek’s architecture aims for this "Goldilocks" zone to maximize efficiency.

Takeaways

Investment Theme: Efficiency-first architectures. Companies that can achieve "Frontier" performance (like GPT-4) using sparse methods (MoE) will have significantly better margins on API pricing.
Inference Advantage: Sparse models allow for larger batch sizes, which amortizes the cost of loading weights across thousands of users simultaneously.

The "Memory Wall" & Hyperscaler CapEx

A significant portion of Hyperscaler (Google, AWS, Meta) CapEx—potentially up to 50%—is now being spent on memory (HBM).

The KV Cache Problem: As users demand longer context lengths (e.g., 200k+ tokens), the memory required to store the "conversation history" (KV Cache) becomes the dominant cost, surpassing the cost of the model weights themselves.
Tiered Storage Opportunity: There is a growing need for a "memory hierarchy" in AI data centers.
- HBM: For active processing (very expensive).
- DDR/Flash: For "caching" conversations that are paused (cheaper).
- Spinning Disks: Potentially used for long-term "context" storage (slowest/cheapest).

Takeaways

Actionable Insight: Look for companies specializing in CXL (Compute Express Link) and tiered memory management. As HBM becomes a CapEx sink, software or hardware that allows AI to "offload" memory to cheaper tiers (DDR5) will be highly valuable.
Context Length Limits: The transcript suggests we are hitting a "memory wall" for context. Don't expect context lengths to grow infinitely (e.g., to 100 million tokens) without a fundamental shift in memory hardware.

AI API Pricing Dynamics (Claude, Gemini, OpenAI)

The transcript decodes why AI companies price their services the way they do, providing a "peek under the hood" of their margins.

Input vs. Output Pricing: Output tokens are often 5x more expensive than input tokens because outputs are "memory-bandwidth limited" (slow and expensive), while inputs are "compute-limited" (fast and efficient).
Cache Hits: Models are starting to offer 10x cheaper pricing for "cache hits." This indicates they have successfully moved the user's data from expensive HBM to cheaper Flash/DDR storage.

Takeaways

Margin Analysis: Companies with high "cache hit" rates (users asking questions about the same uploaded PDF) will have much higher profit margins than those doing "fresh" generations.
Pricing as a Signal: If an AI provider raises prices for long-context windows (e.g., Gemini's 50% bump at 200k tokens), it is a signal that they have hit a hardware "inflection point" where the model becomes inefficient.

Maddox (Private/Startup)

The guest, Reiner Pope, is the CEO of Maddox, a new chip startup.

Focus: Maddox appears to be designing hardware specifically to solve the "memory and communication" bottlenecks discussed, rather than just chasing raw "Flops" (math speed).
Angel Investors: The host (Dwarkesh Patel) is an angel investor, indicating high-conviction interest from the tech-intellectual community.

Takeaways

Sector Watch: Keep an eye on "bespoke" AI chip startups that focus on Inference rather than Training. The transcript argues that as models become "over-trained," the world will shift from a training-heavy economy to an inference-heavy economy.

Ask about this postAnswers are grounded in this post's content.

Episode Description

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served. It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there – it’s really worth it. There are less than a handful of people who understand the full stack of AI, from hardware design to model architecture as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. Download markdown of transcript here to chat with an LLM - working on some flashcards to help us all retain the content in this episode - come back here in a few hours! Sponsors * Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh * Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4 * Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post. And if you have something to visualize yourself, go to cursor.com/dwarkesh Timestamps (00:00:00) – How batch size affects token cost and speed (00:32:09) – How MoE models are laid out across a GPU racks (00:47:12) – How pipeline parallelism moves model layers across racks (01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.” (01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal (01:33:02) – Deducing long context memory costs from API pricing (02:04:02) – Convergent evolution between neural nets and cryptography Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

About Dwarkesh Podcast

Dwarkesh Podcast

By Dwarkesh Patel

Deeply researched interviews <br/><br/><a href="https://www.dwarkesh.com?utm_medium=podcast">www.dwarkesh.com</a>