Reiner Pope – Chip design from the bottom up
Reiner Pope – Chip design from the bottom up
Podcast1 hr 20 min
Listen to Episode
Note: AI-generated summary based on third-party content. Not financial advice. Read more.
Quick Insights

Investors should maintain high conviction in NVIDIA (NVDA) as their new B100/B200 chips achieve a 3x performance boost in FP4 precision, offering exponential efficiency gains over competitors. For exposure to specialized AI training at scale, Alphabet (GOOGL) remains a top pick as their TPU architecture minimizes "data movement taxes" more effectively than general-purpose hardware. Keep a close watch on the private markets for Maddox, a startup developing a "splittable systolic array" that could bridge the gap between NVIDIA’s flexibility and Google’s raw efficiency. A critical metric for evaluating any semiconductor investment is the ratio of compute area to data movement area, as hardware that minimizes overhead will lead in performance-per-watt. The industry-wide shift toward FP4 precision is the most time-sensitive trend, favoring companies that can maintain accuracy while utilizing the quadratic physical area savings of lower-bit widths.

Detailed Analysis

The following investment insights are extracted from a technical discussion with Reiner Pope, CEO of Maddox (an AI chip startup), regarding the fundamental design of AI hardware, the trade-offs between compute and communication, and the architectural differences between major industry players like NVIDIA and Google.


Maddox (Private)

Maddox is a new AI chip company focused on "bottom-up" chip design to optimize AI workloads. • The company is developing a "splittable systolic array" architecture. This aims to combine the efficiency of large matrix units (like those in Google's TPU) with the flexibility of smaller, distributed units (like those in NVIDIA's GPUs). • Investment Context: The podcast host, Dwarkesh Patel, disclosed he is an initial investor in the company.

Takeaways

Efficiency Focus: Maddox is targeting the "data movement tax." By redesigning how logic gates and memory (registers) interact, they aim to spend more chip area on actual computation rather than the "overhead" of moving data. • Architectural Innovation: Their "splittable" approach suggests a play for versatility—being able to handle both massive matrix multiplications and more granular, irregular AI tasks that typically favor GPUs.


NVIDIA (NVDA)

• The discussion highlights NVIDIA's transition from standard CUDA cores to Tensor Cores (starting with the Volta generation). • Precision Scaling: NVIDIA's newer chips (B100/B200) show a non-linear performance boost when moving to lower precision. While halving bit precision (e.g., from FP8 to FP4) traditionally doubled performance, NVIDIA is now achieving 3x speedups in FP4, reflecting more efficient die area utilization. • GPU Architecture: NVIDIA uses a "fine-grained" approach, tiling many small Streaming Multiprocessors (SMs) across a chip. This makes them highly flexible for various software tasks but introduces more overhead for scheduling and synchronization.

Takeaways

Dominance through Flexibility: NVIDIA’s architecture is described as a collection of "tiny TPUs." This design is robust for a wide range of applications, which explains their stronghold on the developer market despite specialized "AI-only" chips being more efficient for specific tasks. • Quadratic Scaling: Investors should note that as AI models move toward lower precision (FP4), NVIDIA's hardware becomes exponentially more efficient because the physical area required for multipliers scales quadratically with bit width.


Google / Alphabet (GOOGL) - TPU (Tensor Processing Unit)

TPUs utilize a "coarse-grained" architecture with very large Systolic Arrays (Matrix Units). • Deterministic Latency: Unlike standard CPUs, TPUs often use "scratchpad" memory instead of hardware-managed caches. This gives software developers direct control over data movement, leading to more predictable performance. • Amortization of Cost: By using larger matrix units, TPUs "amortize" the cost of data movement better than GPUs for massive, dense matrix multiplications.

Takeaways

Specialization vs. Generalization: The TPU is highly optimized for the specific math of neural networks (matrix multiplication). While it can be more efficient than a GPU for large-scale training, it is less "fungible" or flexible for irregular workloads. • Supply Chain Verticalization: Google's continued success with TPUs demonstrates that for hyperscalers, custom silicon can significantly reduce the "tax" paid on data movement compared to general-purpose hardware.


Semiconductor Sector Themes

The "Data Movement Tax": A recurring theme is that moving data from a register (storage) to the ALU (math unit) is often many times more expensive in terms of chip area and power than the math itself. • Systolic Arrays as the Standard: The "Systolic Array" is identified as the most efficient known mechanism for matrix multiplication. Any competitive AI chip must utilize this structure to minimize the bandwidth required from expensive on-chip memory. • Clock Speed vs. Throughput: There is a critical trade-off in chip design. Increasing clock speed (GHz) can improve latency but often requires adding "pipeline registers" that take up space, potentially lowering the total "throughput" (total work done) of the chip. • ASIC vs. FPGA: * ASICs (Application-Specific Integrated Circuits): Cheaper and more efficient at scale, but require ~$30M+ "tape-out" costs. * FPGAs (Field-Programmable Gate Arrays): Used by firms like Jane Street for high-frequency trading because they offer deterministic latency and can be reprogrammed "in the field," despite being ~10x less efficient than ASICs.

Takeaways

Investment Metric: When evaluating new AI chip startups or incumbents, a key metric is the ratio of compute area to data movement area. Chips that can minimize "Mux" (selector) costs and maximize "Multiplier" area are likely to lead in performance-per-watt. • The Shift to Low Precision: The industry is aggressively moving toward FP4 and lower. Companies that can successfully implement high-accuracy training/inference at these low precisions will have a massive cost advantage due to the quadratic area savings on the physical chip.

Ask about this postAnswers are grounded in this post's content.
Episode Description
New blackboard lecture with Reiner Pope: how do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. Watch this one on YouTube so you can see the chalkboard. Read the transcript. Sponsors * Crusoe was one of only five GPU clouds that made the gold tier in SemiAnalysis' most recent ClusterMAX report. Gold-tier providers like Crusoe delivered 5-15% lower TCO than silver-tier clouds, even with identical GPU pricing. This is because optimizations like early fault detection and rapid node replacement don't necessarily show up in the sticker price, but still matter a ton in the real world. Learn more at crusoe.ai/dwarkesh * Cursor is where I do most of my work—from reading research papers to visualizing technical concepts to coding up internal tools for the podcast. Most recently, I used it to build two different review interfaces for my essay contest, one that anonymizes submissions for scoring and another that lets me see applicants' essays next to their resumes and websites. Whatever you're working on, you should try doing it in Cursor. Get started at cursor.com/dwarkesh * Jane Street let me ask Ron Minsky and Dan Pontecorvo, two senior Jane Streeters, a bunch of questions about how they use AI. We discussed everything from the types of models they're training to how they think about the future of trading to why they're more bullish than ever on hiring technical talent. You can watch the full conversation and learn more about their open positions at janestreet.com/dwarkesh Timestamps 00:00:00 – Building a multiply-accumulate from logic gates 00:16:31 – Muxes and the cost of data movement 00:26:10 – How systolic arrays work 00:39:11 – Clock cycles and pipeline registers 00:51:51 – FPGAs vs ASICs 01:03:25 – Cache vs scratchpad 01:07:27 – Why CPU cores are much bigger than GPU cores 01:12:00 – Brains vs chips 01:15:33 – A GPU is just a bunch of tiny TPUs Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
About Dwarkesh Podcast
Dwarkesh Podcast

Dwarkesh Podcast

By Dwarkesh Patel

Deeply researched interviews <br/><br/><a href="https://www.dwarkesh.com?utm_medium=podcast">www.dwarkesh.com</a>