
Investors should maintain high conviction in NVIDIA (NVDA) as their new B100/B200 chips achieve a 3x performance boost in FP4 precision, offering exponential efficiency gains over competitors. For exposure to specialized AI training at scale, Alphabet (GOOGL) remains a top pick as their TPU architecture minimizes "data movement taxes" more effectively than general-purpose hardware. Keep a close watch on the private markets for Maddox, a startup developing a "splittable systolic array" that could bridge the gap between NVIDIA’s flexibility and Google’s raw efficiency. A critical metric for evaluating any semiconductor investment is the ratio of compute area to data movement area, as hardware that minimizes overhead will lead in performance-per-watt. The industry-wide shift toward FP4 precision is the most time-sensitive trend, favoring companies that can maintain accuracy while utilizing the quadratic physical area savings of lower-bit widths.
The following investment insights are extracted from a technical discussion with Reiner Pope, CEO of Maddox (an AI chip startup), regarding the fundamental design of AI hardware, the trade-offs between compute and communication, and the architectural differences between major industry players like NVIDIA and Google.
• Maddox is a new AI chip company focused on "bottom-up" chip design to optimize AI workloads. • The company is developing a "splittable systolic array" architecture. This aims to combine the efficiency of large matrix units (like those in Google's TPU) with the flexibility of smaller, distributed units (like those in NVIDIA's GPUs). • Investment Context: The podcast host, Dwarkesh Patel, disclosed he is an initial investor in the company.
• Efficiency Focus: Maddox is targeting the "data movement tax." By redesigning how logic gates and memory (registers) interact, they aim to spend more chip area on actual computation rather than the "overhead" of moving data. • Architectural Innovation: Their "splittable" approach suggests a play for versatility—being able to handle both massive matrix multiplications and more granular, irregular AI tasks that typically favor GPUs.
• The discussion highlights NVIDIA's transition from standard CUDA cores to Tensor Cores (starting with the Volta generation). • Precision Scaling: NVIDIA's newer chips (B100/B200) show a non-linear performance boost when moving to lower precision. While halving bit precision (e.g., from FP8 to FP4) traditionally doubled performance, NVIDIA is now achieving 3x speedups in FP4, reflecting more efficient die area utilization. • GPU Architecture: NVIDIA uses a "fine-grained" approach, tiling many small Streaming Multiprocessors (SMs) across a chip. This makes them highly flexible for various software tasks but introduces more overhead for scheduling and synchronization.
• Dominance through Flexibility: NVIDIA’s architecture is described as a collection of "tiny TPUs." This design is robust for a wide range of applications, which explains their stronghold on the developer market despite specialized "AI-only" chips being more efficient for specific tasks. • Quadratic Scaling: Investors should note that as AI models move toward lower precision (FP4), NVIDIA's hardware becomes exponentially more efficient because the physical area required for multipliers scales quadratically with bit width.
• TPUs utilize a "coarse-grained" architecture with very large Systolic Arrays (Matrix Units). • Deterministic Latency: Unlike standard CPUs, TPUs often use "scratchpad" memory instead of hardware-managed caches. This gives software developers direct control over data movement, leading to more predictable performance. • Amortization of Cost: By using larger matrix units, TPUs "amortize" the cost of data movement better than GPUs for massive, dense matrix multiplications.
• Specialization vs. Generalization: The TPU is highly optimized for the specific math of neural networks (matrix multiplication). While it can be more efficient than a GPU for large-scale training, it is less "fungible" or flexible for irregular workloads. • Supply Chain Verticalization: Google's continued success with TPUs demonstrates that for hyperscalers, custom silicon can significantly reduce the "tax" paid on data movement compared to general-purpose hardware.
• The "Data Movement Tax": A recurring theme is that moving data from a register (storage) to the ALU (math unit) is often many times more expensive in terms of chip area and power than the math itself. • Systolic Arrays as the Standard: The "Systolic Array" is identified as the most efficient known mechanism for matrix multiplication. Any competitive AI chip must utilize this structure to minimize the bandwidth required from expensive on-chip memory. • Clock Speed vs. Throughput: There is a critical trade-off in chip design. Increasing clock speed (GHz) can improve latency but often requires adding "pipeline registers" that take up space, potentially lowering the total "throughput" (total work done) of the chip. • ASIC vs. FPGA: * ASICs (Application-Specific Integrated Circuits): Cheaper and more efficient at scale, but require ~$30M+ "tape-out" costs. * FPGAs (Field-Programmable Gate Arrays): Used by firms like Jane Street for high-frequency trading because they offer deterministic latency and can be reprogrammed "in the field," despite being ~10x less efficient than ASICs.
• Investment Metric: When evaluating new AI chip startups or incumbents, a key metric is the ratio of compute area to data movement area. Chips that can minimize "Mux" (selector) costs and maximize "Multiplier" area are likely to lead in performance-per-watt. • The Shift to Low Precision: The industry is aggressively moving toward FP4 and lower. Companies that can successfully implement high-accuracy training/inference at these low precisions will have a massive cost advantage due to the quadratic area savings on the physical chip.

By Dwarkesh Patel
Deeply researched interviews <br/><br/><a href="https://www.dwarkesh.com?utm_medium=podcast">www.dwarkesh.com</a>