BUILD AI (with examples)

BUILD AI (with examples)

Hyperparameter Sweeps: Configuration vs. Trajectory

Part 7 of Training Loop Engineering (End of Unit 5)

Forest Mars's avatar
Forest Mars
Apr 28, 2026
∙ Paid

The first maps of the American West weren’t wrong because the cartographers were careless. They were wrong because the cartographers had never actually been there. They extrapolated from the coastline, from Indigenous accounts, from the reports of trappers who had crossed one valley and assumed the next one was the same. While the maps were internally consistent, the geography was not. When Lewis and Clark reached the Rockies in 1805, they expected a single ridge, the kind of thing you cross in a day, like the Appalachians. What they found was a 400-mile wall of interlocking ranges, each one requiring the kind of effort that the entire crossing was supposed to take. The map had been drawn by people who’d never even stood at the base of what they were mapping.

Hyperparameter search is cartography. Your loss landscape is the Rockies. The interlocking ranges are the interactions between parameters (learning rate and weight decay, batch size and warmup fraction), each one shifting the optimal value of the next, so that crossing one range puts you at the base of another you hadn’t accounted for. Phase boundaries in training are the ridgelines: the learning rate above which the model diverges, the weight decay below which regularization collapses, the batch size above which the linear scaling rule breaks. The cartographers who drew those early maps of the West had never stood at those ridgelines. Neither has your proxy sweep. Every configuration you evaluate is a survey expedition: expensive, slow, and informative only about the specific terrain you actually crossed. The map you’re drawing with your sweep results is drawn from the outside, from the coordinates of configurations you have evaluated, not from the interior of the space those configurations define. And the interior is, of course, where the optimum lives.

This post is about what the map actually looks like, how to draw it efficiently, and what the field has learned about the relationship between the cost of search and the quality of what you find. This concluding installment of Unit 5 is about what changes when you stop treating hyperparameters as a pre-training decision and start treating them as something a running system can adapt, which is the insight that population-based training is built on, and which points us toward what Unit 6 will address directly: the possibility of not searching at all.

  • The Topology of the Loss Landscape (What You’re Actually Searching)

  • Random Search: Why It Works Better Than It Should

  • Bayesian Optimization: The Surrogate and Its Discontents

  • Population-Based Training: Hyperparameters as Adaptive State

  • PBT Failure Modes and How to Detect Them

  • BOHB: Bayesian Optimization Over the Fidelity Dimension

  • The μTransfer Bridge: When Search Becomes Derivation

  • Full Production Grade Implementation

  • Common Bugs and Their Solutions

User's avatar

Continue reading this post for free, courtesy of Forest Mars.

Or purchase a paid subscription.
© 2026 Forest Mars · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture