Hyperparameter Sweeps: Configuration vs. Trajectory
Part 7 of Training Loop Engineering (End of Unit 5)
The first maps of the American West weren’t wrong because the cartographers were careless. They were wrong because the cartographers had never actually been there. They extrapolated from the coastline, from Indigenous accounts, from the reports of trappers who had crossed one valley and assumed the next one was the same. While the maps were internally consistent, the geography was not. When Lewis and Clark reached the Rockies in 1805, they expected a single ridge, the kind of thing you cross in a day, like the Appalachians. What they found was a 400-mile wall of interlocking ranges, each one requiring the kind of effort that the entire crossing was supposed to take. The map had been drawn by people who’d never even stood at the base of what they were mapping.
Hyperparameter search is cartography. Your loss landscape is the Rockies. The interlocking ranges are the interactions between parameters (learning rate and weight decay, batch size and warmup fraction), each one shifting the optimal value of the next, so that crossing one range puts you at the base of another you hadn’t accounted for. Phase boundaries in training are the ridgelines: the learning rate above which the model diverges, the weight decay below which regularization collapses, the batch size above which the linear scaling rule breaks. The cartographers who drew those early maps of the West had never stood at those ridgelines. Neither has your proxy sweep. Every configuration you evaluate is a survey expedition: expensive, slow, and informative only about the specific terrain you actually crossed. The map you’re drawing with your sweep results is drawn from the outside, from the coordinates of configurations you have evaluated, not from the interior of the space those configurations define. And the interior is, of course, where the optimum lives.
This post is about what the map actually looks like, how to draw it efficiently, and what the field has learned about the relationship between the cost of search and the quality of what you find. This concluding installment of Unit 5 is about what changes when you stop treating hyperparameters as a pre-training decision and start treating them as something a running system can adapt, which is the insight that population-based training is built on, and which points us toward what Unit 6 will address directly: the possibility of not searching at all.
The Topology of the Loss Landscape (What You’re Actually Searching)
Random Search: Why It Works Better Than It Should
Bayesian Optimization: The Surrogate and Its Discontents
Population-Based Training: Hyperparameters as Adaptive State
PBT Failure Modes and How to Detect Them
BOHB: Bayesian Optimization Over the Fidelity Dimension
The μTransfer Bridge: When Search Becomes Derivation
Full Production Grade Implementation
Common Bugs and Their Solutions



