Hyperparameter Tuning: Grid Search, Random Search, and Bayesian Optimization
Every architecture and training choice covered so far — learning rate, batch size, dropout rate, number of layers, regularization strength — is a hyperparameter, a value set before training begins that isn’t learned by the model itself. Choosing them well can be the difference between a mediocre model and an excellent one trained on the exact same data and architecture, which is exactly why systematic hyperparameter search is a genuine, distinct skill worth developing rather than an afterthought.
What Counts as a Hyperparameter
Anything you set before training, rather than something the model learns through gradient descent:
hyperparameters = { "learning_rate": 0.001, "batch_size": 64, "num_hidden_layers": 3, "hidden_layer_size": 128, "dropout_rate": 0.3, "weight_decay": 0.01, "optimizer": "adam",}Compare this to weights and biases, which the model learns automatically via backpropagation — hyperparameters are the “settings” you configure, weights are what training actually discovers.
Grid Search: Exhaustive but Expensive
Grid search tries every combination of a predefined set of values for each hyperparameter.
learning_rates = [0.1, 0.01, 0.001]batch_sizes = [32, 64, 128]dropout_rates = [0.2, 0.4]
best_score = 0best_params = None
for lr in learning_rates: for bs in batch_sizes: for dr in dropout_rates: score = train_and_evaluate(lr=lr, batch_size=bs, dropout=dr) if score > best_score: best_score = score best_params = {"lr": lr, "batch_size": bs, "dropout": dr}This example alone requires 18 separate training runs (3 × 3 × 2), and the number of combinations grows exponentially with each additional hyperparameter — grid search becomes computationally infeasible very quickly once you’re tuning more than two or three hyperparameters simultaneously.
Random Search: Often More Efficient Than It Sounds
Random search samples random combinations from specified ranges rather than trying every possible combination — and counterintuitively, it’s frequently more effective than grid search for the same computational budget.
import random
def sample_hyperparameters(): return { "lr": 10 ** random.uniform(-4, -1), # log-uniform between 0.0001 and 0.1 "batch_size": random.choice([32, 64, 128, 256]), "dropout": random.uniform(0.1, 0.5), }
best_score = 0for trial in range(30): params = sample_hyperparameters() score = train_and_evaluate(**params) if score > best_score: best_score = score best_params = paramsThe reason random search often outperforms grid search with equal compute budget: not every hyperparameter matters equally for a given problem, and grid search wastes many trials exhaustively varying a hyperparameter that barely affects performance, while random search’s independent sampling of each hyperparameter naturally explores the important ones more thoroughly across the same number of total trials.
Bayesian Optimization: Learning From Previous Trials
Bayesian optimization goes a step further than random search — instead of sampling blindly, it builds a probabilistic model of how hyperparameters relate to performance based on trials already run, and uses that model to intelligently choose which combination to try next.
# Conceptual usage with a library like Optunaimport optuna
def objective(trial): lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True) batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256]) dropout = trial.suggest_float("dropout", 0.1, 0.5)
score = train_and_evaluate(lr=lr, batch_size=batch_size, dropout=dropout) return score
study = optuna.create_study(direction="maximize")study.optimize(objective, n_trials=30)
print(study.best_params)Because each new trial’s suggested hyperparameters are informed by all previous trials’ results, Bayesian optimization typically finds strong configurations in meaningfully fewer trials than pure random search — a genuine practical advantage when each individual training run is expensive.
Choosing a Search Strategy Given Your Compute Budget
| Situation | Recommended approach |
|---|---|
| Very few hyperparameters (1-2), cheap to train | Grid search is fine and simple |
| More hyperparameters, moderate compute budget | Random search |
| Expensive training runs, need efficiency | Bayesian optimization (Optuna, Ray Tune, or similar) |
| Very large-scale models (LLM pretraining) | Often a mix of established defaults plus targeted, small-scale search on the most impactful hyperparameters |
Which Hyperparameters Usually Matter Most
Not every hyperparameter deserves equal search effort — learning rate is consistently one of the most impactful, as covered in Learning Rate, followed typically by batch size, model capacity (layers/width), and regularization strength. A practical strategy for a limited search budget: search learning rate and regularization strength most thoroughly, and use reasonable, well-established defaults for less-impactful hyperparameters rather than spreading a limited search budget too thin across everything simultaneously.
Avoiding a Subtle Leakage Trap During Tuning
Hyperparameter search has its own version of the data leakage problem covered in Dataset Preparation — if you tune hyperparameters against the same validation set repeatedly across dozens or hundreds of trials, you risk implicitly overfitting your hyperparameter choices to that specific validation set, even though no model weights were ever directly trained on it. The reported “best” validation score after extensive search can end up meaningfully optimistic relative to how the final model actually performs on genuinely new data. The standard defense is exactly the same discipline as before: keep a completely separate, untouched test set that’s used only once, at the very end, after hyperparameter search is finished, to report a final, honest performance estimate.
Keeping this discipline throughout the entire search process, not just for the final reported number, is what makes a hyperparameter tuning effort trustworthy rather than subtly self-flattering.
Summary
| Method | How It Chooses the Next Trial | Best For |
|---|---|---|
| Grid search | Every combination, exhaustively | Very few hyperparameters |
| Random search | Random sampling from defined ranges | A practical, simple default for moderate search budgets |
| Bayesian optimization | Informed by results of previous trials | Expensive training runs where efficiency matters most |
Hyperparameter tuning isn’t guesswork or an afterthought — it’s a genuinely systematic search problem, and choosing the right search strategy for your specific compute budget is itself a meaningful decision worth making deliberately.