Hyperparameter Tuning: Grid Search, Random Search, and Bayesian Optimization

Practical strategies for hyperparameter tuning — grid search, random search, and Bayesian optimization — and which to use given your budget.

Hyperparameter Tuning: Grid Search, Random Search, and Bayesian Optimization

Every architecture and training choice covered so far — learning rate, batch size, dropout rate, number of layers, regularization strength — is a hyperparameter, a value set before training begins that isn’t learned by the model itself. Choosing them well can be the difference between a mediocre model and an excellent one trained on the exact same data and architecture, which is exactly why systematic hyperparameter search is a genuine, distinct skill worth developing rather than an afterthought.


What Counts as a Hyperparameter

Anything you set before training, rather than something the model learns through gradient descent:

hyperparameters = {
"learning_rate": 0.001,
"batch_size": 64,
"num_hidden_layers": 3,
"hidden_layer_size": 128,
"dropout_rate": 0.3,
"weight_decay": 0.01,
"optimizer": "adam",
}

Compare this to weights and biases, which the model learns automatically via backpropagation — hyperparameters are the “settings” you configure, weights are what training actually discovers.


Grid Search: Exhaustive but Expensive

Grid search tries every combination of a predefined set of values for each hyperparameter.

learning_rates = [0.1, 0.01, 0.001]
batch_sizes = [32, 64, 128]
dropout_rates = [0.2, 0.4]
best_score = 0
best_params = None
for lr in learning_rates:
for bs in batch_sizes:
for dr in dropout_rates:
score = train_and_evaluate(lr=lr, batch_size=bs, dropout=dr)
if score > best_score:
best_score = score
best_params = {"lr": lr, "batch_size": bs, "dropout": dr}

This example alone requires 18 separate training runs (3 × 3 × 2), and the number of combinations grows exponentially with each additional hyperparameter — grid search becomes computationally infeasible very quickly once you’re tuning more than two or three hyperparameters simultaneously.


Random Search: Often More Efficient Than It Sounds

Random search samples random combinations from specified ranges rather than trying every possible combination — and counterintuitively, it’s frequently more effective than grid search for the same computational budget.

import random
def sample_hyperparameters():
return {
"lr": 10 ** random.uniform(-4, -1), # log-uniform between 0.0001 and 0.1
"batch_size": random.choice([32, 64, 128, 256]),
"dropout": random.uniform(0.1, 0.5),
}
best_score = 0
for trial in range(30):
params = sample_hyperparameters()
score = train_and_evaluate(**params)
if score > best_score:
best_score = score
best_params = params

The reason random search often outperforms grid search with equal compute budget: not every hyperparameter matters equally for a given problem, and grid search wastes many trials exhaustively varying a hyperparameter that barely affects performance, while random search’s independent sampling of each hyperparameter naturally explores the important ones more thoroughly across the same number of total trials.


Bayesian Optimization: Learning From Previous Trials

Bayesian optimization goes a step further than random search — instead of sampling blindly, it builds a probabilistic model of how hyperparameters relate to performance based on trials already run, and uses that model to intelligently choose which combination to try next.

# Conceptual usage with a library like Optuna
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
dropout = trial.suggest_float("dropout", 0.1, 0.5)
score = train_and_evaluate(lr=lr, batch_size=batch_size, dropout=dropout)
return score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)
print(study.best_params)

Because each new trial’s suggested hyperparameters are informed by all previous trials’ results, Bayesian optimization typically finds strong configurations in meaningfully fewer trials than pure random search — a genuine practical advantage when each individual training run is expensive.


Choosing a Search Strategy Given Your Compute Budget

SituationRecommended approach
Very few hyperparameters (1-2), cheap to trainGrid search is fine and simple
More hyperparameters, moderate compute budgetRandom search
Expensive training runs, need efficiencyBayesian optimization (Optuna, Ray Tune, or similar)
Very large-scale models (LLM pretraining)Often a mix of established defaults plus targeted, small-scale search on the most impactful hyperparameters

Which Hyperparameters Usually Matter Most

Not every hyperparameter deserves equal search effort — learning rate is consistently one of the most impactful, as covered in Learning Rate, followed typically by batch size, model capacity (layers/width), and regularization strength. A practical strategy for a limited search budget: search learning rate and regularization strength most thoroughly, and use reasonable, well-established defaults for less-impactful hyperparameters rather than spreading a limited search budget too thin across everything simultaneously.

Avoiding a Subtle Leakage Trap During Tuning

Hyperparameter search has its own version of the data leakage problem covered in Dataset Preparation — if you tune hyperparameters against the same validation set repeatedly across dozens or hundreds of trials, you risk implicitly overfitting your hyperparameter choices to that specific validation set, even though no model weights were ever directly trained on it. The reported “best” validation score after extensive search can end up meaningfully optimistic relative to how the final model actually performs on genuinely new data. The standard defense is exactly the same discipline as before: keep a completely separate, untouched test set that’s used only once, at the very end, after hyperparameter search is finished, to report a final, honest performance estimate.

Keeping this discipline throughout the entire search process, not just for the final reported number, is what makes a hyperparameter tuning effort trustworthy rather than subtly self-flattering.

Summary

MethodHow It Chooses the Next TrialBest For
Grid searchEvery combination, exhaustivelyVery few hyperparameters
Random searchRandom sampling from defined rangesA practical, simple default for moderate search budgets
Bayesian optimizationInformed by results of previous trialsExpensive training runs where efficiency matters most

Hyperparameter tuning isn’t guesswork or an afterthought — it’s a genuinely systematic search problem, and choosing the right search strategy for your specific compute budget is itself a meaningful decision worth making deliberately.