Hyperparameter Tuning: Grid Search, Random Search, and Bayesian Optimization

Every architecture and training choice covered so far — learning rate, batch size, dropout rate, number of layers, regularization strength — is a hyperparameter, a value set before training begins that isn’t learned by the model itself. Choosing them well can be the difference between a mediocre model and an excellent one trained on the exact same data and architecture, which is exactly why systematic hyperparameter search is a genuine, distinct skill worth developing rather than an afterthought.

What Counts as a Hyperparameter

Anything you set before training, rather than something the model learns through gradient descent:

hyperparameters = {
    "learning_rate": 0.001,
    "batch_size": 64,
    "num_hidden_layers": 3,
    "hidden_layer_size": 128,
    "dropout_rate": 0.3,
    "weight_decay": 0.01,
    "optimizer": "adam",
}

Compare this to weights and biases, which the model learns automatically via backpropagation — hyperparameters are the “settings” you configure, weights are what training actually discovers.

Grid Search: Exhaustive but Expensive

Grid search tries every combination of a predefined set of values for each hyperparameter.

learning_rates = [0.1, 0.01, 0.001]
batch_sizes = [32, 64, 128]
dropout_rates = [0.2, 0.4]

best_score = 0
best_params = None

for lr in learning_rates:
    for bs in batch_sizes:
        for dr in dropout_rates:
            score = train_and_evaluate(lr=lr, batch_size=bs, dropout=dr)
            if score > best_score:
                best_score = score
                best_params = {"lr": lr, "batch_size": bs, "dropout": dr}

This example alone requires 18 separate training runs (3 × 3 × 2), and the number of combinations grows exponentially with each additional hyperparameter — grid search becomes computationally infeasible very quickly once you’re tuning more than two or three hyperparameters simultaneously.

Random Search: Often More Efficient Than It Sounds

Random search samples random combinations from specified ranges rather than trying every possible combination — and counterintuitively, it’s frequently more effective than grid search for the same computational budget.

import random

def sample_hyperparameters():
    return {
        "lr": 10 ** random.uniform(-4, -1),        # log-uniform between 0.0001 and 0.1
        "batch_size": random.choice([32, 64, 128, 256]),
        "dropout": random.uniform(0.1, 0.5),
    }

best_score = 0
for trial in range(30):
    params = sample_hyperparameters()
    score = train_and_evaluate(**params)
    if score > best_score:
        best_score = score
        best_params = params

The reason random search often outperforms grid search with equal compute budget: not every hyperparameter matters equally for a given problem, and grid search wastes many trials exhaustively varying a hyperparameter that barely affects performance, while random search’s independent sampling of each hyperparameter naturally explores the important ones more thoroughly across the same number of total trials.

Bayesian Optimization: Learning From Previous Trials

Bayesian optimization goes a step further than random search — instead of sampling blindly, it builds a probabilistic model of how hyperparameters relate to performance based on trials already run, and uses that model to intelligently choose which combination to try next.

# Conceptual usage with a library like Optuna
import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True)
    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
    dropout = trial.suggest_float("dropout", 0.1, 0.5)

    score = train_and_evaluate(lr=lr, batch_size=batch_size, dropout=dropout)
    return score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)

print(study.best_params)

Because each new trial’s suggested hyperparameters are informed by all previous trials’ results, Bayesian optimization typically finds strong configurations in meaningfully fewer trials than pure random search — a genuine practical advantage when each individual training run is expensive.

Choosing a Search Strategy Given Your Compute Budget

Situation	Recommended approach
Very few hyperparameters (1-2), cheap to train	Grid search is fine and simple
More hyperparameters, moderate compute budget	Random search
Expensive training runs, need efficiency	Bayesian optimization (Optuna, Ray Tune, or similar)
Very large-scale models (LLM pretraining)	Often a mix of established defaults plus targeted, small-scale search on the most impactful hyperparameters

Which Hyperparameters Usually Matter Most

Not every hyperparameter deserves equal search effort — learning rate is consistently one of the most impactful, as covered in Learning Rate, followed typically by batch size, model capacity (layers/width), and regularization strength. A practical strategy for a limited search budget: search learning rate and regularization strength most thoroughly, and use reasonable, well-established defaults for less-impactful hyperparameters rather than spreading a limited search budget too thin across everything simultaneously.

Avoiding a Subtle Leakage Trap During Tuning

Hyperparameter search has its own version of the data leakage problem covered in Dataset Preparation — if you tune hyperparameters against the same validation set repeatedly across dozens or hundreds of trials, you risk implicitly overfitting your hyperparameter choices to that specific validation set, even though no model weights were ever directly trained on it. The reported “best” validation score after extensive search can end up meaningfully optimistic relative to how the final model actually performs on genuinely new data. The standard defense is exactly the same discipline as before: keep a completely separate, untouched test set that’s used only once, at the very end, after hyperparameter search is finished, to report a final, honest performance estimate.

Keeping this discipline throughout the entire search process, not just for the final reported number, is what makes a hyperparameter tuning effort trustworthy rather than subtly self-flattering.

Summary

Method	How It Chooses the Next Trial	Best For
Grid search	Every combination, exhaustively	Very few hyperparameters
Random search	Random sampling from defined ranges	A practical, simple default for moderate search budgets
Bayesian optimization	Informed by results of previous trials	Expensive training runs where efficiency matters most

Hyperparameter tuning isn’t guesswork or an afterthought — it’s a genuinely systematic search problem, and choosing the right search strategy for your specific compute budget is itself a meaningful decision worth making deliberately.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.