Generative Models Explained: Autoencoders, VAEs, GANs, and Diffusion Models
Every architecture covered so far in this series is fundamentally discriminative — given an input, predict something about it (a class, a value, the next token). Generative models flip the direction: given nothing, or given random noise, produce entirely new, realistic data. Four major families dominate this space, each with a genuinely different underlying mechanism and a distinct set of tradeoffs worth understanding before choosing between them.
Autoencoders: The Foundation
An autoencoder, covered as an unsupervised learning technique in Unsupervised Learning, compresses input data through a bottleneck and reconstructs it — useful for representation learning, but not naturally generative on its own, since the bottleneck’s compressed space isn’t structured in a way that supports sampling meaningful new data from it.
import torch.nn as nn
class Autoencoder(nn.Module): def __init__(self): super().__init__() self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 32)) self.decoder = nn.Sequential(nn.Linear(32, 128), nn.ReLU(), nn.Linear(128, 784))
def forward(self, x): latent = self.encoder(x) reconstruction = self.decoder(latent) return reconstructionSampling a random point in the 32-dimensional latent space and decoding it typically produces garbage — a plain autoencoder’s latent space has no guarantee of being smooth or well-structured, since it was only ever trained to reconstruct specific input examples, not to support meaningful interpolation or sampling.
Variational Autoencoders (VAEs): Making the Latent Space Generative
A VAE fixes exactly this problem by forcing the latent space to follow a known, well-behaved distribution (typically Gaussian, covered in Probability Distributions) — the encoder outputs a mean and variance rather than a single point, and a point is sampled from that distribution before being decoded.
class VAE(nn.Module): def __init__(self): super().__init__() self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU()) self.mean_layer = nn.Linear(128, 32) self.logvar_layer = nn.Linear(128, 32) self.decoder = nn.Sequential(nn.Linear(32, 128), nn.ReLU(), nn.Linear(128, 784))
def forward(self, x): h = self.encoder(x) mean, logvar = self.mean_layer(h), self.logvar_layer(h) std = torch.exp(0.5 * logvar) z = mean + std * torch.randn_like(std) # the "reparameterization trick" for sampling return self.decoder(z), mean, logvarTraining a VAE combines a reconstruction loss (how well does the decoded output match the input) with a KL divergence term (how close is the learned latent distribution to a standard Gaussian, covered in Information Theory) — this second term is exactly what makes the latent space smooth and sample-able, since it explicitly penalizes the encoder for deviating too far from a well-behaved, known distribution.
Generative Adversarial Networks (GANs): Two Networks Competing
A GAN takes a fundamentally different approach — a generator network learns to produce fake data, while a discriminator network learns to distinguish real data from the generator’s fakes, and the two are trained simultaneously in direct competition.
class Generator(nn.Module): def __init__(self, noise_dim, output_dim): super().__init__() self.net = nn.Sequential( nn.Linear(noise_dim, 128), nn.ReLU(), nn.Linear(128, output_dim), nn.Tanh() ) def forward(self, noise): return self.net(noise)
class Discriminator(nn.Module): def __init__(self, input_dim): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, 128), nn.ReLU(), nn.Linear(128, 1), nn.Sigmoid() ) def forward(self, x): return self.net(x) # probability that x is realThe generator improves by learning to fool the discriminator; the discriminator improves by learning to catch the generator’s mistakes — this adversarial dynamic, when it trains stably, can produce remarkably realistic outputs. GANs are notoriously difficult to train stably, however — a discriminator that becomes too good too quickly provides the generator with no useful gradient signal to improve from, a well-documented practical challenge specific to this architecture family.
Diffusion Models: Learning to Reverse Noise
Diffusion models take yet another approach — they learn to reverse a gradual noising process. During training, real data is progressively corrupted with noise over many steps; the model learns to predict and remove that noise, step by step. Generation then starts from pure random noise and iteratively denoises it, guided by what the model learned.
# Conceptual training objective: predict the noise that was added at a given stepnoisy_image = original_image + noise_schedule[step] * random_noisepredicted_noise = model(noisy_image, step)loss = mse(predicted_noise, random_noise)
# Conceptual generation: start from pure noise, iteratively denoiseimage = random_noisefor step in reversed(range(num_steps)): predicted_noise = model(image, step) image = denoise_step(image, predicted_noise, step)Diffusion models have become the dominant approach for high-quality image generation, largely because their training is considerably more stable than GANs’ adversarial dynamic — there’s no second competing network to destabilize training — at the cost of typically requiring many iterative denoising steps at generation time, making generation itself slower than a GAN’s single forward pass.
Comparing the Four Approaches
| Model | Training stability | Generation speed | Typical output quality |
|---|---|---|---|
| Autoencoder | Very stable | Fast | Not naturally generative |
| VAE | Stable | Fast | Good, sometimes slightly blurry |
| GAN | Notoriously unstable | Fast (single pass) | Can be very sharp, high quality |
| Diffusion | Stable | Slower (many iterative steps) | State-of-the-art for many domains |
Conditional Generation: Controlling What Gets Generated
All four architectures described above can be extended to conditional generation — producing output constrained by some additional input, rather than sampling freely from the entire learned distribution. A conditional GAN might generate an image matching a specific class label; a conditional diffusion model (the basis for most modern text-to-image systems) generates an image conditioned on a text description, using an embedding of the text — the kind covered in Large Language Models — to steer the generation process at every denoising step. This conditioning mechanism is what turns a generic “generate a plausible image” model into a genuinely useful, controllable tool, and it’s worth recognizing as an extension layered on top of any of these four base architectures, rather than a separate, fifth category of generative model.
Summary
| Model | Core Mechanism |
|---|---|
| Autoencoder | Compress and reconstruct; foundation for the others |
| VAE | Structured, sampleable latent space via a Gaussian prior |
| GAN | Two networks competing — generator vs. discriminator |
| Diffusion | Learn to reverse a gradual noising process |
Each generative model family represents a genuinely different solution to the same underlying challenge — learning a data distribution well enough to sample new, realistic examples from it — and the right choice depends heavily on your specific priorities around training stability, generation speed, and output quality.