Generative Models Explained: Autoencoders, VAEs, GANs, and Diffusion Models

How autoencoders, VAEs, GANs, and diffusion models each generate new data differently, and the tradeoffs that determine which to use when.

Generative Models Explained: Autoencoders, VAEs, GANs, and Diffusion Models

Every architecture covered so far in this series is fundamentally discriminative — given an input, predict something about it (a class, a value, the next token). Generative models flip the direction: given nothing, or given random noise, produce entirely new, realistic data. Four major families dominate this space, each with a genuinely different underlying mechanism and a distinct set of tradeoffs worth understanding before choosing between them.


Autoencoders: The Foundation

An autoencoder, covered as an unsupervised learning technique in Unsupervised Learning, compresses input data through a bottleneck and reconstructs it — useful for representation learning, but not naturally generative on its own, since the bottleneck’s compressed space isn’t structured in a way that supports sampling meaningful new data from it.

import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 32))
self.decoder = nn.Sequential(nn.Linear(32, 128), nn.ReLU(), nn.Linear(128, 784))
def forward(self, x):
latent = self.encoder(x)
reconstruction = self.decoder(latent)
return reconstruction

Sampling a random point in the 32-dimensional latent space and decoding it typically produces garbage — a plain autoencoder’s latent space has no guarantee of being smooth or well-structured, since it was only ever trained to reconstruct specific input examples, not to support meaningful interpolation or sampling.


Variational Autoencoders (VAEs): Making the Latent Space Generative

A VAE fixes exactly this problem by forcing the latent space to follow a known, well-behaved distribution (typically Gaussian, covered in Probability Distributions) — the encoder outputs a mean and variance rather than a single point, and a point is sampled from that distribution before being decoded.

class VAE(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU())
self.mean_layer = nn.Linear(128, 32)
self.logvar_layer = nn.Linear(128, 32)
self.decoder = nn.Sequential(nn.Linear(32, 128), nn.ReLU(), nn.Linear(128, 784))
def forward(self, x):
h = self.encoder(x)
mean, logvar = self.mean_layer(h), self.logvar_layer(h)
std = torch.exp(0.5 * logvar)
z = mean + std * torch.randn_like(std) # the "reparameterization trick" for sampling
return self.decoder(z), mean, logvar

Training a VAE combines a reconstruction loss (how well does the decoded output match the input) with a KL divergence term (how close is the learned latent distribution to a standard Gaussian, covered in Information Theory) — this second term is exactly what makes the latent space smooth and sample-able, since it explicitly penalizes the encoder for deviating too far from a well-behaved, known distribution.


Generative Adversarial Networks (GANs): Two Networks Competing

A GAN takes a fundamentally different approach — a generator network learns to produce fake data, while a discriminator network learns to distinguish real data from the generator’s fakes, and the two are trained simultaneously in direct competition.

class Generator(nn.Module):
def __init__(self, noise_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(noise_dim, 128), nn.ReLU(),
nn.Linear(128, output_dim), nn.Tanh()
)
def forward(self, noise):
return self.net(noise)
class Discriminator(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128), nn.ReLU(),
nn.Linear(128, 1), nn.Sigmoid()
)
def forward(self, x):
return self.net(x) # probability that x is real

The generator improves by learning to fool the discriminator; the discriminator improves by learning to catch the generator’s mistakes — this adversarial dynamic, when it trains stably, can produce remarkably realistic outputs. GANs are notoriously difficult to train stably, however — a discriminator that becomes too good too quickly provides the generator with no useful gradient signal to improve from, a well-documented practical challenge specific to this architecture family.


Diffusion Models: Learning to Reverse Noise

Diffusion models take yet another approach — they learn to reverse a gradual noising process. During training, real data is progressively corrupted with noise over many steps; the model learns to predict and remove that noise, step by step. Generation then starts from pure random noise and iteratively denoises it, guided by what the model learned.

# Conceptual training objective: predict the noise that was added at a given step
noisy_image = original_image + noise_schedule[step] * random_noise
predicted_noise = model(noisy_image, step)
loss = mse(predicted_noise, random_noise)
# Conceptual generation: start from pure noise, iteratively denoise
image = random_noise
for step in reversed(range(num_steps)):
predicted_noise = model(image, step)
image = denoise_step(image, predicted_noise, step)

Diffusion models have become the dominant approach for high-quality image generation, largely because their training is considerably more stable than GANs’ adversarial dynamic — there’s no second competing network to destabilize training — at the cost of typically requiring many iterative denoising steps at generation time, making generation itself slower than a GAN’s single forward pass.


Comparing the Four Approaches

ModelTraining stabilityGeneration speedTypical output quality
AutoencoderVery stableFastNot naturally generative
VAEStableFastGood, sometimes slightly blurry
GANNotoriously unstableFast (single pass)Can be very sharp, high quality
DiffusionStableSlower (many iterative steps)State-of-the-art for many domains

Conditional Generation: Controlling What Gets Generated

All four architectures described above can be extended to conditional generation — producing output constrained by some additional input, rather than sampling freely from the entire learned distribution. A conditional GAN might generate an image matching a specific class label; a conditional diffusion model (the basis for most modern text-to-image systems) generates an image conditioned on a text description, using an embedding of the text — the kind covered in Large Language Models — to steer the generation process at every denoising step. This conditioning mechanism is what turns a generic “generate a plausible image” model into a genuinely useful, controllable tool, and it’s worth recognizing as an extension layered on top of any of these four base architectures, rather than a separate, fifth category of generative model.

Summary

ModelCore Mechanism
AutoencoderCompress and reconstruct; foundation for the others
VAEStructured, sampleable latent space via a Gaussian prior
GANTwo networks competing — generator vs. discriminator
DiffusionLearn to reverse a gradual noising process

Each generative model family represents a genuinely different solution to the same underlying challenge — learning a data distribution well enough to sample new, realistic examples from it — and the right choice depends heavily on your specific priorities around training stability, generation speed, and output quality.