Generative Models for Images
Generative models learn to create new data samples that resemble the training distribution. In computer vision, this means generating realistic images.
Generative Adversarial Networks (GANs)
Introduced by Ian Goodfellow in 2014, GANs train two networks in a minimax game:
Architecture
Training Dynamics
The generator tries to fool the discriminator; the discriminator tries not to be fooled:$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$
Training alternates: 1. Train D: Show real images (label=1) and fake images (label=0), maximize classification accuracy 2. Train G: Generate fake images, try to make D classify them as real (label=1)
Mode Collapse
The biggest challenge in GAN training. The generator finds a few outputs that fool the discriminator and keeps producing only those, ignoring the full diversity of the data distribution.Mitigation techniques:
Notable GAN Architectures
GANs vs VAEs vs Diffusion Models
Variational Autoencoders (VAEs)
VAEs combine autoencoders with probabilistic inference:
Architecture
The Reparameterization Trick
Instead of sampling z ~ N(\u03bc, \u03c3) (which is not differentiable), compute: $$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, 1)$$ This makes the sampling operation differentiable, allowing backpropagation through the encoder.Loss Function
$$L = L_{\text{reconstruction}} + \beta \cdot L_{\text{KL}}$$Latent Space Properties
Diffusion Models
The current state of the art for image generation. Based on non-equilibrium thermodynamics.
Forward Process (Adding Noise)
Gradually add Gaussian noise to an image over T timesteps: $$q(x_t | x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$After enough steps, the image becomes pure noise. The noise schedule {\u03b2_t} controls how fast noise is added (linear, cosine, etc.).
Reverse Process (Denoising)
Learn a neural network \u03b5_\u03b8 to predict the noise added at each step: $$p_\theta(x_{t-1} | x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$The model is trained with a simple MSE loss: $$L = \mathbb{E}_{t, x_0, \epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$$
The U-Net Denoiser
The noise predictor is typically a U-Net with:Stable Diffusion Architecture
Stable Diffusion (Latent Diffusion Model) is the most widely used open-source image generation model:
Key Innovation: Latent Space Diffusion
Instead of diffusing in pixel space (512x512x3 = 786K dimensions), Stable Diffusion operates in a compressed latent space (64x64x4 = 16K dimensions):1. VAE Encoder: Compresses image from pixel space to latent space (8x spatial compression) 2. U-Net: Performs the denoising diffusion in latent space 3. VAE Decoder: Decompresses the denoised latent back to pixel space
This makes training and inference ~50x more efficient than pixel-space diffusion.
Text Conditioning
Classifier-Free Guidance (CFG)
At inference, compute both conditional and unconditional predictions: $$\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$$ControlNet
Adds spatial control to diffusion models:Other Techniques
1# ==============================================================
2# Simple GAN for MNIST in PyTorch
3# ==============================================================
4import torch
5import torch.nn as nn
6import torchvision
7import torchvision.transforms as T
8from torch.utils.data import DataLoader
9
10# Hyperparameters
11latent_dim = 100
12img_dim = 28 * 28 # Flattened MNIST
13batch_size = 128
14lr = 2e-4
15epochs = 50
16
17# Data
18transform = T.Compose([T.ToTensor(), T.Normalize([0.5], [0.5])])
19dataset = torchvision.datasets.MNIST("./data", train=True,
20 download=True, transform=transform)
21loader = DataLoader(dataset, batch_size, shuffle=True)
22
23# Generator
24class Generator(nn.Module):
25 def __init__(self):
26 super().__init__()
27 self.net = nn.Sequential(
28 nn.Linear(latent_dim, 256),
29 nn.LeakyReLU(0.2),
30 nn.BatchNorm1d(256),
31 nn.Linear(256, 512),
32 nn.LeakyReLU(0.2),
33 nn.BatchNorm1d(512),
34 nn.Linear(512, 1024),
35 nn.LeakyReLU(0.2),
36 nn.BatchNorm1d(1024),
37 nn.Linear(1024, img_dim),
38 nn.Tanh(),
39 )
40
41 def forward(self, z):
42 return self.net(z).view(-1, 1, 28, 28)
43
44# Discriminator
45class Discriminator(nn.Module):
46 def __init__(self):
47 super().__init__()
48 self.net = nn.Sequential(
49 nn.Linear(img_dim, 512),
50 nn.LeakyReLU(0.2),
51 nn.Dropout(0.3),
52 nn.Linear(512, 256),
53 nn.LeakyReLU(0.2),
54 nn.Dropout(0.3),
55 nn.Linear(256, 1),
56 nn.Sigmoid(),
57 )
58
59 def forward(self, img):
60 return self.net(img.view(-1, img_dim))
61
62device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
63G = Generator().to(device)
64D = Discriminator().to(device)
65opt_G = torch.optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
66opt_D = torch.optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))
67criterion = nn.BCELoss()
68
69# Training
70for epoch in range(epochs):
71 for real_imgs, _ in loader:
72 real_imgs = real_imgs.to(device)
73 batch = real_imgs.size(0)
74 real_labels = torch.ones(batch, 1, device=device)
75 fake_labels = torch.zeros(batch, 1, device=device)
76
77 # ---- Train Discriminator ----
78 z = torch.randn(batch, latent_dim, device=device)
79 fake_imgs = G(z).detach()
80 d_loss = (criterion(D(real_imgs), real_labels) +
81 criterion(D(fake_imgs), fake_labels)) / 2
82 opt_D.zero_grad()
83 d_loss.backward()
84 opt_D.step()
85
86 # ---- Train Generator ----
87 z = torch.randn(batch, latent_dim, device=device)
88 fake_imgs = G(z)
89 g_loss = criterion(D(fake_imgs), real_labels) # fool D
90 opt_G.zero_grad()
91 g_loss.backward()
92 opt_G.step()
93
94 if (epoch + 1) % 10 == 0:
95 print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss:.4f} | G Loss: {g_loss:.4f}")
96
97# Generate samples
98with torch.no_grad():
99 z = torch.randn(16, latent_dim, device=device)
100 samples = G(z).cpu()
101 grid = torchvision.utils.make_grid(samples, nrow=4, normalize=True)
102 import matplotlib.pyplot as plt
103 plt.figure(figsize=(6, 6))
104 plt.imshow(grid.permute(1, 2, 0).squeeze(), cmap="gray")
105 plt.title("Generated MNIST Digits")
106 plt.axis("off")
107 plt.show()1# ==============================================================
2# Stable Diffusion with diffusers library
3# pip install diffusers transformers accelerate
4# ==============================================================
5from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
6from diffusers import StableDiffusionInpaintPipeline
7import torch
8from PIL import Image
9
10device = "cuda" if torch.cuda.is_available() else "cpu"
11
12# ---- Text-to-Image ----
13pipe = StableDiffusionPipeline.from_pretrained(
14 "stabilityai/stable-diffusion-2-1",
15 torch_dtype=torch.float16 if device == "cuda" else torch.float32,
16)
17pipe = pipe.to(device)
18
19# Generate an image
20prompt = "A serene mountain lake at sunset, photorealistic, 4k"
21negative_prompt = "blurry, low quality, distorted"
22
23image = pipe(
24 prompt=prompt,
25 negative_prompt=negative_prompt,
26 num_inference_steps=50,
27 guidance_scale=7.5,
28 height=512,
29 width=512,
30).images[0]
31image.save("mountain_lake.png")
32
33# ---- Exploring guidance scale ----
34import matplotlib.pyplot as plt
35fig, axes = plt.subplots(1, 4, figsize=(20, 5))
36for i, scale in enumerate([1.0, 5.0, 7.5, 15.0]):
37 img = pipe(
38 prompt="A cat wearing a tiny top hat, oil painting",
39 guidance_scale=scale,
40 num_inference_steps=30,
41 ).images[0]
42 axes[i].imshow(img)
43 axes[i].set_title(f"CFG Scale = {scale}")
44 axes[i].axis("off")
45plt.suptitle("Effect of Classifier-Free Guidance Scale")
46plt.tight_layout()
47plt.show()
48
49# ---- Image-to-Image ----
50img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
51 "stabilityai/stable-diffusion-2-1",
52 torch_dtype=torch.float16 if device == "cuda" else torch.float32,
53)
54img2img_pipe = img2img_pipe.to(device)
55
56init_image = Image.open("mountain_lake.png").resize((512, 512))
57result = img2img_pipe(
58 prompt="Same scene but in winter with snow, photorealistic",
59 image=init_image,
60 strength=0.75, # 0 = no change, 1 = complete regeneration
61 guidance_scale=7.5,
62 num_inference_steps=50,
63).images[0]
64result.save("mountain_lake_winter.png")