Convolutional Neural Networks (CNNs)

Why Not Dense Layers for Images?

Consider a 224x224 color image. Flattened, that's 224 x 224 x 3 = 150,528 input values. A single dense layer with 256 neurons would need:

> 150,528 x 256 = 38.5 million parameters — in just ONE layer!

This has three critical problems: 1. Parameter explosion — too many weights to train efficiently 2. No spatial awareness — a dense layer treats pixel (0,0) and pixel (223,223) identically 3. No translation invariance — a cat in the top-left corner looks completely different from a cat in the bottom-right

CNNs solve all three problems by using local connectivity (small filters), weight sharing (same filter slides across the image), and pooling (spatial invariance).

CNN Architecture: The Building Blocks

Conv2D — The Convolutional Layer

A convolutional layer slides small filters (e.g., 3x3) across the image. Each filter detects a specific pattern (edge, texture, shape) at every spatial location.

Parameter	Meaning
filters	Number of different patterns to detect (e.g., 32)
kernel_size	Size of each filter (e.g., 3x3)
strides	How far the filter moves each step (default: 1)
padding	"same" keeps spatial dims; "valid" shrinks them

MaxPooling2D — Spatial Down-sampling

Takes the maximum value in each window (e.g., 2x2), reducing spatial dimensions by half. This provides:

Translation invariance — small shifts don't change the output

Dimension reduction — fewer parameters in subsequent layers

Flatten / GlobalAveragePooling2D — Bridge to Dense Layers

After conv layers extract spatial features, we need to convert the 3D feature maps into a 1D vector for classification. GlobalAveragePooling2D is preferred over Flatten because it dramatically reduces parameters.

Building a CNN from Scratch

Here's the classic CNN architecture pattern: stacks of (Conv -> Pool) followed by Dense layers for classification.

python

1import tensorflow as tf
2from tensorflow import keras
3from tensorflow.keras import layers
4
5# --- Load CIFAR-10 ---
6(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
7X_train = X_train.astype("float32") / 255.0
8X_test = X_test.astype("float32") / 255.0
9print(f"Train: {X_train.shape}, Test: {X_test.shape}")
10# Train: (50000, 32, 32, 3), Test: (10000, 32, 32, 3)
11
12# --- Build CNN ---
13model = keras.Sequential([
14    # Block 1: low-level features (edges, colors)
15    layers.Conv2D(32, (3, 3), activation="relu", padding="same",
16                  input_shape=(32, 32, 3)),
17    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
18    layers.MaxPooling2D((2, 2)),
19    layers.Dropout(0.25),
20
21    # Block 2: mid-level features (textures, parts)
22    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
23    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
24    layers.MaxPooling2D((2, 2)),
25    layers.Dropout(0.25),
26
27    # Block 3: high-level features (object parts)
28    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
29    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
30    layers.MaxPooling2D((2, 2)),
31    layers.Dropout(0.25),
32
33    # Classification head
34    layers.GlobalAveragePooling2D(),
35    layers.Dense(128, activation="relu"),
36    layers.Dropout(0.5),
37    layers.Dense(10, activation="softmax"),
38])
39
40model.summary()
41# Much fewer parameters than a dense network on flattened images!
42
43model.compile(
44    optimizer="adam",
45    loss="sparse_categorical_crossentropy",
46    metrics=["accuracy"],
47)
48
49history = model.fit(
50    X_train, y_train,
51    epochs=20,
52    batch_size=64,
53    validation_split=0.1,
54    callbacks=[
55        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
56    ],
57)
58
59test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
60print(f"Test accuracy: {test_acc:.4f}")

Transfer Learning with Pre-Trained Models

Training a CNN from scratch requires large datasets and many GPU hours. Transfer learning lets you reuse a model trained on millions of images (ImageNet) and adapt it to your specific task — often with only a few hundred images.

The Strategy

1. Load a pre-trained base model (e.g., EfficientNetV2B0) and freeze its weights 2. Add a custom classification head on top 3. Train only the head for a few epochs (feature extraction) 4. Optionally unfreeze some base layers and fine-tune with a very low learning rate

python

1import tensorflow as tf
2from tensorflow import keras
3from tensorflow.keras import layers
4
5# --- Step 1: Load pre-trained base model ---
6base_model = keras.applications.EfficientNetV2B0(
7    weights="imagenet",
8    include_top=False,              # Remove original classification head
9    input_shape=(224, 224, 3),
10)
11
12# Freeze ALL base model weights
13base_model.trainable = False
14
15# --- Step 2: Add custom classification head ---
16inputs = keras.Input(shape=(224, 224, 3))
17# EfficientNet has its own preprocessing built in
18x = base_model(inputs, training=False)
19x = layers.GlobalAveragePooling2D()(x)
20x = layers.Dropout(0.3)(x)
21x = layers.Dense(256, activation="relu")(x)
22x = layers.Dropout(0.3)(x)
23outputs = layers.Dense(5, activation="softmax")(x)  # 5 classes
24
25model = keras.Model(inputs, outputs)
26
27# --- Step 3: Train the head (feature extraction) ---
28model.compile(
29    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
30    loss="sparse_categorical_crossentropy",
31    metrics=["accuracy"],
32)
33
34# model.fit(train_data, epochs=10, validation_data=val_data)
35print("Phase 1: Training head only")
36print(f"Trainable parameters: {sum(p.numpy().size for p in model.trainable_variables):,}")
37
38# --- Step 4: Fine-tune the base model ---
39# Unfreeze the base model
40base_model.trainable = True
41
42# Re-compile with a MUCH lower learning rate
43# This is critical — high LR would destroy the pre-trained features
44model.compile(
45    optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # 100x smaller!
46    loss="sparse_categorical_crossentropy",
47    metrics=["accuracy"],
48)
49
50# model.fit(train_data, epochs=10, validation_data=val_data)
51print("\nPhase 2: Fine-tuning entire model")
52print(f"Trainable parameters: {sum(p.numpy().size for p in model.trainable_variables):,}")

Transfer Learning Strategy

1. Freeze base model, train only the new head (high learning rate, ~1e-3) 2. Unfreeze base model, fine-tune everything (very low learning rate, ~1e-5) Why the low learning rate for fine-tuning? The pre-trained weights are already excellent. A large learning rate would destroy the carefully learned features. We want to gently nudge them to work better for our specific task. This two-phase approach consistently outperforms training from scratch, especially with small datasets (< 10,000 images).

Choosing a Pre-Trained Model

Model	Size	Top-1 Accuracy	Speed	Best For
MobileNetV2	14 MB	71.8%	Very Fast	Mobile/edge deployment
EfficientNetV2B0	29 MB	78.7%	Fast	Good balance of accuracy and speed
EfficientNetV2L	478 MB	85.7%	Slow	Maximum accuracy when resources allow
ResNet50	98 MB	76.0%	Medium	Well-studied, reliable baseline

For most projects, EfficientNetV2B0 is the sweet spot. Use MobileNetV2 when deploying to mobile devices, and larger EfficientNet variants when accuracy is critical and you have the compute budget.

Convolutional Neural Networks (CNNs)

Why Not Dense Layers for Images?

Consider a 224x224 color image. Flattened, that's 224 x 224 x 3 = 150,528 input values. A single dense layer with 256 neurons would need:

> 150,528 x 256 = 38.5 million parameters — in just ONE layer!

CNNs solve all three problems by using local connectivity (small filters), weight sharing (same filter slides across the image), and pooling (spatial invariance).

CNN Architecture: The Building Blocks

Conv2D — The Convolutional Layer

A convolutional layer slides small filters (e.g., 3x3) across the image. Each filter detects a specific pattern (edge, texture, shape) at every spatial location.

Parameter	Meaning
filters	Number of different patterns to detect (e.g., 32)
kernel_size	Size of each filter (e.g., 3x3)
strides	How far the filter moves each step (default: 1)
padding	"same" keeps spatial dims; "valid" shrinks them

MaxPooling2D — Spatial Down-sampling

Takes the maximum value in each window (e.g., 2x2), reducing spatial dimensions by half. This provides:

Translation invariance — small shifts don't change the output

Dimension reduction — fewer parameters in subsequent layers

Flatten / GlobalAveragePooling2D — Bridge to Dense Layers

Building a CNN from Scratch

Here's the classic CNN architecture pattern: stacks of (Conv -> Pool) followed by Dense layers for classification.

python

1import tensorflow as tf
2from tensorflow import keras
3from tensorflow.keras import layers
4
5# --- Load CIFAR-10 ---
6(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
7X_train = X_train.astype("float32") / 255.0
8X_test = X_test.astype("float32") / 255.0
9print(f"Train: {X_train.shape}, Test: {X_test.shape}")
10# Train: (50000, 32, 32, 3), Test: (10000, 32, 32, 3)
11
12# --- Build CNN ---
13model = keras.Sequential([
14    # Block 1: low-level features (edges, colors)
15    layers.Conv2D(32, (3, 3), activation="relu", padding="same",
16                  input_shape=(32, 32, 3)),
17    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
18    layers.MaxPooling2D((2, 2)),
19    layers.Dropout(0.25),
20
21    # Block 2: mid-level features (textures, parts)
22    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
23    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
24    layers.MaxPooling2D((2, 2)),
25    layers.Dropout(0.25),
26
27    # Block 3: high-level features (object parts)
28    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
29    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
30    layers.MaxPooling2D((2, 2)),
31    layers.Dropout(0.25),
32
33    # Classification head
34    layers.GlobalAveragePooling2D(),
35    layers.Dense(128, activation="relu"),
36    layers.Dropout(0.5),
37    layers.Dense(10, activation="softmax"),
38])
39
40model.summary()
41# Much fewer parameters than a dense network on flattened images!
42
43model.compile(
44    optimizer="adam",
45    loss="sparse_categorical_crossentropy",
46    metrics=["accuracy"],
47)
48
49history = model.fit(
50    X_train, y_train,
51    epochs=20,
52    batch_size=64,
53    validation_split=0.1,
54    callbacks=[
55        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
56    ],
57)
58
59test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
60print(f"Test accuracy: {test_acc:.4f}")

Transfer Learning with Pre-Trained Models

The Strategy

python

1import tensorflow as tf
2from tensorflow import keras
3from tensorflow.keras import layers
4
5# --- Step 1: Load pre-trained base model ---
6base_model = keras.applications.EfficientNetV2B0(
7    weights="imagenet",
8    include_top=False,              # Remove original classification head
9    input_shape=(224, 224, 3),
10)
11
12# Freeze ALL base model weights
13base_model.trainable = False
14
15# --- Step 2: Add custom classification head ---
16inputs = keras.Input(shape=(224, 224, 3))
17# EfficientNet has its own preprocessing built in
18x = base_model(inputs, training=False)
19x = layers.GlobalAveragePooling2D()(x)
20x = layers.Dropout(0.3)(x)
21x = layers.Dense(256, activation="relu")(x)
22x = layers.Dropout(0.3)(x)
23outputs = layers.Dense(5, activation="softmax")(x)  # 5 classes
24
25model = keras.Model(inputs, outputs)
26
27# --- Step 3: Train the head (feature extraction) ---
28model.compile(
29    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
30    loss="sparse_categorical_crossentropy",
31    metrics=["accuracy"],
32)
33
34# model.fit(train_data, epochs=10, validation_data=val_data)
35print("Phase 1: Training head only")
36print(f"Trainable parameters: {sum(p.numpy().size for p in model.trainable_variables):,}")
37
38# --- Step 4: Fine-tune the base model ---
39# Unfreeze the base model
40base_model.trainable = True
41
42# Re-compile with a MUCH lower learning rate
43# This is critical — high LR would destroy the pre-trained features
44model.compile(
45    optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # 100x smaller!
46    loss="sparse_categorical_crossentropy",
47    metrics=["accuracy"],
48)
49
50# model.fit(train_data, epochs=10, validation_data=val_data)
51print("\nPhase 2: Fine-tuning entire model")
52print(f"Trainable parameters: {sum(p.numpy().size for p in model.trainable_variables):,}")

Transfer Learning Strategy

Choosing a Pre-Trained Model

Model	Size	Top-1 Accuracy	Speed	Best For
MobileNetV2	14 MB	71.8%	Very Fast	Mobile/edge deployment
EfficientNetV2B0	29 MB	78.7%	Fast	Good balance of accuracy and speed
EfficientNetV2L	478 MB	85.7%	Slow	Maximum accuracy when resources allow
ResNet50	98 MB	76.0%	Medium	Well-studied, reliable baseline