Skip to main content

Convolutional Neural Networks (CNNs)

Learn why CNNs dominate image tasks, build one from scratch, and master transfer learning

~55 min
Listen to this lesson

Convolutional Neural Networks (CNNs)

Why Not Dense Layers for Images?

Consider a 224x224 color image. Flattened, that's 224 x 224 x 3 = 150,528 input values. A single dense layer with 256 neurons would need:

> 150,528 x 256 = 38.5 million parameters — in just ONE layer!

This has three critical problems: 1. Parameter explosion — too many weights to train efficiently 2. No spatial awareness — a dense layer treats pixel (0,0) and pixel (223,223) identically 3. No translation invariance — a cat in the top-left corner looks completely different from a cat in the bottom-right

CNNs solve all three problems by using local connectivity (small filters), weight sharing (same filter slides across the image), and pooling (spatial invariance).

CNN Architecture: The Building Blocks

Conv2D — The Convolutional Layer

A convolutional layer slides small filters (e.g., 3x3) across the image. Each filter detects a specific pattern (edge, texture, shape) at every spatial location.

ParameterMeaning
filtersNumber of different patterns to detect (e.g., 32)
kernel_sizeSize of each filter (e.g., 3x3)
stridesHow far the filter moves each step (default: 1)
padding"same" keeps spatial dims; "valid" shrinks them

MaxPooling2D — Spatial Down-sampling

Takes the maximum value in each window (e.g., 2x2), reducing spatial dimensions by half. This provides:

  • Translation invariance — small shifts don't change the output
  • Dimension reduction — fewer parameters in subsequent layers
  • Flatten / GlobalAveragePooling2D — Bridge to Dense Layers

    After conv layers extract spatial features, we need to convert the 3D feature maps into a 1D vector for classification. GlobalAveragePooling2D is preferred over Flatten because it dramatically reduces parameters.

    Building a CNN from Scratch

    Here's the classic CNN architecture pattern: stacks of (Conv -> Pool) followed by Dense layers for classification.

    python
    1import tensorflow as tf
    2from tensorflow import keras
    3from tensorflow.keras import layers
    4
    5# --- Load CIFAR-10 ---
    6(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
    7X_train = X_train.astype("float32") / 255.0
    8X_test = X_test.astype("float32") / 255.0
    9print(f"Train: {X_train.shape}, Test: {X_test.shape}")
    10# Train: (50000, 32, 32, 3), Test: (10000, 32, 32, 3)
    11
    12# --- Build CNN ---
    13model = keras.Sequential([
    14    # Block 1: low-level features (edges, colors)
    15    layers.Conv2D(32, (3, 3), activation="relu", padding="same",
    16                  input_shape=(32, 32, 3)),
    17    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
    18    layers.MaxPooling2D((2, 2)),
    19    layers.Dropout(0.25),
    20
    21    # Block 2: mid-level features (textures, parts)
    22    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    23    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    24    layers.MaxPooling2D((2, 2)),
    25    layers.Dropout(0.25),
    26
    27    # Block 3: high-level features (object parts)
    28    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
    29    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
    30    layers.MaxPooling2D((2, 2)),
    31    layers.Dropout(0.25),
    32
    33    # Classification head
    34    layers.GlobalAveragePooling2D(),
    35    layers.Dense(128, activation="relu"),
    36    layers.Dropout(0.5),
    37    layers.Dense(10, activation="softmax"),
    38])
    39
    40model.summary()
    41# Much fewer parameters than a dense network on flattened images!
    42
    43model.compile(
    44    optimizer="adam",
    45    loss="sparse_categorical_crossentropy",
    46    metrics=["accuracy"],
    47)
    48
    49history = model.fit(
    50    X_train, y_train,
    51    epochs=20,
    52    batch_size=64,
    53    validation_split=0.1,
    54    callbacks=[
    55        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
    56    ],
    57)
    58
    59test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    60print(f"Test accuracy: {test_acc:.4f}")

    Transfer Learning with Pre-Trained Models

    Training a CNN from scratch requires large datasets and many GPU hours. Transfer learning lets you reuse a model trained on millions of images (ImageNet) and adapt it to your specific task — often with only a few hundred images.

    The Strategy

    1. Load a pre-trained base model (e.g., EfficientNetV2B0) and freeze its weights 2. Add a custom classification head on top 3. Train only the head for a few epochs (feature extraction) 4. Optionally unfreeze some base layers and fine-tune with a very low learning rate

    python
    1import tensorflow as tf
    2from tensorflow import keras
    3from tensorflow.keras import layers
    4
    5# --- Step 1: Load pre-trained base model ---
    6base_model = keras.applications.EfficientNetV2B0(
    7    weights="imagenet",
    8    include_top=False,              # Remove original classification head
    9    input_shape=(224, 224, 3),
    10)
    11
    12# Freeze ALL base model weights
    13base_model.trainable = False
    14
    15# --- Step 2: Add custom classification head ---
    16inputs = keras.Input(shape=(224, 224, 3))
    17# EfficientNet has its own preprocessing built in
    18x = base_model(inputs, training=False)
    19x = layers.GlobalAveragePooling2D()(x)
    20x = layers.Dropout(0.3)(x)
    21x = layers.Dense(256, activation="relu")(x)
    22x = layers.Dropout(0.3)(x)
    23outputs = layers.Dense(5, activation="softmax")(x)  # 5 classes
    24
    25model = keras.Model(inputs, outputs)
    26
    27# --- Step 3: Train the head (feature extraction) ---
    28model.compile(
    29    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    30    loss="sparse_categorical_crossentropy",
    31    metrics=["accuracy"],
    32)
    33
    34# model.fit(train_data, epochs=10, validation_data=val_data)
    35print("Phase 1: Training head only")
    36print(f"Trainable parameters: {sum(p.numpy().size for p in model.trainable_variables):,}")
    37
    38# --- Step 4: Fine-tune the base model ---
    39# Unfreeze the base model
    40base_model.trainable = True
    41
    42# Re-compile with a MUCH lower learning rate
    43# This is critical — high LR would destroy the pre-trained features
    44model.compile(
    45    optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # 100x smaller!
    46    loss="sparse_categorical_crossentropy",
    47    metrics=["accuracy"],
    48)
    49
    50# model.fit(train_data, epochs=10, validation_data=val_data)
    51print("\nPhase 2: Fine-tuning entire model")
    52print(f"Trainable parameters: {sum(p.numpy().size for p in model.trainable_variables):,}")

    Transfer Learning Strategy

    1. Freeze base model, train only the new head (high learning rate, ~1e-3) 2. Unfreeze base model, fine-tune everything (very low learning rate, ~1e-5) Why the low learning rate for fine-tuning? The pre-trained weights are already excellent. A large learning rate would destroy the carefully learned features. We want to gently nudge them to work better for our specific task. This two-phase approach consistently outperforms training from scratch, especially with small datasets (< 10,000 images).

    Choosing a Pre-Trained Model

    ModelSizeTop-1 AccuracySpeedBest For
    MobileNetV214 MB71.8%Very FastMobile/edge deployment
    EfficientNetV2B029 MB78.7%FastGood balance of accuracy and speed
    EfficientNetV2L478 MB85.7%SlowMaximum accuracy when resources allow
    ResNet5098 MB76.0%MediumWell-studied, reliable baseline
    For most projects, EfficientNetV2B0 is the sweet spot. Use MobileNetV2 when deploying to mobile devices, and larger EfficientNet variants when accuracy is critical and you have the compute budget.