Skip to main content

Neural Networks from Scratch

Understand the building blocks of neural networks by implementing them in pure NumPy

~60 min
Listen to this lesson

Neural Networks from Scratch

Neural networks are the foundation of deep learning. Before we use any frameworks, let's build one from scratch so you truly understand what's happening under the hood.

What Is a Neural Network?

A neural network is a series of layers, each of which transforms its input using a simple formula:

$$\text{output} = \text{activation}(\text{input} \times \text{weights} + \text{bias})$$

That's it. Every layer in every neural network — from a tiny classifier to GPT — follows this pattern. The magic comes from stacking many layers and learning the right weights through training.

Anatomy of a Single Neuron

A single neuron takes a vector of inputs, multiplies each by a learned weight, adds a bias term, and passes the result through an activation function:

ComponentRole
Inputs (x)The data flowing in (features or outputs from a previous layer)
Weights (W)Learned parameters that scale each input
Bias (b)A learned offset that shifts the output
Activation (f)A non-linear function applied to the weighted sum

The Core Equation

output = activation(input @ weights + bias) This single equation is the building block of ALL neural networks. A layer is just many neurons computed in parallel using matrix multiplication.

Building Blocks in NumPy

Let's implement each piece from scratch. We'll start with the individual components and then assemble them into a working network.

Dense (Fully Connected) Layer

A dense layer connects every input to every output. It stores weights and biases, and computes the linear transformation.

python
1import numpy as np
2
3class DenseLayer:
4    """A fully connected layer: output = input @ weights + bias"""
5
6    def __init__(self, input_size: int, output_size: int):
7        # He initialization — good default for ReLU networks
8        # Scale by sqrt(2/fan_in) to keep variance stable across layers
9        self.weights = np.random.randn(input_size, output_size) * np.sqrt(
10            2.0 / input_size
11        )
12        self.bias = np.zeros((1, output_size))
13
14    def forward(self, inputs: np.ndarray) -> np.ndarray:
15        """
16        inputs shape: (batch_size, input_size)
17        output shape: (batch_size, output_size)
18        """
19        self.inputs = inputs  # Cache for backprop
20        return inputs @ self.weights + self.bias

Activation Functions

Activation functions introduce non-linearity. Without them, stacking layers would be useless — a chain of linear transformations is just one big linear transformation. Let's implement the two most important ones.

python
1class ReLU:
2    """Rectified Linear Unit: max(0, x)
3
4    The most popular activation for hidden layers.
5    Simple, fast, and works well in practice.
6    """
7
8    def forward(self, inputs: np.ndarray) -> np.ndarray:
9        self.inputs = inputs  # Cache for backprop
10        return np.maximum(0, inputs)
11
12
13class Softmax:
14    """Converts raw scores (logits) into probabilities.
15
16    Used as the final activation for classification tasks.
17    Output values are in [0, 1] and sum to 1.
18    """
19
20    def forward(self, inputs: np.ndarray) -> np.ndarray:
21        # Subtract max for numerical stability (prevents overflow in exp)
22        shifted = inputs - np.max(inputs, axis=1, keepdims=True)
23        exp_values = np.exp(shifted)
24        return exp_values / np.sum(exp_values, axis=1, keepdims=True)

Putting It Together: A Simple Neural Network

Now let's combine these pieces into a complete network. This is a 3-layer classifier that takes in feature vectors and outputs class probabilities.

python
1class SimpleNN:
2    """A simple feedforward neural network.
3
4    Architecture: Input -> Dense -> ReLU -> Dense -> ReLU -> Dense -> Softmax
5    """
6
7    def __init__(self, input_size: int, hidden_size: int, output_size: int):
8        self.layer1 = DenseLayer(input_size, hidden_size)
9        self.activation1 = ReLU()
10        self.layer2 = DenseLayer(hidden_size, hidden_size)
11        self.activation2 = ReLU()
12        self.layer3 = DenseLayer(hidden_size, output_size)
13        self.softmax = Softmax()
14
15    def forward(self, X: np.ndarray) -> np.ndarray:
16        """Forward pass: push data through all layers."""
17        out = self.layer1.forward(X)
18        out = self.activation1.forward(out)
19        out = self.layer2.forward(out)
20        out = self.activation2.forward(out)
21        out = self.layer3.forward(out)
22        out = self.softmax.forward(out)
23        return out
24
25
26# --- Demo: classify random data ---
27np.random.seed(42)
28X = np.random.randn(4, 3)  # 4 samples, 3 features each
29network = SimpleNN(input_size=3, hidden_size=8, output_size=3)
30
31probabilities = network.forward(X)
32print("Input shape: ", X.shape)        # (4, 3)
33print("Output shape:", probabilities.shape)  # (4, 3)
34print("\nPredicted probabilities:")
35print(probabilities)
36print("\nRow sums (should be ~1.0):", probabilities.sum(axis=1))
37print("Predicted classes:", np.argmax(probabilities, axis=1))

What About Backpropagation?

The forward pass computes predictions. To actually **train** the network, we need backpropagation — an algorithm that computes how much each weight contributed to the error, then nudges weights in the direction that reduces the loss. Backprop uses the **chain rule** from calculus to efficiently compute gradients layer by layer, moving backwards from the output to the input. We'll use TensorFlow's automatic differentiation (GradientTape) in the next lesson instead of implementing backprop manually, but understanding that it's just repeated chain-rule application is the key insight.

The Forward Pass Step by Step

Let's trace data through our network to make the flow concrete:

1. Input: A batch of feature vectors, shape (batch_size, input_size) 2. Layer 1: Linear transform X @ W1 + b1 — projects input to hidden dimension 3. ReLU: Zeros out negatives — introduces non-linearity 4. Layer 2: Another linear transform — learns more complex combinations 5. ReLU: More non-linearity 6. Layer 3: Final linear transform — projects to number of classes 7. Softmax: Converts raw scores to probabilities

Each layer's output becomes the next layer's input. The whole thing is just a chain of simple operations.

Why Depth Matters

A single layer (linear + activation) can only learn simple decision boundaries. By stacking layers:

  • Layer 1 learns basic features (edges, simple patterns)
  • Layer 2 combines basic features into more complex ones
  • Layer 3 combines complex features into final decisions
  • This hierarchical feature learning is what makes deep networks so powerful.