Reinforcement Learning Foundations
Reinforcement Learning (RL) is fundamentally different from supervised and unsupervised learning. Instead of learning from a fixed dataset, an agent learns by interacting with an environment, taking actions, and receiving rewards.
The RL Loop
Agent --action--> Environment
Agent <--state-- Environment
Agent <--reward-- Environment
At each time step *t*: 1. The agent observes state s_t 2. The agent chooses action a_t 3. The environment transitions to state s_{t+1} 4. The agent receives reward r_{t+1}
The goal: learn a policy (strategy) that maximizes the total cumulative reward over time.
Markov Decision Process (MDP)
States, Actions, and Rewards
States describe the current situation. In a chess game, the state is the board configuration. In a self-driving car, it's the sensor readings.
Actions are what the agent can do. In chess, moving a piece. In a car, steering angle and acceleration.
Rewards are scalar feedback signals. Positive rewards encourage behavior, negative rewards discourage it. Designing good reward functions is one of the hardest parts of RL.
| Component | Chess Example | CartPole Example |
|---|---|---|
| State | Board position | Cart position, pole angle, velocities |
| Action | Move a piece | Push left or right |
| Reward | +1 win, -1 lose, 0 otherwise | +1 for each step the pole stays up |
Value Functions
The state value function V(s) tells us how good it is to be in a state:
**V(s) = E[R_t + gamma * R_{t+1} + gamma^2 * R_{t+2} + ... | S_t = s]
The action-value function Q(s, a) tells us how good it is to take an action in a state:
Q(s, a) = E[R_t + gamma * R_{t+1} + ... | S_t = s, A_t = a]
The Bellman Equation
The Bellman equation expresses the relationship between the value of a state and the values of successor states:
V(s) = max_a [R(s, a) + gamma * sum P(s'|s,a) * V(s')]**
This recursive relationship is the foundation of most RL algorithms. It says: the value of a state equals the best immediate reward plus the discounted value of where you end up.
1import numpy as np
2
3# Simple GridWorld Value Iteration Example
4# Grid: 4x4, goal at (3,3), reward = -1 per step, +10 at goal
5
6def value_iteration(grid_size=4, gamma=0.9, theta=1e-6):
7 """Compute optimal values using the Bellman equation."""
8 V = np.zeros((grid_size, grid_size))
9 actions = [(0, 1), (0, -1), (1, 0), (-1, 0)] # right, left, down, up
10 goal = (grid_size - 1, grid_size - 1)
11
12 while True:
13 delta = 0
14 for i in range(grid_size):
15 for j in range(grid_size):
16 if (i, j) == goal:
17 continue # Goal state has value 0
18
19 v_old = V[i, j]
20 values = []
21 for di, dj in actions:
22 ni, nj = max(0, min(i+di, grid_size-1)), max(0, min(j+dj, grid_size-1))
23 reward = 10 if (ni, nj) == goal else -1
24 values.append(reward + gamma * V[ni, nj])
25
26 V[i, j] = max(values)
27 delta = max(delta, abs(v_old - V[i, j]))
28
29 if delta < theta:
30 break
31
32 return V
33
34V = value_iteration()
35print("Optimal Value Function:")
36print(np.round(V, 1))Discount Factor (gamma)
The discount factor gamma (between 0 and 1) controls how much the agent values future rewards vs immediate rewards:
Exploration vs Exploitation
This is the fundamental dilemma in RL:
If you only exploit, you might miss the best strategy. If you only explore, you never benefit from what you've learned.
Epsilon-Greedy Strategy
1import numpy as np
2
3class EpsilonGreedy:
4 """Epsilon-greedy action selection with decay."""
5
6 def __init__(self, n_actions, epsilon=1.0, epsilon_min=0.01, decay=0.995):
7 self.n_actions = n_actions
8 self.epsilon = epsilon
9 self.epsilon_min = epsilon_min
10 self.decay = decay
11
12 def select_action(self, q_values):
13 """Select action using epsilon-greedy policy."""
14 if np.random.random() < self.epsilon:
15 return np.random.randint(self.n_actions) # Explore
16 return np.argmax(q_values) # Exploit
17
18 def decay_epsilon(self):
19 """Decay epsilon after each episode."""
20 self.epsilon = max(self.epsilon_min, self.epsilon * self.decay)
21
22# Demo
23strategy = EpsilonGreedy(n_actions=4)
24fake_q_values = np.array([1.2, 3.5, 0.8, 2.1])
25
26for episode in range(5):
27 action = strategy.select_action(fake_q_values)
28 print(f"Episode {episode}: epsilon={strategy.epsilon:.3f}, action={action}")
29 strategy.decay_epsilon()