RL Applications & Tools
Now that you understand RL algorithms, let's explore the ecosystem and real-world applications.
OpenAI Gym / Gymnasium
Gymnasium (formerly OpenAI Gym) is the standard API for RL environments. Every environment follows the same interface:
import gymnasium as gymenv = gym.make("CartPole-v1")
state, info = env.reset()
for _ in range(1000):
action = env.action_space.sample() # Random action
next_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
if done:
state, info = env.reset()
else:
state = next_state
env.close()
Popular Environments
| Environment | Type | State | Actions |
|---|---|---|---|
| CartPole-v1 | Classic Control | 4D continuous | 2 discrete |
| LunarLander-v2 | Box2D | 8D continuous | 4 discrete |
| Pendulum-v1 | Classic Control | 3D continuous | 1D continuous |
| MountainCar-v0 | Classic Control | 2D continuous | 3 discrete |
| Ant-v4 | MuJoCo | 111D continuous | 8D continuous |
Stable Baselines3
Stable Baselines3 (SB3) provides reliable, well-tested implementations of popular RL algorithms. It's the go-to library for applying RL without implementing algorithms from scratch.
Supported Algorithms
1# Stable Baselines3 usage example
2"""
3from stable_baselines3 import PPO
4import gymnasium as gym
5
6# Create environment
7env = gym.make("CartPole-v1")
8
9# Create and train agent (just 2 lines!)
10model = PPO("MlpPolicy", env, verbose=1)
11model.learn(total_timesteps=50_000)
12
13# Evaluate
14obs, _ = env.reset()
15total_reward = 0
16done = False
17
18while not done:
19 action, _ = model.predict(obs, deterministic=True)
20 obs, reward, terminated, truncated, _ = env.step(action)
21 total_reward += reward
22 done = terminated or truncated
23
24print(f"Total reward: {total_reward}")
25
26# Save and load
27model.save("ppo_cartpole")
28loaded_model = PPO.load("ppo_cartpole")
29"""
30
31# Key SB3 features demonstrated:
32features = {
33 "Easy training": "model.learn(total_timesteps=50000)",
34 "Built-in logging": "verbose=1 shows training progress",
35 "Deterministic eval": "model.predict(obs, deterministic=True)",
36 "Save/Load": "model.save() and PPO.load()",
37 "Custom policies": "PPO('MlpPolicy', ...) or custom network architectures",
38 "Callbacks": "EvalCallback, CheckpointCallback for monitoring",
39}
40
41for feature, example in features.items():
42 print(f"{feature}: {example}")RLHF - Reinforcement Learning from Human Feedback
RLHF for LLMs
The RLHF pipeline for training language models:
Step 1: Supervised Fine-Tuning (SFT)
Step 2: Reward Model Training
Step 3: PPO Fine-Tuning
This is why PPO knowledge is directly relevant to understanding how modern AI assistants are trained!
Robotics
RL is increasingly used in robotics for tasks that are hard to program explicitly:
Game AI
RL has achieved superhuman performance in many games:
Multi-Agent RL
When multiple agents interact in the same environment:
1# Creating a Custom Gymnasium Environment
2"""
3import gymnasium as gym
4from gymnasium import spaces
5import numpy as np
6
7class SimpleTrading(gym.Env):
8 """Custom trading environment."""
9
10 metadata = {"render_modes": ["human"]}
11
12 def __init__(self, prices, initial_balance=10000):
13 super().__init__()
14 self.prices = prices
15 self.initial_balance = initial_balance
16
17 # Action: 0=hold, 1=buy, 2=sell
18 self.action_space = spaces.Discrete(3)
19
20 # Observation: [balance, shares_held, current_price, price_change]
21 self.observation_space = spaces.Box(
22 low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
23 )
24
25 def reset(self, seed=None, options=None):
26 super().reset(seed=seed)
27 self.balance = self.initial_balance
28 self.shares = 0
29 self.current_step = 0
30 return self._get_obs(), {}
31
32 def _get_obs(self):
33 price = self.prices[self.current_step]
34 change = 0 if self.current_step == 0 else (
35 price - self.prices[self.current_step - 1]
36 ) / self.prices[self.current_step - 1]
37 return np.array([
38 self.balance, self.shares, price, change
39 ], dtype=np.float32)
40
41 def step(self, action):
42 price = self.prices[self.current_step]
43
44 if action == 1 and self.balance >= price: # Buy
45 self.shares += 1
46 self.balance -= price
47 elif action == 2 and self.shares > 0: # Sell
48 self.shares -= 1
49 self.balance += price
50
51 self.current_step += 1
52 terminated = self.current_step >= len(self.prices) - 1
53 truncated = False
54
55 # Reward: change in portfolio value
56 portfolio = self.balance + self.shares * self.prices[self.current_step]
57 reward = portfolio - self.initial_balance
58
59 return self._get_obs(), reward, terminated, truncated, {}
60
61# Usage with SB3:
62# env = SimpleTrading(prices=stock_data)
63# model = PPO("MlpPolicy", env, verbose=1)
64# model.learn(total_timesteps=100000)
65"""
66
67print("Custom Gymnasium environment template defined")
68print("Key components: __init__, reset, step, observation_space, action_space")