Machine Learning Reinforcement Learning
Reinforcement Learning (RL) is a type of Machine Learning where an agent learns to make decisions by interacting with an environment. The agent does not learn from labeled examples. Instead, it receives rewards for good decisions and penalties for bad ones. Over many interactions, it discovers the strategy that maximizes total cumulative reward.
The Core Framework
Five Essential Components: ┌──────────────────┬──────────────────────────────────────────────────┐ │ Component │ Description │ ├──────────────────┼──────────────────────────────────────────────────┤ │ Agent │ The learner and decision-maker (the RL model) │ │ Environment │ The world the agent operates in │ │ State (S) │ The current situation the agent observes │ │ Action (A) │ A move the agent can make in the current state │ │ Reward (R) │ Numeric feedback after an action (+ve or -ve) │ │ Policy (π) │ The agent's strategy: for each state, which │ │ │ action to take │ └──────────────────┴──────────────────────────────────────────────────┘
The RL Loop
Reinforcement Learning Interaction Cycle:
┌─────────────────────────────┐
│ │
▼ │
Agent │
│ │
Observes State (St) │
│ │
Chooses Action (At) │
│ │
▼ │
Environment │
│ │
Returns: │
New State (St+1) │
Reward (Rt+1) │
│ │
└────► Agent updates policy ──┘
This loop repeats until the episode ends (goal reached or time limit).
Over many episodes, the agent's policy improves.
Real-Life Analogy: Learning to Play Chess
Agent: The chess player (RL model) Environment: The chess board and rules State: Current positions of all pieces Actions: All legal moves from this position Reward: +1 for winning, -1 for losing, 0 for draw Policy: The strategy the player builds over thousands of games The agent does not start knowing good moves. It plays random moves, occasionally wins, receives +1. Over time, it discovers which moves lead to wins. Eventually it learns opening theory, tactics, endgames. This is exactly how AlphaZero (DeepMind) learned chess — by playing against itself millions of times, with no human games.
Key Concepts
Return (Cumulative Reward)
The agent does not just optimize for immediate reward. It optimizes for the TOTAL reward across the entire episode. Example: A robot navigating a maze. Step 1: Move forward → Reward = 0 Step 2: Turn left → Reward = 0 Step 3: Move forward → Reward = 0 ... Step 47: Reach exit → Reward = +100 Total Return = 0+0+0+...+100 = +100 Without considering future reward, the agent would be paralyzed at every step because immediate reward is zero. Discounted Return (G): Rewards farther in the future are worth less (like compound interest in reverse). G = R1 + γ×R2 + γ²×R3 + γ³×R4 + ... γ (gamma) = discount factor, typically 0.9–0.99 γ=0: Only cares about immediate reward (very short-sighted) γ=1: Cares equally about all future rewards (no discounting)
Exploration vs Exploitation
One of the fundamental dilemmas in RL: Exploitation: Use what you already know works. Always pick the best known action. Risk: Never discover better actions. Exploration: Try new, unknown actions. May find better strategies. Risk: Short-term performance suffers. Analogy: Finding a restaurant in a new city: Exploitation: Go to the restaurant you already love. Exploration: Try a random new restaurant — might be better or worse. ε-Greedy Strategy (most common solution): With probability ε → choose a random action (explore) With probability (1-ε) → choose the best known action (exploit) ε starts high (e.g., 1.0) and decays over time: Early training: High ε → mostly random exploration Late training: Low ε → mostly exploit learned knowledge
Q-Learning
Q-Learning is one of the foundational RL algorithms.
It learns a function Q(s, a) — the Q-value — which estimates
the total future reward from taking action a in state s.
Q-Table: A table with one row per state and one column per action.
Initialized to zero.
Updated after every action using the Bellman Equation.
Bellman Equation (Q-value update rule):
Q(s, a) ← Q(s, a) + α × [R + γ × max Q(s', a') - Q(s, a)]
Where:
α = learning rate (how quickly to update Q-values)
R = reward received after taking action a in state s
s' = new state after the action
max Q(s', a') = best Q-value available in the new state
Example — Simple Grid World (3×3):
Grid: S=Start, G=Goal, W=Wall
S . .
. W .
. . G
Q-Table (simplified, after training):
┌──────────┬──────┬──────┬────────┬────────┐
│ State │ Up │ Down │ Left │ Right │
├──────────┼──────┼──────┼────────┼────────┤
│ (0,0) S │ 0 │ 0.5 │ 0 │ 0.6 │ ← Go Right or Down
│ (0,1) │ 0 │ 0.4 │ 0.5 │ 0.8 │ ← Go Right
│ (0,2) │ 0 │ 0.9 │ 0.3 │ 0 │ ← Go Down
│ (1,2) │ 0 │ 1.0 │ 0 │ 0 │ ← Go Down (→ Goal!)
└──────────┴──────┴──────┴────────┴────────┘
The agent always picks the action with the highest Q-value.
This leads it from Start to Goal efficiently.
Deep Q-Network (DQN)
Problem with Q-tables:
Real environments have millions or infinite states.
(A chess board has 10^43 possible positions)
A table cannot store all these states.
DQN solution:
Replace the Q-table with a Neural Network.
Input: State (raw pixels, sensor readings, etc.)
Output: Q-value for every possible action
Neural Network: State → [Q(a1), Q(a2), Q(a3), ...]
Agent selects the action with the highest Q-value output.
DQN Key Improvements (DeepMind 2015):
1. Experience Replay:
Store past (s, a, r, s') transitions in a buffer.
Sample random mini-batches to train the network.
Breaks correlation between consecutive experiences.
2. Target Network:
A separate, slowly-updated network computes target Q-values.
Prevents unstable training from chasing a moving target.
DQN Achievement: Learned to play 49 Atari video games from
raw pixels, reaching human-level performance on most.
Policy Gradient Methods
Q-Learning learns value functions (how good is each action).
Policy Gradient directly learns the policy (what action to take).
The policy is modeled as a neural network:
Input: State
Output: Probability distribution over actions
Training:
If an action led to high return → increase its probability
If an action led to low return → decrease its probability
REINFORCE Algorithm (simplest policy gradient):
Step 1: Run a full episode using current policy
Step 2: Calculate total return for each step
Step 3: Update policy network weights to make high-return
actions more likely in those states
Popular Advanced Methods:
PPO (Proximal Policy Optimization) → used by ChatGPT RLHF
A3C (Asynchronous Advantage Actor-Critic)
SAC (Soft Actor-Critic) → robotics and continuous control
Famous RL Applications
┌────────────────────────────┬───────────────────────────────────────┐ │ Application │ Detail │ ├────────────────────────────┼───────────────────────────────────────┤ │ AlphaGo / AlphaZero │ Mastered Go, Chess, Shogi from zero │ │ (DeepMind, 2016–2017) │ Beat world champions │ │ OpenAI Five │ Beat top teams in Dota 2 │ │ ChatGPT (RLHF) │ Human feedback used as reward signal │ │ │ to make responses more helpful │ │ Robotic Arms │ Learn to grasp, assemble, navigate │ │ Self-Driving Cars │ Lane keeping, intersection decisions │ │ Data Center Cooling │ Google cut cooling energy by 40% │ │ (DeepMind) │ using RL to control HVAC systems │ │ Algorithmic Trading │ Maximize portfolio returns over time │ └────────────────────────────┴───────────────────────────────────────┘
RL vs Supervised vs Unsupervised Learning
┌───────────────────────┬──────────────────┬────────────────────────┐ │ Aspect │ Supervised / Unsup│ Reinforcement Learning │ ├───────────────────────┼──────────────────┼────────────────────────┤ │ Data required │ Static dataset │ Environment interaction │ │ Labels needed? │ Yes (supervised) │ No (uses reward signal) │ │ Learning signal │ Loss function │ Cumulative reward │ │ Best for │ Fixed prediction │ Sequential decisions │ │ │ tasks │ and control tasks │ │ Time dimension │ No │ Yes (actions over time) │ └───────────────────────┴──────────────────┴────────────────────────┘
