packages = ["numpy"]

🎮 Q-Learning

Temporal Difference Learning for Optimal Control

💡 What is Q-Learning?

Q-Learning is a model-free, off-policy reinforcement learning algorithm that learns the optimal action-value function directly from experience. It's one of the most fundamental and widely-used algorithms in RL, forming the foundation for modern deep reinforcement learning methods like DQN.

The Key Idea: Instead of learning a model of the environment, Q-Learning directly learns which action is best in each state by bootstrapping — updating estimates based on other estimates. It uses the Bellman optimality equation as an update rule to iteratively improve the Q-values until convergence.

📊 Learning Slides

Loading slides...
1 / 5

🎯 Key Takeaways

1. Bellman Optimality

The optimal Q-function satisfies a recursive equation that Q-Learning exploits for updates

2. Temporal Difference

Learn from incomplete episodes by bootstrapping from current estimates of future value

3. Off-Policy Learning

Learn the optimal policy while following an exploratory behavior policy (like ε-greedy)

4. Convergence Guarantees

Converges to optimal Q-values given sufficient exploration and decaying learning rate

5. Foundation for Deep RL

Core algorithm behind DQN, Double DQN, and many modern AI game-playing agents