packages = ["numpy"]

🎯 SARSA

State-Action-Reward-State-Action: On-Policy TD Control

💡 What is SARSA?

SARSA is an on-policy temporal difference control algorithm that learns the action-value function Q(s,a) while following the same policy it's evaluating. The name comes from the quintuple (S, A, R, S', A') used in each update: the current State, the Action taken, the Reward received, the next State, and the next Action.

The Key Difference from Q-Learning: While Q-Learning uses \max_a Q(s',a) to update (learning about the greedy policy), SARSA uses Q(s', a') where a' is the actual action taken. This makes SARSA learn about the policy it's actually following, including exploration moves. This can make SARSA safer during learning.

📊 Learning Slides

Loading slides...
1 / 5

🎯 Key Takeaways

1. On-Policy Learning

SARSA learns about the policy it actually follows, including exploratory actions

2. The SARSA Tuple

(S, A, R, S', A') — uses the actual next action, not the greedy action

3. Safer Learning

Accounts for exploration risk, leading to more conservative policies in dangerous environments

4. Expected SARSA

A variance-reduction variant that averages over possible next actions

5. Convergence

Converges to optimal Q-values if exploration decays appropriately (GLIE)