💡 What is SARSA?
SARSA is an on-policy temporal difference control algorithm that learns the action-value function Q(s,a) while following the same policy it's evaluating. The name comes from the quintuple (S, A, R, S', A') used in each update: the current State, the Action taken, the Reward received, the next State, and the next Action.
🎯 Key Takeaways
1. On-Policy Learning
SARSA learns about the policy it actually follows, including exploratory actions
2. The SARSA Tuple
(S, A, R, S', A') — uses the actual next action, not the greedy action
3. Safer Learning
Accounts for exploration risk, leading to more conservative policies in dangerous environments
4. Expected SARSA
A variance-reduction variant that averages over possible next actions
5. Convergence
Converges to optimal Q-values if exploration decays appropriately (GLIE)