Intro RL & RLHF

- Markov Processes as LLM Analogies - Frame token generation as a Markov Decision Process (MDP) with states, actions, and rewards - Monte Carlo vs Temporal Difference Learning - Compare Monte Carlo episode-based learning with Temporal Difference updates, and their relevance to token-level prediction - Q-Learning & Policy Gradients - Explore conceptual foundations of Q-learning and policy gradients as the basis of RLHF and preference optimization - RL in Decoding and Chain-of-Thought - Apply RL ideas during inference without retraining, including CoT prompting with reward feedback and speculative decoding verification - Exercises: RL Foundations with Neural Networks - Implement token generation as MDP with policy and value networks - Compare Monte Carlo vs Temporal Difference learning for value estimation - Build Q-Learning from tables to DQN with experience replay - Implement REINFORCE with baseline subtraction and entropy regularization