State-of-the-art decoders

- Explore decoding strategies that influence LLM output diversity and fluency - Top-k sampling - Learn how Top-k sampling truncates the output distribution to the k most likely tokens (e.g., k=16) - Understand how Top-k sampling balances creativity and control, and why it’s especially effective with small vocab sizes like byte-level models - Nucleus (Top-p) sampling - Learn how Nucleus (Top-p) sampling dynamically includes tokens up to a cumulative probability p (e.g., p=0.9) - Understand how Top-p sampling produces more adaptive and coherent completions than Top-k, especially in unpredictable generation tasks - Beam search - Learn how Beam search keeps multiple candidate completions in parallel and scores them to select the most likely overall path - Understand why Beam search is useful for deterministic outputs (e.g., code, structured data) and why it can lead to repetitive or bland completions in open-ended generation - Speculative decoding (OpenAI-style) - Learn how Speculative decoding speeds up inference by letting a small model propose multiple token candidates in parallel, which a larger model verifies - Understand how speculative decoding works internally and why it is gaining popularity in production systems like Groq and OpenAI APIs