Intro

ChatGPT is to LLM, as Kleenex is to tissue

What are LLMs?

Tokens

Demo - Manual LLM inference 

LLM generate text

What LLMs predict

Vectors, intuitively

Word embeddings and nearest neighbors

Demo - Semantic meaning of word embeddings 

The architecture for a Large Language Model

How LLMs predict

Self-attention adds context

Demo - Adding "context" to a vector

MLP transforms

Demo - The necessity of non-linearities

How Transformers predict

Absolute positional encoding

Demo Cons of absolute positional bias

Demo - skip connections

Batch norm

RMS norm

How LLMs use position

Workshop-feedback-qa

How LLMs attend

Modern day transformer architectures

Modern LLM connection to papers

You can take the workshop from anywhere in the world, as long as you
have a computer and an internet connection. You also have the opportunity to ask the instructors questions live.

Live and remote

Learn at your own pace, whenever it's convenient for you. With no
rigid schedule to worry about, you can take the course on your own
terms.

Recorded

Learn by building while you learn the concepts.

Build

Join a [vibrant community](/discord) of other students who
are also learning with Fundamentals of transformers. Ask questions, get feedback and
collaborate with others to take your skills to the next level.

Community

In this workshop, we dive deep into Large Language Models (LLMs) to help you understand, build, and optimize their architecture for real-world applications. LLMs are transforming industries—from customer support to content creation—but understanding how these models work, let alone optimizing them, can be challenging. 

In this comprehensive 9-module series, we cover:

The technical essentials of LLMs, including autoregressive decoding, positional encoding, and multi-head attention
The entire LLM lifecycle, from pretraining on massive datasets to fine-tuning and instruction tuning for specialized tasks
Best practices for evaluating LLMs, identifying bottlenecks, and leveraging state-of-the-art architectures for efficiency and scalability

this workshop includes hours of in-depth instruction, hands-on coding exercises, and access to a community forum for support and discussions. You'll also gain exclusive access to source code templates, an expansive reference library, and downloadable materials for continued learning.

It's taught by [Alvin Wan](https://www.linkedin.com/in/alvinwan/), a Senior Research Scientist at Apple and a PhD student at UC Berkeley with international recognition for his impactful contributions in efficient AI and design. With his practical industry experience and research insights, you’ll be guided from fundamentals to advanced concepts with clarity and precision.

By the end of this workshop, you’ll not only understand how to create and optimize LLMs but also how to apply this knowledge across various applications in tech and business.

Everyone knows chatgpt, but how do modern large language models fully work? The fundamentals start at the transformer. This workshop is a workshop to dymstify the transformer and be able to run through concept to code on how the transformer work. This workshops combines concept at an intutive level, to code, to math all with the intent at providing an end to end understanding at the fundamentals of large language models.

Fundamentals of transformers - Live Workshop

Miniaturized version of Huggingface’s LLM inference utility

Interactive demos to understand word embeddings and model representation capacity

Visualization utilities for real-world LLMs, to understand self-attention

Intro to LLM Basics: Learn foundational concepts, including terminology like models, data, algorithms, and optimization.

Autoregressive Decoding: Grasp how LLMs predict words through conditional generation, supported by manual inference demos.

LLM Prediction Mechanism: Explore LLM architecture, with an intuitive look at vectors and word embeddings.

Semantic Meaning in Embeddings: See how word embeddings represent semantic meaning through nearest neighbors and vector demos.

Transformer Core Mechanics: Unpack the inner workings of a transformer layer, including self-attention and context addition.

Non-Linear Transformations: Discover why non-linearities are essential, supported by hands-on matrix multiplication and MLP demos.

Positional Encoding: Learn absolute and relative positional encoding techniques, plus RMS Norm for positional bias management.

Differences between absolute and relative positional encoding

Attention Mechanisms: Delve into “forward-facing” and multi-head attention to understand attention values.

Advanced Attention: Study grouped-query attention and its importance in handling large data.

Current Transformer Models: Analyze academic and modern transformer diagrams, identifying bottlenecks in today’s LLMs.

Build a Mini LLM Inference Tool: Create a simplified version of Huggingface’s LLM utility to understand LLM operation.

Understand Word Embeddings: Develop interactive demos exploring word embeddings and how models represent words

Visualize Self-Attention: Use visualization tools to understand the role of self-attention in language models.

---
title: Demo - skip connections
privateVideoUrl: https://stream.mux.com/lrYc4aO00OOGyBu4tV8JogedkMCuVVDfsF7201pUVeVqg.m3u8
code: https://gitlab.com/fullstackio/books/newline-course-apps/fundamentals-of-transformers-live-workshop-app/-/tree/main/code/final?ref_type=heads
isJupyterNotebook: true
---

Alvin Wan

So here's an additional way that we can use position.

So as you mentioned before, weighted sums

Now, we have one more way to actually combat this.

was to actually add the positional encoding.

is actually add what's called a skip connection.

We added the attention, and then we added the MLP.

Now what we're going to do is add something called

So I'll explain what this looks like in code in just a second.

But in short, a skip connection has two main purposes.

Number one, it ensures that reordering doesn't occur.

And second, more importantly, it actually

It makes it possible to train much faster.

skip connection makes training converge much faster

is a little bit more complex and hard to explain.

Mostly because the original author is that proposed it,

maybe five or six years ago, didn't really

This is just what the empirically observed.

OK, so let me actually show you what this looks like in code.

let's actually now demo a manual skip connection.

Or it's not simple, but it's simpler than attention at least.

but that's pretty much it for a skip connection.

I have a skip connection that takes the input of my MLP,

And so I just added attention through back to the output.

And so the skip connection is this input being added to the output.

And this is what we call the residual branch.

The residual branch is the original set of outputs

and set of outputs and set of vectors that are being passed through.

Anytime I draw something along with this,

it just means it's that we're taking the input

And the reason why I'm talking about all these details,

by the way, is that maybe they don't make a whole lot

I'm going to walk through the original diagram

from the original transform paper that everybody copy and pastes

Yeah, there are also a lot of sources online

but I think our missing critical parts of the explanation.

So I'm hoping that you can get the actual explanation here.

Definitely let me know if these diagrams end up being confusing.

So in my code, I added attention 3 to the output.

So every time I do that, though, I'm basically

So let's say that attention 3 here is something like--

So here, let's say that attention 3 has the values 1, 2, 3.

And then maybe MLP3 has the values 3, 1, 2.

In essence, if I add the skip connection,