Building a Tiny Transformer From Scratch in PyTorch

(and understanding every piece along the way)

Sep 29, 2025

If you’ve ever felt that Transformers are a “black box,” today’s post is for you. Instead of just downloading a pretrained model, we’re going to build a tiny GPT-style transformer from scratch — small enough to run on your laptop, but powerful enough to generate Shakespeare-like text.

By the end, you’ll not only have a working model, but also a clear understanding of how the pieces fit together: embeddings, attention, feedforward networks, and residual connections.

But first, if you are new to Transformer, check out this blog to get the concepts down before you start coding.

Fig 1. Query, Key, Value vectors used in attention. Src: Illustrated Transformer

1. The Setup: Hyperparameters & Data

Transformers need a few key hyperparameters: embedding size, number of layers, number of heads, context window size, etc. We’ll keep everything small for speed.

device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’
block_size   = 128   # how many characters the model looks back
batch_size   = 32    # sequences per batch
n_layer      = 2     # number of transformer blocks
n_head       = 2     # attention heads
n_embd       = 128   # embedding size
dropout      = 0.1
max_iters    = 1200  # training steps
lr           = 3e-3

We’ll use a small excerpt of Shakespeare for training. Since it’s tiny, we’ll tokenize at the character level — every character (including spaces and punctuation) is a token.

text = “”“From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
...”“”

chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}

encode = lambda s: torch.tensor([stoi[c] for c in s], dtype=torch.long)
decode = lambda t: ‘’.join(itos[int(i)] for i in t)

Why characters instead of words? Simplicity.
Note: Character-level model predicts letters, not words/subwords. It can learn “Thy ” but has no concept of words, so you get non-words and odd spellings.

2. Attention: The Heart of the Transformer

If embeddings are how we represent words or characters, then attention is how the model decides what to focus on.

Think about reading Shakespeare:

“From fairest creatures we desire increase…”

To predict “increase,” you might need to remember “desire” a few tokens back. Attention gives the model this ability: each token can “look” at others in the sequence and decide how much weight to assign them.

Queries, Keys, and Values

In Transformers, each token embedding is transformed into three vectors:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I provide if selected?

The model compares Queries and Keys using a dot product. If a Query matches a Key strongly, the corresponding Value is attended to more.

Scaled Dot-Product Attention

Mathematically:

Attention(Q,K,V) = softmax (QK^T /sqrt{d(k)} )V

Q(K^T): Similarity between queries and keys.
Division by sqrt{d(k)}: Prevents overly large values that destabilize training.
Softmax: Turns similarities into probabilities.
Multiply by V: Aggregate information from the sequence.

Causal Masking

In language modeling, we only want each token to see past tokens, not future ones. Otherwise, predicting the next word would be cheating.

That’s why we apply a causal mask: a triangular matrix that blocks access to future positions.

class CausalSelfAttention(nn.Module):
    def __init__(self, n_embd, n_head, dropout):
        super().__init__()
        self.n_head = n_head
        self.key = nn.Linear(n_embd, n_embd, bias=False)
        self.query = nn.Linear(n_embd, n_embd, bias=False)
        self.value = nn.Linear(n_embd, n_embd, bias=False)
        self.proj = nn.Linear(n_embd, n_embd, bias=False)
        self.attn_drop = nn.Dropout(dropout)
        self.resid_drop = nn.Dropout(dropout)

        # causal mask: prevents looking ahead
        self.register_buffer(’mask’, torch.tril(torch.ones(block_size, block_size))
                             .view(1,1,block_size,block_size))

    def forward(self, x):
        B, T, C = x.size()
        k = self.key(x).view(B, T, self.n_head, C//self.n_head).transpose(1,2)
        q = self.query(x).view(B, T, self.n_head, C//self.n_head).transpose(1,2)
        v = self.value(x).view(B, T, self.n_head, C//self.n_head).transpose(1,2)

        att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        att = att.masked_fill(self.mask[:,:,:T,:T]==0, float(’-inf’))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)

        y = att @ v
        y = y.transpose(1,2).contiguous().view(B, T, C)
        return self.resid_drop(self.proj(y))

3. Transformer Blocks: The Lego Bricks of LLMs

If attention is the engine, then a Transformer Block is the car around it — a self-contained unit that stacks with others to build deep models. Every modern LLM is basically “just” dozens (or hundreds) of these blocks piled on top of each other.

A Transformer Block has three key ingredients:

Layer Normalization (stabilizes training)
Multi-Head Self-Attention (captures dependencies between tokens)
Feedforward Network (MLP) (adds nonlinear mixing and depth)

And critically: Residual Connections — the skip-connections that let the signal flow across layers without vanishing.

Anatomy of a Block

Mathematically, a block looks like this:

x=x+Attention(LayerNorm(x))

x=x+MLP(LayerNorm(x))

LayerNorm before each sub-layer, residual addition after. This pattern is called Pre-Norm (used in GPT, LLaMA, etc.). It helps with stable gradients during deep training.

class MLP(nn.Module):
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.GELU(),
            nn.Linear(4*n_embd, n_embd),
            nn.Dropout(dropout),
        )
    def forward(self, x): return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embd, n_head, dropout):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, n_head, dropout)
        self.ln2 = nn.LayerNorm(n_embd)
        self.mlp = MLP(n_embd, dropout)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))  # residual connection
        x = x + self.mlp(self.ln2(x))
        return x

Residuals (skip connections) are critical — without them, deep networks would struggle to train.

4. The Tiny Transformer

We’ve covered the ingredients:

Embeddings to represent tokens and positions.
Attention to let tokens “talk” to each other.
MLPs to mix information.
Residual + LayerNorm to stabilize everything.

Now it’s time to assemble a complete GPT-style language model.

Think of it like building a house: we’ve poured the foundation (embeddings), installed the plumbing and wiring (attention), and added rooms (blocks). The Tiny Transformer is the full house, ready to be lived in.

Step 1. Token & Positional Embeddings

Transformers don’t know what words or characters are. They only see numbers. So we use an embedding matrix to map token IDs into dense vectors.

But embeddings alone don’t capture order. Without positions, “dog bites man” and “man bites dog” look the same. That’s why we add positional embeddings.

self.tok_emb = nn.Embedding(vocab_size, n_embd)   # token embeddings
self.pos_emb = nn.Embedding(block_size, n_embd)   # position embeddings

We then add them together:

tok = self.tok_emb(idx)                                 # token lookup
pos = self.pos_emb(torch.arange(T, device=idx.device))  # position lookup
x = tok + pos                                           # combined representation

This ensures the model knows both what the token is and where it is.

Step 2. Stacking Transformer Blocks

Each block is a self-contained reasoning step: normalize → attend → normalize → feedforward → residual connections.

We stack multiple blocks to deepen the model’s capacity:

self.blocks = nn.Sequential(*[Block(n_embd, n_head, dropout) for _ in range(n_layer)])

This is where scaling happens: GPT-2 has 48 blocks, GPT-3 has 96, LLaMA-65B has 80+. In our case, we keep it tiny (2 blocks) so it runs on a laptop.

Step 3. Output Layer

After passing through blocks, we normalize one last time and project into the vocabulary size to predict the next token.

self.ln_f = nn.LayerNorm(n_embd)
self.head = nn.Linear(n_embd, vocab_size, bias=False)

Notice something subtle:

self.head.weight = self.tok_emb.weight

This is weight tying. It reuses the same matrix for input embeddings and output projection, reducing parameters and often improving performance.

class TinyTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)

        # tie weights: reuse embedding weights in output
        self.head.weight = self.tok_emb.weight

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = self.drop(tok + pos)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_id), dim=1)
        return idx

Notice the positional embeddings: since transformers don’t have recurrence or convolution, we need to tell them where each token is in the sequence.

5. Training & Sampling

Training is straightforward: predict the next token, compute cross-entropy loss, update weights.

model = TinyTransformer().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

for it in range(max_iters+1):
    xb, yb = get_batch(’train’)
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

# Generate text
context = torch.zeros((1,1), dtype=torch.long, device=device)
sample = model.generate(context, max_new_tokens=400)[0].tolist()
print(decode(sample))

After training, the model starts spitting out Shakespearean-style gibberish. Again, it's just to understand the architecture - not to build something that works.

6. The Big Picture

Why does this matter? Because when you understand how to build a toy transformer. You can play around with the parameters in the model to understand -

Why attention is powerful.
How scaling (layers, heads, embedding size) affects capacity.
Why modern LLMs are just giant versions of this code — trained on billions of tokens instead of a few hundred lines of Shakespeare.

Here is the full code for you to try out.

# Tiny Transformer (PyTorch) — Colab-ready
# Goal: Train a very small character-level language model (LM) on Shakespeare text.
#       This lets you “see” how attention, embeddings, and transformer blocks work.

import math, torch, torch.nn as nn
from torch.nn import functional as F

# -------------------
# Config (small enough to run on CPU/GPU in minutes)
# -------------------
device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’  # use GPU if available
block_size   = 128       # context window size (how many characters model looks back)
batch_size   = 32        # number of sequences per batch
n_layer      = 2         # number of transformer blocks
n_head       = 2         # number of attention heads
n_embd       = 128       # embedding dimension
dropout      = 0.1
max_iters    = 1200      # training steps (short for demo)
eval_interval= 200
lr           = 3e-3      # learning rate
eval_iters   = 100
generate_tokens = 400    # how many characters to generate after training

torch.manual_seed(1337)  # reproducibility

# -------------------
# Tiny dataset (Shakespeare excerpt for demo)
# -------------------
text = “”“
From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
...
“”“

# -------------------
# Tokenization (character-level for simplicity)
# -------------------
chars = sorted(list(set(text)))
vocab_size = len(chars)                   # number of unique characters
stoi = {ch:i for i,ch in enumerate(chars)} # char → index
itos = {i:ch for ch,i in stoi.items()}     # index → char
encode = lambda s: torch.tensor([stoi[c] for c in s], dtype=torch.long)
decode = lambda t: ‘’.join(itos[int(i)] for i in t)

# Prepare train/validation split
data = encode(text)
n = int(0.9*len(data))
train_data, val_data = data[:n], data[n:]

# Function to sample mini-batches for training
def get_batch(split):
    d = train_data if split == ‘train’ else val_data
    # ensure we never sample beyond available length
    cur_block = min(block_size, max(2, len(d) - 2))
    hi = len(d) - cur_block - 1
    if hi <= 0:
        # fall back to using the whole sequence if extremely short
        x = d[:cur_block].unsqueeze(0)
        y = d[1:cur_block+1].unsqueeze(0)
        return x.to(device), y.to(device)

    ix = torch.randint(hi, (batch_size,))
    x = torch.stack([d[i:i+cur_block] for i in ix])
    y = torch.stack([d[i+1:i+cur_block+1] for i in ix])
    return x.to(device), y.to(device)

# -------------------
# Core Transformer Components
# -------------------

class CausalSelfAttention(nn.Module):
    “”“
    Multi-head self-attention with causal masking
    (so model can’t look at future tokens).
    “”“
    def __init__(self, n_embd, n_head, dropout):
        super().__init__()
        assert n_embd % n_head == 0
        self.n_head = n_head
        # projection matrices for Q, K, V
        self.key = nn.Linear(n_embd, n_embd, bias=False)
        self.query = nn.Linear(n_embd, n_embd, bias=False)
        self.value = nn.Linear(n_embd, n_embd, bias=False)
        self.proj = nn.Linear(n_embd, n_embd, bias=False)
        self.attn_drop = nn.Dropout(dropout)
        self.resid_drop = nn.Dropout(dropout)
        # causal mask (ensures model only attends to previous tokens)
        self.register_buffer(’mask’, torch.tril(torch.ones(block_size, block_size))
                             .view(1,1,block_size,block_size))

    def forward(self, x):
        B, T, C = x.size()
        # Linear projections → split into heads
        k = self.key(x).view(B, T, self.n_head, C//self.n_head).transpose(1,2)
        q = self.query(x).view(B, T, self.n_head, C//self.n_head).transpose(1,2)
        v = self.value(x).view(B, T, self.n_head, C//self.n_head).transpose(1,2)

        # Attention scores (scaled dot product)
        att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        att = att.masked_fill(self.mask[:,:,:T,:T]==0, float(’-inf’))  # apply causal mask
        att = F.softmax(att, dim=-1)                                   # normalize
        att = self.attn_drop(att)

        # Weighted sum of values
        y = att @ v
        y = y.transpose(1,2).contiguous().view(B, T, C)
        y = self.resid_drop(self.proj(y))
        return y

class MLP(nn.Module):
    “”“Feed-forward network inside each transformer block.”“”
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),  # expansion
            nn.GELU(),                    # nonlinearity
            nn.Linear(4*n_embd, n_embd),  # projection back down
            nn.Dropout(dropout),
        )
    def forward(self, x): return self.net(x)

class Block(nn.Module):
    “”“A Transformer block = LayerNorm + Attention + MLP with residuals.”“”
    def __init__(self, n_embd, n_head, dropout):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, n_head, dropout)
        self.ln2 = nn.LayerNorm(n_embd)
        self.mlp = MLP(n_embd, dropout)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))  # residual connection
        x = x + self.mlp(self.ln2(x))   # another residual
        return x

class TinyTransformer(nn.Module):
    “”“A GPT-style tiny transformer model.”“”
    def __init__(self):
        super().__init__()
        # Token + position embeddings
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.drop = nn.Dropout(dropout)
        # Stack of transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        # Output projection to vocab size
        self.head = nn.Linear(n_embd, vocab_size, bias=False)
        # Weight tying: share weights between token embedding and output head
        self.head.weight = self.tok_emb.weight

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)                                 # (B,T,C)
        pos = self.pos_emb(torch.arange(T, device=idx.device))  # (T,C)
        x = self.drop(tok + pos)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)                                   # (B,T,V)

        # Compute loss if targets given
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens):
        “”“Generate text by sampling next token repeatedly.”“”
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]        # crop to context window
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]              # last time step
            probs = F.softmax(logits, dim=-1)      # convert to probabilities
            next_id = torch.multinomial(probs, num_samples=1)  # sample token
            idx = torch.cat((idx, next_id), dim=1)
        return idx

# Initialize model + optimizer
model = TinyTransformer().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

# -------------------
# Evaluation helper
# -------------------
@torch.no_grad()
def estimate_loss():
    model.eval()
    out = {}
    for split in [’train’,’val’]:
        losses = []
        for _ in range(eval_iters):
            xb, yb = get_batch(split)
            _, loss = model(xb, yb)
            losses.append(loss.item())
        out[split] = sum(losses)/len(losses)
    model.train()
    return out

# -------------------
# Training loop
# -------------------
for it in range(max_iters+1):
    if it % eval_interval == 0:
        losses = estimate_loss()
        print(f”iter {it:4d} | train loss {losses[’train’]:.3f} | val loss {losses[’val’]:.3f}”)

    xb, yb = get_batch(’train’)
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # avoid exploding grads
    optimizer.step()

# -------------------
# Sampling: generate text from scratch
# -------------------
context = torch.zeros((1,1), dtype=torch.long, device=device)  # start with token 0
sample = model.generate(context, max_new_tokens=generate_tokens)[0].tolist()
print(”\n--- SAMPLE ---\n”)
print(decode(sample))

BuildML

Discussion about this post

Ready for more?