Transformer
Transformer Core Concepts
From Tokens to Embeddings
Raw tokens are first mapped to dense vectors through an embedding matrix so that the model can work in a continuous space. The embedding size (n_embd) defines the dimensionality of this space and controls both the model capacity and its memory footprint.
Positional Information
Because self-attention is permutation-invariant, Transformers inject order information with positional encodings. Classical sinusoidal encodings let the model generalise to longer sequences, while learnable embeddings allow the model to adapt positions during training. Modern variants sometimes rely on relative position encodings or rotary embeddings to better capture long context interactions.
Scaled Dot-Product Self-Attention
For each token, the model projects embeddings into queries (Q), keys (K), and values (V). Attention weights are computed as softmax(QKᵀ / sqrt(d_k)), where d_k is the head dimension to prevent large dot products from saturating the softmax. The output is a weighted sum of the value vectors, allowing every position to gather information from the entire context window (block_size).
Multi-Head Attention
Multiple attention heads run in parallel on different learned projections of the same sequence. This design allows the model to capture heterogeneous relationships (syntax, long-range dependencies, coreference) in the same layer. The concatenated head outputs are linearly projected back into the model dimension.
Position-Wise Feed-Forward Network
Each Transformer block follows attention with a two-layer feed-forward network applied independently to every position. A typical configuration is Linear(n_embd → 4 × n_embd), an activation (GELU or ReLU), then Linear(4 × n_embd → n_embd). This component mixes features learned by attention and introduces non-linearity.
Residual Connections and Normalisation
Skip connections wrap both the attention sublayer and the feed-forward sublayer so that gradients flow directly to earlier blocks. LayerNorm (or RMSNorm in some modern designs) keeps activations well-scaled during training. Variants such as Pre-LN place the normalisation before each sublayer, which improves stability for deeper models.
Encoder-Decoder vs. Decoder-Only
The original Transformer pairs an encoder that builds contextualised representations with a decoder that performs autoregressive generation, both stacked with attention and feed-forward modules. Many language models today use only the decoder stack with causal masking, which enforces that each token can only attend to previous positions, enabling left-to-right generation.
Training and Scaling Considerations
- Optimiser choice: AdamW remains the default, but large models may benefit from learning rate warm-up, cosine decay, and parameter-specific weight decay.
- Regularisation: Dropout complements attention masking, while techniques such as label smoothing or stochastic depth can help deep stacks converge.
- Precision and compilation: Training in mixed precision (
bfloat16/fp16) and enabling compiler optimisations (torch.compile) significantly reduces memory use and speeds up training. - Scaling laws: Empirically, model performance improves predictably with more data, parameters, and compute, guiding decisions about
n_layer,n_head, and dataset size.
Inference-Time Generation
During autoregressive generation, the model caches key-value pairs to avoid recomputing attention for past tokens. Sampling strategies such as temperature, top-k, and nucleus sampling trade off creativity against determinism. For instruction-following models, additional techniques like contrastive decoding or aligning with human feedback further shape the output distribution.
PyTorch Skeleton
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
import torch.nn as nn
class TransformerLM(nn.Module):
def __init__(
self,
vocab_size: int,
block_size: int = 192,
n_embd: int = 192,
n_layer: int = 3,
n_head: int = 6,
dropout: float = 0.2,
) -> None:
super().__init__()
self.block_size = block_size
self.token_embed = nn.Embedding(vocab_size, n_embd)
self.pos_embed = nn.Parameter(torch.zeros(1, block_size, n_embd))
encoder_layer = nn.TransformerEncoderLayer(
d_model=n_embd,
nhead=n_head,
dim_feedforward=4 * n_embd,
dropout=dropout,
activation="gelu",
batch_first=True,
)
self.layers = nn.TransformerEncoder(encoder_layer, num_layers=n_layer)
self.norm = nn.LayerNorm(n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
def forward(self, idx: torch.Tensor) -> torch.Tensor:
if idx.size(1) > self.block_size:
raise ValueError("Sequence length exceeds block size.")
x = self.token_embed(idx) + self.pos_embed[:, : idx.size(1)]
x = self.layers(x)
x = self.norm(x)
return self.lm_head(x)
def training_step(model, batch, optimizer, scaler=None):
model.train()
inputs, targets = batch
optimizer.zero_grad(set_to_none=True)
with torch.cuda.amp.autocast(enabled=scaler is not None):
logits = model(inputs)
loss = nn.functional.cross_entropy(
logits.reshape(-1, logits.size(-1)),
targets.reshape(-1),
)
if scaler:
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
else:
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
return loss.item()
Hyperparameters
Minimal Viable Training Config
| Parameter | Sample Value | Meaning |
|---|---|---|
batch_size |
48 |
Samples per optimisation step |
block_size |
192 |
Context window length |
max_iters |
300 |
Maximum number of optimisation steps |
learning_rate |
3e-4 |
Optimiser step size |
n_embd |
192 |
Transformer embedding dimension |
n_head |
6 |
Number of attention heads |
n_layer |
3 |
Number of Transformer layers |
dropout |
0.2 |
Regularisation probability |
Full Training Configuration
| Category | Parameter | Sample value | Meaning |
|---|---|---|---|
| Data | block_size |
192 |
Context window length |
vocab_size |
(auto from tokenizer) | Number of tokens in the vocabulary | |
| Model | n_embd |
192 |
Embedding dimension |
n_head |
6 |
Number of attention heads | |
n_layer |
3 |
Transformer depth | |
dropout |
0.2 |
Dropout probability | |
tie_weights |
True |
Share token embedding and output projection weights | |
| Training Loop | batch_size |
48 |
Number of samples per update |
max_iters |
300 |
Total training iterations | |
grad_clip |
1.0 |
Gradient norm clipping | |
| Optimiser | learning_rate |
3e-4 |
Base learning rate |
weight_decay |
0.1 |
AdamW weight decay | |
betas |
(0.9, 0.95) |
AdamW momentum coefficients | |
eps |
1e-8 |
AdamW epsilon | |
| LR Scheduler | lr_decay |
True |
Enable learning rate decay |
warmup_iters |
100 |
Warm-up steps before decay | |
min_lr |
1e-5 |
Final learning rate after decay | |
scheduler_type |
"cosine" |
Scheduler function | |
| Precision / Hardware | device |
"cuda" |
Compute device |
dtype |
"bfloat16" |
Precision mode | |
compile |
True |
Enable Torch 2.x compile optimisation | |
| Validation / Early Stop | eval_interval |
100 |
Evaluation frequency |
eval_iters |
20 |
Mini-batches used for validation loss estimation | |
patience |
6 |
Early stopping patience | |
min_delta |
1e-3 |
Minimum improvement threshold | |
| Checkpoint / Logging | save_interval |
100 |
Model checkpoint interval |
log_interval |
50 |
Logging interval | |
wandb_project |
"gpt-debug" |
Optional logging project name | |
| Generation | temperature |
0.8 |
Softmax temperature for sampling |
top_k |
50 |
Top-K sampling | |
top_p |
0.95 |
Nucleus sampling | |
max_new_tokens |
200 |
Maximum number of new tokens to generate |