CSE 447 (NLP)
Yejin Choi, Winter 2024
Lecture 1 - Jan 3
- No final exam
- Four assignments (plus one optional)
- Final project
- Either open-ended research project or replicating/extending
- Small participation scores
- Course materials have been revamped because of LLMs
- Went over common use cases of LLMs, failure modes
Lecture 2 - Jan 5
Classical NLP day!
- Language modeling task: estimate a probability function for all
possible strings
- Idea: use an emperical distribution over traning sentences
- No generalization (literally cannot be out-of-distribution)
- BoW (Unigram) models:
- Append a
token to all sentences
- Multiply probability of all tokens
- Not conditional
- Word order doesn’t matter
- Bigram models:
- Append
token and prepend START
- Conditioned on previous token
- Same sort of autoregressive generation
- N-gram models:
- Condition on the last n tokens when generating next token
- Practically cannot go beyond 3 or 4-gram, Google got ~9-gram
- Learned size is: \(|\text{dict
- This is the MLE
Lecture 3 - Jan 8
Finish classical NLP day!
- While the data we have for N-gram models doesn’t change, the data
per situation will decrease as N increases (and thus cause
- “The Shannon Game”
- How well can we predict the next word?
- Given a prefix, what’s the next word?
- We use typical ML/DL data splits for performance evaluation
- Measures of fit:
- Likelihood, (negative) log-liklihood
- Perplexity:
- Inverse probability of the data, normalized by number of tokens
- \(PP_M(D') =
- “Geometric mean”
- The amount of “surprise” in the data
- Exponentiated cross-entropy! \(PP(X) =
- Example: perplexity of Uniform with \(i\) outputs is \(i\) (kinda like how hard is it to
- Perplexity represents the branching factor of data
- Frontier LLMs have perplexity \(<10\)
- Cross-entropy:
- For two prob distributions \(P, Q\)
you want to compare, cross entropy is: \(H(P,
Q) = -\mathbb{E}_p[\log q] = - \sum_{x \in X} p(x) \log
- Non-symmetric
- If we want to model natural language distribution \(L\) and fit \(P_M\) to minimize cross-entropy: \(H(L, P_M)\)
- Use Monte-Carlo estimate for \(L\), assume \(l(x_i \simeq \frac{1}{N}\) over some
dataset \(D\) drawn from \(L\)
- Basically just saying all phrases have an equal chance of
- \(H(L, P_M) \simeq = -\sum_{i=1}^N
\frac{1}{N} \log p_M(x_i | x_0, \dots, x_{i-1})\)
- Intrinsic evaluation: more about measuring facts about the
underlying model
- Extrinsic evaluation: more application and context focused
- Token distribution follows Zipf’s distribution, kinda power-law
Lecture 4 - Jan 10
- We can never have a probability of zero for models:
- Laplace smoothing: (add one)
- Add-k: add some \(0 < k < 1\)
- Better but still not used
- Convex combinations (?)
- Take combination of multiple distributions
- Better than add-k typically
- Optimal hyperparameter search: human guessing
- E.g. linear combination of 1-gram, 2-gram, 3-gram distributions
- Unknown token
, known as “UNC!”
- Token representing out-of-vocab tokens
- Model can generate
as well
Finally done with lecture 2!!!
- 470,000 words in Webster’s
- Character level tokenization:
- Each token is a character
- BPE:
- Tokenize subwards, similar to a compression algorithm
- We kinda did this in CSE 143
- Greedly concatenating most common token-pairs starting from
- Common way to tokenize natural text, perform the same merges as when
designing the BPE encoding
Lecture 5 - Jan 12
- Tokenizers really fit data with weird matchings
- Whitespace and capitalization will typically cause differnt
- Tokenization varients:
- Strip whitespace, add spaces during decoding. Requires a “continue
word” special character.
- No pretokenization:
- Sentencepiece tokenization
- Generally used for non-english, char-based languages
- Byte-based variations!
- WordPiece: add bigram tokens by normalized probability of
- Tokenizer-free modelling
- One-hot encoding: vector of all zeros, with a one representing
- Embeddings: basically a differentiable lookup table for token
- Have relationships in latent space
- Basic NN review:
Lecture 6 - Jan 17
Guest Lecture - Peter West
- “Large” in LLMs: ~GPT-2 size or greater
- Commonsense knowledge:
- Is commonsense knowledge too hard for compact models?
- Can we distill commonsense knowledge to smaller models?
- Knowledge distillation:
- Minimize cross-entropy between teacher and student models
- Can’t do all strings, instead do over subset of important data
- We get a knowledge graph (DS) and student model as output
- Used GPT-3 to produce massive corpus of commonsense examples:
- Human baseline producing examples: ~86%
- Produced much more (10X) data, with 10% drop in example
- Attempted to use RoBERTa to filter data
- Increased acc to 10% more than humans, while dropping examples to
about half
- Got final student model by finetuning GPT-2 on synthetic data
- Can improve over baseline GPT-3 model
- Humans are good supervisors, not creators (P-NP)
- Spoke about how models can generate better than reason about outputs
- Kinda a reverse P-NP vs humans
Lecture 7 - Jan 19
More about neural networks
- NNs:
- Required:
- Training data
- Model family
- Differentiable loss function
- Learning problem:
- \(\hat{\theta} = \frac{1}{N} \sum_{i=1}^N
L(y_i, \hat{y}_i = f(x_i))\)
- Common loss functions:
- MSE loss
- L1 loss
- Binary cross-entropy loss: \(-[y \log
(\hat{y}) + (1 - y) \log (1 - \hat{y})]\)
- Cross-entropy loss: \(-\sum{i=1}^C y_i
\log (\hat{y}_i)\)
- Just use gradient descent (on subbatches)
- Backprop uses the chain rule
- Softmax: \(\text{softmax}(y) = \frac{\exp
(y_i)}{\sum_{j=1}^d \exp (y_j)}\)
Lecture 8 - Jan 22
It is okay to struggle and ask questions!
- RNNs have a hidden internal state
- This hidden state is modified by the inputs tokens at each time
- The hidden state is used to generate new tokens at each step
- Training procedure:
- Generate logits, take loss between ground truth and logits
- Backprop and update weights
- SGD:
Lecture 9 - Jan 24
- Vanishing gradients:
- No learning when gradients are zero
- Same weights run many times in a row for RNNs
- If weights are small, grads shrink repeatedly to zero
- If weights are big, grads explode
- Has hidden representation, many gates determining updates
- LSTM has a forgetting problem
- Not parllelizable
- Transformers/Attention:
- All you need
- Similar to fuzzy lookup for databases
- Generates key, query, value vectors
Lecture 10 - January 26
- Query, key, and value vectors for each token are generated
- Scores are: \(\text{softmax}(\frac{QK^T}{\sqrt{d})\)
- Output are: \(VS^T\)
- Total output is: \[\text{softmax}(\frac{QK^T}{\sqrt{d}})V\]
- Limitations of transformers:
- No sequence length: position embedding
- At input, add a “positional embedding” to each input token
- Sinuoidal embeddings, not learnable but can support any length
- Learned embeddings: positions can be learned, fixed length
- No nonlinearities: add FF layers between attention blocks
- Looking into the future: mask scores before softmax
- Transformers also have residual connections
- Multi-head attention:
- Run multiple “attention” operations in parallel, concatenate
- Can all be done in parallel
- Layer norm: normalize the values by layer
- Transformer decoder: don’t use masking
- Transformer encoder-decoder: we use cross-attention
- Decoder key and values come from the encoder
Lecture 11 - January 29
- Much of this class is about learning how to get better, vs just the
- Supervised learning: struggles out of domain
- Self-supervised pretraining can be used for transformers
- Catastropic forgetting: when fine-tuning, original performance
- Parameter-efficient fine-tuning: only update a fraction of the
- Adapter layers:
- Add additional layers within the model which are trained
- Prefix-tuning:
- Finetuning based on task-prefix prepended to the LLM
- The task-prefix is trained as parameters
- Very clever! Made by an early-year student!
- Prompt-tuning:
- Similar to prefix-tuning, less parameters
- LoRA (Low-Rank Adaptation):
- Learn low-rank “delta” matrix between ideal and current weights
- 10,000x less fine-tuned parameters
- 3x GPU memory saving for GPT-3
- Given pretrained weights \(\mathbb{R}^{d
\times d}\)
- Represent using \(A \in \mathbb{R}^{r
\times k}, A \sim \mathcal{N}(0, \sigma^2)\) and \(B \in \mathbb{R}^{d \times r}, B = 0\)
- Only train \(A\) and \(B\)
- QLoRA is quantized LoRA
- Encoder-only models:
- Mask particular words, predict the masked words (~15%)
token prepended, <SEP>
token between each text
- Segment embeddings are added to each of the token sequences
Lecture 12 - January 31
- Predicts random masked words in the middle
- Large emperical gain
- Base: 110M, Large: 340M
- RoBERTa:
- Can continue to train BERT much, much more
- Pretty much always better than BERT
- Is easier to fine-tune vs BERT
- T5:
- Masked variable-length subsequences
- Reconstruct the sentences
- Decoder (GPT models):
- Can use special tokens to do many types of tasks
- Certain types of applications need differing opend-endedness
- Natural Language Generation (NLG):
- Train on NLL with implicit teacher forcing
- Decoding:
- Take in probability distribution, predict which token is next
- Greedy Decoding:
- Beam search:
- For depth, search \(n\) routes with
highest prob
- Often go-to decoding algorithm
Lecture 13 - February 2
- One of the interesting problems is how to generate text once we have
a model
- Many variations of beam search
- Most likely sequences are repetitive
- When repeating a phrase, it becomes more likely to be generated in
the future
- Can avoid repeating N-grams as a simple fix
- Constrastive decoding: find sequence \(x\) such that maximizes \(P_\text{large LM}(x) - \log P_\text{small
- Humans don’t say high-prob things…why say something that’s
- Top-k sampling:
- Truncate the prob distribution to \(k\) tokens, sample from those token
- \(k\) can often be too large or
small depending on context
- Top-p (nucleus) sampling:
- Instead all tokens from the top probability mass
- Softmax temperature:
- Divide logits by \(t\) before
- Temperature \(t > 1\) is
uniform, \(t < 1\) is more
- Generate then re-rank:
- Generate \(n\) sequences with same
- Re-rank those sequences by some score (e.g. perplexity)
- Speculative decoding:
- Idea: not all tokens are as hard to generate! (and running sequence
through LLM is nearly as expensive as one token)
- Use small model to generate \(k\)-length draft
- Run draft through large LM to compute token distribution
Lecture 14 - February 5
- Speculative decoding:
- Cases:
- \(q_i \geq p_i\) accept the
- \(q_i < p_i\) accept with
probability \(\frac{q_i}{p_i}\)
- When rejecting and re-sampling, only sample from the distribution
\((q_i - p_i)\) prob of big model minus
small model, all non-negative)
- 2~2.5x decoding scheme
- Models must have the same decoding scheme
- Issues (training):
- Teacher-forcing influences models to “trust” previous
- MLE discourages unique text generation
- Unlikelihood training:
- Add another loss function which penalizes generation of certain
undesirable sequences
- Exposure bias:
- LLMs get fed in “good” tokens, not bad or random ones, unlike during
- Scheduled Sampling: with prob \(p\), use decoded token as next input vs
true token
- Reinforce algorithm:
- Sample from the model, score the sequence, use score as reward to
- \(L_{RL} = - \sum_{t=1}^T r(\hat{y}_t)
\log P(\hat{y}_t | y*;{\hat{y}_{\leq t})\)
- \(r(\cdot)\) reward model
- \(y*\) input sequence
- \(\hat{y}\) sequence from model
given \(y*\)
Lecture 15 - February 7
- RL with powerful models can “reward hack” while not improving at the
underlying task
- RLHF - use PPO with human scores
- Summarization: try using autoencoders
- We typically need multiple loss terms to account for fluency,
reconstruction, length
- We can’t backprop through token generation
- Solution: Gumbel-Softmax
- \(\y_i \sim \text{softmax}(S_t) \simeq e_t
= \sum_{i=1}^{|V|} e(w_i) \text{softmax}(\frac{S_t +
- Evaluation metrics:
- Content-based overlap: often not great because failure to capture
- Model-based metrics:
- BERTSCORE: use embeddings from Bert
- Use vector sims
- Mauve: … TBC
Lecture 16 - February 9
- Mauve: measure of difference between human (\(P\)) and machine (\(Q\)) text distribution
- KL-divergence: \(KL(P|R_\lambda) = \sum_x
P(x) \log \frac{P(x)}{R_\lambda (x)}\)
- Not symmetric, \(R_\lambda(x) \neq
- Interpolate between the two distributions to draw a curve
- Human Evals:
- Humans have been gold standard, but crowdworkers are now using
- Alignment datasets:
- Synthetic data: use seq2seq LLMs to generate data from templates
- Convert existing NLP datasets to instruction-datasets
- Allows supervised fine-tuning as well
- Human annotation:
- Other methods involving getting data from frontier models
- Showed RL is powerful in NLP
- Human feedback models perform much better than supervised
- Pipeline:
- Collect human feedback between samples
- Train reward model from those samples
- Run PPO on LLM using reward model
- Use RL and PPO
- Can be done online by continuing to get more data samples, updating
reward model with new data
Lecture 17 - February 12
- Human preference:
- Easier to do classification (even as humans) vs regression
- Why RLHF is preference based
- Bradley-Terry model: model of probability one output is better than
the other
- Reward model:
- Can be multi-input classification
- Backbone models are often large LLMs
- Any type of model can be used, just by adding a linear head
- Typically only trained one epoch
- (Somewhat) low accuracy, ~60-80%
- RL with LLMs:
- Basic REINFORCE algo is exactly what you’d expect with LLMs
- Policy Gradient with regularization:
- Learns policy maximizing reward minus KL reg term
- Reference model is fixed for KL-divergence
- PPO:
- Surrogate objective (similar to TRPO): \(\text{maximize}_\theta \mathbb{\hat{E}}_t
\hat{A}_t]\) subject to: KL difference_
- Real (clip) objective:
- Kinda fucked up to understand, I’ll look more at it though lol
- Also trains value network
Lecture 18 - Feb 14
- PPO uses a clipped objective which has similar effect as the
- PPO training is a combination of CLIP, Value, and exploration
- CLIP was talked about a little bit above
- DPO: last year, similar to RLHF
- There is closed form optimal policy for the RLHF objective
- Impossible to find it, but very good
- We can rewrite it(?) so that the unknown term is removed
Lecture 19 - Feb 16
- DPO can perform as well as PPO empirically
- RLHF still can cause out-of-distribution errors
- Many sorts of LLM evals, leaderboards, head-to-head,
- Small models often copy style of larger LLMs, but not info
- LIMA: knowledge and capabilities are learned during pretraining,
alignment shifts what distribution it outputs
- Large pretraining allows in-context learning:
- Zero-shot: no examples given
- Few-shot: few examples given
- Ability scales with num parameters
- Extremly dependent on prompt formatting
- Scaling laws:
- Just look up the equations
- Chincilla (deepmind) are newer equations for scaling laws
Lecture … March 4
HPC with DL, from Eluther AI
- Bottlenecks with big models:
- Iteration time, takes long for inference
- Processor memory, models too big for one device
- Data Parallelism (exactly what you’d expect)
- Bigger batch sizes, much faster
- Averaging gradients (“AllReduce”):
- Introduces communication overhead
- Model Parallelism: (individual forward pass, average global
gradients, individual backwards pass)
- Memory requirements:
- \(M_\text{tot} = M_m + M_o + M_g +
- \(k\): bytes per param
- \(p\): num params
- \(M_m = k \cdot p\) (model memory),
can be low-prec
- \(M_o = 3k \cdot p\) (optimizer
memory), high-prec
- \(M_g = k \cdot p\) (memory
gradients), can be either prec
- \(M_a = \text{long formula}\)
(memory activation), low-prec
- We can use different precisions for different types of memory
- Use FP16 for calculations, store as FP32
- Straight FP16 doesn’t converge
- Explanation: model parameters need to be in high-precision, but the
diffs are noisy and can be in low-precision
- Loss scaling:
- As many gradients cant be represented by FP16 range, scale by
- Could be static, in practice use dynamic value
- PyTorch AMP does this automatically for you
- Start with a large scaling factor \(s\)
- For each iteration, scale and unscale \(s\)
- After unscaling, if grads are
or inf
decrease \(s\) and skip current
- If grads are are not
or inf
for \(N\) steps, increase \(s\)
- Nvidia Tensor Cores: allows really fast FP16
- Requires params to be multiples of 8 - increasing to mod 8 can be
- Up to 8x speedup
- Memory bandwidth increases 2x via FP16
- DeepSpeed ZeRO (PT FSDP):
- Splits up values across devices
Lecture ??, March 6
- DeepSpeed ZeRO (lowers DP memory requirements):
- Zero 1/2 have same communication volume as DP, but ZeRO-3 requires
much more volume
- Slice model parameters, optimizers, activations over many GPUs
- Divides much of the memory requirements by number of slices
- Curriculum learning: increase sequence length over time
- ZeRO levels:
- Model params
- Gradients (activations ?)
- Optimizer
- GPU communication:
- Inter-node communication is very fast
- Inter-cluster communication can be slower
- Hybrid sharding strats:
- Zero-3 within a node, zero-1 across nodes
- Increased bandwidth requirements via ZeRO-3 a big problem within a
- Kernel Fusion:
- Combine sequential kernel operations into one operation
- Reduces read-write issues
- What FlashAttention uses
- AllReduce is \(2N\) communication
- Tensor Parallelism:
- Idea: parallelize parts of a matrix operation between
- 1-D:
- Reduces communication and memory by factor of \(d\) (num device)
- High bandwidth cost - typically requires it to stay within a
- Activation checkpointing: recompute activations to save memory
- We do this because we’re often memory-constrained, not
- “Selective” checkpointing: recompute things which are easy to
compute but expensive to store
- Layer Parallelism: split layers between devices, fully sequential
- “Each GPU is in charge of it’s own layer”
- Pipeline parallelism: GPipe (?)
- Do layer parallelism with subsets of batches (microbatches)
- Allows for much higher utilization time
- Good with large batch sizes and many GPUs, as you can have more
microbatches, which lowers “unutilized bubble”
- Tensor parallelism is often done within a node, pipeline is done
between different nodes
- Gradient accumulation: don’t always update, go forward and backwards
multiple times then average gradients+update
- \(12 \cdot \text{num layers} \cdot
\text{hidden size}^2\) is rough estimate of number of transformer
- Megatron Parallelism: framework for 3-D parallelism