CSE 493 - Deep Learning
Course
Website - This course is heavily based off of CS231n - This course will primarily
focus on NLP and Computer Vision - Course is multiple parts: - Deep
learning fundamentals - Practical training skills - Application - Vision
has been one of the drivers of early DL, important to understand the
history - Books can be helpful, but not necessary - Gradescope has
automatic and private testing - Three psets, one midterm, and a final
group project
Lecture 1 - March 28
- 543 million years ago vision animals started to develop sight
- Camera obscura developed in 1545 to study eclipses
- Inspired da Vinci to create the pinhole camera
- 1959 Hubel & Wiesel found that we visually react to “edges” and
“blobs”
- Think of this a “lower layer”
- Larry Roberts is known as the “Father of Computer Vision” - wrote
the first CV thesis
- 1960’s MIT attempted to solve vision in a summer - this didn’t
happen
- David Marr introduced the idea of stages of visual recognition in
1970’s
- Edge detection became the next big push in CV
- In the 1980’s expert systems became popular
- These had heuristic rules made by domain “experts”
- Unsuccessful and caused the second AI Winter
- Irving Biederman came up with rules on how we view the world
- 1: We must understand components (objects and relationships)
- 2: This is only possible because we see so many objects learning
- A 6 year old child has seen 30,000 objects
- We can detect animals in 150 ms
- We detect predators and the color red even quicker!
- Later-stage neurons allowed us to detect complex object or
themes
- In the 1990’s research started on start on real-world images
- Algorithms were developed for grouping (1990’s) and matching
(2000’s)
- In 2001 the first commercial success in CV
- Facial detection, used ML and facial features
- In the 2000’s feature development was all the rage
- Histogram of gradients - how do the edges in pixels move?
- We need an incredible amount of data - led to ImageNet
- 2009, had 22K categories and 14M images
- In 2012 AlexNet had breakthrough performance on ImageNet
- By 2015 all attempts were DL and better than humans
- In 1957 the Mark I Perception was created for character recognition
- Manually twisted knobs to tune (adjusted weights)
- Cannot be trained practically
- Backpropagation was developed in 1986
- LeNet is the architecture used in the Postal Service - 1998
- AlexNet is the same architecture
- DL was used in the early 2000’s to compress images
- Everything is homogenized now
- Transformers and backprop are the norm
- Data and compute are the differentiators
- Domains change, but core is often the same
- Hinton, Bengio, and LeCun won the Turing award in 2018
- Deep learning is it’s own course because of incredible growth
Lecture 2 - March 30
- Image classification (IC) is a core task in CV
- There are many challenges related to computer vision:
- Viewpoint variation
- Illumination
- Background clutter
- Occlusion
- Deformation
- Intraclass variation
- It is very difficult to implement an image classifier as “normal”
software
- Data-driven paradigm is better
- Use datasets to train a model
- MNIST
- CFIAR 10(0)
- ImageNet
- MIT Places
- Omniglot
- Use ML to train a classifier
- Evaluate the classifier on new images
- Nearest Neighbor classifier
- Training: memorize all data and labels
- Inference: predict label of “nearest” image
- \(O(1)\) train time, \(O(n)\) inference time
- It is a universal approximator with n -> infinity data
points
- What is “nearest” (distance)?
- L1 norm is bad (Manhattan)
- L2 norm is good (Euclidean)
- Hyperparaamters are choices (configs) of the model
- Finding hyperparams
- We should use a train, val, and test ds
- Cross validation: split data into folds - no dedicated val
necessary
- Curse of dimensionality: the number of data points necessary is
related to the dimension of data
- Linear classifiers take on the form: \(y =
Wx + b\)
- \(b\) is often omitted, instead
appending data vector with a one
- Parametric approach
- Learns a template and decision boundary
Lecture 3 - April 4
- It’s cool that we have a classifier, but how do we make it
good?
- Loss function: define how good or bad our classifier (weights)
are
- Loss over dataset is the average of loss over examples
- \(x_i\), \(y_i\), and \(L_i\) are example data, output label, and
example loss
- Multiclass SVM loss:
- Make sure that prediction of correct label is greater than all
predictions for other labels by a given margin
- Delta doesn’t matter because it will simply scale
- Linear-ish loss
- Issues because piecewise function, doesn’t approach zero
asymptotically
- Squaring losses leads to predictions being penalized by more by a
factor of incorrect-ness
- Regularization is important, we always want the simplest model (no
overfitting!)
- Often adding a fraction of the L1 or L2 on the weights.
- Spread out the weights
- Softmax classifier:
- Pushes each value between 0 and 1
- Ensures the sum of the softmaxed outputs is 1
- \(S_i = \frac{e^{y_i}}{\sum
e^{y_i}}\)
- Scaling will determine how “peaky” the softmaxed scores are
- NLL is the negative log of the prediction of the correct label
- Cross entropy loss is the NLL of the softmax of the prediction of
the correct label
- Optimization is gradient descent
- Find the partial derivatives of the loss with respect to the
weights
- We update each weight by the scaled negative of its partial
derivative gradient
- Use a numeric gradient to gradient check
Lecture 4 - April 6
- Stochastic gradient descent uses minibatches to estimate the
gradient
- Pick your GPU, then biggest minibatch size typically
- Linear classifiers aren’t very powerful (for many problems)
- They only learn one template and linear decision boundaries
- We can extract features, then fit a linear classifier
- Non-linear transformations (features) are necessary
- We have to manually create features…if only we could learn
them…
- Simple two-layer NN: \(f = W_2 max(0,
W_1x)\)
- \(x \in \mathbb{R}^D, W_1 \in
\mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}\)
- We can think of this as learning templates, then learning
combinations of templates
- There are many different activation functions:
- ReLU, Sigmoid, Leaky ReLU are (historically) popular ones
- Activation functions are necessary because multiple consecutive
linear transformations can be represented as just one
transformation
- Biological neurons are quite different!
- We use computational graphs and backpropagation to differentiate
(backwards pass)
- Backprop is simply the chain rule - a lot
Lecture 5 - April 11
- Vector-vector backprop includes the Jacobian matrix for local
operations
- ImageNet helped people realize that data was super important!
- Convolutional NNs are a way of including spatial information
- CNNs are ubiquitous within vision
- While FC nets flatten images, CNNs preserve spatial structure
- Filters must be the same depth as the input image (i.e. 3 for
RGB)
- Slide over the entire image, flatten each part of the image it is
above, dot product with filter
- Stride is the offset between each filter-comparison
- The number of filters is the number of activation weights (and the
number of output channels)
- ConvNet is simply many convolutional layers with activation
functions in between
- Earlier layers learn low-level features, later layers learn
higher-level features
- Ends with a simple FC classifier
- There are interspersed pooling layers to downsample
- Output size of an \(N \times N\)
image with filter size \(F \times F\)
is: \((N - F) / stride + 1\)
- Often images will be padded with zero pixels
- 1x1 convolutional layers increase or reduce depth in channel
space
Lecture 6 - April 13
- Training loop:
- Sample a batch of data
- Forward prop through the graph to get loss
- Backprop through the gradients
- Update the params using the gradient
- Before you train:
- Activation functions
- Preprocessing
- Weight initialization
- Regularization
- Gradient checking
- Training dynamics:
- Babysitting the learning process
- Param updates
- Hyperparam updates
- Evaluation:
- Model ensamples
- Test-time augmentation
- Transfer learning
- Sigmoid function: \(\sigma(x) = \frac{1}{1
+ e^{-x}}\)
- Used to be popular
- Saturated gradient at high positive and negative values
- All gradients squashed by at least a factor of 4
- Outputs aren’t zero-centered which means local gradient is always
positive
- Weights will now all change in the same direction
- Computationally expensive!
- Tanh function:
- Zero centered
- Still has the problem of dying gradients
- ReLU (rectified linear unit): \(f(x) =
max(0, x)\)
- Most common activation function
- No saturation in positive region
- Computationally efficient
- Converges (6x) faster than sigmoid or tanh in practice
- Not zero centered
- Can lead to “dying” ReLUs
- To prevent, often initialize biases with small positive value
(e.g. 0.01)
- Leaky ReLU: \(f(x) = max(0.01x,
x)\)
- Same a ReLU, but minor gradient in negative region
- Parametric Rectifier (PReLU) where scaling is a learnable
parameter
- Exponential Linear Units (ELU):
- “Better”, but more expensive
- Scaled exponential linear units (SELU):
- Works better for larger networks, has a normalizing property
- Can use without BatchNorm
- Holy Heck Math
- “Cool”
- Maxout:
- Max of multiple linear layers
- Multiplies the number of parameters :(
- Swish:
- RL-created activation function
- GeLU:
- Add some randomness to ReLU, then take average to find this
- Main activation function used (esp. in transformers)
- Use ReLU, GeLU if transformers, and try ReLU derivatives
- We often zero-mean and normalize our data as preprocessing
- Sometimes preprocessing involves PCA and whitening
- ResNet subtracted mean across channels and normalized with standard
deviation
- A constant weight initialization leads to all the values being the
same
- Initializations that are too large or too small lead to extreme
saturation or clustered outputs
- “Xavier” initialization: \(std =
\frac{1}{\sqrt{D_{in}}}\)
- Good with Tanh
- Attempts to keep output variance similar to input variance
- “Kaiming” initialization: \(std =
\sqrt{\frac{2}{D_{in}}}\)
- Batch Normalization:
- “Things break when inputs don’t have zero mean and \(std = 1\)” - “Why not just force
that?”
- Subtract by batch mean, divide by square root of the variance of the
data
- It is differentiable!
- We keep these before each non-linearity
- We also have two learned parameters, gamma and beta corresponding to
scaling and shifting
- We keep a running mean of mean and variance across our training
process
Lecture 7 - April 18
- BatchNorm
- Becomes a linear layer during inference
- “Resets” the standard deviation and mean after change from linear
layers
- Allows higher learning rate and better gradient flow
- Acts as regularization during training
- Different behavior during training and testing! Can be a common
bug!
- For CNNs, batch norm across channels (output is: \(1 \times C \times 1 \times 1\))
- LayerNorm:
- Normalize across each example!
- Instance normalization is for CNN across height and width
- Good for segmentation and detection
- There are a ton of ways to normalize in similar ways
- Vanilla gradient descent: calculate gradient, move towards the
negative gradient
- Issues related to getting stuck in saddle points
- Stochastic, so descent can be extremely noisy
- Momentum keeps optimization moving in the same direction:
- \(v_{t+1} = \rho v_t + \nabla
f(x_t)\) then update using \(v_{t+1}\)
- Common values for \(\rho\) are
\(0.9\) and \(0.99\)
- Momentum will often overshoot minima
- Nesterov momentum: take the velocity update before taking
derivative
- Good, but we have to update weights twice, so we use a different
formulation where \(\tilde{x}_t = x_t + \rho
v_t\):
- SGD + momentum or Nesterov are often what is used in practice
- AdaGrad sums squared squared gradients, dividing each gradient by
the square root of the sum of its squared derivatives:
- This leads to weights with large and small gradients getting updated
slower and quicker
- It also quickly decays all learning rates to zero - a problem for
neural networks
- RMSProp (Leaky AdaGrad) is AdaGrad but with a decay to previous
squared grads - similar to a running average
- Keeps step sizes relatively constant
- Common decay rate is \(0.99\)
- Doesn’t overshoot as much
- More computationally expensive
- Adam: combine the best of RMSProp and AdaGrad
python first_moment = 0 second_moment = 0 for t in range(1, num_iterations): dx = compute_gradient(x) first_moment = beta1 * first_moment + (1 - beta1) * dx second_moment = beta2 * second_moment + (1 - beta2) * dx * dx first_unbias = first_moment / (1 - beta1 ** t) second_unbias = second_moment / (1 - beta2 ** t) x -= learning rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)
- There is a bias correction to prevent large early step sizes from
instantly destroying initialization
- Typical hyperparams are
beta1 = 0.9
and
beta2 = 0.99
- L2 regularization and weight decay are the same when using SGD (with
momentum), but different for Adam, AdaGrad, RMSProp
- AdamW: the go-to optimizer
- Adam with decoupled weight decay and \(L_2\) regularization
- Allows user to choose if they want weights to be a part of the
second moment or not
- There are second-order optimization techniques where you move in the
negative direction of the Hessian
- This is typically intractable due to the number of parameters
- AdaGrad is actually a special case of a second-order optimization
technique where we assume the Hessian is diagonal
- Second-order optimization is not typically used in practice
Lecture 8 - April 20
- Learning rate decay: scaling down the learning rate over time
- Necessary to decrease loss beyond a certain point
- Hyperparameter choice is extremely important
- There are many learning rate schedulers:
- “Step” down the learning rate after a fixed number of epochs
- Leads to massive drops in loss followed by plateaus repeatedly
- Cosine learning rate decay: \(\alpha_t =
\frac{1}{2}\alpha_0(1+cos(\frac{t\pi}{T}))\)
- \(\alpha_0\): initial learning
rate, \(\alpha_t\) learning rate at
step \(t\), \(T\) total number of steps
- Constant decrease in loss over time
- Linear learning rate decay
- Inverse square root decay
- Constant learning rate decay
- Learning rate warmup: spending first iterations increasing learning
rate
- A large initial learning rate will cause our weight initialization
to blow up
- Linearly increase learning rate over about 5 epochs (~5000
iterations)
- If you increase batch size by \(N\), also increase learning rate by \(N\)
- Validation data is a good way of paying attention to the model
- Early stopping is ending model training when test loss plateaus
- Training multiple models and then averaging results is a model
ensemble:
- Exhibits about ~2% better performance in the real world
- Different models overfit to different parts of the dataset, it is
averaged out
- Regularization techniques previously covered: L2, L1, weight
decay
- Dropout: every forward pass, set parameters to zero with probability
\(p\)
- Increases redundancy across the entire network, prevents some
overfitting
- Common value of \(p = 0.5\)
- An interpretation of dropout is an ensemble of models with shared
parameters
- At test time, multiply by \(p\)
or multiply by \(\frac{1}{p}\) after each test example
(inverted dropout)
- A common pattern with regularization is adding randomness during
training and averaging out during testing
- Seen in batch norm and dropout for example
- Data augmentation is a common form of regularization:
- Transforming the input in such a way that the label is the same
- Some example image augmentation techniques:
- Flipping an image horizontally
- Random crops and scales of an image
- Testing: average a fixed set of crops of a test image
- Change contrast or color of images
- Stretching or contorting images
- We are now training models to learn good data augmentation
techniques
- DropConnect: set connections to weights to zero (instead of weights
in dropout)
- Fractional pooling pools random regions of each image
- Testing: average predictions over multiple regions
- Stochastic depth: skip entire layers in a network using residual
connections
- Cutout: randomly set parts of image to average image color
- Good for small datasets, not often used on large ones
- Mixup: blend both training images and training labels by an amount
- Why does it work? Who cares!
- CutMix: replace random crops of one image with another while
combining labels
- Label smoothing: set target label to: \(1
- \frac{K-1}{K}\epsilon\) and other labels to: \(\frac{\epsilon}{K}\) for \(K\) classes
- In practice:
- Use dropout for FC layers
- Using batchnorm is always good
- Try Cutout, MixUp, CutMix, Stochastic Depth, Label Smoothing for (a
little) extra performance
- Grid search is okay for hyperparameter search
- Random search is better:
- Use log-uniform randomness in given range
- Likely because some hyperparamters matter more than others, so more
opportunities to get it exact than grid search
- How to choose hyperparameters without Google-level compute:
- Check initial loss: sanity check, turn off weight decay
- Overfit on a small sample (5-10 minibatches)
- Make some architectural decisions
- Loss not decreasing? LR too low, bad weight init
- Loss NaN or Inf? LR too high, also bad weight init
- Search for learning rate for ~100 iterations
- Good LR to think about: 1e-1 to 1e-4
- Coarse search: add other hyperparams and train a few models for ~1.5
epochs
- Good weight decay to try: 1e-4, 1e-5, 0
- Pick best models from 4., train for ~10-20 epochs without learning
rate decay
- Look at learning curves and adjust
- Might need early stopping, adjust regularization, larger model, or
keep going
- Flat start to loss graph means bad initialization
- When learning rate plateaus, add a scheduler (cosine!)
- Use a “command center” like Weights and Biases (or add your
own!)
- Return to step 5!
- Linear classifiers are easy to visualize because they have only one
filter
- Early layers in deep learning networks are also easy to visualize,
these are filters
- The last layers are an embedded representation of the inputs
- The embeddings are much better for KNN
- Google search:
- Embeds search into a 100 or 200 dimensional vector, then runs KNN on
a massive database
- Needs massive compute however
- We can use PCE or t-SNE to visualize the outputs of the last layer
of a network
- Good to use KNN rather than simply labels so we can see what is
happening under the hood
- We can visualize activation maps for CNNs
- To find what activates neurons the best, run patches of images from
the dataset through the model and see which output the full image’s
label
- Occlusion using patches is another way of visualizing which pixels
matter, can be graphed for a cool image
- Shapley values are a way of using multiple patches to determine
important areas of the image
- Compute backprop to the images to find activations for saliency maps
- Super good image segmentation (accidentally)
Lecture 8 - April 25
- Saliency maps allow us to view biases within misclassifications
- Clamp gradients to only negative values
- We can also “backprop” a gray image to make an example image
- Gradient ascent and a heck of a lot of regularization
- Adversarial examples:
- Start from an arbitrary image
- Pick an arbitrary class
- Modify the image to improve class scores
- Repeat!
- Black box attacks:
- Adding random noise to images and models get confused…
- Supervised learning is insanely expensive:
- Labeling ImageNet’s 1.4M images would cost more than $175,000
- Unsupervised learning: model isn’t told what to predict
- Self-Supervised learning: model predicts some naturally occurring
signal in the data
- The goal is to learn a cheap “pretext” task which learns important
features
- Target is something that is easy to compute
- For example, start with autoencoder, then fine-tune
- Three main types of pretext tasks:
- Generative
- Discriminative
- Multimodal
- Input video, output audio
- Pretext task performance is irrelevant
- Often just toss a linear classifier on the back of the encoder
- Generative supervised learning:
- Generate some data from an example
- Computers (NNs) are not rotation-invariant:
- Allows us to predict rotation as a pretext task
- The learned attentions are similar as to what supervised learning
finds
- Predict relative patch locations:
- Break image into 3x3 grid, pass in center square and an outer square
and predict location
- CNNs are size invariant
- Jigsaw puzzle:
- Reorder patches according to correct permutation
- Inpainting:
- Pass in image with missing patch, predict the missing patch
- Adding adversarial loss increases image recreation quality
- Image coloring:
- Use “LAB” coloring, pass in L (grayscale) and predict AB
(color)
- Split-brain autoencoder:
- Predict color from light, predict light from color
- Video coloring:
- Given colored start frame and grayscale images from then on, predict
the colors
- Uses attention mechanism
- Contrastive learning:
- Create many different examples from an orignal example
- These examples are going to be in the same sematic space
- Learn which examples are closer or further away from each other
Lecture 10 - April 27
- Contrastive learning:
- Get a reference example \(x\)
- Create transformed or augmented examples from \(x\), called \(x^+\)
- Label all other examples \(x^-\)
- Maximize \(score(f(x), f(x^+))\)
and minimize \(score(f(x),
f(x^-))\)
- Loss function is derived from softmax
- “Every single instance is it’s own class”
- Uses cosine similarity
- \(\frac{u^T v}{||u|| ||v||}\)
- Generate positive samples through simple data augmentation
- SimCLR Framework:
- Make two simple transformations from the example
- Run each transformed example through a NN to get representation
- Run each representation through a simple linear classifier or
MLP
- Maximize the cosine similarity of those outputs
- Example transformations for images:
- Random cropping
- Random color distortion
- Distortion blur
- Create a minibatch matrix (\(2N\)
by \(2N\)) of alternating
example-transformed images:
- Run it through the model
- Take the cosine similarity of the matrix with itself
- The \((2k, 2k+1)\) and \((2k+1, 2k)\) scores should be positive,
everything else should be negative
- Diagonal will always be 1
- See slides for illustration
- SimCLR works extremely well with very large batch sizes
(64,000+)
- We don’t want to directly expose the representation to the loss
function, so we use a MLP head
- MoCo: Momentum Contrastive Learning
- Keep a running queue of keys (negative examples) for all images in
the batch
- If we have 1000 examples and 2000 negatives, it is \(1000 \cdot 2000\)
- Update the encoder only through the queries (reference images)
- Makes the momentum encoder be much less computationally expensive
- We update the momentum encoder via: \(\theta_k \leftarrow m \theta_k + (1 - m)
\theta_q\)
- Slowly aligns the two networks
- Uses cross-entropy loss
- Treats each negative as a class
- One correct label (this comes from the two parallel transformations)
and the incorrect labels (negatives) are from the queue
- These don’t need to be run in parallel or there could be a
concatenation
- MoCo V2: Add a non-linear projection head and better data
augmentation
- DINO: do we need negatives? (very recent)
- Reformulates contrastive learning and a distillation problem
- Teacher model is like the momentum encoder
- Running average of the student model
- Sees a global view augmentation of the image
- Student model only sees cropped augmentation of the image
- DINO training tricks for the teacher:
- Center the data by adding a mean
- Sharpen the distribution towards a certain class - like a
temperature
- Has the effect of making the teacher be a bit of a classifier
- DINO V2 relased this week!
- Contrastive Predictive Coding (CPC)
- Contrastive: contrast between correct and incorrect sequences using
contrastive learning
- Predictive: model must predict future patterns based on current
- Coding: model must learn useful feature vectors
- We give the model context, then a correct continuation sequence and
many incorrect possible sequences
- First encode all images into vectors: \(z_t = g_{enc}(x_t)\)
- Summarize all context into a conctext code: \(c_t\) using an autoregressive model: \(g_{ar}\)
- Compute InfoNCE loss between context \(c_t\) and future code \(z_{t+k}\) using scoring function:
- \(s_k(z_{t+k}, c_t) = z^T_{t+k} W_k
c_t\)
- \(W_k\) is a trainable matrix
- CLIP is a contrastive learning model
- We can sequentially process non-sequential data - think about how we
observe an image
- Variable length sequences are tough to work with for basic NNs
- Recurrent Neural Networks contain an internal “state”, a summary or
context of what’s been seen before
- RNN formulation: \(h_t = f_W(h_{t-1},
x_t)\) (repeat as necessary!)
- Typical autoregressive loss function - run everything through, make
predictions, then sum/mean the losses from those predictions
Lecture 11 - May 2
- Vanilla RNN learns three weight matricies:
- Transforms input
- Transforms context state (sum these last two then use tanh)
- Transforms context state to output
- Hidden representation is commonly initialized to zero
- One could learn the initial weight matrix, but not necessary
- Sequence length is an assumption we often have to make during
training
- Encoder-decoder architecture for sequences encodes sequence into a
representation, then decodes into a sequence from that representation
- Think language translation, orignal attention paper
- One hot vector is a vector of all zeros and a one corresponding to a
single class
- We use an embedding layer - computationally inexpensive (indexing),
but keeps gradient flowing!
- Test time - autoregressive model (I wrote about this!), similar to a
phone’s text autocomplete
- We truncate sequences, these act as minibatches
- We get surprising emergent behaviors from sequence modeling
- RNN advantages:
- Can process sequential information
- Can use information from many steps back
- Symmetrical processes for each step
- RNN disadvantages:
- Recurrent computation is slow
- Infomation is lost over a sequence in practice
- Exploding gradients :(
- Image classification: combine RNNs and CNNs
- Take a CNN and remove the classification head
- Use the image representation as the first hidden representation in
an RNN
- Repeatedly sample tokens until the
<END>
token is
produced
- Question answering: use RNN and CNN to generate representations,
learn a compression, then softmax across the language
- Agents can learn instructions and actions to take in a lanugage or
image based environment (same principles as before)
- Multilayer RNNs: stacking layer weights and adding depth
- Vanilla RNN gradient flow: look up the derivation!
- Gradients will vanish as the tanh squishes the gradient for each
step
- Even without tanh, this problem is repeated - gradients will explode
or vanish
- Note: normalization doesn’t work well with RNNs, active field of
research
- To combat exploding gradients we can use gradient “clipping”,
scaling the gradient if its norm is too large
- However, there is no good solution for vanishing gradients
- LSTMs (Long Short Term Memory) help solve the problems within RNNs
- LSTMs keep track of two values: a cell memory an next hidden
representation
- Long and short term memory!
- Usually both initialized to zero
- LSTMs produce four outputs \(4h\)
from the hidden state \(h\) and the
vector from below \(x\)
- This can be done in parallel through matrix multiplication
- Three of these outputs are passed through a sigmoid nonlinearity,
the fourth is passed through a tanh
- Cell memory: \(c_t = f \cdot c_{t - 1} + i
\cdot g\), \(h_t = o \cdot
tanh(c_t)\)
- The “info gate” (output of tanh) determines how much to write to
cell: \(g\)
- The “input gate” \(i\) determines
whether to write to cell
- The “forget gate” \(f\) determines
how much to forget or erase from cell
- The “output gate” \(o\) determines
how much to reveal cell at a timestep
- LSTMs create a “highway network” over for gradients to flow back
over time through the cell memory
- LSTMs preserve information better over time, however there’s no
guarantee
- Residual (type) connections are very popular and widespread
throughout deep learning!
- Neural architecture search for LSTM-like architecture was popular in
the past, but no more
- GRUs (Gated Recurrent Unit) are inspired by LSTMs
- Quite simple and common because of ease of training
- LSTMs were quite popular until this year, hwoever transformers have
risen in popularity
Lecture 12 - May 9
- Recurrent image captioning is constrained by the size of the image
representation vector
- Attention: use different context vector at different timesteps
- Use some function to get relationship scores within different
elements in a vector
- Softmax to normalize the scores
- Use these scores to create output context vector (multiply by some
value vector)
- This is just scaled dot-product attention
- Also works for interpretability: allows for bias correction as
well
- Attention Layer:
- We matmul a query and key to create scores
- Softmax the scores then to normalize
- Multiply the scores by the values, then sum
- Query, key, and value come from linear transformations
- Notes:
- We need to use masked-attention for sequence problems
- We need to inject positional encodings as attention layers are
position-invariant
- Simply add positional vectors (often sinusoidal)
- We also scale the dot-product by dividing by \(\sqrt{d}\)
- Important for autoregressive problems
- We love
torch.triu
- Self attention doesn’t require separate queries and values, they
come from the same input
- Multi-head self attention:
- Split the dimensionality up, use attention for each part, then
concatenate
- One of the biggest issues: attention is an \(n^2\) memory requirement
- Transformers: sequence to sequence model (encoder-decoder)
- Encoder:
- Made up of multiple “encoder blocks”
- Each encoder block has a multi-head self-attention block (with
residual connection), layernorm, MLP (with residual connection), then a
second layer norm
- Decoder:
- Made up of multiple “decoder blocks”
- Masked multi-head self attention (residual), layernorm,
cross attention (with encoder output and residual), layernorm, MLP
(residual), second layernorm
- CNNs are often replaced by transformers which split data up into
patches
- Good if you have a lot of data, but small data you want to use
CNNs
- Vision transformer (ViT)
- Image captioning can work now with purely transformer-based
architectures
- Transformer (size) history: (see slides)
- Started with 12 layers, 8 heads, 65M params
Lecture 13 - May 11
- LeNet, small network of convolutional layers
- We pool to downsample images
- AlexNet: bigger model
- Trained individually on multiple GPUs
- First use of ReLU
- Data augmentation
- Dropout 0.5
- 7 model ensamble
- VGGNet: smaller filters, deeper network
- GoogLeNet: added multiple paths between layers (InceptionNet)
- In the future, 1x1 convolutions were added to downscale filter
dimensions (reduce computational costs)
- Also add auxiliary classification outputs earlier to keep gradient
flow
- ResNet: add residual connections between layers
- Keeps gradients flowing
- Also allows model to find the difference
- All current networks start with a conv layer
- Dropout, Kaiming init
- ViT: Vision Transformer
- Add a convolution in the first layer to create patches
- Then just run through transformer blocks
- ViT needs more data
- Trained on a dataset called JFT-300M
- ViT performs worse than ResNet on 10M images
- Final layer is finetuned on ImageNet-1.5M
- MLP-Mixer: all MLP architecture
- Full of Mixer Layers
- Mixer layer is layer norm, transpose to get patches, MLP, transpose
to get channels, layernorm, second MLP
- ResNet improvements:
- Change normalization ordering
- Wider (more filters) networks
- ResNeXT: multiple paths inception-style
- DenseNet: add more residual pathways
- MobileNet: use channel downsampling 1x1 conv layers
- Neural Architecture Search: NAS
- EfficientNet: fast, accurate, small