Course Website - This course is heavily based off of CS231n - This course will primarily focus on NLP and Computer Vision - Course is multiple parts: - Deep learning fundamentals - Practical training skills - Application - Vision has been one of the drivers of early DL, important to understand the history - Books can be helpful, but not necessary - Gradescope has automatic and private testing - Three psets, one midterm, and a final group project

- 543 million years ago vision animals started to develop sight
- Camera obscura developed in 1545 to study eclipses
- Inspired da Vinci to create the pinhole camera

- 1959 Hubel & Wiesel found that we visually react to “edges” and “blobs”
- Think of this a “lower layer”

- Larry Roberts is known as the “Father of Computer Vision” - wrote the first CV thesis
- 1960’s MIT attempted to solve vision in a summer - this didn’t happen
- David Marr introduced the idea of stages of visual recognition in 1970’s
- Edge detection became the next big push in CV
- In the 1980’s expert systems became popular
- These had heuristic rules made by domain “experts”
- Unsuccessful and caused the second AI Winter

- Irving Biederman came up with rules on how we view the world
- 1: We must understand components (objects and relationships)
- 2: This is only possible because we see so many objects learning
- A 6 year old child has seen 30,000 objects

- We can detect animals in 150 ms
- We detect predators and the color red even quicker!

- Later-stage neurons allowed us to detect complex object or themes
- In the 1990’s research started on start on real-world images
- Algorithms were developed for grouping (1990’s) and matching (2000’s)

- In 2001 the first commercial success in CV
- Facial detection, used ML and facial features

- In the 2000’s feature development was all the rage
- Histogram of gradients - how do the edges in pixels move?

- We need an incredible amount of data - led to ImageNet
- 2009, had 22K categories and 14M images

- In 2012 AlexNet had breakthrough performance on ImageNet
- By 2015 all attempts were DL and better than humans

- In 1957 the Mark I Perception was created for character recognition
- Manually twisted knobs to tune (adjusted weights)
- Cannot be trained practically

- Backpropagation was developed in 1986
- LeNet is the architecture used in the Postal Service - 1998
- AlexNet is the same architecture

- DL was used in the early 2000’s to compress images
- Everything is homogenized now
- Transformers and backprop are the norm
- Data and compute are the differentiators
- Domains change, but core is often the same

- Hinton, Bengio, and LeCun won the Turing award in 2018
- Deep learning is it’s own course because of incredible growth

- Image classification (IC) is a core task in CV
- There are many challenges related to computer vision:
- Viewpoint variation
- Illumination
- Background clutter
- Occlusion
- Deformation
- Intraclass variation
- It is very difficult to implement an image classifier as “normal” software
- “old” AI

- Data-driven paradigm is better
- Use datasets to train a model
- MNIST
- CFIAR 10(0)
- ImageNet
- MIT Places
- Omniglot

- Use ML to train a classifier
- Evaluate the classifier on new images

- Use datasets to train a model

- Nearest Neighbor classifier
- Training: memorize all data and labels
- Inference: predict label of “nearest” image
- \(O(1)\) train time, \(O(n)\) inference time
- It is a universal approximator with n -> infinity data points

- What is “nearest” (distance)?
- L1 norm is bad (Manhattan)
- L2 norm is good (Euclidean)

- Hyperparaamters are choices (configs) of the model
- e.g. k, distance type

- Finding hyperparams
- We should use a train, val, and test ds
- Cross validation: split data into folds - no dedicated val necessary

- Curse of dimensionality: the number of data points necessary is related to the dimension of data
- Linear classifiers take on the form: \(y = Wx + b\)
- \(b\) is often omitted, instead appending data vector with a one
- Parametric approach
- Learns a template and decision boundary

- It’s cool that we have a classifier, but how do we make it good?
- Loss function: define how good or bad our classifier (weights) are
- Loss over dataset is the average of loss over examples
- \(x_i\), \(y_i\), and \(L_i\) are example data, output label, and example loss

- Multiclass SVM loss:
- Make sure that prediction of correct label is greater than all predictions for other labels by a given margin
- Delta doesn’t matter because it will simply scale
- Linear-ish loss
- Issues because piecewise function, doesn’t approach zero asymptotically

- Squaring losses leads to predictions being penalized by more by a factor of incorrect-ness
- Regularization is important, we always want the simplest model (no overfitting!)
- Often adding a fraction of the L1 or L2 on the weights.
- Spread out the weights

- Softmax classifier:
- Pushes each value between 0 and 1
- Ensures the sum of the softmaxed outputs is 1
- \(S_i = \frac{e^{y_i}}{\sum e^{y_i}}\)
- Scaling will determine how “peaky” the softmaxed scores are

- NLL is the negative log of the prediction of the correct label
- Cross entropy loss is the NLL of the softmax of the prediction of the correct label
- Optimization is gradient descent
- Find the partial derivatives of the loss with respect to the weights
- We update each weight by the scaled negative of its partial derivative gradient

- Use a numeric gradient to gradient check

- Stochastic gradient descent uses minibatches to estimate the gradient
- Pick your GPU, then biggest minibatch size typically

- Linear classifiers aren’t very powerful (for many problems)
- They only learn one template and linear decision boundaries

- We can extract features,
*then*fit a linear classifier- Non-linear transformations (features) are necessary
- We have to manually create features…if only we could learn them…

- Simple two-layer NN: \(f = W_2 max(0, W_1x)\)
- \(x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}\)
- We can think of this as learning templates, then learning combinations of templates

- There are many different activation functions:
- ReLU, Sigmoid, Leaky ReLU are (historically) popular ones

- Activation functions are necessary because multiple consecutive linear transformations can be represented as just one transformation
- Biological neurons are
*quite*different! - We use computational graphs and backpropagation to differentiate (backwards pass)
- Backprop is simply the chain rule - a lot

- Vector-vector backprop includes the Jacobian matrix for local operations
- ImageNet helped people realize that data was super important!
- Convolutional NNs are a way of including spatial information
- CNNs are ubiquitous within vision
- While FC nets flatten images, CNNs preserve spatial structure
- Filters must be the same depth as the input image (i.e. 3 for RGB)
- Slide over the entire image, flatten each part of the image it is above, dot product with filter
- Stride is the offset between each filter-comparison
- The number of filters is the number of activation weights (and the number of output channels)
- ConvNet is simply many convolutional layers with activation functions in between
- Earlier layers learn low-level features, later layers learn higher-level features
- Ends with a simple FC classifier
- There are interspersed pooling layers to downsample
- Output size of an \(N \times N\) image with filter size \(F \times F\) is: \((N - F) / stride + 1\)
- Often images will be padded with zero pixels
- 1x1 convolutional layers increase or reduce depth in channel space

- Training loop:
- Sample a batch of data
- Forward prop through the graph to get loss
- Backprop through the gradients
- Update the params using the gradient

- Before you train:
- Activation functions
- Preprocessing
- Weight initialization
- Regularization
- Gradient checking

- Training dynamics:
- Babysitting the learning process
- Param updates
- Hyperparam updates

- Evaluation:
- Model ensamples
- Test-time augmentation
- Transfer learning

- Sigmoid function: \(\sigma(x) = \frac{1}{1 + e^{-x}}\)
- Used to be popular
- Saturated gradient at high positive and negative values
- All gradients squashed by at least a factor of 4
- Outputs aren’t zero-centered which means local gradient is always positive
- Weights will now all change in the same direction

- Computationally expensive!

- Tanh function:
- Zero centered
- Still has the problem of dying gradients

- ReLU (rectified linear unit): \(f(x) = max(0, x)\)
- Most common activation function
- No saturation in positive region
- Computationally efficient
- Converges (6x) faster than sigmoid or tanh in practice
- Not zero centered
- Can lead to “dying” ReLUs
- To prevent, often initialize biases with small positive value (e.g. 0.01)

- Leaky ReLU: \(f(x) = max(0.01x, x)\)
- Same a ReLU, but minor gradient in negative region
- Parametric Rectifier (PReLU) where scaling is a learnable parameter

- Exponential Linear Units (ELU):
- “Better”, but more expensive

- Scaled exponential linear units (SELU):
- Works better for larger networks, has a normalizing property
- Can use without BatchNorm
- Holy Heck Math
**“Cool”**

- Maxout:
- Max of multiple linear layers
- Multiplies the number of parameters :(

- Swish:
- RL-created activation function

- GeLU:
- Add some randomness to ReLU, then take average to find this
- “Data dependent dropout”

- Main activation function used (esp. in transformers)

- Add some randomness to ReLU, then take average to find this
- Use ReLU, GeLU if transformers, and try ReLU derivatives
- We often zero-mean and normalize our data as preprocessing
- Sometimes preprocessing involves PCA and whitening
- ResNet subtracted mean across channels and normalized with standard deviation
- A constant weight initialization leads to all the values being the same
- Initializations that are too large or too small lead to extreme saturation or clustered outputs
- “Xavier” initialization: \(std = \frac{1}{\sqrt{D_{in}}}\)
- Good with Tanh
- Attempts to keep output variance similar to input variance

- “Kaiming” initialization: \(std = \sqrt{\frac{2}{D_{in}}}\)
- Works well with ReLU!

- Batch Normalization:
- “Things break when inputs don’t have zero mean and \(std = 1\)” - “Why not just force that?”
- Subtract by batch mean, divide by square root of the variance of the data
- It is differentiable!
- We keep these
*before*each non-linearity - We also have two learned parameters, gamma and beta corresponding to scaling and shifting
- We keep a running mean of mean and variance across our training process

- BatchNorm
- Becomes a linear layer during inference
- “Resets” the standard deviation and mean after change from linear layers
- Allows higher learning rate and better gradient flow
- Acts as regularization during training
- Different behavior during training and testing! Can be a common bug!
- For CNNs, batch norm across channels (output is: \(1 \times C \times 1 \times 1\))

- LayerNorm:
- Normalize across each example!

- Instance normalization is for CNN across height and width
- Good for segmentation and detection

- There are a ton of ways to normalize in similar ways
- Vanilla gradient descent: calculate gradient, move towards the negative gradient
- Issues related to getting stuck in saddle points
- Stochastic, so descent can be extremely noisy

- Momentum keeps optimization moving in the same direction:
- \(v_{t+1} = \rho v_t + \nabla f(x_t)\) then update using \(v_{t+1}\)
- Common values for \(\rho\) are \(0.9\) and \(0.99\)
- Momentum will often overshoot minima

- Nesterov momentum: take the velocity update
*before*taking derivative- Good, but we have to update weights twice, so we use a different formulation where \(\tilde{x}_t = x_t + \rho v_t\):

- SGD + momentum or Nesterov are often what is used in practice
- AdaGrad sums squared squared gradients, dividing each gradient by the square root of the sum of its squared derivatives:
- This leads to weights with large and small gradients getting updated slower and quicker
- It also quickly decays all learning rates to zero - a problem for neural networks

- RMSProp (Leaky AdaGrad) is AdaGrad but with a decay to previous squared grads - similar to a running average
- Keeps step sizes relatively constant
- Common decay rate is \(0.99\)
- Doesn’t overshoot as much
- More computationally expensive

- Adam: combine the best of RMSProp and AdaGrad
`python first_moment = 0 second_moment = 0 for t in range(1, num_iterations): dx = compute_gradient(x) first_moment = beta1 * first_moment + (1 - beta1) * dx second_moment = beta2 * second_moment + (1 - beta2) * dx * dx first_unbias = first_moment / (1 - beta1 ** t) second_unbias = second_moment / (1 - beta2 ** t) x -= learning rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)`

- There is a bias correction to prevent large early step sizes from instantly destroying initialization
- Typical hyperparams are
`beta1 = 0.9`

and`beta2 = 0.99`

- L2 regularization and weight decay are the same when using SGD (with momentum), but different for Adam, AdaGrad, RMSProp

- AdamW: the go-to optimizer
- Adam with decoupled weight decay and \(L_2\) regularization
- Allows user to choose if they want weights to be a part of the second moment or not

- There are second-order optimization techniques where you move in the negative direction of the Hessian
- This is typically intractable due to the number of parameters
- AdaGrad is actually a special case of a second-order optimization technique where we assume the Hessian is diagonal
- Second-order optimization is not typically used in practice

- Learning rate decay: scaling down the learning rate over time
- Necessary to decrease loss beyond a certain point
- Hyperparameter choice is extremely important

- There are many learning rate schedulers:
- “Step” down the learning rate after a fixed number of epochs
- Leads to massive drops in loss followed by plateaus repeatedly

- Cosine learning rate decay: \(\alpha_t = \frac{1}{2}\alpha_0(1+cos(\frac{t\pi}{T}))\)
- \(\alpha_0\): initial learning rate, \(\alpha_t\) learning rate at step \(t\), \(T\) total number of steps
- Constant decrease in loss over time

- Linear learning rate decay
- Inverse square root decay
- Constant learning rate decay

- “Step” down the learning rate after a fixed number of epochs
- Learning rate warmup: spending first iterations increasing learning rate
- A large initial learning rate will cause our weight initialization to blow up
- Linearly increase learning rate over about 5 epochs (~5000 iterations)
- If you increase batch size by \(N\), also increase learning rate by \(N\)

- Validation data is a good way of paying attention to the model
- Early stopping is ending model training when test loss plateaus
- Often for ~5 epochs

- Training multiple models and then averaging results is a model ensemble:
- Exhibits about ~2% better performance in the real world
- Different models overfit to different parts of the dataset, it is averaged out

- Regularization techniques previously covered: L2, L1, weight decay
- Dropout: every forward pass, set parameters to zero with probability \(p\)
- Increases redundancy across the entire network, prevents some overfitting
- Common value of \(p = 0.5\)
- An interpretation of dropout is an ensemble of models with shared parameters
- At test time, multiply by \(p\)
**or**multiply by \(\frac{1}{p}\) after each test example (inverted dropout)

- A common pattern with regularization is adding randomness during training and averaging out during testing
- Seen in batch norm and dropout for example

- Data augmentation is a common form of regularization:
- Transforming the input in such a way that the label is the same

- Some example image augmentation techniques:
- Flipping an image horizontally
- Random crops and scales of an image
- Testing: average a fixed set of crops of a test image

- Change contrast or color of images
- Stretching or contorting images

- We are now training models to learn good data augmentation techniques
- DropConnect: set connections to weights to zero (instead of weights in dropout)
- Fractional pooling pools random regions of each image
- Testing: average predictions over multiple regions

- Stochastic depth: skip entire layers in a network using residual connections
- Cutout: randomly set parts of image to average image color
- Good for small datasets, not often used on large ones

- Mixup: blend both training images and training labels by an amount
- Why does it work? Who cares!

- CutMix: replace random crops of one image with another while combining labels
- Label smoothing: set target label to: \(1 - \frac{K-1}{K}\epsilon\) and other labels to: \(\frac{\epsilon}{K}\) for \(K\) classes
- In practice:
- Use dropout for FC layers
- Using batchnorm is always good
- Try Cutout, MixUp, CutMix, Stochastic Depth, Label Smoothing for (a little) extra performance

- Grid search is okay for hyperparameter search
- Use log-linear values

- Random search is better:
- Use log-uniform randomness in given range
- Likely because some hyperparamters matter more than others, so more opportunities to get it exact than grid search

- How to choose hyperparameters without Google-level compute:
- Check initial loss: sanity check, turn off weight decay
- Overfit on a small sample (5-10 minibatches)
- Make some architectural decisions
- Loss not decreasing? LR too low, bad weight init
- Loss NaN or Inf? LR too high, also bad weight init

- Search for learning rate for ~100 iterations
- Good LR to think about: 1e-1 to 1e-4

- Coarse search: add other hyperparams and train a few models for ~1.5 epochs
- Good weight decay to try: 1e-4, 1e-5, 0

- Pick best models from 4., train for ~10-20 epochs without learning rate decay
- Look at learning curves and adjust
- Might need early stopping, adjust regularization, larger model, or
**keep going** - Flat start to loss graph means bad initialization
- When learning rate plateaus, add a scheduler (cosine!)
- Use a “command center” like Weights and Biases (or add your own!)

- Might need early stopping, adjust regularization, larger model, or
- Return to step 5!

- Linear classifiers are easy to visualize because they have only one filter
- Early layers in deep learning networks are also easy to visualize, these are filters
- Often edge detection

- The last layers are an embedded representation of the inputs
- The embeddings are much better for KNN

- Google search:
- Embeds search into a 100 or 200 dimensional vector, then runs KNN on a massive database
- Needs massive compute however

- We can use PCE or t-SNE to visualize the outputs of the last layer of a network
- Good to use KNN rather than simply labels so we can see what is happening under the hood
- We can visualize activation maps for CNNs
- To find what activates neurons the best, run patches of images from the dataset through the model and see which output the full image’s label
- Occlusion using patches is another way of visualizing which pixels matter, can be graphed for a cool image
- Shapley values are a way of using multiple patches to determine important areas of the image
- Compute backprop to the images to find activations for saliency maps
- Super good image segmentation (accidentally)

- Saliency maps allow us to view biases within misclassifications
- Clamp gradients to only negative values

- We can also “backprop” a gray image to make an example image
- Gradient ascent and a heck of a lot of regularization

- Adversarial examples:
- Start from an arbitrary image
- Pick an arbitrary class
- Modify the image to improve class scores
- Repeat!

- Black box attacks:
- Adding random noise to images and models get confused…

- Supervised learning is insanely expensive:
- Labeling ImageNet’s 1.4M images would cost more than $175,000

- Unsupervised learning: model isn’t told what to predict
- Self-Supervised learning: model predicts some naturally occurring signal in the data
- The goal is to learn a cheap “pretext” task which learns important features
- Target is something that is easy to compute

- For example, start with autoencoder, then fine-tune
- Three main types of pretext tasks:
- Generative
- Autoencoders, GANs

- Discriminative
- Contrastive

- Multimodal
- Input video, output audio

- Generative
- Pretext task performance is irrelevant
- Often just toss a linear classifier on the back of the encoder
- Generative supervised learning:
- Generate some data from an example

- Computers (NNs) are not rotation-invariant:
- Allows us to predict rotation as a pretext task

- The learned attentions are similar as to what supervised learning finds
- Predict relative patch locations:
- Break image into 3x3 grid, pass in center square and an outer square and predict location
- CNNs are size invariant

- Jigsaw puzzle:
- Reorder patches according to correct permutation

- Inpainting:
- Pass in image with missing patch, predict the missing patch
- Adding adversarial loss increases image recreation quality

- Image coloring:
- Use “LAB” coloring, pass in L (grayscale) and predict AB (color)

- Split-brain autoencoder:
- Predict color from light, predict light from color

- Video coloring:
- Given colored start frame and grayscale images from then on, predict the colors
- Uses attention mechanism

- Contrastive learning:
- Create many different examples from an orignal example
- These examples are going to be in the same sematic space
- Learn which examples are closer or further away from each other

- Contrastive learning:
- Get a reference example \(x\)
- Create transformed or augmented examples from \(x\), called \(x^+\)
- Label all other examples \(x^-\)
- Maximize \(score(f(x), f(x^+))\) and minimize \(score(f(x), f(x^-))\)
- Loss function is derived from softmax
- “Every single instance is it’s own class”

- Uses cosine similarity
- \(\frac{u^T v}{||u|| ||v||}\)

- Generate positive samples through simple data augmentation

- SimCLR Framework:
- Make two simple transformations from the example
- Run each transformed example through a NN to get representation
- Run each representation through a simple linear classifier or MLP
- Maximize the cosine similarity of those outputs

- Example transformations for images:
- Random cropping
- Random color distortion
- Distortion blur

- Create a minibatch matrix (\(2N\) by \(2N\)) of alternating example-transformed images:
- Run it through the model
- Take the cosine similarity of the matrix with itself
- The \((2k, 2k+1)\) and \((2k+1, 2k)\) scores should be positive, everything else should be negative
- Diagonal will always be 1
- See slides for illustration

- SimCLR works extremely well with very large batch sizes (64,000+)
- We don’t want to directly expose the representation to the loss function, so we use a MLP head
- MoCo: Momentum Contrastive Learning
- Keep a running queue of keys (negative examples) for all images in the batch
- If we have 1000 examples and 2000 negatives, it is \(1000 \cdot 2000\)

- Update the encoder only through the queries (reference images)
- Makes the momentum encoder be much less computationally expensive
- We update the momentum encoder via: \(\theta_k \leftarrow m \theta_k + (1 - m) \theta_q\)
- Slowly aligns the two networks

- Uses cross-entropy loss
- Treats each negative as a class
- One correct label (this comes from the two parallel transformations) and the incorrect labels (negatives) are from the queue
- These don’t need to be run in parallel or there could be a concatenation

- Keep a running queue of keys (negative examples) for all images in the batch
- MoCo V2: Add a non-linear projection head and better data augmentation
- DINO: do we need negatives? (very recent)
- Reformulates contrastive learning and a distillation problem
- Teacher model is like the momentum encoder
- Running average of the student model
- Sees a global view augmentation of the image

- Student model only sees cropped augmentation of the image

- DINO training tricks for the teacher:
- Center the data by adding a mean
- Sharpen the distribution towards a certain class - like a temperature
- Has the effect of making the teacher be a bit of a classifier

- DINO V2 relased this week!
- Contrastive Predictive Coding (CPC)
- Contrastive: contrast between correct and incorrect sequences using contrastive learning
- Predictive: model must predict future patterns based on current
- Coding: model must learn useful feature vectors

- We give the model context, then a correct continuation sequence and many incorrect possible sequences
- First encode all images into vectors: \(z_t = g_{enc}(x_t)\)
- Summarize all context into a conctext code: \(c_t\) using an autoregressive model: \(g_{ar}\)
- Compute InfoNCE loss between context \(c_t\) and future code \(z_{t+k}\) using scoring function:
- \(s_k(z_{t+k}, c_t) = z^T_{t+k} W_k c_t\)
- \(W_k\) is a trainable matrix

- CLIP is a contrastive learning model
- We can sequentially process non-sequential data - think about how we observe an image
- Variable length sequences are tough to work with for basic NNs
- Recurrent Neural Networks contain an internal “state”, a summary or context of what’s been seen before
- RNN formulation: \(h_t = f_W(h_{t-1}, x_t)\) (repeat as necessary!)
- Typical autoregressive loss function - run everything through, make predictions, then sum/mean the losses from those predictions

- Vanilla RNN learns three weight matricies:
- Transforms input
- Transforms context state (sum these last two then use tanh)
- Transforms context state to output

- Hidden representation is commonly initialized to zero
- One could learn the initial weight matrix, but not necessary

- Sequence length is an assumption we often have to make during training
- Encoder-decoder architecture for sequences encodes sequence into a representation, then decodes into a sequence from that representation
- Think language translation, orignal attention paper

- One hot vector is a vector of all zeros and a one corresponding to a single class
- We use an embedding layer - computationally inexpensive (indexing), but keeps gradient flowing!
- Test time - autoregressive model (I wrote about this!), similar to a phone’s text autocomplete
- We truncate sequences, these act as minibatches
- We get surprising emergent behaviors from sequence modeling
- RNN advantages:
- Can process sequential information
- Can use information from many steps back
- Symmetrical processes for each step

- RNN disadvantages:
- Recurrent computation is slow
- Infomation is lost over a sequence in practice
- Exploding gradients :(

- Image classification: combine RNNs and CNNs
- Take a CNN and remove the classification head
- Use the image representation as the first hidden representation in an RNN
- Repeatedly sample tokens until the
`<END>`

token is produced

- Question answering: use RNN and CNN to generate representations, learn a compression, then softmax across the language
- Agents can learn instructions and actions to take in a lanugage or image based environment (same principles as before)
- Multilayer RNNs: stacking layer weights and adding depth
- Vanilla RNN gradient flow: look up the derivation!
- Gradients will vanish as the tanh squishes the gradient for each step
- Even without tanh, this problem is repeated - gradients will explode or vanish
- Note: normalization doesn’t work well with RNNs, active field of research

- To combat exploding gradients we can use gradient “clipping”, scaling the gradient if its norm is too large
- However, there is no good solution for vanishing gradients

- LSTMs (Long Short Term Memory) help solve the problems within RNNs
- LSTMs keep track of two values: a cell memory an next hidden representation
- Long and short term memory!

- Usually both initialized to zero

- LSTMs keep track of two values: a cell memory an next hidden representation
- LSTMs produce four outputs \(4h\) from the hidden state \(h\) and the vector from below \(x\)
- This can be done in parallel through matrix multiplication
- Three of these outputs are passed through a sigmoid nonlinearity, the fourth is passed through a tanh

- Cell memory: \(c_t = f \cdot c_{t - 1} + i \cdot g\), \(h_t = o \cdot tanh(c_t)\)
- The “info gate” (output of tanh) determines how much to write to cell: \(g\)
- The “input gate” \(i\) determines whether to write to cell
- The “forget gate” \(f\) determines how much to forget or erase from cell
- The “output gate” \(o\) determines how much to reveal cell at a timestep

- LSTMs create a “highway network” over for gradients to flow back over time through the cell memory
- LSTMs preserve information better over time, however there’s no guarantee
- Residual (type) connections are very popular and widespread throughout deep learning!
- Neural architecture search for LSTM-like architecture was popular in the past, but no more
- GRUs (Gated Recurrent Unit) are inspired by LSTMs
- Quite simple and common because of ease of training

- LSTMs were quite popular until this year, hwoever transformers have risen in popularity
- Seemingly a new transformer model every day, check out: visual transformer history

- Recurrent image captioning is constrained by the size of the image representation vector
- Attention: use different context vector at different timesteps
- Use some function to get relationship scores within different elements in a vector
- Softmax to normalize the scores
- Use these scores to create output context vector (multiply by some value vector)
- This is just scaled dot-product attention

- Also works for interpretability: allows for bias correction as well
- Attention Layer:
- We matmul a query and key to create scores
- Softmax the scores then to normalize
- Multiply the scores by the values, then sum

- Query, key, and value come from linear transformations
- Notes:
- We need to use masked-attention for sequence problems
- We need to inject positional encodings as attention layers are position-invariant
- Simply add positional vectors (often sinusoidal)

- We also scale the dot-product by dividing by \(\sqrt{d}\)
- Important for autoregressive problems
- We love
`torch.triu`

- Self attention doesn’t require separate queries and values, they come from the same input
- Multi-head self attention:
- Split the dimensionality up, use attention for each part, then concatenate

- One of the biggest issues: attention is an \(n^2\) memory requirement
- Transformers: sequence to sequence model (encoder-decoder)
- Encoder:
- Made up of multiple “encoder blocks”
- Each encoder block has a multi-head self-attention block (with residual connection), layernorm, MLP (with residual connection), then a second layer norm

- Decoder:
- Made up of multiple “decoder blocks”
*Masked*multi-head self attention (residual), layernorm, cross attention (with encoder output and residual), layernorm, MLP (residual), second layernorm

- Encoder:
- CNNs are often replaced by transformers which split data up into patches
- Good if you have a lot of data, but small data you want to use CNNs
- Vision transformer (ViT)

- Image captioning can work now with purely transformer-based architectures
- Transformer (size) history: (see slides)
- Started with 12 layers, 8 heads, 65M params

- LeNet, small network of convolutional layers
- We pool to downsample images

- AlexNet: bigger model
- Trained individually on multiple GPUs
- First use of ReLU
- Data augmentation
- Dropout 0.5
- 7 model ensamble

- VGGNet: smaller filters, deeper network
- GoogLeNet: added multiple paths between layers (InceptionNet)
- In the future, 1x1 convolutions were added to downscale filter dimensions (reduce computational costs)
- Also add auxiliary classification outputs earlier to keep gradient flow

- ResNet: add residual connections between layers
- Keeps gradients flowing
- Also allows model to find the difference
- All current networks start with a conv layer
- Dropout, Kaiming init

- ViT: Vision Transformer
- Add a convolution in the first layer to create patches
- Then just run through transformer blocks
- ViT needs more data
- Trained on a dataset called JFT-300M
- ViT performs worse than ResNet on 10M images

- Final layer is finetuned on ImageNet-1.5M

- MLP-Mixer: all MLP architecture
- Full of Mixer Layers
- Mixer layer is layer norm, transpose to get patches, MLP, transpose to get channels, layernorm, second MLP

- ResNet improvements:
- Change normalization ordering
- Wider (more filters) networks
- ResNeXT: multiple paths inception-style
- DenseNet: add more residual pathways
- MobileNet: use channel downsampling 1x1 conv layers
- Neural Architecture Search: NAS
- EfficientNet: fast, accurate, small