CSE 455, Computer Vision
Spring 2024, Ranjay Krishna
Lecture 1, March 26
- Two big technologies make computer vision possible:
- Cameras: easy way to capture images
- Ability to scan and digitize images
- Goal of CV: convert light into meaning!
- Very multidiciplinary definition
- Tasks
- Extracting semantic information
- Extracting geometric information
- 1966: one undergrad at MIT solved CV in a summer!
- 80% of web traffic is images and videos - very interesting
fact!
- First big sucess of CV was face detection for cameras
- We’re going to focus on non-deep learning CV
- CV is not just DL! We have to find what’s important in the
future!
Lecture 2, March 28
- Color is essential to understand the world - computers need thatt
capability too!
- Simple color can be used for clustering or object recognition
- Color formation:
- Light can be complete described by a plot/histogram of wavelengths
vs quantity
- Light is subtractive or additive
- Radience: amount of light reflected onto an object
- Irradiance: amount of energy onto an object
- BRDF: function between radience and irradiance
- Diffuse surfaces: appear equally bright from all directions (light
is spread out equally)
- Albedo: fraction of light a surface reflects
- How do humans see light?
- Cones: less sensitive to brightness, color
- Three different kinds of cones, each tuned to different
wavelengths
- Rods: more sensitive to brightness, grayscale
- Color is very lossy and we lose a lot of information when light hits
things!
- Color spaces:
- Basis are “basis colors” (e.g. RGB)
- Mixing two colors allows for a linear combination of colors between
them
- Mixing three colors can’t create all possible colors!
- XYZ colorspace:
- Y is the brightness of the color
- HSV colorspace:
- Hue
- Saturation
- Value (intensity)
Lecture 3, April 2
- Resolution (quantization) is a major loss of information
- Metric is DPI (dots per inch)
- Types of images:
- Binary (white and black)
- Greyscale
- Color: three channels
- Histograms are just counts over an image
- This can be done over an entire image, over patches, over
rows/columns, etc
- Images as functions:
- Converts the real world into pixel values
- Digital images are typically discrete
- These functions can convert positions into values, one for each
channel
- We assume positions outside the “image” are NaN
- \(\mathcal{f}: \mathbb{R}^2 \rightarrow
\mathbb{R}^C\), where \(C\) is
number channels
- Domain is: \(-\inf, \inf\), range
is: \(0, 255\)
- Filters transform images:
- \(\text{image} \rightarrow
\text{image}\) function
- Do some sort of transformation, e.g. denoising
- All layers in a neural network are simply filters
- There are linear filters! (Very similar to NNs)
- A system might use multiple filters to transform an image
- Possible properties of systems:
- Additivity
- Homogenity
- Superposition: basically, does it work as a linear combination?
- Stability: if input is bounded, the output is also bounded
- Invertibile
- Causality: all values upper and to the left are zero, the output is
zero
- Shift invariance: shifted input and output are shifted by the same
amount
Lecture 4, April 4
- A filter is a linear system if superposition holds
- Linear shift invariant systems (LSI) also are shift invariant
- Humans are (practically) scale and transition invariant (cool!)
- Inpulse function: one at the origin, zero everywhere else
- \(\delta_2\) is a 2-D impulse
function
- \(h[m, n]\) is the output at \(m, n\) of the inpulse function after a
filter
- Any filter can be written as a summation of shifted delta
functions!
- Property of LSI systems:
- For an input inpluse function into a LSI system, the LSI system is
complelty specified by the outptut
- We can use that output instead of the LSI system (allows
convolutions!)
- This means for: \(f[n, m]
\overset{S}{\rightarrow} g[n, m]\), \(f[n, m] = \sum_{k=-\inf}^\inf \sum_{l=-\inf}^\inf
C_{k, l} \cdot \delta_2 [n-k, m-l]\)
- Finally, \(\overset{S}{\rightarrow}
\sum_{k=-\inf}^\inf \sum_{l=-\inf}^\inf C_{k, l} \cdot S\{\delta_2 [n-k,
m-l]\}\)
- This is the convolution operation!
- \(f[n,m]*h[n,m] = \sum_{k=-\inf}^\inf
\sum_{l=-\inf}^\inf f[k,l] \cdot h[n-k, m-l]\)
- \(*\) typically represents the
convolution operation
- Convolutions:
- We will typically “flip” or “fold” the kernel about the origin, so
that it works with our equation
- We can have complicated transformations (e.g. sharpening) just via
multiple kernels!
- For finite images on computers with non-infinite domain, we
typically pad the edges
- Can pad with zeros, mean, mirror, extension, etc
- Cross-correlation:
- Two-star symbol: \(**\)
- Convolution without the flip
- Similar to the search operation
- What is typically implemented in DL(?)
Lecture 5, April 9
- Systems can be represented as a sum of inpluse responses
- LSI systems are completely specified by their inpluse responses
- We can create new systems by combining filters
- Edges are essential in the way we visually understand
- Some edges are more important for specific image types
- Very cool example: classified images based on brain scans, while
people were looking at images. Little difference between showing a human
a RGB photo vs black and white edges.
- Edges also allow much more efficient processing
- Edge detection:
- Identify sudden changes in an image
- We often want to find physical edges, not necessarily color
discontinuities
- We can calculate numeric derivatives in images and use them to find
edges
- Filters can perform differentiation
- Derivatives: backward, forward, and central derivatives
- Discrete derivation in 2D:
- Output two gradients from both dimensions and combine them
- Edge strenth comes from: \(\sqrt{f_n^2 +
f_m^2}\)
- We can get the direction via: \(\theta =
\tan^{-1}(\frac{f_m}{f_n})\)
- Gradient spikes will be edges! (assuming the image isn’t noisy)
- To combat noise, we can add a blur kernel to the image
- Edge detection example:
- Add gasussian smoothing kernel
- Add derivative kernel afterwards
- This can be done in one step! (derivative of the gaussian kernel,
DoG)
Lecture 6, April 11
- Sobel Filters:
- Gaussian blur combined with central derivative filter
- Has poor localization
- Requires you to threshold the values - a messy problem
- Canny edge detector:
- From 1972!
- Most common edge detector today
- Theoretically optimal when pixels have Gaussian noise
- Steps:
- Supress noise
- Compute gradient magnitude and direction
- Apply non-maximum supression
- Hysteresis
- Non-maximum supression
- Zero out non-maximum edge values
- For any point, compare it to neighbors. If it is smaller, zero it
out
- Get the two neighbors via the gradient angle
- As pixels are quantized, can use weighted average of neighboring
eight pixels
- Hysteresis:
- Two thresholds, low and high
- Low thresholds become and edge if it connects two strong edges
- Hough transform (“Huff”):
- 1962
- Transform edges into lines
- For a single point, we can define lines that pass through it in
\(a,b\) (slope-bias) space
- Do this for all points, intersection (or close intersections)
represent lines going through the same points
- Divide the \(a,b\) space into
cells, count the number of intersections, draw those lines
- Can use one of many basis, e.g. polar
- Could be extended to multiple dimensions, for curves in images
- Computationally complex
- \(O(n^2)\) of edges
- RANSAC:
- Try to remove the votes of noisy edges, without \(O(n^2)\) iteration
- Bad when many outliers
- Steps:
- Pick some points at random, draw a line between them
- Find the “inlier” points are within a threshold from the line
- If the number of inliers is greater than the current greatest
number, compute line of best fit from the inliers
- Repeat until number of inliers doesn’t change
- Repeat all \(k\) times
- Hyperparameters:
- Number of points in the seed set
- How many times to repeat? (\(k\))
- Threshold
- Min number of inliers to claim a line exists
Lecture 7, April 16
- Simple cross-correlation doesn’t work for some matching problems:
- Occlusion, lighting shifts, intra-category variation, articulation,
etc.
- Matching images, a formula:
- Take patches and match them to an image
- General idea: find key points and the regions surrounding them.
Describe and normalize them, then compute descriptors.
- Harris Corner Detector:
- Idea is multiplying gradients of \(x\) and \(y\) by each other
- “Find patches resulting in large pixel-value changes in any
direction”
- We create energy function: \(E(u,v) =
\sum_{x,y} w(x,y) [I(x+u,y+v)-I(x,y)]^2\)
- Window function might be Gaussian or discrete
- We can use the Taylor expansion, see slides for formula
- This represents an ellipse
- We use eigenvalues to find rotations, but approximate the
eigenvalues
- Algo:
- Compute image derivatives
- Square of derivatives
- Gaussian filter
- Cornerness function (ensure two large eigenvalues)
- Non-maxiumum supression (only one corner per location)
- Threshold
- Properties:
- Translation invariant
- Rotation invariant
- Not scale invariant
Lecture 8, April 18
- Scale-Invariant detection:
- Depending on scale of images, we need differing patch sizes
- For some metric function, our optimal scale is at the peak of the
function
- We want one peak on the function!
- Laplacian: (second derivative of a Gaussian)
- Convolving a Laplacian with an image, edges cross zero from high and
low
- Laplacians give a peak at optimial “blob” detection size
- SIFT and Harris-Laplacian are the two standards
- Harris is more accurate, but SIFT is faster (i.e. realtime)
- Local descriptors:
- Convert key points into local descriptors, represented as
vectors
- How to get rotation-invariant descriptors?
- Calculate all the gradients in each patch, ensure the total gradient
directions are axis-aligned
- SIFT:
- Idea:
- Convolve a possible “blob” by a Laplacian
- Multiply that by a Gaussian (with some \(\sigma\))
- Good \(\sigma\) will result in a
very low peak
- For any sized blob, there’s a good \(\sigma\) (the characteristic scale)
- \(\nabla^2 n_\sigma\) is Laplaccian
of Gaussian (LoG)
- Similarly, \(\sigma^2 \nabla^2
n_\sigma\) is Normalized LoG
- Try different values of \(\sigma\)
to find the best scale
- How to get sigmas: \(\sigma_k = \sigma_0
s^k\) for \(k=0,1,\dots\)
- \((x^*, y^*, \sigma^*) =
\arg_{(x,y,\sigma)} \max \|\sigma^2 \nabla^2 n_\sigma *
I(x,y)\|\)
- LoG is very similar to difference of Gaussians (DoG), so we can use
either
- Eq. to blurring image with many different Gaussians, then
subtracting between different \(\sigma\) values
- Then we take max over local \(3 \times 3
\times 3\)
- Algo:
- Blur keypoint image patch
- Calculate image gradients over patch
- Rotate all gradients by the keypoint angle
- Generate descriptor
- Descriptor generation:
- Divide the blob into a \(4 \times
4\) array
- Calculate the overall gradients into each array
- The output is a 128 length vector
Lecture 9, April 23
- Make SIFT robust:
- Threshold gradients (often 0.2)
- Normalize each histogram
- HoG: histogram of oriented gradients
- Detect shapes of objects, we can use that to classify!
- Normalization is the only differntiation between SIFT:
- We normalize local cells with larger local block gradients
- HoG has a lot of false positives!
- SIFT is used for point matching, HoG for image description
- Resizing:
- We need to keep the important parts of an image centered!
- Saliency: what’s the most important parts of an image?
- Framework:
- Create saliency map over image
- Resize image somehow according to map
- Methods:
- Cropping:
- Doesn’t work well because multiple targets
- Seam Carving:
- Every pixel’s energy is equal to the sum of absolute gradients
- Then remove the pixels with the lowest energy
- Seam: connected path of pixels from top to bottom
- We need to use dynamic programming to find the seam
Lecture 10, April 25
- Image expansion: repeat the \(k\)
lowest energy seams w. seam carving
- Possible content enhancement: use seam carving to remove, then
add
- User defined energy: allow user to modify energy function for
results
- Problem: seam carving increases energy and adds edges
- Instead, remove seam which inserts the most energy
- Video processing:
- Very similar, but add energy between frames
- Seam carving is also used for sports!
- Image segmentation: find pixels which go together
- Grouping:
- Superpixels: little groups of pixels
- There’s an ideal level of segmentation
Lecture 11, April 30
- Gestalt theory:
- Whole groups of objects matter vs individual
- Proximity matters a lot for grouping
- Think of images and pixels as a graph
- Some distance function will be edge spaces
- Possible pixel features:
- RGB
- Location
- RGB + location…
- Agglomerative clustering:
- Algo:
- Start with everything sa its own cluster
- Merge most similar (closest) pair as its own cluster
- Repeat
- Distance measures:
- Single linkage: connect based on distance of closest pixels, long
skinny clusters
- Complete linkage:
- Average link: mean distance, good against outliers
- Inner-outer linkage: merge based on distance between inliers vs
outliers
- Num clusters is not needed at runtime
- Terrible runtime \(O(n^3)\)
- K-means clustering:
- Choose cluster centers via minimizing sum of square distances
- Computationally infeasible, so we use iterative algo
- Iteratively:
- Calculate which points correspond to which clusters
- Calcualte center points w.r.t. its points
Lecture 12, May 2
- Mean-shift clustering:
- Algo:
- Initialize window
- Find mean of points within window
- Shift window to be centered at mean
- Repeat
- Cluster is all points which were part of the cluster at some
point
- Sometimes two radii are used, smaller for membership, larger for
center
- General and model-free
- Variable number of modes
- Robust to outliers
- Computationally expensive
- Single-view:
- Large ambiguity
- Lasers are often required
- Stereo vision:
- Using two views to get depth and understanding
- How to understand images:
- Recognizing familiar objects and their relative sizes
- Shading from light sources
- Perspective effects
- Geometry:
- We use homogenous coordinates (append a 1 to the coordinates)
- To convert back to heterogeneous, divide first values by the
third
- Any point now can be represented by an ininite number of vectors,
\(x \sim c \cdot x\)
- Transformations:
- Scaling: diagonal matrix with scaling values
- Rotation: matrix of \(\sin, \cos\)
about the origin
- 2D translation: use homogeneous coordinates to “add”
- Similarity: I wasn’t paying attention
- Affine:
- Lines remain lines
- Parallel lines remain parallel
- Projective transformation: all values in the matrix can change
- Transformation composition: transformations are applied right to
left
- Pinhole cameras:
- Images are inverted when hitting the sensor
Lecture 13, May 7
- We often use ideas from pinhole cameras to model more complex
ideas
- (Pinhole) transformations:
- We represent a point on the image plane (camera world) as \(x\)
- The real-world position is \(X\)
- We try to find some transformation \(P\) s.t. \(x =
PX\)
- This will be eqivalant up to some lambda due to the coordinates
being homogeneous
- Focal length: image plane distance from the center of camera
- Transformation is easy when camera is at the origin
- More complicated when camera is not, see slides
- We assume \(z=1\), only translate
the \(x,y\)
- Camera-to-image transformation:
- Simple version:
- \(X = \frac{fX}{Z}\)
- \(Y = \frac{fY}{Z}\)
- Complex version: \(x^I \sim
K[I|0]X^C\)
- \(K = \begin{bmatrix} f & 0 & p_x
\\ 0 & f & p_y \\ 0 & 0 & 1 \end{bmatrix}\)
- Assumes \(Z=1\)
- Camera-to-world transformation:
- Subtract camera position from world position: \(X^W - C^W\)
- Account for rotation: \(X^C = R(X^W -
C^W)\)
- We have some changes due to using homogeneous coordinates
- World-to-camera transformation:
- \(\begin{bmatrix} R & -RC\\ 0 & 1
\end{bmatrix}\)
- \(R\) is rotation matrix
- All camera transformations can be composed of three matrices:
- Intrinsic parameters (camera internals, img->img transformation):
\(K\)
- Different for all cameras
- Perspective projection (3D->2D transformation): \([I|0]\)
- Can be combined with extrinsic parameters
- Extrinsic parameters: \(\begin{bmatrix} R
& -RC\\ 0 & 1 \end{bmatrix}\)
- Independent of the type of the camera
- Pinhole camera matrix: \(P =
K[R|t]\) where \(t = -RC\)
- To use, find \(f, p_x, p_y\) and
rotation/translation variables
- For many cameras, both \(f\)’s
aren’t the same. Use \(\alpha_x,
\alpha_y\) to represent them
- 2D images don’t preserve length or angles
- Straight lines are preserved, parallel lines are only preserved when
planar
- Vanishing point/lines:
- All parallel lines intersect at a vanishing point in a 2D image
- Vanishing line connects the vanishing points
- Allows us to understand 3D space
- (Linear) camera calibration:
- Take a photo of a known object
- Find a projection \(P\) such that
image can be mapped to real-world points
- Each point corresponds to two equations \(p_1^T - p_3^TX_x'=0\), \(p_2^T - p_3^TXy'=0\)
- This can be written in matrix form, used to solve as a system of
linear equations
- \(\begin{bmatrix}x^T & 0 &
-x'X^T \\ 0 & X^T & -y'X^T \end{bmatrix}
\begin{bmatrix}p_1 \\ p_2 \\ p_3 \end{bmatrix} = 0\)
- This can be extended by stacking multiple of the above matrices
- We have 12 degrees of freedom, so at least 6 points is
necessary to calibrate
- 30+ is often used due to noise!
- Also, the points should be not planar!
- We need to find the nullspace of the above matrix!
- Solve for linear least squares: \(\hat{x}
\arg_x \min ||Ax||^2\) s.t. \(||x||^2 =
1\)
- SVD can be used to solve!
- \(x\) is the column corresponding
to smallest non-zero singular value!
- Once we have \(P\), we can chop off
the translation at the end
- Use Cholesky decomposition to get \(K\) after squaring it to remove \(R\)
- Find scaling factor via simple formula \(\lambda |K^{-1}\hat{P}|^\frac{-1}{3}\)
- Finally, find \(C\) via
inverse
- Often not used in practice, favoring non-linear methods
- SVD recap:
- \(U \Sigma V^T = A\)
- \(U\):
- Rotation matrix
- Orthonormal
- \(\Sigma\):
- Diagnoal, descending order of magnitude
- \(V^T\)
- Rotation matrix
- Orthonormal
Lecture 14, May 9
- Can also do depth calibration using cameras: inverse of the
calibration using two cameras
- Simple classification pipeline:
- Extract features from images
- Combine features with labels
- Train to match labels and features
- Use the trained classifier
- See slides for table of features and their properties
- \(K\)-nearest neighbor:
- Assign label to a point via the \(k\) nearest neighbors
- Classifies feature space into labels
- Use a test set to represent new “real-life” examples
- We need to use cross-validation to verify KNN hyperparams
- Dimensionality reduction:
- Use SVD
- We can use only some of the principle components as features
- PCA: how to work with sparse data!
- “Find projection that maximizes variance”
- Algo:
- Find sample mean \(\mu\) of the
dataset \(X\)
- Duplicate \(\mu\) across each
column, subtract it from \(X\) to get
\(X_c\)
- Get sample covariance matrix: \(C =
\frac{1}{n}X_c X_c^T\)
- Do SVD decomposition: \(X_c^T = U \Sigma
V^T\) where \(U^TU=I\), \(V^V=I\)
- \(C = \frac{1}{n} U \Sigma^2 U^T\)
from SVD
Lecture 15, May 14
- Covariance is how values change with each other (correlation between
axes)
- Covariance matrix is symmetric with variance on the diagonal
- PCA can make sparse, high-dim data dense
- Eigenfaces (re)construct faces using a low-dimensional manifold
- PCA heavily requires spatial similarity between examples
- Linear Discriminant Analysis (LDA):
- Much better for classification
- Account for variation between classes, not just
reconstruction
- We get a covariance matrix between classes
- Therefore we want to minimize in-class variance, maximize
out-of-class variance
- See slides for equations
- Visual bag of words (BoW):
- Represent images of histograms of feature representations
- Common features are small patches of interest points (RANSAC,
SIFT)
- Algo:
- Use SIFT to find features
- Find nearest neighbors of those features
- Each NN increments the histogram by one
Lecture 16, May 16
- Pyramids: do multiple levels of image splitting, find histogram for
each level
- Object detection: detect images, often find a bounding box for
objects
- Object detection metrics: IoU, area of overlap divided by area of
union
- IoU > 0.5 is typically used as the threshold
- Precision: \(\frac{TP}{TP +
FP}\)
- Recall: \(\frac{TP}{TP + FN}\)
- We commonly need to make a tradeoff between precision and
recall
- Delal-Triggs method: slide a window over the image, take the box
which is maximal activation
- Deformable parts method:
- Represent each object as a collection of parts
- Main box is the root, global, filter
- There are allowable locations for each part (filter)
- Quite human-labor intensive
- Usage:
- Run sliding windows across all pyramids for all parts
- Get part scores for possible person locations
- Weight global scores by the allowable locations for parts
Lecture 17, May 21
- Optical flow: apparent motion of brightness patterns in the
image
- Light focused, not object focused
- We can find \(u(x,y), v(x,y)\),
horizontal and vertical movement for a pixel between two images
- Assumptions:
- Small motions for points
- Spatial coherence: movement is similiar locally
- Brightness constancy: points look similar in each frame
- Brightness constancy equation: \(I(x,y,t-1) = I(x+u(x,y),y+v(x,y),t)\)
- Linear Taylor expansion: \(\sim I(x,y,t-1)
+I_x \cdot u(x,y) + I_y \cdot v(x,y) + I_t\)
- Goal: solve \(\nabla I \cdot
\begin{bmatrix}u & v \end{bmatrix}^T + I_t = 0\)
- Failure modes:
- Televisions: light moves, but set is stationary
- Motion with little change in pixels: (e.g. rotating sphere)
- Lighting changes
- Aperture problem: we can only measure movement in the direction
perpendicular to the edge
- Gradient in time filter: 2D, -1 for first image, 2 for second
image
- Lucas-Kanade method:
- Assume all pixels in a 5x5 grid move in the same direction (spatial
coherence)
- In other words, all of the pixels have the same \(u, v\)
- Can solve using linear least squares (or other interesting
methods!)
- Solve: \(A^TA \begin{bmatrix}u \\ v
\end{bmatrix} = - A^Tb\)
- Solvable when the first two eigenvalues are large and similar
- Iterative Lucas-Kanade:
- Estimate direction of each pixel via Lucas-Kanade
- Warp image via estimated direction
- Repeat until convergence
- Spatial coherence enforcement: decrease resolution in pyramids until
true
- Iteratively run Lucas-Kanade starting from lowest resolution until
highest
- Horn-Schunk method: flow is a global energy function which should be
minimized
- Both the Lucas-Kanade term and smoothness term (movement should be
consistant)
- \(E = \int \int [(I_x u + I_y v + T_t)^2 +
\alpha^2 (||\nabla u|| + ||\nabla v||)] dx dy\)
- \(u\) and \(v\) should be some small deviation from the
mean
- Because they are recursivly defined, iteratively find them
- Use Lucas-Kanade to start, then repeat
- Segmentation:
- Groups have a “common fate” based on motion
- By this logic, groups have similar \(u,
v\)
- We can thus use K-means or clustering to do segmentation based on
\(u,v\)