- Two big technologies make computer vision possible:
- Cameras: easy way to capture images
- Ability to scan and digitize images

- Goal of CV: convert light into meaning!
- Very multidiciplinary definition

- Tasks
- Extracting
*semantic*information - Extracting
*geometric*information

- Extracting
- 1966: one undergrad at MIT solved CV in a summer!
- 80% of web traffic is images and videos - very interesting fact!
- First big sucess of CV was face detection for cameras
- We’re going to focus on non-deep learning CV
- CV is not
*just*DL! We have to find what’s important in the future!

- CV is not

- Color is essential to understand the world - computers need thatt capability too!
- Simple color can be used for clustering or object recognition
- Color formation:
- Light can be complete described by a plot/histogram of wavelengths vs quantity
- Light is subtractive or additive

- Radience: amount of light reflected onto an object
- Irradiance: amount of energy onto an object
- BRDF: function between radience and irradiance
- Diffuse surfaces: appear equally bright from all directions (light is spread out equally)
- Albedo: fraction of light a surface reflects
- How do humans see light?
- Cones: less sensitive to brightness, color
- Three different kinds of cones, each tuned to different wavelengths

- Rods: more sensitive to brightness, grayscale

- Cones: less sensitive to brightness, color
- Color is very lossy and we lose a lot of information when light hits things!
- Color spaces:
- Basis are “basis colors” (e.g. RGB)
- Mixing two colors allows for a linear combination of colors between them
- Mixing three colors can’t create all possible colors!
- XYZ colorspace:
- Y is the
*brightness*of the color

- Y is the
- HSV colorspace:
- Hue
- Saturation
- Value (intensity)

- Resolution (quantization) is a major loss of information
- Metric is DPI (dots per inch)
- Types of images:
- Binary (white and black)
- Greyscale
- Color: three channels

- Histograms are just counts over an image
- This can be done over an entire image, over patches, over rows/columns, etc

- Images as functions:
- Converts the real world into pixel values
- Digital images are typically discrete
- These functions can convert positions into values, one for each channel
- We assume positions outside the “image” are NaN
- \(\mathcal{f}: \mathbb{R}^2 \rightarrow \mathbb{R}^C\), where \(C\) is number channels
- Domain is: \(-\inf, \inf\), range is: \(0, 255\)

- Filters transform images:
- \(\text{image} \rightarrow \text{image}\) function
- Do some sort of transformation, e.g. denoising
- All layers in a neural network are simply filters
- There are linear filters! (Very similar to NNs)

- A system might use multiple filters to transform an image
- Possible properties of systems:
- Additivity
- Homogenity
- Superposition: basically, does it work as a linear combination?
- Stability: if input is bounded, the output is also bounded
- Invertibile
- Causality: all values upper and to the left are zero, the output is zero
- Shift invariance: shifted input and output are shifted by the same amount

- A filter is a linear system if superposition holds
- Linear shift invariant systems (LSI) also are shift invariant
- Humans are (practically) scale and transition invariant (cool!)
- Inpulse function: one at the origin, zero everywhere else
- \(\delta_2\) is a 2-D impulse function
- \(h[m, n]\) is the output at \(m, n\) of the inpulse function after a filter
- Any filter can be written as a summation of shifted delta functions!

- Property of LSI systems:
- For an input inpluse function into a LSI system, the LSI system is complelty specified by the outptut
- We can use that output instead of the LSI system (allows convolutions!)
- This means for: \(f[n, m] \overset{S}{\rightarrow} g[n, m]\), \(f[n, m] = \sum_{k=-\inf}^\inf \sum_{l=-\inf}^\inf C_{k, l} \cdot \delta_2 [n-k, m-l]\)
- Finally, \(\overset{S}{\rightarrow} \sum_{k=-\inf}^\inf \sum_{l=-\inf}^\inf C_{k, l} \cdot S\{\delta_2 [n-k, m-l]\}\)
- This is the convolution operation!

- \(f[n,m]*h[n,m] = \sum_{k=-\inf}^\inf
\sum_{l=-\inf}^\inf f[k,l] \cdot h[n-k, m-l]\)
- \(*\) typically represents the convolution operation

- Convolutions:
- We will typically “flip” or “fold” the kernel about the origin, so that it works with our equation

- We can have complicated transformations (e.g. sharpening) just via multiple kernels!
- For finite images on computers with non-infinite domain, we
typically pad the edges
- Can pad with zeros, mean, mirror, extension, etc

- Cross-correlation:
- Two-star symbol: \(**\)
- Convolution without the flip
- Similar to the search operation
- What is typically implemented in DL(?)

- Systems can be represented as a sum of inpluse responses
- LSI systems are completely specified by their inpluse responses
- We can create new systems by combining filters
- Edges are essential in the way we visually understand
- Some edges are more important for specific image types

- Very cool example: classified images based on brain scans, while people were looking at images. Little difference between showing a human a RGB photo vs black and white edges.
- Edges also allow much more efficient processing
- Edge detection:
- Identify sudden changes in an image
- We often want to find physical edges, not necessarily color discontinuities
- We can calculate numeric derivatives in images and use them to find edges
- Filters can perform differentiation

- Derivatives: backward, forward, and central derivatives
- Discrete derivation in 2D:
- Output two gradients from both dimensions and combine them
- Edge strenth comes from: \(\sqrt{f_n^2 + f_m^2}\)
- We can get the direction via: \(\theta = \tan^{-1}(\frac{f_m}{f_n})\)
- Gradient spikes will be edges! (assuming the image isn’t noisy)
- To combat noise, we can add a blur kernel to the image

- Edge detection example:
- Add gasussian smoothing kernel
- Add derivative kernel afterwards
- This can be done in one step! (derivative of the gaussian kernel, DoG)

- Sobel Filters:
- Gaussian blur combined with central derivative filter
- Has poor localization
- Requires you to threshold the values - a messy problem

- Canny edge detector:
- From 1972!
- Most common edge detector today
- Theoretically optimal when pixels have Gaussian noise
- Steps:
- Supress noise
- Compute gradient magnitude and direction
- Apply non-maximum supression
- Hysteresis

- Non-maximum supression
- Zero out non-maximum edge values
- For any point, compare it to neighbors. If it is smaller, zero it out
- Get the two neighbors via the gradient angle
- As pixels are quantized, can use weighted average of neighboring eight pixels

- Hysteresis:
- Two thresholds, low and high
- Low thresholds become and edge if it connects two strong edges

- Hough transform (“Huff”):
- 1962
- Transform edges into lines
- For a single point, we can define lines that pass through it in \(a,b\) (slope-bias) space
- Do this for all points, intersection (or close intersections) represent lines going through the same points
- Divide the \(a,b\) space into cells, count the number of intersections, draw those lines
- Can use one of many basis, e.g. polar
- Could be extended to multiple dimensions, for curves in images
- Computationally complex
- \(O(n^2)\) of edges

- RANSAC:
- Try to remove the votes of noisy edges, without \(O(n^2)\) iteration
- Bad when many outliers
- Steps:
- Pick some points at random, draw a line between them
- Find the “inlier” points are within a threshold from the line
- If the number of inliers is greater than the current greatest
number, compute line of best fit
*from the inliers* - Repeat until number of inliers doesn’t change
- Repeat all \(k\) times

- Hyperparameters:
- Number of points in the seed set
- How many times to repeat? (\(k\))
- Threshold
- Min number of inliers to claim a line exists

- Simple cross-correlation doesn’t work for some matching problems:
- Occlusion, lighting shifts, intra-category variation, articulation, etc.

- Matching images, a formula:
- Take patches and match them to an image
- General idea: find key points and the regions surrounding them. Describe and normalize them, then compute descriptors.

- Harris Corner Detector:
- Idea is multiplying gradients of \(x\) and \(y\) by each other
- “Find patches resulting in large pixel-value changes in
*any*direction” - We create energy function: \(E(u,v) =
\sum_{x,y} w(x,y) [I(x+u,y+v)-I(x,y)]^2\)
- Window function might be Gaussian or discrete
- We can use the Taylor expansion, see slides for formula
- This represents an ellipse
- We use eigenvalues to find rotations, but approximate the eigenvalues

- Algo:
- Compute image derivatives
- Square of derivatives
- Gaussian filter
- Cornerness function (ensure two large eigenvalues)
- Non-maxiumum supression (only one corner per location)
- Threshold

- Properties:
- Translation invariant
- Rotation invariant
*Not*scale invariant

- Scale-Invariant detection:
- Depending on scale of images, we need differing patch sizes
- For some metric function, our optimal scale is at the peak of the
function
- We want one peak on the function!

- Laplacian: (second derivative of a Gaussian)
- Convolving a Laplacian with an image, edges cross zero from high and low
- Laplacians give a peak at optimial “blob” detection size

- SIFT and Harris-Laplacian are the two standards
- Harris is more accurate, but SIFT is faster (i.e. realtime)

- Local descriptors:
- Convert key points into local descriptors, represented as vectors
- How to get rotation-invariant descriptors?
- Calculate all the gradients in each patch, ensure the total gradient directions are axis-aligned

- SIFT:
- Idea:
- Convolve a possible “blob” by a Laplacian
- Multiply that by a Gaussian (with some \(\sigma\))
- Good \(\sigma\) will result in a very low peak
- For any sized blob, there’s a good \(\sigma\) (the characteristic scale)

- \(\nabla^2 n_\sigma\) is Laplaccian of Gaussian (LoG)
- Similarly, \(\sigma^2 \nabla^2 n_\sigma\) is Normalized LoG
- Try different values of \(\sigma\)
to find the best scale
- How to get sigmas: \(\sigma_k = \sigma_0 s^k\) for \(k=0,1,\dots\)

- \((x^*, y^*, \sigma^*) = \arg_{(x,y,\sigma)} \max \|\sigma^2 \nabla^2 n_\sigma * I(x,y)\|\)
- LoG is very similar to difference of Gaussians (DoG), so we can use
either
- Eq. to blurring image with many different Gaussians, then subtracting between different \(\sigma\) values
- Then we take max over local \(3 \times 3 \times 3\)

- Algo:
- Blur keypoint image patch
- Calculate image gradients over patch
- Rotate all gradients by the keypoint angle
- Generate descriptor

- Descriptor generation:
- Divide the blob into a \(4 \times 4\) array
- Calculate the overall gradients into each array
- The output is a 128 length vector

- Idea:

- Make SIFT robust:
- Threshold gradients (often 0.2)
- Normalize each histogram

- HoG: histogram of oriented gradients
- Detect shapes of objects, we can use that to classify!
- Normalization is the only differntiation between SIFT:
- We normalize local cells with larger local block gradients

- HoG has a lot of false positives!
- SIFT is used for point matching, HoG for image description

- Resizing:
- We need to keep the important parts of an image centered!
- Saliency: what’s the most important parts of an image?
- Framework:
- Create saliency map over image
- Resize image somehow according to map

- Methods:
- Cropping:
- Doesn’t work well because multiple targets

- Seam Carving:
- Every pixel’s energy is equal to the sum of absolute gradients
- Then remove the pixels with the lowest energy
- Seam: connected path of pixels from top to bottom
- We need to use dynamic programming to find the seam

- Cropping:

- Image expansion: repeat the \(k\) lowest energy seams w. seam carving
- Possible content enhancement: use seam carving to remove, then add
- User defined energy: allow user to modify energy function for results
- Problem: seam carving increases energy and adds edges
- Instead, remove seam which
*inserts the most energy*

- Instead, remove seam which
- Video processing:
- Very similar, but add energy between frames
- Seam carving is also used for sports!

- Image segmentation: find pixels which go together
- Grouping:
- Superpixels: little groups of pixels
- There’s an ideal level of segmentation

- Gestalt theory:
- Whole groups of objects matter vs individual
- Proximity matters a lot for grouping

- Think of images and pixels as a graph
- Some distance function will be edge spaces

- Possible pixel features:
- RGB
- Location
- RGB + location…

- Agglomerative clustering:
- Algo:
- Start with everything sa its own cluster
- Merge most similar (closest) pair as its own cluster
- Repeat

- Distance measures:
- Single linkage: connect based on distance of closest pixels, long skinny clusters
- Complete linkage:
- Average link: mean distance, good against outliers
- Inner-outer linkage: merge based on distance between inliers vs outliers

- Num clusters is not needed at runtime
- Terrible runtime \(O(n^3)\)

- Algo:
- K-means clustering:
- Choose cluster centers via minimizing sum of square distances
- Computationally infeasible, so we use iterative algo

- Iteratively:
- Calculate which points correspond to which clusters
- Calcualte center points w.r.t. its points

- Choose cluster centers via minimizing sum of square distances

- Mean-shift clustering:
- Algo:
- Initialize window
- Find mean of points within window
- Shift window to be centered at mean
- Repeat
- Cluster is all points which were part of the cluster at some point

- Sometimes two radii are used, smaller for membership, larger for center
- General and model-free
- Variable number of modes
- Robust to outliers
- Computationally expensive

- Algo:
- Single-view:
- Large ambiguity
- Lasers are often required

- Stereo vision:
- Using two views to get depth and understanding

- How to understand images:
- Recognizing familiar objects and their relative sizes
- Shading from light sources
- Perspective effects

- Geometry:
- We use homogenous coordinates (append a 1 to the coordinates)
- To convert back to heterogeneous, divide first values by the third
- Any point now can be represented by an ininite number of vectors, \(x \sim c \cdot x\)

- Transformations:
- Scaling: diagonal matrix with scaling values
- Rotation: matrix of \(\sin, \cos\) about the origin
- 2D translation: use homogeneous coordinates to “add”
- Similarity: I wasn’t paying attention
- Affine:
- Lines remain lines
- Parallel lines remain parallel

- Projective transformation: all values in the matrix can change

- Transformation composition: transformations are applied
*right to left* - Pinhole cameras:
- Images are inverted when hitting the sensor

- We often use ideas from pinhole cameras to model more complex ideas
- (Pinhole) transformations:
- We represent a point on the image plane (camera world) as \(x\)
- The real-world position is \(X\)
- We try to find some transformation \(P\) s.t. \(x = PX\)
- This will be eqivalant up to some lambda due to the coordinates being homogeneous
- Focal length: image plane distance from the center of camera

- Transformation is easy when camera is at the origin
- More complicated when camera is
*not*, see slides - We assume \(z=1\), only translate the \(x,y\)
- Camera-to-image transformation:
- Simple version:
- \(X = \frac{fX}{Z}\)
- \(Y = \frac{fY}{Z}\)

- Complex version: \(x^I \sim
K[I|0]X^C\)
- \(K = \begin{bmatrix} f & 0 & p_x \\ 0 & f & p_y \\ 0 & 0 & 1 \end{bmatrix}\)
- Assumes \(Z=1\)

- Simple version:
- Camera-to-world transformation:
- Subtract camera position from world position: \(X^W - C^W\)
- Account for rotation: \(X^C = R(X^W - C^W)\)
- We have some changes due to using homogeneous coordinates

- World-to-camera transformation:
- \(\begin{bmatrix} R & -RC\\ 0 & 1 \end{bmatrix}\)
- \(R\) is rotation matrix

- All camera transformations can be composed of three matrices:
- Intrinsic parameters (camera internals, img->img transformation):
\(K\)
- Different for all cameras

- Perspective projection (3D->2D transformation): \([I|0]\)
- Can be combined with extrinsic parameters

- Extrinsic parameters: \(\begin{bmatrix} R
& -RC\\ 0 & 1 \end{bmatrix}\)
- Independent of the type of the camera

- Intrinsic parameters (camera internals, img->img transformation):
\(K\)
- Pinhole camera matrix: \(P =
K[R|t]\) where \(t = -RC\)
- To use, find \(f, p_x, p_y\) and rotation/translation variables

- For many cameras, both \(f\)’s aren’t the same. Use \(\alpha_x, \alpha_y\) to represent them
- 2D images don’t preserve length or angles
- Straight lines are preserved, parallel lines are only preserved when planar
- Vanishing point/lines:
- All parallel lines intersect at a vanishing point in a 2D image
- Vanishing line connects the vanishing points
- Allows us to understand 3D space

- (Linear) camera calibration:
- Take a photo of a known object
- Find a projection \(P\) such that image can be mapped to real-world points
- Each point corresponds to two equations \(p_1^T - p_3^TX_x'=0\), \(p_2^T - p_3^TXy'=0\)
- This can be written in matrix form, used to solve as a system of
linear equations
- \(\begin{bmatrix}x^T & 0 & -x'X^T \\ 0 & X^T & -y'X^T \end{bmatrix} \begin{bmatrix}p_1 \\ p_2 \\ p_3 \end{bmatrix} = 0\)
- This can be extended by stacking multiple of the above matrices

- We have 12 degrees of freedom, so
*at least*6 points is necessary to calibrate- 30+ is often used due to noise!
- Also, the points should be not planar!

- We need to find the nullspace of the above matrix!
- Solve for linear least squares: \(\hat{x}
\arg_x \min ||Ax||^2\) s.t. \(||x||^2 =
1\)
- SVD can be used to solve!
- \(x\) is the column corresponding to smallest non-zero singular value!

- Once we have \(P\), we can chop off the translation at the end
- Use Cholesky decomposition to get \(K\) after squaring it to remove \(R\)
- Find scaling factor via simple formula \(\lambda |K^{-1}\hat{P}|^\frac{-1}{3}\)
- Finally, find \(C\) via inverse
- Often not used in practice, favoring non-linear methods

- SVD recap:
- \(U \Sigma V^T = A\)
- \(U\):
- Rotation matrix
- Orthonormal

- \(\Sigma\):
- Diagnoal, descending order of magnitude

- \(V^T\)
- Rotation matrix
- Orthonormal

- Can also do depth calibration using cameras: inverse of the calibration using two cameras
- Simple classification pipeline:
- Extract features from images
- Combine features with labels
- Train to match labels and features
- Use the trained classifier

- See slides for table of features and their properties
- \(K\)-nearest neighbor:
- Assign label to a point via the \(k\) nearest neighbors
- Classifies feature space into labels
- Use a test set to represent new “real-life” examples
- We need to use cross-validation to verify KNN hyperparams

- Dimensionality reduction:
- Use SVD
- We can use only some of the principle components as features

- PCA: how to work with sparse data!
- “Find projection that maximizes variance”
- Algo:
- Find sample mean \(\mu\) of the dataset \(X\)
- Duplicate \(\mu\) across each column, subtract it from \(X\) to get \(X_c\)
- Get sample covariance matrix: \(C = \frac{1}{n}X_c X_c^T\)
- Do SVD decomposition: \(X_c^T = U \Sigma V^T\) where \(U^TU=I\), \(V^V=I\)

- \(C = \frac{1}{n} U \Sigma^2 U^T\) from SVD

- Covariance is how values change with each other (correlation between axes)
- Covariance matrix is symmetric with variance on the diagonal
- PCA can make sparse, high-dim data dense
- Eigenfaces (re)construct faces using a low-dimensional manifold
- PCA heavily requires spatial similarity between examples
- Linear Discriminant Analysis (LDA):
- Much better for classification
- Account for variation between
*classes*, not just reconstruction - We get a covariance matrix between classes
- Therefore we want to minimize in-class variance, maximize out-of-class variance
- See slides for equations

- Visual bag of words (BoW):
- Represent images of histograms of feature representations
- Common features are small patches of interest points (RANSAC, SIFT)
- Algo:
- Use SIFT to find features
- Find nearest neighbors of those features
- Each NN increments the histogram by one

- Pyramids: do multiple levels of image splitting, find histogram for each level
- Object detection: detect images, often find a bounding box for objects
- Object detection metrics: IoU, area of overlap divided by area of
union
- IoU > 0.5 is typically used as the threshold
- Precision: \(\frac{TP}{TP + FP}\)
- Recall: \(\frac{TP}{TP + FN}\)

- We commonly need to make a tradeoff between precision and recall
- Delal-Triggs method: slide a window over the image, take the box which is maximal activation
- Deformable parts method:
- Represent each object as a collection of parts
- Main box is the root, global, filter
- There are allowable locations for each part (filter)
- Quite human-labor intensive
- Usage:
- Run sliding windows across all pyramids for all parts
- Get part scores for possible person locations
- Weight global scores by the allowable locations for parts

*Optical flow:*apparent motion of brightness patterns in the image- Light focused, not object focused
- We can find \(u(x,y), v(x,y)\), horizontal and vertical movement for a pixel between two images
- Assumptions:
- Small motions for points
- Spatial coherence: movement is similiar locally
- Brightness constancy: points look similar in each frame

- Brightness constancy equation: \(I(x,y,t-1) = I(x+u(x,y),y+v(x,y),t)\)
- Linear Taylor expansion: \(\sim I(x,y,t-1) +I_x \cdot u(x,y) + I_y \cdot v(x,y) + I_t\)

- Goal: solve \(\nabla I \cdot \begin{bmatrix}u & v \end{bmatrix}^T + I_t = 0\)
- Failure modes:
- Televisions: light moves, but set is stationary
- Motion with little change in pixels: (e.g. rotating sphere)
- Lighting changes

- Aperture problem: we can only measure movement in the direction perpendicular to the edge
- Gradient in time filter: 2D, -1 for first image, 2 for second image
- Lucas-Kanade method:
- Assume all pixels in a 5x5 grid move in the same direction (spatial coherence)
- In other words, all of the pixels have the same \(u, v\)
- Can solve using linear least squares (or other interesting methods!)
- Solve: \(A^TA \begin{bmatrix}u \\ v \end{bmatrix} = - A^Tb\)
- Solvable when the first two eigenvalues are large and similar

- Iterative Lucas-Kanade:
- Estimate direction of each pixel via Lucas-Kanade
- Warp image via estimated direction
- Repeat until convergence

- Spatial coherence enforcement: decrease resolution in pyramids until
true
- Iteratively run Lucas-Kanade starting from lowest resolution until highest

- Horn-Schunk method: flow is a global energy function which should be
minimized
- Both the Lucas-Kanade term and smoothness term (movement should be consistant)
- \(E = \int \int [(I_x u + I_y v + T_t)^2 + \alpha^2 (||\nabla u|| + ||\nabla v||)] dx dy\)
- \(u\) and \(v\) should be some small deviation from the
mean
- Because they are recursivly defined, iteratively find them
- Use Lucas-Kanade to start, then repeat

- Segmentation:
- Groups have a “common fate” based on motion
- By this logic, groups have similar \(u, v\)
- We can thus use K-means or clustering to do segmentation based on \(u,v\)