github posts notes

CSE 422 (Modern Algos)

James R. Lee, Winter 2024

Lecture 1 - Jan 4

9 topics + a bonus topic, one per week
One mini-project per week, always due on Wed
- Can work with partners
“Power of two choices”
- Related problem:
  - Put \(n\) balls in \(n\) bins, i.i.d.
  - “What is the expected number of balls in the most-full bin?”
  - It grows via: \(\sim \frac{\log n}{\log \log n}\)
- Two-choice strategy:
  - Chose two bins \(i, j \in {1, \dots, n}\) uniformly at random
  - Throw ball in the least loaded of the two!
  - Grows via: \(\sim \log \log n\)
- \(d\)-choices: \(\sim \frac{\log \log n}{\log d}\)
  - The same “boost” doesn’t happen again outside of \(1 \rightarrow 2\)

Lecture 2 - Jan 9

Basic web caching:
- Akamai was one of the first companies
- Basic web use: use a url to request data from a server
- Caching: use a intermediary cache to request data first
Hash functions:
Consistent hashing problem: given a cache distributed over \(N\) machines, how to quickly figure out which machine a string \(x\) is associated with, while balancing load between machines
- Motivated by web-caching: how can we cache data at URLs?
- The basic solution would be to use a typical hash function \(H(\cdot)\) and modulo: \(H(x) \mod N\)
- However we need to agree to a fixed \(N\), and in real, dynamic environments we cannot do so
  - Changing \(N\) would have a cost of \(1 - \frac{1}{N}\)
“Circular hashing solution”:
- (Assume the hash outputs are 32-bit for this problem)
- Hash each server, place them on a \(2^32\) circle
- When caching data \(x_0\), find hash \(H(x_0)\), then go clockwise to first hashed server. This is the cache server
- Expected load per server: \(\frac{P}{N}\) where \(P\) is the number of objects
- When we add another server, we only need to relocate \(\frac{1}{N}\) objects
How to do the “clockwise scan”?
- We want some datastructure which can find the server \(v'\) for a has \(v\), such that \(v' \geq v\)
- Balanced binary search trees (e.g. red-black trees) can do this in \(O(\log N)\) time
- If \(N\) is small, we can also use other data structures
Variance reduction trick:
- Using basic method, some servers can get overloaded
- Trick: compute \(k\) hashes for each server (filling out the circle)
- If \(k \simeq \log N\), there are theoretical bounds that servers won’t be overloaded with high probability
- This also helps when we have servers with differing memory capacities, server with 2x memory can get 2x the number of hash-keys!

Lecture 3 - Jan 9

Majority element problem: given an array of length \(n\) that has a majority element, (more than \(\frac{n}{2}\)), find that element!
- Simple algo to find linear-time algorithm, keep a counter of most-seen element, increment/decrement it depending on what is seen
Heavy hitters problem: given an array \(A\) of large length \(n\) and a smaller \(k\), compute the values which appear at least \(\frac{n}{k}\) times.
- Many applications: given a large stream of data, compute elements which appear a certain amount of times
- If \(A\) can fit in memory: trivial \(O(n \log n)\), sort array, find which elements appear at least \(k\) times in a row
- There is no algo which solves Heavy Hitters in one pass with sublinear amount of auxillary space
  - We cannot store all counts, therefore if possible candidates change, we won’t know the counts of items which could replace it
\(\epsilon\)-approximate heavy hitters problem:
- Input: same as HH, but with additional parameter \(0 < \epsilon < 1\)
- Output: a list where:
  - All values in \(A\) which occur at least \(\frac{n}{k}\) times are in the list
  - All values in the list occur at least \(\frac{n}{k} - \epsilon n\) times in \(A\) (provides margin of error)
- Auxillary memory required grows as \(\frac{1}{\epsilon}\)
  - We can’t have \(\epsilon = 0\) as the memory requirement grows to infinity
Count-min sketch: a datastructure for \(\epsilon\)-approximate heavy hitters
- Supports two operations:
  - inc(x): increment the count of x
    - for all i = 1, 2, ..., L: arr[i][hash_i(x)] += 1
    - \(O(\ell)\) time
  - count(x): get the count of x
    - return min_i arr[i][hash_i(x)]
    - \(O(\ell)\) time
- Parameters:
  - Number of buckets \(b\) (larger than \(\ell\), much smaller than \(n\))
  - Number of hash functions \(\ell\) (smaller)
- Data structure: \(\ell \times b\) 2-D array of ints, init at 0
- The array counts can only underestimate the true counts - the error is one-sided
- Given reasonable choices of parameters, the expected error rate is \(\frac{1}{2}^\ell\)
Role model: bloom filters
- Remembers which elements have been inserted
- Has false positives! Sometimes confirms elements which weren’t added
- No false negatives however
- 1-byte per element results in false positive rate of less than 2%
- Represents accuracy-space tradeoff

Lecture 4 - Jan 16

Metric Space:
- \(X\), a set of points in the space
- \(d\), a distance function between two points:
  1. \(d(x, y) = 0 \iff x = y\)
  2. \(d(x, y) = d(y, x)\)
  3. \(d(x, z) \leq d(x, y) + d(y, z)\)
Norms are a common example of a distance function
Edit distance: between two strings, what’s the smallest number of insertions/deletions to make them equal?
- Can either use a \(O(mn)\) exact algorithm or a fast approximate algorithm
Jacard Similarity:
- \(J(S, T) = \frac{|S \cap T|}{|S \cup T|}\)
- Higher means more similar
- Not a measure of distance
- Used for BoW models between documents to find similarity
Nearest Neighbor search:
- Trival time: \(O(n)\), exact algorithms
- Gold standard time: \(O(\log n)\), approximate algorithms
- \(\epsilon\)-NN: find the class of a point within \(1+\epsilon\) times the actual min distance to a point
Classical, low-dimensional NN:
- Use space partitioning
- Make a decision tree on space, splitting up on median points
- Works with one dimension
- For more than one dimension, use K-D trees
- Split on median points for one dimension, repeat over different dimensions
- No longer any guarantee of finding nearest neighbor first
- Time: \(O(2^\text{dim} \log n)\)
Modern, high-dimensional NN: use locality-sensitive hashing
Curse of Dimensionality: costs and space grow exponentially w.r.t. dimension

Lecture 5 - Jan 18

Volume of the Euclidean ball of radius \(r\) in \(k\) dimensions grows proportionally to \(r^k\)
- \(c^k\) disjoint balls of radius \(r\) can fit inside a ball of radius \(2r\), for some \(c > 1\)
- Almost all of the volume of the ball is near the boundary
  - This means that space partitioning is much worse in high dimensions
Johnson-Lindenstrauss Lemma:
- Let \(A\) be a \(d \times k\) random matrix of \(N(0, 1)\) random variables normalized by \(\sqrt{d}\)
- For some \(n \geq 1\) and \(\epsilon > 0\), let \(d = \lceil \frac{24 \ln n}{\epsilon^2} \rceil\)
- For any \(n\)-point subset \(X \subseteq \mathbb{R}^k\), all vectors \(x, y \in X\), with high probability we have: \[(1 - \epsilon) \|x - y\|^2_2 \leq \|Ax - Ay\|^2_2 \leq (1 + \epsilon) \| x - y\|^2_2\]
- This means that a random matrix mapping \(d\) dimensions to \(O(\log n)\) dimensions will with high probability preserve interpoint distances between \(n\) points
- A matrix \(A\) where all elements are positive or negative signs at random also has this property
- Idea is to have many one-dim maps which preserve distance in expectation, use them in parallel
MinHash:
- Let \(\pi\) be a function with a random permutation of all elements \(U\)
- For some \(S \subseteq U\), let \(h_\pi(S) = \arg \min_{x \in S} \pi(x)\)
  - This basically returns the minimum of the random labels assigned
  - The probability of two subsets colliding is equal to their Jaccard similarity! \[P[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)\]
Independence of MinHash:
- Let \(H(S)\) be the results of \(\ell\) independent hash functions
- Similarity: \(J^H(A, B) = \frac{\text{Number of times equal}(H(A), H(B))}{\ell}\)
- We have \((1 - \epsilon)J(A,B) \leq J^H(A,B) \leq (1+\epsilon)J(A,B)\)
  - Similar to JL Lemma
  - As long as \(\ell \leq O(\frac{\log n}{\epsilon^2})\) for a set \(X\) of \(n\) users/subsets
NN search using MinHash:
- Preprocess all elements \(x\) by finding their MinHash: \(H(x)\)
- When searching for \(y\)’s NN, compute \(H(y)\), find all elements \(z\) where \(H(y) = H(z)\), search through them
One dimensional hash: dot product between input \(x\) and random normal \(r\)
- \(F_r(x) = \langle x, r \rangle\) has mean \(0\) and variance \(\|x\|^2_2\)
  - Comes from addition of random Gaussian variables being another Gaussian

Lecture 6 - Jan 23

Binary classification:
- Given distribution and create classifier function
- Maps between 0 and 1 for this example
- Problem: given \(n\) vectors with ground truths, create mapping function
Train error: error of func on training data
Test error: error on test data

Lecture 7 - Jan 25

Missed

Lecture 8 - Jan 30

PCA

How do we get feature vectors from data?
- Good to start with centering data (de-mean them)
- When our data is lower-dimensional vs the vector rep, we can use linear combinations of vectors to represent it
PCA:
- Find features which “explain” the data the most
- Maximizes distances between points projected onto the basis vectors
Can use it to map DNA sequences of Europeans to locations
Keep on adding PCA vectors until representation doesn’t improve

Lecture 9 - Feb 1

Input to PCA is \(m\) \(n\)-dim data points, \(X\) and parameter \(k\)
Output is \(k\) orthonormal vectors, top \(k\) principle components
This output maximizes: \(\frac{1}{m}\sum_{i=1}^m \sum_{j=1}^k \langle x_i, v_j \rangle^2\)
We can express our problem as: \(\arg \max_{v = \|v\| = 1} v^T Av\), where \(v\) is a orthonormal matrix and \(A = X^T X\)
Solutions are just diagonal matricies with a rotation
Top \(k\) principle components are first \(k\) rows of \(Q^T\) in \(X^TX = QDQ^T\)
- Those are the first \(k\) principle components of \(X^TX\)
PCA gets the \(k\) eigenvectors of \(X^TX\) which have the largest eignenvalues
SVD is cubic time, power iteration is quicker
Power Iteration:
- Algo:
  1. Given matrix \(A = X^TX\)
  2. Select random unit vector \(u_0\)
  3. For \(i = 1, 2, \dots\), set \(u_i = A^iu_0\) If \(\frac{u_i}{\|u_i\|} \simeq \frac{u_{i-1}}{\|u_{i-1}\|}\), return \(\frac{u_i}{\|u_i\|}\)
- Finds the first eigenvector
- Stretches out the random vector in the direction of the largest eigenvector
- Can just repeatedly square instead of \(i=1,2,\dots\)
- Number of iterations scales as: \(\frac{\log n}{\log (\frac{\lambda_1}{\lambda_2}}\)
  - \(\frac{\lambda_1}{\lambda_2}\) is the spectral gap, the difference between how big the stretching is on the sphere
Continuing power iteration:
1. Find top component using power iteration
2. Subtract variance of data explained by \(v_1\)
  - e.g. \(\tilde{x}_1 = x_1 - \langle x_1, v_1 \rangle v_1\)
3. Recurse to find \(k-1\) components of new data matrix
Often best to just find many eigenvalues, plot, then pick your \(k\)