CSE 422 (Modern Algos)
James R. Lee, Winter 2024
Lecture 1 - Jan 4
- 9 topics + a bonus topic, one per week
- One mini-project per week, always due on Wed
- “Power of two choices”
- Related problem:
- Put \(n\) balls in \(n\) bins, i.i.d.
- “What is the expected number of balls in the most-full bin?”
- It grows via: \(\sim \frac{\log n}{\log
\log n}\)
- Two-choice strategy:
- Chose two bins \(i, j \in {1, \dots,
n}\) uniformly at random
- Throw ball in the least loaded of the two!
- Grows via: \(\sim \log \log
n\)
- \(d\)-choices: \(\sim \frac{\log \log n}{\log d}\)
- The same “boost” doesn’t happen again outside of \(1 \rightarrow 2\)
Lecture 2 - Jan 9
- Basic web caching:
- Akamai was one of the first companies
- Basic web use: use a url to request data from a server
- Caching: use a intermediary cache to request data first
- Hash functions:
- Consistent hashing problem: given a cache distributed over \(N\) machines, how to quickly figure out
which machine a string \(x\) is
associated with, while balancing load between machines
- Motivated by web-caching: how can we cache data at URLs?
- The basic solution would be to use a typical hash function \(H(\cdot)\) and modulo: \(H(x) \mod N\)
- However we need to agree to a fixed \(N\), and in real, dynamic environments we
cannot do so
- Changing \(N\) would have a cost of
\(1 - \frac{1}{N}\)
- “Circular hashing solution”:
- (Assume the hash outputs are 32-bit for this problem)
- Hash each server, place them on a \(2^32\) circle
- When caching data \(x_0\), find
hash \(H(x_0)\), then go clockwise to
first hashed server. This is the cache server
- Expected load per server: \(\frac{P}{N}\) where \(P\) is the number of objects
- When we add another server, we only need to relocate \(\frac{1}{N}\) objects
- How to do the “clockwise scan”?
- We want some datastructure which can find the server \(v'\) for a has \(v\), such that \(v' \geq v\)
- Balanced binary search trees (e.g. red-black trees) can do this in
\(O(\log N)\) time
- If \(N\) is small, we can also use
other data structures
- Variance reduction trick:
- Using basic method, some servers can get overloaded
- Trick: compute \(k\) hashes for
each server (filling out the circle)
- If \(k \simeq \log N\), there are
theoretical bounds that servers won’t be overloaded with high
probability
- This also helps when we have servers with differing memory
capacities, server with 2x memory can get 2x the number of
hash-keys!
Lecture 3 - Jan 9
- Majority element problem: given an array of length \(n\) that has a majority element, (more than
\(\frac{n}{2}\)), find that element!
- Simple algo to find linear-time algorithm, keep a counter of
most-seen element, increment/decrement it depending on what is seen
- Heavy hitters problem: given an array \(A\) of large length \(n\) and a smaller \(k\), compute the values which appear at
least \(\frac{n}{k}\) times.
- Many applications: given a large stream of data, compute elements
which appear a certain amount of times
- If \(A\) can fit in memory: trivial
\(O(n \log n)\), sort array, find which
elements appear at least \(k\) times in
a row
- There is no algo which solves Heavy Hitters in one pass with
sublinear amount of auxillary space
- We cannot store all counts, therefore if possible candidates change,
we won’t know the counts of items which could replace it
- \(\epsilon\)-approximate heavy
hitters problem:
- Input: same as HH, but with additional parameter \(0 < \epsilon < 1\)
- Output: a list where:
- All values in \(A\) which occur at
least \(\frac{n}{k}\) times are in the
list
- All values in the list occur at least \(\frac{n}{k} - \epsilon n\) times in \(A\) (provides margin of error)
- Auxillary memory required grows as \(\frac{1}{\epsilon}\)
- We can’t have \(\epsilon = 0\) as
the memory requirement grows to infinity
- Count-min sketch: a datastructure for \(\epsilon\)-approximate heavy hitters
- Supports two operations:
inc(x)
: increment the count of x
for all i = 1, 2, ..., L: arr[i][hash_i(x)] += 1
- \(O(\ell)\) time
count(x)
: get the count of x
return min_i arr[i][hash_i(x)]
- \(O(\ell)\) time
- Parameters:
- Number of buckets \(b\) (larger
than \(\ell\), much smaller than \(n\))
- Number of hash functions \(\ell\)
(smaller)
- Data structure: \(\ell \times b\)
2-D array of ints, init at 0
- The array counts can only underestimate the true counts -
the error is one-sided
- Given reasonable choices of parameters, the expected error rate is
\(\frac{1}{2}^\ell\)
- Role model: bloom filters
- Remembers which elements have been inserted
- Has false positives! Sometimes confirms elements which weren’t
added
- No false negatives however
- 1-byte per element results in false positive rate of less than
2%
- Represents accuracy-space tradeoff
Lecture 4 - Jan 16
- Metric Space:
- \(X\), a set of points in the
space
- \(d\), a distance function between
two points:
- \(d(x, y) = 0 \iff x = y\)
- \(d(x, y) = d(y, x)\)
- \(d(x, z) \leq d(x, y) + d(y,
z)\)
- Norms are a common example of a distance function
- Edit distance: between two strings, what’s the smallest number of
insertions/deletions to make them equal?
- Can either use a \(O(mn)\) exact
algorithm or a fast approximate algorithm
- Jacard Similarity:
- \(J(S, T) = \frac{|S \cap T|}{|S \cup
T|}\)
- Higher means more similar
- Not a measure of distance
- Used for BoW models between documents to find similarity
- Nearest Neighbor search:
- Trival time: \(O(n)\), exact
algorithms
- Gold standard time: \(O(\log n)\),
approximate algorithms
- \(\epsilon\)-NN: find the class of
a point within \(1+\epsilon\) times the
actual min distance to a point
- Classical, low-dimensional NN:
- Use space partitioning
- Make a decision tree on space, splitting up on median points
- Works with one dimension
- For more than one dimension, use K-D trees
- Split on median points for one dimension, repeat over different
dimensions
- No longer any guarantee of finding nearest neighbor first
- Time: \(O(2^\text{dim} \log
n)\)
- Modern, high-dimensional NN: use locality-sensitive hashing
- Curse of Dimensionality: costs and space grow exponentially w.r.t.
dimension
Lecture 5 - Jan 18
- Volume of the Euclidean ball of radius \(r\) in \(k\) dimensions grows proportionally to
\(r^k\)
- \(c^k\) disjoint balls of radius
\(r\) can fit inside a ball of radius
\(2r\), for some \(c > 1\)
- Almost all of the volume of the ball is near the boundary
- This means that space partitioning is much worse in high
dimensions
- Johnson-Lindenstrauss Lemma:
- Let \(A\) be a \(d \times k\) random matrix of \(N(0, 1)\) random variables normalized by
\(\sqrt{d}\)
- For some \(n \geq 1\) and \(\epsilon > 0\), let \(d = \lceil \frac{24 \ln n}{\epsilon^2}
\rceil\)
- For any \(n\)-point subset \(X \subseteq \mathbb{R}^k\), all vectors
\(x, y \in X\), with high probability
we have: \[(1 - \epsilon) \|x - y\|^2_2 \leq
\|Ax - Ay\|^2_2 \leq (1 + \epsilon) \|
x - y\|^2_2\]
- This means that a random matrix mapping \(d\) dimensions to \(O(\log n)\) dimensions will with high
probability preserve interpoint distances between \(n\) points
- A matrix \(A\) where all elements
are positive or negative signs at random also has this property
- Idea is to have many one-dim maps which preserve distance in
expectation, use them in parallel
- MinHash:
- Let \(\pi\) be a function with a
random permutation of all elements \(U\)
- For some \(S \subseteq U\), let
\(h_\pi(S) = \arg \min_{x \in S}
\pi(x)\)
- This basically returns the minimum of the random labels
assigned
- The probability of two subsets colliding is equal to their Jaccard
similarity! \[P[h_\pi(A) = h_\pi(B)] =
\frac{|A \cap B|}{|A \cup B|} = J(A, B)\]
- Independence of MinHash:
- Let \(H(S)\) be the results of
\(\ell\) independent hash
functions
- Similarity: \(J^H(A, B) =
\frac{\text{Number of times equal}(H(A), H(B))}{\ell}\)
- We have \((1 - \epsilon)J(A,B) \leq
J^H(A,B) \leq (1+\epsilon)J(A,B)\)
- Similar to JL Lemma
- As long as \(\ell \leq O(\frac{\log
n}{\epsilon^2})\) for a set \(X\) of \(n\) users/subsets
- NN search using MinHash:
- Preprocess all elements \(x\) by
finding their MinHash: \(H(x)\)
- When searching for \(y\)’s NN,
compute \(H(y)\), find all elements
\(z\) where \(H(y) = H(z)\), search through them
- One dimensional hash: dot product between input \(x\) and random normal \(r\)
- \(F_r(x) = \langle x, r \rangle\)
has mean \(0\) and variance \(\|x\|^2_2\)
- Comes from addition of random Gaussian variables being another
Gaussian
Lecture 6 - Jan 23
- Binary classification:
- Given distribution and create classifier function
- Maps between 0 and 1 for this example
- Problem: given \(n\) vectors with
ground truths, create mapping function
- Train error: error of func on training data
- Test error: error on test data
Lecture 7 - Jan 25
Missed
Lecture 8 - Jan 30
PCA
- How do we get feature vectors from data?
- Good to start with centering data (de-mean them)
- When our data is lower-dimensional vs the vector rep, we can use
linear combinations of vectors to represent it
- PCA:
- Find features which “explain” the data the most
- Maximizes distances between points projected onto the basis
vectors
- Can use it to map DNA sequences of Europeans to locations
- Keep on adding PCA vectors until representation doesn’t improve
Lecture 9 - Feb 1
- Input to PCA is \(m\) \(n\)-dim data points, \(X\) and parameter \(k\)
- Output is \(k\) orthonormal
vectors, top \(k\) principle
components
- This output maximizes: \(\frac{1}{m}\sum_{i=1}^m \sum_{j=1}^k \langle x_i,
v_j \rangle^2\)
- We can express our problem as: \(\arg
\max_{v = \|v\| = 1} v^T Av\), where \(v\) is a orthonormal matrix and \(A = X^T X\)
- Solutions are just diagonal matricies with a rotation
- Top \(k\) principle components are
first \(k\) rows of \(Q^T\) in \(X^TX =
QDQ^T\)
- Those are the first \(k\) principle
components of \(X^TX\)
- PCA gets the \(k\) eigenvectors of
\(X^TX\) which have the largest
eignenvalues
- SVD is cubic time, power iteration is quicker
- Power Iteration:
- Algo:
- Given matrix \(A = X^TX\)
- Select random unit vector \(u_0\)
- For \(i = 1, 2, \dots\), set \(u_i = A^iu_0\) If \(\frac{u_i}{\|u_i\|} \simeq
\frac{u_{i-1}}{\|u_{i-1}\|}\), return \(\frac{u_i}{\|u_i\|}\)
- Finds the first eigenvector
- Stretches out the random vector in the direction of the largest
eigenvector
- Can just repeatedly square instead of \(i=1,2,\dots\)
- Number of iterations scales as: \(\frac{\log n}{\log
(\frac{\lambda_1}{\lambda_2}}\)
- \(\frac{\lambda_1}{\lambda_2}\) is
the spectral gap, the difference between how big the stretching is on
the sphere
- Continuing power iteration:
- Find top component using power iteration
- Subtract variance of data explained by \(v_1\)
- e.g. \(\tilde{x}_1 = x_1 - \langle x_1,
v_1 \rangle v_1\)
- Recurse to find \(k-1\) components
of new data matrix
- Often best to just find many eigenvalues, plot, then pick your \(k\)