- 9 topics + a bonus topic, one per week
- One mini-project per week, always due on Wed
- Can work with partners

- “Power of two choices”
- Related problem:
- Put \(n\) balls in \(n\) bins, i.i.d.
- “What is the expected number of balls in the most-full bin?”
- It grows via: \(\sim \frac{\log n}{\log \log n}\)

- Two-choice strategy:
- Chose two bins \(i, j \in {1, \dots, n}\) uniformly at random
- Throw ball in the least loaded of the two!
- Grows via: \(\sim \log \log n\)

- \(d\)-choices: \(\sim \frac{\log \log n}{\log d}\)
- The same “boost” doesn’t happen again outside of \(1 \rightarrow 2\)

- Related problem:

- Basic web caching:
- Akamai was one of the first companies
- Basic web use: use a url to request data from a server
- Caching: use a intermediary cache to request data first

- Hash functions:
- Consistent hashing problem: given a cache distributed over \(N\) machines, how to quickly figure out which machine a string \(x\) is associated with, while balancing load between machines
- Motivated by web-caching: how can we cache data at URLs?
- The basic solution would be to use a typical hash function \(H(\cdot)\) and modulo: \(H(x) \mod N\)
- However we need to agree to a fixed \(N\), and in real, dynamic environments we cannot do so
- Changing \(N\) would have a cost of \(1 - \frac{1}{N}\)

- “Circular hashing solution”:
- (Assume the hash outputs are 32-bit for this problem)
- Hash each server, place them on a \(2^32\) circle
- When caching data \(x_0\), find hash \(H(x_0)\), then go clockwise to first hashed server. This is the cache server
- Expected load per server: \(\frac{P}{N}\) where \(P\) is the number of objects
- When we add another server, we only need to relocate \(\frac{1}{N}\) objects

- How to do the “clockwise scan”?
- We want some datastructure which can find the server \(v'\) for a has \(v\), such that \(v' \geq v\)
- Balanced binary search trees (e.g. red-black trees) can do this in \(O(\log N)\) time
- If \(N\) is small, we can also use other data structures

- Variance reduction trick:
- Using basic method, some servers can get overloaded
- Trick: compute \(k\) hashes for each server (filling out the circle)
- If \(k \simeq \log N\), there are theoretical bounds that servers won’t be overloaded with high probability
- This also helps when we have servers with differing memory capacities, server with 2x memory can get 2x the number of hash-keys!

- Majority element problem: given an array of length \(n\) that has a majority element, (more than \(\frac{n}{2}\)), find that element!
- Simple algo to find linear-time algorithm, keep a counter of most-seen element, increment/decrement it depending on what is seen

- Heavy hitters problem: given an array \(A\) of
*large*length \(n\) and a smaller \(k\), compute the values which appear at least \(\frac{n}{k}\) times.- Many applications: given a large stream of data, compute elements which appear a certain amount of times
- If \(A\) can fit in memory: trivial \(O(n \log n)\), sort array, find which elements appear at least \(k\) times in a row
**There is no algo which solves Heavy Hitters in one pass with sublinear amount of auxillary space**- We cannot store all counts, therefore if possible candidates change, we won’t know the counts of items which could replace it

- \(\epsilon\)-approximate heavy hitters problem:
- Input: same as HH, but with additional parameter \(0 < \epsilon < 1\)
- Output: a list where:
- All values in \(A\) which occur at least \(\frac{n}{k}\) times are in the list
- All values in the list occur at least \(\frac{n}{k} - \epsilon n\) times in \(A\) (provides margin of error)

- Auxillary memory required grows as \(\frac{1}{\epsilon}\)
- We can’t have \(\epsilon = 0\) as the memory requirement grows to infinity

- Count-min sketch: a datastructure for \(\epsilon\)-approximate heavy hitters
- Supports two operations:
`inc(x)`

: increment the count of`x`

`for all i = 1, 2, ..., L: arr[i][hash_i(x)] += 1`

- \(O(\ell)\) time

`count(x)`

: get the count of`x`

`return min_i arr[i][hash_i(x)]`

- \(O(\ell)\) time

- Parameters:
- Number of buckets \(b\) (larger than \(\ell\), much smaller than \(n\))
- Number of hash functions \(\ell\) (smaller)

- Data structure: \(\ell \times b\) 2-D array of ints, init at 0
- The array counts can only
*underestimate*the true counts - the error is one-sided - Given reasonable choices of parameters, the expected error rate is \(\frac{1}{2}^\ell\)

- Supports two operations:
- Role model: bloom filters
- Remembers which elements have been inserted
- Has false positives! Sometimes confirms elements which weren’t added
- No false negatives however
- 1-byte per element results in false positive rate of less than 2%
- Represents accuracy-space tradeoff

- Metric Space:
- \(X\), a set of points in the space
- \(d\), a distance function between two points:
- \(d(x, y) = 0 \iff x = y\)
- \(d(x, y) = d(y, x)\)
- \(d(x, z) \leq d(x, y) + d(y, z)\)

- Norms are a common example of a distance function
- Edit distance: between two strings, what’s the smallest number of insertions/deletions to make them equal?
- Can either use a \(O(mn)\) exact algorithm or a fast approximate algorithm

- Jacard Similarity:
- \(J(S, T) = \frac{|S \cap T|}{|S \cup T|}\)
- Higher means more similar
- Not a measure of distance
- Used for BoW models between documents to find similarity

- Nearest Neighbor search:
- Trival time: \(O(n)\), exact algorithms
- Gold standard time: \(O(\log n)\), approximate algorithms
- \(\epsilon\)-NN: find the class of a point within \(1+\epsilon\) times the actual min distance to a point

- Classical, low-dimensional NN:
- Use space partitioning
- Make a decision tree on space, splitting up on median points
- Works with one dimension
- For more than one dimension, use K-D trees
- Split on median points for one dimension, repeat over different dimensions
- No longer any guarantee of finding nearest neighbor first
- Time: \(O(2^\text{dim} \log n)\)

- Modern, high-dimensional NN: use locality-sensitive hashing
- Curse of Dimensionality: costs and space grow exponentially w.r.t. dimension

- Volume of the Euclidean ball of radius \(r\) in \(k\) dimensions grows proportionally to \(r^k\)
- \(c^k\) disjoint balls of radius \(r\) can fit inside a ball of radius \(2r\), for some \(c > 1\)
- Almost all of the volume of the ball is near the boundary
- This means that space partitioning is much worse in high dimensions

- Johnson-Lindenstrauss Lemma:
- Let \(A\) be a \(d \times k\) random matrix of \(N(0, 1)\) random variables normalized by \(\sqrt{d}\)
- For some \(n \geq 1\) and \(\epsilon > 0\), let \(d = \lceil \frac{24 \ln n}{\epsilon^2} \rceil\)
- For any \(n\)-point subset \(X \subseteq \mathbb{R}^k\), all vectors \(x, y \in X\), with high probability we have: \[(1 - \epsilon) \|x - y\|^2_2 \leq \|Ax - Ay\|^2_2 \leq (1 + \epsilon) \| x - y\|^2_2\]
- This means that a random matrix mapping \(d\) dimensions to \(O(\log n)\) dimensions will with high probability preserve interpoint distances between \(n\) points
- A matrix \(A\) where all elements are positive or negative signs at random also has this property
- Idea is to have many one-dim maps which preserve distance in expectation, use them in parallel

- MinHash:
- Let \(\pi\) be a function with a random permutation of all elements \(U\)
- For some \(S \subseteq U\), let \(h_\pi(S) = \arg \min_{x \in S} \pi(x)\)
- This basically returns the minimum of the random labels assigned
- The probability of two subsets colliding is equal to their Jaccard similarity! \[P[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)\]

- Independence of MinHash:
- Let \(H(S)\) be the results of \(\ell\) independent hash functions
- Similarity: \(J^H(A, B) = \frac{\text{Number of times equal}(H(A), H(B))}{\ell}\)
- We have \((1 - \epsilon)J(A,B) \leq J^H(A,B) \leq (1+\epsilon)J(A,B)\)
- Similar to JL Lemma
- As long as \(\ell \leq O(\frac{\log n}{\epsilon^2})\) for a set \(X\) of \(n\) users/subsets

- NN search using MinHash:
- Preprocess all elements \(x\) by finding their MinHash: \(H(x)\)
- When searching for \(y\)’s NN, compute \(H(y)\), find all elements \(z\) where \(H(y) = H(z)\), search through them

- One dimensional hash: dot product between input \(x\) and random normal \(r\)
- \(F_r(x) = \langle x, r \rangle\) has mean \(0\) and variance \(\|x\|^2_2\)
- Comes from addition of random Gaussian variables being another Gaussian

- \(F_r(x) = \langle x, r \rangle\) has mean \(0\) and variance \(\|x\|^2_2\)

- Binary classification:
- Given distribution and create classifier function
- Maps between 0 and 1 for this example
- Problem: given \(n\) vectors with ground truths, create mapping function

- Train error: error of func on training data
- Test error: error on test data

*Missed*

*PCA*

- How do we get feature vectors from data?
- Good to start with centering data (de-mean them)
- When our data is lower-dimensional vs the vector rep, we can use linear combinations of vectors to represent it

- PCA:
- Find features which “explain” the data the most
- Maximizes distances between points projected onto the basis vectors

- Can use it to map DNA sequences of Europeans to locations
- Keep on adding PCA vectors until representation doesn’t improve

- Input to PCA is \(m\) \(n\)-dim data points, \(X\) and parameter \(k\)
- Output is \(k\) orthonormal vectors, top \(k\) principle components
- This output maximizes: \(\frac{1}{m}\sum_{i=1}^m \sum_{j=1}^k \langle x_i, v_j \rangle^2\)
- We can express our problem as: \(\arg \max_{v = \|v\| = 1} v^T Av\), where \(v\) is a orthonormal matrix and \(A = X^T X\)
- Solutions are just diagonal matricies with a rotation
- Top \(k\) principle components are first \(k\) rows of \(Q^T\) in \(X^TX = QDQ^T\)
- Those are the first \(k\) principle components of \(X^TX\)

- PCA gets the \(k\) eigenvectors of \(X^TX\) which have the largest eignenvalues
- SVD is cubic time, power iteration is quicker
- Power Iteration:
- Algo:
- Given matrix \(A = X^TX\)
- Select random unit vector \(u_0\)
- For \(i = 1, 2, \dots\), set \(u_i = A^iu_0\) If \(\frac{u_i}{\|u_i\|} \simeq \frac{u_{i-1}}{\|u_{i-1}\|}\), return \(\frac{u_i}{\|u_i\|}\)

- Finds the first eigenvector
- Stretches out the random vector in the direction of the largest eigenvector
- Can just repeatedly square instead of \(i=1,2,\dots\)
- Number of iterations scales as: \(\frac{\log n}{\log (\frac{\lambda_1}{\lambda_2}}\)
- \(\frac{\lambda_1}{\lambda_2}\) is the spectral gap, the difference between how big the stretching is on the sphere

- Algo:
- Continuing power iteration:
- Find top component using power iteration
- Subtract variance of data explained by \(v_1\)
- e.g. \(\tilde{x}_1 = x_1 - \langle x_1, v_1 \rangle v_1\)

- Recurse to find \(k-1\) components of new data matrix

- Often best to just find many eigenvalues, plot, then pick your \(k\)