- We will cover probability, combinatorics, and some applications
- Counting is helpful because:
- It helps with algo analysis
- It is a building block for probability

- A set’s cardinality (size) is the number of distinct elements in it
*Sum Rule:*if we want to choose an element between two sets with no overlapping elements, we have the sum of the cardinalities number of options*Product Rule:*when we have a sequential process of choosing elements, the number of possibilities is the sum of the number of options*at each step*- This is equal to the size of the Cartesian Product of the sets
- Note that is works independent of which are the elements being chosen from, as long as the number of options is always the same at a step

- What is the size of the power set of \(S\)?
- \(2^{\mid S \mid}\)
- Like the product rule, as each element can be in or out of a subset

- When parsing HW problems, look out for “sequence” and “distinct”
- Remember \(0! = 1\)
**K Permutation:**- The number of \(k\) element sequences of distinct symbols from a universe of \(n\) symbols is: \(P(n, k) = n (n - 1) \dots (n - k + 1) = \frac{n!}{(n - k)!}\)
- Said as “P n K”, “n permute k”, or “n pick k”

- Permutation vs subsets: does order matter?
**K Combination:**- The number of \(k\) element subsets from a set of \(n\) symbols is: \(C(n, k) = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}\)
- Said as “n choose k” typically
- Also written as: \(\binom{n}{k}\)
- We can think of this as finding the number where ordering matters, then dividing by the number of possible orders for each subset

- We can go back and forth from combinations to permutations by dividing and multiplying by the number of possible orders within a subset (typically \(\mid S \mid\))
- Path counting is really just combinations
- Overcounting is cool! (We just have to correct for it)
- e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
- divide by \(2! \cdot 2!\)

- e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
**Complementary Counting:**- We count all the
*complements*of a set (everything that*wouldn’t*be valid), then subtract that from the cardinality of the universe!

- We count all the

**Multinomial coefficient:**\(\binom{c}{a, b} \equiv \frac{c!}{a! \cdot b!}\)- Counting two ways can be a good way to learn more and understand!
**Symmetry of combinations:**\(\binom{n}{k} = \binom{n}{n - k}\)- Can prove using algebra
- Can prove using the “team-picking” idea: choosing who is on a team also indirectly chooses who is not on a team, and vice-versa

**Pascal’s Rule:**\(\binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n - 1}{k}\)- Again, we can prove using algebra
- We can also prove using the idea of the summation of all teams that I’m on and teams that I’m not on
- “Focus on one element”

**Binomial Theorem:**\((x + y)^n = \sum_{i = 0}^{n} \binom{n}{i} x^i y^{n-i}\)- We have \(xy\)’s and \(\binom{n}{i}\) determines how many there are
- Useful when we need to plug in specific numbers for x and y

**Principle of Inclusion-Exclusion:**when we are trying to get the size of a set which is the OR of multiple conditions, we can take the size of the union of the sets which satisfy at least one condition- Sum rule for non-disjoint sets: \(\mid A \cup B \mid = \mid A \mid + \mid B \mid - \mid A \cap B \mid\)
- Sum rule for three non-disjoint sets follows similar logic, subtracting the intersection of each pair of sets, then adding the intersection of all sets
- Typically only use for two or three conditions (maybe four)

- Pigeonhole Principle
- If there are more items then spots, we know that one spot has more than one item
- Strong Pigeonhole: If we have \(n\) pigeons and \(k\) pigeonholes, there is at least one pigeonhole with \(\frac{k}{n}\) pigeons,
*rounded up*

- Placing dividers (stars and bars):
- If we need to find the number of different groups comprised of different types of elements, use dividers
- For n size groups of k types: \(\binom{n+k-1}{k-1}\)
- Can think about it as a binary string permutation problem

- Probability is a way of quantifying our uncertainty
- Sample space, \(\Omega\), is all the possible outcomes of an experiment
- Single coin flip: \(\Omega = \lbrace H, T \rbrace\)

- An event, \(E \subseteq \Omega\), is a subset of all possible outcomes
- A probability, \(\mathbb{P}: \Omega \rightarrow [0, 1]\), is a function that maps an element of \(\Omega\) to it’s likelihood of occurring
- Other notation: \(Pr[\omega]\) or \(P(\omega)\)
- All probabilities must be between 0 and 1 and the sum of the probability for each element in the sample space must sum to 1
- Probability of an event should be the sum of the probabilities of the elements in the event

- A probability space is a pair of sample space and probability function
- Uniform Probability Space: all events have the same likelihood
- Events are “mutually exclusive” if they cannot happen simultaneously
- Axioms:
- Non-negativity: \(\mathbb{P}(x) \geq 0\) for all \(x\)
- Normalization: \(\sum_{x \in \Omega} \mathbb{P}(x) = 1\)
- Countable additivity: If \(E\) and \(F\) are mutually exclusive, then: \(\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F)\)

- Facts derived from axioms:
- Complementation: \(\mathbb{P}(\bar{E}) = 1 - \mathbb{P}(E)\)
- Monotonicity: if \(E \subseteq F\), then \(\mathbb{P}(E) \leq \mathbb{P}(F)\)
- Inclusion-exclusion: \(\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F) - \mathbb{P}(E \cap F)\)

- Often our sample space will contain excess information. This won’t make our answer incorrect, but can lead to unnecessary work
- We use conditional probability when we have partial information
- It is a way to “restrict” the sample space
- \(\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}\)
- This allows us to update our probabilities
- “The probability of A given B is equal to the probability of A and B happening, divided by the probability of B”

- Bayes’ Rule allows us to use conditional probabilities
- Bayes Rule:
- \(\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B)}\)

- The law of total probability:
- \(\mathbb{P}(S) = \mathbb{P}(S \mid G) \cdot \mathbb{P}(G) + \mathbb{P}(S \mid \bar{G}) \cdot \mathbb{P}(\bar{G})\)
- More generally: \(\sum_{\forall i}\mathbb{P}(E \mid A_i) \mathbb{P}(A_i)\)
- The probability of an event happening is equal to the sum of the probability of the event happening given another event happening, multiplied by the probability of the other event

- A partition of the sample space is a family of subsets where each partition is distinct and they combine to be the entire sample space
- Humans are very bad at understanding very large or small numbers - past a certain amount we ignore magnitudes
- It is good to think of tests as an “update” to our priors, not a revelation of truth
- When we condition on an event, we still have probability spaces: \(B\) and probability measures: \(\mathbb{P}(\omega \mid B)\)
- Do
*not*condition on multiple events, only the intersection of them

- (Statistical) independence is when the probabilities of two things don’t depend on each other:
- \(\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)\)
- “Conditioning doesn’t make a difference”

- Chain Rule:
- \(\mathbb{P}(A_1 \cap A_2 \cap \dots \cap A_n) = \mathbb{P}(A_n \mid A_1 \cap \dots \cap A_{n-1}) \cdot \mathbb{P}(A_{n-1} \mid A_1 \cap \dots \cap A_{n-2}) \dots \mathbb{P}(A_2 \mid A_1) \cdot \mathbb{P}(A_1)\)
- We can find this from: \(\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \rightarrow \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B) \cdot \mathbb{P}(B)\)

- Conditional independence is independence of two events given another event

- Medical terms for tests:
- \(\mathbb{P}(D)\) is “prevalence”
- \(\mathbb{P}(T \mid D)\) is “sensitivity”
- \(\mathbb{P}(\bar{T} \mid \bar{D})\) is “specificity”

**ONE WEIRD TRICK:**- Think of these problems with a large population (where at least one for each chance)

- Intuition Trick: Bayes’ Factor:
- When you test positive, multiply prior by the Bayes’ Factor: \(\frac{\text{sensitivity}}{\text{false positive rate}} = \frac{1 - FNR}{FPR}\)
- Also called the “likelihood ratio”
- It is an estimate of how much I should update the prior by
- Better when the prior is quite low

- When there are overwhelming differences between groups, response error can drown out small groups

- We often implicitly define the sample space:
- This commonly occurs when the size of the sample space is infinite

- When working with sample spaces with infinite size:
- We can use infinite sums of probabilities (and closed forms)
- We can use complement events

- Random Variables:
- It is any function that has domain \(\Omega\) and outputs a real number
- \(X(\omega)\) is a summary of \(\omega\)
- It doesn’t
*change*the problem, but it can simplify it!

- One sample space can have many random variables
- We always use capital letters as random variables
- We commonly use lowercase variables as the values the random variable could take on
- Random variables are a function
- We do not use typical function notation, instead something like: \(X = 2\)

- The
**support**(the range) is the set of values \(X\) can take on - The event the random variable takes on a value: \(\lbrace \omega : X(\omega) = x \rbrace\)
- The probability of that event is: \(\mathbb{P}(X = x)\)

- The function that gives us \(\mathbb{P}(X = x)\) is the Probability Mass Function, or PMF
- This is written as: \(p_X(x)\) or \(f_X(x)\)
- Think of the PMF as a piecewise function where values outside the support are zero because it must take in all real numbers as an input

- The Cumulative Distribution Function (CDF) gives the probability \(X \leq x\)
- Written as: \(F_X(x) = \mathbb{P}(X \leq x)\)

- CDF will always be defined over all real numbers (we must support all of them)
- We will often use floor functions when we only want to “support” integers
- Typically has zero below the support and one above the support

- The “expectation” of a random variable \(X\) is: \(\mathbb{E}[X] = \sum_k k \cdot \mathbb{P}(X = k)\)
- The “weighted average” of values \(X\) could take on
- Weighted by the probability we actually see the value
- Think about it as the expectation of a drive in football:
- \(X\) is the possible scores \(0, 2, 3, 6, 7\)
- We multiply each score by the likelihood that it happens, sum all of these for our expected score

- Not the most common outcome

- The expectation of the sum of two random variables is equal to the sum of the expectations for each variable
- Pairwise independece: for each pair in the set, they are independent
- Mutual independence: for each subset of events, the probability of the intersection of all of them is equal to the product of all individual probabilities of the events
- Stronger than pairwise independence

- Two random variables \(X\) and \(Y\) are independent if for all \(k\) and \(l\) (in the supports), \(\mathbb{P}(X = k, Y = l) = \mathbb{P}(X = k) \cdot \mathbb{P}(Y = l)\)
- Note that commas are often used instead of \(\cap\) for random variables

- Pairwise independence is the intuition
- Mutual independence of random variables: \(X_1, X_2, \dots, X_n\) are mutually independent if for all \(x_1, x_2, \dots, x_n\) \(\mathbb{P}(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n) = \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 = x_2) \cdot \dots \cdot \mathbb{P}(X_n = x_n)\)
- We
*don’t*need to check all subsets, but*do*need to check all possible values in the ranges

- We
- While many equations might want all values (outside) the range of a random variable to be checked or included, because the probability of it is zero they often don’t need to be

- Linearity of expectation: for any two random variables \(X\) and \(Y\), \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\)
- This extends to more than two random variables
- Simple summation proof
- \(X\) and \(Y\) do
**not**need to be independent - Constants are also fine (works like algebra intuition!)

- Indicator random variables are 1 if the event occurs and zero if it does not
- \(\mathbf{1}[A]\)
- The expectation of the indicator variable is the probability the event occurs

- How to compute complicated expectations:
- Decompose the random variable into the sum of simple random variables
- Apply linearity of expecation
- Compute the simple expectations

- Variance is another one-number summary of a random variable
- We typically square values instead of using absolute values (think norms)
- Variance: \(Var(X) = \sum_{\omega} \mathbb{P}(\omega) \cdot (X(\omega) - \mathbb{E}[X])^2\)
- How “extreme” or “spread out” values are
- \(= \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2\)
- We should first find \(\mathbb{E}[X]\)

- If \(X\) and \(Y\) are independent:
- \(Var(X + Y) = Var(X) + Var(Y)\)
- \(\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]\)

- Squaring an indicator variable does nothing to it
- Squaring random variables:
*we only square the value, not probability*

- Squaring random variables:
- \(Var(X + c) = Var(X)\)
- Think of this as just shifting the distribution

- \(Var(aX) = a^2 Var(X)\)
- Stretching or compressing random variable

- Introduces the random variable zoo: a list of facts about random variables (and distributions)
- Use a reference sheet or wikipedia!

- Memoryless - future outcomes aren’t influenced by past outcomes
- “Independent and identially distributed”: “iid”
- Bernoulli: one trial with a probability \(p\) of success
- Binomial: how many success in \(n\) independent trials with probability \(p\) of success?
- Geometric: how many independent trials until the first success?
- Uniform: integer in some range with each value equally likely
- We need smallest and largest possible value for the range

- Negative Binomial: how many independent trials with probability \(p\) until we have \(r\) successes?

- Poisson Distribution: we know the average over an interval of time, we can’t use each individual possible source
- (Kinda) requires each event to be independent
- We use Poisson because we don’t have good ideas of what the random variables are
- This is a model - not perfect (most real-world events are somewhat dependent), but it is useful!
- The PMF involves the Taylor Series for \(e^x\)
- It is a way of using the limit as the number of experiments approach infinity, but the mean stays constant

- Hypergeometric: drawing without probability from an urn
- Continuous random variables:
- We need continuous probability spaces and continuous random variables to represent uncountably-infinite sample spaces

- We use the probability density function for continuous random variables:
- It is a way of comparing probability of being near different events
- \(f_x(k) \geq 0\)
- \(\int_{-\infty}^{\infty} f_x(k) dk = 1\)
- We use this because we need the PDF to work for events

- Integrating is analogous to summation - continuous vs discrete values:
- \(\mathbb{P}(a \leq X \leq B) = c\)
- \(\int_{a}^{b} f_x(k) dk = c\)

- Impossible events still have probability 0, but some probability 0 events are still possible for continuous probability spaces

- Continuous random variables require Probability Density Function:
- Main difference is that we use events vs values
- Every single value is equal to 0 (typically)
- The PDF is the number that when integrated over gives the probability of an event
- Comparing \(f_X(k)\) and \(f_X(l)\) gives relative chances of \(X\) being near \(k\) vs \(l\)
- $(a X b) = _{a}^{b} f_X(z)dz $
- Sometimes the density will be greater than 1

- Cumulative density function is analogous when using continuous vs discrete random variables:
- \(F_X(k) = \mathbb{P}(X \leq k) = \int_{- \infty}^{k} f_X(z)dz\)

- CDF to PDF by taking the derivative of CDF
- Undos the integral
- Vice-versa works as well

- General pattern: summation for discrete random variables becomes integration for continuous
- \(\mathbb{E}[X] = \int_{- \infty}^{\infty} X(z) \cdot f_X(z) dz\)
- Expectation of a function of a random variable:
- \(\mathbb{E}[g(X)] = \int_{- \infty}^{\infty} g(X(z)) \cdot f_X(z)dz\)
- “Law of the Unconcious Statistician” ~math nerds

- Linearity of Expectations still works!
- Variance: \(Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \int_{- \infty}^{\infty} f_x(k)(X(k) - \mathbb{E}[X])^2 dk\)
- Expectation of a uniform random varaible between \(a\) and \(b\) is the same, \(\frac{a+b}{2}\)
- Expectation of the uniform random variable squared is: \(\frac{a^2 + ab + b^2}{3}\)
- Variance of the variable is: \(\frac{(b - a)^2}{12}\)

- Exponential random variable: like geometric random variable, but continuous time
- “Waiting doesn’t make the event happen sooner” - meomoryless
- \(f_X(k) = \lambda e^{- \lambda k}\)
- \(\mathbb{E}[X] = \frac{1}{\lambda}\)
- Same as taking a Poisson random variable and asking “How long until the next event?”
- \(F_X(t) = \mathbb{P}(X \leq t) = 1 - e^{- \lambda t}\)
- Expectation is: \(\frac{1}{\lambda}\)
- Variance is: \(\frac{1}{\lambda^2}\)

- Gaussian (normal) random variable:
- Mean: \(\mu\)
- Variance: \(\sigma^2\)
- \(f_X(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot e^{- \frac{(x - \mu)^2}{2 \sigma^2}}\)
- \(F_X(k)\) has no nice closed form: use the table
- \(\mathbb{E}[X] = \mu\)
- This is the bell curve
- Scaling or adding to a normal variable results in another normal variable

- To normalize normal variable \(X\): \(Y = \frac{X - \mu}{\sigma}\)
- We normalize because the CDF for normal random variables can be super rough
- We convert to a “standard normal”, round the “z-score” to the hundredths, look up on the table

- The sum of any independent random variables approaches the normal distribution
- Let \(X_1, X_2, \dots , X_n\) i.i.d. random variables with mean \(\mu\) and variance \(\sigma ^2\)
- Let \(Y_n = \frac{X_1 + X_2 + \dots + X_n - n \mu}{\sigma \sqrt{n}}\)
- Then as \(n \rightarrow \infty\), \(Y_n\) converges to the CDF of \(\mathbb{N}(0, 1)\)
- Only equal in the limit!

- Gaussians often occur in the real world because the random variable is a combination of many independent factors
- We can often use the Gaussian CDF in practice instead of complicated independent variables
- Note that \(\Phi\) represents the CDF of the Gaussian with mean 0 and variance 1
- The z-table only has positive values: we use \(\Phi(-x) = 1 - \Phi(x)\)
- When we have a discrete variable we’re approximating with a continuous random variable, we must do a continuity correction:
- We find all values that would
*round*to the correct discrete value - Other corrections are typically not worth the effort

- We find all values that would
- CLT Usage Outline:
- Write down the event you’re interested in, in terms of the sum of random variables
- Apply continiuty correction if RVs are discrete
- Normalize RB to have mean 0 and standard deviation 1
- Replace RV with \(\mathbb{N}(0,1)\)
- Write event in terms of \(\Phi\)
- Look up in table

- Polling: \(\mathbb{P}(\vert X - \mathbb{E}[X] \vert \geq) s) \leq \epsilon\)
- How often \(\epsilon\) is our polling outside an acceptable margin \(s\)

- If two RVs are independent, the probability they both equal some values is equal to their individual products
- We can use LTP to talk about only one variable (when they’re dependent)
- This is the marginal distribution, as we marginalized a RV
- The “marginal” for \(X\) is where we marginalize all other RVs

- Joint expectation is still the sum of the probabilities times the value
- This is often written as a function
- Conditional expectations are also intuitively defined
- LOE still work!

- Law of Total Expectation: basically the same as LTP, but with expectations
- \(\mathbb{E} = \sum_{i=1}^n \mathbb{E}[X \vert A_i] \mathbb{E}(A_i)\)

- Covariance: how much \(X\) and \(Y\) change
*together*- \(\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E})(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]\)
- We often only care about if \(\text{Cov}\) is positive or negative

- \(\text{Variance}(X + Y) = \text{Variance}(X) + \text{Variance}(Y) + 2 \text{Cov}(X, Y)\)

- Two methods of polling: sampling with vs without replacement
- We will do the math for sampling with replacement even if this isn’t used in practice

- We don’t do continuity corrections when we’re polling
- Accuracy of polling isn’t determined by the total population amount, only the number of people polled
- Works for idealized polling scenarios with large-ish populations
- This is due to the way we use the two methods of polling

- The “margin of error” is an intuitive way of measuring variance
- “If I performed this poll repeatedly, 95% of the time the correct value will be within +/- the margin of error”

- Find the number of people necessary to guarantee a margin of error
- Handling \(\sqrt{p(1-p)}\), assume worst case scenario for \(p = 0.5\)
- We use the z-table in reverse, find the smallest input that gives the correct output

- We need a way to calculate the bounds with certainty
- Tail bound: “bounds the size of the tail”
- Markov Inequality:
- Let \(X\) be a random variable supported only on non-negative numbers:
- \(\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}\)
- \(\mathbb{P}(X \geq k \mathbb{E}[X]) \leq \frac{1}{k}\)

- Non-negative random variable and \(k\) or \(t\) is positive

- Let \(X\) be a random variable supported only on non-negative numbers: