CSE 312
Lecture 1 - March 27
- We will cover probability, combinatorics, and some applications
- Counting is helpful because:
- It helps with algo analysis
- It is a building block for probability
- A set’s cardinality (size) is the number of distinct elements in
it
- Sum Rule: if we want to choose an element between two sets
with no overlapping elements, we have the sum of the cardinalities
number of options
- Product Rule: when we have a sequential process of choosing
elements, the number of possibilities is the sum of the number of
options at each step
- This is equal to the size of the Cartesian Product of the sets
- Note that is works independent of which are the elements being
chosen from, as long as the number of options is always the same at a
step
- What is the size of the power set of \(S\)?
- \(2^{\mid S \mid}\)
- Like the product rule, as each element can be in or out of a
subset
Lecture 2 - March 29
- When parsing HW problems, look out for “sequence” and
“distinct”
- Remember \(0! = 1\)
- K Permutation:
- The number of \(k\) element
sequences of distinct symbols from a universe of \(n\) symbols is: \(P(n, k) = n (n - 1) \dots (n - k + 1) =
\frac{n!}{(n - k)!}\)
- Said as “P n K”, “n permute k”, or “n pick k”
- Permutation vs subsets: does order matter?
- K Combination:
- The number of \(k\) element subsets
from a set of \(n\) symbols is: \(C(n, k) = \frac{P(n, k)}{k!} = \frac{n!}{k!(n -
k)!}\)
- Said as “n choose k” typically
- Also written as: \(\binom{n}{k}\)
- We can think of this as finding the number where ordering matters,
then dividing by the number of possible orders for each subset
- We can go back and forth from combinations to permutations by
dividing and multiplying by the number of possible orders within a
subset (typically \(\mid S \mid\))
- Path counting is really just combinations
- Overcounting is cool! (We just have to correct for it)
- e.g. anagrams of “SEATTLE”, we find all permutations, then divide by
the product of the factorials of the number of each element
- divide by \(2! \cdot 2!\)
- Complementary Counting:
- We count all the complements of a set (everything that
wouldn’t be valid), then subtract that from the cardinality of
the universe!
Lecture 3 - April 1
- Multinomial coefficient: \(\binom{c}{a, b} \equiv \frac{c!}{a! \cdot
b!}\)
- Counting two ways can be a good way to learn more and
understand!
- Symmetry of combinations: \(\binom{n}{k} = \binom{n}{n - k}\)
- Can prove using algebra
- Can prove using the “team-picking” idea: choosing who is on a team
also indirectly chooses who is not on a team, and vice-versa
- Pascal’s Rule: \(\binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n -
1}{k}\)
- Again, we can prove using algebra
- We can also prove using the idea of the summation of all teams that
I’m on and teams that I’m not on
- Binomial Theorem: \((x +
y)^n = \sum_{i = 0}^{n} \binom{n}{i} x^i y^{n-i}\)
- We have \(xy\)’s and \(\binom{n}{i}\) determines how many there
are
- Useful when we need to plug in specific numbers for x and y
- Principle of Inclusion-Exclusion: when we are
trying to get the size of a set which is the OR of multiple conditions,
we can take the size of the union of the sets which satisfy at least one
condition
- Sum rule for non-disjoint sets: \(\mid A
\cup B \mid = \mid A \mid + \mid B \mid - \mid A \cap B
\mid\)
- Sum rule for three non-disjoint sets follows similar logic,
subtracting the intersection of each pair of sets, then adding the
intersection of all sets
- Typically only use for two or three conditions (maybe four)
Lecture 4 - April 3
- Pigeonhole Principle
- If there are more items then spots, we know that one spot has more
than one item
- Strong Pigeonhole: If we have \(n\)
pigeons and \(k\) pigeonholes, there is
at least one pigeonhole with \(\frac{k}{n}\) pigeons, rounded
up
- Placing dividers (stars and bars):
- If we need to find the number of different groups comprised of
different types of elements, use dividers
- For n size groups of k types: \(\binom{n+k-1}{k-1}\)
- Can think about it as a binary string permutation problem
Lecture 5 - April 5
- Probability is a way of quantifying our uncertainty
- Sample space, \(\Omega\), is all
the possible outcomes of an experiment
- Single coin flip: \(\Omega = \lbrace H, T
\rbrace\)
- An event, \(E \subseteq \Omega\),
is a subset of all possible outcomes
- A probability, \(\mathbb{P}: \Omega
\rightarrow [0, 1]\), is a function that maps an element of \(\Omega\) to it’s likelihood of occurring
- Other notation: \(Pr[\omega]\) or
\(P(\omega)\)
- All probabilities must be between 0 and 1 and the sum of the
probability for each element in the sample space must sum to 1
- Probability of an event should be the sum of the probabilities of
the elements in the event
- A probability space is a pair of sample space and probability
function
- Uniform Probability Space: all events have the same likelihood
- Events are “mutually exclusive” if they cannot happen
simultaneously
- Axioms:
- Non-negativity: \(\mathbb{P}(x) \geq
0\) for all \(x\)
- Normalization: \(\sum_{x \in \Omega}
\mathbb{P}(x) = 1\)
- Countable additivity: If \(E\) and
\(F\) are mutually exclusive, then:
\(\mathbb{P}(E \cup F) = \mathbb{P}(E) +
\mathbb{P}(F)\)
- Facts derived from axioms:
- Complementation: \(\mathbb{P}(\bar{E}) = 1
- \mathbb{P}(E)\)
- Monotonicity: if \(E \subseteq F\),
then \(\mathbb{P}(E) \leq
\mathbb{P}(F)\)
- Inclusion-exclusion: \(\mathbb{P}(E \cup
F) = \mathbb{P}(E) + \mathbb{P}(F) - \mathbb{P}(E \cap F)\)
Lecture 6 - April 7
- Often our sample space will contain excess information. This won’t
make our answer incorrect, but can lead to unnecessary work
- We use conditional probability when we have partial information
- It is a way to “restrict” the sample space
- \(\mathbb{P}(A \mid B) =
\frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}\)
- This allows us to update our probabilities
- “The probability of A given B is equal to the probability of A and B
happening, divided by the probability of B”
Lecture 7 - April 10
- Bayes’ Rule allows us to use conditional probabilities
- Bayes Rule:
- \(\mathbb{P}(A \mid B) =
\frac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B)}\)
- The law of total probability:
- \(\mathbb{P}(S) = \mathbb{P}(S \mid G)
\cdot \mathbb{P}(G) + \mathbb{P}(S \mid \bar{G}) \cdot
\mathbb{P}(\bar{G})\)
- More generally: \(\sum_{\forall
i}\mathbb{P}(E \mid A_i) \mathbb{P}(A_i)\)
- The probability of an event happening is equal to the sum of the
probability of the event happening given another event happening,
multiplied by the probability of the other event
- A partition of the sample space is a family of subsets where each
partition is distinct and they combine to be the entire sample
space
- Humans are very bad at understanding very large or small numbers -
past a certain amount we ignore magnitudes
- It is good to think of tests as an “update” to our priors, not a
revelation of truth
- When we condition on an event, we still have probability spaces:
\(B\) and probability measures: \(\mathbb{P}(\omega \mid B)\)
- Do not condition on multiple events, only the intersection
of them
Lecture 8 - April 12
- (Statistical) independence is when the probabilities of two things
don’t depend on each other:
- \(\mathbb{P}(A \cap B) = \mathbb{P}(A)
\cdot \mathbb{P}(B)\)
- “Conditioning doesn’t make a difference”
- Chain Rule:
- \(\mathbb{P}(A_1 \cap A_2 \cap \dots \cap
A_n) = \mathbb{P}(A_n \mid A_1 \cap \dots \cap A_{n-1}) \cdot
\mathbb{P}(A_{n-1} \mid A_1 \cap \dots \cap A_{n-2}) \dots
\mathbb{P}(A_2 \mid A_1) \cdot \mathbb{P}(A_1)\)
- We can find this from: \(\mathbb{P}(A \mid
B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \rightarrow \mathbb{P}(A
\cap B) = \mathbb{P}(A \mid B) \cdot \mathbb{P}(B)\)
- Conditional independence is independence of two events given another
event
Lecture 9 - April 14
- Medical terms for tests:
- \(\mathbb{P}(D)\) is
“prevalence”
- \(\mathbb{P}(T \mid D)\) is
“sensitivity”
- \(\mathbb{P}(\bar{T} \mid
\bar{D})\) is “specificity”
- ONE WEIRD TRICK:
- Think of these problems with a large population (where at least one
for each chance)
- Intuition Trick: Bayes’ Factor:
- When you test positive, multiply prior by the Bayes’ Factor: \(\frac{\text{sensitivity}}{\text{false positive
rate}} = \frac{1 - FNR}{FPR}\)
- Also called the “likelihood ratio”
- It is an estimate of how much I should update the prior by
- Better when the prior is quite low
- When there are overwhelming differences between groups, response
error can drown out small groups
Lecture 10 - April 17
- We often implicitly define the sample space:
- This commonly occurs when the size of the sample space is
infinite
- When working with sample spaces with infinite size:
- We can use infinite sums of probabilities (and closed forms)
- We can use complement events
- Random Variables:
- It is any function that has domain \(\Omega\) and outputs a real number
- \(X(\omega)\) is a summary of \(\omega\)
- It doesn’t change the problem, but it can simplify it!
- One sample space can have many random variables
- We always use capital letters as random variables
- We commonly use lowercase variables as the values the random
variable could take on
- Random variables are a function
- We do not use typical function notation, instead something like:
\(X = 2\)
- The support (the range) is the set of values \(X\) can take on
- The event the random variable takes on a value: \(\lbrace \omega : X(\omega) = x \rbrace\)
- The probability of that event is: \(\mathbb{P}(X = x)\)
- The function that gives us \(\mathbb{P}(X
= x)\) is the Probability Mass Function, or PMF
- This is written as: \(p_X(x)\) or
\(f_X(x)\)
- Think of the PMF as a piecewise function where values outside the
support are zero because it must take in all real numbers as an
input
- The Cumulative Distribution Function (CDF) gives the probability
\(X \leq x\)
- Written as: \(F_X(x) = \mathbb{P}(X \leq
x)\)
Lecture 11 - April 19
- CDF will always be defined over all real numbers (we must support
all of them)
- We will often use floor functions when we only want to “support”
integers
- Typically has zero below the support and one above the support
- The “expectation” of a random variable \(X\) is: \(\mathbb{E}[X] = \sum_k k \cdot \mathbb{P}(X =
k)\)
- The “weighted average” of values \(X\) could take on
- Weighted by the probability we actually see the value
- Think about it as the expectation of a drive in football:
- \(X\) is the possible scores \(0, 2, 3, 6, 7\)
- We multiply each score by the likelihood that it happens, sum all of
these for our expected score
- Not the most common outcome
- The expectation of the sum of two random variables is equal to the
sum of the expectations for each variable
- Pairwise independece: for each pair in the set, they are
independent
- Mutual independence: for each subset of events, the probability of
the intersection of all of them is equal to the product of all
individual probabilities of the events
- Stronger than pairwise independence
- Two random variables \(X\) and
\(Y\) are independent if for all \(k\) and \(l\) (in the supports), \(\mathbb{P}(X = k, Y = l) = \mathbb{P}(X = k) \cdot
\mathbb{P}(Y = l)\)
- Note that commas are often used instead of \(\cap\) for random variables
- Pairwise independence is the intuition
- Mutual independence of random variables: \(X_1, X_2, \dots, X_n\) are mutually
independent if for all \(x_1, x_2, \dots,
x_n\) \(\mathbb{P}(X_1 = x_1, X_2 =
x_2, \dots, X_n = x_n) = \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 =
x_2) \cdot \dots \cdot \mathbb{P}(X_n = x_n)\)
- We don’t need to check all subsets, but do need to
check all possible values in the ranges
- While many equations might want all values (outside) the range of a
random variable to be checked or included, because the probability of it
is zero they often don’t need to be
Lecture 12 - April 21
- Linearity of expectation: for any two random variables \(X\) and \(Y\), \(\mathbb{E}[X + Y] = \mathbb{E}[X] +
\mathbb{E}[Y]\)
- This extends to more than two random variables
- Simple summation proof
- \(X\) and \(Y\) do not need to be
independent
- Constants are also fine (works like algebra intuition!)
- Indicator random variables are 1 if the event occurs and zero if it
does not
- \(\mathbf{1}[A]\)
- The expectation of the indicator variable is the probability the
event occurs
- How to compute complicated expectations:
- Decompose the random variable into the sum of simple random
variables
- Apply linearity of expecation
- Compute the simple expectations
Lecture 13 - April 24
- Variance is another one-number summary of a random variable
- We typically square values instead of using absolute values (think
norms)
- Variance: \(Var(X) = \sum_{\omega}
\mathbb{P}(\omega) \cdot (X(\omega) - \mathbb{E}[X])^2\)
- How “extreme” or “spread out” values are
- \(= \mathbb{E}[(X - \mathbb{E}[X])^2] =
\mathbb{E}[X^2] - \mathbb{E}[X]^2\)
- We should first find \(\mathbb{E}[X]\)
- If \(X\) and \(Y\) are independent:
- \(Var(X + Y) = Var(X) +
Var(Y)\)
- \(\mathbb{E}[X \cdot Y] = \mathbb{E}[X]
\cdot \mathbb{E}[Y]\)
- Squaring an indicator variable does nothing to it
- Squaring random variables: we only square the value, not
probability
- \(Var(X + c) = Var(X)\)
- Think of this as just shifting the distribution
- \(Var(aX) = a^2 Var(X)\)
- Stretching or compressing random variable
Lecture 14 - April 26
- Introduces the random variable zoo: a list of facts about random
variables (and distributions)
- Use a reference sheet or wikipedia!
- Memoryless - future outcomes aren’t influenced by past outcomes
- “Independent and identially distributed”: “iid”
- Bernoulli: one trial with a probability \(p\) of success
- Binomial: how many success in \(n\)
independent trials with probability \(p\) of success?
- Geometric: how many independent trials until the first success?
- Uniform: integer in some range with each value equally likely
- We need smallest and largest possible value for the range
- Negative Binomial: how many independent trials with probability
\(p\) until we have \(r\) successes?
Lecture 15 - April 28
- Poisson Distribution: we know the average over an interval of time,
we can’t use each individual possible source
- (Kinda) requires each event to be independent
- We use Poisson because we don’t have good ideas of what the random
variables are
- This is a model - not perfect (most real-world events are somewhat
dependent), but it is useful!
- The PMF involves the Taylor Series for \(e^x\)
- It is a way of using the limit as the number of experiments approach
infinity, but the mean stays constant
- Hypergeometric: drawing without probability from an urn
- Continuous random variables:
- We need continuous probability spaces and continuous random
variables to represent uncountably-infinite sample spaces
- We use the probability density function for continuous random
variables:
- It is a way of comparing probability of being near different
events
- \(f_x(k) \geq 0\)
- \(\int_{-\infty}^{\infty} f_x(k) dk =
1\)
- We use this because we need the PDF to work for events
- Integrating is analogous to summation - continuous vs discrete
values:
- \(\mathbb{P}(a \leq X \leq B) =
c\)
- \(\int_{a}^{b} f_x(k) dk = c\)
- Impossible events still have probability 0, but some probability 0
events are still possible for continuous probability spaces
Lecture 16 - May 1
- Continuous random variables require Probability Density Function:
- Main difference is that we use events vs values
- Every single value is equal to 0 (typically)
- The PDF is the number that when integrated over gives the
probability of an event
- Comparing \(f_X(k)\) and \(f_X(l)\) gives relative chances of \(X\) being near \(k\) vs \(l\)
- $(a X b) = _{a}^{b} f_X(z)dz $
- Sometimes the density will be greater than 1
- Cumulative density function is analogous when using continuous vs
discrete random variables:
- \(F_X(k) = \mathbb{P}(X \leq k) = \int_{-
\infty}^{k} f_X(z)dz\)
- CDF to PDF by taking the derivative of CDF
- Undos the integral
- Vice-versa works as well
- General pattern: summation for discrete random variables becomes
integration for continuous
- \(\mathbb{E}[X] = \int_{- \infty}^{\infty}
X(z) \cdot f_X(z) dz\)
- Expectation of a function of a random variable:
- \(\mathbb{E}[g(X)] = \int_{-
\infty}^{\infty} g(X(z)) \cdot f_X(z)dz\)
- “Law of the Unconcious Statistician” ~math nerds
- Linearity of Expectations still works!
- Variance: \(Var(X) = \mathbb{E}[X^2] -
\mathbb{E}[X]^2 = \int_{- \infty}^{\infty} f_x(k)(X(k) -
\mathbb{E}[X])^2 dk\)
- Expectation of a uniform random varaible between \(a\) and \(b\) is the same, \(\frac{a+b}{2}\)
- Expectation of the uniform random variable squared is: \(\frac{a^2 + ab + b^2}{3}\)
- Variance of the variable is: \(\frac{(b -
a)^2}{12}\)
Lecture 17 - May 5
- Exponential random variable: like geometric random variable, but
continuous time
- “Waiting doesn’t make the event happen sooner” - meomoryless
- \(f_X(k) = \lambda e^{- \lambda
k}\)
- \(\mathbb{E}[X] =
\frac{1}{\lambda}\)
- Same as taking a Poisson random variable and asking “How long until
the next event?”
- \(F_X(t) = \mathbb{P}(X \leq t) = 1 - e^{-
\lambda t}\)
- Expectation is: \(\frac{1}{\lambda}\)
- Variance is: \(\frac{1}{\lambda^2}\)
- Gaussian (normal) random variable:
- Mean: \(\mu\)
- Variance: \(\sigma^2\)
- \(f_X(x) = \frac{1}{\sigma \sqrt{2 \pi}}
\cdot e^{- \frac{(x - \mu)^2}{2 \sigma^2}}\)
- \(F_X(k)\) has no nice closed form:
use the table
- \(\mathbb{E}[X] = \mu\)
- This is the bell curve
- Scaling or adding to a normal variable results in another normal
variable
- To normalize normal variable \(X\):
\(Y = \frac{X - \mu}{\sigma}\)
- We normalize because the CDF for normal random variables can be
super rough
- We convert to a “standard normal”, round the “z-score” to the
hundredths, look up on the table
Lecture 18 - May 8
- The sum of any independent random variables approaches the normal
distribution
- Let \(X_1, X_2, \dots , X_n\)
i.i.d. random variables with mean \(\mu\) and variance \(\sigma ^2\)
- Let \(Y_n = \frac{X_1 + X_2 + \dots + X_n
- n \mu}{\sigma \sqrt{n}}\)
- Then as \(n \rightarrow \infty\),
\(Y_n\) converges to the CDF of \(\mathbb{N}(0, 1)\)
- Only equal in the limit!
- Gaussians often occur in the real world because the random variable
is a combination of many independent factors
- We can often use the Gaussian CDF in practice instead of complicated
independent variables
- Note that \(\Phi\) represents the
CDF of the Gaussian with mean 0 and variance 1
- The z-table only has positive values: we use \(\Phi(-x) = 1 - \Phi(x)\)
- When we have a discrete variable we’re approximating with a
continuous random variable, we must do a continuity correction:
- We find all values that would round to the correct discrete
value
- Other corrections are typically not worth the effort
- CLT Usage Outline:
- Write down the event you’re interested in, in terms of the sum of
random variables
- Apply continiuty correction if RVs are discrete
- Normalize RB to have mean 0 and standard deviation 1
- Replace RV with \(\mathbb{N}(0,1)\)
- Write event in terms of \(\Phi\)
- Look up in table
- Polling: \(\mathbb{P}(\vert X -
\mathbb{E}[X] \vert \geq) s) \leq \epsilon\)
- How often \(\epsilon\) is our
polling outside an acceptable margin \(s\)
Lecture 19 - May 10
- If two RVs are independent, the probability they both equal some
values is equal to their individual products
- We can use LTP to talk about only one variable (when they’re
dependent)
- This is the marginal distribution, as we marginalized a RV
- The “marginal” for \(X\) is where
we marginalize all other RVs
- Joint expectation is still the sum of the probabilities times the
value
- This is often written as a function
- Conditional expectations are also intuitively defined
- LOE still work!
- Law of Total Expectation: basically the same as LTP, but with
expectations
- \(\mathbb{E} = \sum_{i=1}^n \mathbb{E}[X
\vert A_i] \mathbb{E}(A_i)\)
- Covariance: how much \(X\) and
\(Y\) change together
- \(\text{Cov}(X, Y) = \mathbb{E}[(X -
\mathbb{E})(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]
\mathbb{E}[Y]\)
- We often only care about if \(\text{Cov}\) is positive or negative
- \(\text{Variance}(X + Y) =
\text{Variance}(X) + \text{Variance}(Y) + 2 \text{Cov}(X,
Y)\)
Lecture 20 - May 12
- Two methods of polling: sampling with vs without replacement
- We will do the math for sampling with replacement even if this isn’t
used in practice
- We don’t do continuity corrections when we’re polling
- Accuracy of polling isn’t determined by the total population amount,
only the number of people polled
- Works for idealized polling scenarios with large-ish
populations
- This is due to the way we use the two methods of polling
- The “margin of error” is an intuitive way of measuring variance
- “If I performed this poll repeatedly, 95% of the time the correct
value will be within +/- the margin of error”
- Find the number of people necessary to guarantee a margin of error
- Handling \(\sqrt{p(1-p)}\), assume
worst case scenario for \(p =
0.5\)
- We use the z-table in reverse, find the smallest input that gives
the correct output
Lecture 21 - May 15
- We need a way to calculate the bounds with certainty
- Tail bound: “bounds the size of the tail”
- Markov Inequality:
- Let \(X\) be a random variable
supported only on non-negative numbers:
- \(\mathbb{P}(X \geq t) \leq
\frac{\mathbb{E}[X]}{t}\)
- \(\mathbb{P}(X \geq k \mathbb{E}[X]) \leq
\frac{1}{k}\)
- Non-negative random variable and \(k\) or \(t\) is positive