CSE 473 (AI)
Rob Minneker, Winter 2024
Lecture 1 - Jan 3
- Grade is 50% PacMan projects, 50% written projects
- AI is the superset of ML, NNs, GenAI
- Went over all the fields of AI
Lecture 2 - Jan 5
- A search problem cosists of:
- State space
- State transition function
- State space graph
- We make search trees from our state space graphs, just paths from
the graph
- Graphs with cycles become infinite search trees
- “Complete” search algorithm: always finds a solution if it
exists
- “Optimal” search algorithm: always finds the shortest path between
start and stop
Search algos:
- Uninformed
- DFS, BFS
- Same search method, different “fringe” (stack vs queue) goal
- Uniform cost search (UCS):
- Uses a priority queue
- Complete and optimal
- Similar to Dijkstra’s
- Heuristics:
- A heuristic estimates how close a state is to a goal
- Admissibility:
- Admissible (optimistic): \(h(s) \geq
0\) and \(h(s) \leq
\text{true_cost}(s)\)
- Inadmissible (pessimistic) heuristics
- Greedy search”
- A* search:
- Order by sum of cost of path to get there plus heuristic
- Optimal only with admissible heuristics
Lecture 3 - Jan 8
Adversarial search!
- There are deterministic vs nondeterministic games, as well as
perfect vs imperfect information games
- Terminal-utility games only give rewards at terminal states
- General games:
- Agents have independent utilities
- Agents can have range of behaviors, cooperative, adversarial,
etc
- Zero-sum games:
- Agents have opposite utilites
- Agents are inherently adversaries
- Nash Equilibrium:
- No agent can benefit by changing their strategy, assuming all other
players strategies are fixed
- Not necessarily optimal for all players, can be local optima
- MiniMax:
- For two-player zero-sum (2PZS) games
- Similar to BFS
- One player maximizes score, other minimizes, taking turns
- Assumes optimal play, is Nash
- Implementation is recursive value function
- Alpha-beta pruning:
- MiniMax with optimization
- Keep track of alpha (max’s best option on path to root) and beta
(min’s best option on path to root)
- If we see options which will never be accepted, we can stop
searching portions of the tree
- “If we know min will return a value which is at least less than a
value we know max can pick, max will never pick what min chooses
here!”
- Can double the solvable search depth
- In practice, depth-limited serach with evaluation functions is
used
Lecture 4 - Jan 10
- Expectimax:
- Works with some randomness
- Values in tree should represent the average case under optimal play,
not worst case
- Cannot use alpha-beta pruning except for domain-specific
heuristics
- Magnitudes must always scale, rewards don’t depend just on being
monotonic
- Multi-agent games:
- Terminal states have tuple rewards
Lecture 5 - Jan 12
MDPs - most of this lecture is covered during lecture 6
Lecture 6 - Jan 17
Lecture 5 content
- MDPs are defined by:
- Set of states
- Set of actions
- Start state (dist)
- Possible terminal state
- Reward functions
- Transition functions
- Agents attempt to maximize total future rewards
- Reward time scales:
- Finite (typically undiscounted)
- Infinite (typically discounted)
- Indefinite
- Markov property: the future is independent of the past given the
present
- How to solve MDPs:
- Find optimal policy \(\pi^*\)
- Bellman equation: recursive equality equations
- How to use Bellman to solve?
- Value iteration: repeatedly find the optimal value by taking actions
which are argmax value
- Policy iteration: create policies based on max of value function,
then find value of that policy, repeat
Lecture 6 content
- Finite-horizion MDPs involve a time-dependence
- “Stationary”: time-independent
- Proper Policies:
- Enforce no-leaving goal state
- Enforce no reward for staying in a goal state
- All histories from a policy must end up in a goal state over
infinite time
- Indefinite Horizon MDPs (SSP):
- Has cost instead of reward
- Take min instead of max (can just use negative)
- Finite-horizion and ininite-horizon MDPs are a subset of SSP
MDPs
- Upper bound on infinite horizon discounted rewards: \(\frac{R_\text{max}}{1 - \gamma}\)
Lecture 7 - Jan 24
- MDPs satisfy the Markov property, future states only care about
current state
- Not much new
Lecture 8 - Jan 29
- Talking about RLHF and PPO
- Online learning - has to take the actions to learn from it
- RL uses an underlying MDP
- Model-based RL:
- Learns the underlying MDP
- Model-free learning:
- Passive RL: doesn’t change policy while learning
- Active RL: update policy while learning
- Direct Eval:
- Values equal the sum of total rewards divided by the number of times
the state was visited
- Similar to the mean of rewards over trajectories through the
state
- TD learning:
- Rolling average of rewards + values
- Learn from every experience
- Must be used with some model-based approach to get the optimal
policy
- Steps:
- For all \(s\), init \(V^\pi (s) = 0\)
- Let \(\text{sample} = R(s, \pi (s),
s') + \gamma V^\pi(s')\)
- \(V^\pi(s) \leftarrow (1 - \alpha)
V^\pi(s) + \alpha \cdot \text{sample}\)
Lecture 9 - Jan 31
- Q-learning:
- Update \(Q\)-values based on
sampled rewards (satisfy the Bellman equation)
- Similar to TD-learning, but on \(Q\)-values
- Off-policy (learns even when taking actions different than
policy)
- “True” Q-learning requires keeping a table of all state-action pairs
- Memory inefficient and doesn’t generalize
- Could use feature representations
- Exploration vs Exploitation:
- When to try a new thing vs do the best known action?
- \(\epsilon\)-greedy:
- With probablity \(\epsilon\), act
randomly, else take best action
- Good to lower \(\epsilon\) over
time
- Exploration functions:
- Add a \(\frac{k}{N(s, a)}\) bonus
for unknown states
- Regret: total cost of mistakes while learning
- Difference between optimal rewards and earned rewards
Lecture 10 - Feb 2
- Approximate Q-learning:
- Instead, best to use deep learning to approximate:
- Similar to online least squares
Went over HW problem
Lecture 11 - Feb 5
- Bayes Networks:
- DAG model with probablistic relationships between states
- Encode joint representations
Lecture 12(?) - Feb 12
- ML: subset of AI, very data-dependent
- Talked about MLOps