- Grade is 50% PacMan projects, 50% written projects
- AI is the superset of ML, NNs, GenAI
- Went over all the fields of AI

- A search problem cosists of:
- State space
- State transition function
- State space graph

- We make search trees from our state space graphs, just paths from the graph
- Graphs with cycles become infinite search trees
- “Complete” search algorithm: always finds a solution if it exists
- “Optimal” search algorithm: always finds the shortest path between start and stop

**Search algos:**

- Uninformed
- DFS, BFS
- Same search method, different “fringe” (stack vs queue) goal

- Uniform cost search (UCS):
- Uses a priority queue
- Complete and optimal
- Similar to Dijkstra’s

- DFS, BFS
- Heuristics:
- A
*heuristic*estimates how close a state is to a goal - Admissibility:
- Admissible (optimistic): \(h(s) \geq 0\) and \(h(s) \leq \text{true_cost}(s)\)
- Inadmissible (pessimistic) heuristics

- Greedy search"
- Use local heurstics

- A* search:
- Order by sum of cost of path to get there plus heuristic
- Optimal only with admissible heuristics

- A

**Adversarial search!**

- There are deterministic vs nondeterministic games, as well as perfect vs imperfect information games
- Terminal-utility games only give rewards at terminal states
- General games:
- Agents have independent utilities
- Agents can have range of behaviors, cooperative, adversarial, etc

- Zero-sum games:
- Agents have opposite utilites
- Agents are inherently adversaries

- Nash Equilibrium:
- No agent can benefit by changing their strategy, assuming all other players strategies are fixed
- Not necessarily optimal for all players, can be local optima

- MiniMax:
- For two-player zero-sum (2PZS) games
- Similar to BFS
- One player maximizes score, other minimizes, taking turns
- Assumes optimal play, is Nash
- Implementation is recursive value function

- Alpha-beta pruning:
- MiniMax with optimization
- Keep track of alpha (max’s best option on path to root) and beta (min’s best option on path to root)
- If we see options which will never be accepted, we can stop searching portions of the tree
- “If we know min will return a value which is at least less than a value we know max can pick, max will never pick what min chooses here!”
- Can double the solvable search depth

- In practice, depth-limited serach with evaluation functions is used

- Expectimax:
- Works with some randomness
- Values in tree should represent the average case under optimal play, not worst case
- Cannot use alpha-beta pruning except for domain-specific heuristics
- Magnitudes must always scale, rewards don’t depend just on being monotonic

- Multi-agent games:
- Terminal states have tuple rewards

*MDPs - most of this lecture is covered during lecture 6*

*Lecture 5 content*

- MDPs are defined by:
- Set of states
- Set of actions
- Start state (dist)
- Possible terminal state
- Reward functions
- Transition functions

- Agents attempt to maximize total future rewards
- Reward time scales:
- Finite (typically undiscounted)
- Infinite (typically discounted)
- Indefinite

- Markov property: the future is independent of the past given the present
- How to solve MDPs:
- Find optimal policy \(\pi^*\)

- Bellman equation: recursive equality equations
- I have these written in my spinningup notes

- How to use Bellman to solve?
- Value iteration: repeatedly find the optimal value by taking actions which are argmax value
- Policy iteration: create policies based on max of value function, then find value of that policy, repeat

*Lecture 6 content*

- Finite-horizion MDPs involve a time-dependence
- “Stationary”: time-independent
- Proper Policies:
- Enforce no-leaving goal state
- Enforce no reward for staying in a goal state
- All histories from a policy must end up in a goal state over infinite time

- Indefinite Horizon MDPs (SSP):
- Has cost instead of reward
- Take min instead of max (can just use negative)

- Finite-horizion and ininite-horizon MDPs are a subset of SSP MDPs
- Upper bound on infinite horizon discounted rewards: \(\frac{R_\text{max}}{1 - \gamma}\)

- MDPs satisfy the Markov property, future states only care about current state
- Not much new

- Talking about RLHF and PPO
- Online learning - has to take the actions to learn from it
- RL uses an underlying MDP
- Model-based RL:
- Learns the underlying MDP

- Model-free learning:
- Passive RL: doesn’t change policy while learning
- Active RL: update policy while learning
- Direct Eval:
- Values equal the sum of total rewards divided by the number of times the state was visited
- Similar to the mean of rewards over trajectories through the state

- TD learning:
- Rolling average of rewards + values
- Learn from every experience
- Must be used with some model-based approach to get the optimal policy
- Steps:
- For all \(s\), init \(V^\pi (s) = 0\)
- Let \(\text{sample} = R(s, \pi (s), s') + \gamma V^\pi(s')\)
- \(V^\pi(s) \leftarrow (1 - \alpha) V^\pi(s) + \alpha \cdot \text{sample}\)

- Q-learning:
- Update \(Q\)-values based on sampled rewards (satisfy the Bellman equation)
- Similar to TD-learning, but on \(Q\)-values
- Off-policy (learns even when taking actions different than policy)
- “True” Q-learning requires keeping a table of all state-action pairs
- Memory inefficient and doesn’t generalize
- Could use feature representations

- Exploration vs Exploitation:
- When to try a new thing vs do the best known action?
- \(\epsilon\)-greedy:
- With probablity \(\epsilon\), act randomly, else take best action
- Good to lower \(\epsilon\) over time

- Exploration functions:
- Add a \(\frac{k}{N(s, a)}\) bonus for unknown states

- Regret: total cost of mistakes while learning
- Difference between optimal rewards and earned rewards

- Approximate Q-learning:
- Instead, best to use deep learning to approximate:
- Similar to online least squares

*Went over HW problem*

- Bayes Networks:
- DAG model with probablistic relationships between states
- Encode joint representations