github posts notes

CSE 473 (AI)

Rob Minneker, Winter 2024

A search problem cosists of:
- State space
- State transition function
- State space graph
We make search trees from our state space graphs, just paths from the graph
Graphs with cycles become infinite search trees
“Complete” search algorithm: always finds a solution if it exists
“Optimal” search algorithm: always finds the shortest path between start and stop

Search algos:

Adversarial search!

There are deterministic vs nondeterministic games, as well as perfect vs imperfect information games
Terminal-utility games only give rewards at terminal states
General games:
- Agents have independent utilities
- Agents can have range of behaviors, cooperative, adversarial, etc
Zero-sum games:
- Agents have opposite utilites
- Agents are inherently adversaries
Nash Equilibrium:
- No agent can benefit by changing their strategy, assuming all other players strategies are fixed
- Not necessarily optimal for all players, can be local optima
MiniMax:
- For two-player zero-sum (2PZS) games
- Similar to BFS
- One player maximizes score, other minimizes, taking turns
- Assumes optimal play, is Nash
- Implementation is recursive value function
Alpha-beta pruning:
- MiniMax with optimization
- Keep track of alpha (max’s best option on path to root) and beta (min’s best option on path to root)
- If we see options which will never be accepted, we can stop searching portions of the tree
- “If we know min will return a value which is at least less than a value we know max can pick, max will never pick what min chooses here!”
- Can double the solvable search depth
In practice, depth-limited serach with evaluation functions is used

Expectimax:
- Works with some randomness
- Values in tree should represent the average case under optimal play, not worst case
- Cannot use alpha-beta pruning except for domain-specific heuristics
- Magnitudes must always scale, rewards don’t depend just on being monotonic
Multi-agent games:
- Terminal states have tuple rewards

MDPs - most of this lecture is covered during lecture 6

Lecture 5 content

MDPs are defined by:
- Set of states
- Set of actions
- Start state (dist)
- Possible terminal state
- Reward functions
- Transition functions
Agents attempt to maximize total future rewards
Reward time scales:
- Finite (typically undiscounted)
- Infinite (typically discounted)
- Indefinite
Markov property: the future is independent of the past given the present
How to solve MDPs:
- Find optimal policy \(\pi^*\)
Bellman equation: recursive equality equations
- I have these written in my spinningup notes
How to use Bellman to solve?
- Value iteration: repeatedly find the optimal value by taking actions which are argmax value
- Policy iteration: create policies based on max of value function, then find value of that policy, repeat

Lecture 6 content

Finite-horizion MDPs involve a time-dependence
“Stationary”: time-independent
Proper Policies:
- Enforce no-leaving goal state
- Enforce no reward for staying in a goal state
- All histories from a policy must end up in a goal state over infinite time
Indefinite Horizon MDPs (SSP):
- Has cost instead of reward
- Take min instead of max (can just use negative)
Finite-horizion and ininite-horizon MDPs are a subset of SSP MDPs
Upper bound on infinite horizon discounted rewards: \(\frac{R_\text{max}}{1 - \gamma}\)

Q-learning:
- Update \(Q\)-values based on sampled rewards (satisfy the Bellman equation)
- Similar to TD-learning, but on \(Q\)-values
- Off-policy (learns even when taking actions different than policy)
- “True” Q-learning requires keeping a table of all state-action pairs
  - Memory inefficient and doesn’t generalize
  - Could use feature representations
Exploration vs Exploitation:
- When to try a new thing vs do the best known action?
- \(\epsilon\)-greedy:
  - With probablity \(\epsilon\), act randomly, else take best action
  - Good to lower \(\epsilon\) over time
- Exploration functions:
  - Add a \(\frac{k}{N(s, a)}\) bonus for unknown states
Regret: total cost of mistakes while learning
- Difference between optimal rewards and earned rewards

Approximate Q-learning:
- Instead, best to use deep learning to approximate:
- Similar to online least squares

Went over HW problem

Bayes Networks:
- DAG model with probablistic relationships between states
- Encode joint representations