CS234 reinforcement learning
2 minute read ∼ Filed in : A paper noteMarkov Assumption
future is independent of the past given the present
Markov Process/Chain
- sequence of random states with Markov property
-
no actions, rewards, only have transition model P(s_(t+1) s_t)
Markov Reward Process
- markov chian + rewards
-
no actions, transition model P(s_(t+1) s_t) - reward function R(s_t).
- has defined return and value functions, Return = sum of reward from time step t to the horizon. Value function = expected return from state s.
Markov Decision Process (S, A, P, R, \gama)
-
Markov reward process + actions
-
Definition:
-
Transition model P(s_(t+1) s_t, a_t) - Reward function R(s_t, a_t)
-
-
Policy:
- what actions to take at each state
- can be deterministic or stochastic
-
like conditional distribution, pi(a s) = P(a s)
-
MDP Policy Evaluation, Iterative Algorithm.
-
V=0 for all states s.
-
for k=1 to converge, and for each state, calculate. \(V_k^\pi(s) = r(s, \pi(s)) + r\sum p(s'|s, \pi(s)) V_{k-1}^\pi(s')\) This is Bellman’s backup for a policy.
and the optimal policy is \(\pi^*(s)=argmaxV^\pi(s)\)
-
-
Policy iteration is efficient in guessing the optimal policy.
-
define the Q function to measure state-action value, and measure the improvement of a policy. \(Q^\pi(s, a) = R(s, a) + r\sum p(s'|s, a) V^\pi(s')\)
-
for all s and a, we compute the Q function, and then the new policy should be \(\pi_{i+1}(s) = argmax_aQ^\pi_i(s, a)\) This is to find a such that the Q is max.
-
Monotonic improvement in policy, the new policy is always better. (there is proof at this Link)
-
-
Value iteration is another way to compute it.
- maintain the optimal value of starting in a state.
- Bellman Equation, and Bellman Backup operators.
RL Algorithm Components
Model: representation of how the world changes in response to agent’s action.
- Transition model: predict next agent state, s_t + a_t => s_(t+1)
- Reward model: predict immediate rewards, r(s_t, a_t)
Policy: function mapping agent’s stats to action. S -> A,
- Determinsitc policy: pi(s) = a
-
Stochastic policy: pi(a s) = Pr(a_t = a s_t = s)
Value Function: future rewards from being in a stats and action when following a policy
-
V(s_t) = E(r_t + r_(t+1) + r_(t+2)… s_t) - quantify goodness/badness of states and actions.
RL agent:
Model-based/Model-free: the difference is whether they can model the environment。
Model-Free
Monte Carlo Policy Evaluation: Policy evaluation when we don’t have dynamics and reward model.
- no bootstrapping
- does not assume state is Markov
- Can only be applied to episodic MPDs.
###