Markov Chain
Markov Decision Process
States :
Actions :
Probability distribution of transitions :
Reward :
Value function: Bellman Equation
RL learns a policy
- State Value Function
- Action Value Function
- Connection
In this section, we have 2 different value functions for states and actions. Consideras a specialized version of with predescribed actions for all states, thus we can easily achieve the reward for a sequence of states and actions with . - Difference
is defined on actions, but is defined on states. - MDP Best Policy
Basic Solutions
Dynamic Programming (?)
Policy Iteration
- Policy Evaluation
For a given policy, Policy Evaluation algorithm calculates values of states . ALGORITHM: Policy_Evaluation
Input:, the mixed policy to be evaluted.
Initializefor all
Repeat
For eachUntil
(a small positive threshold)
Output:, the approximate values of states. - Policy Improvement
For a given policyand values of states , Policy Improvement algorithm can achieve a better policy with untouched. ALGORITHM: Policy_Improvement
Input:, .
Repeatpolicy_stable
For each
IfThen policy_stable Until policy_stable
Output:, the Improved policy. - Policy Iteration
Combine Policy Evaluation and Policy Improvement, we have Policy Iteration algorithm. The process is as follows:ALGORITHM: Policy_Iteration
Initializateand randomly for all
RepeatPolicy_Evaluation Policy_Improvement
policy_stable
IfThen policy_stable Until policy_stable
Output:Value Iteration
Compared with Policy Iteration algorithm, Value iteration algorithm implicitly stores the values of states, so in each iteration we only need to sweep all
for one time.
ALGORITHM: Value_Iteration
Initializaterandomly for all
Repeat
For eachUntil
(a small positive threshold)
For eachOutput:
Pros andc Cons
- pros
- interpretable
- mathematical deduction based
- cons
- require complete environmental information
Monte Carlo
MC method is a random version of DP method based on samples. MC method is defined on episode tasks (will end in finite steps) only. There are first-visit MC methods (number of episodes where exists) and every-visit MC methods (number of ). In the section, we discuss first-visit MC methods only.
- require complete environmental information
Similar to DP method, MC method has MC version of Policy Evalution, Policy Improvement and Policy Iteration processes as well.
Monte Carlo Policy Evalution
- Input: The policy to be evaluted
- Step1: generate some state sequences (each sequence is a episode)
- Step2: for each state, calculate the average reward among all episodes where exists
- Step3: set the average rewards as values of states
Mote Carlo Estimation of Action Values
To improve policy, we need values of actions (Q-value) first. We can do similar steps like Monte Carlo Policy Evalution: generate sequences, calculate the average reward and set them as Q-values. After that, we and improve the policy as follows:Maintaining Exploration
There is a problem for MC method. If we already have predescribed Q-values:and , Q(s,a_2) will never be updated given MC method will never choose this action. It is similar to a Multi-armed Bandit problem. Maintaining Exploration replace soft policies to definite policies with, for example, policy: execute the best action with a probability of , otherwise execute those worse actions. Decrease by time and the algorithm will converge. Mote Carlo Control
The process of Mote Carlo Control is as follows:We can also implicitly stores the values of actions, so we have value iteration of Mote Carlo Control. And at the end of this algorithm, we generate the policy based on Q-values.Pros and Cons
- pros
- based on experience rather than the whole environment
- cons
- worked on episode tasks only
Temporal-Difference
TD Prediction
Consider the Bellman Equation for value
When policy
Then we have td_error
To optimize the model, all we need is to modify policy
N-step TD
Consider the definition of state value
Similarly in TD algorithm, we can reform state value and achieve new td_error with any step
Pros and Cons
- pros
- more flexible than MC
- available for both online (SARSA) and offline (Q-Learning) situation
- TD has much better performance than other algorithms, so most SOTA algorithm are based on TD methods