Brief Introduction for Reinforcement Learning I (Background Info)

Markov Chain

Markov Decision Process

States :
Actions :
Probability distribution of transitions :
Reward :

Value function: Bellman Equation

RL learns a policy . Reward function reflects only the real time REWARD. For a long term REWARD, we introduce Value Function .

State Value Function $V^{\pi}(s)=\sum_{s'\in S}p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^\pi(s')]$
Action Value Function $Q(s,a)=\sum_{s'\in S}p(s'|s,a)[r(s'|s,a)+\gamma V^\pi(s')]$
Connection
In this section, we have 2 different value functions for states and actions. Consider as a specialized version of with predescribed actions for all states, thus we can easily achieve the reward for a sequence of states and actions with . $V^{\pi}(s)=Q(s,\pi(s))$
Difference
is defined on actions, but is defined on states.
MDP Best Policy $\pi^*=\arg\max_\pi V^\pi(s), \forall s\in S$
Basic Solutions
Dynamic Programming (?)
Policy Iteration
Policy Evaluation
For a given policy , Policy Evaluation algorithm calculates values of states .

ALGORITHM: Policy_Evaluation
Input: , the mixed policy to be evaluted.
Initialize for all
Repeat

For each

Until (a small positive threshold)
Output: , the approximate values of states.
Policy Improvement
For a given policy and values of states , Policy Improvement algorithm can achieve a better policy with untouched.

ALGORITHM: Policy_Improvement
Input: , .
Repeat

policy_stable
For each

If Then policy_stable

Until policy_stable
Output: , the Improved policy.
Policy Iteration
Combine Policy Evaluation and Policy Improvement, we have Policy Iteration algorithm. The process is as follows: $\pi_0\rightarrow^{E} v_0\rightarrow^{I} \pi_1\rightarrow^{E} v_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} v^*\rightarrow^{I} \pi^*$

ALGORITHM: Policy_Iteration
Initializate and randomly for all
Repeat

Policy_Evaluation
Policy_Improvement
policy_stable
If Then policy_stable

Until policy_stable
Output:

Value Iteration
Compared with Policy Iteration algorithm, Value iteration algorithm implicitly stores the values of states, so in each iteration we only need to sweep all for one time.
ALGORITHM: Value_Iteration
Initializate randomly for all
Repeat

For each

Until (a small positive threshold)
For each

Output:

Pros andc Cons
pros
- interpretable
- mathematical deduction based
cons
- require complete environmental information
  Monte Carlo
  MC method is a random version of DP method based on samples. MC method is defined on episode tasks (will end in finite steps) only. There are first-visit MC methods (number of episodes where exists ) and every-visit MC methods (number of ). In the section, we discuss first-visit MC methods only.

Similar to DP method, MC method has MC version of Policy Evalution, Policy Improvement and Policy Iteration processes as well.

Monte Carlo Policy Evalution

Input: The policy to be evaluted
Step1: generate some state sequences (each sequence is a episode)
Step2: for each state, calculate the average reward among all episodes where exists
Step3: set the average rewards as values of states
Mote Carlo Estimation of Action Values
To improve policy, we need values of actions (Q-value) first. We can do similar steps like Monte Carlo Policy Evalution: generate sequences, calculate the average reward and set them as Q-values. After that, we and improve the policy as follows: $\pi'(s)=\arg\max_a Q^\pi(s,a)$
Maintaining Exploration
There is a problem for MC method. If we already have predescribed Q-values: and , Q(s,a_2) will never be updated given MC method will never choose this action. It is similar to a Multi-armed Bandit problem. Maintaining Exploration replace soft policies to definite policies with, for example, policy: execute the best action with a probability of , otherwise execute those worse actions. Decrease by time and the algorithm will converge.
Mote Carlo Control
The process of Mote Carlo Control is as follows: $\pi_0\rightarrow^{E} q_0\rightarrow^{I} \pi_1\rightarrow^{E} q_1\rightarrow^{I} \pi_2\rightarrow\cdots\rightarrow^{E} q^*\rightarrow^{I} \pi^*$ We can also implicitly stores the values of actions, so we have value iteration of Mote Carlo Control. And at the end of this algorithm, we generate the policy based on Q-values.
Pros and Cons
pros
- based on experience rather than the whole environment
cons
- worked on episode tasks only

Temporal-Difference

TD Prediction

Consider the Bellman Equation for value

$V_\pi(s_t)=E_\pi[R(s_{t+1})+\gamma V_\pi(s_{t+1})|s_{t+1}=\pi(s_t)]$

When policy is fixed, we have

$V(s_t)=R(s_{t+1})+\gamma V(s_{t+1})$

Then we have td_error

$td\_error=|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)|$

To optimize the model, all we need is to modify policy to minimize td_error as follows

$\pi^*=\arg\min_\pi|R(s_{t+1})+\gamma V_\pi(s_{t+1})-V_\pi(s_t)|$

N-step TD

Consider the definition of state value

$V(s_t)=R(s_{t+1})+\gamma R(s_{t+2})+\cdots+\gamma^{n-1} R(s_{t+n})+\gamma^{n} V(s_{t+n})$

Similarly in TD algorithm, we can reform state value and achieve new td_error with any step . If we set , then it will degenerate to MC algorithm. To achieve a better performance, need to be modified. In order to reduce the effect of step size on the results, we can multiply to , then the expected value should be in the same order of magnitude with different hyper-parameter .

Pros and Cons

pros
- more flexible than MC
- available for both online (SARSA) and offline (Q-Learning) situation
- TD has much better performance than other algorithms, so most SOTA algorithm are based on TD methods

Markov Chain

Markov Decision Process

Value function: Bellman Equation

Basic Solutions

Dynamic Programming (?)

Policy Iteration

Value Iteration

Pros andc Cons

Monte Carlo

Monte Carlo Policy Evalution

Mote Carlo Estimation of Action Values

Maintaining Exploration

Mote Carlo Control

Pros and Cons

Temporal-Difference

TD Prediction

N-step TD

Pros and Cons