0%

Brief Introduction for Reinforcement Learning I (Background Info)

Markov Chain

Markov Decision Process

States :
Actions :
Probability distribution of transitions :
Reward :

Value function: Bellman Equation

RL learns a policy . Reward function reflects only the real time REWARD. For a long term REWARD, we introduce Value Function .

  • State Value Function
  • Action Value Function
  • Connection
    In this section, we have 2 different value functions for states and actions. Consider as a specialized version of with predescribed actions for all states, thus we can easily achieve the reward for a sequence of states and actions with .
  • Difference
    is defined on actions, but is defined on states.
  • MDP Best Policy

    Basic Solutions

    Dynamic Programming (?)

    Policy Iteration
  • Policy Evaluation
    For a given policy , Policy Evaluation algorithm calculates values of states .

    ALGORITHM: Policy_Evaluation
    Input: , the mixed policy to be evaluted.
    Initialize for all
    Repeat


    For each



    Until (a small positive threshold)
    Output: , the approximate values of states.

  • Policy Improvement
    For a given policy and values of states , Policy Improvement algorithm can achieve a better policy with untouched.

    ALGORITHM: Policy_Improvement
    Input: , .
    Repeat

    policy_stable
    For each


    If Then policy_stable

    Until policy_stable
    Output: , the Improved policy.

  • Policy Iteration
    Combine Policy Evaluation and Policy Improvement, we have Policy Iteration algorithm. The process is as follows:

    ALGORITHM: Policy_Iteration
    Initializate and randomly for all
    Repeat

    Policy_Evaluation
    Policy_Improvement
    policy_stable
    If Then policy_stable

    Until policy_stable
    Output:

    Value Iteration

    Compared with Policy Iteration algorithm, Value iteration algorithm implicitly stores the values of states, so in each iteration we only need to sweep all for one time.
    ALGORITHM: Value_Iteration
    Initializate randomly for all
    Repeat

    For each



    Until (a small positive threshold)
    For each

    Output:

    Pros andc Cons
  • pros
    • interpretable
    • mathematical deduction based
  • cons
    • require complete environmental information

      Monte Carlo

      MC method is a random version of DP method based on samples. MC method is defined on episode tasks (will end in finite steps) only. There are first-visit MC methods (number of episodes where exists ) and every-visit MC methods (number of ). In the section, we discuss first-visit MC methods only.

Similar to DP method, MC method has MC version of Policy Evalution, Policy Improvement and Policy Iteration processes as well.

Monte Carlo Policy Evalution
  • Input: The policy to be evaluted
  • Step1: generate some state sequences (each sequence is a episode)
  • Step2: for each state, calculate the average reward among all episodes where exists
  • Step3: set the average rewards as values of states
    Mote Carlo Estimation of Action Values
    To improve policy, we need values of actions (Q-value) first. We can do similar steps like Monte Carlo Policy Evalution: generate sequences, calculate the average reward and set them as Q-values. After that, we and improve the policy as follows:
    Maintaining Exploration
    There is a problem for MC method. If we already have predescribed Q-values: and , Q(s,a_2) will never be updated given MC method will never choose this action. It is similar to a Multi-armed Bandit problem. Maintaining Exploration replace soft policies to definite policies with, for example, policy: execute the best action with a probability of , otherwise execute those worse actions. Decrease by time and the algorithm will converge.
    Mote Carlo Control
    The process of Mote Carlo Control is as follows:We can also implicitly stores the values of actions, so we have value iteration of Mote Carlo Control. And at the end of this algorithm, we generate the policy based on Q-values.
    Pros and Cons
  • pros
    • based on experience rather than the whole environment
  • cons
    • worked on episode tasks only

Temporal-Difference

TD Prediction

Consider the Bellman Equation for value

When policy is fixed, we have

Then we have td_error

To optimize the model, all we need is to modify policy to minimize td_error as follows

N-step TD

Consider the definition of state value

Similarly in TD algorithm, we can reform state value and achieve new td_error with any step . If we set , then it will degenerate to MC algorithm. To achieve a better performance, need to be modified. In order to reduce the effect of step size on the results, we can multiply to , then the expected value should be in the same order of magnitude with different hyper-parameter .

Pros and Cons
  • pros
    • more flexible than MC
    • available for both online (SARSA) and offline (Q-Learning) situation
    • TD has much better performance than other algorithms, so most SOTA algorithm are based on TD methods