A tuple – (S,s1,A,P,R)
S – finite set of states.
s1 – initial state.
A – finite set of actions.
P – Given a state s1 and action a, what is the probability of ending up at a particular state s2? This information is provided by P. This is called a State transition probability matrix. S*A*S
R – What is the reward for taking an action a at the state s?
Pi – Policy, what action to take at which state? A mapping from state space to action space. The optimal policy is the policy with the maximum reward.
Value function – Expected reward starting from state s and following policy pi.
Bellman equations –
Ways to find an optimal policy:
Optimal policy when starting in state s is the policy whose value function is maximum starting from state s.
Think about the case where a particular policy is always better than another policy for all initial states. The first policy is greater than second policy and this is called partial ordering.
Theorem:
There always exists a policy such that the partial ordering of it with all other policies is greater or equal. Such a policy/policies is called optimal policy.
Iterative methods:
- Policy iteration:
- At each state, pick the action with max value function.
- you get the new policy.
- Again go to step 1 and loop until the new policy and old policy are same.
- Value iteration:
- Finding an optimal value function rather than explicit policy.
- For every iteration improve value vector.
- once you converge to a particular value vector, use it to find the optimal policy.
- This is cheap compared to policy iteration.
Model-free methods:
Reinforcement learning is Markov decision process with unknown transition model and/or reward distribution.
Agent can observe samples and induce transition model and reward distribution from that.
Q-learning:
Uses Q-values instead of the value function.
To be continued……….
Utility: Utility of the policy at a state is what happens if we start running from that state.
Reward gives us immediate feedback but utility gives us long-term feedback. utilities allow you to take short-term negatives for long-term positives.
Credit assignment problem: