The problem of credit assignment is a long-lasting issue in reinforcement learning. In short, it’s about which action is actually “caused” the reward in the future. Here I am looking at two papers that address this problem.
1-step TD error approach to learning suffers with delayed rewards with long periods or when between the action and reward there are uncontrollable events that contribute.
Their proposed method (optimizing synthetic returns, explanation will come later) combined with the IMPALA agent is able to solve the skiing game in Atari, which is difficult because of the delayed reward, 25 times faster than the state of the art of the time.
State-associative learning. The goal is to learn associations between pairs of states such that the state that was visitied prior to the other one is predictive of the reward and the future state (I think a better word here would be future timestep). This enables the algorithm to construct a synthetic return that connects the reward to the previous state by skipping timesteps.
Door-key example. If you have a door that you need to open to receive reward, you first need to pick up the key. So the credit of the reward shold be assigned to the state (transition) before (pickiping up the key), otherwise with one-step TD error we are going to obtain no reward signal for the key pickup transition and the value function approximator has a hard time of figuring this out, a lot of samples are needed.
Equation 1., the reward prediction loss. This is a bit weird to me, since the gating function
Apparently there is a JAX implementation of this
, yuppee!
Alright, a bit later after the equation they explain why they are summing over
Synthetic return. This is what they call
Further, they assume that the contribution of past states to a future state reward is sparse
, that’s why the gating mechanism is introduced before the sum. I think it probably didn’t work without the gating mechanism, because there is no causal connection here between past action and reward, this might be a problem worth exploring
.
Notice that the baseline in Eq. 1
depends on
The Are there any nice properties of this in the tabular case actually?
.
Experiments. I am pretty much skipping this section in interest of time, long story short - the algorithm is very good in cases of long delayed rewards. The only thing that is peculiar to me that the objective for the synthetic returns is modified.
Summary/Discussion Do I don’t get it here why are they calling it a synthetic return instead of a synthetic reward, which it is. They mention that they cannot offer convergence guarantees because of the use of multiplicative gates and because of the use of an additive regression model they cannot offer guarantees for optimal credit assignment :), but the cool things is that they propose SA variants that overcome this in the appendix. Another problem is that this method has high variance across seeds for some reason.
BONUS: Appendix. In the current formulation the reward contributions of the summation might be double counted, so there is no reward conservation. The section 6.4 limitation of additive regression I am not getting exactly, now suddenly we have the function
Key idea: condition value functions on future events by learning to extract relevant information from a trajectory.
Let’s jump to the method section right away, I’m just gonna say it straight, I don’t like the notation style… Who uses score function
is basically Monte-Carlo estimate
! So the only thing that we need to do is to be able to simulate trajectories (a lot of them though!). Another feature here is the value function baseline in this formulation (I believe that the original REINFORCE didn’t have value function baseline?). A key feature of REINFORCE is that the likelihood of an action increases proportionally to the return from time-step t, not the whole trajectory. Second feature is that subtracting this is kind of the motivator of this paper
.
Weber et al (2019)
showed that including variables that are causally dependent on the action leads to a biased estimator.
The REINFORCE estimator actually updates only the single action that was taken in the trajectory, but there is work from Sutton (2000) (would be interest to read this)
that shows that we can derive a policy gradient estimator that actually updates all of the actions simultaneously. It is basically summing across all of the timesteps of the trajectory discounted and summing over all of the actions where we weight the score function with actually, this is true yes, the REINFORCE gradient weights all of the timesteps by difference between
Hindisght reasoning vs. luck. Imagine a team game where the agent is a weak player and it plays with a strong player in a team. The team wins, but what should the agent learn from this positive reward signal, since all of the actions that it has taken have not contributed to victory.
Buesing et al. (2019)
show that hindsight information is helpful in understanding the trajectory, pretty frequently. This work is connected to the work of Buesign et al. with the difference that this work focuses on model-free and the other is model-based.
Here's a thought
I think that this future-conditioned baseline is predicting returns that are not influenced by the policy, whatever the policy does or in fact, maybe this can be formulated as predicting the worst-case return
?
Future-conditional PG. The point here is that the baseline can be conditioned on future information, but there needs to be a importance correction term because otherwise future-conditioned policy gradients would be biased.
How bad is this bias?
Theorem 1. FC-PG Assumption in the main text is that the (policy) - (action density conditioned on statistic and state) ratio needs to be finite, i.e. 0 cannot be in the denominator. In fact, this improtance ratio is used for correction in fron of
The FC-PG doesn’t necessarily have lower variance than classical PG, because of the importance weighting, the countermeasure to this is to study an estimator that makes the ratio equal to 1, meaning that the action is completely independent of the statistic
Theorem 2. CCA-PG. Interesting theoretical result here is that the variance of CCA is at most the variance of the classical policy gradient and it has no bias, the only condition is that
How do we estimate Theorem 3.
. We consider a general variable
Note, we are allowed to do so because of independence, but
Corollary 1. I don’t understand this really, it’s basically taking the conditional expectation of
Learning CCA with Generative Models. Figures, we need a generative model, so the first idea that comes to mind for modelling the posterior are Variational Autoencoders.
Model-free approach to learning CCA. Method related to domain-adversarial training techniques (whatever that is). Two objectives guide the learning signal: (1)
Practical implementation. They use an RNN so that the algorithm is applicable to POMDPs. A hindsight network
is used for predicting hindsight preidctor
that outputs a distribution over
Connections to causality. Assume that the MDP is generated by a structural causal model and that the return
Notion of counterfactuals in SCMs. In simple terms, you need to keep in mind that counterfactual distributions are different from conditionals. Why? First of all, because of mutilating the DAG, the incoming arrows of the variables that we intervene on are deleted. But more importantly, we are asking ourselves the question: “What would have happened, if I acted differently, under the same conditions”. This means that the exogenous variables are fixed, in other words, we are also conditioning on the exogenous variables. Now, the authors here give a neat definition of conterfactuals with some fancy terminology:
Theorem 4. counterfactual distribution identifiability. Now, the golden question is how do we identify this counterfactual distribution, since we are not given the exogenous variables explicitly. It turns out that given the assumptions of faithfulness (classic) and conditional independence of