compute_MC_returns currently loops in reverse to compute the returns and discount each step.
Comparing this with discount gamma=1.0 and and just taking data["rewards"].sum(dim=0), there is a discrepancy of
(data["rewards"].sum(dim=0)-compute_MC_returns(data, 1.0, test_critic)[0, :]).abs().max()
out: tensor(1.9073e-06, device='cuda:0')
so not very big, but still there.
Describe the solution you'd like
Pre-compute the discounting vector, and multiply then call .sum().
compute_MC_returnscurrently loops in reverse to compute the returns and discount each step.Comparing this with discount
gamma=1.0and and just takingdata["rewards"].sum(dim=0), there is a discrepancy ofso not very big, but still there.
Describe the solution you'd like
Pre-compute the discounting vector, and multiply then call .sum().