Hello everyone,
I hope this message finds you well. I've been working on implementing the SDP (Part 1 of the paper), and I've come across a point of potential confusion regarding the cumulative rewards for transitions in batch B1.
The paper mentions that every transition within batch B1 should have the same cumulative reward( following the math description of B_1), but upon reviewing the code, it seems that transitions are randomly selected with the possibility of having different cumulative rewards.
Before jumping to any conclusions, I wanted to open up a discussion and seek clarification from the community and maintainers. Could someone please shed light on whether the intended behavior is to have uniform cumulative rewards for all transitions in B1, or if the current code aligns with the paper's specifications?
Hello everyone,
I hope this message finds you well. I've been working on implementing the SDP (Part 1 of the paper), and I've come across a point of potential confusion regarding the cumulative rewards for transitions in batch B1.
The paper mentions that every transition within batch B1 should have the same cumulative reward( following the math description of B_1), but upon reviewing the code, it seems that transitions are randomly selected with the possibility of having different cumulative rewards.
Before jumping to any conclusions, I wanted to open up a discussion and seek clarification from the community and maintainers. Could someone please shed light on whether the intended behavior is to have uniform cumulative rewards for all transitions in B1, or if the current code aligns with the paper's specifications?