Is the difference that the trajectory of the single-turn comes from data, while that of the expanded single-turn comes from rollout?
Is the difference that the trajectory of the single-turn comes from data, while that of the expanded single-turn comes from rollout?