Currently set up a config.min_reward option to return a minimum reward (instead of a ranked reward calculation) for invalid molecules. But, I think this has the side-effect of causing the MCTS algorithm not know how to distinguish high-performing molecules from actually invalid ones. Maybe the policy network training will solve this, but it might also be something we can adjust through the RR calculation