-
Notifications
You must be signed in to change notification settings - Fork 278
Open
Labels
enhancementNew feature or requestNew feature or request
Description
- In here, the reward at each step is calculated as reward = max(0., reward). I see the code about penalty in actions.py and I understand why you cancel that. When I tried to remove this max() operation, the reward became highly negative and the agent learns nothing. However, it makes the agent overfit because the reward is so sparse. I think adding some time cost is necessary such as -1 or -0.5.
- In here, when giving the reward for NEW_SUCCESSFULL_ATTACK_REWARD, it does not take into consideration whether the attacked the node is already owned. It's meaningless to attack a node already owned by the attacker. It will make the agent repeatedly launch attacks between owned nodes and ignore discovering new nodes.
In my experiment, I trained an agent with the original reward design in the chain env. The agent can perfectly take ownership of the network in training. When I saved the model and evaluate it with epsilon-greedy, the success rate is only about 90%. When I patched the two points I proposed above and trained an agent with the same parameters, the successful rate for evaluation is about 100%. I think the original reward design makes the agent overfit.
Could you please take a look at the two points and give some feedback? Anyway, thanks again for your codes, it helps with my research, and I even would like to use them in my next research project about online learning :)
Originally posted by @sherdencooper in #46 (comment)
MariaRigaki
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request