Actor loss implementation

Hi bic4907, i really like your BicNet implementation! My goal is to run your BicNet implementation on an environment where every agent gets -1 reward for each time step it needs to finish the env. But there is a problem with your actor loss implementation, because the loss of the actor is defined as the prediction of the critic, the rewards needs to be zero if the agents performs perfect, isn't it?

Can you explain to me why you implemented it this way? Also is there a possibility that the reward doesn't converges to 0 when the Agents performs good  (linke in the environment i mentioned above)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor loss implementation #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Actor loss implementation #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions