Skip to content

About rl_loss #7

@xcfcode

Description

@xcfcode

Thank you for this excellent job, I still have some questions about rl_loss, rl_loss = neg_reward * sample_out.loss, the neg_reward is obtained by greedy_rouge - sample_rouge, and the sample_out.loss means the cross-entropy loss, it is equal to -LogP(). However, in the paper, self-critical policy gradient training algorithm uses LogP(), this confused me, could you please explain this?


Update
I have read SeqGAN code from SeqGAN, according to the policy gradient, the loss is computed as loss += -out[j][target.data[i][j]]*reward[j], out means Log_softmax, so the author adds "-" to using gradient descent later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions