About rl_loss

Thank you for this excellent job, I still have some questions about rl_loss, `rl_loss = neg_reward * sample_out.loss`, the `neg_reward` is obtained by `greedy_rouge - sample_rouge`, and the `sample_out.loss` means the cross-entropy loss, it is equal to `-LogP()`. However, in the paper,  self-critical policy gradient training algorithm uses `LogP()`, this confused me, could you please explain this?

------

Update
I have read SeqGAN code from [SeqGAN](https://github.com/suragnair/seqGAN/blob/master/generator.py), according to the policy gradient, the loss is computed as `loss += -out[j][target.data[i][j]]*reward[j]`, out means Log_softmax, so the author adds "-" to using gradient descent later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About rl_loss #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About rl_loss #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions