Skip to content

Conversation

@taha-yassine
Copy link
Contributor

@taha-yassine taha-yassine commented Nov 13, 2025

What does this PR do?

Fixes #2897

As mentioned in the original issue, special tokens can be especially useful when computing reward functions.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@qgallouedec
Copy link
Member

qgallouedec commented Nov 13, 2025

Thanks for the PR. I went through the issue. I understand that the motivation is to compute a reward that depends on the special tokens.
Would computing the reward based on the completion_ids (ints) instead of on the completion (strs) would do the trick?

@taha-yassine
Copy link
Contributor Author

Would computing the reward based on the completion_ids (ints) instead of on the completion (strs) would do the trick?

I think completion_ids is useful when the reward depends on whether or not some specific predetermined tokens are present. For example, it can be used to check if the policy respects a given answer format.
Now say we're training a policy to answer questions and to put the result between <answer> and </answer>, and say these are special tokens. completion_ids would allow us to detect whether the completion contains those tokens, but we won't be able to precisely locate what part of completion they surround. Also, we can't manually decode completion_ids because we don't have access to the tokenizer inside the reward functions.

@taha-yassine
Copy link
Contributor Author

Hi @qgallouedec , just checking in to see if you've had a chance to look at my last comment. Lmk what you think.

@qgallouedec
Copy link
Member

qgallouedec commented Nov 21, 2025

Actually, I'm still wondering about this feature. In your example, in my opinion, <answer> shouldn't be a special token in the first place. In most models, these kind of tokens are not special:

Another difficulty I foresee is that we will soon be relying on "response parsing" (see #4300) instead of simple decoding. And in that case, I don't even see how we could include special tokens; in fact, I don't think this question would even make sense.

I recommend the following:

  • re-consider the choice of making a token "special" (in the vast majority of cases, only BOS, EOS and PAD should be special), or
  • have a tokenizer in the reward function:
    tokenizer = AutoTokenizer.from_pretrained(...)
    
    def reward_func(completion_ids, **kwargs):
        text = tokenizer.decode(completion_ids, skip_special_tokens=False)
        ...
        return rewards

@taha-yassine
Copy link
Contributor Author

I agree that it doesn't make sense to have these kinds of tokens marked as special. The issue is that some models still do that (example). And for some more context, the model I struggled with the most and that led me to open this PR was gpt-oss because of their harmony response format. Extracting the content of each channel requires access to the special tokens.

Another difficulty I foresee is that we will soon be relying on "response parsing" (see #4300) instead of simple decoding. And in that case, I don't even see how we could include special tokens; in fact, I don't think this question would even make sense.

I wasn't aware of this new response parsing feature. It seems useful and it would indeed solve a lot of the headache around extracting the different parts of the response for reward design. My only concern is the time it would take for it to make it into TRL and for general adoption by past and future models.

On the other hand, what issues do you see with this PR in its current?

In the mean time, while having a globally available tokenizer that one could use for rewards isn't the most elegant of solutions, I guess it's an acceptable workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPO completions skip special tokens ?

2 participants