- 
                Notifications
    You must be signed in to change notification settings 
- Fork 182
filter out empty triplets when batching #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a runtime error that occurs when processing batches containing samples with empty prompt and response sequences, which causes tensor reshaping failures in the Qwen2 model. The fix filters out empty triplets (samples with both empty prompt_ids and response_ids) during batch preparation to prevent the error from propagating to the model forward pass.
- Filter out samples with both empty prompt_ids and response_ids during batch preparation
- Return early from training step when no valid samples remain after filtering
- Add metrics tracking for filtered samples
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description | 
|---|---|
| agentlightning/verl/daemon.py | Adds filtering logic to skip empty samples and return None when no valid samples exist | 
| agentlightning/verl/trainer.py | Adds early return handling when batch is None due to filtering | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| 
 @microsoft-github-policy-service agree | 
| Have you tested locally whether the proposed fix will mitigate issue #50? Show the behavior before the fix and after. | 
| Before fixingthe server side reports the above error and exits. After fixingThe client side reports the same error to notice users. | 
| Have you checked wandb? I don't think the training makes any sense if all the data are thrown away. I think a design choice has to be made here, with pros and cons, to ask users to intervene, if that's the case happening. Simply silencing all errors might not be a good idea. | 
| ClarificationIn my debugging code, as all generated samples exceed the length limitation, there is no valid samples in each batch, and wandb log is empty. Potential improvementTo allow user intervention, is an option like  BehaviourBy enabling this option, when a bad request like this issue happens, the server should skip this rollout sample during batching since there is no valid data. Default value of the optionThis option should be enabled by default. Otherwise users may meet this issue in late training steps and waste most compute resources. For example, Figure 3 of Deepseek-R1 indicates that response will gradually increase as the training step increases. | 
| Well I think there are two cases. If it's the prompt length that is too long, it's a problem at the agent side and the server should clearly warn or point out that the user should take a look at their agent. If it's a response length problem, like the DSR1 given by you, we may need to further discuss what's happening here and should have a proper mechanism to handle it. | 
| To warn the prompt length of agent client, we need to intialize a tokenizer to precompute number of tokens during enqueuing a task (queue_task). Then we can add a warninng if the token count is out of limit. Is this a preferred design? To get the response length, the execution is wrapped by verl's  | 
| 
 I'm fixing the same issue right now. With changing the verl training bash, setting 'truncate' in data.truncation. this replacement with error can somehow omit the error and let the training process continue, while somehow we can also set a bigger data.max_prompt_length If a 4096 length is not enough. However, it‘s also important that if too much responses are dropped, the rl training is stakced and advantage estimation is incorrect, which may cause degrading performance. | 
| 
 I think it's a really good solution that truncating the input prompt to exceed the length of the agent when the LLM inference Max Token is set too small. However, when it comes to long agent responses, such as 10000+ input and 4000+ output, it can cause really verbose training time waste. | 


Reproduce bug
How to
insert
prompt = " ".join([prompt for _ in range(3000)])below this line.Full trace stack on server side
Solution
As shown in the trace stack, this issue is caused by a 0-length response. In addition, it does not affect inference mode. So, a straightforward idea is to filter out such kind of samples from training batches.
Not sure if this is too aggressive to impact other components.
Other solutions
As the bug is from transformers, using another model can avoid this issue.
fixing this issue in transformers's modeling_qwen2 model also works.
add anthoer field like
is_drop_maskto filter out 0-respoonse samples before enteringcompute_log_prob, and fill those droped samples with some default value.