Skip to content

[train] Support logprobs, fix generation config defaults and add more generation tests for the new HTTP inference pathway#1038

Open
SumanthRH wants to merge 13 commits intomainfrom
fix-tests-and-migrate-more
Open

[train] Support logprobs, fix generation config defaults and add more generation tests for the new HTTP inference pathway#1038
SumanthRH wants to merge 13 commits intomainfrom
fix-tests-and-migrate-more

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Feb 6, 2026

What does this PR do?

This PR migrates more tests for the new HTTP inference pathway and adds some missing features like rollout logprobs support along the way. Also makes some fixes for tests failures on main. The changes are as follows:

Test improvements

Introduces a new InferenceEngineState class to manage instantiating inference engines in states. With better state management, this fixes some cleanup issues for existing test_policy_local_engines_e2e test in CI.

Configuration fix for vLLM server actor

vLLM server can have different generation quality from AsyncLLMEngine.generate . I noticed this while going over generations in the weight sync tests:

/v1/completions

"To determine how much Janet makes at the farmers' market each day, let's break down the earnings:\n\n1. **Total hours working:** \n   - It is given that Janet works 5 days a week.\n\n2. **Daily average earnings:**\n   - Janet earns $2 per fresh duck egg per day for sales.\n   \n3. **Number of eggs sold:**  \n   - In addition to selling leftover eggs from fresh Monday through Thursday (the remaining number), Janet also offers muffins for her friends\n     \\[\n     \\text{Total number of muffins} = 4 \\times (\\text{eigglees prepared for guests}) = 4 \\times (16 + 2) = 72 \n     \\]\n\nThe total quantity (eggs for muffins * Price per egg) can be calculated as:\n\\[\n\\text{Tickets baked} = 72 \\, (minutes \\, @$2) = 72 \\, \\text{tickets}\n\\]\nThus, the cost per ticket is:\n\\[\n\\text{Cost per ticket} = \\$2 / \\text{10mins} = \\$2 / 50sime)\n\\]\n\nSumming up all remaining lunch and recipe hours per day:\n\n\\[\n\\text{Total Earnings} = 5 \\,\\text{(days)} \\times \\$50/\\text{nightly salary Increase}_{Sunday-\\} = \\$3,000/circle\nTo sum it up:\n    Total daily income =\n\\[ \\boxed{\\text{\"1}}おり$\\] 現在、人々が食い doGetl@g\\u4ehnの多未にꀲ_mehng 同様に 別の ショルダー・モリ...\n\nIn conclusion:\")\nTherefore, it sees that-\\\nTotlar!:5天 <- コロールのお立場 -\n\nSpecial:Aricci 团体 少女と思ってss聚い虾肝\n\nHe also said he had no problem sharing iotisedmale amountㅣtsful今天 Landoré院に\"r множgli Острая症候群は血液病です。血液分をいっぽ透ゅと光曳行指 proclaimed Chinaの時电网運路~\u00020オ大へegる ≥ Tsusonoの楽誤をカインの血圧のz武通過ということで知らvensalでも高め values SSDgbnsu...."

AsyncLLMEngine.generate

"To determine how much Janet makes at the farmers' market every day, we will follow these steps:\n\n1. **Calculate the daily egg production for lunch:**\n   - She produces 16 eggs per day for meals.\n   - She eats 3 eggs for breakfast.\n   - Therefore, the remaining eggs at the end of the day are \\( 16 - 3 = 13 \\) eggs.\n\n2. **Calculate the daily egg production for dinner:**\n   - She processes 4 muffins without lunch.\n   - She bakes 4 muffins for dinner.\n   - Therefore, the consumed muffins per day are \\( 4 \\times 4 = 16 \\) muffins.\n\n3. **Calculate the daily net production without fattening losses:**\n   - Net production at the farmers' market is the total remaining articles minus the whole produced eggs.\n   - Thus, the net production is \\( 13 - 16 = -3 \\) eggs.\n\n4. **Determine per-purpose production:**\n   - Janet processes 6 eggs per day, which cost her $2.\n   - Hence, processed eggs produce each day \\( \\frac{2}{6} \\) dollars.\n\n5. **Compute the daily earnings:**\n   - Since the net production of egg is -3 eggs, we have:\n     \\[\n     \\text{Daily total earnings} = \\left(\\frac{2}{6}\\right) \\times (6 \\text{ eggs})\n     \\]\n     \\[\n     = \\frac{2 \\times 6}{6} \\text{ dollars}\n     \\]\n     \\[\n     = 2 \\text{ dollars}\n     \\]\n\n### Conclusion:\n- Janet makes 2 dollars daily at the farmers' market.\n- The answer is: **2** cents."

More details here: https://gist.github.com/SumanthRH/847a328c121c1463b8b8aca6d548224f

The reason is that vllm server's generation config defaults are different. Passing --generation-config vllm fixes the issue.

Switch to /inference/v1/generate for RemoteInferenceClient.generate

For RemoteInferenceClient.generate, I notice that we were re-tokenizing intermediate tokens (on abort), which can cause small drifts since tokenization is not invertible. The solution is to not rely on /v1/completions and instead use the token-in-token-out endpoint /inference/v1/generate - this also makes it compatible with accumulating logprobs returned from the server. There can also be silent issues with the completions API as above. For RL, it is best to use the /generate endpoint

Support response logprobs for RemoteInferenceClient

Adds support for response_logprobs in RemoteInferenceClient. Note that there are some slight differences in sampling_params for /inference/v1/generate and AsyncLLMEngine.generate. As per the OpenAI completions API , logprobs=0 is meant to return logprobs for the chosen token (same as logprobs=1). However, /inference/v1/generate treats logprobs=0 as logprobs=null, and doesn't return any logprobs. This is a vLLM issue. I have created a PR: vllm-project/vllm#34010. While we wait for it to land, I believe it is overall better to rely on logprobs=1 for getting logprobs for the chosen token. it also lends itself to truthy checks if logprobs: better.

Support test_skyrl_gym_generator for _SKYRL_USE_NEW_INFERENCE=1

  1. Switches SkyRLGymGenerator to provide input tokens over text for generate -> this is because the new pathway only supports tokens
  2. Fixes search env SearchEnv.validate_action to strip newlines: With /inference/v1/generate, only output tokens are provided (unlike AsyncLLMEngine.generate where output text is also available). With output tokens, there can be a case where the LLM generates a trailing newline - generating [<search, >\n] as opposed to [<search>]. One would need to postprocess the output text after detokenization to ensure that strings end exactly with the stop string. There are two fixes here:
  3. Have RemoteInferenceClient do custom postprocessing for generate based on stop strings
  4. Make validation less strict in the SearchEnv (It is the only Env with this strict parsing)

I prefer 2. because RemoteInferenceClient layer should be pretty much pass-through and operate in token space.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Comment on lines 133 to -138

finally:
ray.shutdown()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests already use ray_init_fixture, which handles cleanup

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review February 6, 2026 21:02
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily refactors the inference engine interaction to consistently use token-based generation and properly handle log probabilities, especially for Token Importance Sampling (TIS). Key changes include updating various configuration files and example scripts to set logprobs=1 instead of logprobs=0 for sampling parameters, and modifying the RemoteInferenceClient to use a new /inference/v1/generate endpoint that operates with token_ids and returns response_logprobs. The skyrl_gym_generator.py is updated to explicitly use token-in-token-out for consistency. Additionally, the GPU CI script is updated to include a new test for the skyrl_gym_generator, and several test files (test_engine_generation.py, test_inference_engine_client_http_endpoint.py, test_lora.py, test_megatron_worker.py, test_pause_and_continue_generation.py, test_policy_local_engines_e2e.py, test_save_weights_for_sampler.py, test_skyrl_gym_generator.py, test_verifiers_generator.py) are refactored to use a new InferenceEngineState context manager for managing engine lifecycle and to align with the token-based generation approach. A minor change also strips trailing newlines from action strings in skyrl_gym/envs/search/env.py. Review comments suggest simplifying a conditional expression for response_logprobs and addressing an inconsistency in tokenizer.apply_chat_template regarding add_special_tokens.

"stop_reason": stop_reason,
"response_ids": final_token_ids,
"response_ids": accum_token_ids,
"response_logprobs": response_logprobs if len(response_logprobs) > 0 else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expression response_logprobs if len(response_logprobs) > 0 else None can be simplified. Since an empty list [] evaluates to False in a boolean context, you can use a more concise and Pythonic expression.

Suggested change
"response_logprobs": response_logprobs if len(response_logprobs) > 0 else None,
"response_logprobs": response_logprobs if response_logprobs else None,

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall is good. I'd wish you had broken down the pr into individual smaller prs for each part. But it's ok for now, for the next PRs, let's make sure orthogonal features are kept separate on PRs.

prompt_token_ids = self.tokenizer.apply_chat_template(
prompts,
add_generation_prompt=True,
add_special_tokens=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intentional?

# Run tests for new inference layer
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci/test_policy_local_engines_e2e.py -m "vllm"
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci/test_engine_generation.py -m "vllm"
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci/test_skyrl_gym_generator.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a list here on which test would fail and need work if we switch to _SKYRL_USE_NEW_INFERENCE=1 uv run <options> pytest -s tests/gpu/gpu_ci/

In the next PRs as we fix those tests we will add more lines here.

In the end at some point we will just run the tests with new inference flag on the gpu_ci folder (some tests will be skipped, those that are testing inferenceEnigneClient)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants