[train] Support logprobs, fix generation config defaults and add more generation tests for the new HTTP inference pathway by SumanthRH · Pull Request #1038 · NovaSky-AI/SkyRL

SumanthRH · 2026-02-06T18:21:22Z

What does this PR do?

This PR migrates more tests for the new HTTP inference pathway and adds some missing features like rollout logprobs support along the way. Also makes some fixes for tests failures on main. The changes are as follows:

Test improvements

Introduces a new InferenceEngineState class to manage instantiating inference engines in states. With better state management, this fixes some cleanup issues for existing test_policy_local_engines_e2e test in CI.

Configuration fix for vLLM server actor

vLLM server can have different generation quality from AsyncLLMEngine.generate . I noticed this while going over generations in the weight sync tests:

/v1/completions

"To determine how much Janet makes at the farmers' market each day, let's break down the earnings:\n\n1. **Total hours working:** \n   - It is given that Janet works 5 days a week.\n\n2. **Daily average earnings:**\n   - Janet earns $2 per fresh duck egg per day for sales.\n   \n3. **Number of eggs sold:**  \n   - In addition to selling leftover eggs from fresh Monday through Thursday (the remaining number), Janet also offers muffins for her friends\n     \\[\n     \\text{Total number of muffins} = 4 \\times (\\text{eigglees prepared for guests}) = 4 \\times (16 + 2) = 72 \n     \\]\n\nThe total quantity (eggs for muffins * Price per egg) can be calculated as:\n\\[\n\\text{Tickets baked} = 72 \\, (minutes \\, @$2) = 72 \\, \\text{tickets}\n\\]\nThus, the cost per ticket is:\n\\[\n\\text{Cost per ticket} = \\$2 / \\text{10mins} = \\$2 / 50sime)\n\\]\n\nSumming up all remaining lunch and recipe hours per day:\n\n\\[\n\\text{Total Earnings} = 5 \\,\\text{(days)} \\times \\$50/\\text{nightly salary Increase}_{Sunday-\\} = \\$3,000/circle\nTo sum it up:\n    Total daily income =\n\\[ \\boxed{\\text{\"1}}おり$\\] 現在、人々が食い doGetl@g\\u4ehnの多未にꀲ_mehng 同様に 別の ショルダー・モリ...\n\nIn conclusion:\")\nTherefore, it sees that-\\\nTotlar!：5天 <- コロールのお立場 -\n\nSpecial:Aricci 团体 少女と思ってss聚い虾肝\n\nHe also said he had no problem sharing iotisedmale amountㅣtsful今天 Landoré院に\"r множgli Острая症候群は血液病です。血液分をいっぽ透ゅと光曳行指 proclaimed Chinaの時电网運路~\u00020オ大へegる ≥ Tsusonoの楽誤をカインの血圧のz武通過ということで知らvensalでも高め values SSDgbnsu...."

AsyncLLMEngine.generate

"To determine how much Janet makes at the farmers' market every day, we will follow these steps:\n\n1. **Calculate the daily egg production for lunch:**\n   - She produces 16 eggs per day for meals.\n   - She eats 3 eggs for breakfast.\n   - Therefore, the remaining eggs at the end of the day are \\( 16 - 3 = 13 \\) eggs.\n\n2. **Calculate the daily egg production for dinner:**\n   - She processes 4 muffins without lunch.\n   - She bakes 4 muffins for dinner.\n   - Therefore, the consumed muffins per day are \\( 4 \\times 4 = 16 \\) muffins.\n\n3. **Calculate the daily net production without fattening losses:**\n   - Net production at the farmers' market is the total remaining articles minus the whole produced eggs.\n   - Thus, the net production is \\( 13 - 16 = -3 \\) eggs.\n\n4. **Determine per-purpose production:**\n   - Janet processes 6 eggs per day, which cost her $2.\n   - Hence, processed eggs produce each day \\( \\frac{2}{6} \\) dollars.\n\n5. **Compute the daily earnings:**\n   - Since the net production of egg is -3 eggs, we have:\n     \\[\n     \\text{Daily total earnings} = \\left(\\frac{2}{6}\\right) \\times (6 \\text{ eggs})\n     \\]\n     \\[\n     = \\frac{2 \\times 6}{6} \\text{ dollars}\n     \\]\n     \\[\n     = 2 \\text{ dollars}\n     \\]\n\n### Conclusion:\n- Janet makes 2 dollars daily at the farmers' market.\n- The answer is: **2** cents."

More details here: https://gist.github.com/SumanthRH/847a328c121c1463b8b8aca6d548224f

The reason is that vllm server's generation config defaults are different. Passing --generation-config vllm fixes the issue.

Switch to `/inference/v1/generate` for `RemoteInferenceClient.generate`

For RemoteInferenceClient.generate, I notice that we were re-tokenizing intermediate tokens (on abort), which can cause small drifts since tokenization is not invertible. The solution is to not rely on /v1/completions and instead use the token-in-token-out endpoint /inference/v1/generate - this also makes it compatible with accumulating logprobs returned from the server. There can also be silent issues with the completions API as above. For RL, it is best to use the /generate endpoint

Support response logprobs for `RemoteInferenceClient`

Adds support for response_logprobs in RemoteInferenceClient. Note that there are some slight differences in sampling_params for /inference/v1/generate and AsyncLLMEngine.generate. As per the OpenAI completions API , logprobs=0 is meant to return logprobs for the chosen token (same as logprobs=1). However, /inference/v1/generate treats logprobs=0 as logprobs=null, and doesn't return any logprobs. This is a vLLM issue. I have created a PR: vllm-project/vllm#34010. While we wait for it to land, I believe it is overall better to rely on logprobs=1 for getting logprobs for the chosen token. it also lends itself to truthy checks if logprobs: better.

Support `test_skyrl_gym_generator` for `_SKYRL_USE_NEW_INFERENCE=1`

Switches SkyRLGymGenerator to provide input tokens over text for generate -> this is because the new pathway only supports tokens
Fixes search env SearchEnv.validate_action to strip newlines: With /inference/v1/generate, only output tokens are provided (unlike AsyncLLMEngine.generate where output text is also available). With output tokens, there can be a case where the LLM generates a trailing newline - generating [<search, >\n] as opposed to [<search>]. One would need to postprocess the output text after detokenization to ensure that strings end exactly with the stop string. There are two fixes here:
Have RemoteInferenceClient do custom postprocessing for generate based on stop strings
Make validation less strict in the SearchEnv (It is the only Env with this strict parsing)

I prefer 2. because RemoteInferenceClient layer should be pretty much pass-through and operate in token space.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…te-more

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2026-02-06T21:00:30Z

skyrl-train/tests/gpu/gpu_ci/test_save_weights_for_sampler.py


-    finally:
-        ray.shutdown()
-


Tests already use ray_init_fixture, which handles cleanup

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request primarily refactors the inference engine interaction to consistently use token-based generation and properly handle log probabilities, especially for Token Importance Sampling (TIS). Key changes include updating various configuration files and example scripts to set logprobs=1 instead of logprobs=0 for sampling parameters, and modifying the RemoteInferenceClient to use a new /inference/v1/generate endpoint that operates with token_ids and returns response_logprobs. The skyrl_gym_generator.py is updated to explicitly use token-in-token-out for consistency. Additionally, the GPU CI script is updated to include a new test for the skyrl_gym_generator, and several test files (test_engine_generation.py, test_inference_engine_client_http_endpoint.py, test_lora.py, test_megatron_worker.py, test_pause_and_continue_generation.py, test_policy_local_engines_e2e.py, test_save_weights_for_sampler.py, test_skyrl_gym_generator.py, test_verifiers_generator.py) are refactored to use a new InferenceEngineState context manager for managing engine lifecycle and to align with the token-based generation approach. A minor change also strips trailing newlines from action strings in skyrl_gym/envs/search/env.py. Review comments suggest simplifying a conditional expression for response_logprobs and addressing an inconsistency in tokenizer.apply_chat_template regarding add_special_tokens.

skyrl-train/tests/gpu/utils.py

gemini-code-assist · 2026-02-06T21:08:43Z

skyrl-train/skyrl_train/inference_servers/remote_inference_client.py

            "stop_reason": stop_reason,
-            "response_ids": final_token_ids,
+            "response_ids": accum_token_ids,
+            "response_logprobs": response_logprobs if len(response_logprobs) > 0 else None,


The expression response_logprobs if len(response_logprobs) > 0 else None can be simplified. Since an empty list [] evaluates to False in a boolean context, you can use a more concise and Pythonic expression.

Suggested change

"response_logprobs": response_logprobs if len(response_logprobs) > 0 else None,

"response_logprobs": response_logprobs if response_logprobs else None,

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

kouroshHakha

Overall is good. I'd wish you had broken down the pr into individual smaller prs for each part. But it's ok for now, for the next PRs, let's make sure orthogonal features are kept separate on PRs.

kouroshHakha · 2026-02-09T06:13:28Z

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

            prompt_token_ids = self.tokenizer.apply_chat_template(
                prompts,
                add_generation_prompt=True,
-                add_special_tokens=False,


was this intentional?

kouroshHakha · 2026-02-09T06:17:22Z

skyrl-train/ci/gpu_ci_run.sh

 # Run tests for new inference layer
 _SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci/test_policy_local_engines_e2e.py -m "vllm"
 _SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci/test_engine_generation.py -m "vllm"
+_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci/test_skyrl_gym_generator.py


can we create a list here on which test would fail and need work if we switch to _SKYRL_USE_NEW_INFERENCE=1 uv run <options> pytest -s tests/gpu/gpu_ci/

In the next PRs as we fix those tests we will add more lines here.

In the end at some point we will just run the tests with new inference flag on the gpu_ci folder (some tests will be skipped, those that are testing inferenceEnigneClient)

SumanthRH added 10 commits February 6, 2026 16:55

Fix tests; Fix logprobs issue; migrate SkyRLGymGenerator tests

12f48bf

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

be more generous with parsing

0853cd3

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

1f41232

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

708b52f

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

fix more logprobs in scripts

169ae31

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

14b10bf

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into fix-tests-and-migra…

0a5ac7c

…te-more

x

4513444

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

142b09e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

6203725

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH commented Feb 6, 2026

View reviewed changes

self review

3f5fa99

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review February 6, 2026 21:02

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

SumanthRH added 2 commits February 6, 2026 21:22

x

a969708

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

0d9003f

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

kouroshHakha reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Support logprobs, fix generation config defaults and add more generation tests for the new HTTP inference pathway#1038

[train] Support logprobs, fix generation config defaults and add more generation tests for the new HTTP inference pathway#1038
SumanthRH wants to merge 13 commits intomainfrom
fix-tests-and-migrate-more

SumanthRH commented Feb 6, 2026 •

edited

Loading

Uh oh!

SumanthRH Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

kouroshHakha left a comment

Uh oh!

kouroshHakha Feb 9, 2026

Uh oh!

kouroshHakha Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"response_logprobs": response_logprobs if len(response_logprobs) > 0 else None,
	"response_logprobs": response_logprobs if response_logprobs else None,

Conversation

SumanthRH commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test improvements

Configuration fix for vLLM server actor

Switch to /inference/v1/generate for RemoteInferenceClient.generate

Support response logprobs for RemoteInferenceClient

Support test_skyrl_gym_generator for _SKYRL_USE_NEW_INFERENCE=1

Uh oh!

SumanthRH Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Feb 6, 2026 •

edited

Loading

Switch to `/inference/v1/generate` for `RemoteInferenceClient.generate`

Support response logprobs for `RemoteInferenceClient`

Support `test_skyrl_gym_generator` for `_SKYRL_USE_NEW_INFERENCE=1`