Run vLLM inference using torchtitan model definition (single GPU) #2119

wwwjn · 2025-12-05T22:21:31Z

As titled, put it in deterministic RL folder

tianyu-l

left some comments, ignore if you still plan to change

tianyu-l · 2025-12-05T22:40:21Z

torchtitan/experiments/deterministic_vllm_rl/models/attention.py

        return output
+
+
+class VLLMPagedFlashAttention(torch.nn.Module):


What's this for? I thought we only run inference.

By design this class should be able to run both training and inference in the future, so we have one single attention class. Now it's only used when running inference. I will clean up and remove non-inference related stuff. cc @zhxchen17

torchtitan/experiments/deterministic_vllm_rl/README.md

tianyu-l · 2025-12-05T22:42:02Z

torchtitan/experiments/deterministic_vllm_rl/models/qwen3/model_vllm_compat.py



-class FeedForwardVLLMCompat(nn.Module):
+class TorchTitanQwen3ForCausalLM(nn.Module):


we should aim for at most 1 general wrapper for all models -- we shouldn't have 1 wrapper for each model.

Yes that's the goal! The current wrapper is specific to Qwen3 model as we access each model layers by name, let me think about how to design the interface

Suppose each model author defines the model in slightly different ways, then it seems to up to the ppl who do RL to make the adaptation work like we prototyped here.

So I guess we need some sort of contract baked in the authoring time. e.g. Models should annotate/implement their attention layers in certain ways (having some sort of base class or special methods?)

Agreed with @zhxchen17.
I think we can start by adding a BaseTransformer class (in torchtitan/protocols/model.py) with the standard layers defined, so that other model (text ones) can inherit, e.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama4/model/model.py#L425

but this won't stop model from not using the layers (e.g. tok_embeddings). We can make them less error-prone by removing the forward function in subclasses since they are identical.

Cool! The refactor of making a BaseTransformer class can be handled in a separate PR, this would need an alignment between rope_cache and freqs_cis naming as well

Also I was thinking how to design the 1 single general wrapper of vllm model.
There are 4 things are model specific and need to be pluged into the general wrapper class, before register model to vllm:

model class itself

model_args class

model's parallel plan (mainly TP. PP needs to be considered separately)

state dict adapter

I refactored the code into a basic wrapper and plugin components by inheriting when registering the model. Wdyt?

this would need an alignment between rope_cache and freqs_cis naming as well

Why? Is it because we put freqs_cis generation outside model for CP? I think @fegin has a plan to move it back into the model.

No it's not related to CP, I was thinking we need to access self.rope_cache / self.freqs_cis during the forward of BaseTransformer:

torchtitan/torchtitan/models/qwen3/model/model.py

Line 573 in f1d41a1

h = layer(h, self.rope_cache, attention_masks, positions)

torchtitan/models/qwen3/infra/parallelize.py

acisseJZhong · 2025-12-05T23:19:42Z

torchtitan/experiments/deterministic_vllm_rl/models/qwen3/model_vllm_compat.py

+            positions_2d = positions.unsqueeze(0)  # [total_tokens] -> [1, total_tokens]

+        # Get embeddings from 2D tokens
+        h = self.model.tok_embeddings(tokens_2d)  # [1, total_tokens, hidden_size]


qq: curious are the input tokens always have bz=1?

I'm not 100% sure about this part, but I guess no? I only tried single prompts for now but if batched inference is enabled, I will leave a ToDo here for now

torchtitan/models/qwen3/model/model.py

tianyu-l · 2025-12-07T07:09:52Z

torchtitan/experiments/deterministic_vllm_rl/infer.py

+register()
+
+
+def parse_args():


not urgent, but we should use "our" config system in the long term

Curretnly the entry point is vllm engine, so we are taking the config from whatever vllm engine passed to us. Let me check vllm engine see if there's anything we could do

wait how is it related to vllm config system? You are just using them as is in args = parse_args().

This args is only for infer.py script, it will pass args into vllm engine LLM() , and vllm engine will create a VLLMConfig instance internally, and pass to our model wrapper

it will pass args into vllm engine LLM()

I don't think it's passing the args to LLM(). What would be different if we use our config manager to construct args?

acisseJZhong · 2025-12-08T22:04:58Z

torchtitan/experiments/deterministic_vllm_rl/models/qwen3/model_vllm_compat.py

-            model_args.n_heads
-            if model_args.n_kv_heads is None
-            else model_args.n_kv_heads
+    def _replice_with_vllm_paged_attention(self, model_args):


can we expose this function as a util function? It could also be reused for the trainer model

torchtitan/experiments/deterministic_vllm_rl/README.md

torchtitan/experiments/deterministic_vllm_rl/models/vllm_wrapper.py

torchtitan/experiments/deterministic_vllm_rl/models/base_wrapper.py

acisseJZhong · 2025-12-09T21:45:47Z

torchtitan/experiments/deterministic_vllm_rl/models/utils.py

+    return parallel_dims
+
+
+def build_device_mesh_and_parallelize(


curious how will this function be used? Won't VLLM engine handle TP=2 for us?

VLLM engine handle TP=2 for us?

vllm applies TP by patching each module: https://docs.vllm.ai/en/latest/contributing/model/basic/#3-optional-implement-tensor-parallelism-and-quantization-support, which also changes model definition and happens during model initialization

This function is calling parallelize_qwen3 function from qwen3/infra/parallelize.py, according to parallel_dims, it will apply different parallism

torchtitan/experiments/deterministic_vllm_rl/infer.py

tianyu-l · 2025-12-09T23:56:24Z

torchtitan/experiments/deterministic_vllm_rl/infer.py

+register()
+
+
+def parse_args():


wait how is it related to vllm config system? You are just using them as is in args = parse_args().

torchtitan/experiments/deterministic_vllm_rl/models/vllm_wrapper.py

tianyu-l · 2025-12-10T00:04:25Z

torchtitan/experiments/deterministic_vllm_rl/models/base_wrapper.py

+        logits = self.model.output(h)
+
+        if isinstance(logits, DTensor):
+            logits = logits.full_tensor()


Have you verified this for TP, or is it only working for single gpu? If it's the latter, add a TODO around such conversions.

Currently this PR only works for single GPU, and I will add TP in a following PR

tianyu-l · 2025-12-10T00:14:55Z

torchtitan/experiments/deterministic_vllm_rl/models/attention.py

+            layer_idx = next(VLLMPagedFlashAttention._layer_counter)
+            prefix = f"layers.{layer_idx}"
+
+            self.vllm_attn = Attention(


why can't we always use VLLMPagedFlashAttention?

tianyu-l · 2025-12-10T00:22:50Z

torchtitan/experiments/deterministic_vllm_rl/models/__init__.py

+                model_cls=train_spec.model_cls,
+                model_args_cls=model_args_cls,
+                state_dict_adapter=train_spec.state_dict_adapter,
+                parallelize_fn=train_spec.parallelize_fn,


It seems we need these fields and the wrappers TorchTitanVLLMModel / TorchTitanVLLMModelFromSpec because we rely on vllm's LLM() api to create the model.

This is hacky and making things complicated as we are dumping a lot of logic (originally in train.py and checkpoint.py) to the model code itself.

I feel this is unnecessary if our end goal is to use the engine part of vLLM, not the model init part.

dumping a lot of logic (originally in train.py and checkpoint.py) to the model code itself

Agreed, the main blocker is that we need to have control of how Worker instantiate a model.

According to vllm design , this class is not only a model nn.module, but a model_runner https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/worker_base.py#L85, that's why it has load_weights function

torchtitan/experiments/deterministic_vllm_rl/models/vllm_wrapper.py

acisseJZhong · 2025-12-10T00:52:33Z

torchtitan/experiments/deterministic_vllm_rl/models/vllm_wrapper.py

+        if parallel_dims.tp_enabled:
+            self.world_mesh = parallel_dims.world_mesh
+            tp_mesh = self.world_mesh["tp"]
+            parallelize_fn(


wondering why do we parallelize the model during init? This model is used during vllm inference, and I thought VLLM has its own TP impl?

We want to apply our TP instead of using vLLM TP implementation, we can not have direct access to the model later once LLM() initialized, so we can not apply TP later

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025

wwwjn requested review from acisseJZhong and bwasti December 5, 2025 22:22

tianyu-l reviewed Dec 5, 2025

View reviewed changes

acisseJZhong reviewed Dec 5, 2025

View reviewed changes

wwwjn commented Dec 6, 2025

View reviewed changes

torchtitan/models/qwen3/model/model.py Outdated Show resolved Hide resolved

tianyu-l reviewed Dec 7, 2025

View reviewed changes

acisseJZhong reviewed Dec 8, 2025

View reviewed changes

wwwjn force-pushed the vllm-infer branch 3 times, most recently from d7b714b to 226150d Compare December 9, 2025 18:52

wwwjn marked this pull request as ready for review December 9, 2025 18:57

acisseJZhong reviewed Dec 9, 2025

View reviewed changes

torchtitan/experiments/deterministic_vllm_rl/README.md Outdated Show resolved Hide resolved

acisseJZhong reviewed Dec 9, 2025

View reviewed changes

torchtitan/experiments/deterministic_vllm_rl/models/vllm_wrapper.py Show resolved Hide resolved

acisseJZhong reviewed Dec 9, 2025

View reviewed changes

torchtitan/experiments/deterministic_vllm_rl/models/base_wrapper.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Dec 9, 2025

View reviewed changes

tianyu-l requested changes Dec 10, 2025

View reviewed changes

acisseJZhong reviewed Dec 10, 2025

View reviewed changes

wwwjn added 12 commits December 9, 2025 20:14

run vllm engine

89736f4

add 1st version

c97316a

hit TP issue

5f826dc

switch to attention

1e7ee17

single GPU works

0dc7a75

single GPU works

86e34df

add TP v1

f39f678

remove dependency on bram's PR

e203cf5

add missing file

d1cb51b

merge to deterministic_rl

42dac79

add readme

4fc1d16

address comments

8e64515

wwwjn added 4 commits December 9, 2025 20:20

restore unnecessary changes

971f919

refactor

1659708

refactor v2

295e654

comments

218336a

wwwjn force-pushed the vllm-infer branch from 226150d to 218336a Compare December 10, 2025 19:46

wwwjn requested review from acisseJZhong and tianyu-l December 10, 2025 19:46

		return output


		class VLLMPagedFlashAttention(torch.nn.Module):



		class FeedForwardVLLMCompat(nn.Module):
		class TorchTitanQwen3ForCausalLM(nn.Module):

Run vLLM inference using torchtitan model definition (single GPU) #2119

Are you sure you want to change the base?

Run vLLM inference using torchtitan model definition (single GPU) #2119

Uh oh!

Conversation

wwwjn commented Dec 5, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

wwwjn Dec 9, 2025 •

edited

Loading