-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Reproduction
I started vLLM using the following command:
CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve
--model /home/user/qwen3-4B-instruct
--tensor-parallel-size 2
--data-parallel-size 1
--dtype bfloat16
--max-model-len 4608
--gpu-memory-utilization 0.9
Then I started GRPO using the following command:
CUDA_VISIBLE_DEVICES=3 accelerate launch
examples/scripts/grpo_retriever.py
--model_name_or_path /home/user/qwen3-4B-instruct
--output_dir grpo_retrieval
--learning_rate 1e-5
--dtype bfloat16
--max_prompt_length 512
--max_completion_length 300
--temperature 0.2
--top_p 0.1
--use_peft
--lora_target_modules q_proj k_proj v_proj o_proj gate_proj up_proj down_proj
--log_completions
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--num_generations 2
--save_steps 50
--save_total_limit 3
--use_vllm
--vllm_mode server
--vllm_server_host 127.0.0.1
--vllm_server_port 8000
--vllm_importance_sampling_correction
--vllm_importance_sampling_cap 2.0
--vllm_guided_decoding_regex '^'
outputs:
[E1118 17:30:24.226257071 socket.cpp:1019] [c10d] The client socket has timed out after 300000ms while trying to connect to (127.0.0.1, 51216).
[W1118 17:30:24.226757417 TCPStore.cpp:340] [c10d] TCP client failed to connect/validate to host 127.0.0.1:51216 - retrying (try=0, timeout=300000ms, delay=32101ms): The client socket has timed out after 300000ms while trying to connect to (127.0.0.1, 51216).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x70eb99af2eb0 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x70eb7d3bb4d1 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x1369132 (0x70eb789bb132 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ddddeb (0x70eb7d42fdeb in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5dddf8b (0x70eb7d42ff8b in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5dde324 (0x70eb7d430324 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x5d61dc3 (0x70eb7d3b3dc3 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x41d (0x70eb7d3ba7cd in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xc11045 (0x70eb8c71b045 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xc4b5c6 (0x70eb8c7555c6 in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x381cdf (0x70eb8be8bcdf in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x13bf46 (0x5ed479ef0f46 in /home/user/.conda/envs/trl/bin/python3.10)
frame #12: _PyObject_MakeTpCall + 0x24b (0x5ed479eea0db in /home/user/.conda/envs/trl/bin/python3.10)
frame #13: <unknown function> + 0x1493e7 (0x5ed479efe3e7 in /home/user/.conda/envs/trl/bin/python3.10)
frame #14: PyVectorcall_Call + 0xc7 (0x5ed479efee07 in /home/user/.conda/envs/trl/bin/python3.10)
frame #15: <unknown function> + 0x147101 (0x5ed479efc101 in /home/user/.conda/envs/trl/bin/python3.10)
frame #16: <unknown function> + 0x13546d (0x5ed479eea46d in /home/user/.conda/envs/trl/bin/python3.10)
frame #17: <unknown function> + 0x380a3b (0x70eb8be8aa3b in /home/user/.conda/envs/trl/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #18: _PyObject_MakeTpCall + 0x24b (0x5ed479eea0db in /home/user/.conda/envs/trl/bin/python3.10)
frame #19: _PyEval_EvalFrameDefault + 0x54ea (0x5ed479ee651a in /home/user/.conda/envs/trl/bin/python3.10)
frame #20: _PyFunction_Vectorcall + 0x6c (0x5ed479ef141c in /home/user/.conda/envs/trl/bin/python3.10)
frame #21: _PyEval_EvalFrameDefault + 0x130f (0x5ed479ee233f in /home/user/.conda/envs/trl/bin/python3.10)
frame #22: <unknown function> + 0x149107 (0x5ed479efe107 in /home/user/.conda/envs/trl/bin/python3.10)
frame #23: _PyEval_EvalFrameDefault + 0x130f (0x5ed479ee233f in /home/user/.conda/envs/trl/bin/python3.10)
frame #24: _PyFunction_Vectorcall + 0x6c (0x5ed479ef141c in /home/user/.conda/envs/trl/bin/python3.10)
frame #25: _PyObject_FastCallDictTstate + 0x190 (0x5ed479ee95e0 in /home/user/.conda/envs/trl/bin/python3.10)
frame #26: <unknown function> + 0x146c67 (0x5ed479efbc67 in /home/user/.conda/envs/trl/bin/python3.10)
frame #27: _PyObject_MakeTpCall + 0x283 (0x5ed479eea113 in /home/user/.conda/envs/trl/bin/python3.10)
frame #28: _PyEval_EvalFrameDefault + 0x54ea (0x5ed479ee651a in /home/user/.conda/envs/trl/bin/python3.10)
frame #29: <unknown function> + 0x1d260c (0x5ed479f8760c in /home/user/.conda/envs/trl/bin/python3.10)
frame #30: PyEval_EvalCode + 0x85 (0x5ed479f87555 in /home/user/.conda/envs/trl/bin/python3.10)
frame #31: <unknown function> + 0x203f7a (0x5ed479fb8f7a in /home/user/.conda/envs/trl/bin/python3.10)
frame #32: <unknown function> + 0x1fe973 (0x5ed479fb3973 in /home/user/.conda/envs/trl/bin/python3.10)
frame #33: <unknown function> + 0x976b0 (0x5ed479e4c6b0 in /home/user/.conda/envs/trl/bin/python3.10)
frame #34: _PyRun_SimpleFileObject + 0x1bb (0x5ed479fae03b in /home/user/.conda/envs/trl/bin/python3.10)
frame #35: _PyRun_AnyFileObject + 0x44 (0x5ed479fadbd4 in /home/user/.conda/envs/trl/bin/python3.10)
frame #36: Py_RunMain + 0x371 (0x5ed479fab041 in /home/user/.conda/envs/trl/bin/python3.10)
frame #37: Py_BytesMain + 0x37 (0x5ed479f7a887 in /home/user/.conda/envs/trl/bin/python3.10)
frame #38: <unknown function> + 0x29d90 (0x70eb9a4a8d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #39: __libc_start_main + 0x80 (0x70eb9a4a8e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #40: <unknown function> + 0x1c576e (0x5ed479f7a76e in /home/user/.conda/envs/trl/bin/python3.10)
System Info
trl: Version: 0.24.0.dev0
vllm: Version: 0.10.2
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete