Skip to content

Conversation

@ishandhanani
Copy link
Collaborator

Problem

When running multi-node inference with --nnodes > 1 and --node-rank >= 1, the dummy health check server logs that it's started but is not actually accessible:

# Launch command
python3 -m sglang.launch_server \
  --model-path /model/ \
  --host 0.0.0.0 \
  --nnodes 2 \
  --node-rank 1 \
  --enable-metrics \
  --dist-init-addr 10.30.1.187:29500 \
  --tp 8

# Log output
2025-10-28T21:32:44.763434Z  INFO common.launch_dummy_health_check_server: Dummy health check server scheduled on existing loop at 0.0.0.0:30000

# But curl fails
curl http://localhost:30000/health
# Connection refused or timeout

Inference works correctly across nodes, but the health check and metrics endpoints are unreachable on non-zero rank nodes.

Root Cause

The issue occurs in launch_dummy_health_check_server() when called from an async context (e.g., custom distributed runtime wrappers or when an event loop policy is set):

  1. asyncio.get_running_loop() succeeds and finds the parent event loop
  2. loop.create_task(server.serve()) schedules the server as a task
  3. The function returns immediately
  4. The main thread then blocks on proc.join() waiting for scheduler processes
  5. The scheduled task never executes because the loop that owns it is now blocked
# Old buggy code
try:
    loop = asyncio.get_running_loop()
    loop.create_task(server.serve())  # ← Scheduled but never runs
except RuntimeError:
    server.run()

Solution

Run the health check server in a dedicated daemon thread with its own event loop:

def run_server():
    asyncio.run(server.serve())

thread = threading.Thread(target=run_server, daemon=True, name="health-check-server")
thread.start()

Benefits:

  • Works in both async and sync contexts
  • Doesn't block the main thread
  • Daemon thread ensures automatic cleanup on process exit
  • Creates isolated event loop independent of parent context

Testing

Multi-node setup:

# Node 0
python3 -m sglang.launch_server --model-path /model/ --nnodes 2 --node-rank 0 --enable-metrics --tp 8

# Node 1  
python3 -m sglang.launch_server --model-path /model/ --nnodes 2 --node-rank 1 --enable-metrics --tp 8

# Verify health checks work on both nodes
curl http://node0:30000/health  # ✓ Works
curl http://node1:30000/health  # ✓ Now works (previously failed)
curl http://node1:30000/metrics # ✓ Metrics accessible

Custom async runtime:

async def init():
    engine = sgl.Engine(...)  # Called from async context
    # Health check server now accessible even when Engine() is called in async context

Files Changed

  • python/sglang/srt/utils/common.py: Fixed launch_dummy_health_check_server() to use background thread

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ishandhanani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug where the dummy health check server failed to become accessible on non-zero rank nodes in multi-node inference setups. By refactoring the server's launch mechanism to utilize a separate daemon thread with its own event loop, the health check and metrics endpoints are now reliably available across all nodes, improving the robustness and observability of distributed SGLang deployments.

Highlights

  • Health Check Accessibility Fix: Resolved an issue where the dummy health check server was inaccessible on non-zero rank nodes during multi-node inference, preventing health checks and metrics from functioning correctly.
  • Root Cause Identification: Identified the root cause as the health check server task being scheduled on the main event loop, which then blocked, preventing the server from ever starting.
  • Concurrency Model Change: Implemented a solution to run the health check server in a dedicated daemon thread with its own asyncio event loop, ensuring it operates independently and doesn't block the main process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves the issue of the dummy health check server being inaccessible on non-zero rank nodes. The root cause analysis is spot-on, and the solution of running the server in a dedicated daemon thread with its own event loop is robust and well-implemented. The change is clear and correct. I have one suggestion to further improve the robustness by adding exception handling in the new thread.

Comment on lines 2364 to 2365
def run_server():
asyncio.run(server.serve())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new implementation correctly runs the server in a background thread. However, if an exception occurs within server.serve() (e.g., the port is already in use), it will be raised in the background thread and terminate it silently. This could make debugging difficult as the main process would be unaware of the failure.

It's good practice to wrap the call in a try...except block to log any exceptions that occur in the thread. This will provide visibility into failures of the health check server.

Suggested change
def run_server():
asyncio.run(server.serve())
def run_server():
try:
asyncio.run(server.serve())
except Exception:
logger.exception("Dummy health check server thread failed unexpectedly.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants