Skip to content

Only rank=0 metrics endpoint should be fetched in autoscaler if podGroupSize is configured #1664

@Jeffwan

Description

@Jeffwan

🐛 Describe the bug

Image

if we use podGroupSize, the httpserver will be only launched in the rank=0 server, rest pods are pure GPU workers. In that case, we should not fetch the endpoint. this is kind of missing in autoscaler now.

Steps to Reproduce

PodGroupSize != 1

Expected behavior

autoscaler should only fetch the pods with rank=0.

Environment

nightly

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions