Skip to content

[Bug] Status.MaxWorkerReplicas overflow when numOfHosts > 1 #4153

@win5923

Description

@win5923

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Applying the following YAML will cause an overflow in Status.MaxWorkerReplicas:

workerGroupSpecs:
  - replicas: 1
    minReplicas: 3
    numOfHosts: 4

By default, maxReplicas is set to 2147483647.

// MaxReplicas denotes the maximum number of desired Pods for this worker group, and the default value is maxInt32.
// +kubebuilder:default:=2147483647
MaxReplicas *int32 `json:"maxReplicas"`

However, in CalculateMaxReplicas, there’s no check to prevent overflow when multiplying by numOfHosts, which can result in an incorrect value for Status.MaxWorkerReplicas.

// CalculateMaxReplicas calculates max worker replicas at the cluster level
func CalculateMaxReplicas(cluster *rayv1.RayCluster) int32 {
count := int32(0)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
if nodeGroup.Suspend != nil && *nodeGroup.Suspend {
continue
}
count += (*nodeGroup.MaxReplicas * nodeGroup.NumOfHosts)
}
return count
}

Reproduction script

Applying the following YAML

# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
spec:
  rayVersion: '2.46.0' # should match the Ray version in the image of the containers
  # Ray head pod template
  headGroupSpec:
    # rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
    rayStartParams: {}
    template:
      spec:
        schedulerName: default-scheduler
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 1
              memory: 2G
            requests:
              cpu: 1
              memory: 2G
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 3
    numOfHosts: 4
    # logical group name, for this called small-group, also can be functional
    groupName: workergroup
    # rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 1
              memory: 1G
            requests:
              cpu: 1
              memory: 1G
$ kubectl get raycluster raycluster-kuberay -o yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ray.io/v1","kind":"RayCluster","metadata":{"annotations":{},"name":"raycluster-kuberay","namespace":"default"},"spec":{"headGroupSpec":{"rayStartParams":{},"template":{"spec":{"containers":[{"image":"rayproject/ray:2.46.0","name":"ray-head","ports":[{"containerPort":6379,"name":"gcs-server"},{"containerPort":8265,"name":"dashboard"},{"containerPort":10001,"name":"client"}],"resources":{"limits":{"cpu":1,"memory":"2G"},"requests":{"cpu":1,"memory":"2G"}}}],"schedulerName":"default-scheduler"}}},"rayVersion":"2.46.0","workerGroupSpecs":[{"groupName":"workergroup","minReplicas":3,"numOfHosts":4,"rayStartParams":{},"replicas":1,"template":{"spec":{"containers":[{"image":"rayproject/ray:2.46.0","name":"ray-worker","resources":{"limits":{"cpu":1,"memory":"1G"},"requests":{"cpu":1,"memory":"1G"}}}]}}}]}}
  creationTimestamp: "2025-10-28T16:05:15Z"
  generation: 1
  name: raycluster-kuberay
  namespace: default
  resourceVersion: "13168"
  uid: c83e5bd6-c5ee-4a22-b615-341fee75725a
....
status:
  availableWorkerReplicas: 12
  conditions:
  - lastTransitionTime: "2025-10-28T16:05:36Z"
    message: ""
    reason: HeadPodRunningAndReady
    status: "True"
    type: HeadPodReady
  - lastTransitionTime: "2025-10-28T16:06:43Z"
    message: All Ray Pods are ready for the first time
    reason: AllPodRunningAndReadyFirstTime
    status: "True"
    type: RayClusterProvisioned
  - lastTransitionTime: "2025-10-28T16:05:36Z"
    message: ""
    reason: RayClusterSuspended
    status: "False"
    type: RayClusterSuspended
  - lastTransitionTime: "2025-10-28T16:05:36Z"
    message: ""
    reason: RayClusterSuspending
    status: "False"
    type: RayClusterSuspending
  desiredCPU: "5"
  desiredGPU: "0"
  desiredMemory: 6G
  desiredTPU: "0"
  desiredWorkerReplicas: 12
  endpoints:
    client: "10001"
    dashboard: "8265"
    gcs-server: "6379"
    metrics: "8080"
  head:
    podIP: 10.244.0.34
    podName: raycluster-kuberay-head-g5wzm
    serviceIP: 10.244.0.34
    serviceName: raycluster-kuberay-head-svc
  lastUpdateTime: "2025-10-28T16:06:43Z"
  maxWorkerReplicas: -4                <---- overflowed
  minWorkerReplicas: 12
  observedGeneration: 1
  readyWorkerReplicas: 12
  state: ready
  stateTransitionTimes:
    ready: "2025-10-28T16:06:43Z"

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood-first-issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions