-
Notifications
You must be signed in to change notification settings - Fork 647
Open
Labels
bugSomething isn't workingSomething isn't workinggood-first-issueGood for newcomersGood for newcomers
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Applying the following YAML will cause an overflow in Status.MaxWorkerReplicas:
workerGroupSpecs:
- replicas: 1
minReplicas: 3
numOfHosts: 4
By default, maxReplicas is set to 2147483647.
kuberay/ray-operator/apis/ray/v1/raycluster_types.go
Lines 112 to 114 in 3471f99
| // MaxReplicas denotes the maximum number of desired Pods for this worker group, and the default value is maxInt32. | |
| // +kubebuilder:default:=2147483647 | |
| MaxReplicas *int32 `json:"maxReplicas"` |
However, in
CalculateMaxReplicas, there’s no check to prevent overflow when multiplying by numOfHosts, which can result in an incorrect value for Status.MaxWorkerReplicas.
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 394 to 405 in 3471f99
| // CalculateMaxReplicas calculates max worker replicas at the cluster level | |
| func CalculateMaxReplicas(cluster *rayv1.RayCluster) int32 { | |
| count := int32(0) | |
| for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs { | |
| if nodeGroup.Suspend != nil && *nodeGroup.Suspend { | |
| continue | |
| } | |
| count += (*nodeGroup.MaxReplicas * nodeGroup.NumOfHosts) | |
| } | |
| return count | |
| } |
Reproduction script
Applying the following YAML
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-kuberay
spec:
rayVersion: '2.46.0' # should match the Ray version in the image of the containers
# Ray head pod template
headGroupSpec:
# rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
rayStartParams: {}
template:
spec:
schedulerName: default-scheduler
containers:
- name: ray-head
image: rayproject/ray:2.46.0
resources:
limits:
cpu: 1
memory: 2G
requests:
cpu: 1
memory: 2G
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 3
numOfHosts: 4
# logical group name, for this called small-group, also can be functional
groupName: workergroup
# rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:2.46.0
resources:
limits:
cpu: 1
memory: 1G
requests:
cpu: 1
memory: 1G
$ kubectl get raycluster raycluster-kuberay -o yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"ray.io/v1","kind":"RayCluster","metadata":{"annotations":{},"name":"raycluster-kuberay","namespace":"default"},"spec":{"headGroupSpec":{"rayStartParams":{},"template":{"spec":{"containers":[{"image":"rayproject/ray:2.46.0","name":"ray-head","ports":[{"containerPort":6379,"name":"gcs-server"},{"containerPort":8265,"name":"dashboard"},{"containerPort":10001,"name":"client"}],"resources":{"limits":{"cpu":1,"memory":"2G"},"requests":{"cpu":1,"memory":"2G"}}}],"schedulerName":"default-scheduler"}}},"rayVersion":"2.46.0","workerGroupSpecs":[{"groupName":"workergroup","minReplicas":3,"numOfHosts":4,"rayStartParams":{},"replicas":1,"template":{"spec":{"containers":[{"image":"rayproject/ray:2.46.0","name":"ray-worker","resources":{"limits":{"cpu":1,"memory":"1G"},"requests":{"cpu":1,"memory":"1G"}}}]}}}]}}
creationTimestamp: "2025-10-28T16:05:15Z"
generation: 1
name: raycluster-kuberay
namespace: default
resourceVersion: "13168"
uid: c83e5bd6-c5ee-4a22-b615-341fee75725a
....
status:
availableWorkerReplicas: 12
conditions:
- lastTransitionTime: "2025-10-28T16:05:36Z"
message: ""
reason: HeadPodRunningAndReady
status: "True"
type: HeadPodReady
- lastTransitionTime: "2025-10-28T16:06:43Z"
message: All Ray Pods are ready for the first time
reason: AllPodRunningAndReadyFirstTime
status: "True"
type: RayClusterProvisioned
- lastTransitionTime: "2025-10-28T16:05:36Z"
message: ""
reason: RayClusterSuspended
status: "False"
type: RayClusterSuspended
- lastTransitionTime: "2025-10-28T16:05:36Z"
message: ""
reason: RayClusterSuspending
status: "False"
type: RayClusterSuspending
desiredCPU: "5"
desiredGPU: "0"
desiredMemory: 6G
desiredTPU: "0"
desiredWorkerReplicas: 12
endpoints:
client: "10001"
dashboard: "8265"
gcs-server: "6379"
metrics: "8080"
head:
podIP: 10.244.0.34
podName: raycluster-kuberay-head-g5wzm
serviceIP: 10.244.0.34
serviceName: raycluster-kuberay-head-svc
lastUpdateTime: "2025-10-28T16:06:43Z"
maxWorkerReplicas: -4 <---- overflowed
minWorkerReplicas: 12
observedGeneration: 1
readyWorkerReplicas: 12
state: ready
stateTransitionTimes:
ready: "2025-10-28T16:06:43Z"
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood-first-issueGood for newcomersGood for newcomers