-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What version of Knative?
1.8.1
Expected Behavior
I want to executed a rolling restart on a deployment such that it starts the new pod(s), waits until the pod(s) are running, and then terminates the old pod(s)
Actual Behavior
I noticed that when trying to do a rolling restart on a KServe or KNative deployment the new pod is immediately terminated and the old pod remains running.
Steps to Reproduce the Problem
We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27
To narrow down where the issue might be I did the following:
1. Run a basic KServe example: sklearn-iris
This shows the behavior I described above. I won't go too much into detail here because the behavior shows up with a basic knative deployment as well as described below.
2. Run a basic KNative serving example: sklearn-iris
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello
spec:
template:
spec:
containers:
- image: ghcr.io/knative/helloworld-go:latest
ports:
- containerPort: 8080
env:
- name: TARGET
value: "World"
This starts a pod with 2 containers:
- queue-proxy
- user-container
Then I do a rolling restart:
$ kubectl rollout restart deploy -n <namespace> hello-00001-deployment
and I witness the same behavior: new pod gets terminated within seconds.
Looking at the event timeline we see:
$ kubectl get events -n <namespace> --sort-by='.lastTimestamp'
10m Normal Created configuration/hello Created Revision "hello-00001"
10m Normal ScalingReplicaSet deployment/hello-00001-deployment Scaled up replica set hello-00001-deployment-64ddfc4766 to 1
10m Normal Created service/hello Created Route "hello"
10m Normal Created service/hello Created Configuration "hello"
10m Normal FinalizerUpdate route/hello Updated "hello" finalizers
10m Warning FinalizerUpdateFailed route/hello Failed to update finalizers for "hello": Operation cannot be fulfilled on routes.serving.knative.dev "hello": the object has been modified; please apply your changes to the latest version and try again
10m Warning InternalError revision/hello-00001 failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
10m Normal SuccessfulCreate replicaset/hello-00001-deployment-64ddfc4766 Created pod: hello-00001-deployment-64ddfc4766-zzrcb
10m Normal Pulled pod/hello-00001-deployment-64ddfc4766-zzrcb Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 343.203487ms (343.211177ms including waiting)
10m Normal Created pod/hello-00001-deployment-64ddfc4766-zzrcb Created container queue-proxy
10m Normal Pulling pod/hello-00001-deployment-64ddfc4766-zzrcb Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
10m Normal Started pod/hello-00001-deployment-64ddfc4766-zzrcb Started container user-container
10m Normal Created pod/hello-00001-deployment-64ddfc4766-zzrcb Created container user-container
10m Normal Pulled pod/hello-00001-deployment-64ddfc4766-zzrcb Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
10m Normal Started pod/hello-00001-deployment-64ddfc4766-zzrcb Started container queue-proxy
32s Normal ScalingReplicaSet deployment/hello-00001-deployment Scaled up replica set hello-00001-deployment-7c46db99df to 1
32s Normal SuccessfulDelete replicaset/hello-00001-deployment-7c46db99df Deleted pod: hello-00001-deployment-7c46db99df-nxp9x
32s Normal SuccessfulCreate replicaset/hello-00001-deployment-7c46db99df Created pod: hello-00001-deployment-7c46db99df-nxp9x
32s Normal ScalingReplicaSet deployment/hello-00001-deployment Scaled down replica set hello-00001-deployment-7c46db99df to 0 from 1
31s Normal Created pod/hello-00001-deployment-7c46db99df-nxp9x Created container user-container
31s Normal Pulled pod/hello-00001-deployment-7c46db99df-nxp9x Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
31s Normal Killing pod/hello-00001-deployment-7c46db99df-nxp9x Stopping container user-container
31s Normal Pulled pod/hello-00001-deployment-7c46db99df-nxp9x Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 421.949534ms (421.969854ms including waiting)
31s Normal Started pod/hello-00001-deployment-7c46db99df-nxp9x Started container user-container
31s Normal Pulling pod/hello-00001-deployment-7c46db99df-nxp9x Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
31s Normal Created pod/hello-00001-deployment-7c46db99df-nxp9x Created container queue-proxy
31s Normal Started pod/hello-00001-deployment-7c46db99df-nxp9x Started container queue-proxy
31s Normal Killing pod/hello-00001-deployment-7c46db99df-nxp9x Stopping container queue-proxy
2s Warning Unhealthy pod/hello-00001-deployment-7c46db99df-nxp9x Readiness probe failed: HTTP probe failed with statuscode: 503
2s Warning Unhealthy pod/hello-00001-deployment-7c46db99df-nxp9x Readiness probe failed: Get "http://10.245.142.234:8012/": dial tcp 10.245.142.234:8012: connect: connection refused
The first section is the startup of the knative service (10m). What I do see here is the error:
10m Warning InternalError revision/hello-00001 failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
However, the service starts up fine and can be queried without issue. Not sure if this is relevant for this problem but I also noticed this error in our KServe deployments.
The second part (32s and less) happens when running the rolling restart. We see no errors but we can see that it terminates the same pod it created and keeps the old one running.
3. Run a basic Kubernetes deployment
To rule out that it is a k8s specific issue is just created a basic nginx deployment and did a rolling restart on it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
This deployment actually does restart the pod as expect:
- Creates new pod
- Waits until ready
- Terminates old pod
The event log for this deployment looks like:
2m4s Normal ScalingReplicaSet deployment/nginx-deployment Scaled up replica set nginx-deployment-7c68c5c8dc to 1
2m3s Normal Created pod/nginx-deployment-7c68c5c8dc-tbjcm Created container nginx
2m3s Normal SuccessfulCreate replicaset/nginx-deployment-7c68c5c8dc Created pod: nginx-deployment-7c68c5c8dc-tbjcm
2m3s Normal Started pod/nginx-deployment-7c68c5c8dc-tbjcm Started container nginx
2m3s Normal Pulled pod/nginx-deployment-7c68c5c8dc-tbjcm Container image "nginx:1.14.2" already present on machine
11s Normal ScalingReplicaSet deployment/nginx-deployment Scaled up replica set nginx-deployment-68fbb8c788 to 1
11s Normal Pulled pod/nginx-deployment-68fbb8c788-gw4cn Container image "nginx:1.14.2" already present on machine
11s Normal SuccessfulCreate replicaset/nginx-deployment-68fbb8c788 Created pod: nginx-deployment-68fbb8c788-gw4cn
11s Normal Created pod/nginx-deployment-68fbb8c788-gw4cn Created container nginx
10s Normal Started pod/nginx-deployment-68fbb8c788-gw4cn Started container nginx
9s Normal ScalingReplicaSet deployment/nginx-deployment Scaled down replica set nginx-deployment-7c68c5c8dc to 0 from 1
9s Normal SuccessfulDelete replicaset/nginx-deployment-7c68c5c8dc Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
9s Normal Killing pod/nginx-deployment-7c68c5c8dc-tbjcm Stopping container nginx
8s Normal SuccessfulDelete replicaset/nginx-deployment-7c68c5c8dc Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
This makes me thing the issue is with our KNative Serving setup.
- How can we further debug this issue?
- Why is there an internal error when first starting a knative service?
- How can I figure out why the new pod gets terminated within seconds?