Skip to content

During rollout restart of a deployment the new pod gets terminated within seconds #14705

@gerritvd

Description

@gerritvd

What version of Knative?

1.8.1

Expected Behavior

I want to executed a rolling restart on a deployment such that it starts the new pod(s), waits until the pod(s) are running, and then terminates the old pod(s)

Actual Behavior

I noticed that when trying to do a rolling restart on a KServe or KNative deployment the new pod is immediately terminated and the old pod remains running.

Steps to Reproduce the Problem

We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27

To narrow down where the issue might be I did the following:

1. Run a basic KServe example: sklearn-iris

This shows the behavior I described above. I won't go too much into detail here because the behavior shows up with a basic knative deployment as well as described below.

2. Run a basic KNative serving example: sklearn-iris

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
spec:
  template:
    spec:
      containers:
        - image: ghcr.io/knative/helloworld-go:latest
          ports:
            - containerPort: 8080
          env:
            - name: TARGET
              value: "World"

This starts a pod with 2 containers:

  1. queue-proxy
  2. user-container

Then I do a rolling restart:

$ kubectl rollout restart deploy -n <namespace> hello-00001-deployment

and I witness the same behavior: new pod gets terminated within seconds.
Looking at the event timeline we see:

$ kubectl get events -n <namespace> --sort-by='.lastTimestamp'
10m         Normal    Created                        configuration/hello                                                   Created Revision "hello-00001"
10m         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled up replica set hello-00001-deployment-64ddfc4766 to 1
10m         Normal    Created                        service/hello                                                         Created Route "hello"
10m         Normal    Created                        service/hello                                                         Created Configuration "hello"
10m         Normal    FinalizerUpdate                route/hello                                                           Updated "hello" finalizers
10m         Warning   FinalizerUpdateFailed          route/hello                                                           Failed to update finalizers for "hello": Operation cannot be fulfilled on routes.serving.knative.dev "hello": the object has been modified; please apply your changes to the latest version and try again
10m         Warning   InternalError                  revision/hello-00001                                                  failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
10m         Normal    SuccessfulCreate               replicaset/hello-00001-deployment-64ddfc4766                          Created pod: hello-00001-deployment-64ddfc4766-zzrcb
10m         Normal    Pulled                         pod/hello-00001-deployment-64ddfc4766-zzrcb                           Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 343.203487ms (343.211177ms including waiting)
10m         Normal    Created                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Created container queue-proxy
10m         Normal    Pulling                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
10m         Normal    Started                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Started container user-container
10m         Normal    Created                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Created container user-container
10m         Normal    Pulled                         pod/hello-00001-deployment-64ddfc4766-zzrcb                           Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
10m         Normal    Started                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Started container queue-proxy
32s         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled up replica set hello-00001-deployment-7c46db99df to 1
32s         Normal    SuccessfulDelete               replicaset/hello-00001-deployment-7c46db99df                          Deleted pod: hello-00001-deployment-7c46db99df-nxp9x
32s         Normal    SuccessfulCreate               replicaset/hello-00001-deployment-7c46db99df                          Created pod: hello-00001-deployment-7c46db99df-nxp9x
32s         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled down replica set hello-00001-deployment-7c46db99df to 0 from 1
31s         Normal    Created                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Created container user-container
31s         Normal    Pulled                         pod/hello-00001-deployment-7c46db99df-nxp9x                           Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
31s         Normal    Killing                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Stopping container user-container
31s         Normal    Pulled                         pod/hello-00001-deployment-7c46db99df-nxp9x                           Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 421.949534ms (421.969854ms including waiting)
31s         Normal    Started                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Started container user-container
31s         Normal    Pulling                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
31s         Normal    Created                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Created container queue-proxy
31s         Normal    Started                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Started container queue-proxy
31s         Normal    Killing                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Stopping container queue-proxy
2s          Warning   Unhealthy                      pod/hello-00001-deployment-7c46db99df-nxp9x                           Readiness probe failed: HTTP probe failed with statuscode: 503
2s       Warning   Unhealthy                      pod/hello-00001-deployment-7c46db99df-nxp9x                           Readiness probe failed: Get "http://10.245.142.234:8012/": dial tcp 10.245.142.234:8012: connect: connection refused

The first section is the startup of the knative service (10m). What I do see here is the error:

10m         Warning   InternalError                  revision/hello-00001                                                  failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again

However, the service starts up fine and can be queried without issue. Not sure if this is relevant for this problem but I also noticed this error in our KServe deployments.

The second part (32s and less) happens when running the rolling restart. We see no errors but we can see that it terminates the same pod it created and keeps the old one running.

3. Run a basic Kubernetes deployment

To rule out that it is a k8s specific issue is just created a basic nginx deployment and did a rolling restart on it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

This deployment actually does restart the pod as expect:

  1. Creates new pod
  2. Waits until ready
  3. Terminates old pod

The event log for this deployment looks like:

2m4s        Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled up replica set nginx-deployment-7c68c5c8dc to 1
2m3s        Normal    Created                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Created container nginx
2m3s        Normal    SuccessfulCreate               replicaset/nginx-deployment-7c68c5c8dc                                Created pod: nginx-deployment-7c68c5c8dc-tbjcm
2m3s        Normal    Started                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Started container nginx
2m3s        Normal    Pulled                         pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Container image "nginx:1.14.2" already present on machine
11s         Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled up replica set nginx-deployment-68fbb8c788 to 1
11s         Normal    Pulled                         pod/nginx-deployment-68fbb8c788-gw4cn                                 Container image "nginx:1.14.2" already present on machine
11s         Normal    SuccessfulCreate               replicaset/nginx-deployment-68fbb8c788                                Created pod: nginx-deployment-68fbb8c788-gw4cn
11s         Normal    Created                        pod/nginx-deployment-68fbb8c788-gw4cn                                 Created container nginx
10s         Normal    Started                        pod/nginx-deployment-68fbb8c788-gw4cn                                 Started container nginx
9s          Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled down replica set nginx-deployment-7c68c5c8dc to 0 from 1
9s          Normal    SuccessfulDelete               replicaset/nginx-deployment-7c68c5c8dc                                Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
9s          Normal    Killing                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Stopping container nginx
8s         Normal    SuccessfulDelete               replicaset/nginx-deployment-7c68c5c8dc                                Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm

This makes me thing the issue is with our KNative Serving setup.

  1. How can we further debug this issue?
  2. Why is there an internal error when first starting a knative service?
  3. How can I figure out why the new pod gets terminated within seconds?

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIssues which should be fixed (post-triage)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions