During rollout restart of a deployment the new pod gets terminated within seconds

## What version of Knative?


1.8.1

## Expected Behavior



I want to executed a rolling restart on a deployment such that it starts the new pod(s), waits until the pod(s) are running, and then terminates the old pod(s)


## Actual Behavior



I noticed that when trying to do a rolling restart on a KServe or KNative deployment the new pod is immediately terminated and the old pod remains running. 

## Steps to Reproduce the Problem

We are running KServe 0.10.0 on top of KNative 1.8.1 with istio 1.13.3, on kubernetes 1.27



To narrow down where the issue might be I did the following:

### 1. Run a basic KServe example: sklearn-iris  
This shows the behavior I described above. I won't go too much into detail here because the behavior shows up with a basic knative deployment as well as described below.

### 2. Run a basic KNative serving example: sklearn-iris  
```
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
spec:
  template:
    spec:
      containers:
        - image: ghcr.io/knative/helloworld-go:latest
          ports:
            - containerPort: 8080
          env:
            - name: TARGET
              value: "World"
```
This starts a pod with 2 containers: 

1. queue-proxy
2. user-container

Then I do a rolling restart:
```
$ kubectl rollout restart deploy -n <namespace> hello-00001-deployment
```
and I witness the same behavior: new pod gets terminated within seconds.
Looking at the event timeline we see:
```
$ kubectl get events -n <namespace> --sort-by='.lastTimestamp'
10m         Normal    Created                        configuration/hello                                                   Created Revision "hello-00001"
10m         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled up replica set hello-00001-deployment-64ddfc4766 to 1
10m         Normal    Created                        service/hello                                                         Created Route "hello"
10m         Normal    Created                        service/hello                                                         Created Configuration "hello"
10m         Normal    FinalizerUpdate                route/hello                                                           Updated "hello" finalizers
10m         Warning   FinalizerUpdateFailed          route/hello                                                           Failed to update finalizers for "hello": Operation cannot be fulfilled on routes.serving.knative.dev "hello": the object has been modified; please apply your changes to the latest version and try again
10m         Warning   InternalError                  revision/hello-00001                                                  failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
10m         Normal    SuccessfulCreate               replicaset/hello-00001-deployment-64ddfc4766                          Created pod: hello-00001-deployment-64ddfc4766-zzrcb
10m         Normal    Pulled                         pod/hello-00001-deployment-64ddfc4766-zzrcb                           Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 343.203487ms (343.211177ms including waiting)
10m         Normal    Created                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Created container queue-proxy
10m         Normal    Pulling                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
10m         Normal    Started                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Started container user-container
10m         Normal    Created                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Created container user-container
10m         Normal    Pulled                         pod/hello-00001-deployment-64ddfc4766-zzrcb                           Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
10m         Normal    Started                        pod/hello-00001-deployment-64ddfc4766-zzrcb                           Started container queue-proxy
32s         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled up replica set hello-00001-deployment-7c46db99df to 1
32s         Normal    SuccessfulDelete               replicaset/hello-00001-deployment-7c46db99df                          Deleted pod: hello-00001-deployment-7c46db99df-nxp9x
32s         Normal    SuccessfulCreate               replicaset/hello-00001-deployment-7c46db99df                          Created pod: hello-00001-deployment-7c46db99df-nxp9x
32s         Normal    ScalingReplicaSet              deployment/hello-00001-deployment                                     Scaled down replica set hello-00001-deployment-7c46db99df to 0 from 1
31s         Normal    Created                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Created container user-container
31s         Normal    Pulled                         pod/hello-00001-deployment-7c46db99df-nxp9x                           Container image "ghcr.io/knative/helloworld-go@sha256:d2af882a7ae7d49c61ae14a35dac42825ccf455ba616d8e5c1f4fa8fdcf09807" already present on machine
31s         Normal    Killing                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Stopping container user-container
31s         Normal    Pulled                         pod/hello-00001-deployment-7c46db99df-nxp9x                           Successfully pulled image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest" in 421.949534ms (421.969854ms including waiting)
31s         Normal    Started                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Started container user-container
31s         Normal    Pulling                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Pulling image "artifactory-server.net/docker-local/u/user/kserve/qpext:latest"
31s         Normal    Created                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Created container queue-proxy
31s         Normal    Started                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Started container queue-proxy
31s         Normal    Killing                        pod/hello-00001-deployment-7c46db99df-nxp9x                           Stopping container queue-proxy
2s          Warning   Unhealthy                      pod/hello-00001-deployment-7c46db99df-nxp9x                           Readiness probe failed: HTTP probe failed with statuscode: 503
2s       Warning   Unhealthy                      pod/hello-00001-deployment-7c46db99df-nxp9x                           Readiness probe failed: Get "http://10.245.142.234:8012/": dial tcp 10.245.142.234:8012: connect: connection refused
```
The first section is the startup of the knative service (10m). What I do see here is the error:
```
10m         Warning   InternalError                  revision/hello-00001                                                  failed to update deployment "hello-00001-deployment": Operation cannot be fulfilled on deployments.apps "hello-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
```
However, the service starts up fine and can be queried without issue. Not sure if this is relevant for this problem but I also noticed this error in our KServe deployments.

The second part (32s and less) happens when running the **rolling restart**. We see no errors but we can see that it terminates the same pod it created and keeps the old one running.

### 3. Run a basic Kubernetes deployment
To rule out that it is a k8s specific issue is just created a basic nginx deployment and did a rolling restart on it:
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

```
This deployment actually does restart the pod as expect:
1. Creates new pod
2. Waits until ready
3. Terminates old pod

The event log for this deployment looks like:
```
2m4s        Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled up replica set nginx-deployment-7c68c5c8dc to 1
2m3s        Normal    Created                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Created container nginx
2m3s        Normal    SuccessfulCreate               replicaset/nginx-deployment-7c68c5c8dc                                Created pod: nginx-deployment-7c68c5c8dc-tbjcm
2m3s        Normal    Started                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Started container nginx
2m3s        Normal    Pulled                         pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Container image "nginx:1.14.2" already present on machine
11s         Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled up replica set nginx-deployment-68fbb8c788 to 1
11s         Normal    Pulled                         pod/nginx-deployment-68fbb8c788-gw4cn                                 Container image "nginx:1.14.2" already present on machine
11s         Normal    SuccessfulCreate               replicaset/nginx-deployment-68fbb8c788                                Created pod: nginx-deployment-68fbb8c788-gw4cn
11s         Normal    Created                        pod/nginx-deployment-68fbb8c788-gw4cn                                 Created container nginx
10s         Normal    Started                        pod/nginx-deployment-68fbb8c788-gw4cn                                 Started container nginx
9s          Normal    ScalingReplicaSet              deployment/nginx-deployment                                           Scaled down replica set nginx-deployment-7c68c5c8dc to 0 from 1
9s          Normal    SuccessfulDelete               replicaset/nginx-deployment-7c68c5c8dc                                Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
9s          Normal    Killing                        pod/nginx-deployment-7c68c5c8dc-tbjcm                                 Stopping container nginx
8s         Normal    SuccessfulDelete               replicaset/nginx-deployment-7c68c5c8dc                                Deleted pod: nginx-deployment-7c68c5c8dc-tbjcm
```

This makes me thing the issue is with our KNative Serving setup.

1. How can we further debug this issue?
2. Why is there an internal error when first starting a knative service?
3. How can I figure out why the new pod gets terminated within seconds?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

During rollout restart of a deployment the new pod gets terminated within seconds #14705

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

1. Run a basic KServe example: sklearn-iris

2. Run a basic KNative serving example: sklearn-iris

3. Run a basic Kubernetes deployment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

During rollout restart of a deployment the new pod gets terminated within seconds #14705

Description

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

1. Run a basic KServe example: sklearn-iris

2. Run a basic KNative serving example: sklearn-iris

3. Run a basic Kubernetes deployment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions