-
Couldn't load subscription status.
- Fork 6.5k
Description
Checklist:
- I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- I've included steps to reproduce the bug.
- I've pasted the output of
argocd version.
Describe the bug
In the HA deployment bundle (manifests/ha/cluster-install?ref=v3.1.8), the headless Service argocd-redis-ha does not set publishNotReadyAddresses: true. During a cold restart of the Redis HA StatefulSet, kube-dns removes the service endpoints while pods are not ready, so fix-split-brain.sh cannot contact Sentinel. Every redis pod keeps the stale slaveof entry and fails its startup probe (role=slave; repl=connect), leading to a permanent CrashLoopBackOff and broken Argo CD logins (rpc error: code = Unauthenticated desc = no session information).
To Reproduce
- Deploy the stock HA bundle (
kustomize build github.com/argoproj/argo-cd/manifests/ha/cluster-install?ref=v3.1.8 | kubectl apply -f -). - Force a restart:
kubectl delete pod argocd-redis-ha-server-{0,1,2} -n argocd. - Watch the pods restart:
kubectl logs argocd-redis-ha-server-1 -c split-brain-fixshowsCould not connect to Redis at argocd-redis-ha:26379: Name does not resolve.kubectl get pods -n argocdshows all redis pods inCrashLoopBackOff.
- Attempt UI/CLI login; Argo CD server logs report “no session information”.
Expected behavior
Sentinel should be reachable via the service name during bootstrap so a new master can be elected and replicas join without crashing.
Screenshots
❯ k get pods
NAME READY STATUS RESTARTS AGE
argocd-application-controller-0 1/1 Running 0 4h59m
argocd-application-controller-1 1/1 Running 0 4h57m
argocd-applicationset-controller-5c6799bb45-pkn85 1/1 Running 0 4h58m
argocd-applicationset-controller-5c6799bb45-wdmgq 1/1 Running 0 35m
argocd-dex-server-57c896dbc6-jwl8x 1/1 Running 0 4h59m
argocd-notifications-controller-84b4d4b674-jphfz 1/1 Running 0 4h59m
argocd-redis-ha-haproxy-58476dd6d7-69pdw 1/1 Running 0 36m
argocd-redis-ha-haproxy-58476dd6d7-cdbrs 1/1 Running 0 36m
argocd-redis-ha-haproxy-58476dd6d7-x69ht 1/1 Running 0 35m
argocd-redis-ha-server-0 2/3 CrashLoopBackOff 98 (4m48s ago) 5h
argocd-redis-ha-server-1 2/3 CrashLoopBackOff 100 (71s ago) 4h58m
argocd-redis-ha-server-2 2/3 CrashLoopBackOff 101 (48s ago) 4h57m
argocd-repo-server-65669fd7cf-5gq4r 1/1 Running 0 15m
argocd-repo-server-65669fd7cf-l22cl 1/1 Running 0 15m
argocd-server-c67bd8d45-9xxzq 1/1 Running 0 35m
argocd-server-c67bd8d45-f4jmx 1/1 Running 0 4h57m
argocd-server-c67bd8d45-qcjsj 1/1 Running 0 4h59m
Version
Paste the output from `argocd version` here.
argocd: v3.1.9+8665140
BuildDate: 2025-10-17T23:03:44Z
GitCommit: 8665140f96f6b238a20e578dba7f9aef91ddac51
GitTreeState: clean
GoVersion: go1.25.3
Compiler: gc
Platform: darwin/arm64
argocd-server: v3.1.8+becb020Logs
$ kubectl logs argocd-redis-ha-server-1 -c split-brain-fix -n argocd
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Fri Oct 24 05:21:53 UTC 2025 Did not find redis master ()
Identifying redis master (get-master-addr-by-name)..
using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Fri Oct 24 05:24:08 UTC 2025 Did not find redis master ()
...
Attempting to force failover (sentinel failover)..
on sentinel (argocd-redis-ha:26379), sentinel grp (argocd)
Fri Oct 24 05:29:53 UTC 2025 Failover returned with 'NOGOODSLAVE'
$ kubectl logs argocd-redis-ha-server-1 -c redis -n argocd --previous
1:S 24 Oct 2025 04:50:55.688 * Connecting to MASTER 172.20.218.81:6379
1:S 24 Oct 2025 04:50:55.688 * MASTER <-> REPLICA sync started
1:S 24 Oct 2025 04:50:55.689 * Master replied to PING, replication can continue...
1:S 24 Oct 2025 04:50:55.690 * Trying a partial resynchronization (request 4306f2b885fb2bed9c6ea95e1e3069a81caf0b4d:429491749).
1:S 24 Oct 2025 04:50:55.690 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 24 Oct 2025 04:50:56.691 * Connecting to MASTER 172.20.218.81:6379
1:S 24 Oct 2025 04:50:56.692 * MASTER <-> REPLICA sync started
1:S 24 Oct 2025 04:50:56.692 * Master replied to PING, replication can continue...
1:S 24 Oct 2025 04:50:56.693 * Trying a partial resynchronization (request 4306f2b885fb2bed9c6ea95e1e3069a81caf0b4d:429491749).
1:S 24 Oct 2025 04:50:56.693 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
$ kubectl logs argocd-server-c67bd8d45-f4jmx -n argocd | rg "Unauthenticated"
{"grpc.code":"Unauthenticated","grpc.component":"server","grpc.error":"rpc error: code = Unauthenticated desc = invalid session: failed to verify the token","grpc.method":"List","grpc.method_type":"unary","grpc.service":"cluster.ClusterService","grpc.start_time":"2025-10-24T04:47:57Z","grpc.time_ms":"0.646","level":"info","msg":"finished call","peer.address":"[::1]:54768","protocol":"grpc","time":"2025-10-24T04:47:57Z"}
{"grpc.code":"Unauthenticated","grpc.component":"server","grpc.error":"rpc error: code = Unauthenticated desc = no session information","grpc.method":"List","grpc.method_type":"unary","grpc.service":"application.ApplicationService","grpc.start_time":"2025-10-24T04:49:22Z","grpc.time_ms":"0.187","level":"info","msg":"finished call","peer.address":"[::1]:54768","protocol":"grpc","time":"2025-10-24T04:49:22Z"}
Workaround
Apply a service patch:
spec:
publishNotReadyAddresses: truethen restart the StatefulSet. After applying this the cluster recovers.
References
- Kubernetes docs: https://kubernetes.io/docs/concepts/services-networking/service/#publishing-notready-addresses
- Bitnami redis chart (ships the flag): https://github.com/bitnami/charts/blob/main/bitnami/redis/templates/headless-svc.yaml#L24