Skip to content

Redis HA Needs publishNotReadyAddresses #25060

@maheshrijal

Description

@maheshrijal

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

In the HA deployment bundle (manifests/ha/cluster-install?ref=v3.1.8), the headless Service argocd-redis-ha does not set publishNotReadyAddresses: true. During a cold restart of the Redis HA StatefulSet, kube-dns removes the service endpoints while pods are not ready, so fix-split-brain.sh cannot contact Sentinel. Every redis pod keeps the stale slaveof entry and fails its startup probe (role=slave; repl=connect), leading to a permanent CrashLoopBackOff and broken Argo CD logins (rpc error: code = Unauthenticated desc = no session information).

To Reproduce

  1. Deploy the stock HA bundle (kustomize build github.com/argoproj/argo-cd/manifests/ha/cluster-install?ref=v3.1.8 | kubectl apply -f -).
  2. Force a restart: kubectl delete pod argocd-redis-ha-server-{0,1,2} -n argocd.
  3. Watch the pods restart:
    • kubectl logs argocd-redis-ha-server-1 -c split-brain-fix shows Could not connect to Redis at argocd-redis-ha:26379: Name does not resolve.
    • kubectl get pods -n argocd shows all redis pods in CrashLoopBackOff.
  4. Attempt UI/CLI login; Argo CD server logs report “no session information”.

Expected behavior

Sentinel should be reachable via the service name during bootstrap so a new master can be elected and replicas join without crashing.

Screenshots

❯ k get pods
NAME                                                READY   STATUS             RESTARTS         AGE
argocd-application-controller-0                     1/1     Running            0                4h59m
argocd-application-controller-1                     1/1     Running            0                4h57m
argocd-applicationset-controller-5c6799bb45-pkn85   1/1     Running            0                4h58m
argocd-applicationset-controller-5c6799bb45-wdmgq   1/1     Running            0                35m
argocd-dex-server-57c896dbc6-jwl8x                  1/1     Running            0                4h59m
argocd-notifications-controller-84b4d4b674-jphfz    1/1     Running            0                4h59m
argocd-redis-ha-haproxy-58476dd6d7-69pdw            1/1     Running            0                36m
argocd-redis-ha-haproxy-58476dd6d7-cdbrs            1/1     Running            0                36m
argocd-redis-ha-haproxy-58476dd6d7-x69ht            1/1     Running            0                35m
argocd-redis-ha-server-0                            2/3     CrashLoopBackOff   98 (4m48s ago)   5h
argocd-redis-ha-server-1                            2/3     CrashLoopBackOff   100 (71s ago)    4h58m
argocd-redis-ha-server-2                            2/3     CrashLoopBackOff   101 (48s ago)    4h57m
argocd-repo-server-65669fd7cf-5gq4r                 1/1     Running            0                15m
argocd-repo-server-65669fd7cf-l22cl                 1/1     Running            0                15m
argocd-server-c67bd8d45-9xxzq                       1/1     Running            0                35m
argocd-server-c67bd8d45-f4jmx                       1/1     Running            0                4h57m
argocd-server-c67bd8d45-qcjsj                       1/1     Running            0                4h59m

Version

Paste the output from `argocd version` here.
argocd: v3.1.9+8665140
  BuildDate: 2025-10-17T23:03:44Z
  GitCommit: 8665140f96f6b238a20e578dba7f9aef91ddac51
  GitTreeState: clean
  GoVersion: go1.25.3
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v3.1.8+becb020

Logs

$ kubectl logs argocd-redis-ha-server-1 -c split-brain-fix -n argocd
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
  Fri Oct 24 05:21:53 UTC 2025 Did not find redis master ()
Identifying redis master (get-master-addr-by-name)..
  using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
  Fri Oct 24 05:24:08 UTC 2025 Did not find redis master ()
...
Attempting to force failover (sentinel failover)..
  on sentinel (argocd-redis-ha:26379), sentinel grp (argocd)
  Fri Oct 24 05:29:53 UTC 2025 Failover returned with 'NOGOODSLAVE'
$ kubectl logs argocd-redis-ha-server-1 -c redis -n argocd --previous
1:S 24 Oct 2025 04:50:55.688 * Connecting to MASTER 172.20.218.81:6379
1:S 24 Oct 2025 04:50:55.688 * MASTER <-> REPLICA sync started
1:S 24 Oct 2025 04:50:55.689 * Master replied to PING, replication can continue...
1:S 24 Oct 2025 04:50:55.690 * Trying a partial resynchronization (request 4306f2b885fb2bed9c6ea95e1e3069a81caf0b4d:429491749).
1:S 24 Oct 2025 04:50:55.690 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 24 Oct 2025 04:50:56.691 * Connecting to MASTER 172.20.218.81:6379
1:S 24 Oct 2025 04:50:56.692 * MASTER <-> REPLICA sync started
1:S 24 Oct 2025 04:50:56.692 * Master replied to PING, replication can continue...
1:S 24 Oct 2025 04:50:56.693 * Trying a partial resynchronization (request 4306f2b885fb2bed9c6ea95e1e3069a81caf0b4d:429491749).
1:S 24 Oct 2025 04:50:56.693 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
$ kubectl logs argocd-server-c67bd8d45-f4jmx -n argocd | rg "Unauthenticated"
{"grpc.code":"Unauthenticated","grpc.component":"server","grpc.error":"rpc error: code = Unauthenticated desc = invalid session: failed to verify the token","grpc.method":"List","grpc.method_type":"unary","grpc.service":"cluster.ClusterService","grpc.start_time":"2025-10-24T04:47:57Z","grpc.time_ms":"0.646","level":"info","msg":"finished call","peer.address":"[::1]:54768","protocol":"grpc","time":"2025-10-24T04:47:57Z"}
{"grpc.code":"Unauthenticated","grpc.component":"server","grpc.error":"rpc error: code = Unauthenticated desc = no session information","grpc.method":"List","grpc.method_type":"unary","grpc.service":"application.ApplicationService","grpc.start_time":"2025-10-24T04:49:22Z","grpc.time_ms":"0.187","level":"info","msg":"finished call","peer.address":"[::1]:54768","protocol":"grpc","time":"2025-10-24T04:49:22Z"}

Workaround

Apply a service patch:

spec:
  publishNotReadyAddresses: true

then restart the StatefulSet. After applying this the cluster recovers.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage/pendingThis issue needs further triage to be correctly classified

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions