Cluster resilience and service failure behavior in disaster scenarios #1047

lfq04 · 2025-06-09T19:59:51Z

lfq04
Jun 9, 2025

Hello team,

First of all, thank you for the excellent work on the Autobase project — we're currently implementing it for a distributed PostgreSQL architecture across five datacenters, each running a Patroni node deployed via Autobase.

We’ve conducted several resilience and disaster recovery tests, and we would like to share some findings and questions regarding certain failure conditions, especially in critical scenarios. We would appreciate your insights or any best practices.

Architecture overview:

1. Cluster behavior with multiple nodes down

From our testing, we’ve confirmed that the cluster can tolerate the loss of up to two nodes. However, if we lose three out of five nodes, the cluster goes into read-only mode due to lack of quorum (as expected).

Now we’re considering worst-case scenarios such as:

Only one physical server survives a disaster.
The client still needs the database operational (even temporarily or in degraded mode).

In such cases, which of the following options would you recommend?

Running multiple Autobase nodes (2 or more) inside a single physical server, to reestablish quorum artificially?
Manually bypassing the HA mechanisms and running a standalone PostgreSQL instance to restore write access?
Any recommended procedure to rebootstrap a minimal cluster safely?

We understand that some of these actions break the HA model, but we're looking for a clean and supported way to restore operability under these rare but critical situations.

2. Failover not triggered when HAProxy or PgBouncer stop on the master

In our environment, each node runs the following services:

haproxy
etcd
confd
patroni
pgbouncer
postgresql

We noticed that if we stop HAProxy and PgBouncer on the current master, the node becomes unreachable for clients, but failover is not triggered — the node is still considered healthy by Patroni/etcd.

This led to service downtime even though the master itself was partially degraded. Is there any way to:

Monitor the availability of haproxy/pgbouncer as part of the failover logic?
Tie Patroni health to the availability of those front-facing services?
Use external checks or watchdogs that could assist in promoting a new master when such partial failures occur?

3. Additional considerations

If you have suggestions or patterns to better handle partial or total failures, especially regarding:

Manual quorum restoration
Single-node survivability
Extending failover detection

We’d be happy to hear your thoughts and incorporate them into our project.

Thanks again for your work and support!

Answered by vitabaks

Jun 10, 2025

Hi @lfq04

Cluster behavior with multiple nodes down

You’re absolutely right: in a 5-node DCS setup, the cluster can tolerate up to 2 nodes failing. This follows the standard RAFT requirement of N/2 + 1 quorum. For more on how this works, I recommend The Secret Lives of Data – Raft.

In worst-case scenarios where only a single node remains:

You can temporarily bypass HA by stopping patroni and starting PostgreSQL manually via pg_ctl or pg_ctlcluster. This gives you write access in standalone mode — but keep in mind that this breaks HA and replication guarantees.
For a cleaner approach, consider deploying a dedicated etcd cluster (e.g. 5 or 7 VMs) shared across multiple PostgreSQL clust…

View full answer

vitabaks · 2025-06-10T12:03:14Z

vitabaks
Jun 10, 2025
Maintainer

Hi @lfq04

Cluster behavior with multiple nodes down

You’re absolutely right: in a 5-node DCS setup, the cluster can tolerate up to 2 nodes failing. This follows the standard RAFT requirement of N/2 + 1 quorum. For more on how this works, I recommend The Secret Lives of Data – Raft.

In worst-case scenarios where only a single node remains:

You can temporarily bypass HA by stopping patroni and starting PostgreSQL manually via pg_ctl or pg_ctlcluster. This gives you write access in standalone mode — but keep in mind that this breaks HA and replication guarantees.
For a cleaner approach, consider deploying a dedicated etcd cluster (e.g. 5 or 7 VMs) shared across multiple PostgreSQL clusters. This increases DCS resilience independently of your DB nodes.
Also worth testing: the new failsafe_mode option in Patroni — designed for precisely such degraded scenarios.

Failover not triggered when HAProxy or PgBouncer stop on the master

HAProxy, even when installed on database nodes, is not directly tied to the PostgreSQL role (primary or replica). Patroni does not monitor HAProxy, so its failure does not trigger a switchover.

High availability of HAProxy itself should be ensured via external means — typically using keepalived with a shared virtual IP (VIP). This way, the VIP is always assigned to a node where the HAProxy process is healthy, ensuring continuous availability for clients.

PgBouncer does not have native HA mechanisms, but that’s often not required. If needed, you can run multiple PgBouncer processes on the same host using the pgbouncer_processes variable — this provides lightweight redundancy and better parallelism.

Additional considerations

If you’re looking for guided support and architectural reviews tailored to your use case, I encourage you to explore our support packages.

Best regards,
Vitaliy

Free Support Discontinuation Notice

1 reply

lfq04 Jun 10, 2025
Author

Hi @vitabaks,

Thank you so much for your response!

We’ll definitely test the new failsafe_mode and also evaluate moving our etcd nodes to a separate cluster, as you suggested.

Regarding HAProxy and PgBouncer, your explanation makes perfect sense. We’ll look into using keepalived with a VIP to improve the availability of our proxy layer.

Thanks again for the great support!

Best regards,
Luiz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Cluster resilience and service failure behavior in disaster scenarios #1047

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Uh oh!

Cluster resilience and service failure behavior in disaster scenarios #1047

Uh oh!

Uh oh!

lfq04 Jun 9, 2025

1. Cluster behavior with multiple nodes down

2. Failover not triggered when HAProxy or PgBouncer stop on the master

3. Additional considerations

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

vitabaks Jun 10, 2025 Maintainer

Uh oh!

lfq04 Jun 10, 2025 Author

lfq04
Jun 9, 2025

Replies: 1 comment 1 reply

vitabaks
Jun 10, 2025
Maintainer

lfq04 Jun 10, 2025
Author