-
|
Hello team, First of all, thank you for the excellent work on the Autobase project — we're currently implementing it for a distributed PostgreSQL architecture across five datacenters, each running a Patroni node deployed via Autobase. We’ve conducted several resilience and disaster recovery tests, and we would like to share some findings and questions regarding certain failure conditions, especially in critical scenarios. We would appreciate your insights or any best practices. 1. Cluster behavior with multiple nodes downFrom our testing, we’ve confirmed that the cluster can tolerate the loss of up to two nodes. However, if we lose three out of five nodes, the cluster goes into read-only mode due to lack of quorum (as expected). Now we’re considering worst-case scenarios such as:
In such cases, which of the following options would you recommend?
We understand that some of these actions break the HA model, but we're looking for a clean and supported way to restore operability under these rare but critical situations. 2. Failover not triggered when HAProxy or PgBouncer stop on the masterIn our environment, each node runs the following services:
We noticed that if we stop HAProxy and PgBouncer on the current master, the node becomes unreachable for clients, but failover is not triggered — the node is still considered healthy by Patroni/etcd. This led to service downtime even though the master itself was partially degraded. Is there any way to:
3. Additional considerationsIf you have suggestions or patterns to better handle partial or total failures, especially regarding:
We’d be happy to hear your thoughts and incorporate them into our project. Thanks again for your work and support! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Hi @lfq04
You’re absolutely right: in a 5-node DCS setup, the cluster can tolerate up to 2 nodes failing. This follows the standard RAFT requirement of N/2 + 1 quorum. For more on how this works, I recommend The Secret Lives of Data – Raft. In worst-case scenarios where only a single node remains:
HAProxy, even when installed on database nodes, is not directly tied to the PostgreSQL role (primary or replica). Patroni does not monitor HAProxy, so its failure does not trigger a switchover. High availability of HAProxy itself should be ensured via external means — typically using keepalived with a shared virtual IP (VIP). This way, the VIP is always assigned to a node where the HAProxy process is healthy, ensuring continuous availability for clients. PgBouncer does not have native HA mechanisms, but that’s often not required. If needed, you can run multiple PgBouncer processes on the same host using the
If you’re looking for guided support and architectural reviews tailored to your use case, I encourage you to explore our support packages. |
Beta Was this translation helpful? Give feedback.


Hi @lfq04
You’re absolutely right: in a 5-node DCS setup, the cluster can tolerate up to 2 nodes failing. This follows the standard RAFT requirement of N/2 + 1 quorum. For more on how this works, I recommend The Secret Lives of Data – Raft.
In worst-case scenarios where only a single node remains: