From 1faefe1cd57daf261654973e8073901116ef7205 Mon Sep 17 00:00:00 2001 From: Ronald Ngounou Date: Sat, 11 Oct 2025 02:22:00 -0700 Subject: [PATCH] Lift the 8GB etcd limit to 100GB https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/ Signed-off-by: Ronald Ngounou --- content/en/docs/v3.4/dev-guide/limit.md | 2 +- content/en/docs/v3.4/faq.md | 1 - content/en/docs/v3.5/dev-guide/limit.md | 2 +- content/en/docs/v3.5/faq.md | 5 ++--- content/en/docs/v3.6/dev-guide/limit.md | 2 +- content/en/docs/v3.6/faq.md | 5 ++--- content/en/docs/v3.6/op-guide/hardware.md | 2 +- content/en/docs/v3.7/dev-guide/limit.md | 2 +- content/en/docs/v3.7/faq.md | 5 ++--- content/en/docs/v3.7/op-guide/hardware.md | 6 +----- 10 files changed, 12 insertions(+), 20 deletions(-) diff --git a/content/en/docs/v3.4/dev-guide/limit.md b/content/en/docs/v3.4/dev-guide/limit.md index bdc0d06f1..29159b869 100644 --- a/content/en/docs/v3.4/dev-guide/limit.md +++ b/content/en/docs/v3.4/dev-guide/limit.md @@ -10,4 +10,4 @@ etcd is designed to handle small key value pairs typical for metadata. Larger re ## Storage size limit -The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 8 GiB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. +The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 8 GiB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. Read this [blog](https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/) to further understand how the 100 GB was obtained. diff --git a/content/en/docs/v3.4/faq.md b/content/en/docs/v3.4/faq.md index bdbe25420..e502fa7dc 100644 --- a/content/en/docs/v3.4/faq.md +++ b/content/en/docs/v3.4/faq.md @@ -146,7 +146,6 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd sends a snapshot of its complete key-value store to refresh slow followers and for [backups][backup]. Slow snapshot transfer times increase MTTR; if the cluster is ingesting data with high throughput, slow followers may livelock by needing a new snapshot before finishing receiving a snapshot. To catch slow snapshot performance, etcd warns when sending a snapshot takes more than thirty seconds and exceeds the expected transfer time for a 1Gbps connection. - [api-mvcc]: ../learning/api/#revisions [backend_commit_metrics]: ../metrics/#disk [backup]: /docs/v3.4/op-guide/recovery#snapshotting-the-keyspace diff --git a/content/en/docs/v3.5/dev-guide/limit.md b/content/en/docs/v3.5/dev-guide/limit.md index bdc0d06f1..1ea9e2728 100644 --- a/content/en/docs/v3.5/dev-guide/limit.md +++ b/content/en/docs/v3.5/dev-guide/limit.md @@ -10,4 +10,4 @@ etcd is designed to handle small key value pairs typical for metadata. Larger re ## Storage size limit -The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 8 GiB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. +The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 8GiB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. Read this [blog](https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/) to further understand how the 100GB was obtained. diff --git a/content/en/docs/v3.5/faq.md b/content/en/docs/v3.5/faq.md index f6d5af416..e8d340713 100644 --- a/content/en/docs/v3.5/faq.md +++ b/content/en/docs/v3.5/faq.md @@ -34,7 +34,7 @@ A member's advertised peer URLs come from `--initial-advertise-peer-urls` on ini ### System requirements -Since etcd writes data to disk, its performance strongly depends on disk performance. For this reason, SSD is highly recommended. To assess whether a disk is fast enough for etcd, one possibility is using a disk benchmarking tool such as [fio][fio]. For an example on how to do that, read [here][fio-blog-post]. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 8GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations. +Since etcd writes data to disk, its performance strongly depends on disk performance. For this reason, SSD is highly recommended. To assess whether a disk is fast enough for etcd, one possibility is using a disk benchmarking tool such as [fio][fio]. For an example on how to do that, read [this blog][fio-blog-post]. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 100GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations. Most stable production environment is Linux operating system with amd64 architecture; see [supported platform][supported-platform] for more. @@ -142,7 +142,7 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd uses a leader-based consensus protocol for consistent data replication and log execution. Cluster members elect a single leader, all other members become followers. The elected leader must periodically send heartbeats to its followers to maintain its leadership. Followers infer leader failure if no heartbeats are received within an election interval and trigger an election. If a leader doesn’t send its heartbeats in time but is still running, the election is spurious and likely caused by insufficient resources. To catch these soft failures, if the leader skips two heartbeat intervals, etcd will warn it failed to send a heartbeat on time. -Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor [wal_fsync_duration_seconds][wal_fsync_duration_seconds] (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem. To tell whether a disk is fast enough for etcd, a benchmarking tool such as [fio][fio] can be used. Read [here][fio-blog-post] for an example. +Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor [wal_fsync_duration_seconds][wal_fsync_duration_seconds] (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem. To tell whether a disk is fast enough for etcd, a benchmarking tool such as [fio][fio] can be used. Read [this blog][fio-blog-post] for an example. The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation with cgroups, or renicing the etcd server process into a higher priority can usually solve the problem. @@ -154,7 +154,6 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd sends a snapshot of its complete key-value store to refresh slow followers and for [backups][backup]. Slow snapshot transfer times increase MTTR; if the cluster is ingesting data with high throughput, slow followers may livelock by needing a new snapshot before finishing receiving a snapshot. To catch slow snapshot performance, etcd warns when sending a snapshot takes more than thirty seconds and exceeds the expected transfer time for a 1Gbps connection. - [api-mvcc]: ../learning/api/#revisions [backend_commit_metrics]: ../metrics/#disk [backup]: ../op-guide/recovery/#snapshotting-the-keyspace diff --git a/content/en/docs/v3.6/dev-guide/limit.md b/content/en/docs/v3.6/dev-guide/limit.md index bdc0d06f1..e4666908f 100644 --- a/content/en/docs/v3.6/dev-guide/limit.md +++ b/content/en/docs/v3.6/dev-guide/limit.md @@ -10,4 +10,4 @@ etcd is designed to handle small key value pairs typical for metadata. Larger re ## Storage size limit -The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 8 GiB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. +The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 100GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. Read this [blog](https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/) to further understand how the 100GB was obtained. diff --git a/content/en/docs/v3.6/faq.md b/content/en/docs/v3.6/faq.md index 64f12e3c3..f4ab0cd5e 100644 --- a/content/en/docs/v3.6/faq.md +++ b/content/en/docs/v3.6/faq.md @@ -34,7 +34,7 @@ A member's advertised peer URLs come from `--initial-advertise-peer-urls` on ini ### System requirements -Since etcd writes data to disk, its performance strongly depends on disk performance. For this reason, SSD is highly recommended. To assess whether a disk is fast enough for etcd, one possibility is using a disk benchmarking tool such as [fio][fio]. For an example on how to do that, read [here][fio-blog-post]. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 8GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations. +Since etcd writes data to disk, its performance strongly depends on disk performance. For this reason, SSD is highly recommended. To assess whether a disk is fast enough for etcd, one possibility is using a disk benchmarking tool such as [fio][fio]. For an example on how to do that, read [this blog][fio-blog-post]. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 100GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations. Most stable production environment is Linux operating system with amd64 architecture; see [supported platform][supported-platform] for more. @@ -142,7 +142,7 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd uses a leader-based consensus protocol for consistent data replication and log execution. Cluster members elect a single leader, all other members become followers. The elected leader must periodically send heartbeats to its followers to maintain its leadership. Followers infer leader failure if no heartbeats are received within an election interval and trigger an election. If a leader doesn’t send its heartbeats in time but is still running, the election is spurious and likely caused by insufficient resources. To catch these soft failures, if the leader skips two heartbeat intervals, etcd will warn it failed to send a heartbeat on time. -Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor [wal_fsync_duration_seconds][wal_fsync_duration_seconds] (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem. To tell whether a disk is fast enough for etcd, a benchmarking tool such as [fio][fio] can be used. Read [here][fio-blog-post] for an example. +Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor [wal_fsync_duration_seconds][wal_fsync_duration_seconds] (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem. To tell whether a disk is fast enough for etcd, a benchmarking tool such as [fio][fio] can be used. Read [this blog][fio-blog-post] for an example. The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation with cgroups, or renicing the etcd server process into a higher priority can usually solve the problem. @@ -154,7 +154,6 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd sends a snapshot of its complete key-value store to refresh slow followers and for [backups][backup]. Slow snapshot transfer times increase MTTR; if the cluster is ingesting data with high throughput, slow followers may livelock by needing a new snapshot before finishing receiving a snapshot. To catch slow snapshot performance, etcd warns when sending a snapshot takes more than thirty seconds and exceeds the expected transfer time for a 1Gbps connection. - [api-mvcc]: ../learning/api/#revisions [backend_commit_metrics]: ../metrics/#disk [backup]: ../op-guide/recovery/#snapshotting-the-keyspace diff --git a/content/en/docs/v3.6/op-guide/hardware.md b/content/en/docs/v3.6/op-guide/hardware.md index df67cfe98..e17fcd5a6 100644 --- a/content/en/docs/v3.6/op-guide/hardware.md +++ b/content/en/docs/v3.6/op-guide/hardware.md @@ -14,7 +14,7 @@ Heavily loaded etcd deployments, serving thousands of clients or tens of thousan ## Memory -etcd has a relatively small memory footprint but its performance still depends on having enough memory. An etcd server will aggressively cache key-value data and spends most of the rest of its memory tracking watchers. Typically 8GB is enough. For heavy deployments with thousands of watchers and millions of keys, allocate 16GB to 64GB memory accordingly. +etcd has a relatively small memory footprint but its performance still depends on having enough memory. An etcd server will aggressively cache key-value data and spends most of the rest of its memory tracking watchers. Typically 100GB is enough. For heavy deployments with thousands of watchers and millions of keys, allocate 16GB to 64GB memory accordingly. ## Disks diff --git a/content/en/docs/v3.7/dev-guide/limit.md b/content/en/docs/v3.7/dev-guide/limit.md index bdc0d06f1..e4666908f 100644 --- a/content/en/docs/v3.7/dev-guide/limit.md +++ b/content/en/docs/v3.7/dev-guide/limit.md @@ -10,4 +10,4 @@ etcd is designed to handle small key value pairs typical for metadata. Larger re ## Storage size limit -The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 8 GiB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. +The default storage size limit is 2 GiB, configurable with `--quota-backend-bytes` flag. 100GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. Read this [blog](https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/) to further understand how the 100GB was obtained. diff --git a/content/en/docs/v3.7/faq.md b/content/en/docs/v3.7/faq.md index 64f12e3c3..f4ab0cd5e 100644 --- a/content/en/docs/v3.7/faq.md +++ b/content/en/docs/v3.7/faq.md @@ -34,7 +34,7 @@ A member's advertised peer URLs come from `--initial-advertise-peer-urls` on ini ### System requirements -Since etcd writes data to disk, its performance strongly depends on disk performance. For this reason, SSD is highly recommended. To assess whether a disk is fast enough for etcd, one possibility is using a disk benchmarking tool such as [fio][fio]. For an example on how to do that, read [here][fio-blog-post]. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 8GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations. +Since etcd writes data to disk, its performance strongly depends on disk performance. For this reason, SSD is highly recommended. To assess whether a disk is fast enough for etcd, one possibility is using a disk benchmarking tool such as [fio][fio]. For an example on how to do that, read [this blog][fio-blog-post]. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a configurable storage size quota set to 2GB by default. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. 100GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations. Most stable production environment is Linux operating system with amd64 architecture; see [supported platform][supported-platform] for more. @@ -142,7 +142,7 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd uses a leader-based consensus protocol for consistent data replication and log execution. Cluster members elect a single leader, all other members become followers. The elected leader must periodically send heartbeats to its followers to maintain its leadership. Followers infer leader failure if no heartbeats are received within an election interval and trigger an election. If a leader doesn’t send its heartbeats in time but is still running, the election is spurious and likely caused by insufficient resources. To catch these soft failures, if the leader skips two heartbeat intervals, etcd will warn it failed to send a heartbeat on time. -Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor [wal_fsync_duration_seconds][wal_fsync_duration_seconds] (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem. To tell whether a disk is fast enough for etcd, a benchmarking tool such as [fio][fio] can be used. Read [here][fio-blog-post] for an example. +Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor [wal_fsync_duration_seconds][wal_fsync_duration_seconds] (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem. To tell whether a disk is fast enough for etcd, a benchmarking tool such as [fio][fio] can be used. Read [this blog][fio-blog-post] for an example. The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation with cgroups, or renicing the etcd server process into a higher priority can usually solve the problem. @@ -154,7 +154,6 @@ If none of the above suggestions clear the warnings, please [open an issue][new_ etcd sends a snapshot of its complete key-value store to refresh slow followers and for [backups][backup]. Slow snapshot transfer times increase MTTR; if the cluster is ingesting data with high throughput, slow followers may livelock by needing a new snapshot before finishing receiving a snapshot. To catch slow snapshot performance, etcd warns when sending a snapshot takes more than thirty seconds and exceeds the expected transfer time for a 1Gbps connection. - [api-mvcc]: ../learning/api/#revisions [backend_commit_metrics]: ../metrics/#disk [backup]: ../op-guide/recovery/#snapshotting-the-keyspace diff --git a/content/en/docs/v3.7/op-guide/hardware.md b/content/en/docs/v3.7/op-guide/hardware.md index df67cfe98..9f0735af3 100644 --- a/content/en/docs/v3.7/op-guide/hardware.md +++ b/content/en/docs/v3.7/op-guide/hardware.md @@ -14,7 +14,7 @@ Heavily loaded etcd deployments, serving thousands of clients or tens of thousan ## Memory -etcd has a relatively small memory footprint but its performance still depends on having enough memory. An etcd server will aggressively cache key-value data and spends most of the rest of its memory tracking watchers. Typically 8GB is enough. For heavy deployments with thousands of watchers and millions of keys, allocate 16GB to 64GB memory accordingly. +etcd has a relatively small memory footprint but its performance still depends on having enough memory. An etcd server will aggressively cache key-value data and spends most of the rest of its memory tracking watchers. Typically 100GB is enough. For heavy deployments with thousands of watchers and millions of keys, allocate 16GB to 64GB memory accordingly. ## Disks @@ -54,7 +54,6 @@ Example application workload: A 50-node Kubernetes cluster | AWS | m4.large | 2 | 8 | 3600 | 56.25 | | GCE | n1-standard-2 + 50GB PD SSD | 2 | 7.5 | 1500 | 25 | - ### Medium cluster A medium cluster serves fewer than 500 clients, fewer than 1,000 of requests per second, and stores no more than 500MB of data. @@ -66,7 +65,6 @@ Example application workload: A 250-node Kubernetes cluster | AWS | m4.xlarge | 4 | 16 | 6000 | 93.75 | | GCE | n1-standard-4 + 150GB PD SSD | 4 | 15 | 4500 | 75 | - ### Large cluster A large cluster serves fewer than 1,500 clients, fewer than 10,000 of requests per second, and stores no more than 1GB of data. @@ -78,7 +76,6 @@ Example application workload: A 1,000-node Kubernetes cluster | AWS | m4.2xlarge | 8 | 32 | 8000 | 125 | | GCE | n1-standard-8 + 250GB PD SSD | 8 | 30 | 7500 | 125 | - ### xLarge cluster An xLarge cluster serves more than 1,500 clients, more than 10,000 of requests per second, and stores more than 1GB data. @@ -90,7 +87,6 @@ Example application workload: A 3,000 node Kubernetes cluster | AWS | m4.4xlarge | 16 | 64 | 16,000 | 250 | | GCE | n1-standard-16 + 500GB PD SSD | 16 | 60 | 15,000 | 250 | - [diskbench]: https://github.com/ongardie/diskbenchmark [fio]: https://github.com/axboe/fio [fio-blog-post]: https://web.archive.org/web/20240726111518/https://prog.world/is-storage-speed-suitable-for-etcd-ask-fio/