Skip to content

Commit 4285e30

Browse files
committed
Merged main branch
Signed-off-by: Wendy Ha <[email protected]>
1 parent 091033a commit 4285e30

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+6884
-6248
lines changed

.github/workflows/update-release-version.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,7 @@ jobs:
1717
uses: ./.github/workflows/update-release-version-template.yaml
1818
with:
1919
release: v3.5
20+
v3_6:
21+
uses: ./.github/workflows/update-release-version-template.yaml
22+
with:
23+
release: v3.6

OWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,5 @@ approvers:
1010
- serathius # Marek Siarkowicz <[email protected]> <[email protected]>
1111
- spzala # Sahdev Zala <[email protected]>
1212
- wenjiaswe # Wenjia Zhang <[email protected]> <[email protected]>
13+
reviewers:
14+
- wendy-ha18 # Wendy Ha <[email protected]>
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
title: Autonomous Testing of etcd's Robustness
3+
author: "Marek Siarkowicz (Google)"
4+
date: 2025-10-03
5+
draft: false
6+
---
7+
8+
*This is a post from the [CNCF blog](https://www.cncf.io/blog/2025/09/25/autonomous-testing-of-etcds-robustness/) which we are sharing with our community as well.*
9+
10+
As a critical component of many production systems, including Kubernetes, the etcd project's first priority is reliability.
11+
Ensuring consistency and data safety requires our project contributors to continuously improve testing methodologies.
12+
In this article, we will describe how we used advanced simulation testing to uncover subtle bugs,
13+
validate the robustness of our releases, and increase our confidence in etcd's stability.
14+
We'll share our key findings and how they have improved etcd.
15+
16+
## Enhancing etcd's Robustness Testing
17+
18+
Many critical software systems depend on etcd to be correct and consistent, most notably as the primary datastore for Kubernetes.
19+
After some issues with the v3.5 release,
20+
the etcd maintainers developed a new [robustness testing framework](https://github.com/etcd-io/etcd/issues/14045)
21+
to better test for correctness under various failure scenarios. To further enhance our testing capabilities,
22+
we integrated a deterministic simulation testing platform from [Antithesis](https://antithesis.com/) into our workflow.
23+
24+
The platform works by running the entire etcd cluster inside a deterministic hypervisor.
25+
This specialized environment gives the testing software complete control over every source of non-determinism,
26+
such as network behavior, thread scheduling, and system clocks.
27+
This means any bug it discovers can be perfectly and reliably reproduced.
28+
29+
Within this simulated environment, the testing methodology shifts away from traditional, scenario-based tests.
30+
Instead of writing tests imperatively with strict assertions for one specific outcome, this approach uses declarative, property-based assertions about system behavior.
31+
These properties are high-level invariants about the system that must always hold true. For example,
32+
"data consistency is never violated" or "a watch event is never dropped."
33+
34+
The platform then treats these properties not as passive checks, but as targets to break.
35+
It combines automated exploration with targeted fault injection,
36+
actively searching for the precise sequence of events and failures that will cause a property to be violated.
37+
This active search for violations is what allows the platform to uncover subtle bugs that result from complex combinations of factors.
38+
Antithesis refers to this approach as Autonomous Testing.
39+
40+
This builds upon etcd's existing robustness tests, which also use a property-based approach.
41+
However, without a deterministic environment or automated exploration,
42+
the original framework resembled throwing darts while blindfolded and hoping to hit the bullseye.
43+
A bug might be found, but the process relies heavily on random chance and is difficult to reproduce.
44+
Antithesis's deterministic simulation and active exploration
45+
removes the blindfold, enabling a systematic and reproducible search for bugs.
46+
47+
## How We Tested
48+
49+
Our goals for this testing effort were to:
50+
51+
1. **Validate the robustness of etcd v3.6.**
52+
2. **Improve etcd's software quality by finding and fixing bugs.**
53+
3. **Enhance our existing testing framework with autonomous testing.**
54+
55+
We ran our existing robustness tests on the Antithesis simulation platform, testing a 3-node and a 1-node etcd cluster against a variety of faults, including:
56+
57+
* **Network faults:** latency, congestion, and partitions.
58+
* **Container-level faults:** thread pauses, process kills, clock jitter, and CPU throttling.
59+
60+
We tested older versions of etcd with known bugs to validate the testing methodology, as well as our stable releases (3.4, 3.5, 3.6) and the main development branch. In total, we ran 830 wall-clock hours of testing, which simulated 4.5 years of usage.
61+
62+
## What We Found
63+
64+
The results were impressive. The simulation testing not only found all the known bugs we tested for but also uncovered several new issues in our main development branch.
65+
66+
Here are some of the key findings:
67+
68+
* **A critical watch bug was discovered** that our existing tests had missed. This bug was present in all stable releases of etcd.
69+
* **All known bugs were found**, giving us confidence in the ability of the combined testing approach to find regressions.
70+
* **Our own testing was improved** by revealing a flaw in our linearization checker model.
71+
72+
### Issues in the Main Development Branch
73+
74+
| Description | Report Link | Status | Impact | Details |
75+
| :---------------------------------------------------------------------- | :---------------------------- | :----------------------------------------------- | :------| :--------------------------------------------------------- |
76+
| [Watch on future revision might receive old events][bug-1-issue] | [Triage Report][bug-1-report] | Fixed in 3.6.2 ([\#20281][bug-1-fix]) | Medium | New bug discovered by Atithesis |
77+
| [Watch on future revision might receive old notifications][bug-2-issue] | [Triage Report][bug-2-report] | Fixed in 3.6.2 ([\#20221][bug-2-fix]) | Medium | New bug discovered by both Antithesis and robustness tests |
78+
| [Panic when two snapshots are received in short period][bug-3-issue] | [Triage Report][bug-3-report] | Open | Low | Previously discovered by robustness |
79+
| [Panic from db page expected to be 5][bug-4-issue] | [Triage Report][bug-4-report] | Fixed in 3.6.5 ([\#20553][bug-4-fix]) | Low | New bug discovered by Antithesis |
80+
| [Operation time based on watch response is incorrect][bug-5-issue] | [Triage Report][bug-5-report] | Fixed test on main branch ([\#19998][bug-5-fix]) | Low | Bug in robustness tests discovered by Antithesis |
81+
82+
[bug-1-issue]: https://github.com/etcd-io/etcd/issues/20221
83+
[bug-1-report]: https://linuxfoundation.antithesis.com/report/LAbnx9WBHxp0BPeEDSFrTxl3/798H3lSB7pQb6x2LYB65zGlNhM_OmxZAza0PfRbjpQo.html?auth=v2.public.eyJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiNzk4SDNsU0I3cFFiNngyTFlCNjV6R2xOaE1fT214WkF6YTBQZlJianBRby5odG1sIiwicmVwb3J0X2lkIjoiTEFibng5V0JIeHAwQlBlRURTRnJUeGwzIn19LCJuYmYiOiIyMDI1LTA3LTAyVDA2OjI2OjA5Ljc5MjM0NTQ2OVoifaf6ZskL_GQSGDCZ7ESxV5SbygmAq_NiZZ9Oj2wcMnFOZlEjL5QEfgxM1zjSkF20PrjCjrmKzr4U7fJVJOPT3Qo#/run/e3a65c762111a06ab412abbdec1e3a73-32-6/finding/984b7ce364030642155dcd71d492711c9f9f73a9
84+
[bug-1-fix]: https://github.com/etcd-io/etcd/pull/20281
85+
[bug-2-issue]: https://github.com/etcd-io/etcd/issues/20221
86+
[bug-2-report]: https://linuxfoundation.antithesis.com/report/UZjUP_KGxboJepL7k1q_8pa4/ZqL0Vt9a7YESiiBmGecPMkBP8YgM1IwlTZJ4dcYjmZ8.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA2LTI1VDAzOjE4OjIzLjM4MDU2MDQwMFoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiWnFMMFZ0OWE3WUVTaWlCbUdlY1BNa0JQOFlnTTFJd2xUWko0ZGNZam1aOC5odG1sIiwicmVwb3J0X2lkIjoiVVpqVVBfS0d4Ym9KZXBMN2sxcV84cGE0In19feIAsYO4-UIigcL4eMu7QUqA6XFbCU3Hnw7BeyZW06o9x11mFqleHbSbRWdIcLdTH2Xzx42DXNB7dBqYq25Ujg4#/run/e35cadd61e2b01c494095b06141fcc8b-32-6/finding/984b7ce364030642155dcd71d492711c9f9f73a9
87+
[bug-2-fix]: https://github.com/etcd-io/etcd/issues/20221
88+
[bug-3-issue]: https://github.com/etcd-io/etcd/issues/18055
89+
[bug-3-report]: https://linuxfoundation.antithesis.com/report/HqTiW-VhiXU25CCPP8vkSUPB/3Q73gnvlcEpEb6XVWcl4H3qTOnXZ7pFAdkpbpHr8mMI.html?auth=v2.public.eyJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiM1E3M2dudmxjRXBFYjZYVldjbDRIM3FUT25YWjdwRkFka3BicEhyOG1NSS5odG1sIiwicmVwb3J0X2lkIjoiSHFUaVctVmhpWFUyNUNDUFA4dmtTVVBCIn19LCJuYmYiOiIyMDI1LTA2LTA2VDAxOjA5OjE5Ljc1MDg1NDI2NVoifW8RMYqVcS2V3idTzyvalEO2SnPqycds-Cn710lY-wlfqYPe1MAb2U0R2wEKVwPtSsr79WcnR8yYCyZyCQNqhAc#/run/31f74082d85b5ffdaf9f34ed37480bbd-32-6/finding/138fa550c81efa6efc7170191b75c4a22caea51f
90+
[bug-4-report]: https://linuxfoundation.antithesis.com/report/G-9rIjiZJiwodTEN5avQ7wgK/u_uFsWOwZSxS5mOmbEprwMUijNhsWdV6mfde_CT-y4k.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA3LTAzVDA3OjMxOjUyLjQzMzQ2ODk3NFoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoidV91RnNXT3daU3hTNW1PbWJFcHJ3TVVpak5oc1dkVjZtZmRlX0NULXk0ay5odG1sIiwicmVwb3J0X2lkIjoiRy05cklqaVpKaXdvZFRFTjVhdlE3d2dLIn19fUHU0wnVRoDtfilwOCROUiDTtcOlIZkrVaddCqjorH3utgcIEPIzlsrMAJGXFC6NTZMneLqAWWU_lq-9prD_tQc#/run/9088730ba7972869a3e2b68b66708b55-32-6/finding/b9cbdf1bc8bd74cab1388e30ebdbf0b37c6f1420
91+
[bug-4-issue]: https://github.com/etcd-io/etcd/issues/20271
92+
[bug-4-fix]: https://github.com/etcd-io/etcd/pull/20553
93+
[bug-5-report]: https://linuxfoundation.antithesis.com/report/IVzVnBQKQ0aInboRbRdsVDIE/xlfYJ3eyHooIxRJqHjimRYPrnttrULyr8PqOfRD0pS8.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA1LTIxVDA5OjMzOjUzLjcyODgwNTA4MVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoieGxmWUozZXlIb29JeFJKcUhqaW1SWVBybnR0clVMeXI4UHFPZlJEMHBTOC5odG1sIiwicmVwb3J0X2lkIjoiSVZ6Vm5CUUtRMGFJbmJvUmJSZHNWRElFIn19fc8p5s8qWPm5KxSC8oqMFj8HzTze7dxXhyPVt3l-GLwxSHIsuAIk1-2W7tgrh9mNXpZkFRhedvGSYNyhZ272kAo#/run/2e0ec6758e3603c3e4f5fd43dd26ffab-31-8/finding/5a95b2983bca202814eaa6a3fe594910a72cd2c6
94+
[bug-5-issue]: https://github.com/etcd-io/etcd/issues/19998
95+
[bug-5-fix]: https://github.com/etcd-io/etcd/issues/19998
96+
97+
### Known Issues
98+
99+
Antithesis also successfully found and reproduced these known issues in older releases – the “[Brown M&M](https://www.safetydimensions.com.au/van-halen/)s” set by the etcd maintainers.
100+
101+
| Description | Report Link |
102+
| :--------------------------------------------------------------------- | :------------------------------ |
103+
| [Watch dropping an event when compacting on delete][known-1-issue] | [Triage Report][known-1-report] |
104+
| [Revision decreasing caused by crash during compaction][known-2-issue] | [Triage Report][known-2-report] |
105+
| [Watch progress notification not synced with stream][known-3-issue] | [Triage Report][known-3-report] |
106+
| [Inconsistent revision caused by crash during defrag][known-4-issue] | [Triage Report][known-4-report] |
107+
| [Watchable runlock bug][known-5-issue] | [Triage Report][known-5-report] |
108+
109+
[known-1-issue]: https://github.com/etcd-io/etcd/issues/18089
110+
[known-1-report]: https://linuxfoundation.antithesis.com/report/eYAhUOXW751VmJwPvGPa6R52/SFgfiy4PFXUGW5JkKt-uOnLFUVk9ZDIxFNQDRIS-eLE.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA2LTAyVDIxOjI5OjMwLjAxNjk5OTQ5NloiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiU0ZnZml5NFBGWFVHVzVKa0t0LXVPbkxGVVZrOVpESXhGTlFEUklTLWVMRS5odG1sIiwicmVwb3J0X2lkIjoiZVlBaFVPWFc3NTFWbUp3UHZHUGE2UjUyIn19feadk3puhf0lkOv5k8GN_uQ74jb64WhykomO8nUZVBbUqRC-dLOnb7ENYLEjLW_rConu9ADWMFK_WVX7_zpX-wE#/run/6f713ca33a385cfa6d1987f125cbd951-31-8/finding/984b7ce364030642155dcd71d492711c9f9f73a9
111+
[known-2-issue]: https://github.com/etcd-io/etcd/issues/17780
112+
[known-2-report]: https://linuxfoundation.antithesis.com/report/aRbi2JR9dqoXK2xvN-DfZi9S/r4GRi-BLXj6-kpaqz5fo8j8W-qUV1diKw6_x8vonLNk.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA2LTAyVDIxOjMyOjQ5LjU1OTQ3Nzg2MFoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoicjRHUmktQkxYajYta3BhcXo1Zm84ajhXLXFVVjFkaUt3Nl94OHZvbkxOay5odG1sIiwicmVwb3J0X2lkIjoiYVJiaTJKUjlkcW9YSzJ4dk4tRGZaaTlTIn19fR5EmgiseZ02ngQHRC5uYXTIekPT7Z9Ta903abbN1xq-t2XYheG4YSlJFDdRIfyMpKKclB_uZGQOPd2kXKeutwM#/run/6a08b1cd0efe3d19b9bd89c6815e84e4-31-8/finding/5a95b2983bca202814eaa6a3fe594910a72cd2c6
113+
[known-3-issue]: https://github.com/etcd-io/etcd/issues/15220
114+
[known-3-report]: https://linuxfoundation.antithesis.com/report/ymTYOGwzB-UwlmrjT8VrC_Kn/Y-D2b7S_BKdIl67UqZtXafn0xPhbRulSZVQvPqsBZak.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA1LTI5VDEyOjAyOjIzLjUzMTc3MTg2MloiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiWS1EMmI3U19CS2RJbDY3VXFadFhhZm4weFBoYlJ1bFNaVlF2UHFzQlphay5odG1sIiwicmVwb3J0X2lkIjoieW1UWU9Hd3pCLVV3bG1yalQ4VnJDX0tuIn19fRmIEwPnKRaq1qnN9tKlGw0m--zs7uFUMMi3AaZM_Kz6Uy0IzsO-af3D1DDBFzSyclF13rqyjI-3ki2d9ufDNQk#/run/fa475411ad6b37641065963bc37b5dd4-31-8/finding/984b7ce364030642155dcd71d492711c9f9f73a9
115+
[known-4-issue]: https://github.com/etcd-io/etcd/pull/14685
116+
[known-4-report]: https://linuxfoundation.antithesis.com/report/kuUVd-WEW4jkcEp7Uzsh-649/doX_RaZAkZxIOxxBn51bdhfjFzrV5ipnJYAQUAT2454.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA2LTE5VDAxOjM3OjAxLjYyODQzMjM4OVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiZG9YX1JhWkFrWnhJT3h4Qm41MWJkaGZqRnpyVjVpcG5KWUFRVUFUMjQ1NC5odG1sIiwicmVwb3J0X2lkIjoia3VVVmQtV0VXNGprY0VwN1V6c2gtNjQ5In19fW1TcMbQfba6iffW4KX_yGOjmwg2qHbRsqzxhJOh8ywc6fxgJa8Lemw1ShkuhQs3caqWHlEEojyAEMVjlLPK4Ac#/run/8b12e3b98a5b206e30c5d0067746083e-32-6/finding/5a95b2983bca202814eaa6a3fe594910a72cd2c6
117+
[known-5-issue]: https://github.com/etcd-io/etcd/pull/13505
118+
[known-5-report]: https://linuxfoundation.antithesis.com/report/6zkMqkEjjJuinArwLZTkJehM/7h42qHA5soBgxYPsMiDa4dSbA-O_g2SL9vkJqxvGON8.html?auth=v2.public.eyJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiN2g0MnFIQTVzb0JneFlQc01pRGE0ZFNiQS1PX2cyU0w5dmtKcXh2R09OOC5odG1sIiwicmVwb3J0X2lkIjoiNnprTXFrRWpqSnVpbkFyd0xaVGtKZWhNIn19LCJuYmYiOiIyMDI1LTA1LTMwVDAwOjUyOjM2LjUwMjc3MTMyNFoifSYHsw1ZLCSfxME2keN58uGgi2yHTLvlg5_mFLkmePovjDjan-8SH72WdrmeWc4OMoRR-F3Pmi9UkU546_rtkgI#/run/8b82751143d9baaf98535089301d7af4-31-8/finding/984b7ce364030642155dcd71d492711c9f9f73a9
119+
120+
## Conclusion
121+
122+
The integration of this advanced simulation testing into our development workflow has been a success.
123+
It has allowed us to find and fix critical bugs, improve our existing testing framework,
124+
and increase our confidence in the reliability of etcd. We will continue to leverage this technology
125+
to ensure that etcd remains a stable and trusted distributed key-value store for the community.

content/en/blog/2025/upgrade_from_3.5_to_3.6_issue.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@ date: 2025-03-27
55
draft: false
66
---
77

8+
{{% alert title="Update (October 21, 2025)" %}}
9+
We have identified and fixed an additional scenario related to this issue. Please see our
10+
new blog post
11+
[Follow Up - Preventing Upgrade Failures from etcd v3.5 to v3.6](/blog/2025/upgrade_from_3.5_to_3.6_issue_followup)
12+
for details.
13+
{{% /alert %}}
14+
815
There is a common issue [19557][] in the etcd v3.5 to v3.6 upgrade that may cause the upgrade
916
process to fail. You can find detailed information and related discussions in the issue.
1017

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Follow Up - Preventing Upgrade Failures from etcd v3.5 to v3.6
3+
author: "[Benjamin Wang](https://github.com/ahrtr) VMware by Broadcom, [Josh Berkus](https://github.com/jberkus) Red Hat"
4+
date: 2025-10-21
5+
draft: false
6+
---
7+
8+
We have identified and fixed an additional scenario that may cause upgrade failures when
9+
moving from etcd v3.5 to v3.6. This post contains details, the fix, and additional workarounds.
10+
Please refer to issue [20793][] to get detailed technical information.
11+
12+
## Issue
13+
14+
In a previous post — [How to Prevent a Common Failure when Upgrading etcd v3.5 to v3.6][] — we
15+
described an upgrade issue affecting etcd versions in v3.5.1-v3.5.19. That issue was addressed in
16+
v3.5.20. However, a follow-up investigation revealed that the original fix did not cover all scenarios.
17+
18+
Specifically, during rolling replacement upgrades (such as those performed by Cluster API when upgrading
19+
Kubernetes control planes), a new learner may receive a snapshot from an older member (≤ v3.5.19) containing
20+
incorrect membership data. This inconsistency does not affect clusters still on v3.5 – where v2store remains
21+
authoritative – but can cause upgrade failures when moving to v3.6, as a new learner may again receive a
22+
snapshot with incorrect membership data and v3store becomes the source of truth.
23+
24+
## Solution
25+
26+
This additional scenario has been addressed in etcd v3.5.24 via [20797][]. All users who can
27+
should first upgrade to etcd v3.5.24 (or a higher patch version) before upgrading to etcd v3.6;
28+
otherwise, the upgrade may fail.
29+
30+
## Workarounds
31+
32+
Upgrading directly to v3.5.24 or later is the most reliable and simplest way of avoiding the upgrade failure.
33+
However, if you cannot upgrade to v3.5.24 (or a higher patch version) for some reason, please apply
34+
one of the following workarounds before upgrading to v3.6:
35+
36+
- If you are already running v3.5.20 - v3.5.23, restart all etcd members before upgrading to v3.6.x.
37+
- A full restart triggers re-registration and corrects the incorrect membership information.
38+
- If you are already running v3.5.20 - v3.5.23, alternatively perform an additional upgrade to any patch version in v3.5.20 - v3.5.23.
39+
- Each member will re-register its server information, automatically correcting the incorrect membership data in the additional upgrade.
40+
- If you are running v3.5.19 or earlier, upgrade to any version between v3.5.20 and v3.5.23, and then apply one of the two workarounds above.
41+
42+
## Acknowledgements
43+
44+
We would like to thank [Avinash Batukbhai][] from Broadcom for reporting the upgrade issue.
45+
His report helped bring the issue to our attention so that we could investigate and resolve it upstream.
46+
47+
[20793]: https://github.com/etcd-io/etcd/issues/20793
48+
[How to Prevent a Common Failure when Upgrading etcd v3.5 to v3.6]: https://etcd.io/blog/2025/upgrade_from_3.5_to_3.6_issue/
49+
[20797]: https://github.com/etcd-io/etcd/pull/20797
50+
[Avinash Batukbhai]: https://github.com/avinashsavaliya

0 commit comments

Comments
 (0)