Skip to content

[e2e] Move/Pivot test intermittently fails with stuck CAPD Machine #691

@anmazzotti

Description

@anmazzotti

What happened:
The e2e Move/Pivot test fails due to an unknown containerd issue.

How to run the test:

  • GINKGO_SKIP="" GINKGO_FOCUS="Pivot" make test-e2e

What happens:

  • Very sporadically, when upscaling control plane nodes, there is a chance the new Machine will be stuck in Provisioning. I can only reproduce this after the Cluster has been moved to itself.
  • The container associated to the Machine is Created, but not Started:
> docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS          PORTS                                              NAMES
f44792628f4e   kindest/node:v1.30.4                 "/usr/local/bin/entr…"   45 minutes ago   Created                                                            caprke2-e2e-fnk907-pivot-control-plane-qwzwm
0a0e80f73338   kindest/node:v1.30.4                 "/usr/local/bin/entr…"   48 minutes ago   Up 48 minutes                                                      caprke2-e2e-fnk907-pivot-md-0-2hxwq-s95cg
5911ad690f14   kindest/node:v1.30.4                 "/usr/local/bin/entr…"   58 minutes ago   Up 58 minutes   127.0.0.1:32851->6443/tcp                          caprke2-e2e-fnk907-pivot-control-plane-zrlrc
20643d6f3baf   kindest/haproxy:v20230606-42a2262b   "haproxy -W -db -f /…"   58 minutes ago   Up 58 minutes   0.0.0.0:32798->6443/tcp, 0.0.0.0:32799->8404/tcp   caprke2-e2e-fnk907-pivot-lb
409f2a2eff0d   moby/buildkit:buildx-stable-1        "buildkitd --allow-i…"   5 months ago     Up 4 days                                                          buildx_buildkit_rancher-turtles0
fdcbbc271ea7   moby/buildkit:buildx-stable-1        "buildkitd --allow-i…"   5 months ago     Up 4 days                                                          buildx_buildkit_cluster-api-provider-rke20
  • CAPD is unable to bootstrap this machine and fails in a loop:
I0701 08:02:30.134274       1 machine.go:392] "Failed running command" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-9x22z" namespace="bootstrap-pivot-cluster-ue8zsw" name="caprke2-e2e-fnk907-pivot-control-plane-9x22z" reconcileID="44905eb4-bf56-4e89-b172-0168572bfa3d" Machine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-qwzwm" Machine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-qwzwm" Cluster="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot" instance="caprke2-e2e-fnk907-pivot-control-plane-qwzwm" command={"Cmd":"mkdir","Args":["-p","/etc/rancher/rke2"],"Stdin":""} stdout="" stderr="" bootstrap data="IyMgdGVtcGxhdGU6IGppbmphCiNjbG91ZC1jb25maWcKCndyaXRlX2ZpbGVzOgotICAgcGF0aDogL2V0Yy9yYW5jaGVyL3JrZTIvcmVnaXN0cmllcy55YW1sCiAgICBvd25lcjogcm9vdDpyb290CiAgICBwZXJtaXNzaW9uczogJzA2NDAnCiAgICBjb250ZW50OiB8CiAgICAgIGNvbmZpZ3M6IG51bGwKICAgICAgbWlycm9yczoge30KICAgICAgCi0gICBwYXRoOiAvZXRjL3JhbmNoZXIvcmtlMi9jb25maWcueWFtbAogICAgb3duZXI6IHJvb3Q6cm9vdAogICAgcGVybWlzc2lvbnM6ICcwNjQwJwogICAgY29udGVudDogfAogICAgICBkaXNhYmxlLWNsb3VkLWNvbnRyb2xsZXI6IHRydWUKICAgICAgZGlzYWJsZToKICAgICAgICAtIHJrZTItaW5ncmVzcy1uZ2lueAogICAgICBrdWJlLWFwaXNlcnZlci1hcmc6CiAgICAgICAgLSAtLWFub255bW91cy1hdXRoPXRydWUKICAgICAgdGxzLXNhbjoKICAgICAgICAtIDE3Mi4xOC4wLjMKICAgICAgY2x1c3Rlci1jaWRyOiAxMC40NS4wLjAvMTYKICAgICAgc2VydmljZS1jaWRyOiAxMC40Ni4wLjAvMTYKICAgICAga3ViZWxldC1hcmc6CiAgICAgICAgLSBhbm9ueW1vdXMtYXV0aD10cnVlCiAgICAgIHNlcnZlcjogaHR0cHM6Ly8xNzIuMTguMC4zOjkzNDUKICAgICAgdG9rZW46IDRhYmY0MWRiM2MyYjllMGRhZjkzMzhjOTdkNWFmNTgzCiAgICAgIAoKCnJ1bmNtZDoKICAtICdjdXJsIC1zZkwgaHR0cHM6Ly9nZXQucmtlMi5pbyB8IElOU1RBTExfUktFMl9WRVJTSU9OPXYxLjMxLjArcmtlMnIxIHNoIC1zIC0gc2VydmVyJwogIC0gJ3N5c3RlbWN0bCBlbmFibGUgcmtlMi1zZXJ2ZXIuc2VydmljZScKICAtICdzeXN0ZW1jdGwgc3RhcnQgcmtlMi1zZXJ2ZXIuc2VydmljZScKICAtICdta2RpciAtcCAvcnVuL2NsdXN0ZXItYXBpJwogIC0gJ2VjaG8gc3VjY2VzcyA+IC9ydW4vY2x1c3Rlci1hcGkvYm9vdHN0cmFwLXN1Y2Nlc3MuY29tcGxldGUnCg=="
I0701 08:02:30.135344       1 machine.go:573] "Got logs from the machine container" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-9x22z" namespace="bootstrap-pivot-cluster-ue8zsw" name="caprke2-e2e-fnk907-pivot-control-plane-9x22z" reconcileID="44905eb4-bf56-4e89-b172-0168572bfa3d" Machine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-qwzwm" Machine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-qwzwm" Cluster="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot" output=<
	Inspected the container:
	{"Id":"f44792628f4e98a35e0eeff514cad90def39f18257b592a7bca1338d8ece0a13","Created":"2025-07-01T07:25:20.254689114Z","Path":"/usr/local/bin/entrypoint","Args":["/sbin/init"],"State":{"Status":"created","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":0,"Error":"","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z"},"Image":"sha256:ea9c9420224022fe8410c8b9d96d80385ec9b0e444c0d1681fd71f118cb6377a","ResolvConfPath":"","HostnamePath":"","HostsPath":"","LogPath":"/var/lib/docker/containers/f44792628f4e98a35e0eeff514cad90def39f18257b592a7bca1338d8ece0a13/f44792628f4e98a35e0eeff514cad90def39f18257b592a7bca1338d8ece0a13-json.log","Name":"/caprke2-e2e-fnk907-pivot-control-plane-qwzwm","RestartCount":0,"Driver":"overlay2","Platform":"linux","MountLabel":"","ProcessLabel":"","AppArmorProfile":"unconfined","ExecIDs":null,"HostConfig":{"Binds":["/var/run/docker.sock:/var/run/docker.sock","/home/andrea/repos/cluster-api-provider-rke2/out/images:/var/lib/rancher/rke2/agent/images","/lib/modules:/lib/modules:ro"],"ContainerIDFile":"","LogConfig":{"Type":"json-file","Config":{"max-file":"5","max-size":"10m"}},"NetworkMode":"kind","PortBindings":{"6443/tcp":[{"HostIp":"127.0.0.1","HostPort":"0"}]},"RestartPolicy":{"Name":"on-failure","MaximumRetryCount":1},"AutoRemove":false,"VolumeDriver":"","VolumesFrom":null,"ConsoleSize":[0,0],"CapAdd":null,"CapDrop":null,"CgroupnsMode":"private","Dns":null,"DnsOptions":null,"DnsSearch":null,"ExtraHosts":null,"GroupAdd":null,"IpcMode":"private","Cgroup":"","Links":null,"OomScoreAdj":0,"PidMode":"","Privileged":true,"PublishAllPorts":false,"ReadonlyRootfs":false,"SecurityOpt":["seccomp=unconfined","apparmor=unconfined","label=disable"],"Tmpfs":{"/run":"","/tmp":""},"UTSMode":"","UsernsMode":"","ShmSize":67108864,"Runtime":"runc","Isolation":"","CpuShares":0,"Memory":0,"NanoCpus":0,"CgroupParent":"","BlkioWeight":0,"BlkioWeightDevice":null,"BlkioDeviceReadBps":null,"BlkioDeviceWriteBps":null,"BlkioDeviceReadIOps":null,"BlkioDeviceWriteIOps":null,"CpuPeriod":0,"CpuQuota":0,"CpuRealtimePeriod":0,"CpuRealtimeRuntime":0,"CpusetCpus":"","CpusetMems":"","Devices":null,"DeviceCgroupRules":null,"DeviceRequests":null,"MemoryReservation":0,"MemorySwap":0,"MemorySwappiness":null,"OomKillDisable":false,"PidsLimit":null,"Ulimits":null,"CpuCount":0,"CpuPercent":0,"IOMaximumIOps":0,"IOMaximumBandwidth":0,"MaskedPaths":null,"ReadonlyPaths":null,"Init":false},"GraphDriver":{"Data":{"ID":"f44792628f4e98a35e0eeff514cad90def39f18257b592a7bca1338d8ece0a13","LowerDir":"/var/lib/docker/overlay2/a1174e81ebbb1c950829b4eaffd79f2bb733e0ecccc77648fb0c3e36bd2e4f40-init/diff:/var/lib/docker/overlay2/0f3f34eab8ad0f0ee4c00fad856a13b327a017980e8c7a8fb4d5f3f6bb3f644f/diff:/var/lib/docker/overlay2/06501e4a9f1b13df72873b89dced1ba8d7b7ea87444e5b269771765a46ab9cc8/diff","MergedDir":"/var/lib/docker/overlay2/a1174e81ebbb1c950829b4eaffd79f2bb733e0ecccc77648fb0c3e36bd2e4f40/merged","UpperDir":"/var/lib/docker/overlay2/a1174e81ebbb1c950829b4eaffd79f2bb733e0ecccc77648fb0c3e36bd2e4f40/diff","WorkDir":"/var/lib/docker/overlay2/a1174e81ebbb1c950829b4eaffd79f2bb733e0ecccc77648fb0c3e36bd2e4f40/work"},"Name":"overlay2"},"Mounts":[{"Type":"bind","Source":"/var/run/docker.sock","Destination":"/var/run/docker.sock","Mode":"","RW":true,"Propagation":"rprivate"},{"Type":"bind","Source":"/home/andrea/repos/cluster-api-provider-rke2/out/images","Destination":"/var/lib/rancher/rke2/agent/images","Mode":"","RW":true,"Propagation":"rprivate"},{"Type":"bind","Source":"/lib/modules","Destination":"/lib/modules","Mode":"ro","RW":false,"Propagation":"rprivate"},{"Type":"volume","Name":"be5be41062dc3e50b55c3c56319cad560c142d1e919ec4d7e0d3dcb1780ebcac","Source":"/var/lib/docker/volumes/be5be41062dc3e50b55c3c56319cad560c142d1e919ec4d7e0d3dcb1780ebcac/_data","Destination":"/var","Driver":"local","Mode":"","RW":true,"Propagation":""}],"Config":{"Hostname":"caprke2-e2e-fnk907-pivot-control-plane-qwzwm","Domainname":"","User":"","AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"ExposedPorts":{"6443/tcp":{}},"Tty":true,"OpenStdin":false,"StdinOnce":false,"Env":["KUBECONFIG=/etc/kubernetes/admin.conf","PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","container=docker","HTTP_PROXY=","HTTPS_PROXY=","NO_PROXY="],"Cmd":null,"Image":"kindest/node:v1.30.4","Volumes":{"/var":{}},"WorkingDir":"/","Entrypoint":["/usr/local/bin/entrypoint","/sbin/init"],"OnBuild":null,"Labels":{"io.x-k8s.kind.cluster":"caprke2-e2e-fnk907-pivot","io.x-k8s.kind.role":"control-plane"},"StopSignal":"SIGRTMIN+3"},"NetworkSettings":{"Bridge":"","SandboxID":"","SandboxKey":"","Ports":{},"HairpinMode":false,"LinkLocalIPv6Address":"","LinkLocalIPv6PrefixLen":0,"SecondaryIPAddresses":null,"SecondaryIPv6Addresses":null,"EndpointID":"","Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"IPAddress":"","IPPrefixLen":0,"IPv6Gateway":"","MacAddress":"","Networks":{"kind":{"IPAMConfig":null,"Links":null,"Aliases":null,"MacAddress":"","DriverOpts":null,"NetworkID":"","EndpointID":"","Gateway":"","IPAddress":"","IPPrefixLen":0,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"DNSNames":null}}}}
	Got logs from the container:
 >
E0701 08:02:30.135741       1 controller.go:316] "Reconciler error" err="failed to exec DockerMachine bootstrap: failed to run cloud config: stdout:  stderr: : error creating container exec: Error response from daemon: container f44792628f4e98a35e0eeff514cad90def39f18257b592a7bca1338d8ece0a13 is not running" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="bootstrap-pivot-cluster-ue8zsw/caprke2-e2e-fnk907-pivot-control-plane-9x22z" namespace="bootstrap-pivot-cluster-ue8zsw" name="caprke2-e2e-fnk907-pivot-control-plane-9x22z" reconcileID="44905eb4-bf56-4e89-b172-0168572bfa3d"

I could not find any info in logs or docker events.
I dump here all files I collected.

It's not clear to me if this is a CAPD problem or it is due RKE2 bootstrapping.
In cluster-api, a similar test also performs a rollout on the self-managed cluster, however this seems to be working just fine.

Also note that starting the container manually works without issues.

docker start f44792628f4e

What did you expect to happen:

How to reproduce it:

Anything else you would like to add:
capd.txt
docker-events.txt
docker-ps-all.txt
docker-version.txt
journal.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/ciIssues or PRs related to CIkind/bugSomething isn't workinglifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-priorityIndicates an issue or PR needs a priority assigning to itneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions