-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Problem
Currently, Rufio controllers default to controller-runtime's default value for MaxConcurrentReconciles in all its controllers. Since the controller-runtime default is set to 1, this means Rufio can handle at most one BMC operation during a reconciliation.
When multiple nodes are getting provisioned concurrently by the Tinkerbell stack, this creates a significant bottleneck for operations like netboot jobs to complete.
Proposed Solution
Allow Rufio deployment to be configurable with concurrent reconciliations to improve latency. This could be implemented as:
- A CLI flag:
--max-concurrent-reconciles - An environment variable:
RUFIO_MAX_CONCURRENT_RECONCILES - Both options for flexibility
Performance Impact
I ran a benchmark script with different levels of concurrency to test the time for netboot jobs to complete across 30 nodes concurrently. The results are striking:
| Concurrency | Total Jobs | Completed Jobs | Failed Jobs | Min Duration (s) | Max Duration (s) | Avg Duration (s) | Median Duration (s) | Total Duration (s) |
|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 30 | 0 | 469 | 620 | 568.13 | 571.5 | 647 |
| 5 | 30 | 30 | 0 | 102 | 147 | 126.77 | 127.5 | 181 |
| 10 | 30 | 30 | 0 | 64 | 96 | 81.17 | 85 | 124 |
Improvement Analysis:
- Concurrency 5 vs 1: 72.00% faster
- Concurrency 10 vs 1: 80.00% faster
Test Script
The benchmark was performed using this script which creates netboot jobs for multiple machines and measures completion time with different concurrency settings.