Skip to content

Allow configurable MaxConcurrentReconciles in Rufio controllers #311

@rahulbabu95

Description

@rahulbabu95

Problem

Currently, Rufio controllers default to controller-runtime's default value for MaxConcurrentReconciles in all its controllers. Since the controller-runtime default is set to 1, this means Rufio can handle at most one BMC operation during a reconciliation.

When multiple nodes are getting provisioned concurrently by the Tinkerbell stack, this creates a significant bottleneck for operations like netboot jobs to complete.

Proposed Solution

Allow Rufio deployment to be configurable with concurrent reconciliations to improve latency. This could be implemented as:

  1. A CLI flag: --max-concurrent-reconciles
  2. An environment variable: RUFIO_MAX_CONCURRENT_RECONCILES
  3. Both options for flexibility

Performance Impact

I ran a benchmark script with different levels of concurrency to test the time for netboot jobs to complete across 30 nodes concurrently. The results are striking:

Concurrency Total Jobs Completed Jobs Failed Jobs Min Duration (s) Max Duration (s) Avg Duration (s) Median Duration (s) Total Duration (s)
1 30 30 0 469 620 568.13 571.5 647
5 30 30 0 102 147 126.77 127.5 181
10 30 30 0 64 96 81.17 85 124

Improvement Analysis:

  • Concurrency 5 vs 1: 72.00% faster
  • Concurrency 10 vs 1: 80.00% faster

Test Script

The benchmark was performed using this script which creates netboot jobs for multiple machines and measures completion time with different concurrency settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions