Skip to content

Conversation

@tehbooom
Copy link
Member

@tehbooom tehbooom commented Aug 18, 2025

This PR allows for a cluster to have multiple StackConfigPolicies (SCPs). Avoids conflicts of the same resources based on the weights of the policy with the lowest weight having the highest priority. If two policies have the same weight and the same resources it will fail and the phase will be changed to CONFLICT.

Here is a simple table showing what should pass and what should fail

Weight Resource Result
Same Different
Same Same
Different Same
Different Different

The following was tested and passed interchangeably.

---
# different weights same policy should pass with policy-1 winning
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-1
spec:
  weight: -1
  elasticsearch:
    snapshotRepositories:
      policy-1-backups:
        type: s3
        settings:
          bucket: "policy-1-backups"
          region: "us-west-2"
---
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-2
spec:
  weight: 11
  elasticsearch:
    snapshotRepositories:
      policy-1-backups:
        type: s3
        settings:
          bucket: "policy-2-backups"
          region: "us-east-1"
---
## Different weights different policies should pass
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-1
spec:
  weight: 10
  elasticsearch:
    snapshotRepositories:
      policy-1-backups:
        type: s3
        settings:
          bucket: "policy-1-backups"
          region: "us-west-2"
---
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-2
spec:
  weight: 11
  elasticsearch:
    snapshotRepositories:
      policy-2-backups:
        type: s3
        settings:
          bucket: "policy-2-backups"
          region: "us-east-1"
---
## Same weight different repositories should pass
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-1
spec:
  weight: 10
  elasticsearch:
    snapshotRepositories:
      policy-1-backups:
        type: s3
        settings:
          bucket: "policy-1-backups"
          region: "us-west-2"
---
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-2
spec:
  weight: 10
  elasticsearch:
    snapshotRepositories:
      policy-2-backups:
        type: s3
        settings:
          bucket: "policy-3-backups"
          region: "us-east-1"
---
# Policy 2 wins with telemetry.optIn set to true
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: kibana-policy-1
spec:
  weight: 1
  kibana:
    config:
      telemetry.optIn: false
---
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: kibana-policy-2
spec:
  weight: 0
  kibana:
    config:
      telemetry.optIn: true

The following will fail

---
# Same weight same policy must fail
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-1
spec:
  weight: 10
  elasticsearch:
    snapshotRepositories:
      policy-1-backups:
        type: s3
        settings:
          bucket: "policy-1-backups"
          region: "us-west-2"
---
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
metadata:
  name: cluster-policy-2
spec:
  weight: 10
  elasticsearch:
    snapshotRepositories:
      policy-1-backups:
        type: s3
        settings:
          bucket: "policy-2-backups"
          region: "us-east-1"

This was also tested with a ECK 3.1.0 then upgraded. Previous SCPs will have a default value of 0.

@tehbooom tehbooom added the >enhancement Enhancement of existing functionality label Aug 18, 2025
@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Aug 18, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Licenses 0 0 0 0 0 issues
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link

github-actions bot commented Aug 18, 2025

🔍 Preview links for changed docs

@pebrc pebrc requested a review from Copilot August 22, 2025 08:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements support for multiple StackConfigPolicies (SCPs) per cluster by introducing a weight-based priority system. Policies with lower weights take precedence over those with higher weights, and conflicts are detected when policies with the same weight attempt to configure the same resources.

  • Adds a weight field to StackConfigPolicy specs with default value 0
  • Implements conflict detection for policies with same weights targeting overlapping resources
  • Merges configurations from multiple policies based on weight priorities

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/apis/stackconfigpolicy/v1alpha1/stackconfigpolicy_types.go Adds weight field to StackConfigPolicySpec
pkg/controller/stackconfigpolicy/weight_conflict_test.go Comprehensive test suite for weight conflict detection logic
pkg/controller/stackconfigpolicy/controller.go Core multi-policy support with conflict detection and policy merging
pkg/controller/stackconfigpolicy/elasticsearch_config_settings.go Multi-policy versions of Elasticsearch configuration functions
pkg/controller/stackconfigpolicy/kibana_config_settings.go Multi-policy versions of Kibana configuration functions
pkg/controller/elasticsearch/filesettings/secret.go Multi-policy secret management with policy reference tracking
pkg/controller/elasticsearch/filesettings/file_settings.go Policy merging logic for Elasticsearch file settings
pkg/controller/stackconfigpolicy/controller_test.go Updates test expectations for multi-policy behavior
config/crds/v1/*.yaml CRD updates to include weight field
docs/reference/api-reference/main.md Documentation for new weight field

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// For map-type values (like snapshot repositories), individual entries are merged rather than replaced
func (s *Settings) mergeConfig(target, source *commonv1.Config) {
if source == nil || source.Data == nil {
return
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mergeConfig function modifies the target parameter but then creates a new Config object that is not assigned back to the original parameter. The line target = &commonv1.Config{Data: make(map[string]interface{})} only modifies the local variable, not the original pointer. This will cause the merge to fail silently.

Suggested change
return
func (s *Settings) mergeConfig(target, source *commonv1.Config) *commonv1.Config {
if source == nil || source.Data == nil {
return target

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid copilot observation. Also I would expect unit tests for this kind of functionality. We do have more advanced merging through the CanonicalConfig abstraction in the code base. But there might be some gotchas with the more complex configuration objects that stack config policies support that might not be thinking of right now.

Copy link
Contributor

@thbkrkr thbkrkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! I like how this extends the original implementation to multiple SCPs without adding complexity. I didn’t do any testing on my side, but code-wise this looks good. 👍

@pebrc
Copy link
Collaborator

pebrc commented Sep 19, 2025

buildkite test this -f p=gke,t=TestStackConfigPolicy*

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am generally positive about the approach but I feel like this needs more work, and test coverage. I did a quick smoke test and it does not work as advertised.

},
ClusterSettings: &commonv1.Config{
Data: map[string]interface{}{
"persistent": map[string]interface{}{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leads to an illegal setting "persistent"/"transient" are not supported via stack config policies


// mergeConfig merges source config into target config, with source taking precedence
// For map-type values (like snapshot repositories), individual entries are merged rather than replaced
func (s *Settings) mergeConfig(target, source *commonv1.Config) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a receiver method of Settings?

Comment on lines 167 to 177
for key, value := range source.Data {
if targetValue, exists := target.Data[key]; exists {
if targetMap, targetIsMap := targetValue.(map[string]interface{}); targetIsMap {
if sourceMap, sourceIsMap := value.(map[string]interface{}); sourceIsMap {
for subKey, subValue := range sourceMap {
targetMap[subKey] = subValue
}
continue
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too simplistic given that for example cluster settings can both be in nested and flat syntax. Also it only merges one level deep?

// For map-type values (like snapshot repositories), individual entries are merged rather than replaced
func (s *Settings) mergeConfig(target, source *commonv1.Config) {
if source == nil || source.Data == nil {
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid copilot observation. Also I would expect unit tests for this kind of functionality. We do have more advanced merging through the CanonicalConfig abstraction in the code base. But there might be some gotchas with the more complex configuration objects that stack config policies support that might not be thinking of right now.

@tehbooom
Copy link
Member Author

buildkite test this -f p=gke,t=TestStackConfigPolicy*

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started running tests. It is is a fairly complex implementation so I am not finished with my testing, but wanted to leave feedback already.

I have two main concerns:

  1. secret garbage collection is not working/not implementing and we are leaking secrets. This can be adressed.
  2. the reconcilation logic has potential concurrently issues: in controller-runtime the workqueue guarantees that the same reconciled resource is not being reconcilied concurrently by multiple go-routines. This implemenation turns the logic upside down. We reconcile one policy based on an event (create/update/delete) but then in the same iteration we actually reconcile all policies that affect the same Elasticsearch/Kibana. That means that now multiple go-routines can potentially reconcile the configuration for the same Elasticsearch/Kibana clusters at the same time. I have not tested this in detail and maybe K8s optimistic concurrency control helps us a bit here. But I don't have a good remedy for this problem at the moment.

}

// policyIsEmpty checks if a policy has no meaningful configuration
func (r *ReconcileStackConfigPolicy) policyIsEmpty(policy *policyv1alpha1.StackConfigPolicy) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a receiver method on the policy itself:

// HasNoSettings checks if the policy has no meaningful settings.
// SecureSettings are not considered in this check as they are composable with other policies.
func (s StackConfigPolicySpec) HasNoSettings() bool {
	return s.Elasticsearch.ClusterSettings.IsEmpty() &&
		s.Elasticsearch.SnapshotRepositories.IsEmpty() &&
		s.Elasticsearch.SnapshotLifecyclePolicies.IsEmpty() &&
		s.Elasticsearch.SecurityRoleMappings.IsEmpty() &&
		s.Elasticsearch.IndexLifecyclePolicies.IsEmpty() &&
		s.Elasticsearch.IngestPipelines.IsEmpty() &&
		s.Elasticsearch.IndexTemplates.ComponentTemplates.IsEmpty() &&
		s.Elasticsearch.IndexTemplates.ComposableIndexTemplates.IsEmpty() &&
		s.Elasticsearch.Config.IsEmpty() &&
		len(s.Elasticsearch.SecretMounts) == 0 &&
		s.Kibana.Config.IsEmpty()
}


// Check if there are any common keys
for key := range config1.Data {
if _, exists := config2.Data[key]; exists {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this first level comparison catches all cases. Especially if there is a mix of nested and flat syntax.

}

// configsConflict checks if two Config objects have overlapping keys
func (r *ReconcileStackConfigPolicy) configsConflict(config1, config2 *commonv1.Config) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not need to be a receiver method of ReconcileStackConfigPolicy

}

// policiesCouldOverlap checks if two policies could potentially target the same resources
func (r *ReconcileStackConfigPolicy) policiesCouldOverlap(policy1, policy2 *policyv1alpha1.StackConfigPolicy, policy1Selector labels.Selector) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit awkward to have the policy selector passed in here. I I think would probably make these receiver methods of StackConfigPolicy instead.

}

// NewSettingsSecretFromPolicies returns a new SettingsSecret for a given Elasticsearch from multiple StackConfigPolicies.
func NewSettingsSecretFromPolicies(version int64, es types.NamespacedName, currentSecret *corev1.Secret, policies []policyv1alpha1.StackConfigPolicy, meta metadata.Metadata) (corev1.Secret, int64, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be only called from NewSettingsSecretWithVersionFromPolicies I totally get why we want a second function where we can pass the version for testing but I don't see any tests for this and it does not have to be exported.

}

// check for weight conflicts with other policies
if err := r.checkWeightConflicts(ctx, &policy); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the admission webhook validation checks already to give immediate feedback to users?


func newElasticsearchConfigSecret(policy policyv1alpha1.StackConfigPolicy, es esv1.Elasticsearch) (corev1.Secret, error) {
// reconcileSecretMounts creates the secrets in SecretMounts to the respective Elasticsearch namespace where they should be mounted to.
func reconcileSecretMounts(ctx context.Context, c k8s.Client, es esv1.Elasticsearch, policy *policyv1alpha1.StackConfigPolicy, meta metadata.Metadata) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused code?

// reconcileSecretMountsFromPolicies creates secrets from all policies' SecretMounts
func (r *ReconcileStackConfigPolicy) reconcileSecretMountsFromPolicies(ctx context.Context, es esv1.Elasticsearch, policies []policyv1alpha1.StackConfigPolicy, meta metadata.Metadata) error {
// Collect all unique secret mounts from all policies
secretMountMap := make(map[string]policyv1alpha1.SecretMount) // key is secretName to avoid duplicates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weights are ignored if I am not mistaken?

}
}

for _, secretMount := range secretMountMap {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just iterated through all policies to find the mounts why do now have to search through them again to find the policy? Why not keep the source policy around in the map of mount points to begin with.

continue
case 1:
// Single policy - use the original approach for backward compatibility
expectedSecret, expectedVersion, err = filesettings.NewSettingsSecretWithVersion(esNsn, &actualSettingsSecret, &allPolicies[0], meta)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the argument for keepin the old approach around. Can you explain why we need to be backwards compatible here?

My expectation would be that we migrate existing secrets to the new scheme, so we might need some code that updates/removes soft owner references and applies the new tracking we need for garbage collection instead.

@pkoutsovasilis pkoutsovasilis self-assigned this Nov 10, 2025
@pkoutsovasilis
Copy link
Contributor

@pebrc I share the same concurrency concerns as you. I will try to explain my point of view and suggest an alternative solution.

Why the Concurrency Issues Exist

This PR's behavior: Consider we have StackConfigPolicies A, B, C, and D all targeting the same Elasticsearch cluster. When StackConfigPolicy A reconciles, it lists ALL four policies, merges them all, and writes the shared file-settings secret. Simultaneously, when policies B, C, and D each reconcile (triggered by their own watches or events), they each independently list all four policies, merge them, and all try to update the same file-settings secret.

Result: Multiple goroutines (one per StackConfigPolicy) may end up writing to the same secret, relying only on Kubernetes' optimistic concurrency control to prevent corruption. This causes unnecessary reconciliation loops (conflict → requeue → retry), wasted work (each of the 4 policies doing identical merge operations), and potential configuration drift.

My Proposal

StackConfigPolicy controllers should write intent - Each policy creates its own secret in the namespace of each Elasticsearch cluster it targets (matched by resource selector). For example, if policy A targets ES clusters "prod-es" and "dev-es", it creates prod-es-scp-namespace-policy-a in the prod-es namespace and dev-es-scp-namespace-policy-a in the dev-es namespace. Each of these secrets contains just that policy's configuration for that specific cluster. These secrets are owned by their respective Elasticsearch resources (for automatic cleanup) and labeled so they can be discovered by label selectors. Each StackConfigPolicy controller uses dynamic watches to track its own secrets across all target namespaces. Additionally, it maintains the soft-owner mechanics so these are cleaned up when the respective StackConfigPolicy is deleted

Elasticsearch controller should aggregate and apply - The ES controller watches for policy secrets via label selectors in its own namespace. When reconciling "prod-es", it lists all policy secrets in the prod-es namespace (finds secrets from policies A, B, C, and D), sorts them by weight, merges the configurations (using proper CanonicalConfig to handle nested/flat syntax as you noted), and writes the final file-settings secret for that cluster. The ES controller is the single writer to the final configuration.

I believe such an approach:

  • Eliminates concurrent writes (policy A writes its own secrets, policy B writes its own secrets, etc. - no shared writes; only ES controller writes final merged secret)
  • Solves the garbage collection issue
  • Provides a single merge point for proper configuration handling
  • Scales better (no N² policy listing - each policy only manages its own secrets)
  • Makes debugging easier (can see policy A's intent for prod-es in its secret, policy B's intent in its secret, etc.)

Trade-offs

This is admittedly a more radical architectural change:

  1. Breaks the 1:1 assumption - The current codebase assumes one StackConfigPolicy maps to one file-settings secret per cluster. This changes it to N policies → N policy secrets → 1 merged file-settings secret per cluster (4 policies targeting 1 cluster → 4 policy secrets + 1 file-settings secret in that cluster's namespace).

  2. Elasticsearch controller complexity - The ES controller needs to gain merging logic that currently lives in the StackConfigPolicy controller. This adds reconciliation complexity to the ES controller.

  3. More secrets - Creates N × M secrets where N is the number of policies and M is the total number of ES clusters they target, plus M file-settings secrets. For example, if 10 policies each target 5 different clusters = 50 policy secrets + 5 file-settings secrets = 55 total instead of just 5. However, Kubernetes handles thousands of secrets without issue, so I believe this shouldn't be a scalability concern in practice.

I'm still figuring out the necessary code changes as I'm not fully familiar with the codebase yet, and making the ES controller be the writer of file-settings can be tricky. That said, I believe these trade-offs are worth it to eliminate the race conditions and provide proper multi-policy support. As always, I'm more than happy to hear your thoughts on this.

@barkbay
Copy link
Contributor

barkbay commented Nov 12, 2025

I didn't have a look at the PR 🙃 , just a drive by comment:

This causes unnecessary reconciliation loops (conflict → requeue → retry), wasted work (each of the 4 policies doing identical merge operations),

I think this is pretty common with controllers, as soon as we have a lot watches (for example here) this is very likely going to trigger "no-op" reconciliations. Fortunately, since most operations interact with the informers caches, which are converging toward the actual cluster state, running these reconciliation loops should have minimal impact.

potential configuration drift.

Can you elaborate on that?

However, Kubernetes handles thousands of secrets without issue, so I believe this shouldn't be a scalability concern in practice.

In my experience, the size and memory usage on both the API server and client are more important than the absolute number of Secrets. I think we already have a fair number of Secrets in ECK-managed namespaces. I feel like it would be useful to evaluate whether the added complexity and these new Secrets are really necessary.

@pkoutsovasilis
Copy link
Contributor

Hey @barkbay 👋 , thanks for the feedback

I think this is pretty common with controllers, as soon as we have a lot watches (for example here) this is very likely going to trigger "no-op" reconciliations. Fortunately, since most operations interact with the informers caches, which are converging toward the actual cluster state, running these reconciliation loops should have minimal impact.

I see your point and I may be overstating this concern - Kubernetes' eventual consistency model means this kind of transient inconsistency is expected and will converge. However, the fact that we are listing all StackConfigPolicies in the reconcile loop of one, then merging them and then applying them is prone to race conditions, since a StackConfigPolicy can change after the list that happens in a parallel reconcile loop. The bit that I'm not a super fan of is that although k8s optimistic concurrency control is there to prevent such race scenarios, we deliberately choose to rely on it. In the alternative proposal, everything flows in a single direction (from StackConfigPolicy to Elasticseach) without requiring optimistic concurrency control to guarantee correctness (please don't get me wrong it's still a good thing that it's there though 😆).

Can you elaborate on that?

I meant this more in the sense that since the re-queue time in such scenarios isn't something we control, there could be an uncontrolled time period where the configuration isn't what it's supposed to be, and another edit/creation/deletion of a StackConfigPolicy can easily make it longer.

In my experience, the size and memory usage on both the API server and client are more important than the absolute number of Secrets. I think we already have a fair number of Secrets in ECK-managed namespaces. I feel like it would be useful to evaluate whether the added complexity and these new Secrets are really necessary.

Makes sense. I am still exploring the codebase, and thus my previous comment was more in the context of flow than actual code changes. But if you have any ideas on what such an evaluation should look like, please do let me know 🙂


Now, assuming that k8s optimistic concurrency control is good enough to address the concurrency concerns, we still have to work out the multi-StackConfigPolicy soft-ownership of the merged secrets. When all StackConfigPolicies are removed (or have changed resource selectors), the respective secrets in the namespace of elasticsearch should be removed, otherwise we leak. This is again a racy scenario in my head that the alternative approach is not prone to. So @barkbay, any ideas on how this can be handled without disappearing and re-appearing secrets?

@barkbay
Copy link
Contributor

barkbay commented Nov 12, 2025

Sorry, it's been a long time since I looked at this code, still trying to (re)process it.
Would a cleaner alternative be to use "subcontrollers" for Elasticsearch and Kibana?

For example:

Create a dedicated controller to process ES clusters in parallel

In a new file named pkg/controller/stackconfigpolicy/elasticsearch_controller.go, something along the lines of

var _ reconcile.Reconciler = &ReconcileStackConfigPolicyForElasticsearch{}

// ReconcileStackConfigPolicyForElasticsearch  reconciles StackConfigPolicies objects that apply to Elasticsearch resources.
type ReconcileStackConfigPolicyForElasticsearch struct {
	*ReconcileStackConfigPolicy
}

func (r *ReconcileStackConfigPolicyForElasticsearch) Reconcile(ctx context.Context, request reconcile.Request) (reconcile.Result, error) {
	defer tracing.Span(&ctx)()
	log := ulog.FromContext(ctx)

	log.V(1).Info("Reconcile StackConfigPolicy for Elasticsearch", "es_namespace", request.Namespace, "es_name", request.Name)
...
	// Find all policies that target this Elasticsearch cluster
	allPolicies, err := r.findPoliciesForElasticsearch(ctx, es)
	if err != nil {
		return reconcile.Result{}, err
	}
...
}

In the existing controller, detect which ES clusters must be reconciled (a bit as it is done today)

func Add(mgr manager.Manager, params operator.Parameters) error {

	esCtrlChan := make(chan event.GenericEvent)

	r := newReconciler(mgr, params, esCtrlChan, kbCtrlChan)
	c, err := common.NewController(mgr, controllerName, r, params)
	if err != nil {
		return err
	}

	if err := controllerruntime.NewControllerManagedBy(mgr).
		Named("stackconfigpolicy-es-controller").
		WatchesRawSource(
			source.Channel(esCtrlChan, &handler.EnqueueRequestForObject{}),
		).
		Complete(&ReconcileStackConfigPolicyForElasticsearch{ReconcileStackConfigPolicy: r}); err != nil {
		return err
	}

	return addWatches(mgr, c, r)
}

...

func (r *ReconcileStackConfigPolicy) reconcileElasticsearchResources(ctx context.Context, policy policyv1alpha1.StackConfigPolicy, status policyv1alpha1.StackConfigPolicyStatus) (*reconciler.Results, policyv1alpha1.StackConfigPolicyStatus) {
	defer tracing.Span(&ctx)()
	log := ulog.FromContext(ctx)
	log.V(1).Info("Reconcile Elasticsearch resources")

	results := reconciler.NewResult(ctx)

	// prepare the selector to find Elastic resources to configure
	selector, err := metav1.LabelSelectorAsSelector(&metav1.LabelSelector{
		MatchLabels:      policy.Spec.ResourceSelector.MatchLabels,
		MatchExpressions: policy.Spec.ResourceSelector.MatchExpressions,
	})
	if err != nil {
		return results.WithError(err), status
	}
	listOpts := client.ListOptions{LabelSelector: selector}

	// restrict the search to the policy namespace if it is different from the operator namespace
	if policy.Namespace != r.params.OperatorNamespace {
		listOpts.Namespace = policy.Namespace
	}

	// find the list of Elasticsearch to configure
	var esList esv1.ElasticsearchList
	if err := r.Client.List(ctx, &esList, &listOpts); err != nil {
		return results.WithError(err), status
	}

	for _, es := range esList.Items {
		r.esCtrlChan <- event.GenericEvent{
			Object: &esv1.Elasticsearch{
				ObjectMeta: metav1.ObjectMeta{
					Name:      es.Name,
					Namespace: es.Namespace,
				},
			},
		}
	}

I'm not a GenericEvent expert though, so maybe I'm missing something.

// Weight determines the priority of this policy when multiple policies target the same resource.
// Lower weight values take precedence. Defaults to 0.
// +kubebuilder:default=0
Weight int32 `json:"weight,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic nit, I found it useful to have the weight in a column, feel free to ignore if you think otherwise:

diff --git a/pkg/apis/stackconfigpolicy/v1alpha1/stackconfigpolicy_types.go b/pkg/apis/stackconfigpolicy/v1alpha1/stackconfigpolicy_types.go
index e7d17e200..ec0de8d60 100644
--- a/pkg/apis/stackconfigpolicy/v1alpha1/stackconfigpolicy_types.go
+++ b/pkg/apis/stackconfigpolicy/v1alpha1/stackconfigpolicy_types.go
@@ -34,6 +34,7 @@ func init() {
 // +kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.readyCount",description="Resources configured"
 // +kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
 // +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
+// +kubebuilder:printcolumn:name="Weight",type="integer",JSONPath=".spec.weight"
 // +kubebuilder:subresource:status
 // +kubebuilder:storageversion
 type StackConfigPolicy struct {

Result:

 k get stackconfigpolicies.stackconfigpolicy.k8s.elastic.co
NAME               READY   PHASE   AGE   WEIGHT
cluster-policy-1   1/1     Ready   20h   10
cluster-policy-2   1/1     Ready   20h   10
kibana-policy-1    1/1     Ready   20h   1
kibana-policy-2    1/1     Ready   20h   0

@pkoutsovasilis
Copy link
Contributor

Following our team meeting, we've decided to proceed with the following implementation plan:

  1. Subcontroller Pattern (proposed by @barkbay above)

    • Create dedicated reconcilers (ReconcileStackConfigPolicyForElasticsearch and ReconcileStackConfigPolicyForKibana) that process target resources
    • The StackConfigPolicy controller will emit GenericEvent messages via channels for the affected Elasticsearch and Kibana resources to the reconcilers above
    • Each of the new reconcilers will be responsible for merging all the stackconfigpolicies based on their weight, updating the secrets in each resource namespace and reflecting back the status in the policy.
  2. Multi-Soft Owner Secret Cleanup

    • Implement multi-soft owner labels on shared secrets to track all policies contributing to a resource
    • This enables proper garbage collection when policies are deleted or change selectors

We chose this approach primarily to maintain the same number of secrets as the current implementation. Additionally, this design keeps all changes scoped to the stackconfigpolicy package without requiring modifications to the Elasticsearch or Kibana controllers.

@pkoutsovasilis
Copy link
Contributor

Hey @tehbooom 👋 Thank you so much for opening this PR and taking the time to contribute! Your work here was really helpful and provided a great foundation for addressing this issue.

I've created a new PR (#8917) that builds on your work and includes some additional changes we needed to make. I'd love to invite you to review it if you're interested!

I'm going to close this PR since it's been superseded by #8917, but I really appreciate your contribution to the project. Thanks again for taking the initiative!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement Enhancement of existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants