|
| 1 | +## Why |
| 2 | +Currently there is no way to instruct the Kuadrant Operator to publish or unpublish a DNS Record when certain conditions are met, this means that conditionally removing a DNS Record requires manual intervention and an understanding of the internal workings of the DNS-Operator. |
| 3 | + |
| 4 | +### Some Example Use Cases |
| 5 | +- DNS Failover (rapidly switch to alternative DNS Configuration) to a secondary site |
| 6 | +- Workload migration (removing workload from one cluster in favour of a new cluster) |
| 7 | + |
| 8 | +## What |
| 9 | +Add optional `unpublishRules` to the dns policy CRD, which will allow an administrator to define one or more rules which when satisfied will instruct the Kuadrant Operator to unpublish the records from the zone and see the results in a condition in the status of the policy. |
| 10 | + |
| 11 | +## How |
| 12 | + |
| 13 | +### Dependencies |
| 14 | + |
| 15 | +The [soft_delete](https://github.com/Kuadrant/dns-operator/issues/356) feature is required. |
| 16 | + |
| 17 | +### Diagrams |
| 18 | + |
| 19 | +#### Unpublish diagram |
| 20 | + |
| 21 | + |
| 22 | +#### Health check changes diagram |
| 23 | +There are some changes to the health check probes also, below is a diagram showing how they will function once complete. |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | +https://miro.com/app/board/uXjVL32kOMY=/ |
| 28 | + |
| 29 | +### CRD Updates |
| 30 | +#### DNS Record |
| 31 | + |
| 32 | +The DNS Record will no longer have a health check section in the spec. |
| 33 | + |
| 34 | +It will also have a new field in the spec `publish` if set to false the DNS Operator will attempt to lazily (i.e. not greedily) remove the DNS Records from the zone without interfering with other controllers records and without causing the potential for an NXDomain response. If set to true it will attempt to publish the endpoints to the zone as usual. |
| 35 | + |
| 36 | +#### DNS Policy |
| 37 | +The DNS Policy will have a new field added to their spec: |
| 38 | + |
| 39 | +```yaml |
| 40 | +unpublishRules: |
| 41 | + <rule implemented using Common Expression Language> |
| 42 | + <rule implemented using Common Expression Language> |
| 43 | + <rule implemented using Common Expression Language> |
| 44 | +``` |
| 45 | +### Kuadrant and DNS Operator operator changes |
| 46 | +
|
| 47 | +### Health check probe creation |
| 48 | +The health checks are currently created by the DNS Operator, this will instead be moved to the Kuadrant-operator and the health checks will be owned by the DNS Policy that defined them, they will no longer have a failureThreshold as this can be defined in the unpublishRule, this section will no longer be copied to the DNS Record. They will continue to be reconciled by the DNS Operator. |
| 49 | +
|
| 50 | +### Unpublish process - Kuadrant operator |
| 51 | +
|
| 52 | +The Kuadrant Operator will have the option to set the publish field in the spec of the DNS Record to false. When set to false the DNS Operator will attempt to remove that endpoint from the DNS Provider, unless doing so would result in a NXDOMAIN response from the provider. The DNS Operator will also add information to the status of the DNS Record regarding the result (e.g. unpublished, preserved to avoid NXDOMAIN or preserved to avoid removing an entire GEO). |
| 53 | +
|
| 54 | +When any unpublish rule evaluates to true, the kuadrant-operator will set publish to false on relevant DNS Records and add a condition to the status of the DNS Policy that the label was added due to a matched rule, and if the status is available in the DNS Record regarding the result of adding the label, this will also be propagated to the DNS Policy status. |
| 55 | +
|
| 56 | +### Unpublish process - DNS Operator |
| 57 | +
|
| 58 | +When a DNS Record is reconciled with the publish field set to false, the DNS Operator will apply the soft_delete label to the leaf records of that DNS Record (i.e. expect them to be removed if possible). |
| 59 | +
|
| 60 | +The soft_delete label [ticket here](https://github.com/Kuadrant/dns-operator/issues/356) will compute the minimum set of records required to be deleted to ensure that these labelled records are cleanly removed with out leaving any dead ends (e.g. a weighted hostname whose targets don't have records), and ensuring no potential NXDomains. |
| 61 | +
|
| 62 | +### Available CEL Fields |
| 63 | +
|
| 64 | +To begin with a few basic fields will be made available, although this could potentially expand in the future. |
| 65 | +- records: The set of related DNS Record CRs for this DNS Policy |
| 66 | +- healthchecks: The set of related DNSHealthCheckProbe CRs for this DNS Policy |
| 67 | +
|
| 68 | +## Use cases expanded |
| 69 | +### DNS Failover |
| 70 | +To enact DNS Failover with this config, the rule for publishing could be set to "when all other records are marked as unhealthy". |
| 71 | +
|
| 72 | +#### Example |
| 73 | +Cluster 1 has no publishing rule (i.e. always) |
| 74 | +Cluster 2 publishing rule is: "unhealthy_record_count >= unowned_record_count || unowned_record_count == 0" |
| 75 | +
|
| 76 | +- Cluster 1 is currently published and healthy and cluster 2 has no published records. |
| 77 | +- An event occurs that causes the workload to begin malfunctioning on cluster 1. |
| 78 | +- All the records for cluster 1 are marked as unhealthy in the registry (but not removed as they are the only records available) |
| 79 | +- cluster 2 reconciles and sees that all the records currently in the zone are unhealthy, as this satisfies it's publishing rule, it publishes it's records |
| 80 | +- cluster 1 reconciles and sees there are records other than it's own and so unpublishes it's own records due to being unhealthy |
| 81 | +- eventually cluster 1 is healthy again and publishes it's records |
| 82 | +- cluster 2 sees that it's rule is no longer satisfied and so unpublishes it's own records. |
| 83 | +
|
| 84 | +### Workload migration |
| 85 | +Cluster 1 has a workload that needs to be migrated to cluster 2. |
| 86 | +- publishing rule on cluster 1 is set to: "unowned_record_count = 0" |
| 87 | +- workload is created on cluster 2 |
| 88 | +- records created by cluster 2 |
| 89 | +- cluster 1 sees other records exist and unpublishes it's records from the zone |
| 90 | +- admin sees the status updated on the DNS Policy in cluster 1 (all records removed from zone) happened more than the TTL ago |
| 91 | +- admin can safely remove the workload from cluster 1. |
| 92 | +
|
| 93 | +### Extra clusters during high load |
| 94 | +This would require the addition of some metrics into the CEL rules that are not currently planned, but this can show how that rule might look. |
| 95 | +
|
| 96 | +- Cluster 1 has the workload and publishes always |
| 97 | +- Cluster 2 has the workload and has a publishing rule: "requests_per_minute >= n" |
0 commit comments