|
| 1 | +## Why |
| 2 | +Currently there is no way to instruct the DNS Operator to publish or unpublish a DNS Record when certain conditions are met, this means that gracefully removing a DNS Record requires manual intervention and an understanding of the internal workings of the DNS-Operator. |
| 3 | + |
| 4 | +### Some Example Use Cases |
| 5 | +- DNS Failover (rapidly switch to alternative DNS Configuration) to a secondary site |
| 6 | +- Workload migration (removing workload from one cluster in favour of a new cluster) |
| 7 | +- Extra clusters during periods of high load |
| 8 | + |
| 9 | +## What |
| 10 | +Add an optional publishStrategy to the dns policy CRD, which will allow an administrator to define a some rules which when met will instruct the DNS Operator to publish/unpublish the records from the zone and set a condition in the status. |
| 11 | + |
| 12 | +## How |
| 13 | + |
| 14 | +### Diagram |
| 15 | + |
| 16 | + |
| 17 | +https://miro.com/app/board/uXjVL32kOMY=/ |
| 18 | + |
| 19 | +### Kuadrant operator changes |
| 20 | +The DNS Policy and DNS Record CRDs will have a new field added to their spec: |
| 21 | + |
| 22 | +```yaml |
| 23 | +publishStrategy: |
| 24 | + rule: <implemented using Common Expression Language> |
| 25 | +``` |
| 26 | +
|
| 27 | +This is read by the kuadrant-operator and propagated into any relevant DNS Records. |
| 28 | +
|
| 29 | +When the DNS Operator acts on these instructions it will set a condition in the DNS Record, this condition will be propagated back into the relevant DNS Policy. |
| 30 | +
|
| 31 | +### DNS Operator Changes |
| 32 | +The DNS Operator will read the publishStrategy from the DNS Record on reconcile, based on the values it will then interrogate the zone values to see if the publish rule is met. If so it will publish the records, if not it will ensure the records are unpublished and update the condition in the DNS Record status to reflect the decision. |
| 33 | +
|
| 34 | +### Available CEL Fields |
| 35 | +
|
| 36 | +To begin with a few basic fields will be made available, although this could potentially expand in the future. |
| 37 | +- records: The set of related records in the zone |
| 38 | +- unowned_record_count: The number of related records in the zone with an owner ID that is not the local owner ID |
| 39 | +- unhealthy_record_count: The number of related records flagged as unhealthy |
| 40 | +
|
| 41 | +## Use cases expanded |
| 42 | +### DNS Failover |
| 43 | +To enact DNS Failover with this config, the rule for publishing could be set to "when all other records are marked as unhealthy". |
| 44 | +
|
| 45 | +#### Example |
| 46 | +Cluster 1 has no publishing rule (i.e. always) |
| 47 | +Cluster 2 publishing rule is: "unhealthy_record_count >= unowned_record_count || unowned_record_count == 0" |
| 48 | +
|
| 49 | +- Cluster 1 is currently published and healthy and cluster 2 has no published records. |
| 50 | +- An event occurs that causes the workload to begin malfunctioning on cluster 1. |
| 51 | +- All the records for cluster 1 are marked as unhealthy in the registry (but not removed as they are the only records available) |
| 52 | +- cluster 2 reconciles and sees that all the records currently in the zone are unhealthy, as this satisfies it's publishing rule, it publishes it's records |
| 53 | +- cluster 1 reconciles and sees there are records other than it's own and so unpublishes it's own records due to being unhealthy |
| 54 | +- eventually cluster 1 is healthy again and publishes it's records |
| 55 | +- cluster 2 sees that it's rule is no longer satisfied and so unpublishes it's own records. |
| 56 | +
|
| 57 | +### Workload migration |
| 58 | +Cluster 1 has a workload that needs to be migrated to cluster 2. |
| 59 | +- publishing rule on cluster 1 is set to: "unowned_record_count = 0" |
| 60 | +- workload is created on cluster 2 |
| 61 | +- records created by cluster 2 |
| 62 | +- cluster 1 sees other records exist and unpublishes it's records from the zone |
| 63 | +- admin sees the status updated on the DNS Policy in cluster 1 (all records removed from zone) happened more than the TTL ago |
| 64 | +- admin can safely remove the workload from cluster 1. |
| 65 | +
|
| 66 | +### Extra clusters during high load |
| 67 | +This would require the addition of some metrics into the CEL rules that are not currently planned, but this can show how that rule might look. |
| 68 | +
|
| 69 | +- Cluster 1 has the workload and publishes always |
| 70 | +- Cluster 2 has the workload and has a publishing rule: "requests_per_minute >= n" |
0 commit comments