Skip to content

Commit 549c726

Browse files
committed
First draft of DNS Publishing RFC
Signed-off-by: Phil Brookes <[email protected]>
1 parent 3c07ad8 commit 549c726

File tree

3 files changed

+97
-0
lines changed

3 files changed

+97
-0
lines changed
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
## Why
2+
Currently there is no way to instruct the Kuadrant Operator to publish or unpublish a DNS Record when certain conditions are met, this means that conditionally removing a DNS Record requires manual intervention and an understanding of the internal workings of the DNS-Operator.
3+
4+
### Some Example Use Cases
5+
- DNS Failover (rapidly switch to alternative DNS Configuration) to a secondary site
6+
- Workload migration (removing workload from one cluster in favour of a new cluster)
7+
8+
## What
9+
Add optional `unpublishRules` to the dns policy CRD, which will allow an administrator to define one or more rules which when satisfied will instruct the Kuadrant Operator to unpublish the records from the zone and see the results in a condition in the status of the policy.
10+
11+
## How
12+
13+
### Dependencies
14+
15+
The [soft_delete](https://github.com/Kuadrant/dns-operator/issues/356) feature is required.
16+
17+
### Diagrams
18+
19+
#### Unpublish diagram
20+
![Diagram of DNS Publish strategy](./0013-dns-unpublishing-strategy-assets/unpublish-diagram.png)
21+
22+
#### Health check changes diagram
23+
There are some changes to the health check probes also, below is a diagram showing how they will function once complete.
24+
25+
![Health check updates](./0013-dns-unpublishing-strategy-assets/health-check-changes.png)
26+
27+
https://miro.com/app/board/uXjVL32kOMY=/
28+
29+
### CRD Updates
30+
#### DNS Record
31+
32+
The DNS Record will no longer have a health check section in the spec.
33+
34+
It will also have a new field in the spec `publish` if set to false the DNS Operator will attempt to lazily (i.e. not greedily) remove the DNS Records from the zone without interfering with other controllers records and without causing the potential for an NXDomain response. If set to true it will attempt to publish the endpoints to the zone as usual.
35+
36+
#### DNS Policy
37+
The DNS Policy will have a new field added to their spec:
38+
39+
```yaml
40+
unpublishRules:
41+
<rule implemented using Common Expression Language>
42+
<rule implemented using Common Expression Language>
43+
<rule implemented using Common Expression Language>
44+
```
45+
### Kuadrant and DNS Operator operator changes
46+
47+
### Health check probe creation
48+
The health checks are currently created by the DNS Operator, this will instead be moved to the Kuadrant-operator and the health checks will be owned by the DNS Policy that defined them, they will no longer have a failureThreshold as this can be defined in the unpublishRule, this section will no longer be copied to the DNS Record. They will continue to be reconciled by the DNS Operator.
49+
50+
### Unpublish process - Kuadrant operator
51+
52+
The Kuadrant Operator will have the option to set the publish field in the spec of the DNS Record to false. When set to false the DNS Operator will attempt to remove that endpoint from the DNS Provider, unless doing so would result in a NXDOMAIN response from the provider. The DNS Operator will also add information to the status of the DNS Record regarding the result (e.g. unpublished, preserved to avoid NXDOMAIN or preserved to avoid removing an entire GEO).
53+
54+
When any unpublish rule evaluates to true, the kuadrant-operator will set publish to false on relevant DNS Records and add a condition to the status of the DNS Policy that the label was added due to a matched rule, and if the status is available in the DNS Record regarding the result of adding the label, this will also be propagated to the DNS Policy status.
55+
56+
### Unpublish process - DNS Operator
57+
58+
When a DNS Record is reconciled with the publish field set to false, the DNS Operator will apply the soft_delete label to the leaf records of that DNS Record (i.e. expect them to be removed if possible).
59+
60+
The soft_delete label [ticket here](https://github.com/Kuadrant/dns-operator/issues/356) will compute the minimum set of records required to be deleted to ensure that these labelled records are cleanly removed with out leaving any dead ends (e.g. a weighted hostname whose targets don't have records), and ensuring no potential NXDomains.
61+
62+
### Available CEL Fields
63+
64+
To begin with a few basic fields will be made available, although this could potentially expand in the future.
65+
- records: The set of related DNS Record CRs for this DNS Policy
66+
- healthchecks: The set of related DNSHealthCheckProbe CRs for this DNS Policy
67+
68+
## Use cases expanded
69+
### DNS Failover
70+
To enact DNS Failover with this config, the rule for publishing could be set to "when all other records are marked as unhealthy".
71+
72+
#### Example
73+
Cluster 1 has no publishing rule (i.e. always)
74+
Cluster 2 publishing rule is: "unhealthy_record_count >= unowned_record_count || unowned_record_count == 0"
75+
76+
- Cluster 1 is currently published and healthy and cluster 2 has no published records.
77+
- An event occurs that causes the workload to begin malfunctioning on cluster 1.
78+
- All the records for cluster 1 are marked as unhealthy in the registry (but not removed as they are the only records available)
79+
- cluster 2 reconciles and sees that all the records currently in the zone are unhealthy, as this satisfies it's publishing rule, it publishes it's records
80+
- cluster 1 reconciles and sees there are records other than it's own and so unpublishes it's own records due to being unhealthy
81+
- eventually cluster 1 is healthy again and publishes it's records
82+
- cluster 2 sees that it's rule is no longer satisfied and so unpublishes it's own records.
83+
84+
### Workload migration
85+
Cluster 1 has a workload that needs to be migrated to cluster 2.
86+
- publishing rule on cluster 1 is set to: "unowned_record_count = 0"
87+
- workload is created on cluster 2
88+
- records created by cluster 2
89+
- cluster 1 sees other records exist and unpublishes it's records from the zone
90+
- admin sees the status updated on the DNS Policy in cluster 1 (all records removed from zone) happened more than the TTL ago
91+
- admin can safely remove the workload from cluster 1.
92+
93+
### Extra clusters during high load
94+
This would require the addition of some metrics into the CEL rules that are not currently planned, but this can show how that rule might look.
95+
96+
- Cluster 1 has the workload and publishes always
97+
- Cluster 2 has the workload and has a publishing rule: "requests_per_minute >= n"
44 KB
Loading
58.3 KB
Loading

0 commit comments

Comments
 (0)