-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Overview
Add drift detection support to the Pulumi Kubernetes Operator, similar to Pulumi Cloud's drift detection feature. This allows users to detect when cloud resources diverge from the desired state defined in their Stack, with optional automatic remediation.
Background
Related to #16 (preview support)
The operator currently applies changes immediately using pulumi up -y. To enhance safety and observability, users need:
- Scheduled drift detection without forced reconciliation
- Visibility when resources drift from desired state
- Optional automatic remediation of drift
User Experience
Users can configure drift detection schedules on Stack CRs:
apiVersion: pulumi.com/v1
kind: Stack
metadata:
name: my-stack
spec:
stack: org/project/stack
projectRepo: https://github.com/example/repo
driftDetection:
schedules:
- cron: "*/15 * * * *" # Check every 15 minutes
autoRemediate: falseWhen drift is detected:
- Results visible in Pulumi Cloud's Drift tab on the Stack page
- Stack status conditions show drift:
DriftDetectedcondition - Kubernetes events emitted for integration with external notification systems
- Optional automatic remediation via
autoRemediate: true
Status Condition Examples
When drift is detected:
status:
conditions:
- type: DriftDetected
status: "True"
reason: Changes
message: "2 additions, 0 deletions, and 1 change"
lastTransitionTime: "2025-01-23T10:30:00Z"
- type: Ready
status: "True"
reason: ProcessingCompleted
message: "the stack has been processed and is up to date"
driftDetection:
lastCheck: "2025-01-23T10:30:00Z"
lastUpdate:
type: refresh
state: succeeded
permalink: "https://app.pulumi.com/org/project/stack/updates/42"When no drift is detected:
status:
conditions:
- type: DriftDetected
status: "False"
reason: NoChanges
message: "No changes detected"
lastTransitionTime: "2025-01-23T10:30:00Z"
- type: Ready
status: "True"
reason: ProcessingCompleted
message: "the stack has been processed and is up to date"
driftDetection:
lastCheck: "2025-01-23T10:30:00Z"
lastUpdate:
type: refresh
state: succeeded
permalink: "https://app.pulumi.com/org/project/stack/updates/42"Kubernetes Events:
LAST SEEN TYPE REASON OBJECT MESSAGE
2m ago Normal StackDriftDetected Stack/my-stack 2 additions, 0 deletions, and 1 change
Implementation Approach
Drift detection based on pulumi refresh --preview-only (non-destructive state check).
API Changes
operator/api/pulumi/shared/stack_types.go:
- Add
DriftDetectionspec with:Schedules []DriftSchedule(cron string + autoRemediate bool)
- Add
DriftDetectionStatuswith:LastCheck *metav1.Time
operator/api/pulumi/v1/stack_types.go:
- Add
DriftDetectedcondition constant
Protocol Buffer Changes
agent/pkg/proto/agent.proto:
- Add
preview_onlyfield toRefreshRequest - Regenerate proto code via
cd agent && make protoc
Agent Changes
agent/pkg/server/server.go:
- Update
Refresh()to handlepreview_onlyflag - Use
Stack.PreviewRefresh()for non-destructive drift detection
Note: The Pulumi Automation API already supports preview-only refresh via Stack.PreviewRefresh(). The draft implementation currently uses a workaround (RunProgram(false)), but should be updated to use PreviewRefresh() properly.
Controller Changes
operator/internal/controller/pulumi/stack_controller.go:
- Add cron-based scheduling logic for drift checks
- When Stack is Ready and schedule triggers:
- Create Update CR with
type: refreshandpreview_only: true
- Create Update CR with
- After drift detection Update completes:
- Set
DriftDetectedcondition based on change summary - Update
driftDetection.lastChecktimestamp - Emit Kubernetes events
- If
autoRemediate: trueand drift detected, create Update withtype: up
- Set
operator/internal/controller/auto/update_controller.go:
- Update
Refresh()to passpreview_onlyflag to agent
Status Tracking
Drift detection runs will update the lastUpdate field with:
- Type:
refresh - Permalink to the Pulumi Cloud refresh operation
- Change summary in conditions
The DriftDetected condition shows:
- Status: True/False
- Message: Change summary (e.g., "2 additions, 0 deletions, and 1 change")
- Reason: "Changes" or "NoChanges"
Condition is cleared when:
- Auto-remediation succeeds, or
- Next drift detection shows no changes
Tasks
- Add CRD fields for drift detection configuration
- Update protobuf definitions for preview-only refresh
- Implement agent support for preview-only refresh (currently uses workaround)
- Update agent to use
Stack.PreviewRefresh()instead of workaround - Add scheduling logic to StackReconciler
- Add drift detection condition handling
- Implement auto-remediation workflow
- Add tests for drift detection scenarios
- Run code generation (
make codegen) - Update documentation
- Add examples
Related
- Requirements: See attached design document
- Draft PR: Add drift detection infrastructure #1041