Skip to content

Implement Drift Detection with Scheduled Refresh #1037

@EronWright

Description

@EronWright

Overview

Add drift detection support to the Pulumi Kubernetes Operator, similar to Pulumi Cloud's drift detection feature. This allows users to detect when cloud resources diverge from the desired state defined in their Stack, with optional automatic remediation.

Background

Related to #16 (preview support)

The operator currently applies changes immediately using pulumi up -y. To enhance safety and observability, users need:

  • Scheduled drift detection without forced reconciliation
  • Visibility when resources drift from desired state
  • Optional automatic remediation of drift

User Experience

Users can configure drift detection schedules on Stack CRs:

apiVersion: pulumi.com/v1
kind: Stack
metadata:
  name: my-stack
spec:
  stack: org/project/stack
  projectRepo: https://github.com/example/repo
  driftDetection:
    schedules:
      - cron: "*/15 * * * *"  # Check every 15 minutes
        autoRemediate: false

When drift is detected:

  1. Results visible in Pulumi Cloud's Drift tab on the Stack page
  2. Stack status conditions show drift: DriftDetected condition
  3. Kubernetes events emitted for integration with external notification systems
  4. Optional automatic remediation via autoRemediate: true

Status Condition Examples

When drift is detected:

status:
  conditions:
    - type: DriftDetected
      status: "True"
      reason: Changes
      message: "2 additions, 0 deletions, and 1 change"
      lastTransitionTime: "2025-01-23T10:30:00Z"
    - type: Ready
      status: "True"
      reason: ProcessingCompleted
      message: "the stack has been processed and is up to date"
  driftDetection:
    lastCheck: "2025-01-23T10:30:00Z"
  lastUpdate:
    type: refresh
    state: succeeded
    permalink: "https://app.pulumi.com/org/project/stack/updates/42"

When no drift is detected:

status:
  conditions:
    - type: DriftDetected
      status: "False"
      reason: NoChanges
      message: "No changes detected"
      lastTransitionTime: "2025-01-23T10:30:00Z"
    - type: Ready
      status: "True"
      reason: ProcessingCompleted
      message: "the stack has been processed and is up to date"
  driftDetection:
    lastCheck: "2025-01-23T10:30:00Z"
  lastUpdate:
    type: refresh
    state: succeeded
    permalink: "https://app.pulumi.com/org/project/stack/updates/42"

Kubernetes Events:

LAST SEEN   TYPE     REASON              OBJECT        MESSAGE
2m ago      Normal   StackDriftDetected  Stack/my-stack  2 additions, 0 deletions, and 1 change

Implementation Approach

Drift detection based on pulumi refresh --preview-only (non-destructive state check).

API Changes

operator/api/pulumi/shared/stack_types.go:

  • Add DriftDetection spec with:
    • Schedules []DriftSchedule (cron string + autoRemediate bool)
  • Add DriftDetectionStatus with:
    • LastCheck *metav1.Time

operator/api/pulumi/v1/stack_types.go:

  • Add DriftDetected condition constant

Protocol Buffer Changes

agent/pkg/proto/agent.proto:

  • Add preview_only field to RefreshRequest
  • Regenerate proto code via cd agent && make protoc

Agent Changes

agent/pkg/server/server.go:

  • Update Refresh() to handle preview_only flag
  • Use Stack.PreviewRefresh() for non-destructive drift detection

Note: The Pulumi Automation API already supports preview-only refresh via Stack.PreviewRefresh(). The draft implementation currently uses a workaround (RunProgram(false)), but should be updated to use PreviewRefresh() properly.

Controller Changes

operator/internal/controller/pulumi/stack_controller.go:

  • Add cron-based scheduling logic for drift checks
  • When Stack is Ready and schedule triggers:
    • Create Update CR with type: refresh and preview_only: true
  • After drift detection Update completes:
    • Set DriftDetected condition based on change summary
    • Update driftDetection.lastCheck timestamp
    • Emit Kubernetes events
    • If autoRemediate: true and drift detected, create Update with type: up

operator/internal/controller/auto/update_controller.go:

  • Update Refresh() to pass preview_only flag to agent

Status Tracking

Drift detection runs will update the lastUpdate field with:

  • Type: refresh
  • Permalink to the Pulumi Cloud refresh operation
  • Change summary in conditions

The DriftDetected condition shows:

  • Status: True/False
  • Message: Change summary (e.g., "2 additions, 0 deletions, and 1 change")
  • Reason: "Changes" or "NoChanges"

Condition is cleared when:

  • Auto-remediation succeeds, or
  • Next drift detection shows no changes

Tasks

  • Add CRD fields for drift detection configuration
  • Update protobuf definitions for preview-only refresh
  • Implement agent support for preview-only refresh (currently uses workaround)
  • Update agent to use Stack.PreviewRefresh() instead of workaround
  • Add scheduling logic to StackReconciler
  • Add drift detection condition handling
  • Implement auto-remediation workflow
  • Add tests for drift detection scenarios
  • Run code generation (make codegen)
  • Update documentation
  • Add examples

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions