Skip to content

KSM in the CloudZero Agent

Evan Nemerson edited this page Nov 26, 2025 · 2 revisions

This document explains the purpose, functionality, and importance of Kube State Metrics (KSM) in the CloudZero Agent deployment, including why CloudZero bundles its own KSM instance.

What is Kube State Metrics?

Kube State Metrics (KSM) is a service that listens to the Kubernetes API server and generates Prometheus metrics about the state of Kubernetes objects. Unlike metrics that track resource consumption (CPU, memory usage), KSM exposes metrics about resource configuration and state - what resources exist, how they're configured, and their current status.

graph LR
    subgraph "Kubernetes Cluster"
        K8S[Kubernetes API Server]
        NODES[Nodes]
        PODS[Pods]
        DEPLOYS[Deployments]
        SVC[Services]
    end

    subgraph "CloudZero Agent"
        KSM[Kube State Metrics]
        PROM[Prometheus]
        COLLECTOR[Collector]
    end

    K8S -.->|"Watches Resources"| KSM
    NODES -.-> K8S
    PODS -.-> K8S
    DEPLOYS -.-> K8S
    SVC -.-> K8S

    KSM -->|"Exposes Metrics"| PROM
    PROM -->|"Scrapes & Stores"| COLLECTOR
Loading

Key distinction: KSM provides state information (labels, annotations, resource requests, node placement), not usage information (actual CPU/memory consumption).

Why CloudZero Needs Kube State Metrics

CloudZero uses KSM to gather essential metadata about your Kubernetes resources for accurate cost allocation. Specifically, the CloudZero Agent scrapes these metrics from KSM:

  • kube_node_info - Node identification and cloud provider metadata
  • kube_node_status_capacity - Node capacity for resource allocation calculations
  • kube_pod_container_resource_limits - Container resource limits
  • kube_pod_container_resource_requests - Container resource requests
  • kube_pod_labels - Pod labels for cost allocation dimensions
  • kube_pod_info - Pod placement and ownership information

These metrics enable CloudZero to:

  1. Attribute costs to resources: Know which pods are running on which nodes
  2. Calculate allocation percentages: Understand resource requests vs node capacity
  3. Apply cost allocation tags: Use Kubernetes labels to allocate costs to teams, projects, or applications
  4. Track resource efficiency: Compare resource requests to actual usage

Without KSM data, CloudZero cannot accurately allocate Kubernetes costs or provide insights into resource utilization and optimization opportunities.

Why CloudZero Bundles Its Own KSM Instance

CloudZero Agent includes a dedicated Kube State Metrics instance rather than relying on existing cluster monitoring infrastructure. This architectural decision addresses several operational and reliability challenges:

  1. Purpose-Built Configuration: KSM produces a significant volume of metrics by default. CloudZero's bundled instance enables only the specific collectors needed for cost allocation workloads.

  2. Operational Independence: Just as CloudZero configures KSM for cost allocation needs, observability teams configure their KSM installations for their monitoring needs - and these needs can conflict. Teams may disable metrics to reduce cardinality, filter namespaces to reduce noise, or adjust label handling for their dashboards. These changes, made with good intentions, can unknowingly break cost allocation. By managing its own KSM instance, CloudZero maintains a stable configuration that's decoupled from cluster monitoring infrastructure and protected from operational changes in other systems.

  3. Complete Cluster Coverage: The bundled KSM makes CloudZero Agent self-contained, working consistently across production, development, test, and ephemeral clusters without depending on existing monitoring infrastructure.

  4. Tested Integration: Each CloudZero Agent release includes a specific KSM version that's tested as part of the integrated system. Metric formats, labels, and APIs are validated in the test suite, ensuring known behavior.

  5. Forward Compatibility: Bundled KSM enables CloudZero features to evolve seamlessly. New metrics, configuration changes, and label handling updates are deployed transparently through Agent upgrades without requiring separate coordination. If CloudZero replaces KSM with a different data collection approach in the future, the transition can happen transparently through a normal Agent upgrade.

  6. Simplified Support: The bundled KSM provides a standard, validated configuration that CloudZero Support can replicate and troubleshoot. The data pipeline from KSM through cost allocation is a single, well-understood system.

Configuration

Using the Bundled KSM (Recommended)

The default configuration uses the bundled KSM instance with no additional configuration required:

# KSM is enabled by default

Using an External KSM Instance (NOT Recommended)

The bundled KSM can be disabled if an external instance is preferred:

kubeStateMetrics:
  enabled: false
  targetOverride: "your-ksm-service.your-namespace.svc.cluster.local:8080"

Adjusting KSM Resources

For large clusters, you may need to increase KSM resources (set values as needed):

kubeStateMetrics:
  resources:
    requests:
      memory: 512Mi
      cpu: 500m
    limits:
      memory: 1Gi
      cpu: 1000m

Conclusion

Kube State Metrics is an essential component of the CloudZero Agent, providing the Kubernetes resource state information necessary for accurate cost allocation. The bundled KSM instance is purpose-built for cost allocation, tested as part of CloudZero Agent releases, and configured to work reliably across diverse cluster environments.

The bundled KSM approach provides:

  • Reliability: Tested configuration that works consistently across all clusters
  • Consistency: Predictable behavior across CloudZero Agent versions
  • Operational simplicity: Self-contained deployment with no external dependencies
  • Easy support: Standard configuration that CloudZero Support understands

For most deployments, the bundled KSM provides excellent performance with minimal overhead (typically 100-200 MB memory). The bundled KSM is designed to coexist peacefully with existing monitoring infrastructure while providing CloudZero with the specific metrics needed for accurate cost allocation.

Clone this wiki locally