Kubernetes & Container Orchestration

Building Self-Healing Kubernetes Systems with Operators: A Complete Guide

MatterAI

12 min read·January 16, 2026

How to Build Self-Healing Systems with Kubernetes Operators and Custom Resources

Self-healing systems in Kubernetes rely on the Operator pattern, where Custom Resources (CRs) define desired state and Controllers reconcile actual state through continuous observation and remediation loops.

Architecture of Self-Healing

The Operator pattern extends Kubernetes by introducing domain-specific knowledge into the control plane. Custom Resource Definitions (CRDs) define the API schema, while the Controller implements the reconciliation logic that maintains system health.

Core Components

Custom Resource Definition (CRD): Schema defining the desired state structure
Custom Resource (CR): Instance of the CRD containing specific configuration
Controller: Infinite loop observing current state and taking corrective actions
Reconciler: Function implementing the Observe-Analyze-Act pattern

Defining the Custom Resource

The CRD schema must separate spec (user-defined desired state) from status (controller-observed actual state). The status subresource must be explicitly declared to enable status updates.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: healingservices.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        status: {}
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicaCount:
                  type: integer
                  minimum: 1
                healthThreshold:
                  type: integer
                  default: 3
                failureThreshold:
                  type: integer
                  default: 5
            status:
              type: object
              properties:
                observedGeneration:
                  type: integer
                healthyReplicas:
                  type: integer
                lastHealTime:
                  type: string
                  format: date-time
      additionalPrinterColumns:
        - name: Replicas
          type: integer
          jsonPath: .spec.replicaCount
        - name: Healthy
          type: integer
          jsonPath: .status.healthyReplicas
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
  scope: Namespaced
  names:
    plural: healingservices
    singular: healingservice
    kind: HealingService

The status subresource enables the controller to report observed state without user intervention, while spec remains the immutable source of truth. Without the subresources: status: {} declaration, Status().Update() calls will fail with 404 errors. The additionalPrinterColumns field enhances kubectl get output with meaningful status information.

Controller Manager Setup

Before implementing the reconciler, initialize the controller manager with production-ready configuration including leader election and metrics exposure.

package main

import (
    "flag"
    "os"

    "k8s.io/apimachinery/pkg/runtime"
    utilruntime "k8s.io/apimachinery/pkg/util/runtime"
    clientgoscheme "k8s.io/client-go/kubernetes/scheme"
    _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/healthz"
    "sigs.k8s.io/controller-runtime/pkg/log/zap"
    "sigs.k8s.io/controller-runtime/pkg/metrics/server"

    examplev1 "example.com/healing-operator/api/v1"
    "example.com/healing-operator/internal/controller"
)

var (
    scheme   = runtime.NewScheme()
    setupLog = ctrl.Log.WithName("setup")
)

func init() {
    utilruntime.Must(clientgoscheme.AddToScheme(scheme))
    utilruntime.Must(examplev1.AddToScheme(scheme))
}

func main() {
    var metricsAddr string
    var enableLeaderElection bool
    var probeAddr string

    flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
    flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
    flag.BoolVar(&enableLeaderElection, "leader-elect", false,
        "Enable leader election for controller manager. "+
            "Enabling this will ensure there is only one active controller manager.")
    opts := zap.Options{
        Development: true,
    }
    opts.BindFlags(flag.CommandLine)
    flag.Parse()

    ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme:                 scheme,
        Metrics:                server.Options{BindAddress: metricsAddr},
        HealthProbeBindAddress: probeAddr,
        LeaderElection:         enableLeaderElection,
        LeaderElectionID:       "healing-operator.example.com",
    })
    if err != nil {
        setupLog.Error(err, "unable to start manager")
        os.Exit(1)
    }

    if err = (&controller.HealingServiceReconciler{
        Client: mgr.GetClient(),
        Scheme: mgr.GetScheme(),
    }).SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to create controller", "controller", "HealingService")
        os.Exit(1)
    }

    if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up health check")
        os.Exit(1)
    }
    if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up ready check")
        os.Exit(1)
    }

    setupLog.Info("starting manager")
    if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
        setupLog.Error(err, "problem running manager")
        os.Exit(1)
    }
}

Leader election ensures only one controller instance runs in HA deployments, preventing duplicate reconciliation actions. The metrics server exposes Prometheus-compatible metrics at /metrics for observability.

The Reconcile Loop Logic

The reconciliation function implements the core self-healing mechanism: observe current state, compare with desired state, and take corrective actions.

package controller

import (
    "context"
    "fmt"
    "sort"
    "time"

    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    "sigs.k8s.io/controller-runtime/pkg/log"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    examplev1 "example.com/healing-operator/api/v1"
)

//+kubebuilder:rbac:groups=example.com,resources=healingservices,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=example.com,resources=healingservices/status,verbs=update
//+kubebuilder:rbac:groups=example.com,resources=healingservices/finalizers,verbs=update
//+kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch;create;update;patch;delete

const requeueDelay = 5 * time.Second
const healDelay = 30 * time.Second

func (r *HealingServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)
    
    // 1. Fetch the HealingService instance
    service := &examplev1.HealingService{}
    if err := r.Get(ctx, req.NamespacedName, service); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    // 2. Observe current state - fetch managed pods
    podList := &corev1.PodList{}
    if err := r.List(ctx, podList, client.InNamespace(req.Namespace), 
        client.MatchingLabels{"app": service.Name}); err != nil {
        return ctrl.Result{}, err
    }
    
    // 3. Analyze health status
    healthyCount := 0
    for _, pod := range podList.Items {
        if isPodHealthy(&pod) {
            healthyCount++
        }
    }
    
    // 4. Scale to desired replica count (both up and down)
    desiredReplicas := service.Spec.ReplicaCount
    currentReplicas := len(podList.Items)
    
    if currentReplicas < desiredReplicas {
        podsNeeded := desiredReplicas - currentReplicas
        log.Info("Scaling up", "current", currentReplicas, "desired", desiredReplicas)
        
        for i := 0; i < podsNeeded; i++ {
            if err := r.createNewPod(ctx, service); err != nil {
                if !errors.IsAlreadyExists(err) {
                    log.Error(err, "Failed to create pod")
                    return ctrl.Result{RequeueAfter: requeueDelay}, err
                }
            }
        }
        return ctrl.Result{RequeueAfter: requeueDelay}, nil
    }
    
    if currentReplicas > desiredReplicas {
        podsToDelete := currentReplicas - desiredReplicas
        log.Info("Scaling down", "current", currentReplicas, "desired", desiredReplicas)
        
        // Separate unhealthy and healthy pods
        var unhealthyPods, healthyPods []corev1.Pod
        for _, pod := range podList.Items {
            if isPodHealthy(&pod) {
                healthyPods = append(healthyPods, pod)
            } else {
                unhealthyPods = append(unhealthyPods, pod)
            }
        }
        
        // Sort healthy pods by creation time (oldest first for stable scale-down)
        sort.Slice(healthyPods, func(i, j int) bool {
            return healthyPods[i].CreationTimestamp.Before(&healthyPods[j].CreationTimestamp)
        })
        
        // Delete unhealthy pods first, then healthy pods if needed
        deletedCount := 0
        for _, pod := range unhealthyPods {
            if deletedCount >= podsToDelete {
                break
            }
            if err := r.Delete(ctx, &pod); err != nil && !errors.IsNotFound(err) {
                log.Error(err, "Failed to delete pod", "pod", pod.Name)
                return ctrl.Result{RequeueAfter: requeueDelay}, err
            }
            deletedCount++
        }
        
        for _, pod := range healthyPods {
            if deletedCount >= podsToDelete {
                break
            }
            if err := r.Delete(ctx, &pod); err != nil && !errors.IsNotFound(err) {
                log.Error(err, "Failed to delete pod", "pod", pod.Name)
                return ctrl.Result{RequeueAfter: requeueDelay}, err
            }
            deletedCount++
        }
        
        return ctrl.Result{RequeueAfter: requeueDelay}, nil
    }
    
    // 5. Update status with conflict handling
    service.Status.HealthyReplicas = healthyCount
    service.Status.ObservedGeneration = service.Generation
    if err := r.Status().Update(ctx, service); err != nil {
        if errors.IsConflict(err) {
            log.Info("Conflict updating status, requeueing")
            return ctrl.Result{Requeue: true}, nil
        }
        return ctrl.Result{}, err
    }
    
    // 6. Act - trigger healing if below threshold
    if healthyCount < service.Spec.HealthThreshold {
        log.Info("Triggering self-healing", "healthy", healthyCount, "required", service.Spec.HealthThreshold)
        
        // Delete unhealthy pods for recreation
        deletedCount := 0
        for _, pod := range podList.Items {
            if !isPodHealthy(&pod) {
                if err := r.Delete(ctx, &pod); err != nil {
                    if !errors.IsNotFound(err) {
                        log.Error(err, "Failed to delete unhealthy pod", "pod", pod.Name)
                        return ctrl.Result{RequeueAfter: requeueDelay}, err
                    }
                } else {
                    deletedCount++
                }
            }
        }
        
        if deletedCount > 0 {
            r.Event(service, corev1.EventTypeNormal, "HealingTriggered", 
                fmt.Sprintf("Deleted %d unhealthy pods for recreation", deletedCount))
            
            service.Status.LastHealTime = metav1.Now()
            if err := r.Status().Update(ctx, service); err != nil {
                if errors.IsConflict(err) {
                    return ctrl.Result{Requeue: true}, nil
                }
                return ctrl.Result{}, err
            }
            
            return ctrl.Result{RequeueAfter: healDelay}, nil
        }
    }
    
    return ctrl.Result{}, nil
}

func isPodHealthy(pod *corev1.Pod) bool {
    if pod.Status.Phase != corev1.PodRunning {
        return false
    }
    
    // Check PodReady condition
    for _, condition := range pod.Status.Conditions {
        if condition.Type == corev1.PodReady && condition.Status != corev1.ConditionTrue {
            return false
        }
    }
    
    // Check container states for various failure conditions
    for _, containerStatus := range pod.Status.ContainerStatuses {
        state := containerStatus.State
        if state.Waiting != nil {
            reason := state.Waiting.Reason
            if reason == "CrashLoopBackOff" || 
               reason == "ImagePullBackOff" || 
               reason == "ErrImagePull" {
                return false
            }
        }
        if state.Terminated != nil && state.Terminated.ExitCode != 0 {
            return false
        }
    }
    
    return true
}

func (r *HealingServiceReconciler) createNewPod(ctx context.Context, service *examplev1.HealingService) error {
    pod := &corev1.Pod{
        ObjectMeta: metav1.ObjectMeta{
            GenerateName: fmt.Sprintf("%s-", service.Name),
            Namespace:    service.Namespace,
            Labels: map[string]string{
                "app": service.Name,
            },
            OwnerReferences: []metav1.OwnerReference{
                {
                    APIVersion: "example.com/v1",
                    Kind:       "HealingService",
                    Name:       service.Name,
                    UID:        service.UID,
                    Controller: controllerutil.TruePtr(),
                },
            },
        },
        Spec: corev1.PodSpec{
            Containers: []corev1.Container{
                {
                    Name:  "app",
                    Image: "nginx:latest",
                },
            },
        },
    }
    
    return r.Create(ctx, pod)
}

func (r *HealingServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&examplev1.HealingService{}).
        Owns(&corev1.Pod{}).
        Complete(r)
}

The idempotency requirement ensures the reconciler can run multiple times without side effects—each execution should produce the same result given the same state. RBAC markers grant the controller necessary permissions to manage pods and update CR status. The isPodHealthy() function checks container states to detect CrashLoopBackOff, ImagePullBackOff, ErrImagePull, and failed terminations. Scale-down logic prioritizes deletion of unhealthy pods first, then uses stable selection (oldest pods first) for healthy pods to minimize disruption. Conflict handling prevents status update failures during concurrent modifications.

Validation Webhook

Validation webhooks enforce constraints on CR creation and updates, preventing invalid configurations like negative replica counts. The webhook must return errors to reject invalid specs.

package webhook

import (
    "context"
    "fmt"

    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/util/validation/field"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/webhook/admission"

    examplev1 "example.com/healing-operator/api/v1"
)

func (r *HealingService) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
    service := obj.(*examplev1.HealingService)
    var allWarnings admission.Warnings
    var allErrs field.ErrorList
    
    specPath := field.NewPath("spec")
    
    if service.Spec.ReplicaCount <= 0 {
        allErrs = append(allErrs, field.Invalid(
            specPath.Child("replicaCount"),
            service.Spec.ReplicaCount,
            "replicaCount must be greater than 0",
        ))
    }
    
    if service.Spec.HealthThreshold < 1 {
        allErrs = append(allErrs, field.Invalid(
            specPath.Child("healthThreshold"),
            service.Spec.HealthThreshold,
            "healthThreshold must be at least 1",
        ))
    }
    
    if service.Spec.HealthThreshold > service.Spec.ReplicaCount {
        allErrs = append(allErrs, field.Invalid(
            specPath.Child("healthThreshold"),
            service.Spec.HealthThreshold,
            "healthThreshold cannot exceed replicaCount",
        ))
    }
    
    if len(allErrs) > 0 {
        return allWarnings, admission.NewForbidden(service, allErrs)
    }
    
    return allWarnings, nil
}

func (r *HealingService) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {
    return r.ValidateCreate(ctx, newObj)
}

func (r *HealingService) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
    return nil, nil
}

//+kubebuilder:webhook:path=/validate-example-com-v1-healingservice,mutating=false,failurePolicy=fail,sideEffects=None,groups=example.com,resources=healingservices,verbs=create;update,versions=v1,name=vhealingservice.kb.io,admissionReviewVersions=v1

The webhook returns admission.Warnings and error. Errors cause the admission request to be rejected with a 403 Forbidden response, preventing invalid configurations from being persisted. The field.ErrorList provides structured error messages that Kubernetes displays in kubectl apply output.

Advanced Self-Healing Patterns

Beyond simple pod recreation, sophisticated healing strategies require additional Kubernetes primitives.

Finalizers for Clean-up

Finalizers block deletion until all associated resources are properly cleaned up, preventing orphaned resources during healing cycles.

metadata:
  finalizers:
    - example.com/cleanup-protection

Event Recording

Publish Kubernetes events to provide audit trails for healing actions.

r.Event(service, corev1.EventTypeNormal, "HealingTriggered", 
    fmt.Sprintf("Recreated %d unhealthy pods", len(unhealthyPods)))

Rate Limiting and Requeue Strategies

The controller-runtime provides built-in rate limiting for reconciliation. Configure exponential backoff in the manager setup:

import (
    "time"
    "k8s.io/client-go/util/workqueue"
    "sigs.k8s.io/controller-runtime/pkg/controller"
)

func (r *HealingServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&examplev1.HealingService{}).
        Owns(&corev1.Pod{}).
        WithOptions(controller.Options{
            RateLimiter: workqueue.NewMaxOfRateLimiter(
                workqueue.NewItemExponentialFailureRateLimiter(5*time.Second, 300*time.Second),
                workqueue.NewItemFastSlowRateLimiter(5*time.Second, 60*time.Second),
            ),
        }).
        Complete(r)
}

Metrics and Observability

Controller-runtime exposes Prometheus metrics automatically. Key metrics include:

controller_runtime_reconcile_total: Total reconciliation attempts
controller_runtime_reconcile_errors_total: Failed reconciliations
controller_runtime_workqueue_depth: Current queue depth
controller_runtime_workqueue_retries_total: Retry attempts

Access metrics at http://<pod-ip>:8080/metrics when deployed.

Getting Started

Install Kubebuilder: go install sigs.k8s.io/kubebuilder/v4/cmd/kubebuilder@latest
Initialize project: kubebuilder init --domain example.com --repo example.com/healing-operator
Create API: kubebuilder create api --group example --version v1 --kind HealingService
Create webhook: kubebuilder create webhook --group example --version v1 --kind HealingService --defaulting --programmatic-validation
Implement Reconcile logic in internal/controller/healingservice_controller.go
Implement webhook validation in internal/webhook/healingservice_webhook.go
Add RBAC markers above the Reconcile function
Run locally: make run or deploy with make install && make deploy
Enable leader election in production: kubectl set env deployment/healing-operator-manager LEADER_ELECT=true
Create test instance: kubectl apply -f config/samples/example_v1_healingservice.yaml

MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so

Follow us on X · LinkedIn · GitHub

Share this Guide:

More Guides

LLM Integration for AI Agents: A Complete Engineering FAQ

Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Continue Reading

LLM Integration for AI Agents: A Complete Engineering FAQ

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min