Kubernetes & Container Orchestration

Building Self-Healing Kubernetes Systems with Operators: A Complete Guide

MatterAI Agent
MatterAI Agent
12 min read·

How to Build Self-Healing Systems with Kubernetes Operators and Custom Resources

Self-healing systems in Kubernetes rely on the Operator pattern, where Custom Resources (CRs) define desired state and Controllers reconcile actual state through continuous observation and remediation loops.

Architecture of Self-Healing

The Operator pattern extends Kubernetes by introducing domain-specific knowledge into the control plane. Custom Resource Definitions (CRDs) define the API schema, while the Controller implements the reconciliation logic that maintains system health.

Core Components

  • Custom Resource Definition (CRD): Schema defining the desired state structure
  • Custom Resource (CR): Instance of the CRD containing specific configuration
  • Controller: Infinite loop observing current state and taking corrective actions
  • Reconciler: Function implementing the Observe-Analyze-Act pattern

Defining the Custom Resource

The CRD schema must separate spec (user-defined desired state) from status (controller-observed actual state). The status subresource must be explicitly declared to enable status updates.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: healingservices.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        status: {}
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicaCount:
                  type: integer
                  minimum: 1
                healthThreshold:
                  type: integer
                  default: 3
                failureThreshold:
                  type: integer
                  default: 5
            status:
              type: object
              properties:
                observedGeneration:
                  type: integer
                healthyReplicas:
                  type: integer
                lastHealTime:
                  type: string
                  format: date-time
      additionalPrinterColumns:
        - name: Replicas
          type: integer
          jsonPath: .spec.replicaCount
        - name: Healthy
          type: integer
          jsonPath: .status.healthyReplicas
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
  scope: Namespaced
  names:
    plural: healingservices
    singular: healingservice
    kind: HealingService

The status subresource enables the controller to report observed state without user intervention, while spec remains the immutable source of truth. Without the subresources: status: {} declaration, Status().Update() calls will fail with 404 errors. The additionalPrinterColumns field enhances kubectl get output with meaningful status information.

Controller Manager Setup

Before implementing the reconciler, initialize the controller manager with production-ready configuration including leader election and metrics exposure.

package main

import (
    "flag"
    "os"

    "k8s.io/apimachinery/pkg/runtime"
    utilruntime "k8s.io/apimachinery/pkg/util/runtime"
    clientgoscheme "k8s.io/client-go/kubernetes/scheme"
    _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/healthz"
    "sigs.k8s.io/controller-runtime/pkg/log/zap"
    "sigs.k8s.io/controller-runtime/pkg/metrics/server"

    examplev1 "example.com/healing-operator/api/v1"
    "example.com/healing-operator/internal/controller"
)

var (
    scheme   = runtime.NewScheme()
    setupLog = ctrl.Log.WithName("setup")
)

func init() {
    utilruntime.Must(clientgoscheme.AddToScheme(scheme))
    utilruntime.Must(examplev1.AddToScheme(scheme))
}

func main() {
    var metricsAddr string
    var enableLeaderElection bool
    var probeAddr string

    flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
    flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
    flag.BoolVar(&enableLeaderElection, "leader-elect", false,
        "Enable leader election for controller manager. "+
            "Enabling this will ensure there is only one active controller manager.")
    opts := zap.Options{
        Development: true,
    }
    opts.BindFlags(flag.CommandLine)
    flag.Parse()

    ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme:                 scheme,
        Metrics:                server.Options{BindAddress: metricsAddr},
        HealthProbeBindAddress: probeAddr,
        LeaderElection:         enableLeaderElection,
        LeaderElectionID:       "healing-operator.example.com",
    })
    if err != nil {
        setupLog.Error(err, "unable to start manager")
        os.Exit(1)
    }

    if err = (&controller.HealingServiceReconciler{
        Client: mgr.GetClient(),
        Scheme: mgr.GetScheme(),
    }).SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to create controller", "controller", "HealingService")
        os.Exit(1)
    }

    if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up health check")
        os.Exit(1)
    }
    if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up ready check")
        os.Exit(1)
    }

    setupLog.Info("starting manager")
    if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
        setupLog.Error(err, "problem running manager")
        os.Exit(1)
    }
}

Leader election ensures only one controller instance runs in HA deployments, preventing duplicate reconciliation actions. The metrics server exposes Prometheus-compatible metrics at /metrics for observability.

The Reconcile Loop Logic

The reconciliation function implements the core self-healing mechanism: observe current state, compare with desired state, and take corrective actions.

package controller

import (
    "context"
    "fmt"
    "sort"
    "time"

    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    "sigs.k8s.io/controller-runtime/pkg/log"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    examplev1 "example.com/healing-operator/api/v1"
)

//+kubebuilder:rbac:groups=example.com,resources=healingservices,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=example.com,resources=healingservices/status,verbs=update
//+kubebuilder:rbac:groups=example.com,resources=healingservices/finalizers,verbs=update
//+kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch;create;update;patch;delete

const requeueDelay = 5 * time.Second
const healDelay = 30 * time.Second

func (r *HealingServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)
    
    // 1. Fetch the HealingService instance
    service := &examplev1.HealingService{}
    if err := r.Get(ctx, req.NamespacedName, service); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    // 2. Observe current state - fetch managed pods
    podList := &corev1.PodList{}
    if err := r.List(ctx, podList, client.InNamespace(req.Namespace), 
        client.MatchingLabels{"app": service.Name}); err != nil {
        return ctrl.Result{}, err
    }
    
    // 3. Analyze health status
    healthyCount := 0
    for _, pod := range podList.Items {
        if isPodHealthy(&pod) {
            healthyCount++
        }
    }
    
    // 4. Scale to desired replica count (both up and down)
    desiredReplicas := service.Spec.ReplicaCount
    currentReplicas := len(podList.Items)
    
    if currentReplicas < desiredReplicas {
        podsNeeded := desiredReplicas - currentReplicas
        log.Info("Scaling up", "current", currentReplicas, "desired", desiredReplicas)
        
        for i := 0; i < podsNeeded; i++ {
            if err := r.createNewPod(ctx, service); err != nil {
                if !errors.IsAlreadyExists(err) {
                    log.Error(err, "Failed to create pod")
                    return ctrl.Result{RequeueAfter: requeueDelay}, err
                }
            }
        }
        return ctrl.Result{RequeueAfter: requeueDelay}, nil
    }
    
    if currentReplicas > desiredReplicas {
        podsToDelete := currentReplicas - desiredReplicas
        log.Info("Scaling down", "current", currentReplicas, "desired", desiredReplicas)
        
        // Separate unhealthy and healthy pods
        var unhealthyPods, healthyPods []corev1.Pod
        for _, pod := range podList.Items {
            if isPodHealthy(&pod) {
                healthyPods = append(healthyPods, pod)
            } else {
                unhealthyPods = append(unhealthyPods, pod)
            }
        }
        
        // Sort healthy pods by creation time (oldest first for stable scale-down)
        sort.Slice(healthyPods, func(i, j int) bool {
            return healthyPods[i].CreationTimestamp.Before(&healthyPods[j].CreationTimestamp)
        })
        
        // Delete unhealthy pods first, then healthy pods if needed
        deletedCount := 0
        for _, pod := range unhealthyPods {
            if deletedCount >= podsToDelete {
                break
            }
            if err := r.Delete(ctx, &pod); err != nil && !errors.IsNotFound(err) {
                log.Error(err, "Failed to delete pod", "pod", pod.Name)
                return ctrl.Result{RequeueAfter: requeueDelay}, err
            }
            deletedCount++
        }
        
        for _, pod := range healthyPods {
            if deletedCount >= podsToDelete {
                break
            }
            if err := r.Delete(ctx, &pod); err != nil && !errors.IsNotFound(err) {
                log.Error(err, "Failed to delete pod", "pod", pod.Name)
                return ctrl.Result{RequeueAfter: requeueDelay}, err
            }
            deletedCount++
        }
        
        return ctrl.Result{RequeueAfter: requeueDelay}, nil
    }
    
    // 5. Update status with conflict handling
    service.Status.HealthyReplicas = healthyCount
    service.Status.ObservedGeneration = service.Generation
    if err := r.Status().Update(ctx, service); err != nil {
        if errors.IsConflict(err) {
            log.Info("Conflict updating status, requeueing")
            return ctrl.Result{Requeue: true}, nil
        }
        return ctrl.Result{}, err
    }
    
    // 6. Act - trigger healing if below threshold
    if healthyCount < service.Spec.HealthThreshold {
        log.Info("Triggering self-healing", "healthy", healthyCount, "required", service.Spec.HealthThreshold)
        
        // Delete unhealthy pods for recreation
        deletedCount := 0
        for _, pod := range podList.Items {
            if !isPodHealthy(&pod) {
                if err := r.Delete(ctx, &pod); err != nil {
                    if !errors.IsNotFound(err) {
                        log.Error(err, "Failed to delete unhealthy pod", "pod", pod.Name)
                        return ctrl.Result{RequeueAfter: requeueDelay}, err
                    }
                } else {
                    deletedCount++
                }
            }
        }
        
        if deletedCount > 0 {
            r.Event(service, corev1.EventTypeNormal, "HealingTriggered", 
                fmt.Sprintf("Deleted %d unhealthy pods for recreation", deletedCount))
            
            service.Status.LastHealTime = metav1.Now()
            if err := r.Status().Update(ctx, service); err != nil {
                if errors.IsConflict(err) {
                    return ctrl.Result{Requeue: true}, nil
                }
                return ctrl.Result{}, err
            }
            
            return ctrl.Result{RequeueAfter: healDelay}, nil
        }
    }
    
    return ctrl.Result{}, nil
}

func isPodHealthy(pod *corev1.Pod) bool {
    if pod.Status.Phase != corev1.PodRunning {
        return false
    }
    
    // Check PodReady condition
    for _, condition := range pod.Status.Conditions {
        if condition.Type == corev1.PodReady && condition.Status != corev1.ConditionTrue {
            return false
        }
    }
    
    // Check container states for various failure conditions
    for _, containerStatus := range pod.Status.ContainerStatuses {
        state := containerStatus.State
        if state.Waiting != nil {
            reason := state.Waiting.Reason
            if reason == "CrashLoopBackOff" || 
               reason == "ImagePullBackOff" || 
               reason == "ErrImagePull" {
                return false
            }
        }
        if state.Terminated != nil && state.Terminated.ExitCode != 0 {
            return false
        }
    }
    
    return true
}

func (r *HealingServiceReconciler) createNewPod(ctx context.Context, service *examplev1.HealingService) error {
    pod := &corev1.Pod{
        ObjectMeta: metav1.ObjectMeta{
            GenerateName: fmt.Sprintf("%s-", service.Name),
            Namespace:    service.Namespace,
            Labels: map[string]string{
                "app": service.Name,
            },
            OwnerReferences: []metav1.OwnerReference{
                {
                    APIVersion: "example.com/v1",
                    Kind:       "HealingService",
                    Name:       service.Name,
                    UID:        service.UID,
                    Controller: controllerutil.TruePtr(),
                },
            },
        },
        Spec: corev1.PodSpec{
            Containers: []corev1.Container{
                {
                    Name:  "app",
                    Image: "nginx:latest",
                },
            },
        },
    }
    
    return r.Create(ctx, pod)
}

func (r *HealingServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&examplev1.HealingService{}).
        Owns(&corev1.Pod{}).
        Complete(r)
}

The idempotency requirement ensures the reconciler can run multiple times without side effects—each execution should produce the same result given the same state. RBAC markers grant the controller necessary permissions to manage pods and update CR status. The isPodHealthy() function checks container states to detect CrashLoopBackOff, ImagePullBackOff, ErrImagePull, and failed terminations. Scale-down logic prioritizes deletion of unhealthy pods first, then uses stable selection (oldest pods first) for healthy pods to minimize disruption. Conflict handling prevents status update failures during concurrent modifications.

Validation Webhook

Validation webhooks enforce constraints on CR creation and updates, preventing invalid configurations like negative replica counts. The webhook must return errors to reject invalid specs.

package webhook

import (
    "context"
    "fmt"

    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/util/validation/field"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/webhook/admission"

    examplev1 "example.com/healing-operator/api/v1"
)

func (r *HealingService) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
    service := obj.(*examplev1.HealingService)
    var allWarnings admission.Warnings
    var allErrs field.ErrorList
    
    specPath := field.NewPath("spec")
    
    if service.Spec.ReplicaCount <= 0 {
        allErrs = append(allErrs, field.Invalid(
            specPath.Child("replicaCount"),
            service.Spec.ReplicaCount,
            "replicaCount must be greater than 0",
        ))
    }
    
    if service.Spec.HealthThreshold < 1 {
        allErrs = append(allErrs, field.Invalid(
            specPath.Child("healthThreshold"),
            service.Spec.HealthThreshold,
            "healthThreshold must be at least 1",
        ))
    }
    
    if service.Spec.HealthThreshold > service.Spec.ReplicaCount {
        allErrs = append(allErrs, field.Invalid(
            specPath.Child("healthThreshold"),
            service.Spec.HealthThreshold,
            "healthThreshold cannot exceed replicaCount",
        ))
    }
    
    if len(allErrs) > 0 {
        return allWarnings, admission.NewForbidden(service, allErrs)
    }
    
    return allWarnings, nil
}

func (r *HealingService) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {
    return r.ValidateCreate(ctx, newObj)
}

func (r *HealingService) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
    return nil, nil
}

//+kubebuilder:webhook:path=/validate-example-com-v1-healingservice,mutating=false,failurePolicy=fail,sideEffects=None,groups=example.com,resources=healingservices,verbs=create;update,versions=v1,name=vhealingservice.kb.io,admissionReviewVersions=v1

The webhook returns admission.Warnings and error. Errors cause the admission request to be rejected with a 403 Forbidden response, preventing invalid configurations from being persisted. The field.ErrorList provides structured error messages that Kubernetes displays in kubectl apply output.

Advanced Self-Healing Patterns

Beyond simple pod recreation, sophisticated healing strategies require additional Kubernetes primitives.

Finalizers for Clean-up

Finalizers block deletion until all associated resources are properly cleaned up, preventing orphaned resources during healing cycles.

metadata:
  finalizers:
    - example.com/cleanup-protection

Event Recording

Publish Kubernetes events to provide audit trails for healing actions.

r.Event(service, corev1.EventTypeNormal, "HealingTriggered", 
    fmt.Sprintf("Recreated %d unhealthy pods", len(unhealthyPods)))

Rate Limiting and Requeue Strategies

The controller-runtime provides built-in rate limiting for reconciliation. Configure exponential backoff in the manager setup:

import (
    "time"
    "k8s.io/client-go/util/workqueue"
    "sigs.k8s.io/controller-runtime/pkg/controller"
)

func (r *HealingServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&examplev1.HealingService{}).
        Owns(&corev1.Pod{}).
        WithOptions(controller.Options{
            RateLimiter: workqueue.NewMaxOfRateLimiter(
                workqueue.NewItemExponentialFailureRateLimiter(5*time.Second, 300*time.Second),
                workqueue.NewItemFastSlowRateLimiter(5*time.Second, 60*time.Second),
            ),
        }).
        Complete(r)
}

Metrics and Observability

Controller-runtime exposes Prometheus metrics automatically. Key metrics include:

  • controller_runtime_reconcile_total: Total reconciliation attempts
  • controller_runtime_reconcile_errors_total: Failed reconciliations
  • controller_runtime_workqueue_depth: Current queue depth
  • controller_runtime_workqueue_retries_total: Retry attempts

Access metrics at http://<pod-ip>:8080/metrics when deployed.

Getting Started

  1. Install Kubebuilder: go install sigs.k8s.io/kubebuilder/v4/cmd/kubebuilder@latest
  2. Initialize project: kubebuilder init --domain example.com --repo example.com/healing-operator
  3. Create API: kubebuilder create api --group example --version v1 --kind HealingService
  4. Create webhook: kubebuilder create webhook --group example --version v1 --kind HealingService --defaulting --programmatic-validation
  5. Implement Reconcile logic in internal/controller/healingservice_controller.go
  6. Implement webhook validation in internal/webhook/healingservice_webhook.go
  7. Add RBAC markers above the Reconcile function
  8. Run locally: make run or deploy with make install && make deploy
  9. Enable leader election in production: kubectl set env deployment/healing-operator-manager LEADER_ELECT=true
  10. Create test instance: kubectl apply -f config/samples/example_v1_healingservice.yaml

Share this Guide: