29. 12. 2025 Luigi Miazzo Development, DevOps, Kubernetes

Planning, Building, and Testing a Kubernetes Operator

Kubernetes Operators are one of those ideas that feel magical when they work: you declare intent in YAML, and software continuously makes the cluster match it—handling upgrades, failures, drift, and lifecycle cleanup: like a purpose-built SRE on autopilot.

Although in theory it looks like a sci-fi fiction, in pratice Operators are just code written by someone that leverage Kubernetes’ extensibility features: they take the desired and current state of the cluster and run a control loop to reconcile the two.

What an operator really is

At a high level, an operator is:

  • A Custom Resource Definition (CRD): your API type (e.g., PostgresCluster, Cache, Tenant)
  • A controller: a reconciliation loop that
    • watches your Custom Resources
    • compares desired state (spec) vs actual state (cluster)
    • creates/updates/deletes Kubernetes objects to close the gap

    This is “reconciliation”: controllers run repeatedly until the world matches the declared desired state. The key insight is that operators usually reconcile one level higher than built-in Kubernetes controllers (your operator might create a Deployment, and then Kubernetes’ own deployment controller creates ReplicaSets and Pods).

    A mental model that helps:

    • spec what the user wants
    • status what’s actually happening (observed state)

    If you get this separation right early, everything else gets easier: upgrades, debugging, user trust, and testing.

    Tooling and language: what should you choose?

    The default stack: Go + Kubebuilder (or Operator SDK)

    If you’re building a “classic” Kubernetes operator, the most common and best-supported path is:

    • Go
    • Kubebuilder for scaffolding and project conventions
    • controller-runtime for controller primitives (manager, reconcile loop helpers, cache, client, etc.)

    If you want extra packaging workflows—especially around Operator Lifecycle Manager bundles (or OLM bundles, a Kubernetes extension that helps install, upgrade, and manage operators in a cluster)—use Operator SDK. For Go projects, Operator SDK uses Kubebuilder under the hood and shares the same basic layout and controller-runtime foundation.

    When other languages are reasonable

    You can write operators in other ecosystems, and sometimes you should (team skillset, rapid prototyping, etc.). But always keep in mind that Go is the safest bet because examples are abundant and libraries/tools are very mature.

    Plan first: define your API and lifecycle before writing code

    Before you scaffold anything, answer these questions:

    1) What is the “source of truth” Custom Resource?

    Pick a single primary CR kind as the main entrypoint:

    • Website
    • Database
    • Topic
    • BackupJob

    A common beginner mistake is trying to make the operator “watch everything” and infer intent. Don’t. Make the CR the user contract.

    2) What child resources will it own?

    List the Kubernetes resources the operator will create/maintain, like:

    • Deployment / StatefulSet
    • ConfigMap / Secret
    • Ingress
    • PodDisruptionBudget
    • RBAC objects

    This list informs your RBAC, your testing, and your reconcile design.

    3) What does “ready” mean?

    Define your Status Conditions strategy early:

    • When do you set Available=True?
    • What conditions represent progress vs failure?
    • What fields help users debug (URLs, observed generation, last reconcile time)?

    4) How will deletion work?

    And last but not least, decide your cleanup model:

    • Let Kubernetes garbage-collect via ownerReferences (ideal when possible)
    • Use a finalizer for external cleanup (cloud resources, DNS records, external DBs)

    This pays in test assertions and user experience.

    The operator’s main structure (project layout + moving parts)

    Kubebuilder organize your project in a fixed structure. The names may evolve across versions, but conceptually you’ll have:

    • API types (Go structs that become CRDs)
    • controllers (Reconcilers)
    • manager entrypoint (wires schemes + controllers + health/metrics)
    • config (generated manifests: CRDs, RBAC, manager deployment, samples)

    The runtime wiring (how requests become reconciles)

    Under the hood, controllers:

    • watch resources through informers and watchers (kubernetes client components that efficiently watches resources, keeps a local cache, and notifies your code when something changes — without constantly polling the API server)
    • enqueue reconcile requests
    • run reconcile workers to converge state

    You don’t need to implement that plumbing, but you do need to design your reconcile logic to be:

    • idempotent (safe to run repeatedly)
    • level-based (compute desired vs actual, don’t rely on “event meaning”)
    • retry-friendly (errors requeue)

    Build it: implementation workflow

    Define Spec and Status

    This is where you should slow down and design:

    Spec (the user’ contract)

    fields that capture intent from users, e.g.:

    • image/version
    • replicas
    • config options
    • references to Secrets/ConfigMaps
    • lifecycle toggles (suspend, paused, etc.)

    Status

    • conditions: Available, Progressing, Degraded
    • observed generation
    • endpoints / useful outputs

    A good heuristic: if an SRE would ask for it at 3am, consider putting it in status.

    Implement reconcile logic with a predictable recipe

    Kubebuilder’s controlle style is basically: load the CR, observe cluster state, take actions, update status.

    A practical reconcile “recipe”:

    1. Fetch the primary CR
    2. Default/validate any runtime decisions
    3. Gather the facts:
      • Compute desired child objects
      • Fetch actual children
    4. Create/patch/delete to converge
    5. Update status/conditions
    6. Return result (or requeue on errors)

    The important point isn’t the exact code—it’s the shape:

    • reconcile is a pure “converge state” function
    • it can run again at any time without breaking things

    Testing and CI: the operator testing pyramid that actually works

    Operator testing gets dramatically easier when you split it into three layers:

    1) Unit tests: fast, no clusters required

    Test pure functions and deterministic logic:

    • object builders (Deployment/Service constructors)
    • label/annotation conventions
    • config parsing/merging
    • condition transitions

    These should run in milliseconds and cover most branches.

    2) Integration tests with envtest: real API server semantics (but still no cluster)

    envtest is another very powerful tool in our sleeve that starts a local control plane (API server + etcd) so your controller talks to a real Kubernetes API, without spinning up a full cluster. This is the sweet spot for most controller tests.

    What envtest is great for:

    • “when I create a CR, the controller creates/patches expected children”
    • status updates and Conditions
    • finalizers and delete flows
    • webhook validation/defaulting behavior

    3) End-to-end tests with kind: catches packaging + RBAC + deployment reality

    kind runs a real Kubernetes cluster in Docker containers. It’s lightweight enough for CI and is widely used for “real cluster” validation.

    Use kind E2E tests to catch issues envtest won’t:

    • wrong RBAC permissions
    • broken manager deployment YAML
    • image build/pull issues
    • leader election / webhook cert wiring mistakes

    Continuous Delivery for Kubernetes Operators

    CI gets your operator correct.
    CD keeps it safe, upgradeable, and operable over time.

    Think of operator CD as three deliverables, not one:

    • Controller image – the binary that runs reconcile logic
    • Manifests / APIs – CRDs, RBAC, webhooks
    • Distribution artifact – how users install/upgrade (plain YAML, Helm, or OLM bundle)

    Your CD pipeline should explicitly manage all three.

    1) Versioning strategy

    Before automating CD, think deeply and decide how versions will work.

    Controller version

    • Semantic version (v1.2.3)
    • Tied to Git tags
    • Embedded into:
      • container image tag
      • bundle metadata

    CRD versioning (critical)

    CRDs are APIs. Breaking them breaks users!

    2) Upgrade safety

    Operator CD must answer one question: “What happens to existing clusters when this version rolls out?”

    Key things your CD process should validate:

    a) CRD compatibility

    • New controller must handle old CRs
    • CRD changes must be backward compatible

    b) Reconcile backward compatibility

    Reconcile logic must tolerate:

    • missing fields
    • defaulted fields
    • older status shapes

    c) Rolling upgrades

    • Controller restarts should be safe
    • Leader election prevents double reconcile
    • No reliance on in-memory-only state

    Testing what’s happening in the wild can be really treaky! A pratical CD test outline could be something like:

    • install v1
    • create CRs
    • upgrade to v2
    • verify resources still converge

    This is where an operator distinguish itself from a simple controller to a fullly fledged operator.

    Common operator design choices

    Prefer “create/patch” over “delete/recreate”

    Recreating child resources on every change causes unnecessary disruption. Patch existing resources where possible.

    Own what you create

    Use owner references so Kubernetes can garbage-collect dependents automatically when the CR is deleted.

    Be intentional about watches

    Start with:

    • watch the primary CR
    • watch the resources you own (Deployment/StatefulSet/etc.)

    Then add more only when you have a reason.

    Make status useful (and test it)

    Status is your UX:

    • set Conditions
    • include helpful messages
    • record observed generation so users can see whether reconciliation has caught up

    And remember, in the end an operator is just a “while true” loop with discipline, in a Pod.

    Luigi Miazzo

    Luigi Miazzo

    Software Developer - IT System & Service Management Solutions at Würth IT Italy

    Author

    Luigi Miazzo

    Software Developer - IT System & Service Management Solutions at Würth IT Italy

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Archive