29. 12. 2025 Luigi Miazzo Development, DevOps, Kubernetes

Planning, Building, and Testing a Kubernetes Operator

Kubernetes Operators are one of those ideas that feel magical when they work: you declare an intent/goal in YAML, and software continuously makes the cluster match it – handling upgrades, failures, drift, and lifecycle cleanup: like a purpose-built SRE on autopilot.

Although in theory it looks like sci-fi fiction, in practice Operators are just code written by someone that leverages Kubernetes’ extensibility features: they take the desired and current state of the cluster and run a control loop that reconciles the two.

What an operator really is

At a high level, an operator is:

A Custom Resource Definition (CRD): your API type (e.g., PostgresCluster, Cache, Tenant)
A controller – a reconciliation loop that:
- Watches your Custom Resources
- Compares desired state (spec) vs actual state (cluster)
- Creates/updates/deletes Kubernetes objects to close the gap

This is “reconciliation”: controllers run repeatedly until the world matches the declared desired state. The key insight is that operators usually reconcile one level higher than built-in Kubernetes controllers (your operator might create a Deployment, and then Kubernetes’ own deployment controller creates ReplicaSets and Pods).

A mental model that helps:

spec what the user wants
status what’s actually happening (observed state)

If you get this separation right early on, everything else gets easier: upgrades, debugging, user trust, and testing.

Tooling and language: what should you choose?

The default stack: Go + Kubebuilder (or Operator SDK)

If you’re building a “classic” Kubernetes operator, the most common and best-supported path is:

Go
Kubebuilder for scaffolding and project conventions
controller-runtime for controller primitives (manager, reconcile loop helpers, cache, client, etc.)

If you want extra packaging workflows – especially around Operator Lifecycle Manager bundles (or OLM bundles, a Kubernetes extension that helps install, upgrade, and manage operators in a cluster) – use Operator SDK. For Go projects, Operator SDK uses Kubebuilder under the hood and shares the same basic layout and controller-runtime foundation.

When other languages are reasonable

You can write operators in other ecosystems, and sometimes you should (team skill set, rapid prototyping, etc.). But always keep in mind that Go is the safest bet because examples are abundant and libraries/tools are very mature.

Plan first: define your API and lifecycle before writing code

Before you scaffold anything, answer these questions:

1) What is the “source of truth” Custom Resource?

Pick a single primary CR type as the main entry point:

Website
Database
Topic
BackupJob

A common beginner mistake is trying to make the operator “watch everything” and infer intent. Don’t. Make the CR the user contract.

2) What child resources will it own?

List the Kubernetes resources the operator will create/maintain, like:

Deployment / StatefulSet
ConfigMap / Secret
Ingress
PodDisruptionBudget
RBAC objects

This list informs your RBAC, your testing, and your reconcile design.

3) What does “ready” mean?

Define your Status Conditions strategy early:

When do you set Available=True?
What conditions represent progress vs failure?
What fields help users debug (URLs, observed generation, last reconcile time)?

4) How does deletion work?

And last but not least, decide on your cleanup model:

Let Kubernetes garbage collect via ownerReferences (ideal when possible)
Use a finalizer for external cleanup (cloud resources, DNS records, external DBs)

This pays in test assertions and user experience.

The operator’s main structure (project layout + moving parts)

Kubebuilder organizes your project into a fixed structure. The names may evolve across versions, but conceptually you’ll have:

API types (Go structs that become CRDs)
controllers (Reconcilers)
manager entrypoint (wire schemes + controllers + health/metrics)
config (generated manifests: CRDs, RBAC, manager deployment, samples)

The runtime wiring (how requests become reconciles)

Under the hood, controllers:

Watch resources through informers and watchers (kubernetes client components that efficiently watch resources, keep a local cache, and notifies your code when something changes – without constantly polling the API server)
Enqueue reconcile requests
Run reconcile workers to converge state

You don’t need to implement that plumbing, but you do need to design your reconcile logic to be:

Idempotent (safe to run repeatedly)
Level-based (compute desired vs actual, don’t rely on “event meaning”)
Retry-friendly (errors re-queue)

Build it: Implementation workflow

Define Spec and Status

This is where you should slow down and design:

Spec (the user’ contract)

fields that capture intent from users, e.g.:

Image/version
Replicas
Config options
References to Secrets/ConfigMaps
Lifecycle toggles (suspend, paused, etc.)

Status

Conditions: Available, Progressing, Degraded
Observed generation
Endpoints / useful outputs

A good heuristic: if an SRE would ask for it at 3am, consider putting it in status.

Implement reconcile logic with a predictable recipe

Kubebuilder’s controller style is basically: load the CR, observe cluster state, take actions, update status.

A practical reconcile “recipe”:

Fetch the primary CR
Default/validate any runtime decisions
Gather the facts:
- Compute desired child objects
- Fetch actual children
Create/patch/delete to converge
Update status/conditions
Return result (or re-queue on errors)

The important point isn’t the exact code – it’s the shape:

Reconcile is a pure “converge state” function
It can run again at any time without breaking things

Testing and CI: the operator testing pyramid that actually works

Operator testing gets dramatically easier when you split it into three layers:

1) Unit tests: fast, no clusters required

Test pure functions and deterministic logic:

Object builders (Deployment/Service constructors)
Label/annotation conventions
Config parsing/merging
Condition transitions

These should run in milliseconds and cover most branches.

2) Integration tests with envtest: Real API server semantics (but still no cluster)

envtest is another very powerful tool up our sleeve that starts a local control plane (API server + etcd) so your controller talks to a real Kubernetes API, without spinning up a full cluster. This is the sweet spot for most controller tests.

What envtest is great for:

“When I create a CR, the controller creates/patches expected children”
Status updates and Conditions
Finalizers and delete flows
Webhook validation/defaulting behavior

3) End-to-end tests with kind: Catches packaging + RBAC + deployment reality

kind runs a real Kubernetes cluster in Docker containers. It’s lightweight enough for CI and is widely used for “real cluster” validation.

Use kind E2E tests to catch issues envtest won’t:

Wrong RBAC permissions
Broken manager deployment YAML
Image build/pull issues
Leader election / webhook cert wiring mistakes

Continuous Delivery for Kubernetes Operators

CI gets your operator correct.
CD keeps it safe, upgradeable, and operable over time.

Think of operator CD as three deliverables, not one:

Controller image – the binary that runs reconcile logic
Manifests / APIs – CRDs, RBAC, webhooks
Distribution artifact – how users install/upgrade (plain YAML, Helm, or OLM bundle)

Your CD pipeline should explicitly manage all three.

1) Versioning strategy

Before automating CD, think deeply and decide how versions will work.

Controller version

Semantic version (v1.2.3)
Tied to Git tags
Embedded into:
- container image tag
- bundle metadata

CRD versioning (critical)

CRDs are APIs. Breaking them breaks users!

2) Upgrade safety

Operator CD must answer one question: “What happens to existing clusters when this version rolls out?”

Key things your CD process should validate:

a) CRD compatibility

New controller must handle old CRs
CRD changes must be backward compatible

b) Reconcile backward compatibility

Reconcile logic must tolerate:

Missing fields
Defaulted fields
Older status shapes

c) Rolling upgrades

Controller restarts should be safe
Leader election prevents double reconcile
No reliance on in-memory-only state

Testing what’s happening in the wild can be really tricky! A practical CD test outline could be something like:

Install v1
Create CRs
Upgrade to v2
Verify resources still converge

This is where an operator distinguishes itself from a simple controller to a fully fledged operator.

Common operator design choices

Prefer “create/patch” over “delete/recreate”

Recreating child resources on every change causes unnecessary disruption. Patch existing resources where possible.

Own what you create

Use owner references so Kubernetes can garbage collect dependents automatically when the CR is deleted.

Be intentional about watches

Start with:

Watch the primary CR
Watch the resources you own (Deployment/StatefulSet/etc.)

Then add more only when you have a reason to.

Make status useful (and test it)

Status is your UX:

Set Conditions
Include helpful messages
Record observed generation so users can see whether reconciliation has caught up

And remember, in the end an operator is just a “while true” loop with discipline, in a Pod.

These Solutions are Engineered by Humans

Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth IT Italy.