Kubernetes Operators are one of those ideas that feel magical when they work: you declare intent in YAML, and software continuously makes the cluster match it—handling upgrades, failures, drift, and lifecycle cleanup: like a purpose-built SRE on autopilot.
Although in theory it looks like a sci-fi fiction, in pratice Operators are just code written by someone that leverage Kubernetes’ extensibility features: they take the desired and current state of the cluster and run a control loop to reconcile the two.
At a high level, an operator is:
PostgresCluster, Cache, Tenant)This is “reconciliation”: controllers run repeatedly until the world matches the declared desired state. The key insight is that operators usually reconcile one level higher than built-in Kubernetes controllers (your operator might create a Deployment, and then Kubernetes’ own deployment controller creates ReplicaSets and Pods).
spec what the user wantsstatus what’s actually happening (observed state)If you get this separation right early, everything else gets easier: upgrades, debugging, user trust, and testing.
If you’re building a “classic” Kubernetes operator, the most common and best-supported path is:
If you want extra packaging workflows—especially around Operator Lifecycle Manager bundles (or OLM bundles, a Kubernetes extension that helps install, upgrade, and manage operators in a cluster)—use Operator SDK. For Go projects, Operator SDK uses Kubebuilder under the hood and shares the same basic layout and controller-runtime foundation.
You can write operators in other ecosystems, and sometimes you should (team skillset, rapid prototyping, etc.). But always keep in mind that Go is the safest bet because examples are abundant and libraries/tools are very mature.
Before you scaffold anything, answer these questions:
Pick a single primary CR kind as the main entrypoint:
WebsiteDatabaseTopicBackupJobA common beginner mistake is trying to make the operator “watch everything” and infer intent. Don’t. Make the CR the user contract.
List the Kubernetes resources the operator will create/maintain, like:
Deployment / StatefulSetConfigMap / SecretIngressPodDisruptionBudgetThis list informs your RBAC, your testing, and your reconcile design.
Define your Status Conditions strategy early:
Available=True?And last but not least, decide your cleanup model:
ownerReferences (ideal when possible)This pays in test assertions and user experience.
Kubebuilder organize your project in a fixed structure. The names may evolve across versions, but conceptually you’ll have:
Under the hood, controllers:
You don’t need to implement that plumbing, but you do need to design your reconcile logic to be:
This is where you should slow down and design:
fields that capture intent from users, e.g.:
suspend, paused, etc.)Available, Progressing, DegradedA good heuristic: if an SRE would ask for it at 3am, consider putting it in status.
Kubebuilder’s controlle style is basically: load the CR, observe cluster state, take actions, update status.
A practical reconcile “recipe”:
The important point isn’t the exact code—it’s the shape:
Operator testing gets dramatically easier when you split it into three layers:
Test pure functions and deterministic logic:
These should run in milliseconds and cover most branches.
envtest is another very powerful tool in our sleeve that starts a local control plane (API server + etcd) so your controller talks to a real Kubernetes API, without spinning up a full cluster. This is the sweet spot for most controller tests.
What envtest is great for:
kind runs a real Kubernetes cluster in Docker containers. It’s lightweight enough for CI and is widely used for “real cluster” validation.
Use kind E2E tests to catch issues envtest won’t:
CI gets your operator correct.
CD keeps it safe, upgradeable, and operable over time.
Think of operator CD as three deliverables, not one:
Your CD pipeline should explicitly manage all three.
Before automating CD, think deeply and decide how versions will work.
v1.2.3)CRDs are APIs. Breaking them breaks users!
Operator CD must answer one question: “What happens to existing clusters when this version rolls out?”
Key things your CD process should validate:
Reconcile logic must tolerate:
Testing what’s happening in the wild can be really treaky! A pratical CD test outline could be something like:
This is where an operator distinguish itself from a simple controller to a fullly fledged operator.
Recreating child resources on every change causes unnecessary disruption. Patch existing resources where possible.
Use owner references so Kubernetes can garbage-collect dependents automatically when the CR is deleted.
Start with:
Then add more only when you have a reason.
Status is your UX:
And remember, in the end an operator is just a “while true” loop with discipline, in a Pod.