Looking at how several companies have managed their services, we discovered a relatively new (for us) technical role, the Site Reliability Engineer. As IBM reports:
The principle behind the SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention – especially as those systems extend or migrate to the cloud.
An SRE can also reduce or remove much of the natural friction between (a) development teams who want to continually release new or updated software into production, and (b) operations teams who don’t want to release any type of update or new software without being absolutely sure it won’t cause outages or other operational problems.
It appears that managing this kind of distributed system gave birth to this role, meaning a portion of the magic behind this kind of management is done by this technical person. So here at Wuerth Phoenix we decided it was worth investigating to better understand it, and then the whole thing became really interesting. Then, we found out that there’s a periodical gathering of SREs at the SRE Conference, named SREcon.
SREcon is a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale
Our interest rose to an unparalleled level, therefore we decided to go to the first available session, hoping to learn everything we could from it and, as an ultimate goal, to understand what an SRE really is. As you read this you might think we expected too much from it, but trust us, it really was an eye-opener. So let me explain where we went and what we did.
SREcon is a large conference involving hundreds of Site Reliability Engineers, companies active in producing tools for/and managing incidents, and experts from all the FAANG companies.
Listening to engineers who manage clusters of thousands of servers distributed all over the world made us feel “a bit” outside of our comfort zone, as well as seeing such a big ecosystem that revolves around incidents.
SREs from the biggest IT companies (e.g. Meta, Google, Spotify, etc.) explained how they approached reliability challenges and brought real-life incidents as examples.
If you are curious about the various speeches, you can take a look at the Conference Program.
When we arrived in Amsterdam we were basically outside the SRE culture, even though we were already applying some of its practices and managing productive systems. After just these three days we already felt a bit more like insiders. Just like agile practices, the SRE role must be adapted to your requirements and your resources: you cannot just take the “Site Reliability Engineering” book and apply it to your company.
Site Reliability Engineering is mostly about common sense applied to making a system as reliable as possible. You should start from the “quick wins“ in order to improve reliability, and then switch to bigger issues, complex architectural improvements and expensive management tools.
You also cannot just “buy“ SRE by hiring a bunch of engineers. SRE is a culture which must be well-rooted inside your company. Again, just as with agile practices, an investment in SRE will also pay back at the company level and not just at the team level.
Reliability is not a new concept. Companies and DevOps already take into account product quality, and SRE is not some kind of “magic formula“. But by following SRE principles and focusing on SRE practices, most companies can find the proper path to reach the goal of making reliability a Key Product Differentiator.
Remember that SRE starts with people, and to embrace your company uniqueness. Quoting a sentence from the book of one of the main speakers:
May the queries flow and your pager be silent!