23. 11. 2022 Alessandro Valentini Events

Our Experience at SREcon22 Europe

Looking at how several companies have managed their services, we discovered a relatively new (for us) technical role, the Site Reliability Engineer. As IBM reports:

The principle behind the SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention – especially as those systems extend or migrate to the cloud.

An SRE can also reduce or remove much of the natural friction between (a) development teams who want to continually release new or updated software into production, and (b) operations teams who don’t want to release any type of update or new software without being absolutely sure it won’t cause outages or other operational problems.

It appears that managing this kind of distributed system gave birth to this role, meaning a portion of the magic behind this kind of management is done by this technical person. So here at Wuerth Phoenix we decided it was worth investigating to better understand it, and then the whole thing became really interesting. Then, we found out that there’s a periodical gathering of SREs at the SRE Conference, named SREcon.

SREcon is a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale

Our interest rose to an unparalleled level, therefore we decided to go to the first available session, hoping to learn everything we could from it and, as an ultimate goal, to understand what an SRE really is. As you read this you might think we expected too much from it, but trust us, it really was an eye-opener. So let me explain where we went and what we did.

About the Conference

SREcon is a large conference involving hundreds of Site Reliability Engineers, companies active in producing tools for/and managing incidents, and experts from all the FAANG companies.

Listening to engineers who manage clusters of thousands of servers distributed all over the world made us feel “a bit” outside of our comfort zone, as well as seeing such a big ecosystem that revolves around incidents.

SREs from the biggest IT companies (e.g. Meta, Google, Spotify, etc.) explained how they approached reliability challenges and brought real-life incidents as examples.

“Going from 30 to 30 Million SLOs” by Alex Paluie (Google): shows the evolution of the Service Level Objectives (SLO) for the GCE Compute API inside Google in recent years, in order to provide better techniques and leadership visibility of System Reliability.
The speech “Is Our Team as Resilient as Our Systems?” highlights the importance of a strong and shared knowledge base. Too often we rely on so-called super heroes: expert colleagues who are able to solve any issue thanks to their (unshared) knowledge. Seniority is valuable and appreciated, but it’s fundamental to train new engineers and make them ready to handle any potential situation.
From the speech “Diamonds with Flaws: Examining the Pressures, Realities, and Future of Site Reliability Engineering” we understood that it’s not worth following the latest coolest tools and practices just for the sake of keeping up with the others. You will just end up investing tons of resources for something that will end up being useless. SREs should instead do what is actually needed by their company based on an unbiased analysis.
In “A Case Study in Chaos Testing: Uncovering Kernel Scaling Issues“, Gary Liku from Bloomberg LP showed us how deep we can go in finding the root cause of an issue, as well as a new testing approach aimed at reproducing fault conditions based by placing a specific resource under the spotlight.

If you are curious about the various speeches, you can take a look at the Conference Program.

Lessons Learned

When we arrived in Amsterdam we were basically outside the SRE culture, even though we were already applying some of its practices and managing productive systems. After just these three days we already felt a bit more like insiders. Just like agile practices, the SRE role must be adapted to your requirements and your resources: you cannot just take the “Site Reliability Engineering” book and apply it to your company.

Site Reliability Engineering is mostly about common sense applied to making a system as reliable as possible. You should start from the “quick wins“ in order to improve reliability, and then switch to bigger issues, complex architectural improvements and expensive management tools.

You also cannot just “buy“ SRE by hiring a bunch of engineers. SRE is a culture which must be well-rooted inside your company. Again, just as with agile practices, an investment in SRE will also pay back at the company level and not just at the team level.

Conclusions

Reliability is not a new concept. Companies and DevOps already take into account product quality, and SRE is not some kind of “magic formula“. But by following SRE principles and focusing on SRE practices, most companies can find the proper path to reach the goal of making reliability a Key Product Differentiator.

Remember that SRE starts with people, and to embrace your company uniqueness. Quoting a sentence from the book of one of the main speakers: