DevOps or SRE? We’re going over the two concepts, highlighting the differences between them and trying to understand how each one came to be.
DevOps and SRE seem like two sides of the same coin. Both titles aim to bridge the gap between development and operation teams, with a unified goal of enhancing the release cycle without any compromises.
And indeed, in most companies we can see that there’s a requirement for just one of these positions, with an overlap in responsibilities and abilities. Both titles co-exist in the same space, and both are an essential part of the development team; so how are they different, and what does each one mean? Let’s check it out.
— OverOps (@overopshq) July 11, 2018
Development, Operations and Reliability
Before DevOps was implemented, development and operation teams worked as two independent squads, each with its own goals and objectives. The differences and lack of communication between these teams often impacted the product, which in return affected the end users and company.
In order to better communicate and build better products, DevOps became one of the most critical positions in every company.
The official definition of DevOps is “a software engineering culture and practice, that aims at unifying software development and software operation.” The term was first coined by Andrew Shafer and Patrick Debois back in 2008, and while it took a few years for it to become a common concept, nowadays just about every company, from enterprises to startups, are hiring DevOps.
The concept of Site Reliability Engineer (SRE) has been around since 2003, making it even older than DevOps. It was coined by Ben Treynor, who founded Google’s Site Reliability Team. According to Treynor, SRE is “what happens when a software engineer is tasked with what used to be called operations.”
Just like DevOps, SRE is also about combining development and operation teams, helping them see the other side of the process, while introducing visibility to the complete application lifecycle.
Both titles are advocates of automation and monitoring, with a similar goal to reduce the time from when a developer commits a change to when it’s deployed to production. DevOps and SREs both want to do so without compromising on the quality of the code or product along the way.
Google itself states that SRE and DevOps are not so different from one another: “they’re not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster.”
So why did Google need to create its own definition?
The Differences Between DevOps and SREs
As we mentioned before, the concept of DevOps is all about combining development and operations, defining the behavior of the system and seeing what needs to be done to close the “gap” between the two teams. The theory behind this title talks about what needs to be done to make the two teams work as one.
And according to Google, that’s where the main difference between DevOps and SRE lies. While DevOps is all about the “What” needs to be done, SRE talks about “How” this can be done. It’s about expanding the theoretical part to an efficient workflow, with the right work methods, tools and so on. It’s also about sharing the responsibility between everyone, and getting everyone in sync with the same goal and vision.
To help further explain the difference, Google released a series of videos and posts that talk about how the two titles differ. In one of these posts, written by two Google employees: Seth Vargo, Staff Developer Advocate and Liz Fong-Jones, Site Reliability Engineer, they explain that SRE “embody the philosophies of DevOps with a greater focus on measuring and achieving reliability through engineering and operations work.”
Seth and Liz represented the similarities and differences between the two through the top 5 pillars of DevOps, explaining what they mean for SRE:
#1 Reduce Organizational Silos
Large enterprises usually have a complex organization structure, with a lot of teams working in silos. Each team is pulling the product in a different direction, not communicating with the rest of the company and as a result, fail to see the big picture as a whole. This can lead to frustration, a set back in deployment and high costs due to delays.
DevOps’ job is to reduce the silos, and to make sure there aren’t any teams within teams who are not aligned with the rest of the company. They minimize and bridge the teams into one group, with a shared vision.
SREs doesn’t talk about how many silos are in the company, but more about how to get everyone to discuss. This is done by using the same tools and techniques across the company, which in return helps share the ownership across everyone.
#2 Accept Failure as Normal
Although the concept of DevOps is about handling and coping with issues before they fail, failure is something that we, unfortunately, can’t avoid. DevOps embraces this by accepting failure as something that is bound to happen, and which can help the team learn and grow.
In the world of the SREs, this objective is delivered by having a formula for balancing accidents and failures against new releases. In other words, SREs want to make sure that there aren’t too many errors or failures, even if it’s something that we can learn.
This formula is measured with two key identifiers: Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
SLIs measure the failures per request, by calculating request latency, throughput of requests per second, or failures per request as measured over time. SLOs derive out of this threshold, percentage or number, and represent the success of SLIs over a certain amount of time.
#3 Implement Gradual Change
Companies want to move faster than before. They want frequent releases, continually updating the product and keeping team members on their toes about new and relevant technology.
DevOps are all for this change, but in a gradual and handled way. Both DevOps and SREs want to move quickly, and Google points out that SREs emphasizes reducing the cost of failure as they do so.
#4 Leverage Tooling and Automation
As we mentioned before, one of the main focal points for both DevOps and SREs is automation. Both titles encourage adding as much automation and tools as possible, as long as they provide value to developers and operations by removing manual tasks.
#5 Measure Everything
An automated workflow that moves fast is something that needs constant monitoring. DevOps and SRE teams both need to make sure that they’re moving in the right direction, and they do so by measuring everything.
The main difference here is that SREs revolves around the concept that operations is a software problem, which led them to define prescriptive ways for measuring availability, uptime, outages, toil, etc.
SREs also ensure that everyone in the company agrees on how to measure reliability, and what to do when availability falls out of specification. This includes contributors at every level, from developers, through team managers and all the way up to VPs and executives.
What Does It Mean To Be Reliable?
We talked about sharing responsibility, accepting failure and measuring everything. Now, we need a way to make sure everything is indeed working as it should, and is reliable. In other words, there should be a unified method to measure reliability at every level.
SREs are measuring SLIs and SLOs, and DevOps teams measure the failure rate, as well as the success rate over time, and both usually do so with different tools and methods. While these teams have an overview of what’s going on, it’s not complete. Reliability is not just about the infrastructure, it’s relevant every step of the way – from application quality, through performance and up to security.
Failure and issues can and will happen in different aspects of the application, and when it does, we need to have reliable data to understand why the issue happened in the first place, what caused it, and how to fix it. If we break it down, this data should include:
- Execution stack and bytecode
- Complete variable state (overlayed on full source code)
- JVM State: Threads, environment variables
- Relevant log statements (including DEBUG and TRACE in production)
- Event analytics (Frequency, failure rate, deployment, application)
And since this is crucial information, we have to make sure it’s reliable and actionable. This can be done with the help of setting up alerts for different scenarios, embracing a method of peer code review, unit tests and so on.
While these methods help promote a shared responsibility between everyone, they might end up impacting the product’s performance. And the bigger the organization is, the higher the cost for failure, whether it’s customer satisfaction, employee churn, or a decreased product value.
That’s why it’s important to minimize the manual system work, and automate the collection of information. And while you’re at it, you also need to stay on top of everything that’s happening in your product. In other words, you need the right data to measure the reliability of your software throughout the CI/CD workflow.
So, is there a difference between DevOps and SREs? Google, the “founder” of the SRE title clearly defined it, along with a straightforward set of expectations. DevOps, as it seems, is more of a “free spirit”, with the definition and perspectives varying from organization to organization.
However, DevOps and SRE teams are not so different. Both help combine developer and operation teams, while sharing similar responsibilities and focusing on enabling automation and reliability.
The bottom line is that it’s all about the data. You need information in order to understand how to measure success and failure and how to gain continuous reliability across the application.