 SRE recognizes that there is a very distinct skillset and expertise in making applications perform well and reliably. Hi, this is your host, Apul Bhartiya, and welcome to another episode of T3M, our topic of this month. And the topic of this month is SRE. And today we have with us once again, Rob Herschel, CUN co-founder of REC and Rob. It's great to see you again. And today's topic is going to be really exciting. It is definitely one that I have been excited to see the industry adopting and be part of the general conversation on site reliability engineering. It's an important trend line. It's one of the most important topics in the whole cloud, Kubernetes landscape, that you hear the big buzzwords on DevOps, SREs, platform engineering. And we ran some surely this year as well. And one thing that I do want to talk to you about, because when I do these interview series, I talk to different folks from different departments from a company. Getting a diverse perspective is really important. So when we look at SRE, how would you define SRE? And does it mean different things for different people? And it's not about from their perspective. But in reality, that's how those teams look at it. So their definition is as right as a definition of someone else's. It's a very good point. Because in the industry, we do have a tendency to adopt terms to mean things that we want them to mean. And I've definitely seen threads of people just retitling SIS admins, right, versus DevOps engineers and site reliability engineers. And it is actually, I think, important for people to work, doing the work have reasonable titles and be respected for the work they do, especially from an engineering perspective and what engineering implies. And that actually makes defining site reliability engineering a little bit trickier. The simplest thing to do is go back to Google's work and the way Google defined it, which was very much as a cross-platform benefit where you had a team of very advanced engineers who were working at a system level to improve the reliability of their output, of their applications. And they had a whole bunch of methodologies around that, including automation and error budgets and ownership of code and things like that. And so when people look at site reliability, they should definitely be looking at this cross-team expertise about tuning applications. But one of the things I've noticed in the last year or so is that site reliability has drifted towards or back to the application maintenance perspective. So when people are talking about SRE right now, they're much more likely to be talking about applications that their company is exposing to the public or running internally and making them resilient, robust, monitored, improving observability. They've really focused on this sort of code to production pipeline from a site reliability engineering perspective. And so SREs have really been focused on that pipeline experience and improving that. And less on what we had talked about briefly in the industry, which is more DevOps, which has a much broader systems perspective with infrastructure, compliance, governance, and things like that. Those seem to have sort of fallen away from the site reliability engineering discussions. If I just remove all the jargons, all the labels, all the buzzword, what is the actual goal of SRE teams, SRE engineer? SRE recognizes that there is a very distinct skill set and expertise in making applications perform well and reliably. And fundamentally, you take everything off. What we've recognized is that writing applications and running applications are different skills. And so what we've identified with SRE is a team of people who have the skills of taking the code that companies are trying to run and improving the performance and the stability of that code. And then there's a whole bunch of processes that Google defined to make that better. And it turns out that that skill set is different, right? Yet again, then running infrastructure or all the other DevOps disciplines that have traditionally gotten lumped into DevOps. And I think it's that what people always should remember in this is what the skill sets are. What expertise you need to be hiring for, what expertise people are going to be doing in their daily jobs. And there's this analogy that we use in the industry of a T-shaped individual where somebody has a lot of broad expertise and then very narrow expertise in certain topics. What we found is the site reliability engineers do have relatively broad expertise, but then very deep on observability, performance, sort of the type of questions that come into how you scale an organization or an application. And so that I think has really changed how people look at SREs over the last two years. We just kind of grabbed our whole series on infrastructure as code. Since you're talking about when we look at SREs, it's more about applications. What does it mean for infrastructure? Of course, infrastructure techniques, the whole infrastructure as code phenomenon that we talked about in the last episode. Yeah, this is a place where I think we had expected SREs and infrastructure as code in DevOps to have a lot more VIN intersection here, where we would see SREs as getting very involved in building infrastructure, maintaining infrastructure, doing compliance and governance. And we haven't seen that as much, right? Now, we do expect SREs to be good at automation and building automation and processes, but at the same time, what we don't see as much is the SRE discussions and dialogues coming back to how do I build and manage infrastructure? How do I deal with cost? How do I build immutability? How do I do operating system configuration and automation, right? There's a lot of pieces around those components of building and running your infrastructure that SRE has not really had the same dialogue on. Picking an operating system, the adjacencies, some of the cloud economics, those typically are not coming up in SRE discussions and are very relevant from a infrastructure as code or a platform engineering perspective. They're very different. Even platform engineering, which has gotten a lot of buzz lately, is different than site reliability engineering. Platform engineering tends to come to, I want to provide developers with standard operating environments. It ends up being a very governance and conformance story and not really worried about, oh, am I giving them well-performing environments for that work? And so what we see is a little bit more partitioning in the skill sets here where a platform engineering team might be working on standardizing and compliance. Infrastructure as code is definitely a tool in the middle of all those things to make it go. And then SRE is coming in after the fact to help tune those environments. What are the intersections where SRE teams meet with observatory teams, DevOps teams, platform engineering teams? Are there overlaps? Is it a super set? Is it a subset? There's definitely overlaps because all these teams ultimately, the goal in any organization is to have improved collaboration, right? Which always comes from being able to have every team have visibility and a stake in how things are done and how things are built. But what we've seen as SRE teams definitely get more involved in the observability, performance monitoring, some of the code generation and code review pieces. But they're much more influential in when a system is running, how well that system can be tuned and operated. And then what are the errors? Because bugs and defects or site issues are definitely parts of it. Security ends up coming in on site reliability engineering at SRE. But in a lot of cases, security work being done right now is different than what SREs do. And so we do have this interesting split where what an SRE does effectively is not as broad as what we sort of, when I first started hearing about SRE, I think a lot of people did this, they imagined this super infrastructure automation engineer who would come in and solve all sorts of problems and then also tune the applications. And what we've seen, especially with Kubernetes, which distributes applications in a very broad way, that SRE has really been much more narrowly focused on the application side of it. Even getting down into how Kubernetes operates and building your cluster and running your cluster, those skills have been much more segmented in the last several years and with specialized teams, which was frankly the point of introducing something like Kubernetes is to be able to have teams like a platform team that focuses on those parts of your infrastructure while other teams are more focused on just the application. And so in some ways, it's a very natural migration for it. But if you're dealing with SRE teams and expecting to do a whole bunch of infrastructure work, you might find yourself out of step in the current industry definitions of SRE. How do you look at, I call them soft silos, they're not hardcore silos like all the time, because there will always be specialization in the specific fields. We cannot really expect to have unicorn developers who know everything. We talk about the whole shift like moving where a lot of things are moving to developer public. How do you look at these things in general, where you're saying, hey, this is once again, we are going to this process of, because in one or two years, we may come up with a new term altogether as things mature. So I just want to have a perspective from you. Sadly, I think IT infrastructure is more siloed now than it even was in the past. And that is very frustrating to our customers. As much as they've built these silos and they defend these silos, what they recognize is that it's very difficult for them to propagate shared resources and shared good through these silos. And this is actually a consequence of moving into cloud where we've really reduced the barriers to teams making their own decisions. And so the challenge I think that listeners need to think about here is if they've built SRE and they've put an individual SRE into each team and not made it a shared resource or a shared infrastructure as code or a shared platform, if those are individual resources in individual teams that are enabling individual teams to move faster because they have access to these resources, they are not even in many ways, they are directly undermining their ability to collaborate and have shared resources across the organization. And we have seen this as a symptom over and over and over again throughout the industry where building a shared team of expertise like a SRE team, rather than the SRE position or a platform team or an infrastructure team or an operations team like we used to have where we would consolidate and control those things, we've really lost that trend line in the industry. And so you definitely need to be careful when you think about how these things work that what we're trying to do is create shared expertise and reuse those resources and consolidate practices. In a lot of cases what we've seen is the current trend line is to position these skill sets in each team and not have all the collaboration and sharing just because it's hard, it takes time and it slows things down. But slowing things down is sometimes a really good strategy for companies that want to improve their velocity. If you're rowing together on how these things work then you ultimately will do better. And at the end of the day that is what Google was writing about in their book. They had a standardized platform for deployment and then they trained people on how to use it, they had expertise and they would rotate them through their teams to make those teams more effective at using the platform. That's why we got very infrastructure tied in the original definition of site reliability engineering. Today we can make assumptions that Kubernetes is that shared platform, it's not always a safe bet but you know and then layer skill sets on top of it. We still need to be figuring out how to have better sharing within organizations so that they can collaborate better. We learned a whole series about DevOps and then platform engineering. How different is SRE from DevOps platform engineering? One of the things that the DevOps community really pushes back against although I think they've lost the fight is that DevOps is not a position that it's a process and a philosophy and one of the things I like about site reliability engineering is that there is actually a team and a person and you can give somebody that title. Whereas DevOps engineering in some cases still raises DevOps engineer still raises eyebrows because in DevOps it is about process, collaboration, working together. That's always been the core framing idea of DevOps. Making a person responsible for DevOps has always been counterproductive from that perspective. What we're describing with site reliability engineering can be a position. It can be an expertise. It can be a skill set that you send in to help people get things done. I think you just have to be aware that when you do that and when you define somebody as a site reliability engineer, you have to give them a very tightly constrained scope of work. If you're giving somebody the SRE title and then saying, and by the way, you're doing automation and infrastructure and security and you keep expanding their role at that point, what you've really got is an ops engineer, which a lot of people now would call a DevOps engineer, and that they're going to end up maintaining the systems and maintaining the infrastructure and doing that work. It's not bad. A lot of times companies desperately need that work, but at that point they've given somebody a title and they're not limiting their scope to be successful in performing site reliability engineering. When you look at SRE, is it process, practice, or tool solutions? It is always helpful for these categories to have a list of tools that are in that person's bag for what they do. I think SRE is a practice fundamentally, but it is helpful to understand that when you are at SRE, you are fundamentally doing work that's related to application performance observability work. You can find SREs in a lot of ways by people who are very excited about observability and infrastructure pipelines and code metrics and uptime statistics and monitoring. The SREs are using those tools in their bag, and that is really a defining characteristics of this. You could pick vendors in the mix of this, but fundamentally they are using those tools. It is helpful when you define these roles to use which tools you use as helpful indicators of what specialty they do and what the mix of their jobs is. That doesn't mean that SRE wouldn't use Terraform or Ansible or some other cloud infrastructure, but it is really where their focus is going to be on helping you determine what that looks like. Can you also talk about how users, of course, you can also talk about your customers, how they are already kind of implementing some of these practices, processes, and what are the benefits that they see because you do interact with your customers, so you get a lot of feedback, insights, inputs from them. This is how these practices benefit us. Interestingly, because of the way we have been talking about site reliability engineering, the teams that we see in our enterprise customers that we interact with are not doing site reliability engineering as much as helping teams that are doing site reliability. They are a step removed from them because Rackin is fundamentally helping build infrastructure and new infrastructure automation from those sides. What we do see is that customers have built a lot of software. They don't always understand if it's running well, if it's working right, they don't understand the ROI from what's going on or how to run it. And so we do see that our customers are looking for ways to ensure that the systems that they have are running well, and site reliability engineering comes up from those perspectives. What we see our customers trying to do from that perspective is they work more towards how do I make my infrastructure more API-driven? How do I create the consistent results? And they start looking at the KPIs on the infrastructure side to then align with KPIs from the SRE performance metrics. And so being able to say, all right, am I running an API for building infrastructure that is highly reliable? The results I get line up are they repetitive? Can a site reliability engineer tear down and then restore a system on a regular basis? That's a normal thing that we look at across the industry. The more that infrastructure work is predictable, reliable, repeatable, and API-driven, the more aligned it is with SRE priorities. And so we definitely see customers who are embracing these techniques coming back and asking for not just API-driven infrastructure, but then wanting to make sure that it delivers reliably, that their automation keeps working, that they don't have to spend time babysitting. The fact that the automation above them is making a lot of calls to rebuild systems or run systems or change systems. So we see this interplay throughout an organization, and I think it's worth noting when your organization is collaborating well, that means the contracts between these organizations are well understood and well exercised. And the faster you can drive through system delivery and infrastructure changeovers and things like that, the faster those term rates are. You'll see those consumption statistics go up in your organization, and that's going to mean that's going to show that your SRE teams are more effective, your app dev teams are more effective, your security teams are more effective. That term rate is actually a core metric in what we see as the most successful organizations. How is REC and how your teams help your customers with these practices and processes? One of the things that we talked about with Site Reliability Engineering and the app dev process is very aligned around code pipelines. And this concept of a pipeline where you start with code and you end up with a working application is critical to how SREs are describing things, and DevOps too, to an extent. What we have found is that our ability to deliver infrastructure pipelines, which is different, it's conceptually similar, but the outputs are different, into customers has really resonated for how they want to consume infrastructure. The languages have started to align very well. So the idea that we can take a request for a system and then drive it through a predictable series of steps, very much like a CI CD pipeline, but taking a piece of infrastructure and then transforming that into its production state has been really transformative to our customers, especially because when we build infrastructure pipelines, the goal of those pipelines is to have an incredibly high degree of off-the-shelf automation involved. So not only do we have a very conceptually simple thing for these teams to understand, you know, driving infrastructure through a repeatable consistent process that you can inject standard injection points into. The fact that we've done it in a way that's standard so that our customers aren't reinventing all of the steps in this process, but only using the things that are truly unique to them, that those two things combined have really made a difference in the effectiveness of adopting these techniques. And so it's really improved the velocity in how the how our customers get infrastructure into people's hands. It's also reduced the amount of custom work they have to do, which also improves the velocity because there's two components to velocity that are worth mentioning here. I mean, we're talking about SREs and SREs care about how they have the performance of the system. And there's two components, right? Can you run the system end to end when it's fully built and how effective, how fast, and how reliable that system is? One of the things that slows people down even more, though, is when you make changes to that system, if making changes to that system is slow and expensive and requires a lot of customization, that slows people down even more. And so there's the other aspect to this is not only if you have a fast pipeline, but if you have a standardized pipeline, then that pipeline doesn't require the maintenance. It doesn't require you to keep customizing and changing and putting DevOps or engineering time into building new pipelines. That is really where we see a lot of acceleration in this velocity. And it's critically important when you're supporting Aptif teams or SRE teams that when they make a request, you don't want to say, well, give me three months and put it on the backlog and I'll build some custom automation for you. You want to be able to say, all right, here's a pipeline that gets you 90% of the way there. Start using that. If you need something special, we'll work with you. Being able to service those requests faster is a big part of making the collaboration work. Rob, thank you so much for taking time out today and talk about this interesting topic. And as usual, I look forward to our next discussion. Thank you. My pleasure. Thank you.