 Welcome again to another OpenShift Commons briefing, but today we have Andrew Clay Schaefer here. Again, this is his second round on our Fridays with DTO. And he's going to talk a little bit about cloud-native operating models and what that means to him. I kind of love the topic, but the title, but don't quite know exactly what he's going to come up with today. So Andrew, take it away, and we'll have live Q&A at the end of this. Yeah, neither do I, and I never know what I'm talking about. If you want to jump in and have me explore or explain anything, it doesn't have to be just me monologuing if you want clarity on any of this. So jump in. We'll do. I think I'll share one. So this is just a quick kind of framing. So I had a very privileged career, in a sense, to see some of the things and participate in some of the things that I have for the last decade, and that gave me certain framing to talk about some of this stuff. And I would say that the pattern of my career has really been trying to take the things that I've seen work and put them together and share them in a way that other people can make them work to. So going back to the last 10 years, it's really been focused on open source infrastructure, and probably just back to Puppet, OpenSack, and now more and more Kubernetes. In parallel to that, I was part of organizing these DevOps days globally and these communities of practice around velocity conference. And so those conversations and being part of those projects gave me what I think is a pretty unique vantage point to try to help the customers and the communities around these different products. So the overarching theme of lots of this is there's the code and there's the Git repos, and then there's what do you actually do with that and how do you make it work? And so this is me trying to distill some of the thoughts and conversations I've had in the last few months and try to at least remove some confusion. I'm not sure I'm going to answer everyone's question. In some ways, my hope is that you have the ability to ask better questions at the end of this more than you have all the answers. So from there, the thing that I want people to realize or that I think people actually feel viscerally is there's this tension between these old ways of doing things, old processes, technology, and sort of the new way of doing things, right? And that's that's a little bit of an oversimplification. But the thing that I would point out, and I think if you've tried to do this in a real sense, especially in a large organization, that if you have adopted something like a container platform, not that we know of any, and you're managing it with your old processes, and you're managing, you know, kind of with the old mindsets, then the rate limiting, the rate limiting or the, you know, upper bound of how successful you can be with that new technology is limited by the fact that you're managing it in these old ways. And so the new tools sort of require or not necessarily require, but they're optimized by new thoughts and new behaviors. So in this old world, you know, we used to manage servers and we would brag as this admins about the uptime on these servers that have been up forever and not restarted. And now we live in a world where, you know, containers might be the life of a container by design might be on the order of seconds, you know, minutes, certainly not, certainly not days or years like we used to be so proud of. The other thing that's happening is you're getting more and more of this work that is, you know, automation enforced by APIs and not done by human toil. And that's the theme I'm going to come back to you over and over. And so the other thing I think is worth pointing out, particularly in enterprises as they adopt these things is that these patterns of IT and framing it as a cost center and having these processes is really rooted in supporting business as usual, where the IT is a secondary consideration to whatever that kind of core business was. And it's not that you want to necessarily forget who you are, but in the new world, IT and technology is much more central. And, you know, if you look at the market dynamics and the performance of the top end of the market, it's sort of dominated by the, you know, the thing companies or the cloud companies where we want to call them, where they're using technology not to support business as usual, but technology is the business, right? And so that's a kind of fundamental mind shift as well. So here's some questions. And like I said, I'm not necessarily going to answer questions, but I'm kind of going to answer questions. And we'll have more questions by the end. So this is relevant to this narrative around, you know, what we've seen over the last 10 years and what people are trying to do. And if you look at what I see and, you know, this is across the Kubernetes community in general and definitely in the OpenShift community is that people are in very different places with respect to how they take that technology and put it into their workflows, into their, you know, behaviors and where they let it change their behaviors or not. And you see this in a lot of the cases, they're adopting a new technology, but they have a process that came from how they manage their VMs, that came from how they manage their bare metal. And they haven't really redone or rethought from first principles what promises they're able to keep with those processes, you know, versus the opportunity that they're losing by kind of slowing themselves down. So I'm going to give you my definition and this is, you know, there's like this whole subgenre of people debating the definition of words. But this is Andrew's version of what DevOps is and, you know, I've given this or said this on stage multiple times over the last few years. To me, and this is in the framing of software is eating the world, right? So there's this notion that software is eating the world. And to me, that means software is going to optimize everything that it can over some timeline. And to me, DevOps is really about optimizing that human experience and performance of operating software, doing it with software. So that's sort of the software eating the world. And then recognizing these are social technical systems. It's not just the technology by itself, it's for us by us. And so you're going to do this work with humans. So this is like, you know, the last 10 years of my career is going around the world and talking about and helping people try to adopt these processes and these technologies. So at the same time, in parallel to this, there's this conversation. And I'm going to come back to SRE and DevOps in a few ways over the next few minutes, 30 minutes or whatever. But this is the quote unquote beginning, you know, this person coined the term SRE at Google and this is a quote from Benjamin. What happens, SRE is what happens when a software engineer is tasked with that used to be called operations. And I'm going to get. So I'm going to pause you just for a second. You're saying SREs and some people might not know what it stands for after. So site reliability engineering is, you know, quote unquote, the Google implementation of DevOps. There's free books you can read. So there's if you Google SRE book, you'll get to a link that is hosted by Google that has all of the text for free. If you like to buy, you know, dead trees or whatever. There's there's hard copies as well. There's there's the SRE. The original SRE book was 2016, which I'm going to come back to some of the points from that book. And then there's an SRE workbook that came out 2018, 2018, I think. That I wrote the forward for that book that has or one of the fours. The other one's Mark Burgess talking about. So the first one is really SRE as Google envisions SRE itself aspirationally inside Google. And some of that's a little bit sort of navel gazey in like Google specific. And then the second workbook is is an attempt and it's mostly Google people, but there's some more collaboration outside Google to bring SRE in practice in a practical way to to like share some of those practices. And this isn't necessarily all about SRE, but there's there's going to be some more SRE content and like kind of talking about some of these models. So why, why does this even matter? Like, like, okay, so there's words and you have DevOps, you have SRE. Like, was it was actually do. And it's interesting watching, you know, this is this is a common theme with agile DevOps transformation SRE where you see lots of people adopt the vocabulary and maybe change their titles. But they don't really change their process. So let's let's try to do better than that. So the why, in my opinion, for both these things, and I would argue, you know, DevOps and SRE are essentially the same phenomena. And that's part of the theme that I bring out in the forward. I wrote for that book. But really what's happening here is that you have these new models being created as part of an evolutionary pressure to deliver systems that are, you know, flexible, adaptable, changeable at scale with high levels of reliability. Right. So, so reliability is even in the name of the way that Google framed it for themselves. So these models together with the technology are what enable all these things that we kind of take for granted, right? So there's lots of there's lots of things that we take for granted. We all walk around with these these little super computers in our pocket. And we sort of think, you know, and for the most part, Google is going to be available. Gmail is going to be available. Facebook is going to be available. Twitter has a few bad days now and then. But like, been pretty good lately, right? So we take this stuff as like the kind of baseline ambient experience that we all have. And then when we get into the sort of enterprise IT, you know, that's not always the case for for all of us in terms of what we can deliver from a reliability perspective. And part of my argument here is that there's there's certainly technology that can that can help you improve those promises you keep. But there's there's also this sort of human workflow aspect that has a huge impact on building those reliable resilience systems. I'm going to kind of walk through a few things. When I'm going to talk about platform patterns, then I'm going to kind of compare and contrast this notion of you build that you run it, which is is sort of famous we, you know, the Amazon way of doing this, which is slightly different from the Google way. But I'm going to kind of show how they're they're not that different in some ways. And then and then go back to the SRE more specific language and sort of practices. So this is this is me drawing shapes on a slide and the argument I'm trying to make here and this is, you know, again, oversimplification for the sake of making a point is that in the in the traditional IT life cycle that you have these stovepipe infrastructure with purpose built, you know, built up to deploy a particular app, right? So so projects would start with the PO and then there's some life cycle to get the hardware in the data center, you know, that might be months. Then then at some point there's operating system and you start putting it together and then you eventually get an app and that's long lived and and tied together. So you kind of have to refresh them and we get a little better as we get to some automation and virtualization. But what you see happening in the cloud native organizations is they collapse a lot of the complexity of the infrastructure layers. And then they they want to spend more and more of their quote unquote complexity budget delivering value at the application. So if you go to a lot of enterprise data centers or colas or whatever, you walk around and every rack has slightly different configuration of gear. If you go to cloud native quote unquote data center and you know, there's the open hardware stuff and then there's the Google stuff, not all of it's open, whatever. But if you get an opportunity to go see one of these data centers is really, really like football field sized data centers filled with racks and racks of identical identical gear because they're they're really collapsing the complexity of what they have to manage. I think it's also worth noting here when you start talking about so people say things like the ratio of the machines that are managed by these kind of sys admins at one of these companies versus what you see in the enterprise and a lot of that ratio is coming from collapsing that complexity. It's easy to easier to manage a thousand things that are identical than it is to manage in some cases 10 things that are not identical. Right. So so like a lot of that ratio and that efficiency comes from from collapsing that complexity at the bottom of the stack. So this is a power you see over and over. And then you look at what comes up the stack into kind of what I'll call the platform services. Every single one of these organizations and there's a list of more that built various aspects of this built this sort of self-service provisioning platform for their developers to be able to do work. Right. And Google's been very public about how they did that and what they did that. Amazon hasn't necessarily been public, but they made some aspects of what they what they learned and built very public by launching EC2 in 2006. And then and then Google, you know, obviously everyone's trying to kind of play the cloud provider game now taking all those lessons from that. So everyone built these things from first principles, slightly different ways, but they built them because they had to. There was no community project that was helping solve these things. But if you go look at what Netflix built, you know, circa, I'd say 2000. Penish, like it basically looks like kind of Kubernetes, but on top of VMs, on top of Amazon. So, you know, you have something where you push push a thing. They basically baked images just like we bake containers. They're am I's they had, you know, these these Java projects, you can walk through the open source products from Netflix. And you can see the routing and the log aggregation and all these pieces that were the Netflix Java specific way of doing that. And now, you know, you can map a lot of those same social capabilities straight to CNCF projects. And so then we don't all need to rebuild these things from scratch. So this I'm just going to read. Go ahead. Go back to that slide just for a second, because it's interesting. Around 2010, they did come out with platforms there. But just prior to that was when like a thousand platforms as a service is bloomed to use their infrastructure. So it was almost like they saw the need for a platform from the thousands of small platform as a services and offerings that were there. So that kind of drove them to do some uniformity there. I mean, I would frame it slightly different. Netflix was doing those platform as a service things before there were the public platform as a service, right? Yeah. We saw like, I don't know, there was a platform as a service for Perl stuff, for every, you know, everybody had their own platform as a service. And what Amazon and Google really did was kind of unified that and made that available as a product on their platform. Well, I think the first thing they did, because a lot of the platform as a service, this is like a deeper philosophical thing that a lot of the platform as a service failed, right? Yeah. Oh, absolutely. Google's first foray into cloud as a service offering was App Engine, which was a platform as a service that didn't appeal to a lot of people because they had made it so Google specific. You had to basically remap the concepts you were used to onto, you know, big table or whatever the kind of the Google version of it. And they could keep all these promises about scale, but it wasn't necessarily the paradigm people wanted to use. So there's like another hour-long conversation about the evolution. I get that. Someday we'll tease that out because the evolution. Like the insight that Amazon had is they actually went a step down, right? And they gave you, okay, like here you can basically run OSs. Like the first version of EC2 was you can, here's three different things. You could have any VM you want as long as it's black. You know, they only have three different sizes. The other things that came before that is also interesting is that the first cloud service from Amazon was the Q, SQS, right? And then S3. So anyway, there's like a whole another hour of talking about the evolution of this stuff. So this, I'm just going to read this out loud and there's a reveal, but all this stuff sounds great, right? So remove friction from product development, high trust, low process, no handoff between teams. Do not do your own undifferentiated heavy lifting. Use simple patterns automated by tooling. Self-service cloud makes impossible things instant. So these are great words. Sounds great to me. I did not write them. These are actually stolen word for word from conference presentation on the Netflix lesson learned by Adrian Kochroff when he used to work for Netflix, right? So to me, when you're talking about something like OpenShift and the platform and the goals that you have as an organization, it should map more or less to this like as you can, right? And obviously we're all in different parts of that journey. But I don't think a lot of organizations necessarily have this as a North Star or at least their behaviors don't indicate that. And so the more that you can, the more that we can help each other get there, then I think the better off everyone's going to be because I like nice things. I want you to have nice things. So I kind of made this point earlier, but this cloud conversation evolved from the lessons learned building, operating these services. And that's key here. So these are services, software services, platform services, infrastructure services. Now, the bad thing is you actually have to operate those. Like they don't operate themselves, right? And this is where some of the modeling or the operating models come in is because you can make different choices about who's accountable for each one of these things in terms of the operations. So what is operations? We keep saying this word and there's like this whole body of operations that means something in a business context. It's totally different than what I'm talking about today. But for me, operations is really about the system operations and building this kind of technical infrastructure and delivering things that way. So this is the DevOps days. This is the Velocity Conference. This is the SRE book. Like all these things are part of it, right? So it's like metrics, pretty much key to understanding what's going on. Now you have some stuff you can hopefully determine things are good or bad. And when things are bad, you hopefully alert people. When things need to change, you hopefully aren't doing everything manually. You've got some automation. There's a lot of what I consider operations that really comes down to having a mental understanding of the system and getting into the middle of it when things are going wrong and troubleshooting and doing the right thing. And then hopefully you learn something from that and can make better changes or better automation, better monitoring for the next time. And there's a, again, this could be its own like 12 hour lecture series about each one of these topics basically. But this is the focus of DevOps days conversations for the last 10 years. And there's lots of meaningful stuff available for you to go dig into that. But these are kind of like a baseline set of capabilities. So this is a slide from 2007. And I use it in a lot of conversations and a lot of presentations. I was made by one of my friends who at the time worked at Amazon and he was talking about, so this is sort of the golden age of a puppet and coming into like the beginning of DevOps conversations. So you have traditional operations on one side and kind of the new quote unquote secret sauce operations on the other side. And the argument here is that the colors on the graph represent quote unquote toil. So the humans doing work. So the number of hours of work to maintain a system. And then the access is representing the scale or the system scaling up. So you're adding adding servers to the system. And those numbers seem laughable now. Like, oh my gosh, 20 servers at the time seem like a big deal. So so like the argument here that Jesse is making. And you could go read this archive from 2007. I wrote another kind of follow up to it in 2010 about revisiting. But the the short version of this is that there's a new way of doing things that if you put kind of in the design, the effort that you have this different curve for the amount of human toil that's required to manage those systems as they scale. And this is 2007. So in 2020, we should be able to compress that curve even more given the platforms and the tools that we have available to us today. But the thing I want to draw here and as we go into the rest of it is this notion that operations is this secret sauce competitive advantage. And I would argue this is the this is the defining advantage of the cloud natives is their operational excellence. So this is 2006. And this is a pretty famous interview to I'll just read it as well. The traditional models that you take your software to the wall that separates development operations and throw it over and then forget about it. Not in Amazon. You build it. You run it. This brings developers into contact with the day to day operation of their software. It also brings them into day to day contact with the customer. This customer feedback loop is essential for improving the quality of service. So this is this is an interesting quote in time. This is the the you know the year EC2 is launched basically 2006. And this is three years before DevOps is a word. But that sounds suspiciously like you know a lot of the conversations people had in the DevOps community. And just to kind of give a shout out to what I mean when I talk about DevOps and you know there's if you Google there's hours of me saying things about these words. But you basically have this kind of community of practice that evolves on having conversations. And these are the the quote unquote elements or what have you of these DevOps conversation. This is a blog post that John and Damon wrote after the first DevOps days in the U.S. They identified culture automation metrics and sharing and jazz in like you know very very quickly after that added lean with this notion of continuous improvement. And so like there's no shortage of DevOps content online but this is sort of a framing for some of this other stuff we're talking about with the capabilities. So going back to this notion of you build that you run it what Werner is saying when he says you build that you run it the software teams are not building up all of these other services. Those exist for them inside of Amazon in 2006 for for a developer or development team to get access to provision infrastructure was an API provisioning database was an API. Right. So you've got all this built in platform and infrastructure services available to the developer. The developers are not building those they're not running those they have inside into them. But what Werner actually means for the for the quote unquote to pizza team to run their software is that they run their software and then they have all these other things that are taken care of for them by those other responsibilities. So that's something that I think sometimes gets lost in translation where you see groups of people who are like oh you build it you run it it's like oh well you know you got to install the West is and you giving developers who may not have that kind of context and expertise a bunch of things that they're not necessarily prepared to do well. And so there is some value to that sort of specification stratification. So this is Google SRE is not really part of the lexicon until 2016. So this is 10 years after that Werner quote. And you know I already kind of said this earlier but this is essentially Google's DevOps implementation. And you know one of the reasons I share this a minute ago is if you go through and read this book like they pretty much check off all these boxes you can kind of go that you can read it for free search SRE book. And this is my recommendation for everyone good DevOps copy great DevOps steal so wherever you find good ideas you should make take full advantage of them. And then the rest of this or for the next like section that I'm going to talk more specifics about SRE kind of in practice and this is straight from the book. So this is the Dickerson's hierarchy of reliability from the SRE book. And here you can see the foundation of reliability from a Google perspective is monitoring. Right. And so then you have monitoring you can figure out a little bit more about what's going on. Now you know something's wrong so then you respond to incidents you respond to incidents now okay like we respond to those incidents we learn some things we do some analysis to that that kind of goes back into it. And then at the very top you get up to this notion of the product like what why does this infrastructure even exist and I'm not going to go through I mean the SRE books 500 page book so for now we'll just keep going but I'm going to come back to this notion of monitoring as the central thing. So this if anyone hasn't read the Borg paper I think the Borg paper one of the like 2015 they published this paper about Borg and it talks about the evolution of Borg and it talks about Kubernetes and some of the stuff. I also think it's kind of fun to point out and reflect on that in the 2009 2010 time frame if you had a conversation with someone who worked at Google and you tried to get them to talk about Borg then they would stop talking to you right so there's like this this shift in the understanding of what is a competitive advantage and you know that that's a fundamental shift and and you see the you know Kubernetes and the ecosystem that was built around that as a reflection of them reframing some of those things that they thought were secrets to them. So the Borg paper has this gold nugget that I think everyone misses they get focused on container scheduling and like fancy algorithms so this is this is one this is straight from that paper almost every task run under Borg contains a built-in HTTP server that publishes information about the health of the task and thousands of performance metrics and I have a standing wager and I don't think I'll ever lose this that you will get more operational benefit from building instrumentation observability into your software into your applications into your services then you will navel gazing on optimizing your container scheduling infrastructure right so. Telemetry is everything. Monitoring is the foundation reliability from Google's perspective right so people miss this but what are you going to do monitor monitor monitor so this is all straight from the book and we don't necessarily need to go through it in depth but service level terminology you have to build up to that to get to the talk about SLOs and to me SLOs are the defining feature of SRE so it's worth building up so you have service level in indicators which are basically now you have some monitoring you can look at your stuff and say okay like here's the service level and then you set service level objectives which we'll talk about a bit more and then that is not to be confused with service level agreements which tend to be contractual and maybe you know imply things about money and that kind of stuff so service level objective is and I'll talk a bit more about it but it's basically this three-way contract so just for the sake of thinking about it and not necessarily to be exhaustive here every service is different and in the book there's a thoughtful discussion about what types of service level indicators and service level objectives might be appropriate for the types of things you're building right so user facing systems are slightly different than storage systems and so finding a way to kind of map what you're doing not like a cookie cutter paint by the dots but like be thoughtful about what a service level should mean for the particular service that's running and then you know what what kind of that that drives a bunch of decisions about monitoring and then this is also straight from the SRE book but this is the kind of the the golden signals the four golden signals according to google's SRE book our latency traffic errors and saturation and so if you are not monitoring those things right now this is maybe a good opportunity to steal a good idea from someone else and and then think very hard about you know if you're not monitoring this stuff now why not and and if you aren't then what would it take to to get this kind of information and start to think about the you know the meaning of each of these for your particular context uh now that you have SLIs and you can measure these things then you move on to this notion of an SLO so the service level objective and to me this is sort of the defining the defining quality of SRE is really centered on SLOs and this contract so you say you know we want this many nines or this many whatever for these particular indicators and that establishes an air budget you don't necessarily want to have a single dimension for an SLO on a service but you also don't want to make it too too complicated and you know this last point here is worth pointing out that the progress is more important than perfection right so so what do we have measured and what are we what are we kind of looking at today and what can we do to improve that system and make it so there's better tomorrow it's more important than getting it perfect so the this SLO is really a three-way contract between the developers the the business and the operations so the the business is saying reliability is important to us if if this thing is not available then it's not delivering value developers are pushing you know code and then operations is responsible for that reliability or the SRE are and so what that establishes is this notion of a service level objective which gives you an air budget and then in the context of the air budget the idea at least from the aspirational kind of perfected version at google is that you can do things with that air budget so I think it's worth pointing out 100% is not realistic I would argue it's actually impossible as you get to these nines so now you have air budgets established you can you can talk about this acceptable level of vulnerability right and for some things that might be minutes or seconds and and especially gets interesting when you talk about building these services that can operate with continuous partial failure right so you so you have some some isolation some concurrency so that some fraction of your system can be down and you know that dovetails into another kind of interesting subgenre of DevOps around chaos engineering and injecting failures and that's all good fun but now we're going to talk about today so now you have SLO you have these consequences so when you're below your air budget in like the perfected version of this you you you have this notion of the developers have self-service access they could do all this stuff when you when you blow your budget we'll go to this next slide then then it changes the dynamic of these things so in the quote unquote aspirational google sense of this when you're below your air budget for reliability the dev team the they deliver features if you blow your air budget then the dev team capacity for work is focused on creating more reliability and this is something lots of orgs welcome like they have a hard time and there's a bunch of like political and organizational reasons why this is hard you know one you don't have SLIs in first place in a lot of places but two like getting getting this idea that oh we're not going to work on features because stuff's unreliable just kind of like seems to blow people's minds but that's the this to me is like the defining feature of true SRE in practice at least as it's aspirationally espoused by google and then moving forward so you have this notion of SRE at google when you when you look behind the the covers like not every project at google gets SRE they actually start out very close in very similar to the mall that happens at amazon and then they earn the right to have SRE support by being you know demonstrating their value and then SRE take over the operational responsibility the call the troubleshooting after services have gone through the the production reliability review and the application reliability review to kind of retool the architecture to match with the promises of the SLO that the SREs are signing up to keep and so just to make this point this is something I think gets lost because people see the SRE and they're like oh it's just like traditional operations you know we just have like we'll just change the name of our of our system in SRE the the the SRE are not there to take toil away from the software engineers the SRE are there to drive toil out of the system and so that's the whole point of this you know one the e part of SRE and two this reliability assessment to take the burden of it and and at least aspirationally from the book there's this notion of a toil that's being created by a service and according to the book if you as a software engineering team exceed the SLO air budget too much then the the SRE have the right to push all of the operational burden back onto the software team so it's like you can't get your stuff together you're causing me a lot of problems and work it's like okay now it's all your problem until you get back into you know if you if you uh need to go outside to use the bathroom like you can't be you can't be a puppy right and you gotta go we gotta train you to do this right so that there's like a little bit of a dynamic power dynamic where the SRE could push the operational burden back on the software engineers I think it's also worth pointing out and I had a lot of conversation philosophical conversation with people at Google about this but SRE are effectively the architects of Google's platform those platform services and those data services in particular and in a sense they're also essentially product managers there so this is straight from the book as well SRE builds framework modules to implement canonical solutions for the concerned production area as a result development teams can focus on the business logic because the framework already takes care of correct infrastructure use so when you're thinking about adopting a container platform and kind of building up these platform services for your own organization I think you know maybe you're not going to adopt the SRE model wholesale but thinking thoughtfully about what are the promises that the services that we're building can keep with respect to this infrastructure use as we make them available to our developers to you know go back to the lessons learned from Netflix like we really want to unlock that product development and so we're kind of coming to the end of this so my kind of advice is you know think thoughtfully about these different services that you have to operate think thoughtfully about who has the accountability to operate them who has the tools to operate them I really like SLOs so if you think about the way that Google's architected itself and built this up that you have these infrastructure services each of which have SLOs and keep promises to these platform services right so it's like at the bottom you have the container scheduling you have colossus you have like you know some thoughtful things about how you're going to schedule jobs and store data and the rest of that and then you build higher level services on top of that they also have their SLOs that are promises kept to the kind of application and the software on top of that and then last but not least you have the customer facing SLOs because hopefully there's some business at some point so so the advice here is not necessarily oh like you should adopt SRE practices but be explicit about these models and and like develop your own kind of understanding about what you're doing now make that explicit in a way that you can evolve it in a meaningful way to something that is quote unquote better right progress over perfection and and realize that everyone's kind of in a different place on this continuum of adoption you know whether you're talking about SRE DevOps or whatever you know there's on the far end of the spectrum there's there's lots of manual work and there's not very much monitoring and everything sort of done through these slow feedback loops and ticket systems and and what you want to get to is this enabled like you build the platform to keep promises with enabling constraints that gives you the confidence that you can allow your developers to have self-service access to these systems because you can keep promises and then you know that dovetails into a bunch of interesting conversations about itill and like these weird misconceptions about segregation of duties and those are fun conversations to have but we probably don't have time for that that right now so this is sort of like Andrew's simplified version of what you should think about as you're as you're kind of adopting and making things explicit if you don't have monitoring if you don't have great monitoring if you don't think you can kind of think about the four signals or what's appropriate for your services that's a great place to start investing as you as you build you know monitoring capabilities now okay we no longer have the customers tell us something's wrong we can detect things are wrong how do we respond to that be thoughtful about incident response and the way you're going to manage those and then kind of build yourself up through the this pyramid of reliability so what is dev ops what is sre i i promised i would give you more questions not necessarily more answers honestly who cares i i don't i don't necessarily care these are just words what works is more interesting question in my opinion and in particular what works for you you know what what what works today and what could work better for to you tomorrow and then really the end of the day if it doesn't go to prod it doesn't matter production or didn't happen can you put code into production and and how fast and then once you get it there can't keep it running so that's some questions for you thank you for your time that was a quick run through some thoughts and conversation i've been having recently about how to how to optimize you know kind of the operational practice and process around your open shift investment so yeah range is probably even more questions um then it you know it gives a model and which is great so and that was i think your plan for today was to give us some models but it kind of some of the conversation that i can see in the tensions that you can see inside of organizations that are trying and struggling to adopt these models um there was one diagram you had in there that was the three-way conversation between business ops and development i think there was one giant visual there and when i saw that the other thing was the other component and you added it in later of the customers and what you often see is this tension inside of development groups and product management groups that are trying to deliver more features you know versus the stability um and that dealing with the tensions of um accountability for maintaining stable and developing stable services um and in order to get that optimal goal of the sre's coming in and taking responsibility for the operation of this the software services that tension i think is one of the things that um that that you have to tease out inside of your organization how you're going to um i don't know whether it's the pavlovian reward them for good behaviors kind of things but that's i think where we we see most of the tension is product managers or developers um have pressures on them to deliver more features exactly so that's what it comes back down to and and really where where the organizational um kind of misalignment or conflict comes from is that you have executives that have slightly different compensation models right so it's like you have it like and if you can't align that higher level mission at the top level of an org the chance that your your frontline developers and operators are going to be aligned is zero and so that's sort of like where so so like you have to really revisit fundamental assumptions about some of those organizational um power dynamics to get to these models yeah and people's bonuses and all kinds of things like if all the bonuses are tried tied to releasing features whether they're reliable or not they're going to be released yeah people don't like it when you mess with their paycheck right like that's uh yeah absolutely i think that the the reliability of your service and your software is is key but um the tension of delivering i i see it every day in all of the you know inside of red hat in the companies that we work with of wanting to deliver more features more functionalities at higher scale um and that pressure to deliver more but and try not to ignore the stability of it i think that's really for me the the other thing that was interesting to me early on in the whole slide deck um and and was the artisanal versus industrial conversation that traditional was artisanal and um today is our industrial and then there was another one about delivering um the impossible adrian i think it was adrian's quote there and the the myth or at least the the the hope that i have is that this industrial strength infrastructure and these new practices will allow us to do the creative things that we want to right um and that allow us to empower developers to deliver these new things these new features these new apps these new services but um having the complexity of having to understand what's under the hood so when you see someone now come to the table with a new offering or a new service but they also have to understand what kubernetes is right that's different than um just having to be able to build a web application or a service or the offer a database offering so there's all this extra that you're asking developers to understand um and that's really i think something another cultural shift um and we've seen in some ways i think that's actually wrong like you don't like if your developers need to understand more and more infrastructure to do their work like it's not that you want to be ignorant of it but you want your developers to be focused on the creative aspect of their domain and the value they're creating not not sifting through yaml and and so like there's definitely an aspect of understanding more of the stack helps you make better or or more optimal decisions for the global but at another level giving abstractions that are hiding some of that complexity lets you do lets you do things that you never could if everyone's worried about every layer yeah definitely so there's a couple of Diane we got a question and check that here so Muhammad Muhammad on Facebook asks how does the nature monolithic versus service oriented of application impacts uh this process of going towards an sre devops culture any suggestions on moving a 20 year old monolithic app towards this goal i mean i think this is a a very interesting question and a lot of organizations are being kind of forced to ask this so so i think that you have to look at what you're trying to accomplish and i'm i'm not from this school of thought that microservices are always better than monoliths right so so thinking about what what kind of promises you can keep and and why you want to move mindfully to these architectures is key now when you think about the operating model and these tools and these other capabilities that i talked about operations one thing to keep in mind from the very beginning is when you go so let's say let's say you have um an aspiration to have a microservice architecture uh when you have microservices you you have more deployments you have more things that need monitoring right so if you have a high fixed cost of deployment you know in terms of the of the work the automation that's not there the testing whatever to have you your confidence in the quote-unquote release or or you have like these unmonitored systems and then you go to a microservice architecture without that that platform support and these operational models being changed you actually made more work for yourself right so part of the microservice architecture is predicated on having these quote-unquote dev ops capabilities having these platform services available to you because if you have that fixed cost of deployment that's still high for each new service that you add you you actually just bury yourself in toil right so so getting it to where the the fixed cost of a new deployment of a service is essentially negligible is kind of where i would start with with moving towards that architecture and then from there you know go through the the hierarchy of reliability if you don't have good well factored monoliths that are monitored in a meaningful way the chance that you're going to end up with a well factored microservice architecture this monitoring in a meaningful way is is quite low right so let's build up that organizational competency kind of the muscle memory around those things with the monolith that we have and then meaningfully take pieces of functionality out of the monolith over time because i also think that the the big bang rewrite approach to going from monolith to to microservices tends to to lead to catastrophic failure so my advice don't do that it's also with a with a monolith you probably can take a part of it and do that deployed that first and figure out the pieces of the monolith that you can break away and and try the new architecture out and and and deploy that's i think this that tells into like a modernization um conversation but to me to me like i kind of break things into a few buckets and and if something is not um causing me operational burden doesn't need to scale to keep keep promises i need for for my org or or doesn't have like a need to change it rapidly right so if something needs to stay the same isn't a problem scaling and isn't expensive to operate i'll just leave it alone yeah right yeah there's no there's no reason to move stuff that's working just fine right like unless you have some high need to do so for operational reasons if i want to explore some new features that are going to be you know in taking advantage of the agile you know whatever kind of um product development life cycle let's move that into architectures where we can have more rapid feedback cycles with that customer engagement right so so like that's a motivator um if i know i'm having problems keeping um the reliability the scaling of that particular architecture let's get to you know the new the new event driven or whatever um your vision is for that architecture to keep those promises or if it's expensive for other reasons you know it's expensive in terms of human costs or licensing costs or whatever that could be a motivator but if it's not one of those three things monolith for life baby do it that's going to be the new t-shirt monolith for life monolith for life i don't know on that note not all monoliths i know that it's it's interesting it's all interesting so this conversation and more conversations like this will keep happening on fridays at this time um we'll bring more folks from the office of the gto and as well as other talking heads and people from this space to help um you all with your transformations um and we're really glad that uh andrew could join us today and make this happen and if you want to get a hold of us it's really easy you can tweet at him at littleidear and we will post this video with his credentials and how to get a hold of him we are also launching a uh transformation sig um so there'll be a landing page soon up on commons.openship.org with links to this video and others um as well as a place to sign up for how to join us and get announcements about who's coming on deck next and if you have a topic you want to hear about let us know um we will try and find um someone to talk about it or make you talk about it which is even more fun yeah come on and talk with us that'll be fun yeah so definitely do that so thanks again um andrew for joining us today thanks for having me it's just just a pleasure and um lots of food for thought there and um so take care and have a great day stay safe out there all right stay safe everybody cheers cheers