 Please welcome to the stage Craig McClucky until recently product manager for Kubernetes at Google and now leading yet another Kubernetes related startup. Hey folks, so that last question was like a brilliant segue to what I want to talk about I'm gonna talk about cloud native operations, right? So I was asked to come here and talk to you about what I was geeking on next like, you know What I was thinking about it's always a dangerous Request because you likely hear me geek on config and not everyone likes config as much as I do But for today, I'd like to kind of spend some time talking about the impact of these technologies on your operation story and the value proposition of Container technologies not just as a way to Package up and deploy your applications But as a way to live with them as a way to actually create more agile organizations as a way to Move forwards into a much more progressive way to build technologies and run them So I'm gonna go through kind of flow of conscious sort of, you know perspective on what I'm calling cloud native operations or cluster operations and How it's likely to apply to enterprise how it will change your perspective on running these systems in production and How it's actually just gonna accelerate your organization spectacularly So just bear with me a little bit and because I'm talking about the future I figured I'd just use like random science fiction quotes to the illustrator point This one's a little bit better, but whatever Okay, so let's let's set the scene right this is where we are today And it's a very interesting time for software companies We're at a time when you have companies like Ford Trying to figure out how to become a software company at the same time Tesla, which is effectively a software company He's trying to figure out how to become an automobile company the world's increasingly competitive and Software is probably the most powerful business tool out there today it's becoming Absolutely critical differentiator for almost every enterprise out there how effectively how well you can engage with technologies to solve business problems It's huge and anything that creates friction in that path to getting Your production systems out there to actually being able to evolve your production systems to match The evolving needs of your customers the evolving needs of your business. It's gonna slow you down, right? So the next point of observation is that you know software is eating the world Anderson said that and I think it's been quoted like a gazillion times not as a gazillion and one times But open source is eating the software Well, you know We look at the way that businesses want to engage and operate they become much more invested in controlling their own destiny I speak to a lot of very big banks or very big insurance companies or very big health care, you know companies and They're invested in not just, you know, you know breaking the deadlock of a lot of these sort of more traditional enterprise Sort of vendors. They're also highly invested and actually engaging in controlling their destiny styling themselves the software companies Participating in developing the software that runs their business making sure that it actually achieves the outcomes that they're looking for And inside the lens of that cloud is happening. I want to say cloud is happening Everyone looks at it's like, okay, you've solved the infrastructure operations problems. That's really neat I can go get a virtual machine. I can run whatever I want It's this move away from the world where I used to have to think about Buying, you know, infrastructure wrecking and stacking and dealing with it and dealing with the depreciation of that I now have a much more agile way to buy the infrastructure that I run. That's not what I mean Cloud is about service, right? And my own personal journey here is I came from Microsoft I spent 12 years in the big house hard time by the way building enterprise enterprise software, right? And I got dropped into a team at Google and these are the kind of crazy cloud guys, right? We just did not I don't know who was more shocked with I was more shocked arriving in this incredibly dynamic environment Where people just had this bias to action the ability to get technology out there if a customer didn't like you to roll it back If you want to find out where the customers like it, you can deploy it to 1% of your folks, right? It's a fundamentally and profoundly different way to think about building technology and delivering it right and here I was the old enterprise guy like no We need to get it just right and we'll spend three years and we'll throw it over the wall And then we'll figure out where the customers like it or not, right? It was shocking and as I look at the cloud companies that are succeeding They're they're companies either like Amazon They just got there way before everybody else and had time to figure it out It's companies like Microsoft who actually bought their way into the space with Bing and You know with the Xbox live assets that they built like it taught them how to deliver service It taught them how to think about Technology as a service and I actually think if if Microsoft hadn't done that they would have really struggled And okay companies like Google where this is just directly in the DNA. There's just no question about them thinking about this way And when we think about this transition to cloud, it's about adopting this model which is thinking about your technology more as a service It's a living thing. It's an agile thing. It's something you can update and tune as you go and this presentation is about how you operate It's not about how you build it. The building is a component of it. It's about how you actually live with it So let's look at the sort of history of operations First of all, it was the kind of the dark ages of operations where the developers had direct access to the machines weird things happened and the response to that was To sort of create this this cannon around system administrators These are professional serious people that are responsible for owning and managing and configuring your production systems So it creates this natural tension. You have the developers that want to go fast They want to get stuff out there. You know the system administrator is like whoa, not so fast let's get it right right and The basic atom of work in this world of system administrators is the ticket So hey, I want to get a production change out there file a ticket. You know, I want to get a new server file a ticket I want to You know update some kind of saying a file a ticket and then an operator takes that ticket and some way between, you know You know a few hours a day a week or three months and in some cases But the final outcome happens, right and it worked It actually it sort of rained in the dark ages of developer-driven deployment and it created a System whereby you could consistently reliably get things on reaction, but slowly and And frankly just doesn't scale that well like as you scale the infrastructure linearly you need to scale the set of operators rather than any So the next change was this idea that The heroic developer can do pretty much anything, you know code is an amazing tool and if you read this quote It's like it's a robot highland quote. I mean, it's it's kind of crazy right this idea that you want to create these perfectly well-rounded Heroes that can go in there they can code the heck out of a problem in Java or whatever the development language is and then they can code the heck out of how to get that into the production environment using one of these DevOps tools and It actually is pretty neat in some ways because you've kind of Systematized the process of getting something on to production. You can create a recipe becomes repeatable. There's much less toil It's a lot easier to actually get things out there And you've become get to this point where you're sort of Adam of work the Adam of operation is this integration So I can play some code. I can run some tests. I get some CICD stuff running. I can get my workload into production and It's great It works really well Except when it doesn't and when it doesn't things get really weird right because what you effectively doing is running a lot of imperative code In a production environment a scaling of it happens The first thing I'm going to do is I'm going to turn up piece of infrastructure And I'm going to like get something running and then I'm going to step into that and then run a bunch of imperative code to Get it configured just right and if something goes wrong I haven't helped me so It's a neat it's a neat it's neat framework and I've seen a lot of success But it is a sharp tool and it's requiring your developers to beat these extremely well-rounded generalists So there's a third way of running operations teams that I've seen a lot of the Google and This is kind of this what I call a cloud native operations model And it's a little different to the world of system administration is a little different to the world of DevOps in this world You have a set of professional teams that are responsible for delivering Common services at the application level to your developers and the basic point of integration becomes an API So if I want a new cluster, I call an API if I want a new service I call a provisioning API at the back end of that there's a professional set of teams that are automating Like crazy to make sure that when that API gets called I actually get a provisioned a property provision system. I Mean this model is always possible It's become extremely relevant as clustering technologies like Kubernetes or Mesa's Or cloud foundry are starting to emerge right it creates a much more Programmatic framework where people can start specializing and delivering these operational frameworks. So It's awesome, of course, it is relatively new and so I'm going to talk a little bit about what this means So let's dive into you know some of the attributes of this You know like what what are the ingredients you need to assemble to get a cloud native operations environment working? Well, it starts with having this idea of Logical infrastructure, so this kind of this cluster environment and we heard a lot about this earlier with technologies like Kubernetes the idea is that Instead of deploying your application and reasoning about your application being tied to a piece of physical infrastructure You're handing your application off to an autonomous subsystem that will figure out how to map it into the infrastructure in an optimal way It has some nice advantages because by using it. It's It's incredibly efficient, but it also removes toil so the role which was previously consumed by a human operator who's having to Tediously go through the steps of configuring this the system or by a piece of random code that someone wrote that kind of you know Ran through and you know I did this you know in a sort of bespoke crafted way You now have an autonomous system doing this and turns out there's some things that machines do better than people And one of those things is deploying software. There's a lot of other things machines do better than people But you know this is just a prime example of this and so if you look at the way that Google runs All of the infrastructure is logical infrastructure. You don't think about a physical machine. You think about a job or a task or a deployment The next sort of observation and sort of attribute of this is you have to be relentlessly focused on automation You have to be the laziest, you know person from a from a toil perspective You have to love automation and automating pieces So this works if you have this constant and relentless focus on reducing toil from the operations environment If anything can be done in an autonomous system have an autonomous system do that Spend the time to get it right, you know create a specialized, you know function That's that's actually delivering professional services around this and and good things happen The other attribute here that is really important is this idea that you create Specialized roles today DevOps assumes that your developers are relatively generic Person that can do all of the things that Highline said right In this world your developer is able to focus largely on solving business problems And you have other people that have operations Expertise that deliver a set of common services to the developer So you may have a team that deals with infrastructure operations They will rack and stack and get you to a point where you have a cluster environment anywhere It could be on your on-prem Environment can be on Amazon Microsoft Google, you know wherever the long story short. It's one team That gets you to the cluster environment the next team has a common cluster environment to start from And they get you to a point where you have a set of common services So instead of each engineering team having to worry about how to configure and install Cassandra And you know, you know try to find some template on the web and deploy that into the order the environment You have professionals that that spend their life doing this and they get really good at it So you get the specialization The next sort of piece of this is that you have these kind of shared services, which I've already talked about Where you can you know as a developer declaratively describe the set of pieces you want you don't have to package them up with the application You don't have to reuse them as part of the application They just show up in your environment and are prepared for you by an expert And then the final piece of it is is this autonomy this is autonomous automation Having expert systems that will get your environment configured just right that can observe the state and the health of your system Make informed decisions. So when that paging event happens that we were talking about 95% of the time the systems is going to recover it for you and deal with it And you're only going to have to deal with situations where it's literally kind of you know broadly out of bounds of what these these systems can do So Kubernetes with its with its control metaphors a lot of the patterns that have been introduced there Provide you a very powerful framework to avoid having to deal with, you know, the sort of operations Toilet and dealing with things like optimization So as a result of that we start to see these new roles emerge and these become specialized roles So in the old days you had a system administrator and developer, you know Then you had just developers or DevOps or I don't even know what a DevOps is exactly But you know a poor developer that has to actually deal with both the operations and the development component Now what you start to see happening is is these new roles emerge and these don't have to be different people These could be just you know small teams. It still applies You just wear different hats at different times, but it's really important to focus on the emergence of these roles Someone deals with the infrastructure someone deals with the cluster someone deals with common services Someone deals with application operations That could be the person who built the application or if the application is big enough You may actually stand up a discrete application operations team and then you have at the end of the day the developer Who is empowered their generalist and they don't need to worry about everything else. They become much more efficient by nature And so as a result of this the specialization, you know One of the things that happens when you specialize is you get really good at something like you you become a specialist You become an expert it lets you take your game to the next level and one of the things that cloud native Technologies do by creating the separation of operations roles is it lets operators take their game to a new level? So if you look at the SRE Folks at Google, I don't know has anyone here read the Google SRE book that just came out If you haven't read it you should it's it's a really interesting book It captures a lot of tribal knowledge from Google And it gives you a little peek into the mindset of the professional operator the person who's responsible and passionate not just about You know building something but of actually running it as a service to an organization, you know to an IT team To a set of customers and it's it's very distinct And you know again I keep coming back to this the SREs are both the geekiest and laziest people you'll ever meet They will automate the heck out of everything if they ever find a task that's not automated They regard that as toil and it's an insult to them and they would rather have an autonomous system deal with that And so they they focus on creating a set of APIs that remove them Personally from the flow of operations so they can go back to drinking or whatever it is the SREs do when they're not actively dealing with issues And the other thing that's it's kind of interesting when you start interacting with these teams when you start Spending time with the SRE teams is like you know it turns out actually they're not all that lazy Like they do some stuff which is kind of neat like they take they take service level monitoring to a new level You'll find a lot of people out there who'll be like look a monitor errors, right? The SREs don't do that. I mean they'll obviously monitor errors, but they'll also monitor traffic the monitor latencies They'll observe all these things at you know the you know the the median and then at the 99th percentile and the 99.9 percentile and They use that and they'll know the modern things like saturation like how much resource actually being used and they'll start to create these Observations around you know, how does saturation impact error rates how does saturation impact latency and they get to a point of Maturity in terms of how they're thinking about the services that they're Managing and let's them become much more nuanced around capacity planning because it doesn't become much more nuanced around Understanding what happens when things go wrong. The other thing is that they delight in In planning so a lot of these specialized operations team can actually take the time to do things like generate an incidence response playbook They can start to do Disaster testing scenarios if your job your sole job is to deliver a Cassandra cluster to an organization and you don't have to worry about 99 other technologies and you're doing that, you know, and you're managing these clusters over and over and over again There's a chance that you will have time to actually think about what happens when a ring goes down That you will actually be able to create a systemized systematized playbook and that you will have a better shot at getting it back up Then an individual development team who's only been dabbling in this technology and has found a template online and got it running in their in their environment They also get really fancy with the way that they tend to manage their applications so Matt and Chris were talking a little earlier about Some of these more nuanced deployment approaches where how to think about Taking a technology and doing, you know, blue-green deployments or how do I actually run an experiment? And so these teams create these operational frameworks using the logical infrastructure that lets you do more interesting things And as a result you get much more practical ways to actually deploy a technology You can imagine a centralized team That's running a mission-critical data service deciding to do a company-wide update and bringing down the entire organization That would be horrible would be horrible outcome So the nice thing about these teams is that they start to be able to say well, you know what like, you know We understand the application portfolio. There's some stuff that no one cares about Let's let's get two of these services running in production and we'll slowly back to load across this that other service Technologies like Kubernetes let you do that and let you actually create these sort of operational Models where you can be much more nuanced about how you get technology out there And they let you test things, you know You can vector a small portion of your load to a new framework and see if your users like it see if it changes some of those core Metrics you have the data you're able to run it better and operations gets you to this this new level So it's not enough to just say okay, we're gonna we're gonna do it We're gonna be all in cloud natives. Here we go We're gonna stand up our operations team like turns out you probably need to think a little bit about the applications as well and invest some time in creating the right the right architecture that actually supports this new operational paradigm and One of the things that I'm most excited about when I think about Designing agile systems and you know, you know creating systems that are much more operationally viable is this idea of almost a continuous spectrum of decomposition Where you walk into an environment and I've seen this a lot recently where you know an organization will have a monolithic Application big old monolithic application and the first thing they'll start to do is they'll start to look at it and say Hey, you know, I want to deploy this into a container environment. Could I jam it into a single container? Well, probably like is it a good idea? I don't know. Maybe not The first part of decomposition is to start extracting some of those pieces Into separate containers that are not intrinsically deeply coupled. So if you look at the way that Google would for instance deploy front-end serving component that might have a Serving an HTTP serving component. It might have a log roller and a data shot updater The first thing will be those things will be put into different containers, right? So they're not intrinsically decoupled. They're still deployed together. They're still tied together They still have access to shared resources But if I need to update one of them, I can do that without disrupting the rest of it If one of them goes crazy, I can set reasonable bounds so it doesn't disrupt the behavior of the other components, right? So the first step on route to better operations is Decomposition and not treating a container like a VM. Container is not like a VM. Container is actually much more awesome than a VM Containers let you piece things together in more natural fashions and and then when you layer that on top The next thing is to say, okay You know, like I have this monolith and it's relatively tied together I've pulled out the common pieces and I put them in containers Can I now start putting some of those things behind stable interfaces, you know Can I start identifying subsystems that I can put behind a stable interface and run as a discrete service? So, you know, like so it's like you have this monolith It looks like people taking chunks out of it, right? And it's actually amazing how quickly an IT team as dedicated can turn a monolithic application That's relatively tightly coupled into a reasonably well-facted cloud native application By focusing on the functional areas by chewing them off by putting a stable interface behind them And then dropping it into something like Kubernetes The application continues to operate just as you'd expect But now you have a way to reason about the pieces now you have a way to evolve the pieces and operate the pieces without having to Focus on it. The next step is then to start looking at across these applications Which of the set of things so I want to reuse, you know Can I take that that that component? That's not a service Can I create a standard template or standard deployment framework and then provide it to other people so they can stamp out their own versions, right? So now you have a microservices reuse ability or sort of, you know Reuse framework and you can start to put an operations team managing each of these discrete pieces And then the final step is to actually promote those pieces to a like an heroic service Where there's an API that's used to provision them and they are operated by professionals There's a common standard interface and it becomes a standard asset for all the developers So you can see a path as a sort of empowered enterprise to go from these relatively monolithic difficult to deal with difficult to operate systems Just over time just like start loosening up the pieces decompose it get it into a more structured form where you actually have an Intelligent subsystem that's going to operate it for you define the health models for those pieces create those stable interfaces And you're off to the races, you're not in a position where you can start To treat it like a more progressive system and create the specialized operations around it And and every point of aggregation you create every time you aggregate and you create a shared component Reduce the number of different configurations that are deployed You have an organizational opportunity to specialize you have a chance to actually create a team that is expert at dealing with that thing And it means that other teams don't have to and so that's that that that that Pursuit of specialization is one of the key attributes here And then one of the final things I want to talk about is Is not just the operational model not just the architecture, but the organizational structure that emerges around this, right? I don't have people have heard about Conway's law I bet you if you haven't heard about it before you can hear about a lot in the next black couple years, right a Conway speculated that System architectures follow the lines of communications of the teams that design them, right? So if you have a big or monolithic user experience team That is you know communicating through a single point of contact with some of the other teams You will tend to create a monolithic, you know front-end component and he you know he observed this this over time And one of the really neat things about this this approach this philosophy of cloud native systems Decomposition specialized operations is that your teams can get a lot smaller You no longer have to start each team with an expert on every subsystem You want to use that you're necessarily you know building and then operating you no longer have to start your team with the sort of dev ops Capabilities you can create much smaller teams that have access to these robust services They don't have to operate them. They don't have to deal with them. They can just consume them And as a result they become smaller they get much closer to the business They become much more nimble and you have created a strong value multiplier for your organization And so at the end of the day This leads to a lot of good things It gets you to a point where you're operating more efficiently Specialization is powerful Having people that are really good at operations makes your systems run far far more efficiently And you get better use out of the infrastructure. So there's just a lot of really neat things here Of course, you know one of the You know it's early days. We've got a lot of work to do as a group as a team as a community To get to this point We have some really strong foundational technologies. We've made a lot of the down payments. We need to get there But it's not enough to just you know dream this we need to we need to really get together and get better tooling Like we need to get better tooling that provides Specialized operations capabilities at the cluster level at the application level We need to deliver better playbooks of like how to actually even approach this and start to you know reason about the functional decomposition of Of monolithic applications how to deal with cluster operations how to deal with incidents There's just a lot of work to be done to imagine this and I'm really excited to You know look to the next couple years as a community and see where this all goes. So I will pause there I think that's 26 minutes and See if there's any questions any questions for him besides what's the name of his new stealth startup? If not, we'll say oops So a couple so a couple examples debugging So one of the things that happens today is an application goes sideways And you can generate SSH and access the local logs and try to figure out what's happening That's so you can get into the you know You can get into whatever debugging so to speak you want in the world of cluster-based operations You know first of all the application, you know may have been torn down because it went into a bad state It may have been rescheduled somewhere else You don't necessarily want your developer to have access to the physical machines where it's running and so providing, you know better diagnostics and analytics tools that let the developers actually understand what went wrong without having to physically access the machine is Necessary that's one, you know, tiny small example. There's there's a lot of others around things like hey I want to you know run a cluster in an organization. How do I do things like? Departmental charge back, you know, if it's becoming a common service, how do people you know think about that? and And then you know for the common services, you know, like I know that like the desk guys are doing some really awesome work around Templating we still have a long way to go whereby we can actually create clean reusable deployable components You know to stand up services. So I think the foundations are there I think there's a whole ecosystem of of capabilities that need to emerge Yeah, so I think I think there's a couple of things Let's just break them in a couple of groups There's a set of what I call distributed system services So a lot of people want to run something that is for instance run in a master elected pattern you know Standing up raft or pack sauce is really difficult So initially at least need some cluster services that let you you know say hey, this is the master This is slave, etc. So there's a lot of basic distributed system services that that need to exist and that could be storage It could be you know, Coraming it could be naming discovery, etc The next layer up is what I call, you know common application services. So this could be things like You know, hey Cassandra or mongo or you know, whatever your storage asset is this could be an Indexing framework or you know, there's just a lot of open free open source You know packages that could be built and deployed as common services so that when a developer says hey I'm deploying a three-tier application. This is the storage asset. I want to get my sequel running There's a team that can actually provision it and run it the cloud provider could do it for you But on-prem would be nice to have that same basic experience and actually have a service That's that's semantically equivalent to whatever the one you're using in the cloud is then the next thing You get to is where you start to create domain specific services that are useful for your organization You're a shipping company. You might want to do that long to you know zip code lookup You know having you know deploying you know 480 libraries that you know contain that information and you know everywhere just doesn't make sense Having one team that actually just does the zip code lookup service behind an endpoint makes a ton more sense So you can start to create these domain specific services that are relevant to you And you know, obviously every every every company in every domain is gonna have a different set of services So that's a great question is like how do you actually deal with the inherent tension between change and you know And not changing these things So, you know There's probably a dark organizational science to to dealing with you know change in innovation The first thing that you know, I think is essential to any of these situations is you know You have to get to probably have stable interfaces You know to be able to change anything on either side of the divide you need to be able to create stable invasions The second thing you need to encourage people to do is make sure you have the capability to run multiple versions of anything ever like in production and as a result if you have a stable interface and the ability to run versions you create this natural tension where The target organization can always stand up their own rendition of what you're doing They can fork kind of like the open-source community. There's always that possibility of forking, right? So if they need to Because what you're using is templatized it has a strong structured basis There's that tension which is an organization that needs to be fast and has a legitimate business need can always stand up one of their own And that creates a natural dynamic tension in these organizations. So You know and how that plays out that there's an organizational science to that But I would say that the key thing is is reproducibility stable interfaces Make sure that everything you deploy is is templatized have the ability to run multiple versions of anything And then that gives you a good framework to create the right dynamic tension between the service provider and service consumer That's a great question. Yeah, there's there's some interesting stuff that that just cannot be delegated out So governance, risk management compliance security posture, etc One of the lovely things about these technologies clustering containers is that your level of introspectability and determinism becomes very high You can define policy and enforce policy autonomously through your stack if everything every deployment is being driven through a structured API You can apply policy at that API event if everything is declarative your configuration the bits you're running, etc You can define policy to make sure that it's enforced at the runtime level So you actually have a much more robust set of tools to define and control policy But one of the things that to you know the earlier question around what needs to be built The set of tools to actually define and enforce, you know policy and provide introspectability At the cluster level or cross-mobile clusters is essential for this this to work