 How's everyone doing today? Yeah? Cloud Foundry Summit. So it's time. It is time, and we are here. And I'm Andrew. And you're going to hear from my friend Luke. And this is a talk that was submitted and accepted in a week. And they asked me to reprise some of the material that I presented at Google Next. And they asked me to do it in less time. And I added more content. So in the next 15 minutes, I'm going to go through about 60 slides. And I'm going to waste some of them having fun at my own expense. So this is about bringing Google CRE to Cloud Foundry. And what I'm going to spend the next 15 minutes doing is explaining a little bit about myself, a little bit about these trends in the industry, a little bit about SRE. And then I'm going to finish with setting up what CRE is, talking about Google stuff. And then I'll let Google talk about some of the work they're doing with Pivotal and Cloud Foundry. So an alternative title to this talk, which is a talk that's recorded that was from an earlier CF summit. And it's one of the, I think it's a pretty good talk, I gave it, about platforms and how this trend is enabling us to do things in a more DevOps manner. And I actually argue that every DevOps project ever, I'm going to talk a little bit about that, was an attempt to build these types of platforms that we're finally seeing with some of the stuff in Cloud Foundry. An alternative title is also the freedom of constraint. And last but not least, the last alternative title is CRE all up in your business. So Pivotal, some of you might have heard of Pivotal here in the Cloud Foundry ecosystem. And we have an explicit mission statement to transform how the world builds software. And I'd also add to that my personal mission, which is related but slightly different, which I've been working and I'm going to talk a little bit about myself for a moment, to transform how the world operates software, because day two matters. Day one is adorable. Day two is forever. So I'm me, Andrew Clay Schaefer. People know me a little bit from Puppet, from DevOps Days, from Cloud Foundry. There's other things I did. I wrote a book for O'Reilly about web operations, which is related to all this. And then last but not least, I'm the foremost advocate for the best methodology for delivering software, artisanal retrofuturism, crossed with team scale anarcho-syndicalism, which I know you're all familiar with. And if you're not, you should go familiarize yourself. Of course, Pivotal pays me to do this. And for your benefit, this is the most important slide, my Twitter handle and the various shapes and sizes of hair and beard that you might see me configured in in the future. So if you see any of these configurations, don't be afraid. It's the same guy. And if you're interested, I'm available for weddings, bar mitzvahs. There is no talent shortage, some of my Twitter. You should go spend 40 minutes watching that. Moving on with more jokes that only have meaning to me. Pareto and efficient Nash Equilibrium's rule everything around me. I'll explain that in the hallway afterwards, if you're not already familiar. Who knows what that means? See, you haven't watched enough of my talks on YouTube, clearly. Who knows what game theory is? Who knows what a Pareto or a Nash equilibrium is? All right, we'll talk about it later. It'll be fun. 2007, this is the story. This is the story that we're all part of. If you're at Cloud Foundry Summit, you're doing things with platforms, this is the story. Operations is the secret sauce. This is a blog post from O'Reilly Radar in 2007. Who reads O'Reilly Radar? Does anyone know what radar is? So if you're in Silicon Valley, which you are right now, there's this thing called O'Reilly Media. And they wrote all the animal books. Who's ever read any of the animal books? The tech books, and this is their blog that is sort of forward-leaning reflections on technology and trends in technology. And what this is saying in 2007 is that the traditional way that people have done IT is at a disadvantage to this new secret sauce way that people are doing IT. And especially with respect to scaling. So what this graph is showing on the y-axis is the amount of hours spent doing work, which I'm going to call toil from now on. Thank you, Google. And on the x-axis, it's supposed to be the scale of the system, the number of servers that you're managing. So what's arguing here is that there's one old way to do things, which does some work in the beginning, but maybe not as much work, that has a different curve with respect to how many hours have to be spent managing those systems as you scale. And then there's a new way, which has a very different slope, very different linear slope to the growth of the toil that you have to spend managing those systems. This is 2007. I've been saying this and referencing this for 10 years now. And most people still don't get it, right? There's kind of a word now of people to say DevOps. And we'll talk about that in a minute. But this is the thing. You either compete at this level or you lose. That's what I believe. That's what I see. That's what's reflected in the stock price of companies like Blockbuster. Does anyone remember renting VHS cassette tapes? So there's another part of this story that ties into Google. Does anyone know what this is? Do they even know who these people are? Anyone know who James Waters is? Derek Collison, Mark Lukowski, Vadim. Has anyone ever used Cloud Foundry or Bosch? So this is a core team that was hired out of Google, specifically to build Cloud Foundry at VMware in 2009, circa 2009, 2008, whatever. That is the heritage of the project. It is born in ideas that, and at the time, and I remember the days, when if you said the word board in a room with people that worked at Google, they would all look at each other and then they would stop talking to you. In 2008, that was true. The conversation was over. We'll talk about the board a little bit more. So this DevOps thing, everybody wants it. Everyone wants that second thing, the new secret sauce. Who wouldn't want the secret sauce? This is what most DevOps conversations sound like to me. It's like, OK, that sounds delightful. And pan is vomiting rainbows, of course. This is what DevOps means to me. I'm going to tell you what DevOps means to me. Basically, I don't know if I have the moral authority to do that, but I'm going to do it. I wrote a blog post in 2010. So it was seven years from me writing this blog post. And it was a blog post basically entitled what DevOps means to Andrew. And at the time, I wrote about developers and operations working together, which is controversial in some places. That system administration is evolving to look more like software development, which has only continued. So this is four years after EC2. And when I wrote this blog post, I could configure servers with APIs. I could provision servers with APIs. I could add servers to monitoring with APIs. And when I'm writing all this to APIs, I'm doing all this system engineering, system administration against APIs, that looks suspiciously like software development. And I can use all these tools and processes that I leveraged building software in my system administration work. And then last but not least, and I think this is actually the most powerful point, is that this was all evolving as a global community sharing solutions. And that you could have conversations with people in the DevOps community and in the velocity community, in the puppet and the chef. And these communities were building and running these problems together, solving these problems together. And I'm pretty smart objectively, at least that's what the testings things say. But I'm not as smart as everyone put together. And I'm not as smart as all my friends that will answer me on IRC or on email or on text. And being able to leverage the fact that these people worked on these same problems in my solutions is a game changer. So there's another thing that I'm going to call the five pillars of DevOps that other people use. And it's culture, automation, lean, metrics, and sharing. I'm going to change lean to learning going forward. But lean was how it was originally put out there. And that's a nice framing for all this stuff. And I'm going to come back to that. The reason I'm going to put it there will be obviously at the end. But to me, this is my definition of DevOps. DevOps is optimizing human performance and experience, operating software with software and with humans. Whoops. Click. Okay, there we go. That's the definition. And this is a slide I've used over and over for almost 10 years. You can either manage these systems or you can't. I've been using it for a decade. And I used to believe this very strongly. You should automate all the things. I don't believe that now. What and how and why you automate is as important as that you do. Is this automation? It's not. That's not automation. It's a manual toil with some tools. That, my friends, is automation. And I'm sure some of you live this, or you're about to, but the architecture matters. Like the choice that you make and what you automate matters. And maybe it's like, you're just going to take that robot and put it in a container. And then you're going to schedule it on some things. And it's all going to be fine. And I don't know about you, but if Tetris has taught me anything, it's that errors pile up and accomplishments disappear. And that constraints decide the promises that your platform can keep. You make thoughtful decisions about what you're going to do. And it gives you these patterns with predictable scaling and failure characteristics. And people are familiar with this propaganda from Heroku around 12-factor apps. But what I don't think people have been very mindful of, and I should probably have wrote this blog close to a year ago, is the other side of that equation, the other side of that contract, the 12-factor ops that make that possible, that make that available. And everyone's got these buzzwords. Continuously DevOps microservices, of course. These are all one thing. This is the new emergent dominant paradigm that's going to be used to deliver IT for the next decade. Continuously DevOps microservices are di-trying. Patterns, proven successful, building operating highly available systems with predictable scaling and failure characteristics. And you can either scale manageable systems with minimal complexity. You either do that or you don't. You can't scale complex systems. You can feed them with human blood, and it's not very fun, but. And no one set out to do DevOps. No one set out to do continuous delivery. No one set out to do microservices. These were natural Darwinian consequences. Google exists under the Darwinian pressure of serving things at the scale of Google. So don't fixate on any of those words. Fixate on those outcomes. The principles are greater than the practices or greater than the tools. If you see things clearly from the perspective of the principles, the practice and tools are obvious. The next step is obvious. And these problems that we're solving, the problems that we're facing, they're not technical, and they're not social, but they're social technical. We're building these social technical systems that interact with each other, and you have to solve both of those together. So let's talk a little bit about this book. So this is the Site Reliability Engineering book, and you can read it for free. Google has given this book as a gift to the world. Because sharing is caring. And this is giving you DevOps as she has spoken at Google. And when I gave this talk at Google Next, Google SREs came up to me and they thanked me. And then they all sent me LinkedIn invites. And it was like, they were really happy with some of this stuff I'm about to say. So I say developers and operations can't should work together. If you read the book, this is made clear from many, many reference points in that book. And that either solving system administration with software development. SRE at Google is a software engineering role. And they are participating in the global community sharing solutions. The open source efforts from Google and this book stand as proof that that's true. So going back to the five pillars, all of these are represented in that book. And you should read it. And one of the things that'll become clear if you read that book is that the SREs are architects as much as they are operations. I'm just gonna read a little bit from the book. Another way, and this is gonna sound familiar if you're listening or paying attention to Cloud Foundry. Another way, perhaps the best, is a short circuit of the process by which individual, especially create systems with lots of individual variation end up arriving at the SREs door. Provide product development with a platform of SRE validated infrastructure upon which they can build their systems. This platform will have the double benefit of being both reliable and scalable. SRE builds framework models to implement canonical solutions for the concerned production area. As a result, development teams can focus on the business logic because the framework already takes care of the correct infrastructure use. Amazing. This is your homework. I want everyone in this room to go read the embracing risk, service level objectives and eliminating toil chapters from the SRE book. And if that's not enough, the bonus is read the communication and collaboration chapter. And there's this language, this opportunity to borrow the jargon that Google uses and Google gives us in this book to start talking about service level objectives. And I do a lot of work with Pivotal customers and I do a lot of work inside Pivotal. And I don't think we're very good at articulating this. And you're either gonna meet service level objectives. If you don't know what they are, you can't meet them. Or you're not. And it's very simple. I love how mathematical and dispassionate this framing makes some of these decisions around deployment. And then I'd be remiss if I didn't add this, that in the book, and what I actually believe very, very strongly is that monitoring is the foundation of all reliability. That you can't have any of these conversations about service level objectives unless you have baseline monitoring. Unless you have decent monitoring. You can't talk about service level indicators if you don't have monitoring. And that this is the most important sentence from the Borg paper. So has anyone read the Borg paper? The Borg is the spiritual, it's the spiritual lineage of Kubernetes, Cloud Foundry, Mesa, all this stuff. So this paper from Google has a paragraph that says, I think it's lost to most people, almost every task run on a Borg contains a built-in HTTP server that publishes information about the health of the task and thousands of performance metrics. And I guarantee you, if you build applications that tell you their help, you'll get more benefit from that than you will navel gazing about container schedulers. So wait, you mean Google SREs will review my software with me? And make recommendations? Yes, please. CRE, it stands for customer reliability engineering. And that is something that Google is doing with their customers. And it's something Google is doing with Pivotal. So SRE are fixated on Google's outcomes, the reliability of Google's services, and CRE unfixated on your outcomes. And that, my friends, is the end of the beginning. I'm a little idea, and I'm gonna hand this over to Luke and he's gonna tell you about some of the work Pivotal's doing with Google. Thank you. Thank you. Okay, so I'm Luke, I'm from Google, customer reliability engineering. And the idea is we do SRE with our cloud customers so that their apps can run reliably on our system. Duh. Right? Okay, so make this happen. Okay. All right, so it's pretty obvious what we're gonna do based on what he just talked about. This guy's really smart. He knows the space pretty well. We need to work on three different things. We need to work on your application. We need to work on Pivotal software, and we need to work on you, the operator, and your habits, right? So we set ourselves a goal of having a technology partnership between Google and Pivotal with this big, hairy, audacious goal, which is that we can run apps at four nines, which is pretty hard, actually. So the value prop is pretty obvious, right? Like you can go faster, the operators can go faster, and then Pivotal and Google have to spend less time working around the bugs from unintended consequences. So what are we gonna do? Three things is pretty obvious, I think. The first thing is we need to really harden PCF itself so that it works well on our cloud, right? The cloud is a pretty different environment than most people's on-prem data centers. Like it's kind of noisy, there's migration, things move around, you know, Google and the Borg and everything was really built around, okay, you have a lot of infrastructure, you can move things around, you're gonna build the software to be very resilient to the infrastructure, and that means you can just use a lot more infrastructure to get your job done, right? So you're kind of unleashed into this world of having more things to play with, but you have to be a little more resilient, which is the trade-off, okay? So the first thing we need to do is think about how are we gonna make PCF itself a little more resilient? And this is like what Andrew was talking about in terms of make the platform, right? Make a platform that's repeatable across all these different applications. So people have some guardrails and they have some architecture patterns that are good, they don't have to worry about them and people won't shoot themselves in the foot very much. So how do we do this in CRE? Is we go and we look at your application like in detail, takes about six weeks to really go in and like every little thing about your application. And so we did this with PCF, and we said, you know, well, how does the load balancing work? And what happens if the load balancer falls over and what happens if the backup load balancer doesn't come up in time and how much traffic has dropped and how do you know how much traffic has dropped and just like kept asking all those little questions that you know in the back of the head you should be asking and just kept going after it, right? And then came up with a list of, I don't know, about 30 recommendations to improve the architecture and they're chewing through it. One of the big recommendations out there is like better monitoring, right? So that it's really obvious how the thing is performing. And this is really tricky for PCF because you have the platform monitoring and then you also have the application monitoring, right? And it doesn't really help you to know that the platform is healthy if you don't know that the application is also healthy. It doesn't provide the user any benefit to say, oh, Bosch is doing great, but the application is just throwing 500s, right? So you need to really think about how are we gonna combine these together and actually have a good contract between the platform and the application and then also the platform and the infrastructure so that we can all operate this together using science. Okay, so our second, so we're gonna do that. We're actually gonna take another bite at that apple this year, that's a big project, right? Our second project is we wanna review real live applications that run on our cloud. So we're gonna choose a real live application that's like a production thing that you've heard of and I'm not gonna tell you what it is right now. It'll be a surprise for later this year, I think. And we're gonna review it and make sure that applications developers and operators can achieve their reliability goals based on the tooling, based on their practices and based on the architecture of their application. So I'll tell you the dirty little secret here of all these application reviews is that we come out of them and people say, what's our architectural flaw? We say, actually, architecture is pretty good, but the problem is your operations are terrible. You push the thing all globally, all at the same time. You don't have any monitoring for this and that. You don't have a decider who can hit a stop button if you're burning your error budget. So that's gonna be really interesting to think about what the application developer, what's their experience, the operator's experience, and then how do they escalate stuff to Pivotal and Google if bugs are encountered, right? So we wanna make this repeatable so that we can do this for multiple customers because my team is in the job of making everything reliable that runs on Google platform, right? So this is a great opportunity to make it really cheap. If we can get people to run their applications on PCF, on GCP, then we can review orders of magnitude more applications with the same amount of effort. So we're trying to eliminate toil. Our third project is, of course, to work with Pivotal Labs. So a lot of this is human factor stuff, right? You have to get the people to start behaving in the right way and having the right habits and what's a better place to do that than Pivotal Labs? So Labs is really good at giving people an experience and helping them have agile and helping them develop some muscle around actually doing that. I see agile is really fundamentally a way to trade off product uncertainty versus development velocity. So it's kind of like the steering wheel in the car. You can write software really fast, but if you're not doing the right features, then you're probably not gonna get where you wanna go. So you need to have some agile so you can steer back. And I see SRE is kind of like the brakes on the car, right? You can go fast as long as you have brakes and you can slow down when you're in danger. So they're really pretty similar. And so what we often do when we go in as CRE to customers, we do a bunch of drills and we do, let's do a post mortem together and just see how that goes. And let's do wheeled misfortune exercise, which is an exercise where you basically do a hypothetical outage and you put somebody in the hot seat and you say, okay, you're on call and it's 2 a.m. and you get the following page and then what do you do, right? And that just helps people to kind of build the muscle and to exercise all the ideas about SLO, am I within SLO or am I outside of SLO? Do all the stuff in a low stress kind of environment, right? So we want to do the similar thing. We want to basically have an SRE like experience in a lab style environment. And it really comes in at the developer level, right? It's not like a developer versus operator thing. These SRE principles like having an SLO and error budget, they're really about having a contract between the operator and the developer. So it's important to get the developer to be thinking about SLOs when they're sitting down and they're just ready to type the first line of code, right? They need to just be thinking about, okay, I've already decided what my reliability goals are. So that will be very powerful in terms of just changing habits and getting people going. So I think that's pretty much what we're gonna do. That's our little preview of what we're gonna do in SRE with Pivotal. In general, you know, in general about SRE, I'll give you a little update because we haven't talked about it a lot since we started the program about a year ago. So we've worked with a handful of customers. We are getting pretty good at doing application reliability reviews, which are modeled after our Google's internal production readiness reviews, which is like a pre-launch review. It's kind of like quality control. And we're also developing some workshops for how to develop SLOs. We find a lot of people don't really have SLOs or they don't have the tools to really think about SLOs. Like at Google, we're pretty lucky because we have an org structure that kind of lends itself to having a decider who can actually name and number. And we have that HTTP server that spits out all these metrics. So there's like lots of SLIs to look at. So we're pretty lucky, but we often need to kind of teach people how to get those fundamentals in place before they can even start thinking about making an SLO contract with their developers. And we're also gonna do design reviews because we find a lot of people, they pretty much know what they're doing in terms of being SREs and they have an SRE team that are up and running, but they just are writing something new and they wanna know how it's gonna work on cloud and it has a new shape. So we wanna do some quicker design reviews that are less than six weeks of in-depth thing, which is more typical practice at Google anyway, like a kind of a half day design review. So that's SRE and PCF. I'm pretty excited because it'll be amazing value if we can actually get all the bugs out of PCF on GCP, using one sample application so that nobody else has to suffer through it. When? When, exactly. And we're also really focused on this concept of shared monitoring console, where you can look at the same pane of glass as people at Pivotal or people at Google Cloud and everybody can see that you're out of SLO or you're in SLO, all the same. And so I'm trying to refine this concept around just very basic issue triage. Is the issue, are you barely out of SLO or are you like way out of SLO and you're like on the floor? And then what products are you using? What failure domain are you experiencing the problem in? What region or zone? So if we can just get those three things, like your product, your failure domain, and what metric are you looking at? We get those three things communicated to our support team really quickly. Then we can fix things in real time really nicely. That's gonna be our main trick. All right, thanks for coming. Thank you. Almost all of it. So the question was, does the CR review for PCF on GCP benefit other deployments of Cloud Foundry and it will have meaningful impact? I would say that there's probably a third that are maybe GCP specific and then the others are architectural benefits that will improve CF and PCF on every other deployment. No. It's pretty much your second to last slide there, which is like, go read the SRE book, go read the introduction by Ben Trainer, go read chapter three and four and get everybody on board. And then I kind of sit down with people and say, okay, where's your SRE team? And Moomoo's the last time you wrote a code. And make sure that they understand that's the expectation. But you're right, it's, there's a lot of different engineering cultures out there and I'm seeing some that perform really well too that are not SRE. I don't think SRE is the only way to get good performance but it's just the one that we know how to teach. So what I would add there is that there's a lot of context and nuance from organization to organization. And all the things are in the SRE book give us a nice framing to have a shared vocabulary around things like SLOs and air budgets. But at the same time that I would add that I believe this very strongly, I've never seen anything fix problems in IT and in cultures except for smart people that cared about solving them, trying to solve them. So if you can get people aligned with trying to solve those problems, you'll be able to solve them. And if you've got some cultural incentive structures that are against you solving them, then you won't. Yeah, like SRE actually doesn't solve all your problems but it does give you some good guardrails so you don't backslide and you don't fall into some common anti-patterns actually. Mark and Briaco. As Mark has rightly noted, not everyone is Google. Yeah, I think SRE is actually an instantiation of these principles in a specific environment. And there could be other SRE shaped things or there could be other things that look very different but kind of adhere to the same principles. But that's why we're creating this idea of a SLO workshop to kind of get people to think about, okay, it's gonna take like two years to transform my culture before we're ready to do this. Well, just like you wouldn't try to take a monolith and turn it all into microservices at once, you'd probably strangle it. You can refactor your culture in a similar manner. Like you don't have to change every part of your organization and your culture at once. In fact, you are probably irresponsible if you try. Yeah, that's my best advice to people who are just starting to make an SRE team is take your five best developers and put them in an SRE team. And make them responsible for a very small but broad piece of the infrastructure, like database or something, right? Or whatever deployment. And then people go, but I can't waste my five developers. Cool, that's not important to you. You're wasting them now is the thing. You're wasting them now. Exactly, if you think this is the most important thing, then put your best people on it. And give them room to do a really good job. And otherwise, it's cool to have different priorities. It's okay. Lunch? What's next? All right, next talk. Good deal. Thank you for coming. Thanks for coming.