 Live from San Francisco. It's theCUBE. Covering Google Cloud Next 2018. Brought to you by Google Cloud and it's ecosystem partners. Welcome back everyone. It's theCUBE live in San Francisco at Google Cloud's big event next 18. Google Next 18 is the hashtag. I'm John Furrier, Jeff Frick. Our next guest, Dave Rensen, director of CRE and network capacity at Google. CRE stands for Customer Reliability Engineering, not to be confused with SRE, which is Google's heralded program, site reliability engineering category, category changer in the industry. Dave, great to have you on. Thanks for coming on. Thank you so much for having me. So we had a meeting a couple months ago and I was just so impressed by how much thought and engineering and business operations have been built around Google's infrastructure. It's a fascinating case study and history of computing. You guys obviously power yourselves in the cloud is just massive. You got the site reliability engineer concepts now is, I won't say is a boilerplate but it's certainly the guiding architecture for how enterprises are going to start to operate. Take a minute to explain the SRE and the CRE concept within Google. So I think it's super important that you guys again pioneered something pretty amazing with the SRE program. Well, I mean, like everything, it was just born out of necessity for us. We just, we did a calculation 12 or 13 years ago I think we sat down a piece of paper and we said, well, the number of people we need to run our systems scales linearly with the number of machines which scales linearly with the number of users and the complexity of the stuff you're doing. All right, you know, carry the two, divide by six, plot a line. In 10 years, and this is 13 or 14 years ago, we're going to need 1 million humans to run Google. And that was at the growth and complexity of 10 years ago, 12 years ago. Search. Search, right? We didn't have Android, we didn't have cloud, we didn't have assistant, we didn't have any of these things. We're like, well, that's not going to work. We're going to have to do something different. And so that's kind of where SRE came from. It's like, how do we automate the basic philosophy as simple, give to the machines all the things machines can do and keep for the humans all the things that require human judgment. And that's how we get to a place where like 2,500 SREs run all of Google. And that's massive. And then the billions and billions of users. Again, again, I think this is super important because at that time it was a tell sign for you guys to wake up and go, well, you can't get a million humans. But it's now becoming, in my opinion, what's enterprise is going through in this digital transformation, whatever we're called these days, consumerization of IT now is digital transformation. Whatever it is, the role of the human machine interaction is now changing. People need to do more. They could collect more data than ever before. It doesn't cost them that much to collect data. We just heard from the BigQuery guys, there's some amazing stuff happening. So now enterprises are almost going through the same changeover that you guys had to go through. And this is now super important because now you have the tooling and the scale that Google has. And so it's almost like they have to level up fast. So how does an enterprise become SRE like quickly to take advantage of the cloud? So, you know, I would like to say this was all sort of a deliberate march of a multi-year plan. But it wasn't, it was a little accidental. Starting two or three years ago, companies were asking us like, they were saying we're getting mired in toil. Like we're not being able to innovate because we're spending all of our budget and effort just running the things and turning the crank. How do you have billions of users and not have this problem? We said what we do is called SRE and they're like, please use more words. And so we wrote a book, right? And we like expected maybe 20 people would read the book and it was fine. And we didn't do it for any other reason other than that seemed like a very scalable way to tell people the words. And then it all just kind of exploded. We didn't expect that it was going to be true and so a couple of years ago we said, well maybe we should formalize our interactions of like we should go out proactively and teach every enterprise we can how to do this and really work with them and build that muscle memory and that's where SRE comes from. That's my little corner of SRE. It's the part of SRE that instead of being inward focused we point out to companies and our goal is that every firm from five to 50,000 can follow these principles and they can, we know they can do it and it's not as hard as they think. And the funny thing about enterprises is they have this inferiority complex like they've been told for years by Silicon Valley firms in sort of this derogatory way that you're just an enterprise, like we're the innovator. That's by our stuff, by our software. By IT. We're smarter than you. And it's nonsense. There are hundreds and hundreds of thousands of really awesome engineers in these enterprises, right? And if you just give them a little latitude and suddenly we try to, we can walk these companies on this journey and it's been, I mean, you've seen it. It's just been snowballing the last couple of years. Well, the developers have certainly have changed the game. We've seen with cloud native, the role of developers being, doing toil and or, you know, specific longer term projects that an app, you know, related IT would support them. So we had this traditional model that's been changed with Agile, et cetera, and DevOps. So that's great. So, you know, you know, golf clap for that. Now it's like, scale, it's been a high five. Now it's like, they got to go to the next level. Okay, the next level is how do you scale it? How do I get more apps? How do I drive more revenue, not just reduce the cost? So, and now you've got operators now have to operate things. So I think the persona of what operating something means is what you guys have hit with SRE and CRE is part of that program. And that's really, I think the aha moment. So that's what I see. And so how does someone read the book, put it in practice? Is it a cultural shift? Is it a reorganization? What are you guys seeing? What are some of the successes that you guys been involved? The biggest way to fail at doing SRE is to try to do all of it at once. Don't do that. There are a few basic principles that if you adhere to kind of the rest of it just comes organically at a pace that makes sense for your business, right? The easiest thing to think of is simply if I did to slow it down to a few simple things, it's just this. Any system involving people is going to have errors. So any goal you have that assumes perfection, 100% uptime, 100% customer satisfaction, zero error, that kind of thing is a lie. You're lying to yourself. You're lying to your customers. It's not just unrealistic. It's in a way kind of a moral. So you got to embrace that. And then that difference between perfection and the amounts, the closeness to perfection that your customers really need because they don't really need perfection, treat us a budget. We call it the error budget. Go spend the budget because above that line your customers are indifferent. They don't care. And that unlocks innovation. So this is important. I want to just make sure I slow down on this. Error budget is a concept that you're talking about. Explain that because this is, I think, interesting because you're saying it's BSed that nothing, there's no errors because there's always errors, right? So you just got to factor in that and how you deal with them. But explain this error budget. This is operating philosophy of saying, deal with errors. So explain this error budget concept. It comes from this observation which is really fascinating. If you plot, say, reliability and customer satisfaction on a graph, what you will find is, is for a while, as your reliability goes up, your customer satisfaction goes up. Fantastic. And then there's a point, a magic line, after which you hit this really deep knee. And what you find is, is if you are much under that line your customers are angry. Like pitchforks, torches, flipping cars, angry. And if you operate much above that line they are indifferent. Because the network they connect with is less reliable than you. Or the phone they're using is less reliable than you. Or they're doing other things in their day than using your system, right? And so there's a magic line actually which is terms called an SLO, Service Level Objective. And the difference between perfection, 100% and the line you need, which is very business specific, we say treat as a budget. If you overspend your budget your customers are unhappy because you're less reliable than they need. But if you consistently underspend your budget because they're indifferent to the change and because it is exponentially more expensive for increment of improvement that's literally resources you're wasting. You're wasting the one resource you can never get back, which is time. Spend it on innovation. And just that mental shift that we don't have to be perfect lets people do open and honest, blameless postmortems. It lets them embrace their risk in innovation. We go out of our way at Google to find people who accidentally broke something, took responsibility for it, redesigned the system so that the next unlucky person couldn't break it the same way and then we promote them and celebrate them. So you push the error budget but then it's basically a way to do some experimentation, to do some innovation. Safely. Safely and what you're saying is obviously the line of unhappy customers, and it's like Gmail, when Gmail breaks people are like damn, we're all the freaks out, right? But I'm happy with Gmail right now, it's working. But here's the thing, Gmail breaks very, very little, very, very often. I never noticed it breaking, right? Will you notice the difference between 10 milliseconds of delivery time? No, of course not. Now, would you notice an hour or whatever? There's a line, you would for sure notice. That's the SLO line, so that's exactly right. You're also saying that if you try to push above that, it's cost more and there's not an incremental benefit. That's right. It doesn't affect my satisfaction. Yeah, you don't care. I'm in Nirvana now, I'm happy. Yeah, okay. And so what does that mean now for putting things in practice? What's the ideal error budget? Is that an SLO? Is that part of the objective? Well, that's part of the work to do as a business, and that's part of what my team does, is help you figure out what is the SLO? What is the error budget that makes sense for you for this application? And it's different, you know, a medical device manufacturer is going to have a different SLO than a bank or a retailer, right? And they're going to, the shapes are different. And it's interesting, you know, we hear SLA, service level agreements in the old term. Different things. Different things. Here, objective, if I get this right, it's not just about speed and feeds, it's also qualitative user experience objectives, right? So isn't that, am I getting that right? Very much so. So SLOs and SLAs get confused a lot because they share two letters, but they don't mean anywhere near the same thing. An SLA is a legal agreement. It's a contract with your user that describes a penalty if you don't meet a certain performance. Lawyers and sometimes sales and marketing people drive SLAs. SLOs are different things driven by engineers. They are quantitative measures of your user's happiness right now. And it's exactly to your point. It's always from the user's perspective. Like, your user does not care if the CPU in your fleet spiked or the memory usage went up X. They care, did my mail delivery slow down? Or is my load balancer not serving things? So focus from your user backwards into your systems and then you get much saner things to track. They have great conversation. I love the innovation. I love the operating philosophy because you're really nailing it in terms of you want to make people happy, but you're also pushing the envelope. You want to get these error budgets so you experiment and learn and not repeat the same mistake. That sounds like automation to me. But I want you to take a minute to explain what SRE, that's an inward facing thing for Google. You are called a CRE, Customer Reliability Engineer. Explain what that is because I heard Diane Greene saying we're taking a vertical focus. She mentioned healthcare. Seems like Google's starting to get in and applying a lot of resources to the field. Customers, what is a CRE? What does that mean? How's that part of SRE explained that? You know, a couple of years ago, when I was first hired at Google, I was hired to build and run cloud support. And one of the things I noticed, which you notice when you talk to customers a lot, is you know the industry's done a really fabulous job of telling people how to get to cloud. I used to work at Amazon. Amazon is a fantastic job of telling people like how do you get to cloud? How do you build a thing? But we're awful as an industry about telling them how to live there. How do you run it? Because it's different running a thing in a cloud than it is running it on prem. And you find that's a cause of a lot of friction for people. Not that they built it wrong, but they're just operating it in a way that's not quite compatible, right? So a few degrees off. And so we had this notion of, well, we know how to operate these things at scale. That's what SRE is. What if, what if, we did a crazy thing. We took some of our SREs and instead of pointing them in at our production systems, we pointed them out at customers. Like, what if we genetically screened our SREs for can talk to human instead of can talk to machine? Which is, you know what you optimize for when you hire an engineer. And so we started serious. This part of our SRE org that we point outwards to customer in our job is to walk that path with you and really do it to get like, sometimes we go so far as even to share a pager with you. And really get you to that place where your operations look a lot like, like we're talking that same language. It's custom too. You're looking at their environment. You're looking at their bespoke. And then we also tried to scale things. You know, we did the first SRE book at the show just two days ago. We launched the companion volume to the book, which is like cheap plug segment where it's the implementation details, right? The first book sort of a set of principles. These are the implementation details. Anything we can do to close that gap. Where I don't know if I told you the story, but when I was a little kid, when I was like six, like 1978, my dad, who's always like, love technology decided he was going to buy a personal computer. So he went to the largest retailer of personal computers in North America, Macy's, in 1978. And he came up with two things. Came up with a huge box and a human named Fred. And Fred the human unpacked the big box and set up the monitor and the tape drive and the keyboard and told us about hardware and software and booting up because who knew any of these things in 1978? And it's a funny story that you needed a human named Fred. My view is I want to close the gap so that SRE are the Fred's. Like in a few years, it will be funny that you would ever need humans from Google or anyone else to help you learn how to run. It's really helping people operate their new environment at a whole, it's a new first generation problem. Yeah, essentially. Well, Dave, great stuff. Final question, I want to get your thoughts great. Great thing to have this conversation. You should come to the studio, go more and more deeper on this. I think it's a super important and new role with SREs and CREs. But the show here, if you zoom out and look at Google Cloud, look down on the stage of what's going on this week. What's the most important story that should be told and that's coming out of Google Cloud? Across all the announcements, what's the most important thing that people should be aware of? Wow, I have a definite set of biases that won't lie. To me, the three most exciting announcements were GKE on-prem. The idea that managed Kubernetes, you can actually run in your own environments. I mean, people have been saying for years that hybrid wasn't really a thing. Hybrid's a thing and it's going to be a thing for a long time, especially in enterprises. That's one, I think the introduction of machine learning to BigQuery. Like anything we can do to bring those machine learning tools into these petabytes scale. I mean, you mentioned it earlier. We are not collecting so much data. Not only can we not, as companies, we can't manage it. We can't even hire enough humans to figure out the right questions. So that's a big thing. And then selfishly in my own view because of reliability, the idea that Stackdriver will let you set up SLO dashboards and SLO alerting, to me that's a big win too. Those are my top three. Dave, great to have you on. And the new book. Our SLO, the cue is to bring the best content we possibly can, the most interviews at an event and get the data and share that with you live. It's theCUBE here at Google Cloud Next 18. I'm John Furrier, Jeff Frick. Stay with us, we've got more great content coming. We'll be right back after this short break.