 All right, so we're gonna talk a little bit about site reliability in the serverless age, and I'm gonna take you through some, I guess an opinionated point of view on serverless and reliability, and I don't expect everyone to agree. What I do hope for is that afterwards, we have a great conversation, you come by the Cloud Zero booth, we talk about your favorite beer or any of these other topics. So, whom I, you already know all these details, I think probably the most interesting, fun fact is I've lived in six states in three different countries and it just randomly means that I really don't feel like I'm home anywhere, but I live in Boston, I love Boston, I've been here for five years and it's where we started Cloud Zero. So, we're gonna talk about, make sure I don't pull off the stage here, we're gonna talk about serverless, we're gonna talk about reliability, we're gonna talk about how those two things interplay with each other, and then we're gonna make some wild ass guesses about the future, sounds good? All right, so what is serverless? How many times have we talked about the definition of serverless, right? Serverless still has servers, yes, we know. Well, there's five properties that I think apply to serverless, and when we think about serverless, this is really what we should think about, the attributes of what a serverless system or a serverless component is. It's event-driven, the infrastructure is invisible to us who are building systems on top of it, it automatically scales with usage, it's fault-tolerant and high availability built in, and we never pay for idle. So that's how I define serverless, that's how Cloud Zero defines serverless and how a lot of folks that I've been talking to on the Amazon side and on the Google side also apply a lot of these attributes to serverless. And it's gonna be very useful for the purpose of this conversation. So, I wanted to share something new that nobody has ever seen before, it's a live image from the serverless data center. It's not a lot going on here, I don't know. When I saw this tweet, I absolutely, I think I laughed for about five minutes. Of all the jokes that have been made about serverless, this is by far so far my favorite. If you come up with a better one than this, it'll appear in the next talk, please do. So let's talk seriously though about what does serverless mean. So remember those five attributes that I was just talking about? Well, that doesn't mean that it's just functions as a service, Lambda, Google Cloud Functions, Azure Functions, that's not what serverless is all about, whoop. Okay, sorry. It's a whole lot more than that. In fact, from my point of view, serverless is actually better thought of as a spectrum. And I kind of got this idea from a conversation a long time ago from speaking with Ben Kehoe at iRobot, where I think he first kind of presented this idea that serverless was a spectrum. And so I started to think about, well, what does that really mean if I were to go and take a lot of the services? So this is obviously the AWS focused. A lot of my talk today is AWS focused apologies to the Google and Azure folks in the room. I would love to work with you on a broader cloud talk, but I'm gonna focus on AWS. So if I went and I took all the kind of major AWS services and I popped them into a spreadsheet and I assigned a score to each one, they get a one or a zero or a half based on how well they apply to those five attributes. I basically get a spectrum or a chart like this. Now the interesting thing is, serverless by my definition really starts about here. If I'm using ECS Fargate, it applies to a lot of those attributes, Kinesis and then more serverless as I get farther and farther to the left side. One interesting thing is we've got services like SQS and S3. What were the first services ever announced by AWS? I think back in 2004 they talked about SQS and then S3 and then EC2 came out and then for a decade we were all confused and didn't think about this amazing serverless thing because nobody had really talked about it yet, but we were busy focused on EC2, 2014 Lambda comes out and suddenly we're having this conversation again. But it's a spectrum. And so there's a lot of components that go on because who would build a system on nothing but functions of service? No, I can't do that. I need a database, I need storage, I need all these other elements in my environment in order to build a system. And the key kind of tenant or maybe better framework to think about it is is that the cloud is really a computer and the operating system is the cloud provider and serverless is the native code for that operating system and computer. So serverless is a spectrum. Okay, so let's talk about reliability but first the setup, everything fails all the time. How many people would absolutely stand up and applaud if their CTO or CIO stood up and said everything that we're building is going to fail, right? It's a good understanding of how cloud changes everything. When we first started moving things in AWS some people attempted to do the lift and shift and they discovered how painful that was. They had to re-architect a lot of things. They had built a monolith and maybe they didn't know about it. They built a snowflake and maybe they didn't know about it but everything fails all the time, right? So if everything fails all the time and I'm building serverless systems and I'm building an AWS, well then what exactly does reliability mean if everything fails all the time, right? So I sat down and spent some time thinking about it and I was like, oh well, let's see what the dictionary says. It's kind of okay, let's see what the industry says. And at the end of the day I thought, well what are we really trying to do as practitioners, as people building systems? What are we ultimately trying to do? Well we're trying to build systems that delight our customer. And so in my mind, reliability is the trustworthiness of a system and its ability to delight the customer, right? If I know that every time I go to that system it's going to deliver a delightful experience on a trustworthy basis, then it's a reliable system from the customer's perspective and really the customer's perspective is really all that matters. It might be a complete and utter chaos fire behind the scenes, hopefully it isn't. But if the customer is happy, well you're at least getting one thing, probably the most important thing, right? Now that's not sustainable if you're managing a dumpster fire. But we want to definitely delight the customer, right? Okay, so we have an idea of what service is. We have an idea of at least what I think is reliability. So how do those forces tie together to things like DevOps and Site Reliability Engineering? And there's a lot of different points of view on this. I kind of pulled out the pieces that I think that line up with at least how we operate. And a lot of times I see DevOps as more of a culture, Site Reliability Engineering is a practice. We've got some points of view in terms of what DevOps culture means. We've got some points of view about what we should be focused on if we're in the Site Reliability Engineering practice. And by the way, DevOps wasn't a happy merger between development and operations. It was a hostile takeover, right? There wasn't a, oh everything is perfectly fine. Let's glue these two organizations together that have been at war with each other since the dawn of humankind, right? And so after that hostile takeover, that culture that drove the invading hordes, us engineers, into the operation space, made us suddenly realize, wait a second, we have to develop some sort of process now that we're in charge of this whole thing. And that's where we get practices like what's written down in the SRE book. Okay, so we've got those two pieces. So how does serverless affect those forces? So on the DevOps side, some things, in my point of view, are just absolutely required, right? So we talked about eliminating organizational silos, DevOps taking over ops. The silos have to go away, it's absolutely required. But I'm gonna be building serverless. I have to accept failure is normal. Everything fails all the time. I'm now operating on that operating system called AWS. I have to embrace failure all the time. Their CTL says that it's gonna happen 100% of the time, embrace it. We need to adopt this religion that meantime to recovery is more important than meantime before failure, right? If I'm focused on trying to make something never fail, I will fail. But if I focused on trying to be able to respond to an event, respond to an activity or something like that, I will ultimately do what's most important, which is delight my customer, right? We're always trying to optimize feature velocity, but now in serverless worlds, things are happening almost faster than we can keep up, almost certainly. We were trying to measure everything, but now we really should be trying to measure what's important. And then there's something new that shows up, which is an alignment between well-architected systems and their cost, right? This has been happening for, probably been coming for a long time. We haven't really thought deeply about it, but when we had servers in our basement, we didn't think too much about the cost. CFO purchased them every three years and there was a sunk cost and we didn't think about it. Then we went to EC2 and we had a bunch of things running and we really didn't know how much those things cost. Sometimes we thought about it. When serverless, I'm aligning my usage directly with what my customers are doing, directly with how much money I'm paying, and so good architectures cost less. So I would argue that there's a direct relationship between cost and a well-architected system and that we need to consider cost as a first-class metric, an operational metric that I monitor not on a monthly basis. I heard an earlier talk that people talking about finding their bill at the end of the month and discovering something that's happened. I used to give a talk where I talked about the most popular IDS in AWS was the billing report at the end of the month. People are using their bills as an intrusion detection system. Please don't do that one. So I think cost is a first-class metric and it's important. And so a couple of these are blank because I think they just kind of continue on, but then there's a couple things that change here as well. So availability, what does availability mean in a world where my provider is really responsible for that? I need to think deeper and more thoughtfully about the SLAs and the SLOs associated with the system I'm building. Efficiency becomes cost efficiency. Change management becomes change tracking. I just wanna understand what's happening. I've got more teams than I know what to do with and they're all building things and it's all very interdependent. There's this tricky word called observability. I'm gonna talk about that in a second, but it gets to monitoring what's important. Instead of planning for how much capacity I'm gonna need, I need to understand the limits of the environment that I'm deploying in. Remember, scalability is supposedly automatic, but there are limits. And instead of provisioning systems, I'm really focused on automation and auto-scaling, configuring all those things. So there's some changes, but I'm living in a service world. These are the changes that I need to think about. So I wanna deeper dive into some of the hard stuff. Okay, so these four areas I'm gonna spend a little bit more time on. All right, so what does it mean SLA and SLO management, right? So let's ask ourselves. Do our cloud providers, do their SLAs that I've built my system, my house of cards on top of, even support the SLO or SLA that I'm advertising to my customers? How many people knew that EC2 has an SLA, but Lambda does not, right? One person, yeah. There's absolutely no promise that Lambda's gonna continue to work. And if it stops working, there's gonna be no ramifications other than you're upset, right? DynamoDB, they have an SLA. API Gateway, no SLA. CloudWatch logs, tracking all this stuff, no SLA, right? So you need to take an approach of adding those things together, thinking them through, starting to develop your own metrics and own understanding of how those systems combine to create the availability of your environment. So that's the first one. The other one is we need to really think about cost as that first class metric that I was talking about. So, wow, Lambda, it's so unbelievably cheap. I only spent $1.79 to do all this compute that used to cost me hundreds of dollars. Wait a second, why is the bill $800? Well, somebody turned on debug logging. Somebody was logging everything to CloudWatch, and now it's $800 per day. Architecture change, configuration change, dramatic effect on the overall bill. If I'm only taking a moment to look at the Lambda price or individual components of my system, I'm forgetting that Lambda and serverless and all of these components are all part of the serverless system. I have to think of it as a whole, right? Observability versus monitoring. So this term has been all over the place. I actually have been hating this term for a very long time, but then I spent some time thinking about the definition a little bit, and it's a combination of two things, and I think we've been forgetting that it is required to be a combination of two things. One is my ability to observe what the system is outputting. Is the system logging enough? Is it alerting enough? Is it presenting enough information? The other piece is actual analysis of that data, and that's the part that I think a lot of us forget. So observability is a measure of how well the state of a system can be determined from the analysis of its outputs. That's observability. If we're not doing the analysis part, we're just doing monitoring, and that doesn't really scale in a serverless world. Serverless systems are easy to observe. They produce enormous amounts of data, but they're very hard to analyze because of all of the dependencies, all of the inner workings that are tied together. It's very hard to analyze, and thus have a kind of out-of-the-box low observability, but it's key to measuring what's important. So we have to figure this one out. By the way, that is an analysis. If you go looking for the most disturbing eyeball photo, there are worse ones than this, but that one is really troubling. So how many people are here operating as the analysis system in their environment? That doesn't scale. It definitely doesn't scale in a serverless world. And the reason is because the system, remember, is changing faster than we can keep up. Next thing you know, you'll have teams that do nothing but maintain the dashboards that do nothing that track the system that's constantly changing. And sooner or later, that team might start looking the same size as the team that's actually building the solution. It's not a scalable long-term solution. We have to get out ahead of that problem and figure out how to build this into the system itself. Okay. And then the last piece I wanted to mention is capacity planning. So we think about capacity planning in a cloud-centric world and a serverless-centric world where auto-scaling is the norm, where automation is the norm, what really stands in our way of scaling up our limits, provider limits, how many things can I spin up? How many functions can I create? How many connections can I create? What type of calling pattern exists between Lambda and DynamoDB or Lambda and S3? How do those things behave? So when we think about capacity planning, we think about our ability of our system to scale, even though we've read the books and it said scaling is for free, we have to make sure that we architect to them right so that the patterns and the architecture supports the ability to do that. Sometimes it's as simple as increasing the limit, sending an email to AWS and saying, hey, can you bump up that limit? How many concurrent executions I can get? That's a changeable limit. But sometimes it's based on things that we can't control. I can only spin up 500 simultaneous concurrent invokes in US East. Then I have to wait a minute and then another 500. By the way, that's different per region and there's a lot of other little complexities like that. So you have to think about the limits. That is gonna drive your ability to scale and your capacity more than anything in a serverless world. Okay, some weird, crazy projections on the future. So let's think about costs and architectures. Come back to this chart for a second, right? So that was already a big problem. A lot of times we're not thinking about the necessary costs, we're just focused on the individual things or the parts that I built and we forget that somebody can turn on debug logging and cause this to occur. So that's a problem. But could it be worse? Well, of course, of course it could be worse. So follow through this, right? Step one, three invokes. Step two, 100 files. Step three, invoked 100 times from an S3 bucket. Step four, 1,000 records written at DynamoDB. DynamoDB Stream invokes a lambda function. Say I set my batch size to 100, invoked 100 times. That writes 100, 1,000 files to that S2 bucket, problem. This is both costly and catastrophic. Now you might be asking, well, I built this system, I would never create that thing. Why on earth would anybody do that? Obviously this makes no sense, right? Well, what if I'm trying to break up these services? What if there are two different teams working on this? Or if you're only responsible for a small part of that system, what if those parts of the system are coming from a third party? Think about how easy it is if you're capturing events off an S3 bucket, how easy it is for somebody else to go dump a million files into it. Have you configured the filters right? Have you configured all the details? What are the limits? What are the connections? How does data flow? That's a big problem. This is where people get bit all the time when they're trying to build serverless. They have recursive functions, they have cost overruns, they have architecture that just doesn't seemingly scale, and they're wondering, wait a second, maybe this thing isn't for me. And I'll tell you, the thing is for you, but we have to change our thinking. We can't lift and shift into serverless. It's impossible. We could do lift and shift into EC2, and it was mildly painful. It's impossible to lift and shift into serverless. But it gives us a lot of power, which is why we're doing it. But who's responsible? Who's gonna solve this challenge? We have to think about all those elements together. So that's my prediction. That a new tribe among us here is gonna emerge. People who understand the ramifications of architecture and connections and cost and how those things tie together. So Simon Wardley was talking a little bit about this. It's serverless confi tweeted about it. He actually put out a survey to about 300 people with some real data, one data point, but real data. I highly recommend you go check it out. That basically says this tribe is already starting to emerge. So we kind of call this Finn DevOps. It's a DevOps engineer. It's a developer. It's an operations person who cares about cost, who understands the ramifications of cost and architecture and how they're intertwined. They know intimately what the cost of a user or a transaction is. They care about that. And they track the dependencies and data flows like they're life dependent on it. They're trying to solve that observability problem in a mountain of data. So I think those people are starting to merge now. They're already out there. I fully expect to start seeing job descriptions and things like that emerging in 2019 where people start to ask questions. So just by a show of hands, how many people here care deeply about their AWS bill? About a third of you. That's wonderful. When I come back here next year, I want 100% of you all to raise your hand because a well-architected system is a cost-effective system. So thank you very much. I really appreciate the time. I really enjoyed that talk. And this is the first time I had heard Finn DevOps as a term. I kind of feel like I'm doing this already but I don't think a lot of other developers are or are thinking about it. How do you sell this idea through to developers who maybe haven't thought of it or are actively disinterested in it? So really, it's a really tough problem. So when I was at Veracode, I brought AWS in to Veracode. We didn't have AWS, we're all on-prem. And the first thing I did was ask the CFO if I could go spend a little money. And he said, sure, go, don't spend more than $3,000. And somehow, magically through that experience, it forced me to start thinking about cost. But as, but obviously you can't create a movement with just one person. I had to involve other people, other teams in the process. And it was really tough because I think as engineers, a lot of us are just kind of have it ingrained in our mind that the infrastructure that we're running our code on is a sunk cost or we're just so far removed from it. And that's really tough. And so what I think is missing in what just as a way of example, Cloud Zero's trying to do is to actually make the cost in real time something that's visible to you. It's not very useful if I'm writing a bunch of code and at the end of the month, the bill comes out and says, oh, it was $5,000 and last month it was $3,000. You go, well, how can I really affect change? But if I can have information that tells me, oh, I made a change and now suddenly I have an alert and oh, I'm spending $30 more an hour. Oh, I did that. All right, let me go tweak with that. So what you can do today is try to encourage people to get more interactive with the bill to pay more attention to those things. And we're certainly trying to help solve that problem. So great question. It's tough one. I'm sorry, I don't think I totally solved it as like there's a silver bullet for us other than our platform, but I'm not up here today. So are there open source tools or third-party tools which can look at your model, how you're doing it and predict the cost and compare it between different cloud providers? So between different cloud providers? Yeah. I mean, so each cloud provider has, in the early days, the cost management tools that the cloud providers spread were horrible. You really had to go figure it out yourself. But now they're pretty robust. So you can get a lot of data out there. There's been a ton of blog posts recently about, okay, Lambda, Vokes versus Cloud Functions versus Azure Functions. But I think that kind of misses the point, right? You need to think of the total system cost. So I haven't seen any open source emerge to do that sort of thing, other than some of the kind of cloud APIs you can already call from your cloud provider to pull in that data. And I've seen customers trying to solve this problem today by basically writing a bunch of scripts, pulling a bunch of data and loading it into Excel and doing some magic. And obviously that's not scaling very well for them. But that in the open source world, I guess would be the state of the art. So it's actually a very similar question. And it has to do with what's your perspective in terms of once you have this analysis of being able to look at your costs, how do you prevent vendor buy-in or vendor lock-in? So if you are in AWS environment and you start using the Lambda Functions, all of a sudden now you are deeply, deeply tied to that environment. And you want to be able to be financially savvy. So how, where do you see that ROI and costs come into play? So vendor lock-in, that's one of my favorite questions. Thank you. And I've got two responses to it. So the first is vendor lock-in is a red herring. I'm sorry, it doesn't, if you're worried about vendor lock-in, you're worried about the wrong thing. It just simply isn't a problem that you should be thinking about. I know that people go, wait a second, when I'm all beholden to one operating, one cloud provider, well, I made a choice a long time ago when I started building that I was gonna build on Linux or Windows or something else. The cloud providers or operating systems just get over it and implement. At the end of the day though, it's really not just about Lambda, it's about all the services. So what I think will eventually happen, just put a bow on that though, is what stops me from building a multiple provider serverless system? Well, it's bandwidth and network connections and the ability to move data from point A to point B. That isn't anywhere where it needs to be in order to really enable that. So we see hybrid kind of serverless apps emerging where you have something like Auth0 as a service that I use that handles my authentication. And I communicate with that because it's not, doesn't require high bandwidth or high traffic. But it's a tough, it's a tough problem to solve that won't get solved until the cloud providers all have 10 gigabit ethernet connections between them, right? And so until then, it's just not worth thinking about it. But to kind of like dig into it a little bit more, just embrace it, I guess is the best way I can say it. The reality is we're here to build great stuff in a short period of time. I see people are building serverless systems on any platform, any provider in a period of time that is way shorter than I think anybody could imagine. Six months for things that normally take two years. So it's more getting more into the mindset of how quickly can I move and how quickly can I change and build versus I'm gonna be locked into one provider or another over time. I'd love to dive more into that. That's probably an entirely another talk though. But thank you.