 So in the beginning, there was servers, physical servers, then we had virtual machines, containers, and now we have serverless functions. So serverless functions, the end of evolution, there will be no further technology after serverless functions because they solve every problem we've ever had. There's small units of compute, they're easy to use, they're easy to manage. And it's not just lambda that is serverless computing, there's all of these other services, right? Serverless computing is any of these services, any of these APIs where somebody else is managing everything for you. And you can mix and match them. You can have some of your stuff on lambda and some of it on S3 and some of it over at that image hosting service and some of it at that text hosting service and the comments service. The point is you don't own any of it. So I'm mostly going to focus on the Amazon ecosystem today, but that's, like I said, that's not the only thing there is. So what is serverless anyway? Basically the point of serverless is there's still servers, obviously, but you don't control them. So you don't have to manage them, you don't have to optimize them, but you also don't get the choice. And really the whole point of serverless computing is this idea of speeding up development by getting out of the developer's way. It's, you know, removing the overhead, the management overhead. But what does that mean for those of us who actually have to manage this stuff? That's something I'm going to be talking about today. So, you know, back in the day you would have your single development pipeline, everybody who would check their code in, and it would run the tests and deploy. This is, of course, if you had a CI CD pipeline, maybe you weren't even that far along. And you know, this is kind of a pain for the developers, but it was great for us who have to manage this stuff because at least it was consistent and you kind of knew it was going to happen. Then came along the trend of having these multiple pipelines, microservices, everything like that. So you've got a bunch of developers working on different code branches, different repositories, committing to them, deploying, but in the end you had one solid product and at least everybody was still kind of coming through the same pipeline, maybe it was multiple pipelines but probably at least the same systems, or it was, you know, you could at least kind of figure out what was going on. But now we have the serverless environment. So we've got developers who can deploy to Lambda, they can deploy to these third-party services, they can set up their own pipelines. And so, you know, this is great for the developer, it's super easy for them to get moving, but it's a pain in the butt for people who have to manage all of this stuff because you don't know where they're deploying to, you don't know how they're doing it, you don't know how they've configured it, how they've set it up, or anything like that. And this, you know, the standard microservices pattern, right, everybody kind of moves through this eventually for the most part. There are still some monoliths out there. But there's a lot of advantages to the microservices, there's the distributed workforce and the contracts or APIs between them and everything is loosely coupled, so that's really nice, right? You don't have to have everybody in the same page and so that's the joy of serverless and the pain of serverless. So we're all probably familiar with the typical microservices architecture, you know, and the standard microservices tools, right? We've got the servers and the frameworks and all of that stuff. And now, what do all of those things have in common, right? These servers. This idea of you have to manage these servers, which means you have to deal with all of these things like utilization and capacity planning and all of that. And in the serverless world, that all goes away. You don't have to deal with any of that stuff, you don't get to deal with any of that stuff. And in theory, it's beautiful and wonderful and fully managed. But that's really not true. There's still all of these other things that you still have to deal with regardless of whether they're servers or not, right? There's still queues, there's still monitoring, there's still deployment, there's alerting, there's how do we make a reliable system? And so, as the DevOps engineer, your job is still to be expert on building reliable systems. It's still your job to make sure that even if all of these people are deploying to all of these different places, they're doing it in a way that makes sense, that works well together, and so on. And in a serverless world, all of those problems that you have with microservices are 10x worse. What do I mean by that? I mean, basically, with all of these people deploying everywhere with different configuration systems in all the different ways, everything about microservices that sucks, that makes them hard to manage, is 10x worse. So let's quickly talk about some of the example serverless use cases. So raise your hands. How many people are using Lambda in their production environment? Okay. And how about anything else that would be considered serverless? So you basically have a third party in the line of production. Okay, so not a lot of you. So let me throw out a few use cases here that might actually convince you that using Lambda or something like it is, how many of you are on AWS? Oh, actually, wow, that's a surprisingly low number, actually. So, okay, I guess the rest of you are self-hosting. Now I'm kind of curious. I'll ask you later. So some good use cases for using serverless is basically, well, one is the application backend, which I'm going to talk about a lot in a second. But also data processing and pretty much anytime you have one of those tools boxes, how many of you have a box called tools? No, everybody used to raise their hands. Well, I'm sure you have a box where you run all those stupid little scripts that do that thing on a cron job, right? Those are all perfect use cases for something like Lambda or serverless. They can be run, you don't have to have that server, that's that one off server, that's the cron job that runs all of the other things to make all of your other not one off servers work well. And so there's lots of other use cases around serverless like this one. So live video stream processing where you've got a bunch of different functions doing different things. The important thing to highlight here is the fact that one of these services is not running on Amazon. So this is a third party image hosting service. Another example is static website hosting. So there's services that'll host your static website. So maybe you want to have some dynamic content or something like that, so you can split it up that way. So what does Lambda do for you, right? So Lambda is sort of the canonical serverless type of environment. So what do they do? They manage server capacity, they make sure that the functions are executed, they handle some logging, some monitoring, so they handle some stuff for you. This is their monitoring, it's pretty basic, it just says how often things are running and how slow or fast they're moving. And how do you use Lambda? So not a lot of you have done it, so I'll explain it real quickly. Basically, you have to write your code. They give you SDK, you can use Java, Python. There's ways to use other languages as well with a shim. Then you have to choose the source that triggers that function, so the functions don't just run, you have to trigger them. You can trigger them with CloudWatch triggers though that are like Kron or a bunch of other different things. You have to choose the network that they're gonna run under, whether it's gonna be a private network or the regular network and how that ties into the rest of your systems. And then finally, you have to use Amazon's wonderful interface to deploy. If any of you have ever used the Amazon console, you know how excellent and easy it is to deploy things with the Amazon console. So these are all the things you have to do, right? You can do them through the console or you can use a library. So our company has built all of our stuff on Lambda. So basically everything we do is on serverless, on Lambda. And so we started by building an open source library to manage all of that. This is actually written by the same person who wrote BOTO. So if you're familiar with using Python and AWS, then you're probably already familiar with it. And so we wrote this, there's other libraries out there. I'll talk about it in just a second. But I wanna give you a quick example of what that code does and how it works. So I wrote a quick little word generator as a company founder. One of my jobs is to come up with product names and company names. And I hate doing both of those things. So of course, being an engineer, I wrote a program to do it. I wrote this little program where you can give it a prefix or a suffix and say generate English looking words using a database from Google of English type ngrams. So this is what this code structure looks like. The important thing here is that yaml file. And so that yaml file lets you specify things like the profile. It lets you specify the permissions. And you can go pretty deep into the permissions and specifications. Also the timeouts, the memory, etc. So Word Gen itself is actually very simple. It basically just calls another function that says go and grab the data and pull it out. There's a test file that just looks like that. It's just JSON with some arguments. And then this is the actual database. And so the database file is actually put right in there. It's a SQLite database. And it is uploaded with the function. So it gets uploaded every time. This will be relevant in just a minute. So it's really easy to do this deployment. You just say cap deploy. And it takes care of all of that crap that I mentioned for you, right? It creates the roles, it zips everything up, it uploads it. And it does all of that other stuff. And so you can see there how it deployed and it created all those roles, etc. And then we can easily invoke the tests. So we just say invoke. That's the JSON file. All it does is call the function with that JSON. And we get our results back. I really like that second third when they're Opsender. That's a cool name. So free company name for anyone who wants it. Or product name. And so now let's go and edit our JSON file and ask it to generate 20 words instead of 10. So we quickly change our file and we invoke our test again. And then we wait and we wait and we see that, oh no, it timed out. There was an error. So the default is three seconds. So we'll go in and we'll edit the cap file. We'll change the default time out to 30 seconds and we will redeploy. And what you'll notice during the redeployment is that it is smart about the redeployment. It says the function didn't change. So I don't need to up change the function. All I have to do is change the configuration. And so you can see here that's all, that's weird. Well, there it goes. So all it did was change the configuration and that's it. So it's pretty smart and it does some good, cool stuff like that. And then you can see that after we run it again, it took four seconds greater than the time out and it created some other cool ops names down there. And that's all on my GitHub. So if you want, you can run that yourself. So there's a bunch of different ways to manage. There's libraries for this. They all kind of work the same way. There's Apex, which you may have heard of. Apex works almost exactly the same way as Kappa does. The command lines are slightly different. Apex does let you do a bunch of other languages though. So they'll show in a JavaScript shim that lets you run pretty much any language that exists on the Lambda container, which is kind of nice. Serverless uses cloud formation. So if you have a lot of other infrastructure, serverless framework is really nice because it ties in very nicely with your other infrastructure. Or if you're running, if you're using Terraform, Apex ties in directly to Terraform. So basically, if you're gonna do just Python on AWS, Kappa's probably gonna be a good choice, but if you wanna do other languages or intermix with other infrastructure, Apex is really cool. I've used that one as well. So this is the architecture of our actual product. It's a little bit more complicated. It's a Slack bot, so we have to run a ECS container to maintain our connection to Slack because they require a long pole connection web socket. But basically the way it works is we maintain that connection. Somebody types something in chat. It comes in to ECS. That fires off to a routing function in Lambda, which fires off to other Lambdas, which may or may not generate data and send that data back and forth. The request goes back. And then it generates an SNS to send back and a Kinesis log line. So that's the basic architecture of our system. So router.py, what does that look like? Router.py is fairly simple. It just says, here's how you call other functions. So this is one of those nice parts of microservices that you don't have to worry about is discovery, service discovery. How do I find the other services to call? It's just sort of taking care of for you with Lambda because you can call functions by name. So you can alias the functions, you can call them by name, and you can do it right here in your code, which makes it really easy to do service discovery. You don't have to worry about a library or anything like that. And this also lets you do your red, black deployment, blue, green, whatever you call it. This idea of we have two running two pieces of live code. We have the current and we have the future and we shift the traffic to the future to see how it's working so we can rapidly shift back. You basically just shim that right into your own code. Now, at this point, you might start to see the problem. Every one of your developers might do this slightly differently. So this is one of those problems that you now have to solve that's multiplied 10x from microservices. In microservices, chances are you gave them a library to do this. Now, you're gonna have to give them a chunk of code and hope that they include that and so on. So here's an example of deploying functionality so somebody typed something in, it doesn't work. So I quickly go in and edit my file and upload it. I run the cap and deploy. It does its little thing there where it deploys. And then it's supported in dev. Yay, super simple, right? I changed one file and hit deploy and it was great. And then as a developer, I can go through and I can push up my changes. GitHub has that handy pull request thing that you probably used. I'm an extreme developer so I'm gonna prove my own pull request. And now you start to see the problem again, right? I've just, in five minutes basically, I've deployed new functionality to production and the only clue anyone has that I did this is the fact that I did a get check-in. And that's the only clue there is. There's no log of the deployment or anything like that because it all came out of my laptop. So yay for a developer production in five minutes but sucks for the person who has to manage that and make sure that it works. So the deployment for a capital looks like that. Now, how do we solve these problems? Well, we have the code-based structure here. So we've got all of these YAML files. We've got some extra JSON configuration files. Those are like environment variables basically. And we have our artifacts. And the only way right now that I know of to do this is to sort of scan all of this, basically check it all back out and scan through it and look for these problems. And that is unfortunately the current state of the art in managing serverless infrastructure. So the bad news I have for you today is there aren't a lot of solutions that I can offer you today because there just aren't a lot of solutions that exist. So what I will ask you now and again is if you have solutions please share them because so I'm gonna tell you all of the problems we have and hopefully next year I can come back here and we can talk about solutions to these problems because unfortunately there just aren't a lot right now. The good news though is that it means that serverless isn't making our jobs go away today. Maybe we'll have a good year or two left before we're completely useless to developers. So you'll notice here that this is the file sizes of all of the functions. So the way we've laid out our code is every lambda function gets its own little repository. And you notice those first three are significantly larger. I'll come back to that in a second because the other big question debate that a lot of people have is how many repos should I even have? Some people wanna have one repo for every single function which is very nice and granular but very hard to maintain. Some people choose one mega repo for all the functions. We've sort of compromised and we have a front end repo and a back end repo and a website. So because typically when you're editing front end functionality you're gonna end up modifying more than one function at a time so it's handy to have those in one place to make pull requests but we still have to deal with the issues of when we make a back end change or a front end change that requires a back end change that requires two pull requests. You can reference them to each other but that's difficult. So we're still actually trying to figure out what the best methodology here is around repos. It's actually looking like probably the best methodology is a single massive repo. But the important thing to note here is that Lambda lets you manage your code and your infrastructure in one place. Or if you're the person who asks to actually manage that stuff it means that your developers get to manage your infrastructure and do whatever they want to it without you really having any control over it. So how do we at least regain some of the control? How do we at least have some semblance of I actually know what's going on? Well, the one way to do that is through immutable data so this is a standard distributed systems architecture is the more immutable data you have the better. The nice thing about immutable data is that it doesn't matter what the cache state is and you can see all the changes to it because there's typically one after the next after the next so you get a log. So this at least helps a little bit in any distributed system, particularly in one of these distributed systems where the data can be distributed over lots of different places and maybe you can't even do a cache invalidation because some of that cache is stored on some third party server. Because in a distributed system moving data is gonna be the biggest cost of your distributed system greater than any other cost but for reliability you need to make that trade off so my best suggestion is to use queues as often as possible and I do like to go on a very quick side note about queues so queues store anything you're writing in the database you get some great insight out of it but let's look at this graph, right? Here's a graph of queue depth. What happened? What went wrong? Did things come in too quickly or did things not get processed fast enough? So if you take nothing else away from this talk I want you to go look up cumulative flow diagram so if you have any queues this is the best way you can monitor them so this tells you how many things have arrived and left and by looking at this you can immediately tell that the problem was slowness in processing so inbound rate stayed the same, outbound rate was poor and the reason it's cumulative is because if you don't do it cumulatively it looks like this, which is effectively useless. So some other tips and tricks with serverless that we've run into, limit your function size so every time you make a change you have to re-upload your function to the service which means the bigger it is, the longer it takes and worse off if you're using this particularly Java JVM startup time is actually gonna be a big concern. Remember that execution is asynchronous so again queues will help you solve this and so on the back end of these serverless systems typically what they're doing is managing containers for you so you want to, they're gonna manage the containers you wanna know that those containers exist and try to take advantage of the reuse of containers but you also have to remember that there is some temp storage but you can't rely on it ever existing from one invocation to the next but you can rely on it to potentially exist so this is again where that immutable data comes in handy if you throw some immutable data into temp it happens to be available for the next invocation of the function that will speed things up but you don't have to worry about invalidating it or whether it will be there or not so it's kind of like a nice little local cache that you can sometimes rely on. The function aliases as I mentioned are great ways to handle routing so you can use function aliases with prod and dev for example so you don't even need to have two separate environments. You can use version one, version two, version three so you can have your code pointed at a particular version of a function so there's lots of uses for these aliases so they exist, make sure you use them make sure you're using the included logger it has a lot of good information but you do have to process it. Make sure you set up all of your alarms on Lambda and CloudWatch so we got bit by that by not keeping track of failed executions, things like that. One trick to avoid some throttling if you are using Amazon is to use their queuing service because their queuing service has much higher throttles or I should say data coming in from their queue has a much higher throttle than coming directly so if you try to send data directly from one service to another has a much lower limit than if you send it through a queue first and then be aware of infinite loops so we got bit by this one and this becomes again this is one of those microservices problems that gets 10x worse with serverless because we have multiple people writing tiny functions those functions rely on each other, they call each other we ended up getting stuck in an infinite loop we couldn't figure out why our run times kept going up but nothing was coming back we ended up solving it by deploying a new version of the code that didn't have the loop luckily we caught it quickly I don't know what happens if you just let one keep running because I don't have infinite money so I don't want to find out I would hope that Amazon would shut that down but who knows so anyway the best way that we found to avoid it is to pass the call stack basically which is a good practice anyway because it really helps you with your distributed debugging is if every function has within it saying I was called by this person and I called this one next so at least you get that trace of here are all the functions that we pass through to get this data and then the store your data properly don't put anything on a local instance obviously that kind of thing because those local instances only last for a few seconds which is a nice security benefit so that actually takes some workload off of somebody who's managing these infrastructures you don't have to worry as much about security because your exposures is lower so using Kappa some of those difficulties that we've solved are the zipping everything up deploying creating the roles but we still have some other problems efficient dependency usage is one of those right local dev environments is another big problem there's really no good solution about local dev yet and making sure that we have the same dependencies so we had problems where one developer would do wouldn't update wouldn't run their library update and so they would deploy a function that had a earlier library and all of a sudden some code that I wrote stopped working couldn't figure out why and it's because somebody had deployed a lower version library so that's another thing that you have to watch out for so again this diagram of we've got a bunch of people deploying a bunch of stuff to a bunch of places and I mentioned before about the sizes of libraries so if we dive deep into that we'll see everything else is a lot smaller so why is that? Here is all of the files involved in those first three functions and what you'll notice is that there's this library BodoCore you can talk to me later about why we have our own BodoCore but it's taking up most of that space and you'll see those big blue blobs happen again and again so basically there's three functions that all have to have the same library very large library taking up a lot of space how do we solve this? We don't actually have a solution for this now if there were any other library besides like BodoCore one potential solution is to run a separate function that does whatever that thing is and have everybody else call it but then you have to get all of your developers onto the same page to make sure they all know that service X exists and does that thing that you guys are all importing so still there's no tooling today that exists to help you discover these but I think that's probably a path we'll go down where we actually walk the source tree and say hey six people are importing the same one library let's set up a service that does whatever that library does and then tell everybody to call it and then put up a watch to make sure nobody else imports that library something like that and that's kind of part of our next step of our deployment pipeline so we're gonna make it so that when you check in to get basically it calls a lambda function and that lambda function will go to spin up an EC2 instance pull down the code, run all of our analysis, run Kappa, run the deployment and then log a bunch of data into send all that back to the lambda function to log a bunch of data into DynamoDB maybe send out some alerts, things like that send back an alert saying okay your function is deployed still working out the details of this but this is what we're looking at probably so that the deployment is actually a little bit slower for the developer but in exchange we get a lot more insight into what it is that they're doing so we're essentially making it easier for them by deploying for them but in reality we're checking the work so to speak. So at least that will solve the making sure we have the same dependencies problem and knowing when someone else is going but again that efficient dependency usage is a big problem that we don't have a good solution for the local dev environments is another really big problem you pretty much just have to deploy it and test it in maybe a dev account to see if it works and more on the subject of testing in a microservices environment if you're doing a robust job then hopefully you're doing a lot of testing where you are testing things like slow network and cutting off communication from one service to another things of that nature none of that you can do right now if you're using lambda or these other services none of them provide a way for you to say make this just not work randomly five percent of the time and tell me what happens so as I mentioned before and I want to mention again a lot of these are problems that still exist that don't have a lot of solutions so hopefully as we're working on these solutions we can share them with each other we try to make everything we do open source around tooling and things like that and then hopefully I can come back here next year and talk to you all about all of the problems that we've solved with serverless deployments so I am done and I think that's all of my time so I'll be around if you have questions thank you give up for Jeremy