 All right. I think it's time we get started. So thank you all for coming. I know it's just been lunch, and there is at least 14 other sessions you could have gone to. So thanks for choosing this one. My name is Mikko Ai, and the reason I'm here today is to basically share with you a small utility that we wrote in our team that will hopefully help you test your software running on Kubernetes. It's inspired by Chaos Monkey. And if at the end of this 30 minutes together that we have today, you get to use it, and your clusters get 1% better, it would really make me feel warm and fuzzy inside. So let's try to do that. So let's see what's on the menu today. The presentation is going to look a bit like this. I'm going to claim that distributed systems are hard. I'm not expecting many people to disagree, but let's see about that. Then we're going to talk a little bit briefly about the principles of Chaos Engineering and the Chaos Monkey, probably the most famous example. And then we're going to talk about the seal itself. We're going to see how to set it up, how it works, why it's simple, how to use both the interactive and the autonomous modes, and a couple of examples of what the policies can do and what they look like. And then we're going to see where to find the seal, how to add to the seal, where it welcomes contributions, and a couple of closing thoughts about that. But before all of that, good manners require me to introduce briefly. I work for Bloomberg. I flew all over here from London for this presentation. And I have this pleasure to work on DTP team with this five gentlemen you can see here. And DTP, just very briefly, it's an internal platform that we built for the rest of the engineering of microservices with some QC quality control superpowers. If you are interested in that, we actually gave a presentation at KubeCon 2016 in Seattle. So if you just Google a search on YouTube for Paul McLaughlin, Bloomberg, or Sachin Kambhach, the presentation is there. All right. So let's start with the bold claim that distributed systems are hard. So before we go any further, maybe let's do a quick show of hands. Who thinks that distributed systems are hard? That's interesting. That's not everybody. Let's do a quick reality check. Who thinks that they are simple? OK. We have some brave souls there. OK. So in that case, who hasn't had an outage in the last couple of months? All right. Unsurprisingly, the hands are saying that it's easy. We're not raising the second time. I think that gives us a good feel. I like this quote that the friend introduced me to from Lampert to paraphrase that. It basically says that in a distributed system where a machine that you didn't even know existed is messing up your system or messing your own machine here. And the way I feel we'll deal with distributed systems often is a little bit like in a jungle. You walk around, and there are all these rare animals, and plants, and everything is exotic. But at the same time, you feel that in some of these bushes, something is going to try to snap you and bite you and kill you. Or maybe like Australia, animal kingdom, the way I picture it. But the bottom line is that most of these things are failing. The communications fail. The hardware fails. We fail. Environments diverge, and all of that. But most of the time is going to hide from you. And it's only going to try to kill you when you're not paying attention. And those who raise their hands saying that they're difficult to distribute the systems, they're not alone. Last time I Googled, there was like 49 million reasons to think that. Yeah, so the problem really is that the happy path in the distributed system is really just the tip of the iceberg. And because it's difficult to test, we often, probably way too often, just go with the hope for the best approach. And that's not the best. So there must be a better way. And as it turns out, there are different and various trials to do with that. One particular one that I want to talk about today is the principles of chaos engineering. So if you go on the website, principlesofchaos.org, you can read the manifesto. It's that particular long and rather funny. But if you don't want to read a page, it's really just three main points. In order to increase the confidence in your distributed system, and notice how I'm saying just increase the confidence, not have full confidence, we introduce failure on purpose, and then we hope to detect bugs and unpredictable outcomes quicker than our users reporting to us. And probably the most notable, visible example of that was the Chaos Monkey. It was published by Netflix, shared with the community, and got a lot of hits on Hacker News and other places. And Chaos Monkey was an inspiration to Powerful Seal. It was initially doing a lot of things. Now it's basically just terminating instances. And Powerful Seal is trying to achieve a lot of what Chaos Monkey is achieving, except in a slightly different way. So probably the first question will be, OK, all right. So how is this different from Chaos Monkey? And why would they use the seal instead of the monkey? And the answer is this, really. That Powerful Seal was built from the ground up to just big Kubernetes. So it understands things like namespaces, pods, and deployments. It also has a very simple and flexible YAML syntax to describe the policies for the autonomous mode. And it provides the interactive mode in which you kind of have a mix of what kubectl gives you and like some like Nova for OpenStack gives you in one place with some nice auto-complition. And it doesn't really expect much in terms of the external dependencies. It doesn't even need a DB. It tries to be really, really simple. Nothing too complicated. And it doesn't risk cron. So what does it use? It wouldn't be a technical presentation if I didn't have a graph, so here's the graph. That's roughly what it looks like. You have the seal, which is kind of on the edge of the cluster. It depends how you want to run it. And it just needs to talk to three things. It needs to talk to the Cloud API so that it can take things up and down or potentially delete them if you really don't like a cluster. It needs to be able to SSH into the machines to execute dockerkill. And it also needs access to the Kube API server. So a standard Kube config file is going to suffice to do that. And that's it, really. It tries to be really, really simple. And you are free to configure it the way that works for you. OK, so I said that it has the interactive mode and the autonomous mode, which kind of try to achieve two different objectives. In the interactive mode, what we found ourselves doing most of the time is just we run it, and then we do the first very rough go-through of the software. We try to kill some things. We see what happens. You execute the commands as you go, and you get a good feel of where to apply pressure in order to do damage. So in the interactive mode, you basically pip install seal, and then you point it at the cloud, you point it at Kubernetes, and you can go and play around. So the two big things that you're going to do there are manipulating the nodes. So for example here, it's a quick demonstration of the filtering that you have at your control. So you start it, you sync with your cloud driver, you get all the nodes. You can also either restrict that to an inventory from an inventory file just to make sure that you accidentally don't touch other machines that might be in the same name space in your cloud driver. Or you can also solve this cover from Kubernetes. So just match the machines that Kubernetes has in the cluster with your cloud driver. And then you have a whole lot of different things you can filter on, in case you want to take things up and down. For example, the Kubernetes labels on nodes become groups. So for example, architecture here, you can filter on that, you can take them up and down. So that's the first step. And that's probably where it's the most similar to case monkey in. The second thing is manipulating the Kubernetes obstructions, so pods, really. So the way we use it most of the time is just you have some kind of deployment that you would like to punish and see that it runs. So Powerful Seal lets you select things like namespaces, pods in namespaces, pods for deployments, deployments in namespaces, and basically execute keel on some subset of this pods with the auto completion and the filtering that you have in the other things. So that's all nice and great. But this is very manual. It's basically like a discovery mode. And then you see, OK, so the pressure applied here, my application breaks. You go and fix it. It stops breaking. But how do you actually make sure that it continues working while you develop it over time? And this is where the Autonomous mode kicks in. And in Autonomous mode, you basically start the seal with a policy file that just happens to be a YAML file, fairly easy to write and read. And with the policies, we have this three-step approach. We have a matching section, a three-step approach for both nodes and pods. They're dealt with separately. But the logic is the same. You match an initial set with some kind of criteria. Then you apply filters to remove the ones that you don't want to touch. You apply some probability things that we're going to see in a second. And then you issue actions on the remaining bits after the filtering. So if we had another graph, yes, question. Sorry. Filters can be, OK, the filters be what, sorry? Letters. Meters. Sorry, I don't quite catch that. OK, the filter bits would be meters. It's impossible to make the quality based on one metering that you're only filtering. All right, on metrics, basically. Oh, that's a very good question. And that's probably a prime example of the contribution that you could make to that. The truth is that we implemented the seal to just cover the use cases that we had. And we open sourced it very recently. And when I say recently, I mean yesterday. So you are free to add to that. But that's a very good idea on metrics like that. Sorry about that. So where was I? So the three steps, it's basically matching. You can have certain criteria, and then we do duplication. And then the filters, for example, we can imagine that you have one filter that removes B, another that removes D, you're left with ANC. And then the actions are applied in sequence to all of the remaining bits. So if we take a closer look at what this policies really look like, this is the matching section. So the one on the top is for pods. And in this example, it's selecting some kind of random namespace, meaning that all pods from the namespace will go in. A deployment, in which case we're going to query Kubernetes and see the pods that are for, or match the particular deployment in that namespace. Or some kind of arbitrary labels combination that you might be interested in. And for the nodes, you can filter on properties, or match on properties. Currently it supports name, IP, group, availability zone, and state. So a particular example would be, for example, taking an availability zone and smashing it. And then it goes through the filters. And the filters is probably where most of the creativity can go into. Currently we support a couple of basic ones. So again, you can filter out on properties. They support regular expressions. And there is a couple of things in terms of probability that you might be interested in. So for example, a random sample supports both some kind of percentage that you might want. So these are the pods of my application. Can you please kill random 20% of them? Or probability to pass all? I don't execute this at all, if a biased coin doesn't flip the way I wanted it to. And it also has a couple of other things. Like, for example, the time and day filter allows you to just make the actions, take them only on these days of the week, or during this time. So for example, the workday would be better to destroy things in production rather than assembly morning. And then after that, it goes to actions. And again, actions, for the pods, you can essentially just kill them. You can force them to die or not. And for the machines, for the nodes, you can either start them, or stop them, or execute random things on it. And yeah, this is basically it. You are now, well, if you want to go and grab a seal, I encourage you to go to github.com, slash Bloomberg, slash Power4seal. Like I said, it's a baby seal. So please be gentle. We literally put it there yesterday. And we would really, really love some contributions. In particular, if there are any artists here, we don't have a logo. And Chaos Monkey does have an amazing logo. So if there's anyone who would love that, the only driver supported is OpenStack, because that's what we're using. But it would be fairly trivial to add another one. So if that's your use case, please go and please contribute that in a PR or filters. I kind of hope that you have a lot of different use cases that you could do that. And Stars on the repo. If you mind, that would be great. Yeah, so this is really the seal for you. So if we take a step back now, the reason for all of that is that it is inevitable that your stuff will fail. And if you are just waiting for it to fail, A, losing your time, and B, you're probably risking that your users are going to point out that things don't work. So instead of waiting and worrying about the next outage, why don't you just have your own every day in production, and just embrace the inevitable failure and embrace the seal. So yeah, seal is out there for you. And that is basically it I have for you today. Thank you very much. Happy to take questions. So the question is, OK, great. You're killing things, but how do you monitor that things are running properly at the same time? The answer is that the seal is not really concerned about your system running properly. It just wants to destroy it. So we have other things in place that basically monitor continuously, that despite the pods going down and despite the machines going down, we still have things in it. There will also be an issue in GitHub about monitoring the seal itself. It will be really nice to add something like a permative matrix to just see what failures are powerful seal incurred and which ones are completely independent of that, because there will be others like that. So again, that's probably going to be coming in the next couple of weeks, hopefully. There was a question there. Can you script the console? Well, yes, you can script the console, but that kind of defeats the purpose. If you want to script and automate the things, I'd rather extend the policy executor for you, so that you can achieve the things that you might not be able to achieve right now. The interactive mode is mainly to go faster with the completion of the names, of the IPs, and all of that, just to play around. And then for the long running thing, it's the autonomous mode that's supposed to be there with that. There was a question there. No, unfortunately, it doesn't do that at the current stage. Any other questions? All right, well, in that case, thank you very much for coming. It was a pleasure. And hopefully, you'll get another tool in your toolbox, a powerful tool to break your clusters. Thank you. Actually, maybe one last thing. I'll just submit this to Hacker News in case someone wants to share with that. OK, expired link. Won't do that then. That's mean. I'll submit it again then. OK, now I'm posting too fast. I'll post it in a minute. All right, thank you very much.