 Okay. Given that we are from Switzerland, we are going to try and start on time. So let's get going. Welcome everyone. My name is Diego Zamboni. I'm from Swisscom and my friend and colleague Bill Chapman. We're going to give this talk on self-disruption and healing on a CloudHunter-based environment. Do you want to introduce yourself? Sure. I'm Bill Chapman. I'm a cloud architect, Stark and Wayne. I've been attached to Swisscom for a couple of years now, and these two projects we're going to talk about are a culmination of a lot of the effort Diego and I have had together. Alright. First of all, I'd like to comment briefly on how interesting it is to have my name at a CloudHunter conference. I mean, we all have instinctive reactions to our names, right? So it's weird to be hearing it everywhere. Anyway, so just a brief overview of the agenda. Normally when we talk about this in Switzerland, we have to explain what CloudHunter is, whereas here we probably have to explain a bit what Swisscom is. So we're going to do that very briefly, and then we're going to talk about the two counterparts of this effort. One is the health management monitoring and self-healing platform that we've been building, and the other is a self-disruption platform that we've been building, which we've appropriately named Chaos Heidi as a tip of the hat to Chaos Monkey. So very briefly, this is my one marketing slide from Swisscom. Basically, Swisscom is the biggest telecom provider in Switzerland, providing everything from fixed line, internet, cellular service, outsourcing services for big companies, connectivity, all sorts of things. And now, very recently, a PaaS platform built on top of Cloud Foundry, which we have named the Swisscom application cloud. There's a URL over there if you want to try it out. We have a publicly available version of the service, which is available to anyone in the world. There's our Twitter handle, so you can follow up what gets announced. And basically, this is just a very brief architectural overview. It's built using Cloud Foundry on top of OpenStack, using a few different components for the infrastructure. For SDN, we use Plumgrid. We use Scale.io for storage. We have a few other things here and there. We have built a number of services that can be used together with the platform. So we have Redis, MongoDB, Elk, RabbitMQ, and a few other things that our users can connect to their applications. So one of the basic ideas of the application cloud is that we have all of these components, but we don't have just one instance of this. Instead, our intention and what we are actually doing already in production is to have multiple instances of this construct called the application cloud for different uses. So we have a public instance, we have an internal, Swisscom internal instance for internal applications, and we have virtual private instances for different customers. And of course, all of these need to be monitored, need to be managed, need to be operated at scale. And one of the realizations is that Cloud Foundry itself has self-healing, has also all these nice features, but we're not just using Cloud Foundry. We're using all of these other components that together create this service and this application. So this is where all of this comes from. So how do we monitor a Cloud Foundry production environment? We are using this, from the beginning, this thought concept called the UDA loop. Sorry, wrong circle. This is the UDA loop. Basically, it's a well-known idea that originated originally in the US military, in the US Air Force. And it's one of these typical feedback loops, and it stands for observe, orient, decide, and act. Basically, the idea is that you collect information about what's going on. You analyze this information, you make decisions based on what you discover, and then you act in reaction to the current conditions. And then this repeats over and over, and this allows you to adjust the course of action in reaction to what's happening around you. So one of our early realizations is that these UDA loops are already all over the place in our infrastructure. But they are mostly incomplete, and they are mostly disjoint. So each component has its own little version of a self-monitoring, self-healing component. Some of them only fix certain things. Some of them monitor certain parts and tell you about it, but they don't fix anything. Others just fix things, but they don't tell you what's going on inside. So it's a bit of mix and match, and we don't want to reinvent all of these components. Rather, we want to use what's already there in different forms. I mean, we have different self-healing, health-checking, self-management, scripts and bits and pieces all over the place. And we want to aggregate the data that they are providing, and we want to aggregate their functionality and build a layer on top. Now, ideally, at the end of the tunnel, we would like to automate humans out of the equation and have all of the decisions made automatically. This is clearly not possible realistically with our technology. So we want to keep humans in the loop, but delegate to humans only the really hard tasks and the really hard decisions. So we want to, over time, learn to automate more and more of these solutions to problems or at least detection and notification of problems so that our operations teams and our engineering teams can focus on solving the really hard problems that we don't know how to solve automatically. I should clarify, we're not talking about artificial intelligence or machine learning. This is really very simple rule-building and, iteratively, increasing our knowledge base to automate these things. Also, in parallel, we're building this self-disruption platform that was inspired by Netflix's Chaos Monkey project, which we call Chaos Heidi. And the idea is to introduce automatically disruptions and destruction in the infrastructure so that we can test both our detection and our reaction capabilities. So, overall, we want to use, and this is all called Orchard, of course, we want to improve visibility into our systems and the ultimate goal is to make everyone's lives easier. Not only the operations teams, but also the engineering teams, the development teams and even the customers by giving them access to some of these data. So, just to give you a very brief overview, Orchard is really a backend system, but we discovered very early on that you cannot properly demo something without having a visual component. So, we have a very simple dashboard that basically gives us different views into the model, the data that is stored into the system. So, we can have views that give us information about the state of the infrastructure components, such as OpenStack or Scale.io or Plumgrid. We can also have views into our business integration and end-to-end service processes, such as end-to-end checks on our Atmos service or Elk service or MongoDB or whatnot, our building runs or logging subsystems and so on. We can also have views on the checks that are being run on the Cloud Foundry Services infrastructure. So, on the containers and the VMs on which the instances of the services are being executed. So, and this is just a matter of reconfiguring the view according to what is it that we want to look at, right? In terms of an implementation, I'm not going to go through this in detail because we don't have enough time, but I just want to point out that we are basically collecting data from the infrastructure. We have different types of data. Our main state checking data comes from Consul. We are using Consul, which is an open-source health management and service discovery platform. So, Consul allows us to distribute checks throughout all of the nodes in the infrastructure, and then we can customize the checks depending on each component, and all of these results are being centralized in a Consul cluster. We're also collecting logging messages and metrics from Cloud Foundry. We are using both the stock logging and mirroring mechanisms built into Cloud Foundry we're now building in log-regator integration. We also developed, actually, build, developed a custom Bosch monitoring plugin that allows us to check the states of the Bosch components and post them directly to Consul. All of this is fed into a message bus, and then we're using Reman, which is an open-source event processing engine. To fetch these events and then apply custom rules on them. So, we are building a set of rules that range from very generic aggregation and summarization rules of the overall state of a certain component or a set of components. And also, we are over time starting to add more specific rules for specific services, specific customers or specific purposes. Also, for example, for computation of service levels and so on. From Reman, we can feed data into some of our other systems. We are using Splunk in Swisscom already since a long time, so we're feeding some data to Splunk for archival and notification. We are feeding data into an inflex DB database for longer term storage and we're feeding the dashboard from then. And then we are closing the loop, and you can roughly map this to the UDA loop that I showed you earlier by trying to trigger some automatic reactions to the failures. This is still very early phase. We have some automatic reactions built in, but this is still very much work in progress. So, I'm going to hand the voice now to Bill, so he can tell us about chaos hiding. Hello. So, as my team can attest to, standing there silent for nine minutes, ten minutes with all of your people in front of me, was very difficult. So, what is chaos hiding? Chaos hiding is a resiliency testing framework, kind of like Chaos Monkey, Chaos Loris, Chaos Lemur, Chaos Gorilla. These are all products. Chaos Lemur and Chaos Loris are actually Bosch and Cloud Foundry based. But this is a little different, because whereas those solutions tend to be targeted at a specific type of attack, like kill VMs or a specific platform mess with Cloud Foundry, this is a generic attack framework written in Go. So, as I said before, I'm from Stark and Wayne, and normally at Stark and Wayne we create stuff. You know, projects like SHIELD that some of my colleagues spoke about earlier, they're used for disaster recovery and protecting the data and making sure that we can come in and help you fix things. But Swisscom, I've had a relationship, we've had a relationship with them for quite a while, and they said, you guys are pretty good at making stuff that keeps things running. Can you break stuff for us? We can do that, too. So, why do we want to break stuff? Well, you know, people will say, well, Cloud Foundry is self-healing or, boss, you can run a Resorector or this technology does this, this technology does that. Well, it turns out that the modern Cloud Stack is a really complex ecosystem, both open source and vendor driven components that are glued together by locally developed bespoke solutions. And I use this term, Semper vigilante, and I always thought it meant always vigilant. It turns out when I was putting the slide together, it means always watching, but I left it here anyways. So, this is, you can think about as really aggressively watching your stack, right? Because you're not just watching it, you're poking at it, like a little kid who can't look at anything without touching it, and that's what KSID is, and that's where the name comes from. So what might we want to do with a resiliency testing framework? Well, things like stack validation, running random daily attack scenarios like you would do with a Netflix Chaos Monkey. What we use it for now, because we're still in the active development phase, is for testing of orchard components. So orchard has healing components, and it's a lot easier to test those healing components if we can keep taking things down. It can see that something's down and then try to fix it. So how does it work? That's pretty simple. You've got a controller with handle scheduling, orchestration, monitoring things. It holds all of the attack scripts, which are just attacks that can happen on an environment, and then you have an agent. Now, an agent is a small process that sits anywhere that has both the technologies and the access to the victim. So you can think of maybe Bosch as a victim, and your Bosch, your Bastion, your Bosch Jump Host will be where you might put your Heidi agent. In practice, the way our architecture is set up, we would put the Heidi agent anywhere we put a console agent, because it's convenient that way, and then the puppet automation can just set that up for us. An attack runs on an agent, and it is a combination of a set of configuration and a script of some sort, and that way we can have multiple attacks that do the same thing, but with different parameters, different targets. Let's say we want to kill a Bosch job. We might have one that kills a runner and one that kills H.A. Proxy, and we can run them randomly. It tracks holidays so that you don't kill stuff in production while your ops team is off. It's interesting. As I said, some example attacks, kill an open stack VM, kill a Bosch job. You can really do anything. Like I said, it's not specifically related to any platform or target technology. So how do we create useful attacks? And number one here is probably where we get most of our attacks from. What is broken right now? We talk to the ops team and we say, you know, what's going on here? We talk to the Cloud Foundry Services team and they say this is what is now broken. So what we can do is we can create an attack for them, and then when they fix the problem, we can then implement that fix as a healing action in Orchard, and now we have a regression for what that problem might have been. What is broken in the past? Well, if we ever run out of things that are currently on fire, we can start looking into the past and say, okay, well, what used to be broken? And that, again, another layer of regression. And then what will break in the future? This isn't really an academic project. This is a practical project. It might get to the point where we have to come up with clever recipes based on patterns that we've seen, but I don't see what will break in the future to be a part of development for a while, if anytime soon. I don't know, we'll see. So something that broke recently. Oh, wait a minute, sorry. I like getting paid. So we're going to blame this one on the virtual network. So there's a problem we're dealing with right now where the network goes down and Bosch will lose connectivity with one of its jobs. The Resurrector will kick off, but it'll fail. And then we end up with this ghosted job and then somebody down the line might decide that they want to bring that back up again because they need it. And now you end up with a new version of it and a ghosted version of it. And this is a problem we're experiencing in production right now. So we've been trying to emulate it. And it turns out it's a really fascinating process of trying to figure out how to break something. You'd feel like, oh, I can do this. Breaking things is easy. But we run into this problem of simulation versus emulation. We can imply a failure, or we can actually cause that failure, but sometimes causing that failure isn't very practical. Actually taking a whole segment of the network down isn't necessarily going to be as practical as monkeying with your host's deny or monkeying with IP tables just to make a single node think that there's a problem. And this turns out to be a really interesting and difficult problem to solve. So here's an example from what we were just talking about. We can kill a Bosch VM. So this is an attack that we have. We say, Heidi, kill a job in this Bosch. And so we take out this HA proxy. Well, of course, great. We just tested that the Resorector works. Excellent job. That's not very helpful. So what do we have to do? We have to kill the Resorector. Well, we kill the Bosch job and the Resorector, but this isn't really the problem. The problem started because of a network outage. And if you run a CCK, it's going to fix the problem anyway. So now we simulate an outage between the Bosch job VM and the director. And now we've recreated the problem. And when someone tries to resurrect that VM, we end up with a ghosted VM. Great. So we figured out how to simulate the error state. But what did we do? Well, put a deny rule in IP tables, right? And we end up with a situation where, how are we training Orchard to heal that? Because that's not what's going to happen in real time in the real world, right? Unless someone maybe pushed a configuration change that they shouldn't have. But more likely it's going to be related to a problem scaling or a problem with network traffic somewhere. So how does Orchard fix this? And now I'm confused because they told me I just had to break stuff, right? But now we come around to the problem of how are we training Orchard to properly fix things? And this is where we're at right now. We've got a suite of attacks that can do all kinds of really interesting things. We've got a way to orchestrate it. We've got a way to keep track of it. And now we've got to figure out how to make those attacks useful so that we can train Orchard to do what it needs to do. It's currently an internal project at Swisscom. I've been asked to gauge interest and see if there was interest in other people seeing this and open sourcing it. I don't have the go ahead to do that at this point, but if there is enough interest in it, then we can probably make the case. And I think it's a really cool project that people would like to play with. So if you do have any questions about that, come talk to us so that we know. And Diego, do you have anything else to say? Just to repeat what Bill said, come talk to us if you're interested. This is still very much work in progress. This is where we are right now, but there's still a lot to do. And if nothing else, we have application cloud stickers. So if you want to come and get one, come talk to us. Thank you. Thanks. And we got you out of here quickly. Yeah, I was wondering for the agents, what is that? Did you write that yourself? Okay, so the agents are, there's a small go process that can run as either a controller or an agent node. They communicate via RabbitMQ, and I actually apologize. I went way too fast through the architecture slide. But the agent node has all of the configuration and individual attack because we don't want to pass any credentials or anything over the wire. So an agent knows everything about what it needs to do, and all it does is run that script and then tell the controller what happened. It has no healing component. Its job is to fire off that kill command and that's it. It find out what happened, tell the controller what happened, and wait for Orchard to try to fix the problem. No, no, this is an internally developed project at Swisscom, yes. Do you have any suggestions? Have you seen this paradigm before? There's a different way you would do it. In our case, it's not allowed. The access, I mean, we could probably use the puppet M-collective. We could probably use a puppet, but we can't SSH into, you know, if you can't, in this environment, if we can't communicate via an API or some other way, we have to have an agent and the agent communicates via the queue, but there's no direct access. Yeah, the... Well, no, not right now, we're not. But it would be neat if we could be. In the lab environments, it's good for stack validation and testing new features and things like that, but the end goal would be to maybe run it in production, but that takes a much larger buy-in, I think. Well, Netflix, Cast Monkey and Cast Gorilla and the rest of the Simeon Army project, they do similar things to what Orchard as a whole does, but Cast Monkey is really just dealing with killing VMs and Cast Gorilla deals with network latency, I think, but that's such a small part of the attack ecosystem that we're trying to accomplish, that it was either piggyback on top of those or see how far we could get on our own because there were so many more attacks we needed to write.