 Okay, welcome to our talk. Maybe you can sit down here, some free chairs left. Yeah, there's still some free space. You can have a chair. Come over. Especially everybody who's also running a public cloud is warmly welcome. Yeah. Because that is what we will talk about. How we operate a public cloud. But maybe we start with a short introduction. My name is Jan. I'm working in the cloud team at SUSE11. SUSE11 is a cloud provider. Basically, we are offering infrastructure as a service and we do manage toasting. Also, that is where the company comes from. And yeah, what will we talk about in this talk? If you're running a public cloud, you have to face a few challenges. For example, yeah. I also need to introduce myself. Maybe we don't start with that. I'm Stefan. I'm also working in the cloud team and we are a pretty small team, fairly small team of around 11 people working in the team, running and developing our cloud offering. So some of us have a software development background like myself and some of us have an operations background. And that leads to some interesting solutions. But yeah, let's start with the challenges. Yeah. What if you are a public cloud provider? What happens if you're running a public cloud? So basically, you have your infrastructure setup. You have your hypervisors. You have your guest machines. You have your users starting their VMs. And then something nasty comes from the outside. Something like meltdown, any other influence from the outside. These threats are very severe for a public cloud. Yeah, basically, what happens if you don't act on these threats? I think the scenarios that you have one user starting one VM and an attacker can read data from another customer. That is what's behind all these threats that run there. What do you do usually if you have these threats or threatened computers? You just do a simple fix? Easy, right? You can just update and reboot. So basically, what happens if you want to handle this threat? You need to install a fresh package of software. So basically new Linux kernel or what did we had in the past? Microcode updates is a nice example. You need to do that. And to handle that, you have to restart your computers. So that's pretty easy. You have to shut down your computer, put it online back again, and that's it usually. And the important part is for most of these updates to take effect, you need to reboot the computer. Without it, it doesn't matter that you update it at all. So what is the customer's point of view on this? Because our customers run critical businesses on top of our offerings? Yeah, basically, from a customer perspective, the world is also very easy. They want their VMs up all the time and they want a secured environment. So that's all they want. So it's a very primitive world until now. But it gets more and more complicated if you have a look at the details. So some of you operate in OpenStack cloud and in itself it's a complex system. And even if it wouldn't be a complex system, it is a challenge just by... Yeah, we run a medium-sized environment, I would say. But to react on the errors we had in the past year, we had to do 2,500 reboots in a not-so-large environment. So this sounds very painful at first. So why are we doing this? Yeah, basically the answer is given because we want to be our users on the safe side, we have to reboot. There's no other choice. So there's a critical bug outside. We have to install updates and we have to reboot the whole cloud multiple times a year. And that is what we do. The other possibility is also now we could do it like others did in the past. We could never touch the system again and then the user requirement to be available would be met, but not the other requirement to be on the safe side. And that is why we have to go the painful way and reboot the cloud. But also doing it often has a big advantage here because if you have to make changes, it's better to make a small set of changes because if you break something, you can see more easily what was the breaking change and react quicker on that. So this is one of our underlying principles. If we have to do it, we want to do it every week. So now we are at the point where we saw that it makes sense to reboot often and to, yeah, that's it, to reboot often. And then we started to reboot the cloud by hand. I think some of you might remember how we started the project. We just sat there and rebooted machines. That was the first approach. So sitting in front of an SSH terminal and typing reboot was the beginning of the story. But we very quickly ended up not to do that because it was human work and that annoys humans. That is one reason. There's another drawback. But there's another drawback. Let's be honest. The other thing is running an OpenStack cloud is really complicated. There's lots of distributed systems and many of them are stateful and they are important. You need them to keep the APIs working. You need them, but you also need them to keep the VMs working. So, for example, our virtual machines are running on a 4-byte distributed storage and every file is replicated three times in our data center and if you accidentally reboot two compute nodes that might end up in a problem for the customer and the customer's VM will just stop working. To make the problem really clear from a customer perspective the cloud is down if more than two data nodes or actually if more than one data node is lost and one is in maintenance, then we have a huge problem and that is what we have to avoid in production and that is what we are going to talk about how we did that. So, how do you do that? Now we have to reboot so often, how do we do that? Any idea? Anybody have an idea how to reboot 2,500 times without making any mistake? Cluster is an edge. Thanks, Harald, for the hint. I think we didn't try that. Frontdrops is also a very good example. Maintenance window. Users usually love that. Or users love that. I'm not very sure. Let's see. We could also exchange the users if they are complicated if they demand something. We can find better customers. Another solution. Our solution is that we have to do it, we don't want to do it by hand, so we have to automate it. That's the thing. To automate this, we need a different view onto our cluster. What everyone of us already has is the should-be state. We do configuration management or something else in Kubernetes. We write YAML files to express what the should-be state of our cloud is. We want, for example, three Galera nodes and we want so many compute nodes and so on. This is the should-be state. Everyone has that. What you need to automate a reboot is also to gather the actual state. That's where we use Consul. Consul is a great product from HashiCorp. It is very powerful, but at the same time very simple. Let's go to the screenshot. Or is there something? This is the screenshot of our Consul web interface. As you can see, there is a service registry. You can register services in Consul. As you can already see, in a real installation, some services are always, maybe not always, but most of the time something is down because you need to do some maintenance or you have a hardware failure or anything else. This is what Consul can give you. You can register services, you can register checks. The good thing is with a very simple API you can write code that has a complete cluster overview. What Consul also provides is a key value store. Data or configuration or anything else in a key value store. And it also provides primitives to implement distributed locking and Zemaphores. This was a really useful tool for us automating the reboot. What we want to achieve is that no human is involved in rebooting the cloud anymore. Consul is the core component that makes that possible for us because it replaces the guy who sits in front of the monitoring and has a look if everything is green. That is automated with Consul. That's the backbone of our procedure, I would say. So it's very, very important. It doesn't look that important, but it is because with Consul we have the actual cluster state and with that we can do automated decisions if it is good to reboot or not. So that is what we use it for. So now we can make the decision to reboot, but who actually makes this decision if we don't involve any humans? So our reboot manager works in 11 simple steps and it's basically a cron job. So every node decides on its own. So we don't have a central place in the decision, okay, you have to reboot now. Every node will check for in a regular interval. Hey, is it necessary to reboot? Did I install updates previously that need a reboot? Is there a new kernel installed and so on? If yes, it will go to the next step and it will try to get a lock in Consul and if that is successful it will ask Consul the service registry and all cluster services are actually right now okay. So if we have three Galeras are all three Galeras really green because if yes, they can lose a member. In the storage system for example, if you reboot a compute node it will run on this compute node and will look on connect to the local Consul agent and ask which services are running on my machine right now and are all services in the whole cluster with this name green and are usually okay to reboot. Yeah. And also it has the possibility to do any action before rebooting or after rebooting. So one example is to just migrate live migrate VMs away for a simple compute node or do any post or pre-action on that task. So basically, rebooting is a predefined routine and it happens every time and that is what happens in the background here. And we try to keep it really simple and stupid. So it doesn't have a plug-in system or something like that. You have just on every machine you have a directory with pre-boot tasks and post-boot tasks and those are executables they will be run in alphabetical order and if they fail the machine will not reboot. So if we do a quick summary and it is distributed it runs on every node in the cluster and it is very simple it is 300 lines of code or something like that and the information backbone as we already said console. So every machine in the cluster is aware of the whole cluster state and that makes it very powerful. It is very small, very primitive but it is powerful enough to look like to reboot a cloud. This is from yesterday and our whole engineering team is on this conference and still we had some rebooting servers yesterday. This is our production system so while we attend some interesting talks at an interesting conference we can we have enough trust in this simple mechanism to reboot the production system and that is what is cool from our point of view. And the best thing the best thing is it is now open source. Today we released it on github so if you want to use it feel free to go to this page have a look there we published it just a few minutes ago we have some unit tests thank you and let's go to the next slide this is only possible because of our awesome team thank you everyone most of the code was actually written by Dennis on the top left and the last few days me and alexandre we implemented some unit tests or integration tests better yeah, thank you that comes to our booth we have a booth here, we have a lounge downstairs and feel free to talk to us, our whole engineering team is here, you can find us easily with these jackets this is level logo and we are really happy to talk to you do you have any questions here because we have some minutes left yeah, I think we have a microphone for questions do we okay, do you have any questions please raise your hand if everything is clear thanks for your time and we can talk at our booth there's a question hello, I have a question about monitoring of status of cluster databases, are there any sophisticated mechanisms or just checking if every node is up because it's not sufficient in most of the cases so console has multiple ways of implementing checks, so built in it has the ability to just look if for example a port is reachable but this is not enough as you say, that's why you can also add script checks which will run a script, in our case we monitor Galera with a Python script that we wrote that will check everything, every Galera node really master in a master state the replication working okay and so on we really check if it's really, really working if it can really lose a member so all important services our checks are only green if the cluster can lose a member and you have to keep one other feature console in mind that is the locking mechanism so basically only one node can grab this lock and with that we are able to really assure that in certain groups only one node goes down so for Galera cluster is the example where it's very clear that you can only lose one node but it also is important for our storage nodes to not lose any data or slow down the cluster or something like that we have a feature idea actually already I can show you the website can I show you the website I think so because we also published our design document on github so it's 11 slash reboot manager oh you can put it in the browser okay so this is the live demo so we also have our design document on github and we were also thinking about a feature we didn't implement yet where we can say for example this service needs at least this amount of checks passing for example if you have a stateless API maybe it's okay to lose three API backends but some services maybe can lose one member or two and we didn't implement that yet but feel free to ask for that feature in the issues feel free to collaborate with us we would really love to see your feedback on this any other questions left someone of you does anyone of you face the same challenges for example no Harald you're not doing that can you hear me? just wanted like when there's a floor but say meltdown have you used reboot manager for that already and how long does it take you to reboot the whole production since you only have a single lock that's an interesting question if everything goes right all our nodes are rebooted after two weeks and the thing is that we don't use reboot manager for one special error or something like that we are not event driven that is something that runs all the time that is the major difference so we don't have to need to as a manager sit together do risk analysis or something like that it's just a proven process and we can it just runs so for a regular compute storage node how long does the one node take in the process from I'm getting the lock until I release the lock is it like 20 minutes or is there some more stuff involved in that it mainly depends on how many virtual machines are running on the compute node if it's a compute node you have to migrate them we don't want to live migrate every VM at once we have pretty big compute nodes so 500 gigabyte of RAM or more in latest generation 700 gigabyte of RAM so it can take quite a while to migrate all of these VMs so it's like maybe maximum 2-3 hours at the moment but we are not yet in a scale where we have to do parallel reboots this would be a feature we would add as soon as we needed but right now we have sufficient cluster size is not so large that we need to be aware of that the other thing is that we have two clusters who operate in parallel so we don't have any time issues at the moment timing issues the different regions are in a different console cluster so in every console cluster they will reboot in parallel so per console cluster there is one reboot in parallel so if you have maybe availability zones you could implement that as well with different console regions time is up now if you have other questions our booth is over there and I think today we also will be at the CIS11 launch downstairs you can find us here we wear these jackets the whole team is here you can talk to the engineers so feel free and thanks for your time thank you very much