 Okay, my name is Stefan Herrmann, I'm a software engineer at the Komal-De-Bain and I'm here today to talk to you about the cluster as code to how we go from how we provision our ESOS clusters. To give you some background, Komal-De-Bain is one of the largest financial institutions and I work in the analytics information area. So, several years ago we started entering the big data space and we started doing our own Azure cluster to do the data analytics directly from there. So that was quite successful to start providing more and more racks of servers to store and process all the data that we're dealing with. So, the problem we started running into there is that we're still kind of stuck in the era of cat smuggles so the way we managed that hardware and the way we started managing large volumes of servers didn't keep up. The other problem we also encountered is that we've had servers in particular that do which is great at doing batch process and doing analytics and our developers also needed the ability to be able to stand up microservices and to be able to do more flexible things so they can come up with different ways to utilize and experiment with the data. So we had all these servers but we couldn't really necessarily use them effectively and we didn't get the utilization that we wanted out of them. So at the late last year or this year we started, we created a team to look at standing up for MISOS cluster. So moving all our physical, moving our Hadoop servers onto MISOS. As I mentioned we're working with a physical hardware we can't work and can't provide any hardware. So what is it that we wanted? Physical hardware? We wanted to be safe and secure. We're banked so we need to protect our customers' data and we also need to be available for them to be able to use the data. As I said, it's actually cheap. We had the problem where the cost was growing up linearly and as we grow we need to be able to scale everything. And we wanted to be agile. So we're going to be able to quickly adopt the changes in the response. So what does that mean that we concretely want? We want to be able to build and manage services plus the disability. As I mentioned a couple of times, doing these different systems is hard. And so we want to be able to, as much as possible if we make a change from infrastructure that that change is applied or it has failed. And we want to actually ideally in a situation if we make a change it works for all servers, not just some of them. And so that kind of leads us to immutable infrastructure. So we want to do immutable infrastructure. And the next one is, for instance, quite common. We want all our configuration to be in source control and the state needs to be quite standard. We want to have the ability to know why we made changes, what the history of something is, and we want to be able to track it and in use things like GitHub with poor requests to do our approval. So move their approval flow as close to change as possible. We want the ability to actually test our changes. So we're going to be confident that if we make a change that it actually works when we want it out. And it also reduced our development cycle. So rather than knowing that something is broken after you wrote it out, we want to detect it as soon as possible. We want to have an abstraction to reason about clusters. So if you make a change to a physical machine, that physical machine is found in a larger distributed system, that hasn't been picked on that system. So you have a suki for that cluster, for example, and you remove one of the suki for nodes that could break the portal. Or all the other, if you want to change masses in that suki for cluster, actually one machine is broken, it replaces, that could have an impact on all the other agents that you talk to that suki for. So you want to be able to reason at the cluster that one of the abstractions that help us deal with that and encapsulate that, not just think about it at the introduction level. So we're standing up a system to deal with containers. And we don't want to run separate infrastructure to manage that system. So as much as possible, containerize everything, so you can run all of the same infrastructure. So what is then that we plan on doing? So this is kind of a high level approach. We want to create role-specific images and make as much of the configuration into that image as possible. So what versions of packages we're installing, even things like two should be back into that image. And then we're going to deploy that image to each machine. So we're going to deploy different images to different machines based on the role they have. And where we have to specify machines in suki configuration, such as IP addresses, or what cluster that should join, we use cloud config from the cloud config library framework to provide that information. So I'm not sure if you guys continue with it. It's the same thing that Amazon uses to provide information for machines as well. If we need to make a change, when we go back to our configuration and source control, we make a change, we test it and we redeploy it. So that's the high level approach that we're taking. At that point you might ask yourself, there's already a whole bunch of configuration tools. There's puppet, chef, ansible, sort. They're used to the hotness several years ago, and there's a stack of them. So why don't we just use one of those? The reason I guess we don't, we want to avoid using one of those is because it's non-deterministic. So with those tools you tend to make a change toward your machines in parallel. And if you have enough machines, that change will fail. It will fail at some stage somewhere. And now your machine is slightly different. Also if you make that change over a long enough period of time, like doing a package install, there's a possibility that you install package version X today and it's updated version tomorrow. Now your machines have slightly different packages. The other problem is you might have started with some base image and over years you've rolled out continuous changes. And now you have a cluster with your machines in certain ways. Now you've rolled a new rack of hardware. How do you bring that new rack up to speed? Do you use a different base image? Do you use this original base image to roll out the same changes? What if you install a different package during that time? So you end up with this cluster that's slightly different. You might have had ops people locking new machines and make changes to urgent fixes and things like that and it never got supplied clean up. So you have a cluster where all your machines are potentially slightly different and it becomes much harder to reason about what does the cluster actually look like. It doesn't fit yet. Why did that failure occur? Why is that machine effective? Why is the machine able to machine the steady tensioner? Or did it just turn out that way? So you wanted to avoid all that. And by doing it, I guess, banking everything to the image you can avoid that that's how you increase those changes. So just to give you a quick update of the kind of stack we're looking at it's running Ubuntu. There's a physical hardware, resource, marathon for Microsoft services and we use Calico for networking. So Calico allows us to give each of our workloads its own IP and it does that without using any sort of tunneling or VX or anything like that. It actually uses just layer 3 routing so it turns each of the physical servers into a router and then uses BGP to propagate the routes. So that makes it easy from our perspective because we don't have to know different technologies. It's the same technology that makes up the standard boring data centres and it's easy to understand, it's easier to debug and it just reaches the complexity. So it's quite handy for us to actually pass that answer and I think it's a really nice solution for the space. And we use block as a containerizer. We use mist in S for the dynamic in S, power in S for the static in S and then we use the elastic stack for logging, system for monitoring. We use hash in port for secrets management and we use open stack a rolling to actually deploy our physical machines. I can just ask you at this point who's running a physical data centre as opposed to using the cloud? Can you just put your hand up if you're doing physical hardware now? Thank you. So now I'm going to tell you guys how we did it. I'm going to do that in two parts. The first part is how do you build that OS image that we're going to deploy and how do you test it. And the second part is, given that we have that OS image, how do you deploy and orchestrate past us? It's a bit of an OS image. So we've got two highlighted images. What is the master image? Which runs all the master function for us. So it's got the zookeeper forum. It's got the mesos masters. It's got the marathon master. It's got the HCD cluster which is needed for Calico. It's running our board HA. It's running mesos in S, power in S. And then it has elastic and assistive agent puts monitoring. The agent is much simpler. It runs the mesos agent. It runs Calico for the agent networking docker and again on monitoring tools. So at that point you also might ask yourself, if you keep repositioning machines, how do you expect workload? I mean, if you're running big data software on this, you don't want to slow the data rate every single time. I mean, we could, but I would be really proud of the networking perspective if every single time we had to recreate that data. So what we do is we actually only reposition your OS images. So we have some workloads running on it that needs to keep data. And we assign disks to that workload where it can assist the starter. And during reposition cycles that doesn't get touched unless we decide to repurpose the node. So this is a high-level workflow. So we have our OS images in source control. We actually use docker as a configuration language abstraction to define and express our OS image. I'll show you what I mean by that in a second. So our OS images are defined as docker. So the first step we do is a docker build. That gives us a very quick, easy way to verify the instructions that we have. Makes sense. Think of it kind of as a quasi-compliation step. And then we can do some very simple testing, very simple testing to verify that the builds went correctly. So we can test things like where the right package is installed, did the configuration scripts run and produce the correct architecture. So our docker image is more durable. So we use an open-stack tool called Disk Image Builder to convert our docker image into something that can be rooted. And once we've done that, we can do more advanced testing. We do our cluster verification testing, our system integration testing to verify that the image that we created spin up and produce the kind of class we want to produce. Once all that passes, we can publish the image to our architecture store. So this is how we define a resource agent. And you can see it just looks like any standard docker that we might be used to. We have a docker file, and then we have a bunch of files that we copy across it. So it's probably a lot more happening here than you would see in any other docker repository because we can actually define your image. Not just a simple single file that's like the service. So this is our docker file for the agent. So we've got, we inherit from some base image, reduce some setup steps, so we can do package install. So we add some repository, some keys, things like that. We install packages, in particular we install a resource, we pin to a version, we install docker, we get some other binary step we need, things like that. And then we enable our system D services. Actually it has everything to define an operating system image. And it's actually for us a really nice way to express an operating system. Docker cases are really simple and easy to understand language. Any software developer has been wrong. Software developers probably didn't work with Microsoft services, but we find it easy to understand this, easy to come aboard, easy to make changes, easy to participate. It's, and I find it easy to also understand reason why. We have some other languages that you might use to define operating systems. So this is the configuration to the operating system. There's not much more to it. So then we build it. Once we build it as a docker image, we can just convert it to something that we can actually use to boot a disk machine with. There's an excerpt of a much larger script. And what we use is a tool called a modsec or disk-image-builder, and that converts it to a QCao 2 format, which you know, boot a machine with. So it will do things like add the petitions, add a bootloader, stuff like that. So now I'm going to show you just a screenshot, like a video of a build. And so this is front-press Playmax. So this will build the image part. It's heavily edited and fast forwarded because it takes about 20 minutes. But so the first thing is here it's building the master docker, doing the right docker step. Then we're running some verification afterwards. It's just very fast that it works. And then we're also doing the physical build to actually run our cluster integration tests. So that's how it works. And then we're also doing the physical build to actually run our cluster integration tests. So with the three images that we're building, the base, the master and the agent, right now it takes about 20 minutes. So this is just where here it's running the actual cluster verification test. I'll show you what kind of tests we're running in a second. So it spins out the actual instances of our agent and master notes and makes sure that the image produces the right outcome. So this is the kind of test we run after the docker build step. It tests that the packages that we need are installed and have the right versions and things like that. And so that's the first level run from dirty high level checks so that you can get a quick turnaround with this, anything wrong. Now the next step of testing is we've reduced that image and we use a custom tool called Symian. Symian uses docker compose and KVM to spin up I guess virtual representation of the physical cluster. So it uses KVM to spin up three instances of our master image and it also spins up some supporting services and uses VDE to do its networking. So that kind of allows us to simulate a physical deployment. And then we test things like if we spin up our three master images and given that we've given the right configuration we should have a zookeeper cluster and we should be able to write to a zookeeper cluster and read from it and it should work. So if you write something to it we should get the sender back and our cluster should only have exactly the one in the other. Our master image should just give some just spin up and send to the zookeeper cluster without any manual interference and it's just tested. So that's how we build our OS images. So the things that work well for us are the ability to test in particular the ability to simulate physical cluster deployments that we need to do deployments and it also helps us to increase the size of our workflow. Using Dockerfile as our abstraction for defining the OS image it's easy to understand it's easy to be able to participate so I think it's actually quite a nice way to define the OS image. The fact that we don't have any mutation where we mentioned this before it just makes it easy to reason about the state of our cluster it gives us that deterministic property and it actually gives us the confidence to know that if you want to know what's in our cluster and what our machines look like just go to source control and we see the definition. And using forward question changes just makes it easier to approve changes so I guess it's important for us because we have to go through a change in the process that having that as a positive source code as possible just makes the process a lot better. So pain points as I mentioned everything we do inside dockon so we use dockon to convert dockon images to OS images and that exercises the kernel quite heavily and it causes us to run across a lot of edge cases that you would not normally run into if you do dockon. It always encounter a lot of problems trying to make this work trying to make it fixed one of the problems being encountered is that in order to create an OS image you tend to use a loopback device and so when you use the loopback device from docker it works the first time perfectly but it doesn't clean it up quite correctly but it doesn't leave it behind dirty enough for the kernel of the OS to detect just enough that it doesn't work so when you run it the next time you ask the kernel for another loopback device and it will give you the same loopback device because it thinks it's fine, it's unused just don't use it and you try to use it and it just fails and the kernel knows it's dirty so when you run it again you ask the loopback device I can't do the first one because it's dirty and you get another one and so we end up in a situation where about half of our builds would fail just because of that issue so the way we worked around with that is we actually used I'm not sure if you guys know runv but runv so docker can run docker can use runv as a backend a virtual OS a virtual kernel and give that to the docker image inside so now we've got the isolation we need every time we do our builds we're actually running inside a VM we're still using all the docker abstractions and everything looks like docker but now we're running inside a VM so with that loopback we should run away the other benefit we're able to get out of that is that we're just doing builds we don't need to persist anything it's much better if everything was done in memory so we're actually disabled we're able to configure to use our kernel where fsync is disabled and so we just got a whole bunch of speed up by not sending to disk the other pain point even with that speed up the build cycle is still relatively slow I think that's just standard for building OS images even if you use some other tools we'll run into that but it's still a bit slow so as I mentioned it takes about 20 minutes to run through all three images and runs through all the tests so it's both building and testing so now given that we have an OS image I'll talk to you about how we did the orchestrate clusters and how we actually use it so here's another high level overview we've got two main repos and we've got an infrastructure repo and a cluster definition now infrastructure repo we just keep physical setting information about our servers so things like what are the MAC addresses where they're located, how they're patched up how they care for things that don't really change but you want to report when you first install them and then we've got our cluster definition where in a declarative way we declare where our cluster should be so we say our cluster I want to have a cluster that looks like X it has five masters 20 agents and here's some configuration for it what happens if we make a change it's declarative so we declare I now want a cluster that looks like Y so I have cluster X and I want an extra number like this, I want stream masters instead of 5 and what now happens is we'll trigger a CI-CD process and I'll figure out where I want to get from X to Y these are all the changes I need to make and I'll simulate those changes and if you erase that PR it will update that PR of the change it's going to make so let me know that if you want me to go from X to Y and if you merge and approve your PR it will then go and execute those changes and in order to execute those changes it needs to talk to a bunch of other systems so again we're doing distributed we're doing distributed systems here so if you make a change to a machine you probably have to update a bunch of masters and let them know what's happening right now the two systems we've integrated are OpenStack Ironic which we use to do our physical deployment of hardware and Board which we use to manage the secrets and in future we need to integrate the whole bunch of boards we need to integrate and that's the CI-CD and so on that's the whole bunch so we're going to our tool mission planner we're going to talk to those services and orchestrate them so what it does with board is if you're standing up a new cluster it creates a name space for that cluster and it creates some policies it will have a policy for the master loads it will have a policy for any node in the cluster and it will enroll all the physical machines so it can access those name spaces and get the secrets they need to get out of them so that they can do authentication to the agents and masters and things like that so that's the height of the overview so this is the physical cluster definition as you can see here it's a final rack we've got some information on the size and location we've got some information on the server side it's some serial that was installed MAC addresses and things like that and then we've also from that we can derive things like a picture of what the rack actually looks like so you can see how it's stacked up and so you can use that to derive a whole bunch of extra information as we need and in the future we need to integrate that with our cluster definition but we haven't gotten to that stage yet so this is our cluster definition so here it's a YAML bar and we define the ones that are in the clusters we want a green cluster some basic configuration about it and then we've got the master nodes so we've got master nodes called ABLE here some nodes specific information some information so we tell it what IP address it should have and to which interface it should find it what really it should be on, things like that some information that our running needs to provision it and here what image we want to use so this is our YAML alias and that provides us with the information about where to get that image that it should define that and the same thing for the agents and so now we can derive a whole bunch of information from that so if we have 5 clusters we know the quorum size needs to be 3 if you want to and we can also tell all the agents these we can go all these agents these are the masters that you need to talk to to get the information from that you need to enroll with, things like that we can tell our master nodes or our master clusters like suki go on there these are the nodes you need to form a quorum with so all that information gets automatically derived if you don't have to duplicate that information you just specify the height of the construct and then you can get the information on that so this is a cluster of abstraction a cluster way to reason about things and it prevents us from making silly mistakes like finding a 7-0 cluster and sending the quorum incorrectly things like that it's just the rules are encoded you know what the rules are and you need to run that video so now I'm going to show you simulating a change so what I'm going to do is for Able I'm going to change the image so I've got another master image and I'm going to tell that Able should use that other master image I'm going to commit that change so I've added the master image and put it over to use it push to source control server and then I raise the PR so you can see the PR and in the background the CI3 process runs and it tells us what the changes so it's going to say we'll update the green cluster for the green cluster on the left hand side is the old cluster on the right hand side is the new cluster and tell us what the changes it's going to tell us here we're making a change to those nodes so nothing has changed and now it's going to can you pause here please now it's going to simulate the changes it's going to run to so the first thing is go away Able and re-provision it so I'm going to delete all its existing access to the cluster and re-create it so that way it just enforces us to keep our secret short-lived so for some reason Able's credentials were compromised rather than our new credentials so every time we re-define something automatically we issue credentials and then we re-enact the impressive Able then we do like we delete the server we put the image to the plans which is the open stack image store and we re-build it again and when we build it there's a whole bunch of configuration files for cloud copy files we put down that have been derived as part of the cluster definitions so you might not have seen a lot of information from the cluster yarns file but we can automatically derive that information and this is our cluster deployment can you please play again please so now we're standing at the three node cluster so it's we're standing at the three node cluster again we have the editor because it takes about 10 minutes to stand up each node on the left-hand side we've got an IPI console of one of the servers, the last one called fly and we're standing at the three servers Carton, New York and fly so now we're building a fly so we start some activity so the first step is we'll use the Ironic agent the Ironic agent then copies down our OS image and uploads it so that's what's happening here right now and once that's done it will do a boot into that new OS image so now we re-boot into the new OS image that's done so now we should have a three node cluster so we go to our resource distrilo, picture it's up running so the fly, missus master we've got our three node cluster and fly and carton so that was the provision of cluster so things that work well for us updating the core request with the actual simulated changes gives us the confidence that the changes we want to make are actually the right changes it allows us to be able to see the impact and it allows us to go back if something actually went wrong debugging if we got what went wrong what was the wrong change and just when you're doing things at a cluster level that could have potentially a very very high impact and the ability to actually spot out all the changes not to have to manually do that and screw it up but just to see what the system would do to do all that and it's really useful and if there's something broken where we go back to the system and reconfigured and then we know that's always going to be corrected having a cluster level abstraction really helpful for us as well so again we don't making a change in the image has a larger impact so that's the reason about it at the higher level impact that's reason why it's at the cluster level and automatically derive the required changes using one tool to manage all our changes so you might use there's a lot more to probably specialise towards things like storing infrastructure information we chose not to use that we chose to try and do as much as we can inside source control because first we want to do everything as code as code and second it's we only have to worry about maintaining one tool so about two tools in this case source control and a CI-CD pipeline so both from our image building to our cluster provisioning to building software everything is just managed through a full request throw and it's managed to CI-CD flow so it's one tool you need to run or two tools you need to run and there's only two tools that you can familiar with and so you can now heavily invest into building tooling around that and make it easier in this case I think that's a bit of a nice wind blast to get that standardisation from the area familiarity and the last one is with left build in particular for this kind of conference but the tool mission planner that we're doing to orchestrate in large clusters actually uses a concept for functional programming for the interpreter pattern and what that allows us to do is it allows us to write domain specific languages for each of its functions we have domain specific language for doing world operations we've got a completely independent domain specific language for talking to ironic and you can easily write more domain specific languages to talk to everything else and the nice thing about that is they're completely dependent, completely abstract and they have for each language you have several interpreters so you've got an interpreter that actually does the change actually executes it we've got another interpreter that simulates it which we're using for both the testing and so it's easy to write those and when you actually want to combine all of those it's tricky as well there's a constant cost to integrating one more tool into this, you don't have to worry about keeping all the tools about the complexity of managing all these tools it's just very complex with managing each tool individually and then you just put it all together and you have a cluster level language, there's just the addition of all those individual languages you have a cluster level driver that can make the change, that's all the addition of all the individual drivers so that's a really powerful tool it makes it so much easier for us to reason about the complexity of doing this kind of thing so as I said there's no tooling to be built we have to integrate into a lot of other tools so having this kind of nicer approach to managing each of those tools individually but still being able to compose all of them makes it easier to do this the other big problem that we have for the other big pain point is the deployment cycles, so now in order to make a change, it takes about 10 minutes on an individual machine so with Papito Chef you can probably roll out your changes quite quickly you might have some changes that have no impact on the running systems so you could really add them in seconds, 2 minutes instead we have to do we have to roll out like we would in a reprobution machine, so it makes it a bit longer but it's a deterministic time longer but it gets a bit worse so we have to make those changes by the system that's still online so you can't just go and reboot all your machines and then just say well we're going to have a 10 minute challenge and everything's going to come back up okay and just do it with it so you have to figure out a way to do that in an online fashion so that's where we currently are in our development cycle and so you need to consider things like we're running Hadoop, Hadoop keeps 3 copies of the data, so instead of this you want to be able to take up one machine and still ensure that there's still 2 copies of the data left in the system so it can keep running and so you can still endure one more failure before we start having downloads so the kind of work that we have to do is we need to integrate into our workloads like Hadoop can go and talk to Hadoop and say give me all the nodes that I can remove while still keeping 2 copies of each piece of data and I reboot them and reprobution them and I come back to you and ask you did all those nodes come back up give me the next set of nodes that I can reprobution while still keeping 2 copies of data so that's the kind of integration we need to do and that involves also thinking about things like placement of data and all that sort of stuff so you might think that's a lot of effort to it's a lot of effort you pay for simple changes but as soon as you want to make complex changes this is the kind of thing we're going to have to do and this is the kind of thing you need to think about you need to have answers for that so we're kind of forcing ourselves to treat each change the same we're forcing ourselves to actually address this problem head on and to be able to make it work if we can make it work there's a whole bunch of benefits we get from that one of the benefits I haven't mentioned yet is we can do things like read on your file system or partially read on your file system and partially non-executive so from a security perspective it just avoids a whole bunch of part in what otherwise we need to do it's a bit of a adversary to insist on our file system it just gives us a huge security for free so there's still more work we need to do to manage that so that's what we're doing that's kind of what we're up to are there any questions? any questions? that's it then thank you guys