 So welcome everyone. Thanks for being here right at lunchtime. I hope to not let you fall asleep I'm a kind of actually I'm a site reliability engineer that try to keeps Wikipedia running I work at a Wikimedia foundation that is the nonprofit organization that basically is behind Wikipedia and all it's is the project, but that's enough about me Today we will be talking about Qmin Automation orchestration made possible Let's see Qmin is a Python framework and common line interface That integrates with existing infrastructure that allows you to execute multiple commands in parallel on a selected number of hosts and Automation isn't easy, but we are trying to make it possible It's a it Qmin basically allows and provides you a flexible reliable and scalable Way to execute whatever commands you need to run on multiple of them These are us will be dynamically selected based on your needs It will allow to find game select selecting your house That is something that it's usually not possible with with other solutions. So let's see what what we recover today First we will see what problems Qmin is trying to solve Then we will go over what are the main components that that Are part of Qmin so basically the backends that allow you to select us from different Source of truths a grammar that allow you to combine those results And then the transport that is basically what will be used to execute the comments on the target host And then we will go over what are the execution strategies another feature of Qmin that allow basically to adapt to most use cases and it's one of the main reason that we we started writing this tool and Then we will go over a quick demo to show you Qmin in action So first of all how we got here I Will tell you our story, but this story is most common and I'm sure you will will be familiar with it so you start at the start we have just few few service to configure a few hosts and You start to do everything manually. There is no need of automation No need of orchestration no need of configuration management Just because it's too easy you you don't bother about the complexity Then you start going and your number of servers start to grow and you start doing some Quick and easy automation here and there with just some bash script or PAL script and things start to To get better, but then you continue to grow and and what you do you start to look into configuration management solution and You go over all of them and and pick one in our case puppet was was selected on on the fall of 2008 and and then we continue to grow then we open a second data center and Basically, double the number of hosts at that point the configuration management It's still okay. We'll take care of the configuration of your host but it will not be able to to give you automation orchestration and More hosts you have and in particular if you have them in multiple data centers. That's something very neat and and so we start to looking for solution of Things that will allow us to do some orchestration and some automation across the fleet and In our case we find the solution There wasn't perfect because we didn't find anything in the in in in the open source software They will allow us to do that But we chosen one anyway and was sold not for the configuration management part because we were already using puppet But for the distributed SSH basically Capability and so we started using it but having puppet as a configuration management That means that to you sold for distributed SSH We had to have puppet configure salt grains on the target host to just have some host selection capability But that's was very limited not integrated integrated with puppet in a in an easy way And so it didn't allow us to do all the things that we wanted And so we at the end we started deciding to write a new tool that in was cumin and That we wanted to allow us to do all the things that the other tools didn't didn't didn't allow And so basically What we actually wanted so the first thing that we wanted was to be able to target host in a very fine-grained way In a very dynamic way. We didn't want fixed list of host We didn't want anything that was tightly coupled with the current status, but something that was dynamically gathered So in this case for example We would like for this host that are in intersection of different of different Ensembler of host like for example that being stretch host that have a puppet class engine x with parameter cluster equal foo and Then maybe that are not pooled in the cluster foo or something like that So we would be a we want to be able to Address the house for example. I lighted in red and blue, but maybe not the one in black Our dynamically change this selection very quickly. And so we want them basically something that was able to gather Dynamic things from different sources of truth different backends that allow to gather different list of host and then mix them and combine them together With standard Boolean operators and parenthesis allowing you to write a simple query that Can mix them in the way you want in that moment? The other main feature that we wanted is basically how to execute things Sometimes you just are checking some status gum some info or doing some simple clean up you don't need a Very safe way to do it because those are safe common that are armless common that you want to execute a closet for it As quickly as possible because you are just getting something around the fleet or checking some status And in this case you just want to run them and get the output But in other cases you want to do more delicate stuff for example You want to the pull some host from the cluster or start a service and pull it back And you know that you can do that not everyone at the same time of course So you want to do just few other time and then go on on the next one so you want some back batching capabilities, but You also won't control over them because if you run some batching things across a whole fleet You don't want just to to play go to say go and then forget you want to check that the status is okay at all Time and so you want to be able to stop So basically wanted something that could do batching but then stopping in the middle if at any point some success percentage Was not met some success criteria was not met And so this was the where the two main feature that we actually we actually wanted into cumin So let's see for a moment. How how we structure cumin basically what are what are its main components? So we have a global grammar that is the one that allow us to combine results from different beckons and from one side where the beckons and right now we already have puppy DB Open-stack API SSH non-host and Others that are coming and we are writing But also it's possible to plug in external beckons that you ride You just write a single class in Python and you can plug in whatever beckon that is not yet part of itself and And you can combine results from all of them and it's very easy to write to write a new beckon I mean, it's it's just one file very very easy depends on your PI of course, but in general it's very easy and Then given the global grammar that is basically simply Having parentheses Boolean operator and be able to combine results of all of them You can select things from different source of truth from your Cluster management from your inventory Whatever you you have and you want to combine and then use cumin just execute the comments on the Resulting subset of host and the transport is the one that allow you to execute things over the host So right now we just implemented SSH We chose SSH for two main reasons In implied security super simple. It doesn't require any dependency on the target host So the only thing that you need is a host with Cumin master and you don't need anything else on the target host Just as a SSH server and the capability from the to connect from the Cumin master But we are thinking of adding additional transport layers in particular. We are looking for HTTP HTTPS for RESTful API that might need to For example, I don't know you have an elastic search cluster. You want to connect to Multiple elastic search cluster. You want to connect to the API or for example? We are looking to do my SQL transport to connect to to to cluster of my SQL servers So those are things that that will come probably in the next future of Cumin Let's check also what are other Cumin features that we took into account and we are Fundamentally in in in there in its in its capabilities So the execution strategies are the thing that I show you before and I will go a little bit more in detail now So basically we wanted the batch size So we have you can select batch size and you can select a batch sleep batches are Implemented as a sliding window So if you start with batch 10 you will start with 10 hosts and then as soon as one horse will finish you will start Scheduling the execution of the comments on the next host and and going on But you can select also batch sleep between hosts So basically before scheduling the next host it will sleep for such amount of time This is a law for example of doing things completely sequentially You can do one also at a time with five minutes sleep in the middle and do a very slow rolling your start over of your whole fleet for example At the same time when you use batch you we have the concept of success percentage or success ratio Basically what that means is that for each batch? It will at the end when every host completes the execution of a command it will calculate the success percentage of all the host that have executed already a command and if that percentage is over your threshold Then you will go and execute and schedule the next host otherwise you will stop because he says that he considered that a failure so You can by the funny by the by default He has a hundred percent success ratio that means that at the first fault it will stop But at the first failure sorry, but you can change that and decide what is your use as a success acceptable ratio By default human consider every executed command that exit with a different from zero exit status as a failure but you can of course change that Via and a common line option or if you are using it as a library in python You can define the whole list of acceptable exit cause that you consider successful And that for example in our case is very useful with puppet that usually exit has different exit code Even if they are successful, but in test model has a different exit code The other things that you can you can do are timeouts So basically pair host timeouts or every command that is executed in each host You can say set a timeout or you can set a global timeout of the whole execution Gay they given what what are your needs basically? Other small features are for example interactive mode that basically when you rank human at the end Will you run normally but in the end instead of just exiting you will drop you into a python shell in which They are already preloaded the result and you can mangle the results directly In a python shell and that's pretty useful in some cases and the other one are different output formats So by default it just brings the output, but you can select a jason or txt format in which they will be Printed in an easy way for post processing in a different with different tools basically and The other feature of cumin and that is inherited from from the library that we choose to do SSH That is cluster shell that is a very powerful python library that comes from the HPC word and It's to aggregate result So basically when you're on a same common across the whole fleet if they have the same out All the host that have the same output a group together And you will just see one line with the list of us that match and the output So if you're running this over a hundred of us you don't have a very long output But it's much more compact and very useful for checking things And and immediately understand how many hosts have it that output and how many host and which one have a different output So let's see for a moment a quick demo of of cumin In action. Let me see if I can I hope is is visible So First of all, let's let's take a quick look at cumin options So as you can see, this is of course everything that I'm saying I'm showing here is using cumin as a common line tool It can be perfectly used and it's very simple as a library python library So you just in per cumin and in the documentation. It's very clearly There are a lot of example of of how to use it So the basically what you run is cumin a host query that is your query that you want to select the host and the series of common that you want to execute one or more and By default it reads a configuration that it's It's usually in TC cumin, but you can change it and you can pick a different one The mod is what? It's a very specific thing that we decided to do so basically when you are running multiple comments You have basically two way of doing that. I mean there are many but the most common ways is I want to run this common on all the host and then If it's a sexual run the second common on all the host that were successful So basically you can run the first common of a hundred hosts have on I don't know 10 failures and that 10 failures It's within your success ratio, and then you go Anything will execute the next common only on the 90 house that were successful If instead your success ratio was I don't know 95% that will not met will just run the first common and stop and This which we call a sync mode basically and then there is the async mode in which you don't care about this Synchronicity about the cluster and you just run the first the second the third common or whatever they are In each host independently So in each host if one common fails It will not run the others, but it will not wait for the other for the other host So they are independent between them, and they are basically a vertical independent and each one will execute their comments so then we have the other options that We'll see that are basically the batch the timeout Output sleep interactive mode and all basically the the option that I I talk about And then we let's see how we can do basically a simple a simple query in this case We have we are using property B as a beck end And so this is a very simple syntax to say I want all the house the match class engine X and Given that I didn't put any command He will assume that isn't right on more than just exit after listing the matching costs the matching costs also They are listed in very compact way This is also thanks to cluster shell library that use a node set that are basically Python sets, but With additional features and one of those is basically to automatically compress the host in in a very compact syntax in particular if you have number ranging view in your host names and So basically for example to say just a puppy DB a puppet class that is applied you we do that And in this case we are just checking all the host that match a resource of type file Was title is that one? So basically we are checking if that file is managed by puppet that will be recorded in puppy DB as a resource a file resource And the title of that fire source is the is the path usually the path of where you are saving it And you can check very quickly and target those host base based on on specific resources So not only classes, but but also very very specific resources or facts In this case, we are using the kind of new paradigm of puppet roles and puppet profiles So in this case, it's a role. It's a role called media wiki app servers and this will match the host that that have this role applied to them and This is a slightly more complex query that is using the global gamma So in this case, I'm mixing things from different Backend It's just an example. They are both puppy DB backends by different queries So I can do queries that puppy DB API don't allow me to do I'm asking for all the host that have a class engine x and also all the holes that have a fact That says that the the distribution name is Jesse is that be on Jesse So in this case, I'm getting a sub selection of host and I can do and and not or XOR and Combine them with parenthesis. So basically there is no no limitation for for the combination of host As you can see this can become quite long the queries can become quite long. So we develop aliases Aliases are basically a configuration file, which we can just say I want an alias in this case a mw is an alias for us and That can be a very complex way. It doesn't matter. It's a configuration file So all the common queries that you might have you can just save them in an alias and it will be very very easy to To record them when you want. So in this case, let's run a very simple thing I'm checking if apache do a apache to Process on system CTL is active on a target a series of target host And I very quickly can see that one host has it inactive and all the other hosts have the avid active and QE is telling me very clearly at the end that 1% that the one host didn't didn't execute it that failed because the exit code of that command was not zero and It list me where where are the host that succeeds? So from these I can basically just if I want to do some next command based on these output I can just copy and paste the list of host that the subgroups of successful and failing a failing host are there and use cumin again with that list of Hosts because I can just put Those things there so a little bit more complex example with two comments Because this was just a similar command We can run the same thing But then I want to get the status to know since how much time the host Sorry, the service was active and in this case It will just run as the same concept But I can show you better here. So as you can see I'm using the async mode in this case and What I'm telling is basically to run those common independently because I don't care to run the status on all the host and then to run The basically sorry the is active on all the host and then running the status and getting for the the time It was active on on all the host after that I just run both command on all the host at the same time and as you can see group the output together So I know that 44 host were basically apache to was restarted three weeks and four days ago And it was most likely for a for a security upgrade and in this case. I'm using batches So it would be a little bit lower I'm using batches of five and I set the success percentage of 80 That means that I accept 80% success and only 20% failure for each batch of five and I know ready that be given the same list of host and in this case Sorry is using the sync mode. So basically it's in this case I'm running first one common on all the host and then and then the second common only on the one that that succeeded and so This is going slowly because it's batch of five and we'll do the same so this is a very quick example of how we use cumin every day at the Wikimedia and we have really had a lot of Good results so far and a lot of people are using them it So just to a cup we have seen the CUMI as a powerful price index to select hosts Allow to have multiple execution strategies It aggregates the output in a smart way and reports reliably all the failures Can be used it as a common line tool or as a Python framework library in your other automation or orchestration tools and To have more information on how to find the source code That is their leases on pi pi releases on github Documentation and how we use it at Wikimedia Foundation It's so listed on the that all the links are listed on the page on the first page for this talk And all my contacts are on the speaker profile page on foster them For this talk and you can find them just searching for CUMIN foster them and optionally my name On any any search engine so thank you very much For being here and let me know if you have any question. So for that you can set a very small time out So if your normal common, it's I don't know 10 seconds You can set the time out 20 just to be on the safe side and they will fail quickly the other option It doesn't have exactly that that feature so that after success already start the second one because We want to be reliable and execute like if you want to do in sync mode You execute the first one on all of them because you don't know if the next one will be actually success or failing so We you can do that with the time out the short amount to show that when you use batches It's calculated for each at any at any time a host Execute a common so when you finish to execute a common it will calculate a success percentage And if it's not met it will not schedule again And it will stop there, but yeah, it's a safety issue. Absolutely that's because like if you want to do a batch of three and then I don't know you accept one failure on every three But he calculates on the on the whole thing. So a It will go but if the first two fails it will stop for example because in the first batches you have very few all stand It's a safety feature