 So this talk is about how you can go about things like this in a mesos context. The main message I want to bring over here is for resilient systems, you shouldn't reinvent the wheel. Because almost certainly you will come up with a square wheel. What you should do is you should use algorithms, methods and tools which are known to work, which are battle tested and which other people have designed for you to use. That is the easiest way to get up and run. So, let me say a few problems you have when you build a resilient system that you have to solve and this talk will be how to use existing tools to get a quicker start. For example, if you are running a distributed system, you have to somehow hold the state of the whole cluster. You have to know where your tasks are, how they can talk to each other and how things change in case of failover. You want to organize your failover, you want to at first detect that something is wrong and some part of your system wants to react on this, you want to reconfigure your cluster and you want to do all this without any service interruption. One part of your system needs to discover the other even if that part changes due to some failover. And in a distributed system you have different actions happening in different places of your cluster and you sometimes need to synchronize these things that events happen in an orderly fashion. So, this talk is about tools and algorithms maybe to achieve these things. So, before I go into more detail, let me talk about consensus. Because in many cases that is at the core of your problem in a distributed system. The different parts of your system somehow need to agree on things. They need to agree on which server is a failure at this stage or what is the current configuration. And once you start with distributed systems you will quickly notice that one of the fundamental problems is this here. Because in particular the presence of failures, the opinion as to what the system is in, the opinion about the current state differs from the different parts, from part to part of your system. It's an effort, it is really hard to get right. So, if all your servers are up and running and you have a network where that works, then consensus isn't the problem. You just exchange enough information such that everybody is up to date and they all agree on some state. However, in practice this just doesn't work. Network failures do happen, they happen intermittently, but they also happen permanently. They happen due to human misconfiguration of a system or to failure of some network thing or whatever. And the more you scale out, the more servers you have, the more failures you have by hardware. If you run a system with 10,000 computers, then every night a few of them will fail. That is for sure. And your system needs to be resilient and to be able to talk and react on this. And regardless of what the failures are, a fundamental problem is that the different parts of your system agree on what state your complete system is at this stage. So, here's a somewhat lax definition. Consensus is the art to achieve this in software. And I intentionally say it's an art because it can be very difficult to get right and meaning what is right is actually a moving target. So what is out there in theory about consensus? Traditionally, consensus was achieved with something called Paxus, I can't point here. Paxus is a protocol which various servers can speak to agree on a certain state of affairs. Later, in 2013, an alternative has been proposed which is called Raft. And I just want to give you a little bit of an overview about the differences. Traditionally, Paxus is considered to be very difficult. And read the papers, they are very enjoyable to read, but it's really hard to understand what the essence of this protocol is. There's a lot of variants for different purposes and most of the theoretical papers do not actually tell you how to implement Paxus efficiently. So, understanding Paxus is hard, implementing it is hard. And there have been enough examples of where people have tried to implement Paxus and have failed to do it right or to do it efficiently. One of the design goals of Raft is to be understandable, is to be easier implemented to make it easier to understand what's going on. That is a tremendous advantage. So, my advice if you need to achieve consensus between certain servers is first try to understand Paxus. It's an interesting, challenging task and it's enjoyable. But only do it for some time because your time is scarce. Do not implement Paxus. After you have seen how complicated Paxus is, you should have a look at Raft and enjoy the beauty of it. Because it's really a beautiful piece of theory to understand and to appreciate the simplicity of it. But then, don't implement it either. It is much easier to implement Raft than to implement Paxus, but it's still hard. Very clever people have failed to get it right and to make it efficient is a considerable challenge. So, use some other implementation from other people who have done this and tested it and get the box out and so on. So, consensus is necessary. You will need it as an integral part of any kind of resilient distributed system. But don't try to run your own. Don't try to invent your own even. So these two protocols have the advantage that somebody has made a theoretical proof that they are correct. It's then a different matter to implement them correctly. But at least you have the solid theoretical foundation. So, the most important thing is do not try to invent your own consensus algorithm. And the second lesson is do not try to implement your own. Use something out of the box. So, if you want to use something, you should probably have an understanding what it is. And so let me try the impossible to try to explain what Raft does in one slide. I know this is impossible, but nevertheless I will try. A Raft setup consists of an odd number of servers. So, usually you take three or five. You need odd to achieve a majority. And this odd number of servers, they keep a log of events. And the idea is to have to make it so that every server has exactly the same log of events. So that's the consensus part. Everything that happens in this log of events is replicated to every server. And you will notice in a second that the purpose here for Raft is not to be fast. The purpose is to be secure and resilient. What this odd number of servers does, they essentially do a democracy. So what they do first is they elect a leader. And then the leader is the one server that decides what is appended to the log. So only the leader may append and it may not change the log in earlier positions. It may only append to the log. And then every change is replicated to the others. And the append to the log only comes if an absolute majority of the servers have agreed on this new log entry. That is what makes it forward tolerant, essentially. Now, if one of these servers fails, if it's not the leader, then it's not too bad. It's just ignored. But if the leader fails, then the other two or the four quickly elect a new one. But to elect a new leader, they have to have a majority of the votes. So imagine you have five servers. One fails, which used to be the leader. And the other four elect one very quickly. But they need at least three votes. Because once they have three votes which agree on the new leader and the old leader, even if it can talk to one of them, say, and cannot make no more progress. Because to make progress in the replicated log needs a majority of the servers. And if there's a majority for a new leader, then the old leader has to act. So the whole protocol has very smart logic, which I cannot explain here. Detail to always ensure that there's a unique leader and that when anything goes wrong, there is automatically a new leader elected. It's really a lot of fun to get right theoretically and practically. But we have a theoretical proof that it is solved. And of course, if I say all failures, if a majority of servers fails, or all fail, then there's nothing you can do because you need the majority to go one. So therefore, if you run three servers, then you can tolerate failure of one. If you run five, you can tolerate the failure of two sides. Usually what one does with this replicated log is one puts on top of this a key value store, which pays like a database. And every change to the key value store is just written to the log as a transaction of change. And the API for such a RAV system is made in a way such that you can talk to the leader and say I want something changed. The leader appends the change to the replicated log. Everything is replicated to at least a majority. And then it counts as a change to the database. So for the rest of the talk, I want to present a specific implementation of this. And I want to show you how to use it and show a few examples how you build certain parts of the resilient systems with it. Now I, of course, have to admit that I work for a database manufacturer called ArangDB. We make a distributed database ArangDB, and therefore we had to solve all these problems. We had to build a resilient system which nicely runs on DCOS, say, and which recovers from failures. So therefore what we needed in the middle of this ArangDB cluster, we needed such a resilient RAV store. We started using, following Maya twice, and using other RAV implementations. We started with using ETCD. And sooner or later we noticed that we need a very specific feature set which ETCD didn't have. And so therefore we did not follow this advice. It was bad if you don't follow your own advice. But what we did is we implemented in ArangDB the RAV protocol. And so now you can use ArangDB as a multi-model distributed database, but you also can use it just for the purpose of having a RAV server. And so what I'm talking about today is the mode of ArangDB where you just use it as a RAVT triple or RAVT 5 triple. And that is what I call the ArangDB agency. So that's the mode of ArangDB where three or five or whatever odd number of servers together make a resilient key value store. So let me say a few details about how this works. The philosophy is that everything is JSON and every communication is via HTTP. This makes it completely language-independent and practically all languages you want to program in nowadays can talk JSON and can talk HTTP. Now the key value store on top of this RAVT you can think of basically storing a single big JSON document. So the complete state of this database could be something like this. As for example on the top level a key named A which has the value 12 or B is a second key on the top level which has an object as value. So the idea is that the replicated log and the RAVT notices and stores the changes you apply to this single JSON. And you have an API to do write transactions and read transactions to change parts or the complete thing here and to read parts of that. So therefore you can do projections, you can change parts in a transactional way and you can prescribe preconditions for your transactions. And what I'm explaining now and for the moment you can just forget about everything I said about RAVT just keep in mind that this key value store or this transactional store is completely fault tolerant by using RAVT. And just one more word, you shouldn't store gigabytes of data in there. This is intended not for high performance, high volume data. This is intended to be used as a key value store to save the state of your distributed system, some metadata, do synchronization work and to be a building block in your resilient system. So let's see how you work with this guy. Imagine this is the state of the database. You can for example send a transaction like this here and you see the philosophy is everything is JSON. So therefore a transaction, a change to the above state is actually encoded into a JSON document itself. So this here for example is a retransaction which says I want to have two projections of this state. The first, I just want everything stored under B-negum but in this case just answer with my name and the second is give me a second projection of this which is the smallest sub-object of that containing all these paths. So the path notation here with slashes just means to dig into the JSON by following attribute values in the objects. So B-name would first go on the top level to this B here, I can point to the B and then to the name and end up with my name. The result is a substructure of the complete state a projection containing the minimal number of things which covers all these paths. So to cut a longer short story short reading this key value store just means you give a list of lists of paths and it comes back with a list of JSON objects with this state. And it always sees a consistent state of the database between two right transactions. Similarly, if I want to change the state I do this by sending a right transaction there. So here you see a simple example of a right transaction so for example the first one says take the key A on the top level and set it to the new value 13 and the second operation would be take the value under the path C-Y-U and take the value which is there and increment it by one and the benefit of having a transaction is that this operation is done as a whole or not at all. And nobody can see a state of the system where say state.a is changed to 13 but the other state is not yet incremented. So this is a transactional database. You can do deletions obviously and you can set complete sub-values for example the sub-value under C-Y and set it to a complete sub-object. So we use the recursive nature of JSON here if you have a highly recursive structure you can either exchange a complete subtree or you can dig in and exchange something in the subtree. There's a shortcut this op-set and new-rotant to just giving the new value that's just to simplify things a bit. There's different operations I don't have mentioned everything here for arrays you can push something to the end or pop from the end and with new values you can specify a time to live. That makes it possible that you can put something in the database and you know that it is vanishing automatically after a given amount of time. This can be very useful as we see later. So one promise this key value store makes is that even if it is hit from multiple clients at the same time with write transactions it will order all these write transactions in a linear fashion which is visible to the client and execute one transaction after another in an atomic way. That's a very important guarantee to make because it's much easier to program against this. Transactions have more power you can specify preconditions. So for example here we see the change we saw before the first object up here but we can say it is only to be executed if the old value of the key A is 12 and if the value of C Y old empty false means that it is already set so it wouldn't change that if it is not already set. You can have multiple different conditions you can have the condition that a key is empty or that a key is non-empty you can specify a condition that the value of the key is a certain value and you can specify that the value needs to be an array. That's useful for push and pull operations. Another feature this key value store has is that you can somehow subscribe to changes. You can send something to the database under a certain key value you can say you want to observe this value or everything below it. This means that whenever a right transaction is executed that touches anything below this key then an HTTP callback actually a post request is issued to the URL you are giving. That's very convenient and it avoids the necessity of long calls and is there to react quickly to changes. One caveat one should never use this one should never rely on this for correctness. Because in a distributed system any kind of message can be dropped. It is highly possible that a request a callback doesn't get through. So therefore if you use this this is very good for performance optimizations to make your system react quickly if all is well. But whatever value you observe you should at the same time regularly check it just to be correct without this. Therefore we don't make any strong guarantees on this. We say this is delivered on the best basis to speed things up if everything is well. But you have to check yourself. One more important feature we needed and I imagine many distributed systems will need this is something we call supervision. Quite often you design a distributed system in the following way. You say some central place like such an agency contains a current configuration of the system like for example the last heartbeat of any participating server things like that. And then you need some place which observes the state and notices for example that some server hasn't sent a heartbeat in 30 seconds. Then you usually have to react you have to reconfigure your system you have to change responsibilities to take away this failed service and replace it by another one. But you need some kind of supervision task. And the perfect place to run such a supervision is in such a raft store. Because there you can make sure that on the leader exactly the elected leader the supervision job runs regularly it is very close to the actual key value data and it can do changes very quickly. So therefore in our agency implementation we make sure that on the raft leader of this triple or quintuple we have a supervision job running which keeps an eye on things by changing values. So therefore you can use a run-of-the-beat agency as one of these integral parts of your distributed fault-tolerant system and you can actually implement such a supervision process with a few lines of JavaScript which is executed on this key value store in the raft leader. And that is a very convenient way to implement such supervision strategies and it's highly efficient because it runs in the agency and has direct access read and write to the data and can issue write transactions if needed. And usually what you do is that any kind of reconfiguration will be done in there by the supervision process and in a decentralized way the servers just react to this. They keep an eye on the configuration and suddenly I'm responsible for doing this and that so I better get started doing it. And the reaction can be fast because we have to actually be callbacks and even if they fail you can fall back on regularly checking the configuration and noticing that you have a new responsibility. But what are typical use cases for a system like this? So essentially this is the story. I recommend if you devise a distributed fault-tolerant system I recommend that you use such a tool and obviously there are others out there I think people use ZooKeeper people use ETCD people use Conzo and other such services and what I present here is a new kind of such service and the advantages I think it has are that it has transactions it has transactions with preconditions it has callbacks and you can implement your supervision jobs right there on this key valid store. Therefore we designed this to work in a run-on-DP for our needs but we think that many distributed fault-tolerant systems will have the very same needs. Why is that? You can use this as a central configuration management. You can send hard beats there and let the supervision discover that a hard beat has stopped. You can organize supervision and automated failover right there in your key value store for the configuration. You can use it for synchronization between servers. You can do locking resource management with that due to the preconditions for transactions. You can organize a kind of leader election yourself for your services in there. You can do service discovery you can do unique initialization some data structure which needs to be initialized exactly once and do all this stuff. So let me just check the time but the rest of the talk I want to show you a few examples how you would program against such a piece of software. This is a slide I've just put in this is how you deploy an run-on-DP agency You take a single JSON file forget about the details and you do this command dcos-marathon-app-add and this will fire up such resilient key value store in your mesos cluster dcos cluster and so therefore you don't have any hassle with that. Right, so that's the general idea I already explained. Let's look at discovery. Imagine you have a system consisting of a group of jobs or tasks and they have to talk to each other. You don't know before whether it will be 3 or 17 or 100 they need to find each other and talk to each other in particular they need their addresses and ports. How do you do that? Well, you reserve some place in the key value store and make every service do a write operation in the beginning namely it sets the value under the key of its unique ID to its address and port. Every service that starts up knows its address and port knows how it can be reached and so it has all the others by dropping information over there and you do this with a precondition saying the old must be empty nobody must have been registered with this ID and then you just retry if this doesn't work. That's a very robust way to somehow subscribe to your service all you have to know is the address of this address and then regularly every task simply reads the key under service value under the key service and gets a complete list of IDs and addresses that are currently running in the system and then they can happily talk and exchange messages and you see all you have to have is an agency up and running and be able in your programming language of your choice to send one HTTP request in the beginning and then another one regularly every 5 seconds you can even improve this by having a callback to tell you quickly when a new server arrives next example is one-time initialization imagine you want to have something in this key space organized basically when the first server starts and then it is there and doesn't have to be organized again so therefore of course you need that some of your servers does this of course an easy way is to manually start up an initialization process first but that's not resilient because it might fail things could go wrong so here is how you do it you just execute this here in every single task what you do is you first read something under the key init state that this init state is empty in the beginning it is put to the ID of the server who does the initialization and it is put to the string done if somebody has already done the initialization so what you do is you check the init state if it's done then you just start the service and you know it's initialized if it's non-empty then you just say ok somebody is doing my doing the job wait for 15 seconds and start a new why does it have to start a new well the server doing the initialization might fail so therefore we have put a time to live to the entry such that if the server fails who does the initialization the key value store automatically deletes the value under init done and then anybody after these 5 to 15 seconds any of the servers which are starting will discover the empty value and try to raise to the initialization itself now if you see an empty value you do a right transaction to put your name under the scheme under the condition that it is still empty because obviously this is now a data raise in the key value store so therefore you have to make it so that if a change happened between your read and your attempted write this doesn't clash so here comes into play that we have transactions with preconditions otherwise we couldn't implement this in a robust way now if we have if the precondition failed then somebody else does the initialization well just start from scratch and finally if we have somehow the foot in the door and do the initialization we just go on and do a right transaction that initializes whatever and in the end that's the init state to done either this works and it's all initialized and we are done or we fail in between well no worries what happens is that the key value store will automatically remove the value under init state and somebody else will then take over and start so you see this logic is extremely simple you do a three step procedure in every process and it's extremely robust and you get your job done that is somehow how you build up four tolerant services by using such technology I know my time is nearly up so let's just quickly look at leader election imagine you have the problem that you have a distributed system and you want to have at every given time one of your servers be responsible for something what do you do every follower that is a server which is not leader at this stage regularly does the following it reads the key the leader either this is empty or it contains an object the key of the leader and the term number now if the leader is not empty then this follower considers the server in there to be the leader for that term so it will just follow the leader keep the term and after some time try again but you can have this as a background task the follower just stays the follower and knows who the leader is if however the leader is currently empty then the server tries to become leader by putting its own idea here in the leader position and thinking of a term which is a number higher than every term already seen it puts the time to live because obviously if the leader fails the entry here has to be cleaned out automatically such that somebody else can become leader obviously you must only do this under the condition that there is currently no leader configured there so therefore you have a precondition for this transaction if the precondition fails somebody else was faster for the race to the leader and you just start from the beginning however if you have made it here then you are the leader and work under this term you have put there now what does the leader do obviously time to live 10 you don't want to have a leader change every 10 seconds so therefore the leader regularly does this it puts the very same entry again with a time to live of 10 so every second or so it renews its time to make sure that this key is not automatically deleted by the agency if the precondition failed which is the old what we thought it is then the leader obviously was offline for some time or a message was lost or whatever so it has lost the leadership and then it resigns and becomes a follower again and otherwise if this works it continues to stay in time and again it's a very simple implementation but it is robust without you running a rough protocol actually you are running a rough protocol but in a prefabricated tool and so this comes back to my original method if you want to design and implement resilient systems don't do all the hard work yourself get something which gives a lot of such functionality and stand on the shoulders of giants by using these tools that's the advice important for this leader business look at the term if the leader fails somebody else might have re-elected another leader and that will have a higher term so everybody has to look at this term and not follow a leader with an old this is the end of my presentation I have put a few links there are links to Rangudi bean there are links about draft there are links to DCOS and so on and I want to in particular point you to the bottom link in which I have put together a little repository where there is an implementation for this discovery we had in the beginning and you can see literally that such a discovery servers talking to each other can be implemented in 100 lines of go code without any external library just the standard go library so go is particularly quick to write such things but what you find in this repository is everything you need to write such a service and deploy it in a docker container to a DCOS so this whole measures environment makes it particularly easy to get started with these things I've run out of time so I can't really show this also the screen doesn't work with my computer very well but give it a go have a look and if there are any questions about it let me know and I expect that over time we will collect more examples of usage for such systems in this repository thank you very much if there are questions just let me know either now or later I will be here all day thanks