 Hi, everyone. Thank you for coming. My name is Shlomi Noach. I'm with the Database Infrastructure team at GitHub. We're here to talk about Orchestrator on Raft, internal benefits and considerations, and if time allows best places in Brussels to eat croquette. So I work at GitHub. I author a bunch of open source projects, Orchestrator, Ghost, Fresno, C-SQL, and others. We're going to talk about Orchestrator and our choice to use Raft as the underlying consensus or high availability solution. So I'm going to give a brief overview of what Raft is, what Orchestrator is, why we chose to run Orchestrator on Raft, and what are the good things, the benefits that we got as a result. I'll illustrate a couple of interesting flows. Then as time allows considerations to anyone who's thinking to use Raft to implement their own application. So what is Raft? Raft is a distributed consensus algorithm. Multiple distributed services, hopefully different boxes, are trying to collaborate. How do they collaborate, first and foremost, by electing a leader? So with a Raft group, there is always one single leader. That's the premise of Raft. There can never be two leaders. Maybe there's no leader because there's a network problem, but at most one. And Raft is quorum-based. The leader is guaranteed to see a quorum or the majority of the servers. So if you have three servers, the leader must be in a group of at least two. If we have five servers, the leader is part of a group of at least three, etc. And in a Raft group, all nodes communicate to each other. They're all saying, hi, what's up? I am in this position. I'm in that position. They all advertise themselves to each other. And they each keep a replication log. So this is the MySQL in Dev Room, so I'll use some MySQL terminology. You can just think of it like the binary log. The biggest difference is that it is guaranteed to be in order without missing GDIDs, for example, or something. It's an increasing index. The events that are being sent on top of Raft are indexed 1, 2, 3, 4, 5. It can jump to 10. It's just increasing. And the leader is always guaranteed to have the top index. Maybe others have that index, but it certainly can't have a lower index than anyone else. That's actually how the nodes choose a leader. If someone has a higher index, then I cannot be the leader, because I'm not as up-to-date as that server. I don't know if the eventual consistent is the right term here, because it's a bit overloaded. Raft basically guarantees delivery of event. It doesn't necessarily require us to apply those events. So if the leader advertises, hey, this is a new event. I want to do that. I'm doing that. What it gets is an acknowledgement from a quorum of the group. Maybe just the majority. Maybe everyone is alive and happy. But all the leader knows is that everyone received the message. And you can hook into it, but the basic is just the delivery. It doesn't guarantee that this and this and this and these servers are happy to apply the message. That's a different concern. Last is that replication law grows forever. The nodes are assumed to be stateless, right? The application can go down. It is assumed to be stateless. When it comes back up, it reads this verification law, which is persisted. And then it reapplies all those events to come up as an up-to-date server. And the problem is that this law grows forever, infinitely. We don't like infinite things. So we get periodic snapshots or full backups. Periodically, the node will run a full backup. And when a node starts up, it checks to see, do I have some backup? Yes? Oh, okay. I'll do a full load. And then do I have any replication log continuing that backup? So it will do an incremental restore. If it doesn't have a snapshot or log, it can get them from other nodes in the group. So it will try locally, but others can give it the snapshot and the replication log that it requires. I'm using HashiCorp Rraft. That's the engine behind console. When I started using it, it was before what is now the 1.0. There's been a major design change to the HashiCorp Rraft library. I'm not using that. So I'm using an earlier version. It's an open-source library available on GitHub. What's Orchestrator? Orchestrator is a high availability and replication topology management solution. So it helps us observe and discover our MySQL topologies, replication topologies. It allows us to refactor stuff, like move replicas around. And most importantly for this discussion is that it supports failovers. So it maintains the high availability of our servers. If a master dies or an intermediate master dies, Orchestrator will failover and fix the topology and get us back live. And Orchestrator is open-source. I started developing Orchestrator like four years ago at a company called Outbrain and then at Booking.com and now at GitHub. It's been open-source Apache 2. And we continuously develop Orchestrator. And I guess the premise to this discussion is that Orchestrator manages the high availability of our MySQL topologies. So we also want Orchestrator itself to be highly available because if it's down, it can't failover our topologies. And until recently, the high availability of Orchestrator was based in itself on MySQL servers. How do I name it? Orchestrator is an app that uses a relational backend store. So it used to use MySQL as its own backend server. And multiple Orchestrator services would coordinate between themselves to be highly available based on the high availability of the underlying MySQL setup. So for example, you would set up a Galera cluster, Matthias did so. And you would have an Orchestrator node for each MySQL node in the Galera cluster. And high availability would be maintained by Galera. But we got two interesting incentives not to do that. So one incentive is here. Durkan is over here to blame. We have a use case where using a MySQL backend is undesired because of the overhead, because of the memory, because of some reasons that it's just too heavy weight for us to set up a MySQL as a backend for each Orchestrator node or to even have two or three MySQL beckons just for Orchestrator. That was the first thing. The second was a challenge. I had a discussion with our friend Kenny who challenged Orchestrator and asked me, how does Orchestrator mitigate data center fencing? That is, a specific single DC goes dark, the network is down. How does Orchestrator know that this DC is down and not the other one, the one that it cannot see? How does it mitigate fencing? These two incentives drove us to develop Orchestrator graph. And a lot of other good things happened. I'll illustrate shortly how we deploy Orchestrator, how we run it in GitHub, but we get much better cross DC availability and we get two words, it's work in progress, but two words being Kubernetes friendly, which is also something that we're keen on. And Orchestrator on rough setup looks like this. So say we have three nodes of Orchestrator. Each still runs its own dedicated database, but that database doesn't necessarily have to be MySQL, it can be, but it can also be SQLite. So I was looking into various potential solutions to an alternative to MySQL and SQLite just made a lot of sense because it's stable, it's embeddable, it's lightweight, it's small, it really fit our needs. But of course, the SQLites can communicate with each other. There's no replication mechanism to them, so Orchestrator is providing the communication layer and each Orchestrator node has its own dedicated private completely isolated relational backend. Each of these Orchestrator nodes is independently monitoring our entire topology. So they're probing the MySQL service, how are you, what's your status, who is your master, who are your replicas, et cetera. And any one of them is able to diagnose a problem, but only one of them, which is obviously the rough leader, is running failovers. So only one is taking active action upon failure. And I'll illustrate shortly a failover flow. How do we deploy Orchestrator to GitHub? So we have Orchestrator deployed on three different DCs. Right now we have a single Orchestrator node in each DC. We have like two major DCs and one in the cloud, so that makes three, it could change. It's not specifically important. We have one Orchestrator in each DC. We still use MySQL as a backend just because it's still running and we're good at MySQL, so we're happy to do that on GitHub.com. We implemented two interesting features. We patched the HashiCorp rough library to include two interesting features. So the HashiCorp rough library, I spoke earlier about rough in general, but different implementations support different feature sets. The HashiCorp rough library is really, it doesn't care about the identity of the nodes, right? So all nodes are created equal. Anyone can become a leader. You don't necessarily want to control who the leader is and our case is a little bit different. I don't want to be able to control the identity of the leader. For example, we have the notion of the active data center, right? At this very moment, most of our MySQL servers are all located in the same data center, right? And I would really like to have locality such that if DC2 is active, I want this Orchestrator to be the active because operations would be much faster. If it needs to fail over, it can fail over local to the DC. So that's number one. And number two is, so we implemented rough yield. Rough yield is like assuming the cluster is healthy, assuming rough is happy and up and running. Hey, everyone, would you please all yield to this server such that this Orchestrator node could become the leader? It's possible. And if not, okay, keep the current leader, but I would really like that one to be pretty pleased. The other thing is the ability for a leader to step down. That's not supported in the HashiCorp library and I'll illustrate the use case for that shortly. Another thing is that the HashiCorp library comes with one of two options, either LMDB or BoldDB as the persistent backup storage store for the replication log. And Orchestrator already comes with SQLite embedded, so it's a pity to use yet another storage or database. So I re-implemented the replication log on top of relational SQLite. So whether Orchestrator uses SQLite or masculine as a backend, there's an additional use of SQLite for persisting the replication log. Okay, so far. I'd like to illustrate a high-availability scenario. This is a real high-availability scenario that, well, in testing while developing Orchestrator on rough. So we have this single replication tree. We have this master, and locality-wise, it turns out we have three orchestrator nodes, O1, O2, O3, and O2 happens to be the leader, the rough leader, because I have this cron job that periodically says, yeah, pretty please, if possible, I want the leader to be local to the masters. And the first thing I do is kill-9 the master. Now, Orchestrator is expected to run a failover. It knows how to do that. That's what it does. And all the nodes are probing over the service. They all know that there's a failure scenario, but only O2 is the leader and only O2 is about to take action because only the leader runs failovers. It's about to run the failover. It really wants to run the failover, but then I do a nasty thing and I go to the backend database for this orchestrator node. I just drop the schema. I drop database orchestrator. It's a bit nasty, right? I just killed its own ability to communicate to its own backend store. What happens? So Orchestrator immediately recognizes that something is going bad. It's unable to run self-health checks. Its own self-health checks attempt to write, like, a heartbeat to its own backend database. It recognizes that it's unhealthy. Now, the raft setup is completely healthy. There's nothing wrong with raft, and that's where the step down is so important. It's the ability of the application to tell raft, hey, I don't want to be the leader anymore, even though raft itself is up and running, but I'm feeling unwell. I want to step down. So it takes five seconds for Orchestrator to step down, meanwhile the master is dead. We need to fail over that. Within five seconds, O2 freaks out, steps down, and the raft mechanism takes care of the rest and promotes one of the other orchestrator nodes to be the leader. Let's say it's O1. O1 grabs leadership. It already knows the MySQL master is dead. It knew so for a few seconds ago, but it couldn't take action because it wasn't the leader. It is the leader now. It's running the failover. The failover takes place. Production is happy. At this time, we're basically happy. But let's pursue this further. Failover took place. What happens to that node? 60 seconds later, it completely freaks out. It says, yeah, I'm not just unhealthy. I don't know what to do with myself. It bails out, it panics, the process dies. A few minutes later, Puppet takes in. His job on this box is to make sure the orchestrator service is up and running. It kicks the service up. The service begins. It says, oh, I don't have a backend database. Let's create new database. Let's create the tables. All right, now I'm up. Oh, hey, I'm configured to be a member in a raft group. Do I have a snapshot? Yes, I do. Let's apply it. Do I have some replication log? Yes, I do. Let's apply that. What should we speak into? Hi, everyone. Could you please send me the rest of the replication log because I'm at index one, two, three, and I don't know what's next. And those two servers, let's O2, they give it the rest of the replication log. And then it joins the raft group. But a few minutes later, something else happens. My cron job kicks in. Oh, the master is here. You know, I would really love this orchestrator now to be the leader. Are you healthy? Is everyone happy? Yes, we're all happy. Great. Please become the leader. So we injected two errors, and we got a very nice automation which fixes the MySQL, fixes the orchestrator, and everything is up and running again. I'd like to illustrate another flow. We have three DCs, DC1, DC2, DC3. Let's agree that the master is in DC2 and the active orchestrator node is in DC2. I'm going to network partition the data center. I didn't actually do that in production yet, maybe. Next time, next year. So DC2 is the active data center where the orchestrator leader is. What happens if we network isolate the entire data center? Let's see how it looks like to this orchestrator node. In this orchestrator node's view, in this orchestrator node's view, the master is alive. There's a few replicas that are running and a bunch of other replicas in other data centers who are dead. But there's no need to fail over. The master is alive. Thankfully, this orchestrator node is similarly isolated from the rest of its own nodes just as MySQL is isolated from the rest of the replicas. It used to be the leader. It's no longer the leader. It's not part of the quorum. The raft protocol will demote this server. Orchestrator will see that raft says, you're not the leader anymore? Oh, I'm not the leader. Then I'm not calling the shots. I'm not going to decide whether we do failovers or not. We're just going to sit by and do nothing. Meanwhile, what DC1 and DC3 see is that the master is dead, a bunch of replicas have died. These are still alive. Orchestrator 2 is gone, but we are 2 out of 3. We have quorum. We are up and running. We are able to provide either DC1 or DC3 grab leadership. Let's say it's DC1. It fixes the topology. It finds the failover because we have concluded that DC2 was the one to have been network isolated as opposed to any of the other DCs. Whatever happens later when DC2 comes back live is a different matter. Production is happy at this time. Thank you. Another nice thing. Part of this is work in progress. Our use case is a bit special. We have console running on different DCs. Those console setups, they don't talk to each other. Every console is local to its own DC. Work in progress is that we're working on Orchestrator console proxy-based failovers or based high availability, such that upon failover, Orchestrator tells console, hey, this is the new master for this cluster. And console through console template will update our proxies and provide the service discovery for the clients. Now, the consoles in the different DCs, they don't talk to each other. But the fact that Orchestrator is deployed in each DC enables us to locally update each console in its own DC through the raft mechanism. So that's nice. How am I with time? All right. I'll just leave you with a few considerations and we can take this offline outside if anyone wants to know. Just consider a few weird crazy scenarios. What happens? So, eventual consistency is not your best friend, really. What happens if we have a failover from master A to master B? And then master B also fails and we failover to master C. Now, Orchestrator is up and running and everyone is happy. Sometime later, one of the Orchestrator nodes goes down just because and bootstrap is a new empty one and it rereads the snapshot. But the snapshot was taken long before A died. So, anyone sees where I'm getting it? It starts reapplying the logs and he says, oh, you know what? We just failover to B. But wait, the actual situation is that everyone knows that C is the master and here comes Orchestrator who knows we should failover to B. Let's update console. Let's tell everyone B is the master. So, there are a few considerations to look into. There's a few ways to mitigate that. Orchestrator, heuristically, requests a snapshot shortly after failover which is a life-weight operation, so why not? There could be other solutions to that. If anyone else is further interested in understanding some of considerations to use in RAF, please talk to me outside. On the roadmap is using Kubernetes. One of the really nice things right now with Orchestrator is that you can run on SQLite. And I know some people are running Orchestrator on SQLite with the memory engine for SQLite, so SQLite doesn't even possess the data such that on a Kubernetes setup, one of the Orchestrator nodes can go down. Kubernetes will figure this out. Bootstrap a new pod that kicks in a new Orchestrator node that is completely empty because it's out of the blue and that Orchestrator node is able to talk to the other nodes in the group, get a snapshot, get a replication node, and kick in, so that's very nice. And I'm happy to take questions. Thank you. I have time for questions, French? Yes. Orchestrator guarantees the promotion of the best available replica out of the set of replicas. Replication guarantees the consistency if you use either Semi-Sync or GDID or whatever bin of service depending on how far ahead the replicas were. That's a little bit out of the control of Orchestrator. It's more about the MySQL replication. Yes, sir? Semi-Sync. Are we running Semi-Sync? We're looking into Semi-Sync right now and I'll publish a blog post that illustrates. Yeah, so ideally, ideally we would promote, the ideal is to promote the replica that is most up to date. There are various considerations to not doing that, but Orchestrator is able to reparent replicas even in two steps to get the most out of the situation. Okay. It's a configuration variable. The question was what happens if the master dies and all of the replicas are really lagging. There's a configuration variable to say whether this is okay and you prefer to promote or you want to not promote and avoid losing data. Last question, sir? No, okay. Who did the clients talk to? No, Orchestrator is in the back end. It's a big question. So either you would use DNS-based discovery and Orchestrator would be the one to tell DNS, oh, hey, this is changed, but we're looking into everyone talking to a proxy and Orchestrator would tell console, hey, the master is changed. Console would update the configuration or whatever the setup for the proxy and the proxy would kill the connections if still existing and redirect new connections to the new master. That's what we're hoping to get to. Thank you so much.