 Thank you All right, so this talk is going to give kind of a user overview of at CD If we have time I can go more into the computer science these side of how it CD works and that sort of thing but this will probably primarily be about the applications of at CD and then the The basics of how it operates so that you as a system administrator wanting to use at CD How have a good idea of the concepts at play? I'm the CTO and co-founder of CoroS. I've been a systems engineer for a long time I spent a time at Rackspace and Suza and now most most recently at CoroS All right, so at CD is a very Has a very clever name the idea was that we needed to have a place to store configuration for clusters of machines That was resilient to individual machine failures so How it got its name was we wanted to make slash Etsy the place that we store configuration on a single host distributed over multiple hosts You notice the extremely clever combination of terms there So at CD has a few properties that are of interest It's open source software. It's failure tolerant so a number of machines within an at CD cluster can fail and the system Continues to operate normally without human intervention It's durable everything that goes into the at CD key store Gets written to disk in a right-hand log file that can be backed up and restored later It's watchable so clients of at CD can get Notifications of changes and it's exposed via an HTTP interface. So at CD is a key value database store But it can be Accessed from any programming language that can speak HTTP In fact people have written things like master election systems using bash and curl. I wouldn't suggest it But it is absolutely possible And then the the other important property is it's runtime reconfigurable There have been systems like at CD in the past, but one of the primary downfalls of those systems is that they Didn't have a underlying consensus protocol that allowed for Reconfiguration of the members of the cluster without taking a downtime of the of the database than the cluster itself And the entire point of this whole darn exercise is to make something that doesn't fail on the face of Machine failures. So it's important that we are able to remove and replace machines without making the cluster go down The data store API is very straightforward It's what you'd expect from a key value store So you can use the get verb in HTTP to get keys out and you can use the put verb to put keys in And you use the delete verb to delete keys from the key store I'm a couple of interesting things though about the key store is that there's compare and swap Or compare and set and compare and delete and we'll kind of cover that in a bit But the basic idea here is that with at CD you're able to safely take take locks and mutexes over In a distributed system and that's that's the core tenant is that I'm able to do something similar to a mutex or a file lock on disk Only it works across a host of machines and that's a safe operation All right, so this is how an xcd cluster looks in normal operation You have in general you have five to seven members of the cluster that actually actively participating in the consensus You can have more members of the cluster that are Not participating consensus, but the general idea is that you have a stable core Google has a system similar to this called Google chubby and when Google operates their chubby systems in each of their clusters. They generally allocate five to system five machines within a few racks that are the chubby cluster and those machines are carefully managed because Like I said, this this system is designed for storing really important configuration data And so what are sort of the applications? Why? What sort of configuration data are we talking about? One of the first reasons that we built at CD was we wanted to build a system called locksmith and the idea with locksmith is that core OS has a automated update system and We have a few things of how the update system works That makes it safe and atomic and has a rollback property, but we wanted to free the system administrator from having to manage the rollout of updates across their fleet of machines, so You can imagine. I think we've all written the for loop for For sequence one to 100 whatever the size of our cluster is SSH and app get update reboot and log back in check make sure it's healthy We've all done that at some point in our careers But with locksmith and with at CD the idea is that we set a simile for size so we say it's safe in my application of two machines rebuted at a given time and locksmith takes care of actually rebooting the machines as As updates are applied to the host so the update cycle is actually fully automated and it's something that is safe because you're relying on a consistent locking service to take care of ensuring that the Machines are safely acquiring locks releasing locks when they come back from their from their reboot So in core OS similar to really good hardware like like Cisco routers and stuff we have a and B partition and When you're running the a partition of core OS We're atomically updating or we're updating the B partition in the background and then we atomically switch Over a reboot and we check the health of the machine across the reboot and roll back to the old version if it's not healthy So this is good hardware. This isn't how your your routers at homework Because the manufacturer didn't want to spend the extra 15 cents on an additional flat ROM So essentially the idea is that you you get that update you reboot and you're on the next version And the algorithm that locksmith takes essentially is I need a reboot Decorate the semaphore that's held in at CD reboot the host and Then once you come back across the reboot and you're running the version of core OS unlock The semaphore from at CD and that's the basic operation another application for at CD is a Scheduling systems, so who's who's heard of kubernetes? Few people in the audience Another scheduler is fleet that we built There's other schedulers that don't use at CD such as mazes Who's who's worked at Google or worked with any scheduling system in the past just in general? Okay, so I'll give it just a really fast overview of what a scheduling system looks like So like all good computing systems it begins with you. His computer should be helping humans be better And you talk to this scheduler API so imagine that I have an HTTP workload some type of HTTP server I tell the scheduler API I want to have a hundred of these running inside my infrastructure I don't care where these things land They have these requirements for RAM and CPU and disk But I want a hundred of them running and I want those behind the load balancer So you describe this in some sort of document with fleet or kubernetes. It's over an HTTP interface You describe this in some sort of JSON document and you say Scheduler make this happen. Now. This is really important data, right? This is how you want your this is how you want your cluster configured And so that's why you want to sort in at CD if a single machine fails within your cluster You want the cluster to still be running a hundred of these things and so that's why you want to use a cluster data store like at CD to store it and There's essentially master elections that are happening too when you make this This decision because this this description of the work that you want to have done goes into the scheduler and the scheduler Essentially is making a master election decision on those zero through one hundred Jobs that you want running in your cluster because it's saying it's going to land on Machine a machine P machine C machine a gets for the jobs machine B It's 20 of the jobs machine C gets et cetera et cetera and you need to have an atomic way of saying This is the decision that's been made by the scheduler and this is how the cluster should be configured all right so right essentially how I've described it the the the cluster scheduling workflow is Right desired working to et CD Agents on each individual host pick up that work and then the agents report Whether that's running or not to et CD using a leader election There's other applications for services like for et CD. There's a HTTP load balancer called Vulkan that uses et CD and it's being used by a company called mail gun and mail gun uses it to load balance across all their API servers and It's you can think of it as alternative to something like something like ELBs Other things that can be used for et CD or configuration file writeouts, there's a DNS server backed by et CD called sky DNS and then also somebody's prototyped out using et CD is a data store for Get ref heads. So the the idea there is that you could safely have a globally replicated atomically updated get repository that uses et CD to store the refs. So You would have essentially a multi master Et CD or multi master get server, which is a really interesting use case of this stuff, too All right, so we're going to go through and explain how a leader election might work using et CD And I'll use two features of et CD the first is TTLs So the idea that a key can have a time to live of say 600 seconds And then et CD will delete that key and then the idea of an atomic operation. So I want this I want to set this key only if it's at a particular version or at a particular state So This is probably I probably should have chosen a better URL scheme But you can imagine that you have some type of cluster ID of 60 a and then you have a machine ID of f1d and you're wanting to put in and register the the the URL of this host Into the cluster and so you create a key entry and this key entry has a number of fields In that CD the first and most important is the idea of an index and this is a monotonically increasing version number And it's shared across the entire key space So every time you make a modification of the key space the the version number increases and each key has the Version number of the last time it was modified And then the key itself obviously and then the value of the key so in this example we have a Scheduler Like the cluster scheduler that we talked about earlier and this scheduler is master-electing itself And the reason you want to have a leader election on a scheduler is because you only want one entity within the cluster to be making decisions as much as like As much as we like to think about organizations as being meritocracies a lot of times It's just easier if you leader-elect one person to make the final decision on everything Particularly in computing systems. So in this case The machine 3 which is the value field has taken the leader Key which is scheduler and it started Taking that key at version 18 of the key store and it has an expiration time of September 18th at 2 o'clock Okay, so this process which is the yellow square It's talking to this three-member at CD cluster the blue and two red nodes the blue node being the master Are the leader of the of the at CD cluster and it's writing in and updating With a compare and swap. All right update the TTL on this thing from index 18 And register me as the thing and then again before the TTL times out It'll do this compare and swap again. It's saying I know the current index is 30. I want to Reassert myself as the leader So update this key again with a new TTL at for machine 3. So this again increments the version number And then at some point that the the machine or the VM or whatever that was running the scheduler exits power failure disk failure CPU failure, etc. And now we rely on this expiration countdown To Finally delete the key so at CDs making internal sync calls then The key gets deleted and removed from at CD at this point other machines who are Available to be the scheduling process. They're able to come in and make a leader election decision And so what they do is they do a create and a create only succeeds if that key doesn't exist yet and that key no longer exists because the Because it was deleted through at CD so this this other scheduler on machine 5 says all right create me as the leader and I'll I'll I'll take care of all the tasks of being the scheduler for this cluster moving forward All right, so the The basics of at CD as far as how it operates you have this leader and leader and follower architecture And when you first bring up an at CD cluster everyone is a candidate So the internal algorithm for at CD is called raft and you can think of it as democracy as a as a Algorithm and so everyone is a candidate and then everyone puts out proposals hey vote for me and then once somebody gets a majority of the People voting for them. They become the new leader And we have a service called discovery at CD IO that allows you to easily bootstrap these machines because you have this problem of I have five or seven machines and I may not know the IP addresses of these machines beforehand Imagine that I'm using AWS or something and I want this cluster to come up automatically I don't know the IPs of those machines when I start When I when I talk to the AWS API, so we give you this token Well six EA and this token is a URL that you place into at CD and say hey as the at CD Machines come up register self here, and then you'll use that metadata as The initial bootstrapping so this five machine cluster say spun up on AWS Each of the at CD members registers themselves to the discovery service They get all this information and then once they've hit say five machines in this case They will they'll do their initial leader election and then you get a fully bootstrapped server and The discovery service really isn't anything special. It's just another at CD cluster with a little bit of magic in front of it So the discovery service is using at CD so that when Inevitably AWS nukes one of our discovery machines The service remains available And then while I was making this diagram one of the things that I thought of after I made it was Hey guys So everything at CD goes through this right ahead log and so each of these is a log entry Similar to it's essentially a modification on a key And these are the indexes and so at CD has a few properties of being a sequentially consistent Store and so when you modify a single key with an at CD Every member of the cluster sees that modification in the same order as the the leader who accepted the right So you never see dog and then cat you always see cat and then dog with an at CD And one of the important things to remember is that at CD isn't Hasn't defeated physics. We're no faster than the speed of light. Unfortunately. I'll let you know I think that's coming in 3.0 of at CD so you have to always think of things in at CD as a behaving in in a system where there's always latency and so You could imagine that you've written at 10 o'clock the the state that One of the keys is set to dog But you read from another another member and it still says cat. This is a problem But if you always rely on the version numbers instead of real time You'll always get the correct answer because at version 2 of the data store. It's always going to be dog No matter what happens So what you can do is you can actually you can wait You can make a blocking request saying out. Don't give me the result of this Key until until the key store is at version 2 So it'll block on that request and then return dog once the request once that member of the cluster has gotten that state And then you can do interesting things like Do quorum gets so that you always get the most up-to-date Value based on the quorum of the cluster So in this case Even if that one member is out of date It doesn't have that second log entry a quorum get from that member will always return dog because it'll force it to ask a Simple majority of the cluster for the answer before returning to use the user And then the other property is that you can do these HTTP long poles to Wait until changes happen in the key store, which is really nice for Essentially doing I notify things but in a distributed system All right, but one of the things to be aware of is that everything has a end and The event history at some point will be truncated within at CD Because we can't store history forever discs aren't Unlimited memories and unlimited etc. So you have to design for that sort of failure and then the last bit is at CD is designed for machine failure as we talked about so in this five-member cluster if we lose a single Follower, that's fine. If we lose a second follower, that's fine But you should probably be paging somebody because you're about to have a bad time And once you lose a third member of the cluster at CD is no longer available for rights It can optionally be available for reads, but you're not able to make changes to the key store And it's also tolerant to leaders Having hardware failures are being turned off However, it will be temporary when available. So if this one member goes away, that's fine If the leader goes away, then the cluster halts until it's able to do a new master election Which is typically very fast and usually under a second depending on how your network and discs are configured but after the temporary temporary unavailability the cluster will make a new decision and Elect somebody and then they'll build the cluster will be available again Yeah, we've made a number of mistakes You always make mistakes from building software and so some of the things that we fixed in at CD2 Which is our upcoming release is adding check summing to protect against discs having problems We found that a lot of people accidentally misconfigure software Who knew that people would accidentally misconfigure software? And so we've added a bunch of protections to ensure that misconfigurations are much less fatal essentially putting a UUID on everything every member every piece of end of data within the cluster and we've also found that Doing an F-sync in on cloud On cloud disks is extremely slow and we've had to handle and document all the cases where that's the case All right, the final thing is a plug that I have a bunch of other talks Coming up at LCA. I have an introduction talk on Wednesday I'm doing a couple meet-ups and then have a tutorial on Friday. All right So I want to thank you. I want to say that we like pull requests and that the project came to ground at github.com slash coro s slash at CD Yeah, so I'm happy to take questions. I also have way more slides, but I'll leave it there So we have a break now into all quarter past if you want to ask some questions while there's a break go for it I was wondering how you might compare at CDD to some other distributed key value systems like Sambar's CDDB Is it doing aware of CDB or you know, how it compares? so The primary thing is that at CDD is a consensus backed key value system and the primary use cases for doing things where You need to have a consistent view of the data at all times so the sorts of applications where if If you ever had any consistent data you'd be unhappy so the DNS load balancers Scheduling decisions these sorts of things look using them as locks for reboots And so the only other systems that have these properties are things like zookeeper And then another project called console The zookeeper has a number of problems primarily the no runtime reconfiguration and and the consensus protocol has Hasn't really been that well proven from the computer science standpoint. Whereas the raft protocol that we use is Got a lot of Literature behind it at this point if you end up with a network partition and Two masters when the network comes back together. Is there any automated way of handling that or is it up to you? So yeah, you in the face of a network partition Imagine that you have a five-member cluster if there's the weak side of partition has two members Those members will continue will operate and the strong side of the partition They'll do a master election and continue operating. There's no play There's no way to do a split blank sprit Split brain within Etsy D because we require everything to go through a simple majority of the cluster so it's always the strong side wins and then if you have a In ways network partition then you're kind of out of luck and the system remains unavailable until the network partition resolves Is the order discovery documented such that I can run my own order discovery instance Yeah, so auto discovery essentially just uses a second at CD key store and we provide documentation If you want to run your own for discovery, it's mostly a convenience You can also do a manual bootstrap where you provide the IPs on the command line But the the cluster bootstrap we felt like it was a high enough barrier for most people that we wanted to provide a public Service just to make it easy. It's completely optional. Nothing actually references at CD.io internally in the code base Trivial question about locksmith. Can you increase the size of the semaphore runtime? Yeah, so there's a command line tool called locksmith CTL that allows gives you the ability to list out the machines That are currently holding locks and then allow you to bump or on force unlock The semaphore Is it always five Instances like it could be anything that you'd like you run a single member on your laptop I usually run five on my laptop just for testing of the internal communication, but generally you want to odd number so that you're actually Getting an additional fault tolerance because if you have an even number you it's always three out of four need to vote And it should be Generally, it's it's like five to nine is the safe thing The reason that Google uses five for example in their in their chubby system is because if that allows for one planned outage And one unplanned outage and then allows for the cluster to remain available People do choose like seven. I think it's kind of not necessarily to go much higher than five but for example the discovery cluster has been running on five and Today we got our first outage in almost six months And so it's it's been fine The outage was caused by it actually was a Only a reed site outage and it was caused by a bug in our rate limiting proxy so that we don't get abused by public users So we were still at CD was running fine. Everything was fine. It was a single VM failure it was just an unfortunate side effect of how the Discovery at CD IO proxy was written on which is like a 500 line go program And so had nothing to do with that CD is the rate limiting software in front of that CD. I Suck at coding. Sorry Okay, I think