 So, good morning, everyone. Welcome to day two of Open Source Summit. You are currently in what you should know about SCD v3. My name is Elsie Phillips, and this is my partner in crime, Paul Burt. So, let's first address what is SCD v3? So, SCD is a fast, reactive modern JavaScript framework. I'm totally just kidding. As you probably know, SCD is central to a lot of what we work on at CoreOS, which is distributed systems. It arose out of the need to reliably coordinate configurations and state in that environment. So, SCD is a fully replicated, highly available and consistent key value store. Now, for those of you who aren't familiar with those terms, we're just going to quickly break that down. Fully replicated means that the entire store is available at every location in the network. Highly available means that it's designed to avoid single points of failure and ensure that when hardware and networks fail, that it does not cause interruption or degradation in service for the end user. Now, if you're familiar with the literature, you might contest us calling at SCD highly available because of the cap theorem. But what we mean by highly available is that people can read from it, and availability of reads is important for load balancer configurations or service discovery. Finally, consistent means that every read from it returns the most recent write. So, how does that CD work? It's based on the raft algorithm, and it's centered around three key concepts. Leaders, elections, and terms. So, each cluster is a healthy little micro democracy that elects a leader for a given term. That leader pushes updates to the followers. If the leader is killed, which would be very tragic, the cluster will elect a new leader and things will carry on as normal. Since its introduction, it has become an integral part of cloud-native systems. Systems for running Google-like infrastructure. It's part of projects like Kubernetes, where it's the primary data store, Cloud Foundry's Diego, networking solutions like Canal, and many others. It's pretty popular. It has a decent number of stars on GitHub, which as we all know is the most important metric for measuring open source projects. It's used in production by many global 2000 companies, and many companies also contribute to it. So, just to give you a quick recap of the project, it's about two years old. We released the first stable version at CD2.0 in early 2015, and last year we rolled out at CDV3. So, why are we talking about it now? So, you may remember last year our CTO Brandon Phillips did a number of talks about SED. We can see here at large volumes at CD2 experience issues handling large snapshots. To keep scaling and ensure SED remained the community darling for reliable distributed key value storage, we put in some work. Many of Brandon's showcased our progress. Late in 2016, we completed work on our initial release of SED3. That increment is a major version number if you're a fan of semantic versioning. The above is a screenshot of Brandon highlighting some of these improvements. More efficient communication, a revamped API, and significantly improved storage performance. So, our work paid off. The time we put into SED3 helped Kubernetes scale from 1,000 node clusters to 5,000 nodes clusters. Kubernetes even adopted SED3 as its default storage option this year. The community responded positively. They loved the work we were doing. Some of the community loved the work. Other parts were less than enthusiastic about adopting new APIs and communication protocols. As Michael Hausenblass, the creator of Kubernetes backup utility, reshifter puts it, things are generally well documented, but there are some rough edges. So, the goal of this talk is to help folks where they might have stubbed their toe, so to speak. Help them find those rough edges and avoid them. So, I'm going to now pass this over to Paul. All righty. Let's break some stuff. So, before we dive into the common mistakes, most people seem to make upgrading to SED version 3. I'll just preface this by, well, first I'll increase the size on this. Most of these are going to be run inside of a Docker container like this. So, go inside, and then we'll start the SED process. That little pretzel at the end is basically just telling bash that I want this background. We'll see a lot of this Galbi-gu because we go through this pretty regularly. And now we're free to issue commands to SED. So, this is kind of the context we'll be operating in. Cool. So, first question everybody has when they leave their apartment is where did I leave my keys? And that is the first question most people seem to have when they upgrade to SED 3 as well. So, let's take a look at that. So, it's a little bit difficult for me to type with the projector behind me here. So, I'm going to do this by video just for convenience. So, as you can see we're starting containers. We just demo huddle. Huddle is sort of PI to version 3. We're getting an error now. That's kind of something that seems to be going in and out. Can everybody hear me if I shout? Use the mic. Awesome. I'll do my best. Sorry about the sound. Anyways, as you can see, what's happened here is we have instead of the set method used the put method to store our key in SED version 3. And this is actually something of a convenience. It's a clue to your developers if you've moved over to SED version 3 of the API that they want to put their data in using this new command. They will get an error if they try and use the old command, which is hopefully exactly what you want if you've got this environment variable set correctly. Go back to our slides here. The key take away is that SED has two main namespaces. And SED version 2 used to be able to curl the endpoints, access things in a restful manner. SED version 3 still has that. But SED version 3 in the sake of speed and performance has adopted GRPC, which is Google's new RPC protocol uses protobufs plus HTTP 2. A lot more efficient, allows us to multiplex. And that has a different namespace than the old SED 2 namespace. So if you stash your keys in using a curl and you try and pull them out using GRPC, you're going to have headaches. So try and avoid that. GRPC is the new goodness. Cool. And you may have noticed that the API looks like it changed a little bit. So let's explore that. Did it change? Doing this dance again. So this is relatively quick check. We're going to take a look at the SED API using the help command and the old value. And you can see there's a couple commands here. The set is in there, as we experienced earlier. And if we execute that same command with the environment variable set to 3, if we scroll up, you can see there's a lot of new commands available in this new SED client. And this is part of the work that went in the SED 3. This is why this is a major version increment. We've moved into a new MVCC style of working with objects in the database, which allows us to do cool things like transactions and a lot of other stuff. And the next talk we're going to kind of dive into, what other things you'll stub your toe on using this new API. And a quick note before we jump to that section, if you actually want to proactively prevent developers from storing things in the old SED namespace, if you're starting up your SED process and you just want to completely disable that because you have no need for it anymore, absolutely set the dash-enable v2 flag to false. That will prevent people from actually even storing anything in there, even if they're invoking the correct commands. This is, I don't know, the equivalent of if you enjoy static type checking in your programming languages, keep people honest. This kind of keeps people honest when they're using your SED cluster. Cool. Next thing we'll dive into is that the world is now flat. SED version 2 used to be hierarchical. That meant you could have kind of a tree structure with leafs and nodes, and it branched out. And there was a handy command that you could use to kind of collect everything that rested underneath of a node. Let's take a look at that and see how the new commands work for accessing data. Cool. So we are firing up our command line here, and we're going to set some values in the old structure, this branch node. So cartoon slash rm stands for Rick and Morty. We'll add another cartoon into the directory just to give it some companionship. So we'll set the venture brothers here. And we'll go ahead and retrieve those keys in the old at CD2 style. So the ls command will actually reveal everything that's on your current level. And if we go one deeper and we look, we ls the cartoon's directory, the same way it works in your bash command prompt, you'll see all of the values that are in there. But when we switch our API to version 3, as you can see, we run into problems because of that new flattened key space. So we have a solution. One thing to note before we move on to that solution is that this directory type structure is gone, but you can emulate it in the new flat key space if you so desire. And after thinking about these usage patterns, it became apparent to the team that the hierarchical structure with all the stuff that comes along with it was less efficient than this flat key space. The flat key space reduced overhead, increased efficiency of our queries, and it just makes search a lot easier. So let's take a look at that. And to do that, we're going to take a look using querying based on prefixes and ranges already. So we are going to set some keys again as we fire up this demo. We want some keys that are somewhat related to each other, so we're going to choose the Star Wars trench run scene and grab some quotes from that. So we've got red one reporting in, red two standing by, and finally, let's add something for red leader. Let's lock our esfoils into attack position. Cool. So we've got our keys, and now it's time to query them. Let's do a range first. So we'll do a search for red one to red three, and that has returned values for us. So this is inclusive at the front. That means the first value red is included in the query, but exclusive at the end. That means red three is excluded. So if we had set this as red two in our query, we would be excluding data that we want. This is pretty normal, but sometimes it can catch you off guard. It's what a lot of people call a half open interval. You may notice that we're actually missing red leader here, and he's pretty important part of the squad. So key namespace here in another way by searching for a certain prefix. So now we're getting the entire red squadron by searching for the prefix red, and that's exactly what we want. So ranges and prefixes are super cool. Highly recommend using them. This is the replacement for LS. And another cool feature kind of enabled by this new move to the transactional key space is we can actually track history for keys that get entered into at CD. So this is really handy for debugging things or even just looking into your programs and seeing how a key evolves over time, tracking behavior. So let's take a look at that. So once again, we'll need to set up our cluster. What we're going to do here is make it so that that one key gets changed a number of times. So we're going to meme it up and use silly pet names. So doge is first, snake is second, burb is third. We've got our keys that are all under the changed key, and we can see that burb is our last changed. And now when we add this rev flag, we're actually searching over the revisions. So we're getting some initialization here. Nothing is actually happening. And as we go through those revisions, we'll start to see values. So we'll see it was doge first, it was snake second, and it was burb last. So if we were to actually continue to query like this outside of the namespace, we would start to get an error from XCD saying you've actually reached the end of this list. You don't need to go any further. So I personally am a big fan of that when it comes to just seeing what's happening. That can be hard to track. Cool. So use the rev flag. Really handy. I highly recommend it. Definitely one of the coolest things I like about XCD in the new process. So there are actually a number of other handy methods that we don't have time to go into here. But if you are interested, we have a very nice documentation on this. And you can see some of the stuff that we've shared with you here is documented on CoroS's website. So definitely check it out, follow one of the interaction guides like this one, and it will get you kind of in the right mindset for working with XCD. Okay. So the next tumbling block that we see people encounter is selecting the optimal cluster size. An XCD cluster needs a majority of nodes or a quorum to agree on updates to the cluster state. A cluster is operational as long as this quorum is intact. In the event of a loss of quorum, like if there was a network partition or different types of hardware or network failure, XCD will automatically and safely resume after the network recovers and quorum is restored. Because XCD is an implementation of the raft algorithm, consistency is maintained. We recommend an odd number of members in a cluster because this increases the cluster's fault tolerance. Which means that the cluster can survive the same number of failures as an even size cluster but with fewer nodes. The difference can be seen by comparing even and odd sized clusters. As you can see on this handy dandy little chart that we made. For any odd size cluster, adding one node will always increase the number of nodes necessary for a quorum. Although adding a node to an odd sized cluster appears better since there are more machines, the fault tolerance is actually worse since the exact same number of nodes may fail without losing quorum. But there are more nodes that can fail. If the cluster is in a state where it can't tolerate any more failures, adding a node before moving nodes is actually dangerous because if the new node fails to register with the cluster, quorum will be permanently lost. So you might be looking at this saying, hey, so I should make a nine node XCD cluster for maximum reliability. And I would tell you to hold up a second. Because you should keep in mind that the more nodes there are in a cluster, the longer it will take for that data to sink. More nodes has a speed trade off. Cool. So that begs the question, what the heck happens if I lose my quorum in my cluster? And the answer is you're dead. Your cluster can't do anything. So we need to recover from this somehow. So this isn't anything new necessarily to XCD3, but this is still a good topic to cover just because it's so central to kind of distributed systems in general. Cool. So we're going to set up our cluster again, enter a key, I think that is just hello world. And then we're going to make use of new command in XCD version three called the snapshot command. Now you may remember from XCD version two, some of our documentation recommended actually just wholesale copying the data directory that your data was in. You can still do that in XCD version three, but hopefully through this presentation try and convince you that the snapshot command is slightly more convenient. So you can see here, we've entered some values. We are now saving our database to root slash snap. We are now going to viciously kill our XCD process. And now we're off to the races. We need to take a look at where the old XCD data is, which is in this default XCD up in the upper left. Sorry, that blue on black is a little hard to read. And we can see where our snap data is. It's under our root as we had saved it before. So what we want to do first, XCD is not running at all at this point. We just want to restore the snapshot to a location. So we're going to specify that location with the data dir, and we'll just name that new dir. And it looks like that's succeeded. Our snapshot has successfully been created, and we have this new directory in our root, where our snapshot has been restored to. And now when we launch our cluster, we're going to use that same command, data dir equals new dir, pretzlet to get it running. And hopefully we can query it here and confirm that our data from our previous cluster is now in this new cluster, which is exactly what happened. So this is very simple. One thing that's really nice about the snapshot is when you take a snapshot from XCD using the version 3 API, it's actually saving a cryptographic hash. So if your data has been corrupted in transit or something has happened to it otherwise, XCD will check that and warn you about it before it actually does the restore. So that's pretty handy. Great. And we'll talk about that again in a second, snapshots and restoring things. But just a quick detour into knowing your limits, which is always good in software. It's, I think, good personally, too. But some Silicon Valley stuff tells me it's not. So there's a limit of one megabyte per request. And by default, there's a two gigabyte storage limit on XCD. XCD can scale up to a gigabyte storage, but that's configurable by a flag. So you might be thinking, this is a little weird. What's going on here? And on coreOS.com, we have a page that sort of explains this. You know, the things that you would want to compare XCD to are in this handy comparison chart. So there's XCD, ZooKeeper, console, and new SQL databases like CockroachDB or Spanner. If you're looking to store things that are in the terabyte sort of size or the multi-gigabyte size, you may be searching for something more like Spanner or CockroachDB. If you are actually looking to store configuration data, state data, past messages around, XCD is wonderful. If you are looking for service discovery or something along those lines, a console works very well. It has a lot more developer experience sort of niceties. But it tends to fall down when you get beyond, you know, several hundred megabytes in its storage. So XCD scales a lot further. And we see ZooKeeper as our real kind of the target that we want to hit. And to that point, we've even released an adapter, a wrapper that sits on top of XCD and can take commands from the ZooKeeper API. And XCD can act in place of ZooKeeper in your cluster. So we have some folks running XCD in place of ZooKeeper on an Apache Kafka setup. We've seen a lot of good performance benefits from that. You can check it out on our blog if you're interested in reading about all the fun statistics around that. Oh, yeah. Sorry. The wrapper is called ZCD if you're interested. XCD with a Z at the front. Cool. So, oh, I think I skipped a slide. Yeah. So does XCD do Byzantine fault tolerance? And the quick answer is no. So some of you may be scratching your head saying, what is Byzantine fault tolerance itself? So what that is, is when something crashes, it doesn't completely crash. It crashes in a partial way. So your health check might report that the node is working perfectly well, but the node is actually spewing garbage into your cluster. This is sometimes called like the Byzantine generals problem. And it's well known in the distributed system space. So quoting from Wikipedia, Byzantine failures imply no restrictions, which means that the failed node can generate arbitrary data pretending to be a correct one. And it makes fault tolerance difficult. So what does that mean for XCD? Well, it's the same thing that it means for any other raft protocol algorithm. This is something that just raft doesn't cover in general. And XCD kind of being one of the more well known raft implementations out there is no different. We don't do anything fancy to dance around that. That is all to say. If you introduce garbage into your cluster, XCD may not know what to do if it is not information that's supposed to be there. Cool. What else is important? Well, one thing is upgrading from version two to version three. Before we jump into that, you should know that there are other cool tricks you can do. Like there's a multi-key conditional transaction. This means your data will all get written at the same time guaranteed. You should definitely do this instead of just comparing and swapping values. Comparing and swapping values is easy, but if something goes wrong midway through that, you might be up a creek without a paddle. And it's also important to note that we had TTLs, Time to Live, on certain keys in XCD. You can specify that as a flag in version two. In version three, there's a new lease object which the TTL attaches to. So it's just a quick switch in the semantics there. Cool. So what does it look like when you upgrade from two to three? This happened recently when Kubernetes 1.6 adopted XCD three as its kind of main data store. And it's actually a relatively simple process. So XCD version two and version three, we've thoroughly tested the upgrade from 2.3 to 3.0. So that's the upgrade path you'll likely take. What you'll do is you'll say you have a three-node cluster, you'll stop one of the 2.3 processes, and then just drop in the 3.0 process in its place. And these two versions can actually intermingle. So the new 3.0 process will sync, gather the data, you'll stop one of the other 2.3 processes, drop in a 3.0 process, rinse and repeat until your entire cluster is now running XCD version 3.0. You'll find a convenience command on XCD 3.0 to migrate your old version two keys to the new 3.0 namespace if you desire that, if you want to take advantage of all the cool new performance stuff. But if not, you can keep running the old V2 namespace on the new well-tested XCD 3.0 platform. Another thing that's worth checking out, since you're all here and interested in watching XCD kind of get kicked around, I like it. I think this is the funnest part of distributed systems. Talked at Ops in 2016, and his talk focused on running containers at scale. So he suggested fire drills are a really good way to test your knowledge of running the cluster. Basically, the most important things you need to know about distributed systems are the failure states. All of the theory that's around distributed systems is essentially how can it go wrong. So you should have some good working knowledge of how things fail in XCD as well. So I think this is a great guide. Unfortunately, our CTO's talk was more focused on Kubernetes than XCD. So that begs the question, what should we use, then, to do our fire drill? And thankfully, CoreOS ascribes to the Google-E SRE kind of philosophy of codifying all of your procedures in actual code. So you can actually check out the XCD operator, which is our version of that. And let me zoom this in a little bit. Scroll down on the read me. So you can see there's an overview and then a quick demo. Basically, just shows you how to get the cluster created on Kubernetes. That's what the XCD operator is. It leverages part of Kubernetes reconciliation loop to make things easy. There's all these drills laid out for how to resize a cluster or recover from fail failure modes. That sort of thing. So this is your fire drill template if you want to follow along with that. Great. So I was going to do a demo of the XCD operator in action. Unfortunately, I had problems with hotel Wi-Fi. So let's see if that resolved itself during the talk. I had this all kind of, oh boy, which one of these is my actual terminal. That's the question. Well, should be able to quit, quick time. There we go. Cool. Nope. And our cluster is actually still booting. Got stunted by the hotel Wi-Fi. Cool. So in any event, what I was going to suggest is if you are not on hotel Wi-Fi, if you're at home on your own connection, we have the XCD sandbox, which lets you download a Kubernetes cluster, boot it up using Vagrant. And we eat our own dog food here. So the XCD operator is actually at the heart of this Kubernetes cluster. So what you can do is open up the XCD console, visit the deployment for the XCD operator, kick over a node, just delete it, and watch the operator automatically recover from that failure without you doing anything. It'll automatically reconcile that it needs three nodes and fix that for you without you intervening at all. Okay. So as Paul said, we invite you to try out the XCD operator and the tectonic sandbox. If you're interested in the tectonic sandbox, that is where you can find it. And that's where you can find us. That might be Paul, Elsie Filly. And if you would like a copy of these slides, that's where you can find them. And now we'll open it up for any questions people might have. Thank you. That is an excellent question. I am unsure. It seems that an initialization kind of blank spot gets saved at the front for every revision. So it's definitely the sort of thing that you'd want to tumble through programmatically as opposed to doing it introspectively kind of through the command line. Not as far as I'm aware. That's a great question, though. I'll ask that at CD team, someone on there might know. And if I find something, I'll tweet about it. Yeah, I think you would have to write a loop, unfortunately, as ugly as bash scripting is. Again, I'll check with the team. That sounds like a great feature request if we don't already have it. So cool. Yeah, that's absolutely correct. So in the old versioning scheme, like Kubernetes, for instance, would create a resources slash member slash whatever thing it was controlling. Those slashes would just kind of disappear and you would have those that kind of prefix name up front. So it works the same way in practice when you're storing a key inside of etcd. You're just not using slashes to delimit different folders. And then when you query it, you kind of get exactly what you're looking for in that space. That concept is completely gone. So it's a flat key space for performance reasons. We just saw a huge benefit to switching to that model. And that's also part of the MVCC transition. It enabled a lot of these other architectural changes. And I think a lot of that was at the request of a lot of the projects that at city supports, we support a lot of global 2000 companies that are very, very large scale. And Kubernetes in particular drove a lot of the API changes. Yes, you can still include slash if you would like to include that in your keys. So the entire request can be one megabyte. So if your key takes up, like, you know, all of that, you're kind of SOL in terms of storing your value. But yeah, that's kind of the terms. Cool. All right. Well, thank you, everyone. Cheers.