 Welcome everyone to our maintainer talk about witness. This is the end of the day, so thank you for showing up and welcome also to our online audience. The other way. Next. Okay. Thank you. My name is deep the cigarette. I'm a software engineer at planet scale and I'm the tech lead for witness. I will start off with a brief introduction to witness and afterwards. My co speakers will talk about our Kubernetes operator. VT admin, which is a replacement for our existing control plane. And there are some demos with us is a clustering system for horizontal scaling of my sequel or Maria DB. It's effectively a distributed database. It is a CNCF graduated project, the first storage project to graduate from CNCF. It's open source Apache 2.0. We have contributors from many different companies and a vibrant community in our Slack workspace. And it's mostly written in go lang, but this is a cloud native database. It runs in Kubernetes. It was started at YouTube where it used to run in Borg. So it was a natural evolution for witness to be able to run in Kubernetes. It is scalable, highly available, all of those good things with us provide certain durability guarantees that you don't get with vanilla my sequel. It also gives you the illusion of a single database. So behind the scenes you might be running hundreds or thousands of individual database servers. In order to achieve the scalability and the high availability, but to an application, it looks like a single database. Applications can get single dedicated connections to witness, which behind the scenes translate to many individual my sequel connections and witness can present as my sequel five dot seven or eight. And it's compatible with many frameworks and orm's witness serves millions of QPS in production. There are many adopters of witness running it in production over the last six years. So it has started about 11 years ago at YouTube, but over the last six years, a lot of outside companies have adopted it. Notably, Slack runs all of their data on witness square cash runs everything on witness JD.com has a huge Kubernetes deployment of witness with thousands of nodes and tens of thousands of witness components. And Planet Scale runs a database service on witness with on witness with thousands of individual witness clusters. A few key concepts and then we will do an architecture overview before we move on to the rest of the talk. The concepts that are important to know to understand witness our key space shard cell, a key space is just a logical database. There may be thousands of physical databases but it can present as a single logical database and a shard is a subset of the logical database. The union of all the shards comprises your database as far as the application is concerned, and a cell is a failure domain. So depending on the availability guarantees you want you will deploy witness components in different cells and a cell could be a data center or it could be an individual server rack or it could be an availability zone or a region or something like that in a cloud provider. It is very common to run databases today in a replicated configuration with a primary and replicas. When you run witness in front of such a replicated configuration of my SQL, there is a witness component that lives with each of those my sequels called a VT tablet. In production, you typically run multiple clusters. They might be shards of the same key space or they may be multiple key spaces. That is multiple logical databases. All traffic goes through a component called VT gate. This is a proxy that pretends to be my SQL. It speaks to my SQL protocol. It looks like a single my SQL server to an application and it figures out which actual my SQL the query needs to go to. And in order to scale out depending on the application traffic, VT gates can be scaled up as the traffic grows. And each VT gate will be able to route to any of the key spaces and shots. VT gate has to do this in a transparent way. As far as the application is concerned, there is only a database and VT gate has to route them to the right databases the right shards the right clusters. How does it do this? It does this by looking at the schema and the sharding scheme. Sharding scheme is what is also known as V schema or witness schema. This tells VT gate the necessary metadata in order for VT gate to figure out where to route the queries. The other component that is of interest in witness deployment is the topo server. This is a distributed key value store. People typically use it CD or zookeeper or even Kubernetes as the topo server. And this topo server stores the configuration information that witness requires in order for the components to discover each other and in order for the query routing to work. And this is a pretty small data set. It is cashed by VT gate. And one of the principles in witness is that the topo server does not need to be up for applications to be able to query the database. The last thing I want to talk about is the control demon. There is a component called VT CT LD, which allows you to perform management operations on the cluster. It works with the topo. It gives you a view of the topo and it allows the operators to perform manual operations. Just briefly about what is new in witness recently and what is coming up. We are doing our release GA in about two weeks. We did a release candidate last week. We have been improving the query support we have. There is still a small set of sub queries that are not supported, especially in sharded mode, but that set keeps getting smaller as we keep working. There have been a lot of performance improvements recently and we also publish continuous benchmarks on benchmark dot witness dot io. Coming up in the future, we have VT admin that you will hear a little more about in a few minutes. And we are also going to complete support for multi column indexes, automatic failure detection, also known as VT or collisions and distributed transactions. With that, I'll hand it off to Alkan. All right. Thank you for the intro and the new features and what's coming up. My name is Alkan, and I work with deeply planned scale as part of the test open source team. And I'm one of the owners of the test project and many other things. So we'll talk about the operator. As you know, we heard from the other talks, the operator and the Kubernetes are recommended way of running stateful applications, especially for databases. And there is also a test operator for Kubernetes, which is open source. And what does it do beyond the regular operators, it does automation of the test functions. And basically, it helps with the cluster management and all that other flags and the command line tools that you would have to run. And also keeps the consistency and the fail overs automated and helps a lot. So I want to show you a little bit of the operator, how it does it. And because this time consuming and won't fit in this part of the talk, we need to deploy a test cluster. In this case, it's going to be deployed in the GKE. And then we deploy the operator, which is a very low footprint and it's basically fast. It's customizable. It's open source. And you could actually have to set up your own security users and the other settings that you would do normally. And then we will create a database. And with the sharding workflow, which is these links will be provided in the slides. We have operator examples that will run through a scenario will shard a commerce schema like an e-commerce database to a customer database, which is sharded. So I'm going to talk about that a little bit. And then we're going to create some load on it. This is not for a benchmarking. It's for just to demonstrate that there's some activity in the database. And as Dieti said, we run continuous benchmarks on our old releases and deploys that separate from this talk. And also, we do online DDL on this environment and then running the sharded cluster. And I want to demonstrate a failover while running the load and the online DDL and everything else. So this basically a debate between the stateful databases in Kubernetes running with operator under the test. So, okay, there's going to be a workflow. I'll talk about it. And I have the GitHub link for the demo. So first of all, it's not correct. Okay, this one. Hopefully you can see this. I'm going to start this. So we have Kubernetes cluster that's already deployed. And then running with the test operator. And basically operator has one pod. There's three pods for HCD. And then we have, since we already ran the sharded cluster, we have a commerce and a customer key spaces, which they have a VT tablet as a pod. So also, we have some VT CTLD and VT gate pods in this cluster and already running along with the operator. And what I wanted to do is to deploy some load and two pods, it's going to deploy, select and insert recursive on this database in key spaces. Customer and it will create some load on it. And we have the customer table and we have the order table. It's called C order. And then as we run these DB load pods, which are actually Docker images that's pre-created, we have all documented on the link that I mentioned. If you want to try this yourself, it's very easy to do this. And then we generate some load on the sharded cluster and then the traffic starts coming in for the sake of this demo. And so the inserts are increasing, the number of rows are increasing, and then the cluster is healthy running with the operator. And operator actually manages everything. So in this case of Kubernetes, if a pod was killed, it will be in state and then recover the pod from the existing image. And then if it's a replica, it will fail over to the primary and would do all of that. So what we're going to do in this example over here, we have a table called C order and it has multiple columns skew is the one of them. It's var binary 128 default. Now I set the DDL strategy, which we mentioned we do online DDL and we test world and then to online and then invoke this DDL with alter table and then it becomes a migration. The migration gets executed behind the scenes without impacting the load or the there's no like locking or anything. It's it's it's online. And and then we can see the migration status that's running and running on a customer key space on a sharded. There are two, two shards in this example, and each shard has one primary and one replica running under this Kubernetes cluster. And, and then we can see it's actually the tablets are are healthy, whether from the command line example over here from my sequel or from the witness commands that you can do it. Also, there's an option to see from the Kubernetes and there are multiple points of access to this whole cluster, where you want to manage, but the demonstration over here is using the VT CTL client that we test command line. And then there's my sequel client that's actually accessing the cluster. And, and also we have the primary and replicas set up for this customer key space you can see in the in the customer sharded replicas we have two primaries what I wanted to show over here while the load is running. There is a there's a replica ended with 4223. And I want to actually say I want to fail over to this. Just, I think that the primary on this shard is not doing well, just a scenario over here. And I want to fail over to, to the replica, while the migration, the altar is running on that cluster on the shard cluster, I can actually tell the test, okay, fail over to this shard. Now, this is called a planned repair and shard, because I know this is not an emergency I'm planning this. And then, and I invoke and because it's a small cluster, it becomes available, and then it flips over. So when, when we test, when you tell the test that I want to get this, this replica become a primary. And it will actually take the other one if the, if the, the old one is available, it'll actually detach and attach as a replica so always have one replica available. Of course in production, you would have multiple replicas not just one, at least two, and then and then you could actually set them in different regions. So, so that's it. And we'll go back to the workflow. So what we did is we ran the one on one example, it's in test IO, and then we created a load with an example Docker image, did the recursive select and insert and we graphed there is a load coming in we execute an online ddl wide while all this running. And then we failed over in one of the shards from a replica to a master to the primary and then and then we were able to continue the load. Thank you. All right, my, my friend, Andrew will come over and do this. We're going to switch over laptops real quick. Just doing a planned repair and repair laptop. Okay. Give you a little taste of the humor you can expect. Okay. All right. Okay, my demo was done. Hello, my name is Andrew Mason. I am a senior. Oh yeah, I can take my mask off. You can probably hear me a lot better. Hi, my name is Andrew Mason. I'm a senior software engineer at Slack, and I am a maintainer on the test project. And I'm here to talk to you about something that I and a couple other people have been working on called vt admin. And as deeply was saying earlier, vt admin is the next generation of cluster management and web UI tooling for for cluster management. It will eventually replace the vtc tld, which we just saw in in Alkins demo. So, I'm going to kind of do a demo and talk at the same time so we'll see how that goes. And in the original case, you have a single vtes cluster, which is like a, let me jump through here. So, different version of what deeply showed you earlier. Basically, everything that talks to the topo I consider to be a single vtes cluster so we have the application going to the vt gate going to the tablets and then on the side we have the topo and the vtc tld just like we saw before. And vt admin is going to let us do in that, you know, oh boy, bigger, bigger, bigger, bigger. And that, you know, looks like this so I have some shards I have some not serving. Don't worry about that we'll come back to that later. What vt admin does is vt admin is going to sit in front of these vtc tlds and allow you to manage multiple completely isolated deployments at the same time. So that looks like this. So, we took that one chart shrunk it down and stamped out two more of them and we have a single vt admin API and vt admin web that sits in front of all three clusters where this is useful just as a trivial example if you have prod dev and qa, you would like them to not overlap and not clobber each other. So you can have them completely isolated from each other and instead go vt admin and have traffic admin traffic flow only one way application traffic being completely separate. So, in words, it's a single control plane for multiple the test deployments which are we use the very overloaded but useful term of cluster. It provides both grpc and HTTP API endpoints for you to program against if you want to build any automation on top of vt admin API you can do that. There's also UI for human friendly use of the tool. The backend is written and go along with the rest of the test project, and we're using react on the front end. The goal here being to eventually replace the existing vtc to the UI, which is for a single cluster and vt admin works just as well if you only have one cluster but you get, you know, more power out of it from having a many to one relationship. And so there's a branch which you can get from the slides. So here I have a local deployment running on my laptop so to be a little small and use our imaginations a little bit, because my laptop is not very large. I have two clusters one having a single key space called media which has two shards 0 to 80 and 80 to 0 and then for non serving shards which I will get to in a moment. And then I have a second cluster over here, which has a single uncharted commerce key space. And from the vt admin view, I get to see everything in one place. So I can go through I see I have two clusters. There has a vt gate, they're called local one and local two key spaces as before schemas with sizes, I can look at the schema, I can see how they are sharded. And one of the other useful things to do in the tests which has not been talked about I think is a thing called vt explain, which is basically a my SQL explain plan but the test style. So you can sort of see how the test will route a query based on the sharding scheme so I can go into vt admin, and I can go here, select start from books. And unsurprisingly, this will go to every shard in the key space because in order to collect everything we have to go everywhere. But if I'm going to pin it down based on a single sharding key and this is actually my most popular my personal most common use of vt explain is to see where a particular row live so where in this logical database does this row actually live so I can go to the tablets and that shard and do some diagnostics. So I can see that ID 10 gets mapped to shard media dash 80. And that is pretty much vt admin. So now, as you grow over time, your sharding may end up not fitting in the space that you a lot anymore so right now I have two shards but I have a whole bunch more users and so a whole bunch of data and now my data doesn't fit onto two shards anymore. So I'm going to use a feature of a test called reshard to live without taking downtime, take my data that's on two shards and split it across four shards. And we're going to see what that looks like inside of vt admin. And that is what these other four shards are for. So I'm going to go over to workflows which is where things are going to show up, and the AP has made by a thing tiny. That looks sufficiently large. Okay. So the first thing that I'm going to do is I've done a bunch of prep work I've inserted some test data I've spun everything up. I've created all the shards because it takes forever. So I'm going to go right to it and I'm going to create two workflows one for going from dash 80 to 40 boy 40 and 40 80 and then a second workflow going from 80 to 80 C0 C0 to the end of the key range. So you can reshard the entire key space with a single workflow. We have found in our personal experience that when you are operating on sufficiently large key spaces and you have dozens to hundreds of shards it becomes trickier to manage with like retrying and restarting and doing cutovers and piecemeal so we like to create one workflow per source to destination, which is why I'm doing this way. So this workflow created and I can hop over to VTM and reload. Here's the one that's up and running and here's the one that is coming up. So if I click in here I can see that the stream is lagged a little bit because we just finished copying data over. So I'm going to close actually composed of two streams one going from the source to the left hand side and the other going from the source to the right hand side. And if I look at one of these streams. Oh, that's the JSON view. We don't want that. I can go to a tablet. There's no real QPS but there is the replication QPS which is to be expected. And now, if I throw some traffic. There we go. Oh no. I'm in the wrong directory. So send some rows over here and now we can see QPS skyrockets as expected. I'm going to kill that before it ruins my demo. At this point we now have two streams running or two workflows running for streams total copying data from our sources to our destinations. And now we are going to wait for things to catch up and while we do that Malcolm is going to fill us in on some of the work that's going to tie in here. Thank you for that wonderful introduction and demo of VP admin Andrew. Hi everybody. My name is Malcolm McKinjay and I'm a software engineer at Slack on the data stores team and a recent graduate of Tufts University class of 2020. Go Jumbos. This past year I've had the fortunate opportunity of being able to contribute to the tests, mainly focusing on the resharding workflows in the entire resharding experience. Most recently I've been focusing on the Vdiff step of that resharding experience. Vdiff is the only optional step of a resharding experience, but is immensely critical when wanting to validate that the prior steps of your resharding workflow have ran to completion and ran as intended. What Vdiff does is that it takes a diff between your target and source shards to actually validate that the data that you've copied over is there. And it validates if that the data. Thank you. And it validates that the there is no extra data that has been copied over any data that you did not intend to be in your destination charts. At Slack we currently manage one of the largest for test clusters in production. And because of that, due to our scale, we do end up in certain situations where we run into certain bottlenecks or unintended situations that is only due to our scale. Most recently that has been noticed in Vdiff. If we step into Vdiff specifically, what happens is that Vdiff mainly all of the work happens in the VTC TLD, all of the most of the load is generated on the VTC TLD. And also you need to run one Vdiff per source shard. So if you're splitting one shard to two or two to four, you'll need one Vdiff or two and two Vdiff respectively. And at Slack, when we run a Vdiff in production, it normally ends up around averaging seven hours. So if you're running one Vdiff, things are okay, you can run it and finish it by the end of the workday. But as that number increases, you need to run these Vdiffs either sequentially, so back to back, or you can run them manually in parallel running them sequentially leads you into the issue of you run one Vdiff, you come back seven hours later, hopefully it's done, and then you run the next one. That obviously becomes more unmanageable as the amount of Vdiffs you need to run increases. And then you step into the world of, okay, let me try to run these Vdiffs in parallel. When you want to run these Vdiffs in parallel since they're blocking calls, what normally happens is that you end up running multiple screens or multiple T-muxes. And a common scenario at Slack is that we'll be splitting more than one or two shards. And in that case, you'll be juggling maybe six plus T-muxes and screen sessions at one time, but you're also now doing this in parallel. And one Vdiff is already generating load on a VTC-TLD, so what happens when you want to run two, three or four? You can generate so much load on a VTC-TLD that you could be in risk of actually bringing that entire service down, which is what you don't want. So what will then happen after that is, okay, maybe you're running more than one VTC-TLD. You're now juggling multiple screens or multiple T-muxes on multiple different hosts, and that entire process is cumbersome at times. So what ended up happening over the past six months while I've been at Slack, I have worked on an in-house tool built on top of our existing VT-Opps binary at Slack, which allows us to interact with our cluster and it will orchestrate Vdiffs and it will work with our concurrency limit and it will allow us to spread out all of the load on all of our VTC-TLDs and also allow Vdiffs to run back to back. So even if Vdiff finishes at 3 a.m. in the middle of the night, it'll kick off the next one on the list. So this is sort of what the output will look like. You'll end up in queuing your Vdiff, you specify your source cell, your target cell, what tablet types do you want to use for this Vdiff, and the workflows you actually want to run. We have our service discovery, which will end up finding these VTC-TLDs, and from there you start your Vdiff workers. And the Vdiff workers are tied to a specific VTC-TLD and all they do is grab one workflow and then send that request to a VTC-TLD and when it's done it will grab another one. You can specify the number of Vdiff workers you want and with that you can also increase the number of Vdiff workers and each Vdiff worker will take one workflow. Yep, and so I will actually spin up a couple Vdiffs right now in order to show you how the tool works. So right here, as Andrew said previously, we have done some of the work before, so these commands are already here, but in this example we have Vtops Vdiff workflow, I'll copy this whole line, and we should be able to run this right here. And so right now the Vdiff has started on these VTC-TLDs that we have spun up for this demo and currently it's running. And so while this is running, we can talk about what we are looking forward to for the future of Vdiff. The future goals of Vdiff is actually to move Vdiff from the VTC-TLD layer down to the tablet layer in order to relieve some of the pressure from those VTC-TLDs and allow Vdiff to be a lot more scalable than it currently is, which will then make the entire resharding experience a lot better and a lot more streamlined. And if we look back here, what we can see is that in the output of this command, Vdiff for media 80 dash is completed, Vdiff for media dash 80 is completed, and in this, Diff's found false, shows you what VTC-TLD was used, and it also shows you the output. So the Vdiff will tell you how many rows were processed, how many rows were matching if there were extra rows as I mentioned before, and this is grouped by workflow and then by table. And so now since Vdiff has run and Vdiff has validated that our copying of data has gone successfully and that there are no extra rows and that everything is matching as expected, we can now cut over. And so if I step right here, the first command will be to switch traffic on read-onlys and replicas, switch traffic was successful. So now if we take a look back in VT-Admin, we should see a difference inside of our, step into the key space, and we step into media. You should see, okay. So now after that successful, we will switch the primaries, switch traffic was successful, and now all reads were switched, writes not switched, that was the start state, and the current state now is all reads were switched and writes are switched. So the destination shards should be taking, they would be taking traffic right now. So now if we head here, refresh the page, we can see now that the original source shards dash 80, and yeah, the original source shards are no longer serving, and the destination shards are now the serving shards here. And so with all of that said, thank you for your time. I appreciate everyone who was able to make it here. I appreciate everyone who was able to meet with us online and watch our talk. Feel free to check out the test docs at the test.io slash docs, the GitHub repository and link in the slides, as well as if you have any questions or after this talk, feel free to message us in Slack at the test.slack.com. Thank you. I guess we are out of time, or do we have time for questions? Cool. The only before we take that question, we also do have office hours tomorrow at 1130 Pacific for people that think of questions later. I'll take that one. Yeah. Vitus does not require a storage. Sorry. So the question was, does Vitus require a storage manager like Rook or Portworks or does it manage local storage on its own? Vitus does not require a storage manager like Rook or Portworks. It can work with cloud storage or network attached storage or local storage. When running with the Kubernetes operator, most people use cloud storage like Amazon, I'm blanking on the EBS or GCS Google cloud storage. But people have also run Vitus with Seth. So storage managers are optional. Any questions from the room? What was the other question? You said there are two questions. Let's do Carson's question. I think it was done. So the question was, for VTMN, does it talk to the VTCTLV or does it ever talk to the Toblo directly? And it only ever talks to the VTCTLV. There was a couple of things that go through the tablets, or sorry, through the VT gates, because VT gates have a different view of the health of the tablets. So that's what's powering those tablets serving not serving. If I go this one, two, three, fourth column, that's coming from the gates because we are interested in what the gates think about tablet health because it affects query serving versus what's in the Toblo, which is delayed, basically. But other than that, everything goes directly to the VTCTLV, which then proxies the Toblo. So commands have to be run from the CLI. VTMN is mostly read-only is the plan to make it more powerful than just a read-only, a very good read-only UI. And yes, that's the plan. That's where we're going. That's part of the VTCTLV parody is the VTCTLV UI lets you do all of those destructive administrative operations and we're getting there. Bye, stream. We support 5.7 and 8.0 and the minor versions of those two major versions. We used to support 5.6. We don't really support it anymore because it's been end of life. VITAS can work with the Percona distribution of MySQL, Oracle Community Edition, Enterprise Edition, MariaDB, almost all the variants. Okay, I think that's it. Thank you, everyone.