 Hello, everyone. Welcome to this breakout session where I will be speaking about horizontal scaling with VITAS. My name is Deepi Sigareddi. I am a software engineer at PlanetScale, where I lead the VITAS engineering team. I'm also a maintainer and technical lead of VITAS, the open source project. You can find me on Twitter or on GitHub. Let's talk a little bit about the problem that VITAS is designed to solve by taking a step back. What do modern applications need? They need performance, scalability, uptime, which translates to high availability, and we don't want to lose data. Whether it is user data or transactional data or system level data, we do not want to tolerate data loss. This is a little bit different from how things used to work, say, 20 years ago before everything moved to the web. Software applications only needed to be available during business hours, 9 to 5. They were located in specific locations. They did not have a worldwide audience. They did not have billions of users. All of that has changed, and that has changed the characteristics that we expect out of modern applications. VITAS solves this problem at the data layer. VITAS is massive. It's highly available. It provides you durability guarantees with respect to the data, and it is my SQL compatible. VITAS is a CNCF graduated project. It is licensed under the Apache V2 license. It's community-supported, and it is written in Golang. A little bit about the history of VITAS is that it was originally started at YouTube over 10 years ago to solve the scaling problem that YouTube was facing with their monolithic MySQL installation. And since then, it has grown to many, many production installs. VITAS provides an illusion of a single database. So even though behind the scenes, it is not a single database, as far as the application or the user is concerned, it looks like a single database. And it looks as if the application has a single dedicated connection to the entire VITAS cluster. And VITAS is compatible with both MySQL and variants, like MariaDB and Percona. Given the place where VITAS fits in a modern application architecture, it needs to work with database frameworks, ORMs, legacy code, and third-party applications. Some of the third-party applications that VITAS integrates with in the MySQL ecosystem are things like Ghost, PT Online Schema Change, Frano, Scoop, and many others. VITAS serves millions of queries per second in production. These are the logos of some of the companies that are running VITAS in production. Let's focus on a few key adopters of VITAS. There is Slack that is running 100% of their data storage on VITAS. So many of us use Slack on a daily basis. We have a Slack workspace for this conference. All of that is running on VITAS. SquareCache also runs 100% on VITAS. And then there's JD.com, which is a huge Chinese online retail company. It is the biggest in China, which basically means it's the biggest in the world. And they are running 10,000-plus individual databases in a single VITAS cluster. There's also PlanetScale, which provides a database service that's built on VITAS. PlanetScale's data store itself is on VITAS. And PlanetScale is also running many thousands of individual VITAS clusters. Some of these installs run on quote-unquote bare metal, meaning they are not running in a containerized environment. But many VITAS users also run VITAS on Kubernetes in the cloud. Let's talk a little bit about the key features of VITAS. VITAS is compatible with MySQL. It's also compatible with popular development frameworks, like Ruby on Rails, and Django, and Laravel, and PHP, and so on. VITAS comes with many database management features, for instance, connection pooling. The application can make a single connection to VITAS, and behind the scenes, VITAS is managing a pool of connections to the backing database. Schema changes in VITAS are online. They are safeguarded in such a way that database downtime due to schema changes is a thing of the past. The other interesting feature that VITAS gives you is query consolidation, where, let's say, there is a very popular YouTube video, and everybody is trying to watch it. There is a single metadata row that corresponds to that video, and you might have millions of people trying to access that single row in the database. And what VITAS will do is that if there is a query in-flight to fetch that particular row, there may be thousands of other queries that come in at the same time. They will all be held, and the results from the first query will be returned to all of those 10,000-odd queries, so that that identical query is executed once instead of tens of thousands or millions of times. VITAS achieves scalability through sharding. So in terms of scalability, you can go vertical or you can go horizontal, scale up versus scale out. And scale up is typically done by running things on ever larger machines, which eventually hits cost or physical limitation. And scale out is achieved by adding more and more instances of the same small or medium commodity hardware. And the way VITAS achieves scalability is by adding more and more instances of small or medium commodity hardware and hosting different shards on the new hardware. High availability in VITAS is achieved through failure detection and failover. So typically in VITAS, the databases, the various shards are run in a primary replica configuration, and the replicas are available to failover to if the primary goes down for any reason. VITAS also provides a whole bunch of data services, so materialized views, rollups, or aggregations of data which can be updated in real time, change data capture, notifications, data migrations. All of these features are facilitated by a part of VITAS that we call V-replication, or VITAS replication, which is built on top of my SQL replication, but it is sort of a filtered replication. So for instance, if we look at real time rollups, the way they work is that what you are streaming from VITAS is an aggregate number. Could be something like sum of column values which fit a particular filter condition. Now, if one of those values changes, then VITAS knows how to adjust the sum, basically subtract the old value, add the new value, that's your new sum. So whether a row is inserted, deleted, updated, VITAS knows how to keep this value up to date to reflect those changes. Data migrations is another key feature of VITAS. People come from various previous versions of their data store when they come to adopt VITAS, and it's important that you be able to migrate the data into VITAS with as little downtime as possible. Let's switch gears a little bit. Now that we have talked about the features of VITAS, we will now look at what is it in the VITAS architecture that facilitates these features? How do we go from building blocks to where we want to end up? So the question here is how does the VITAS architecture enable transparent database operations? Before we dive into that, let's look at a few concepts. The first one, and the one you'll hear the most of, is key space. A VITAS key space is a logical database. This goes back to our illusion of a single database. Behind the scenes, there may be tens, hundreds, thousands of physical databases, but as far as the application is concerned, VITAS presents a uniform view as a single database, and we call that a key space, and that's the logical database. Within the key space, each row in a table has something called a key space ID. This is a computed value. It is not a physical column. It is not a value that has to be provided by the team that is setting up VITAS or anything of that sort. This is a hex value between zero and FFF, and it can be provided to VITAS by using what is called a sharding scheme. And the sharding scheme can be functional, meaning I want to take a column, apply a function to it, and use that as the key space ID, or it can be custom, or it can be numeric. There are many different sharding schemes. The example I just gave of applying a function to a column value is the most common way people use VITAS, and the most popular function, and the simplest one is hash. But there are other functions that can be applied to data values to generate key space IDs. The next concept I want to talk about is shard. A shard is a piece of the logical database. It will in turn correspond to many physical databases, but logically it's a distinct piece of all of your data, and if you put all of the shards together, then you end up with all of your data. So it's a subset of your data, so to speak. The other concept that people run into often in VITAS is cell, and the idea of cell is to facilitate high availability and failovers. A cell is any failure domain. So whichever part of your system architecture is a self-contained piece corresponds to a cell. So that could be a data center, or it could be one rack of servers, or it could be a region or availability zone in a cloud provider, but this is basically a failure point, and you define it as a cell in order to say that if this whole cell goes down, then we can define how the system should behave and which cell is sort of the one we will fail over to. A common configuration for databases is a cluster with primary and replicas, and when you start putting VITAS on top of this, what we do is we will place a management module on each of these databases, and this module is called a VT tablet. It's a sidecar process that runs right next to each of the databases. It controls the database process, the MySQL server. It interacts with the MySQL server. Typically, it's on the same host as MySQL and communicates through sockets. In production, there may be multiple such clusters, whether they correspond to different teams in a company or different shards in the same key space, there may be multiple clusters. User and application traffic is routed to these clusters through a component called VTGate. VTGate is a smart, stateless proxy. It speaks the MySQL protocol. It gives the illusion to the application of being a single monolithic MySQL server, and it relays the queries to VT tablets. In order to scale out for traffic, a typical witness deployment will run multiple VTGate servers. So VTGates can be scaled up by giving them more CPU and memory or by creating more VTGates. And this also allows people to deploy VTGates in different parts of the world, depending on where their traffic comes from. VTGate, we said, is transparently routing queries to the correct databases. And how does it do this? The way it does this is by using the concept of key space ID that we talked about. So from the app when VTGate gets a query, it is able to compute which shard or which key space the query should be routed to. The routing is based on the schema and the sharding scheme. So there is a table, it has a certain set of columns, and by knowing how the sharding scheme for the table is defined and where the table exists, VTGate is able to route the queries correctly. The next component we'll talk about is what we call a topo server or topology server. This is a distributed key value store, and it stores the state of witness. The schema, the key spaces, the shards, the sharding scheme, tablets, users, roles, authorizations, all of those things. We support multiple different types of distributed key value stores. You can use HCD or ZooKeeper or even Kubernetes itself as the topology server for witness. And the data that is stored in the topology server is fairly small, and it is mostly cached by VTGate. So VTGates don't have to communicate constantly with the topology server. The expectation is that, and this is a core principle of witness, the expectation is that query serving, whatever applications need on an ongoing basis from witness, should not actually depend on the topology server being up. It's only cluster operations. Let's say we are adding a new shard or adding a new database into the cluster, or we are doing some sort of management. We are going to take a note down for maintenance. So we want to fail over everything that's running on that node to a different node. These sorts of management operations require the topology server to be up, but query serving does not require the topology server to be up. The last component we'll talk about is a control daemon, VT control daemon. This is used for running ad hoc operations and things like we are going to take a note down for maintenance. How do we inform Bitus that we now want to fail over all traffic to a different node and then remove everything that's on this node from the cluster before it goes down for maintenance? Or let's say our traffic is growing and we want to add more tablets. We want to add more databases into the VITS cluster. So these are the types of operations that the VITS control daemon facilitates. It also provides an API server that can be used to inspect the state of the VITS cluster so that we don't have to do command line operations against HCD or Kubernetes to inspect the state of the VITS cluster. So in summary, VITS provides an illusion of a single database to an application. So all the application does is speaks to VITS as if it were a single MySQL database, a single database of some sort. But behind the scenes, behind the red line, we have all of these components in VITS that facilitate that illusion. Let's now start talking about sharding. The title of the talk is horizontal scaling. So how do we, in fact, achieve horizontal scaling with VITS? But before we talk about horizontal, we want to first talk about what is sharding in the first place. If I were to define sharding, and I've said this once before, but I'll repeat myself, sharding is the process of splitting some sort of monolithic data into small subsets, which together form your full database. And there are two types of sharding. There is vertical and horizontal. Vertical sharding refers to maybe each database has a different set of tables, and an application talks to one of those databases or key spaces. So in VITS, you can create multiple key spaces and have a subset of tables in each of those key spaces and design your application so that they speak to only one of those key spaces. So that is vertical sharding. Then you have horizontal sharding, where you say, well, even if I were to split my tables out into different key spaces, things are growing so fast that at some point, the cost of continuously scaling those databases up is going to become prohibitive. And it is not a sustainable strategy. So then you end up with horizontal sharding, where for the same tables, data that belongs to the same tables actually gets split into multiple databases. And using VITS as VT gate and key space ID, queries can be routed to the right physical database. Why do people do sharding? There are various reasons. The main reason is performance or management. As database instances become bigger, they become harder to manage, performance suffers. There is only so much you can tune. There is only so much you can get with adding memory or moving it to a larger physical machine. The other thing that happens when my SQL databases get into the multiple terabyte sizes is that if there is a problem, then the recovery time becomes much longer, which translates to longer down times when there is an incident. By sharding, first of all, you sort of spread the risk because you have many physical databases. Any one physical database going down doesn't actually impact your full application. Everything doesn't go down at the same time. Maybe a small function is unavailable for a short period. On top of that, with VITS and my SQL replication, with running in a primary replica configuration, VITS will let you fail over very quickly in, say, 15 to 20 seconds to a replica and be back up. So now we are talking about incidents that used to cause hours, sometimes even 24 hours of downtime, can be managed down to the point where an incident results in 15 to 20 seconds of downtime at the database level, which is very much tolerable at an application level just by using retries. Most web users have now become used to the experience of, oops, something went wrong. Please try again. And no one minds. And if they retry it and it works, that's fine. And that's the sort of experience you can give to the end users with a 15 to 20 second downtime at a database level versus multiple hours. The recommendation for VITS is to keep the shards at around 256 gig. What this means is that with fairly standard hardware, whether you're deploying in a cloud or in a dedicated environment, it doesn't take more than 15 minutes to add a new replica, or even if everything is lost to build things back from a backup and to come back into a serving mode. Now, a demo. So let me talk a little bit about how this demo is set up, and then we will actually get to see it. So I started with creating an unsharded VITS cluster using the VITS operator. The VITS operator is an open source Kubernetes operator that was built by PlanetScale. It's also available with Apache V2 license. So I create a VITS cluster, I load some data into it, and then I set up a monitoring stack, Prometheus and Grafana, so that we can see what the query load looks like. So I run a query load, we look at some metrics, double the query load, we look at the same metrics, and then we'll shard this into two shards. Now, if I were to run through this whole thing here, it would take about 30 minutes, including the time taken waiting for Kubernetes pods to come up. So I've actually pre-created all of this, and we will focus the demo on the sharding part of it. The schema for this demo is very simple. I just have one table called users with a couple of columns, a user ID and user data. So when I started, there were about 800 queries per second. And if we look at the query latency in the second graph, it is between five and six milliseconds. And then I doubled the query load so that the queries per second went to about 1,500 queries per second. And we can see that corresponding to that doubling of the query load, the query latency went up to between seven and eight milliseconds. So this cluster is a pretty lightweight cluster. I'm running it in Google Cloud using medium nodes. So nothing powerful. So very soon, we can see that the traffic actually affects the performance of the system. Now let's scale it up. So let's look at what we have deployed in Kubernetes already. HCD is the topology server that I chose for this demo, and it's running with three replicas. We have some VT tablets, which are the MySQL management components. And we have a VT CTLD, VT gate. And we have two instances of a database load generator running. Now, if we look at what our key spaces look like, we have a single key space called Customer, and it has just one shard. And like I said, this demo was created using the VITS operator, and it's running in a Kubernetes cluster on Google Cloud. OK, so we have one shard, and it's denoted by just a dash. So the names of the shards have a start value and an end value, and these are hex values. When I talked about key space IDs, I talked about how they go from 0, 0, 0, 0 to FFF. And dash basically means that this shard spans the full key range from 0 to FF. The next thing we do is we apply the sharding scheme. And I used a VT CTLD command called Apply V Schema. And on this table, we have a single index of type hash on the column user ID. So this is just setting it up so that we can start the sharding process. This is declaring that this key space is now going to be sharded. Once that is done, we will bring up our new shards. So from one shard, we will actually go to three. So we have one old, two new. And we will copy all of the data from the old one to the new ones. And then we can bring down the old one. So the way we deployed the first shard was by declaring a partitioning with one equal part. And the way we are creating the new shards is by declaring another partitioning with two equal parts. It takes a while for these things to come up. The pods are initializing. Everything is not ready yet. So we will skip ahead to a point where everything is up. So at this point, all of the new processes are up in Kubernetes. Everything is running. And we can refresh our key space view and see that we now have one serving shard, same as before, but we now have two non-serving shards, dash 80 and 80 dash. So what these names are telling us is that the first of them spans 0 to 8, followed by n zeros. Key space ID is typically an 8-byte value. And for each row in the unsharded database, we will compute a key space ID. And depending on whether it falls before 80 or after 80, we will copy it into the appropriate place. So let's look at the steps that we are going to run through to do the resharding. So the first step is going to be to create a resharding workflow. And we tell it what the source is and then what the destinations are. And this workflow has started. And it has created two streams to do the data copy that we talked about. One of them goes from dash to dash 80. The other one from dash to 80 dash. Once the workflow is up, we can actually check on its progress. So this says the copy is completed and the streams are running. What it means for the streams to be running is that initially we take all of the data and we copy it over. But as we are doing this copy, new data is being created on the source because traffic is still coming in. And these streams keep the destinations, the targets in sync with the source so that any new data that's coming into the source is also on a live basis being copied to the target. At any point, we can say, OK, given that the copy has completed, I want to now switch the traffic. I want to start sending all of the requests to the new shards instead of to the old one. And this can be done all at once for reads and writes or it can be done in a phased manner so that you switch the reads first, see how things are working, and then whenever you're ready, you can switch the writes. What happens when you do switch writes is that we stop accepting writes on the source shard for a very brief period and wait for the targets to catch up. And once the targets have caught up, we update the routing in VT gates so that writes will now start going to the targets. So that brings us to the end of this demo. And then we look at what the metrics look like. Oh, but before we go, we should see that now the serving non-serving has changed. The new shards are the ones that are serving, meaning they are the ones that are getting traffic from VT gate. And the old one, Dash, has gone into non-serving. And once the sharding process is complete, we can actually tear down the old shards. But one of the other things people sometimes do is to keep them around so that if something has gone wrong, they are able to actually fail back to the original. That is also possible. When we switch the reads and writes, we set up reverse replication streams to make sure that the source now remains in sync with any new data that's being created on the targets, which are now taking traffic. So after we did the sharding, there was a brief blip in the traffic because we stopped writes for a brief period while we were switching everything over. And then the queries per second pick back up to their previous level. But we can see that the latency has actually gone down. It has fallen to the levels at which it was when we were running half the load. So with 800 QPS, we were getting about six milliseconds of latency. With 1500, we were getting about seven to eight milliseconds of latency. Now with 1500 queries per second, we are getting about five or 5.5 milliseconds of latency. So those three lines are the P50, P95, and P99. So the yellow line is actually the 99th percentile latency. It is not the average latency. Average latency is below three milliseconds. I wanted to share some of the resources that we have for anyone who's interested in VITS. We have a vibrant Slack community with almost 2,500 people now. It is big, but it's not so big that someone new would get lost. There are always lively discussions, and there are people who are very much willing to help anyone who's new to the community. We do have tutorials in the form of getting started guides, which will walk you through running a very simple VITS setup either locally or on Kubernetes. And in terms of local install, it could be Linux or Mac. We also have a tutorial that can run in Docker. The source code is available on VITS, and we welcome everyone to come over and check it out. Some further reading. We talked about how Slack, Square, JD all run VITS at scale, at massive scale, in fact. SquareCache has blocked about how they did their resharding, how they went from one shard to eight or 32. CNCF, the Cloud Native Computing Foundation, that VITS is a part of, does case studies for their projects, and they have published multiple case studies on VITS. And JD.com, which runs something like 27 million queries per second against VITS, is one of those case studies. The Slack Datastore team has blocked about their VITS migration, how using VITS's V-replication feature, they were able to take a migration project that was projected to be multiple years and finish it in less than one year. So that is very interesting. And just yesterday, GitHub Engineering, the GitHub database team published a blog post on how they partitioned GitHub's databases. They used to run a single MySQL. They now run VITS. In the process, they first did vertical sharding and split their tables into different key spaces, but they have also done horizontal sharding with VITS. At this point, I can take questions if there are any. No questions. I will be in the Slack channel for the OS databases track after we are done here. And at that time, I can take any questions from the online audience as well.