 Welcome to Continuous Performance Benchmarking for Vitesse Maintainer Talk. My name is Alkun and I'll be your host today. And today's agenda I'll be doing an introduction to Vitesse and Manan will do a benchmarking as a product information and Florent will give you benchmarking on internals from the Vitesse project. A little bit about Vitesse is that Vitesse database clustering system for the horizontal scaling of MySQL. It is a known CMCF graduate project and open source licensed and we have contributors around the world. More on Vitesse is a single illusion of a database actually dedicated connection that can work on MySQL 5.7 or 80 and it is compatible with the frameworks like common frameworks and ORMs and it is known to be scalable also provides high availability and durability guarantees of the database. Very highly adopted and serves millions of QPS around the world you might be using day to day and mostly a very large implementations are there to be known also other implementations are in the market right now. So let's go into the concepts of Vitesse and Vitesse because it's a sharded system has a key space concept for logical database and provides a shard as a logical database and it comes with a concept called sell for a failure domain. Vitesse architecture actually serves with a primary and replicas it's used as the MySQL primary and replica concept and mainly comes with a VTTablet and a VTTablet is a sidecar to a MySQL D process it's usually sits next to the MySQL D and drives the Vitesse process against the MySQL. Here's an example of multiple clusters and you can have be running under Vitesse for the large implementations so we need a concept like Vitesse. VtGate is a stateless proxy, speaks to MySQL protocol and impersonates monolithic MySQL server. This allows application to speak to VTTablets from the Vitesse standpoint. If you have even a larger deployment you will actually deploy multiple VtGates to access these clusters and in this way you will have an access that is also scalable from the proxy standpoint and connection standpoint. How does the application access the sharded system? In this example you will see VtGate routes the traffic to the sharded clusters in our example we have two shards in commerce database and one internal shard on an unsharded space and this way you can actually have an access to the sharded environment with the access from the application. In our example over here we have a commerce database and then we want to search by the orders in customer ID and then application sends the query as always it does to regular database connection to in this case the MySQL but VtGate understands and knows where the customer ID parts will fall in and it will route the traffic to those and this way you can actually scale indefinitely. Also another component from the Vitesse architecture important component in the Vitesse architecture is the topo. Topo is a state of where the MySQL and database components are as well as the Vitesse components sits. So there are multiple implementations of topo that you can choose. Favorite ones are HCD, Zookeeper and Kubernetes and the known console is also topo manager that was used in the past. VTCDLD is another component of Vitesse. It's a control daemon and it runs the ad hoc operations that serves as an API server and then interacts with the topo. So they all come together in this architecture summary so you have an application server that connects to the load balancer. Load balancer actually speaks to VtGate which interacts with VtCTLD and topo server and serves the incoming queries from the sharded clusters behind the scenes. Vitesse also have very new and upcoming features. As of today we have 12.0 release in GA and we support online DDL operations without locking and Vitesse comes with an experimental version of the Gen4 planner which is very new and highly improved and we do continuous benchmarking which is the subject of the talk today and we have ongoing performance improvements and also tools like VtAdmin and are coming up in the next releases. So we will continue with the next section and thank you very much. Thank you, Algin. I am Manil Gupta and I will be clicking Q through the next part of our presentation. I am going to introduce RE Fasted which is the nightly benchmarking tool for Vitesse and then eventually I am going to show you the website of RE Fasted which I have linked over here as well and the results that it has produced. I have also linked the code over here. Please go through it after the presentation. So let's dive in. First things first, what is benchmarking? Benchmarking is a way to measure and compare the performance of one software version against another. So when we are building Vitesse, we have the main branch. We have different releases of Vitesse, release 12, release 11. We have different patch releases. We have PRs which are changing some part of the code base and we want to measure the performance of each of these and compare that which one is better. So this is what enables us as developers to know that if a code change that we have done is it improving the performance of some part of the code or not. This is the benchmarking tool that we have RE Fasted yet which is what we use for building high performance for tests. Largely benchmarks can be broken into two parts, microbenchmarks and microbenchmarks. Let's go over both of these. So microbenchmarks, like the name suggests, microbenchmarks, they measure a small part of the code base. Which really, this is done by isolating a single function call and doing it repeatedly. What we mean is, let's say that we have a benchmark and we run it for two seconds. Within those two seconds, we keep calling the same function again and again with the same parameter values and the number of times that the function is able to run within those two seconds is how fast the function is able to run. Now, we have a wide array of microbenchmarks in Vitesse which tests the planner, the parser, RPC calls and these are the things that are on the top of my head, among others. So we have a very extensive code coverage in microbenchmarks. If you want to take a look, you can go over the Vitesse code base and within the testing files of Go, local unit testing files of Go, you'll find benchmark tests which is what microbenchmarks are. The microbenchmarks measure the performance of a code base as a whole. So when a user uses Vitesse, they're not going to be able to see what is the amount of time that the parser took or what was the amount of time that the query was planned for or what was the execution time. What they generally see is just the latency of the result set that they get back when they query the database. They measure via microbenchmarks. So they measure the performance of the entirety of the entire code base. And they run in an environment which is very similar to what the end user's experience. We're not going to go over how we bring up that environment or how this is all done because that is something that Florent will cover in the next part of the code base when he types into how RFAST it is built and the way that we bring up the infrastructure and run the micro and microbenchmarks. The microbenchmarks that we run, we have two categories for them, OLTP and TPCC. Florent will cover that later. Now the next question that arises is what all do we benchmark? We have several cron jobs which are configured to run daily and these cron jobs keep tabs of the main branch, the release branches, the tags and the PRs which have the benchmark related. So every day, if there is any PR which has been merged into the main branch of a test, we check that the commit hash has changed for the main branch and we run the benchmarks for it, both micro and microbenchmarks. See what we do for release branches, different tags and we also do that for all the PRs which have the benchmark medieval. So even before you ever you merge a PR you want to know the performance impact that it is going to have both on the unit that you're trying to change and also on the whole code base of a test. This is what makes the user experience, the developer experience a lot better. Taking that a step further, we also have Slack integration. So, RVFAST yet once it has run the microbenchmarks and the microbenchmarks and stored those results, we use those results and compare them against each other. For example, we'll take the main branch and we compare against the results of the previous day run of the main branch. Same we'll do it for the main branch against the release and we'll do it for PRs also against the base of the head of the PR. Once we have these comparisons, if there is any regression, then we send a Slack notification on the bot benchmarking testing channel and then the developers can look at this, they can check out what the regression was, which benchmark the regression came and they can then figure out what was the change that made the regression, that caused the regression. I'm going to diverge here a little bit and talk about the Gen4 planner. Gen4 is a new VTGate planner, which is meant to succeed V3. So this is something that we've been working on with test lately and currently it is an experimental feature, but eventually we're going to make it the default. And Gen4 is a much more advanced planner than V3. It provides a larger query support. A bunch of correlated subquery will start working with Gen4. There are newer primitives, semi-join has been introduced and work for supporting filtering on the VTGate level is also ongoing. So eventually Gen4 is going to become much more powerful than what V3 is. Not only that, it also creates optimized plans, which are much faster while execution. But how can we say for certainty that the optimized plans are faster or the plans that Gen4 produces are actually better? How do we gain confidence that Gen4 is indeed a success of a V3 and we can replace the default on the VTAS website? The way that we do that is via the benchmarks. So we run the benchmarks, the macro benchmarks for both V3 and Gen4 and then we compare the results. We look at the queries we have served per second, the amount of memory usage that we have on VTablet, the amount of CPU usage that we have on VTGates and all these metrics and then we take a look at does Gen4 actually improve the latency that the user is going to see. Enough talk. We've talked about what benchmarks are, the different type of benchmarks, how Gen4 and Gen4 is and how we're supposed to use benchmarks for Gen4. I think it's time that we look at the website. Are we fast yet? And we ought to talk about what the users are going to gain from it. We've already diverged into what the developers have done and VR has done against that and it improves the developer experience. It's time to take a look at what the end users can gain from RbFast yet. So this is the main homepage for RbFast yet and here we have a brief overview of what RbFast yet kind of like I gave you and here we're also looking at the server that we're using actually to run the macro benchmarks. It's an Equinix M2 Xlarge X86 server and for the more enthusiasts of you guys you can actually take a look at the server specifications and see what hardware we're actually using to run our macro benchmarks and micro benchmarks. More on this later. The next page that I'm going to show you is the current page of this page. So we talked about the little while ago wherein we track the main branch and we run the macro and micro benchmarks for it every day if the git has changes and those results for the previous month are actually shown here for the past 30 days. So this tracks the OLTP runs the transactions per second, the QPS, the all of this are shown over here over the past month and the same we do for TPCC as well. So the next page that I'm going to show you is the search page over here you can actually give a specific github hash that you're looking for. So you can take a github hash from github and look for the results for that specific hash. If you have those results for it then you can display them over here and you can look at the macro benchmark results OLTP runs. We also tried the product CPU time individually for what the CPU time for the tablet was, the amount of bytes it allocated on VT gate, the amount of bytes it allocated on VT tablet all of this information is available over here for both micro and macro benchmarks of TPCC OLTP and we also have bunch of micro benchmarks over here which show their runs, the number of iterations, all the information that you need. Now I can even go over in one of these micro benchmarks and it will show the runs of the past five days of past few iterations of that micro benchmark. So these are the past iterations of the past 10 or so runs for this micro benchmark that we're looking at. It will also have the start time and exactly at what time these were benchmark, the github references for them, so you can take a look at this as well. This page specifically catering to one github hash, what if you wanted to compare two github hashes together and that's what this page does. So the compare page of this website you can actually go over and give two github hashes and you will have their comparisons and micro benchmarks, along with some nice graphs which will show you graphically of what the differences are. So this is actually a much older commit and AA79 is actually a later commit. So this later commit, you'll see that actually there has been some performance improvement we'll talk about that a little later as well. The amount of CPU time that we were using actually decreased and so this is where you can see the comparison between two commit hashes. Along with this is the comparison between the micro benchmark as well. So like all the information that you would need for comparing two hashes is over here. All of this is all great and such, but there's actually something which is even more useful. So on the micro benchmark page, a dedicated micro benchmark page, here you can compare the results for different tags of the test. For example that you're a company who's upgrading from Huron version 10 of the test or release 10 of the test and you're actually looking to upgrade to release 11 and you want to look at what all differences or what all performance improve you can expect when you go from release 10 to release 11. So over here and drop-down boxes you can just select the version that you're on and select the version that you want to get to or what you want to compare it against and just go over here and currently showing results for the comparison between 10 and version 11 and here you'll see the micro benchmark or micro benchmark results for all of them. So this interface is actually very similar that we have for the macro benchmarks as well where you have the results for two tags that you can compare. You don't specifically need to search for the GitHub hash and use those you can directly compare different tags. Here we're comparing release 10 versus release 11 and if you look here there is actually a 5% increase between the QPS that release 11 was able to search against release 10 for the OLTP workload. It is because of this micro benchmarking tool that we have that we can say with confidence that actually if the users who have a workload which is very similar to OLTP if they move from release 10 to release 11 of a test they can expect a performance improve of 5.3%. Similarly if someone has a workload which is closer to TPCC they can actually expect the improvement of 3.4%. So the confidence that we say that there has actually been a performance improvement from release 10 to release 11 comes from these numbers that we are seeing over here. Finally we have created one special tab, the one that I talked about that Gen4 is a new planet. We also need to benchmark what that new planet is doing. So we have a specific page for it which is called B3 versus Gen4 and this page shows the results of the comparisons between B3 and Gen4. Here as well you can select which tag you want to check the comparison for. You want to check it on some other release. For example let's say that you are on release 12 and you want to look at what was the comparison between B3 and Gen4 for release 12. You can take a look at that as well. So here you can see that Gen4 is slightly better than B3 but most of its capability as of now in Gen4 is actually coming from the fact that it can support a lot more queries than B3. So here you can compare the results. If you go over to if you click here to see actually see the query plans this is kind of a beta feature. It's not entirely developed kind of work in development. But over here we actually have all the queries and look at the queries that we are running in the OLTP workload. These are generally point queries and you can look at the plan that we get and that B3 and Gen4 are built for it. So this is the query that you got. Select C. It's a normalized query so all input arguments have been converted to bind variables and then you can actually look at what VTGate is going to do with it. You can look at it to a single chart because it's a select equal unique query and this is also a place where you can look at the plans that they're going to that VTGate is going to create. So you can have different plans for Gen4 and V3 as well which is where the optimizations will come into play. This is the PPCC page for the same thing. So you can look at this and you'll see that the plans are different for V3 and Gen4. Gen4 is there on the website but this allows the users to actually know beforehand, even before they start testing their own local environment, what differences they can expect, what performance improvement they can expect, whether the VTGate will actually start using more CPU time, will it start allocating more bytes, is the Gen4 planner actually worth it, will it give you more, will it give you better performance, all of that can be answered and with high confidence because of the benchmark tool that we have in RBFast yet. One more thing I would like to add, Gen4 right now is an active development and we're looking for ways to improve its performance. So if any of you guys are running VTGate in production, we would love if you would be able to share your production queries with us and we then optimize Gen4 to actually work as best as possible, like reduce as optimal plans as possible for your specific workload. If any of you guys are willing to take us up on our offer, please find us in the VTGate Slack. This is all from my side and Flora will now take over and talk about the benchmarking built-in from the ground up. Thank you very much. Hello everyone, my name is Florent and today I'll be talking to you about the internals of a benchmarking system. So let's get started first with the implementation that we use for RBFast yet. So we have two prominent server, the first one is the website. Its goal is to serve the web UI that Manan showed you earlier and its secondary goal is to handle and manage the whole execution of benchmark. When I say the whole execution of benchmark, I mean the current schedules and just the current schedules. Secondly, we have the matrix server. This one is scrapping the benchmark results and storing the data. Another key point of our implementation is the fact that we spawn the new server for each benchmark that we have. This allows us to have more reliable results and make sure that the benchmark is not influenced by anything else when it runs. Each new server is based on the Equinix Metal service. CNCF has a partnership with Equinix Metal for projects like VTES. And final key point of our implementation, we store all the benchmark results and the metadata in MySQL. In fact, it's not directly MySQL but we store it in a VTES cluster. Let's talk about the execution pipeline. It's a very important part of the implementation and RFSJet. The goal of it is to manage the whole execution workflow from start to finish. So we'll get to what the whole execution workflow means in a bit. The pipeline is configurable through a YAML5 meaning that we can have tons of different execution we want to have. For example, microbenchmark and macrobenchmark like Manon just explained before and those two different types of benchmark will have different YAML files. So the whole execution workflow what is it? What are the responsibilities? First, the creation and configuration of a new server. This is the very basic step. We want to make sure that we create a server we configure it based on a very detailed and explicit configuration. Another point is another responsibility is the setup of a VTES cluster. Depending on the type of benchmark we might or might not want to have a VTES cluster. So this is part of the execution pipeline responsibilities. Another one is the actual execution of a benchmark. Running the benchmark. Then we need to store the results and the metrics and finally we upload and publish those results. And there's kind of like intermediate steps between those two responsibilities which is aggregating the results, compiling them and checking if there's any regression amount. This is the whole part like everything is part of the execution pipeline. This is the architecture of our execution pipeline. We have seven steps. I'm going to go through them quickly. So the first one like I said it's the configuration we feed the execution pipeline with a configuration and then everything gets created like magic. So we feed the configuration and then second step is the provision. We create the execution server using Terraform. So this is just to provision and configure the very basic steps, very basic things about the server. And this is provisioning on Equinix Metal like I just said. First step is the configuration of the server. When I say configuration I mean the installation of packets, installation of tools that we need setting up the network, setting up the disk, anything. This is done using Ansible. So for those that don't know Ansible it's basically a configuration tooling that allows us to automate and easily do a bunch of configuration things on servers. Fourth step is the start of a benchmark. So we actually want to start the benchmark, start recording the results. Then we have a benchmark running which is the red square here, the red rectangle. So the benchmark can be either a macro benchmark or micro benchmark. Fifth step is actually storing the results. So this can be done either at the end of the benchmark or during the benchmark like throughout the whole execution process of a benchmark. We store the results in different locations. The most basic one is my SQL. This is going to be very basic information for example the type of benchmark I don't know the idea of the benchmark, everything like that. Like the SHA for example, the Kit SHA. Then we store information also on Prometheus. The information that we store on Prometheus is obviously time series data and this is all the metrics for example the CPU usage that we had for a specific benchmark at a specific time. We store the metrics that we have in Prometheus sorry the data that we have there we store it in an influxdb server later on. This is done to keep data a bit longer keep as little as possible inside Prometheus then store for a longer term inside influxdb and that's it. Sixth step is actually destroying, tearing down the execution server. Again this is done using Terraform. Seventh step is it's actually publishing the results. Like I said before right before publishing the results we have to compile them, aggregating them making sure there's a regression or not we have to calculate this depending on all those outcomes we might or might not want to notify people on Slack the web UI will show up differently based on those outputs etc. So this is the architecture I'm going to move on to how we do a micro benchmark a micro benchmark. So this is out of the two types of benchmark that we have this is the simplest one. Why? Because Vitesse got it in golden and Go has an amazing testing framework which includes a micro benchmark tool so when you write code in Go you can when you write test in Go you can either have an actual unit test or you can have a benchmark test and we have a bunch of those inside the code of the test. The goal of the micro benchmark is just to execute all of those benchmark tests. We get an output every time that we execute one we parse the output we keep the relevant information and then we aggregate all the information from all the different micro benchmark and finally we store them. So like Manan mentioned earlier we have a multitude of relevant or important information inside a micro benchmark for example the time it takes for a function to get executed the memory usage of a function and so on. So this is all part of the micro benchmark and we will go back to them during the demo micro benchmarks a second type of benchmarks that type has a longer execution time why because it's well first of all we have to install a lot more packages for example MySQL, etc we have to install more stuff we have to set up more stuff for example the VTest cluster needs to be created it has to be up and running so longer execution time that's it and then second point we use sysbench to actually do the benchmark of the VTest cluster so sysbench is a widely known tool which I'll get back to it in a coming slide another important part of a micro benchmark is actually distinguishing the different types of benchmark that means that we're going to get a lot of different results for all of the micro benchmarks and we have to make sure that this result is from this type of macro benchmark or this other type of macro benchmark we can have tons of different types and all of the types is going to measure a specific benchmark a specific thing about VTest and a final big part of a macro benchmark is the aggregation and storing of metrics so measuring the result of a macro benchmark is easily done using sysbench but it's usually not enough we have to get more information for example the CPU usage of the host why we ran the benchmark this is very important we want to know in the 20 minutes that the benchmark ran how long the CPU was used for this is important to compare across different versions like I said the test cluster for the macro benchmarks and the topology that we use for it is 2 VT gates, 6 tablets and the topology server that we have is ETCD this is the configuration that we use so far and we're aiming to change it we want to make some tests discover which topology might might not be the best for the benchmarks which one is giving us more reliable and more accurate result sysbench so I talked about it like I said widely known tool it's also highly configurable it's based in Lua and also in C, the C language that means that you can add different workload so different type of benchmarks using Lua files it has three steps at least we used three steps inside our fast yet those three steps that we use are the preparation this step is basically going to create the database create the tables insert the data into them etc the second step that we have is the warm up so this is going to do like an actual benchmark but the results of it won't count the goal is just to warm up the system get the cache flowing get the network flowing everything ready and everything looking almost like like an actual end-to-end situation and final step is the execution it's also called as the run step inside sysbench so the execution is where we actually send a lot of queries to vtests and where we measure the performance we have two custom forks of sysbench one of which is the first link here the plan scale slash sysbench this fork includes a tpch benchmark and it includes a different way of formatting the results sysbench usually format the results using text just plain text and what we do instead is just display them using json and second fork is just a bunch of Lua files for tpch benchmark tpch and tpch benchmarks are two big benchmarks using the database word they're part of the tpch family if you know a bit performance of databases you might know them here is a sample of sysbench results we can see it's pretty simple pretty straightforward we just have the time that the benchmark run in that case it's 10 seconds the number of threads that we used the number of transaction per second tps and then we have a block with qps query per seconds so we have the total number of threads the number of writes and then other which is for example begin commit etc we also have the latency and the number of errors per second that we had and same thing for reconnects the number of reconnection that we had but like we can see it's not enough for us to tell if there is a regression between version a and version b of the test it's already good a lot of data but it's not enough so we've added support for metrics so inside each execution server we start a primitive server which will scrape data from the different components of the test and the host different components being the vtgate the vttablets vtctld etc so some of the interesting metrics that we want to look at are the cpu memory usage also the golang metrics for example go routines etc and finally the disk IO the network those are all the different types of metrics that we want to keep and that might be interesting if ever we see a regression so an important part of the metrics is obviously getting the metrics producing the metrics and then we have to scrape them and store them so to export the metrics from a benchmark server we will like we use from ethios so to gather all the metrics this is inside the execution server and then because the execution server is going to eventually get killed get destroyed at the end of the benchmark we want to make sure that the metrics we collected stays for a long time so we can use them we can analyze them in the long run and so for that we use the metrics server which is just to scrape and store the metrics like I said before metrics are going to be duplicated to an influxdv server longevity and finally metrics can be visualized on the web using graphana so we have a couple of dashboards that allows us to say show me for example I don't like the network usage for this benchmark at that time I'm not going to move on to a demo where I'm going to present the execution of a micro and micro and micro benchmark and show you a bit about the graphana UI and the metrics alright so here we can see a configuration file this is a YAML file and we're going to use this file to feed the execution pipeline and configure the whole benchmark so this benchmark here is an OLTP benchmark so it is a macro benchmark and as you can see we have to define for example all the equidix configurations the token project that we want and the instance type that we want to use then we have to feed for example the commit that we want to benchmark the type the name of the benchmark and then we have the database where we're going to store all the results the different time series databases that we use this one is Prometheus this one is influxdv and then down we have all the macro benchmark configuration so all this part here is what we're going to give to sysbench to configure it so for example here we want to have 50 tables with a certain size we want to have we want to use 100 threads and then we configure the different steps of sysbench so like I said before we have the preparer step, the warmup step and then the run step all of which are configured differently for example we have a different time here it's 30 seconds, 10 seconds and then 900 seconds and then here we have an Ansible file that defines most of the variables and configuration that we want to configure the server that we create so let's focus on this part here which are all the hosts children and if we look at here we have VTGate and here we're going to define all the different VTGates that we want, how many VTGates, the number like the number of each port and etc so here we can see that we have 6 gateways so 6 different VTGates and here we have the same thing for the tablets and we can see that we want to define 2 different VT tablets so now I'm going to use that configuration file to actually start a benchmark so I'm just going to use the command line here manually and feed the configuration file that we have usually all of this happens within the crons and we don't have to start the benchmark manually using the CLI but just for the purpose of this demo I'm going to start it manually so I just dive this command then I press enter and what happens here is that it's going to start creating the server it's going to use Terraform to build and provision the infrastructure on Equinix Metal and once this is ready Ansible is going to start configuring the benchmark itself and as we can see now Ansible is configuring the host so the server where we're going to run the benchmark at the moment it's downloading all the dependencies to run VTES specifically at the moment it's downloading MySQL then it's going to build the actual VTES binaries and then it's going to create and start the VTES server with the topology that we saw in the Ansible file and once this is done it's going to start the actual benchmark using this benchmark alright as we can see here the benchmark has finished it lasted 50 minutes and now we'll be able to check out the results so I'm logging into MySQL and then I'm going to run a specific query to get the result for my benchmark I can start by the query to get the latest execution the latest benchmark we can see the unique ID here which is the one that we're going to use to retrieve the different results and here we can see the transaction per second the latency, the number of errors the time it took for the benchmark the number of QPS, the number of reads per second and writes so all of these results are displayed in the website that Madam showed you and now I'm going to show you on Grafana what it looks like so on Grafana we can visualize the different metrics that we collected during the benchmark as we can see here we have all the host metrics so the CPU usage, memory, network etc we can also take a look at the different queries like the queries in MySQL or Vtest related metrics for example QPS the number of the success rate the time it takes for each queries and we can also see a more general overview of the Vtest cluster that we had for the benchmark that's it for me thank you very much