 The name of this project was BIFOS, and this stands for the test in front of the legacy charts. We will be talking about how we migrated more than a thousand MySQL charts to the test. We will be sharing techniques that we believe could be applied at other companies in the industry. So first, a little bit of introduction about ourselves. My name is Rafael Chacon, and I'm a staff software engineer in the data stores team at Slack. I've been part of this team since Vitesse was just a prototype. And I was here when we developed that prototype and we took it to production. As part of this journey that is almost four years now, I also been very involved with the open source community in the Vitesse project, and I'm a maintainer there as well. Hello everyone. My name is Guido Iacuinti, and I work as engineer in the data store team at Slack. I'm one of the founding members of the team where I designed, built and deployed the first implementation of Vitesse. Here's a quick overview of today's agenda. We're going to start with an introduction about databases at Slack. We'll then go through the legacy chart architecture that we plan to deprecate, jumping then on the details of the migration project BIFOL, challenge and strategy, validation and automation. We'll then close with few final remarks and we'll leave some time at the end for Q&A. For those of you that are not familiarized with our product, Slack is a new ledger of business technology stack that brings together people, applications and data. Slack is where works happen. Our mission is to make people's work in life simpler, more pleasant and more productive. First, a little bit of context about the scale that we're working at at Slack. We have around 12 million daily active users that generate 65 billion queries per day. At this point, we have around nine petabytes of data stored and we have this across thousands of database servers across the world. In order to get some context about all the project that we are going to be talking about, it's important that you have in mind what was the architecture that we call the legacy charts. And let's talk about that. So in the beginning, when Slack was a way simpler application, we have a monolith that we call the web app that was all in HLVM. For the most part, this still exists, but now there are like many other components around this. And the way that we architecture or data stores was kind of like simple and it allows us to scale pretty well for the first four or five years of our product. And it were mostly like this. We have a set of databases. On the left, you see what we call the OXS that it was just like a kitchen sink data that didn't belong to any team. We have some set of databases for that kind of data. We have one that we call the mains that store the mappings of where the team data is going to be stored. So every time we got a request and of course heavily cash but still like the source of truth was the database and we got a request for a team in this example, ID one, we go to the mains database and we select what is the DB chart where this team's data is stored. And with that data, then the application can connect to that specific team chart and then just from that moment on and in the future only talk to that database. And this model like work quite well for many years but with time it started to show some limitations. The main limitations were at the product level. Basically the requirements above in a way that storing all the data for a team in a single chart didn't longer make sense. For instance, like now you know that you can connect channels between different companies in which teams chart would you store that channel data if it's share between two different teams. So this model, although it was simple and flexible at the time it didn't provide us the needs that then our application required. And from database specifically there were also other limitations just mentioned in one here as an example, hotspots were a thing. And our biggest and more active customers you will see like a distribution like the one that you see in this graphic. Many like few charts with many, many like tons of load but then quickly like a logarithm distribution where there is a long tail of charts that are not that heavily used. And this was a problem both from the efficiency perspective and the scale perspective. At some point we literally didn't have more bigger instances in AWS to be able to support our biggest customers. So it was with these challenges that four years ago we started to look the next generation of data stores at Slack. And high level these were the requirements that we had in mind. We needed to support my SQL because the application heavily rely on that because of the reasons of the product requirements that I mentioned earlier. We needed flexible charting strategies. So each table could use a different key to chart. Also we wanted to move away from having the charting extraction being baked into the application. And we wanted to move that to the infrastructure. And of course we wanted it to be horizontal or scalable and highly available. Next. And this is how we landed on the VTAS technology for the purpose of this talk. I'm not going to go and do a deep dive on what is VTAS like how it works that we have talked about this in multiple talks already. And you can refer to those if you're interested on this topic. But for the concept of this presentation I just wanted to highlight that VTAS provided the right instructions that we needed in order to scale Slack databases. Now let's enter in the core of this presentation VTAS in front of the legacy charts. In the first two and a half years of this project we focused on migrating the most complicated and highest volume tables. It was not that many tables between 10 to 20 but they were 70% of all Slack database workloads. Each of these tables was a multi-quarter project with many engineers involved to do the migration. In reality there was not a migration it was more a rearchitector of each of these tables. We took the opportunity to do a bunch of cleaning we did the recharging and it was heavily involved. Actually I would like to show you how these migrations you could imagine the self if you can put it in a small video. So it was a bunch of application engineers watching very closely and then removing this table cloth to make sure that everything was still in place but it required a lot of focus, a lot of understanding and a lot of watching to make it right. But this is how like we did this traditional migrations. But this was not going to work for the remainder of 30% of the traffic. This was not going to work because that 30% was more than 200 tables and all of the strategy was designed and all the tooling and all or framework was thought to be used in a table by table basis. In the best of the cases each table took about a month. So doing it with the approach that we had at the moment it was going to take 16 years to finish the migration. Of course this is a little bit exaggerated but it was obvious that the normal approach was not going to work. So we needed to come up with a different strategy to migrate those 200 tables. And we wanted to do this in a year. We wanted to do this with minimal disruption to application developers. So they were able to continue to work without knowing that we were moving like these tables. And of course we didn't want any downtime while we were performing this migration. So what we wanted for this, look more something like this. We had a lot of instrumentation. We have a lot of automation. And then we just wanted to migrate all the tables at once with no one noticing like very fast and in a very reliable way. So now let's talk about how we did this. First we need to have some context about what was the topology at the MySQL level. Let's start with the legacy one. We were running our clusters using MySQL 5.6. We were using a statement-based replication with async replication. And also we have a very unique setup where we have two pairs of databases in a primary-primary mode where each side was gridable. And this is very unique about Slack. Like not that many companies run MySQL in this mode. And this has the side effect that it was kind of like complicated to find a path forward to even upgrade MySQL using this topology. And then on the VTest side, we had a more standard topology using MySQL 5.7. We are using grow-based replication and we have a primary single host that is gridable and then multiple replicas that we can use for reads. And another interesting fact is that we use semi-sync replication here. Basically, we needed to go from the topology on the left to the topology on the right. And as I mentioned earlier, due to this primary-primary setup, it was very complicated to do this migration. And it was that dramatic that we haven't upgraded MySQL in multiple years. So just thinking about this, it was very challenging. However, we were able to come up with a framework that set us a path forward. Next. We came up with a series of steps where each step in isolation allow us to move forward in this migration. First, we wanted to see if we are able in the vacuum to restore and upgrade these charts. Then we wanted to find ways to synchronize them with the legacy topology, validate that everything was working and finally find a way to migrate them. So what did entail to restore and upgrade? Basically, we set up an empty chart in orbit as a cluster. And as you know, like the way that you route to charts and vtays is through vt gates. This will be important later in the presentation. So we wanted to introduce it at this point. So you have it in context. And we bootstrap a chart that is empty in a chart that was going to represent the equivalent of its legacy version. Of course, we set up this with the topology that we used on vtays from the get go. We use row-based replication and the standard primary replica set. And the way that we did this is that we start empty, we seed the chart with the latest available backup from our legacy infrastructure. And then we run restore in place and we upgrade my SQL. So we migrate the data set to the new version. Then we take a backup of this seed chart and we cycle like all the hosts and we have kind of like a vanilla vtest chart seeded with a backup from our legacy infrastructure. So now that we have this chart on vtays with a backup how do we keep it in sync with our legacy infrastructure? This was a big challenge because of the topology and the difference in my SQL versions, we couldn't just set up normal my SQL replication to connect them and set up the new chart as a replica of the legacy ones. So in order to solve this problem we leverage a core component of vtays that is called vreplication that implements the my SQL replication protocol and we use it to shovel data from an external database in this case or legacy charts to a target chart in vtays. And we actually implemented this functionality to connect to external databases that are part of the ecosystem of the vtays ecosystem. And the way that this looks is that as simple as this this is a very like first class citizen API on the vtest side and you create it as if we were like a normal row in the vtest configuration. And you can say create this workflow we call it like vt trouble where you have like this rule that match basically everything that is on the source we say it's going to be an external my SQL database we specify like some configuration of like where in the replication stream you want to start replicating from and then from that moment on the tablet process that takes on and make sure that this is replicating with the same kind of like level of guarantees that normal my SQL replication will give you. And then at this point we have like orbit test chart seated from the backup replicating from the legacy source chart. When we got to this point this was like very powerful and we were like super excited and we knew that we had like a path forward to finish this migration. However, we needed to build the confidence that this was working as we expected. We needed to know that we didn't leave any data behind. Also, we needed to verify that the view from the application perspective was matching and more maybe as equal as important was this process reliable. Can we trust that the replication is not failing or having any kind of like edge cases where it could create data loss? And we divided this problem from two perspective. The databases and the application. And we ensure that both correctness and scale was correct from this two perspective. If these systems were isolated and in the vacuum, the problem would have been easy. We just take each table on the source and we compare it against each table on destination and if something doesn't match, something is failing. However, the challenge is that these systems were taking rights. So this was not an easy task. And we were able to find a solution that it's actually quite simple. The only thing is that the systems are taking traffic as I mentioned before. But the idea is that we have a consistent snapshot between the two databases. And the way that we did this is that we start by stopping the replication. So we know that no data is flowing into the destination system. Then we issue a query to lock a table on the source database. We then also select everything from a table in a streaming fashion. We record the middle position at that point and then we unlock the table. And this operation is really fast because we only lock the table from the moment that we issue the select and then as soon as it starts streaming, like record the position and unlock the table. At this point, we know that we have a read that is streaming from the source from a point in time. Now we need to make sure that we can read from that exact same point on the destination. The way that we do that is that we start the replication and we stop it at the position alpha that we recorded earlier. And because we lock the table when we issue the select on the source, we know if that we issue a select at this time on the destination, they will be on the exact same position and no writes were in between these two moments. That was actually quite it. We can start the replication again. We start comparing the data from the two streams that we have from the source and destination. Once that compare is done, we can repeat this process for each table on the databases. The validation of the databases is assuring us that the data store are in sync till a specific point in time. But there is still a validation that we need to do in order to know if this is also valid from the application perspective. Is the data matching also there? Also, what about real time traffic? We divided the validation for the application side in two meta step, scale and correctness. In the first, we validated that the expected performance regressions fit within an expected bound. In our case, this was an additional runtime during read operation due to the extra network hop in VTGate and two run trip time during write operation. That was due to the additional network hop for VTGate and the additional run trip time due to the semi synchronous hack needed for my SQL replication. This low down was acceptable by the application and it wasn't noticeable in the overall system performance. In the correctness phase, we also validated that there were no query regression from a syntax perspective due to the query of the write engine in VTAS and also that the query results were matching between the legacy systems and VTAS. To perform the validation, we built a framework by extending our database library to execute and evaluate query results against the two systems. As mentioned before, we were interested about both performance and result correctness. The framework supported two main operation modes, dark reads and dark writes. In the first, only reads operation when sent and evaluated to both system, well for the latter, we were sending only write workloads. We also turned it on both mode at the same time to evaluate scale when running with 100% of the expected workloads. Here is a visualization of a query rooted via the dark read path. Once our main client web app hits libdb, we leverage HHVM asynchronous functionalities to issue the same request to both data store. We then wait to receive the results, evaluate the latencies and then discard the result from VTAS while serving back to our application the response from the legacy system. The majority of query results were matching, but we had to manually investigate some odd liar. As the result set wasn't huge, we used a simple spreadsheet to coordinate the investigation of those between team members. The validation phase ran for over two months and we collected different performance regression for all queries. We then analyze every single issue and the majority of error were driven by changes in order preference in MySQL and places where the application expected read after write semantic. It's worth mentioning that we got to this phase within the first four months of the project. Prototyping fast and having results quickly helped us a lot to iterate on the overall design. Fortunately, we didn't have to change any of the core assumption we made, but having early feedback, especially for a complex project like this one was very important and assure us we were going in the right direction. Once we evaluated every single issue, we concluded it was safe to proceed. Let's now take a look about how the Aquan Migration Procedure works. Here's the view of a single chart in pre-migration state where the destination chart in VTAS has been already provisioned. The VTAS application stream is configured and the validation step has verified that both data store are in sync. From the application perspective, nothing has changed and the only overhead we introduced to the system is an additional replication stream from a legacy chart primary to VTAS. Now that we know that validation passed it correctly, we are ready to migrate our chart. We start by pointing libdb to send all traffic to a single legacy chart host and we then wait till all the transaction from the legacy chart host on the left are replicated to the one on the right. We now have reached what is probably the most important moment of the whole migration as in the next step, we reach the only non-returning point of the procedure. Due to its importance before moving forward, we needed to validate that all the metrics were nominal and the two systems were still in sync. Once we validated that all metrics are okay, we toggle libdb to route traffic to the new system, making VTAS the only authoritative source for that specific chart. As this operation is not atomic, few of you might have already noticed that there might be a race between when we execute a query in legacy and we then execute another one in VTAS involving the same rows. We consider this serialization issue but for the test and simulation that we run, we validated it couldn't create issue in our workload and product, especially as the timeframe of this phrase was in the order of millisecond. Now that VTAS is the authoritative source for our application traffic, we only need to make sure that all the transaction from the legacy chart system made to the new data store. Once we verify that, we can stop the VTAS application stream in preparation of the commissioning, the old legacy charts. As the old system is not serving any more traffic, we can take one final backup and then the commission it entirely. The legacy chart is now a thing from the past. We validated that the core idea is working. Now we only have to repeat this process more than a thousand times with no error and zero downtime. We now have another task, build an automation to make it happening. When we started this task, we decided to simply build an automation to execute the VFOL migration procedure that is repeatable and safe. Due to the constraint of the overall project, we didn't have much time to develop a new solution, so we tried to reuse as many components as we could. We built the automation using Python and by leveraging our internal library SlackOps. SlackOps offers several modules and functionalities to interact with internal systems as luck, like monitoring, service discovering, provisioning system, and so on. It's now time to build, but how can you make sure you're going to succeed? We tried to approach the problem with a very defensive attitude. We decided to build an automation as a finite state machine made by several important steps. A nice property of a state machine is that each component are predictable. Based on the current state and an input, machine performs state transition and produce an output. This help us to implement very strict boundaries and allow only the action needed for each step, making the whole development faster, easier, and safer. We also made sure that every step of the state machine was also safe to be around multiple times. This was unfortunately a more time-consuming process as archiving this property in distributed system is usually not very straightforward. Fortunately, this investment paid off quite immediately as it helped us to build confidence in the tool, as well as allowing robots and human to recover from transient issue during the execution. We briefly described state machine properties and idempotent as very valuable characteristic for our automation, but are those enough to archive our strict requirements? Maybe. In our case, we also implemented some additional safety guardrails against other automation concurring for shared resources like schema changes, backup, shell split, but as well as human error. Actually, the majority of safety guardrails were triggered by human error and not other automation. Let's now close with some final remark. This project was designed, built, and executed by a team of four engineers in a timeframe of around a year. We calculated that by following the legacy migration path, it would have took like more than 70 engineers to deliver the same results in a similar timeframe. Equally important, this migration was completely transparent to our application engineer as well as end user. We were able to leverage and modify for our needs some VTest functionalities to do the most delicate part of this project, keep the data stores in sync and verify that data was not left behind. We moved 100 of terabyte of data with zero downtime and not a single outage. Today, 99% of Slack traffic is on VTest and we expect to wrap up this migration by the end of the year. Here is the VTest adoption as Slack before we started migrating Shell via VFull. The timeframe of this graph is 2.5 years. Here's the same view with six additional months. Starting from May, 2020, we were able to increase the adoption by 30% in less than six months. As things never fully goes as expected, we had to face some unexpected challenges while preparing the immigration of few of our busiest charts. We hit few, my SQL 57 optimizer regression that increased their latencies and we also had to deal with the additional overhead of VT tablet and its Golang garbage collection. Breaking down the validation immigration step was key for us as well as iterate over the initial design several time by balancing speed, execution time and safety. Was this a success? We think it was. We accomplished everything that we said we were going to do. Hitting some hiccup in a project of this size is normal. We consider pretty remarkable that the project planning, building and execution was done within a year. The current state is that 99% of databases as lack are running on VTES today. Could you replicate this elsewhere? Yes, with few KVITs. If you want to know more about VTES, here's a link for a suggested session from two of the core VTES maintainer. Thank you very much for your time.