 So our story started pretty early. Demonware has been around for over 20 years providing services to games. Activision acquired them, so they were providing games before that. But we originally just ran on bare metal. It was quite normal back then to do that. As time progressed and the internet grew, we had to change how we did things. So we introduced sharding to our bare metal databases, and it worked for a while. As time went on and our catalog grew, we had to adapt. So we thought, hey, why don't we use KVM and LXC to make it more flexible and build an orchestration layer in there and make it more manageable. And that was good until we hit thousands of databases and it became kind of unmanageable. So we started building something on top of that, which managed the connections and managed the failovers and errors and all that. It worked, but we were kind of still a little bit stressed by it. So as we continued through there, we investigated some different things, and it came that we kind of figured out the challenges that we were looking for. So Vlad is going to talk about that. Yeah, we started seeing multiple operational issues, which were challenges for us when we started to find solutions how to fix them. What we did is we worked with multiple service teams and we identified few key areas of the issues that those teams were seeing. The first area was around scalability. Activision launches several AAA game titles every year. A game launch is very important even for us, where many billions of players will show up at the same date around the same time. And this is maybe a very pleasant problem to have for a business, but it's a very big technical challenge to solve. To have a smooth experience for the game launch, we prepared many months before and we spent a lot of time low testing, low testing how the new game client behaves, what are the new workloads that we are seeing from the new game title. But we also tried to predict how many players will show up on the launch date. It's not always possible to predict exactly how many players will show up, so what we tend to do is we over-built our database platform to handle any potential peak that we might be seeing during the launch. You would think as the time goes and the workload decreases after the game is launched, you would think that we would decrease the capacity of the database in order to save resources. That sounds logical. It's not that easy though. We were able to scale out our database in a certain hacky way, but we weren't really able to scale down without having a significant player impact. That's why we actually never or very long time ago scaled down our database after we had launched. Another scalability issue which I would mention is about connections. Since our application layer has no database connection pooling at all, we were seeing databases running with more than 10,000 connections, which was extremely painful and was causing MySQL performance issues related to CPU and the memory. Another area is about operational burden. At our scale, we see hardware failures happening every day, those affect databases. We had quite thorough automation in place to automatically remediate a primary failover. If MySQL primary went down, it would have been automatically failed over to one of the replicas. Still, there was a lot of operational overhead for on-call that had to run a manual Ansible automation to replace missing MySQL node. We run on-premise, we run in a data center that we've rarely seen large-scale issues with, but I'm not saying we haven't had any issues. We had few large-scale infrafailers that resulted in many engineering hours to fix broken MySQL replicas. What we saw is that MySQL replicas, at least in our case, broke in ways that they couldn't have been used as the new potential primaries. We just had to replace them and we had to replace several hundreds of such broken replicas. If such thing happened, it was extremely painful. As I mentioned earlier, we were lacking a robust scaling solution. We were able to scale out, but only by doubling number of charts. That was not ideal because scaling by doubling is probably not very resource efficient if you just want to add only, let's say, 10%. The last area which we identified was about the setup complexity. Demolware operates in multiple platform and service teams. Service teams own their own database operations. As such, each team is required to have at least a certain level of database operations experience. That's not always true for every team and some teams were delayed by this because they had to wait for help from other teams. The large-scale database build-out and we are talking about the database with over 100 charts took a very long time. It took between two and three days and this was not acceptable for development agility when you wanted to iterate on the database quickly. When you wanted to change configuration, try different things. For example, build database, build database, it was delaying us. Schema migrations were also hard to execute by certain teams. Those required high experience running database in production. They required very careful planning and execution that often led to delays in development workflows. That led us to finding a solution that would address these key areas. The next section is the Tech Adoption. I ran the Tech Adoption for Vitesse at Demolware. We started by doing a next-gen DB evaluation. We set a bunch of criteria. This is six of them, but I think we had 100 in the spreadsheet. We went through many different products. All of them were working at some level, but we decided that the major things that we wanted were the SQL query compatibility, so that the application didn't really have to change anything. We wanted it to run my SQL on the back-end, if possible. Currently, when we started this, we were looking at our platform and we were saying, everything is moving towards Kubernetes internally in the company. It would be nice if we found something that was Kubernetes native. A bonus feature would be that it provides an operator. After taking these all into account, we evaluated and a test came out as a pretty clear winner. When we started the adoption process, we discovered that Vitesse was a near drop in replacement for most of the services that we had targeted. One of the things that was great about it was that we had a bunch of my SQL expertise on staff. Those people, when we had problems, we could just be like, hey, we're seeing this thing, what's going on? 99% of the time it wasn't a problem with Vitesse. It was like, hey, we can change this or do this, this way, and it was fine. Another big thing was that there were a lot of success stories from other companies just in the community we heard about. We joined the Vitesse Slack and random people would come out and see us asking questions and just be like, hey, what's going on? I reached out to a bunch of people and was just like, oh, I see you asking questions. It was a pretty lively community. We're going to talk about two main parts of the adoption. The small service, which was a social service that was not critical path. And a large service, which was fairly business critical. So I ran the first track of the small service where we were doing a lot of platform things and Vlad ran the large service. So small service wise, why we chose it? The backstory of this is actually a little bit different. We had targeted to do a large service and we were like, okay, well, if we can do our most complex service, then we can do any service. And as time progressed, we were like, no, that's actually not going to work. So I was actually walking on the seawall in Vancouver with one of my coworkers and he was like, why don't you take our group service? And just like, it's new, it doesn't have any data. Just use that. And I was like, okay, let's do that. So like a few days later, we had started working on it. We got to start with fresh data, which was quite nice. We didn't have to do a migration. As I mentioned in the previous slide, it wasn't in the critical login path. So if something did happen with it during the proof of concept, it wasn't going to block players from logging in. Another feature of it was it was very easy to make architectural changes to it. So we didn't end up making that many architectural changes, but we made some and it was much easier to do to a service that was not in production. What this did, though, was give us an opportunity to see what a service would look like running on our Kubernetes platform. So we had to do a lot of work with how volumes were managed, how the pod disruption budgets, work all of those things out. We couldn't just go say, hey, we're going to run the test on our Kubernetes cluster because it wasn't going to work. So we spent a bit of time building this and integrating it into our platform, which built a good baseline for other services. So the timeline, when we first started looking at this for v0, we weren't very serious about adopting it at the time. We were like, hey, we should build a proof of concept. There was a few engineers and we were like, we should build a proof of concept of this and see what it looks like running. So we were like, if we're going to build this and move it into our Kubernetes platform, maybe we build it parallel to what we have now, see how it runs and then migrate that into the Kubernetes platform. We didn't get super far with that. So we started doing it. We got it running, but it just didn't seem like there was no magic moment. So a little bit of time passed and we were taking it a little more seriously. And we moved to a v1, which was using the now deprecated helm charts and on our Kubernetes platform. And I think we kind of had it like that magic moment where it was like, okay, this works. And we started, we had tied this setup into the small service that I was talking about. We had run queries against it. We had done even, I believe, some like minor load tests and we were like, okay, this is, this is, this is real. And as we continued, we were like, okay, now we need to figure out like these operational burdens and the other things that we're talking about, you know, what, how are we going to deal with that? So the helm chart was working good, but we just, you know, it was missing something. So we decided to be early adopters of the operator, which, you know, it was, it worked out of the box. It was, it was, you know, we could just throw our cluster definition in and we had a, a working system there. It still, it was, it was great. And we were like, okay, this is like almost on par with what we have like right now. But it's still like, we like why this, this feels like a vertical move and not like an upgrade. So we decided to be, we may have been one of the first adopters of Vtorque, which we were, we were, we were confused at how well it worked. We were doing bad things to clusters and it just wouldn't, it wouldn't die. So we ended up shipping with our, this V3 into production for both the small service and the large service. And yeah. So the proof of concept, how it actually went was it was like the previous slide with the timeline. It was pretty slow to start because there was a lot of different ways we could do things. There was a lot of, you know, it felt like it wasn't just something who Google because there was not really anybody that, you know, was, was running it like how we were running it. And we had a lot of, you know, platform requirements. It turned out that our platform team was great and they ended up, you know, helping quite a bit and it went pretty smoothly. Once we had it like fully working and everything we started on the testing of the small service. And it was ruthless because this is when the people come out of the woodwork and they're like, what about this small edge case? Like what, you know, what's going to happen when this happens? And it's like, yeah, I can, I think this and they're like, have you tested it? And I'm like, no. So there was a lot of edge case testing, which turned out to be very helpful at the end. We ended up testing shard expansion up and down during one of the larger betas. And it worked flawlessly for us. We did vertical scaling. We tested that adding CPUs, adding memory, no issues. And we did a lot of recovery and error handling testing, just deleting, deleting shards, deleting parts of shards, deleting everything, bringing it back. And it was, it was the first time it happened. I actually did it accidentally. And then I went to the washroom and came back and I was like, why is this still running? I deleted it. And I was like looking at my history and looking at the events. And I was like, okay, this is, I was like, I didn't realize that the backups were something that were automatically triggered when the anyways, it was, it was unexpected. But like quite nice. We started the load testing and it went pretty well. And, you know, during all of this, we started building quite good relationships with the open source community. It was, you know, really, really helpful to have them. They helped out a lot. And then we launched and it was so successful that nobody talked about it until January when they were like, okay, we're going to look at a much bigger service here. And we're going to try and implement it for that, which Vlad was mainly working on and he will take over. We had very great experience from adoption on this small, small service. As Greg mentioned, we were thinking, okay, what's next? What are our biggest pain? What's our biggest pain and where which service has the biggest pain? So we were absolutely clear about that we, we are continuing with a large business critical service. Which you may think like, oh, that's, that's really crazy. Like you, you just finished your small scale adoption. Like, why are you picking the business critical service? Well, the main reason was that we just recently finished execution of manual scaling operation that had been very stressful and painful operation that took several sleepless nights to finish. And we, we had committed ourselves that we don't want to repeat those nights again. And we want to fully automate the next scaling operation for this large service. What do we picked? We picked what we call inventory service. What is this service doing? It stores, it mainly stores player items. So you can think of inventory of what a player owns in the game. Examples would be guns attachments, like various modifications. There are like, even things that you would not really consider to be items. So like anything in the game that can be related to a player and it's in a way granted by certain achievements. It's huge. The database is huge. I'm not saying like we've reached these numbers in one year. Like it was over, it was over a few years. But at recent launch, this database saw peak requests at half million requests per second. And the data size has been a pain too. We are seeing nearly two times increase in size every two years, which makes it like additional challenge on top of the request rate that we are seeing. How we tackle this? We identified few adoption, high level adoption stages. We knew that it's not going to be easy. This service has been around for more than eight years. It's a big monolith application with very complex business logic, full of transactions, uses foreign keys in the database. So we knew that it's going to take some time to get there. And if possible, what we were trying to define is what actually an MVP for an existing service would mean in our case of adoption VTES. So we came up with a definition of core functionality of the service that we needed to be running on VTES. It sounds easy. You identify, okay, this is important to my application, but how do you confirm that such thing actually runs without any disruptions on your new database? Fortunately, we had a very good unit and integration test coverage. So we decided, yeah, why don't we use those tests to confirm that our integration works on VTES? Initially, when we run our first attempt, it looked scary. About 50% of the tests were failing and we were really, I'm not saying like, we were not ready to give up. But it felt overwhelming. What we did is we tried to, like obviously, I need to mention that it's about 4,000 unit tests that this service has. And like when you have 2,000 tests failing, you won't go one by one. Instead, we tried to find similarities like why these tests are failing. And we were slowly coming up with similar patterns and similar issues. And we were able to group these unit tests into a few smaller set of issues. And we started tackling the problem of adaptation in a more systematic way. MVP took about two months to finish. We were able to resolve all these issues that we set ourselves for this basic MVP. And along this way, we found ourselves stuck a few times where, as Greg mentioned earlier, it was a really great experience for us to work with the community. We want to share what we learned along the way. We are mentioning only like few incompatible, few sequel incompatibilities or a different behavior, which was not necessary a blocker for us, but like it required some, I would say, minor work to update our application code. The first one that I'm showing here is about the MySQL name blocks. The application was full of them. We were having multiple places where we needed to stay synchronized to protect the data and stay consistent. What we learned with Vitas was that at least with version 14, that's what we are still using in production. Vitas implements name blocks using the first chart, which turned out to be, I'm not saying like a blocker, but a potential performance bottleneck. What we did was to look around the displaces and we updated the application code to the reduced number of logs that we are acquiring per second. Another one is about the foreign keys. I mentioned that the application is full of them. Tables use foreign key constraints to implement certain business rules and Vitas does not fully support the foreign keys. It doesn't mean they don't work. They do for us and they work well. It's just certain scenarios where things were failing for us. Actually, it was only one, which was about when we were scaling, so adding more charts. The operation was failing on those foreign key checks when the database schema was being created. We worked around it by temporarily disabling the foreign key checks on the target new charts. This didn't put data into any risk because the new charts were not receiving live data at that moment, so we felt absolutely safe about this workaround we had chosen. These are just examples. Replace into was heavily used in our data layer abstraction in the application. The problem was that replace into statement by definition requires all columns to be included. This means that you had to include even chart key, but you just can't update chart key in a charting solution because that would just mean that you are expecting to data to start moving between different charts. That just doesn't make sense. It made sense why this is not supported, but we had to update our application to work again. What we did, we just simply switched to using insert on duplicate and that worked. We continued with load testing. We needed to feel confident that the new database solution will handle the traffic that we expect on the game lounge. We already had a lot of experience from the previous years with load testing MySQL workloads, so we knew that MySQL is behind, so we don't really need to focus on those parts. Instead, we focused on the VITAS components that were new to us. We are basically trying to confirm that VITAS components that are on the query path, that they scale linearly with adding more charts and that we would not hit any performance bottleneck. We basically confirmed those assumptions. What we also learned is that the new solution will require some additional CPU resources, which was kind of expected. Anytime you run additional proxies somewhere or sidecars to your applications, of course it will take more resources, but the benefits that we were about to get from the new platform are much more provided the additional resources that we were bringing into the platform. What we learned is that the component which is a sidecar to the MySQL runs on the data nodes, at least in our case, requires about the same CPU resources as the MySQL process itself. That was kind of surprised to us, but we were also positively surprised by the fact that the connection pooling that is provided by VITAS greatly offsets CPU utilization that is on the MySQL process itself. The fact that we were not having over 10,000 connections on the MySQL process anymore, so we went from 10,000, 15,000 to something around 300, 400 connections per database that greatly helped for MySQL to start breeding again. We were done with the load testing, so all our expectations were confirmed, so we went to the last stage where we needed to prove that we will be able to operate the new database at scale and that the database remains stable in certain failure scenarios. We already did a lot of failure testing before and during the small service adoption, but we continued further, we went even more aggressive. Again, we were doing very bad things to the database and it was always able to recover, which was magical. We very much focused on configuration deployment testing, so we really wanted to make sure that changing a database configuration should be seen by people as a normal change, as you would be changing configuration to your application. That seemed to be a huge difference compared to the past where we felt like, oh, database is something precious that you are not touching if you really absolutely don't have to touch. We wanted to change this model, so we really wanted to enable not even database operations people to run these configuration changes. As a result of this work, we came up with tuned with this configuration that we are mentioning a few features that we used and it was stable. What I would like to highlight on this slide is the key points, what has changed. We went from shards defined in the application configuration, so we had pretty much every shard defined in the config. We went to a single database endpoint where the shards are abstract to the application. Previously, shards were defined in the config that allowed us to make certain assumptions when we were designing database queries. Some queries were just too shard aware and there were implicit assumptions that you are working just with the single shard data. Such kind of queries were not running efficiently with Vitas at the first stage, but when we learned that we just need to give a hint to Vitas that, hey, you need to target a specific shard by just including a aware condition that helps Vitas query planner to route your query to just one or subset of the shards instead of all of them. Where we spent a bit of time was figuring out how to expire data with Vitas. Previously, we were hitting the shards individually and we were running delete with limit on the shards in a serial or a parallel manner. When you have a single database endpoint and shards are abstracted, I guess that's the reason why delete with limit is not supported by Vitas, because if you think about it like you can't really identify which rows you really want to delete from those shards. It makes total sense that it just can't be supported for the safety reason. What we did instead is Vitas still supports a way to talk to the shards individually, even though it's definitely not the best practice, but that's what we have chosen to address this issue and we kind of in our data layer we implemented what we call shard walking, where we do pretty much what we were doing before, but we do it with Vitas now. Alright, so yeah, we can get some conclusions from all this information. So the benefits that we had, we now had tested it a bunch, tested it in production scenarios and we had a very proven method of sharding up and sharding down. The shard tools were provided by the upstream Vitas team, which is pretty great. The on call burden that we were talking about earlier was greatly reduced. When we were creating these slides, I was thinking like, okay, there's going to be at least one escalation before this happens. But yeah, to my knowledge, there has been no escalations to our production for test cluster. The operator and orchestrator have just kind of taken care of anything. But take this with a grain of salt. If you do do this and something happens, please don't blame me. The database setup is using our GitOps model, which the rest of the Kubernetes services are using. So the rest of the company is quite happy with that. Yeah, it's overall the benefits that we were looking for have been, we can see them quite there. So I guess did it work? Yes, it worked to our expectations. So much so that we're building a team around supporting it. And currently there were, I would say about one year into opening it up to multiple teams. And there's about 60 separate Vitas clusters running in our different environments. And as of like, I would say about two months ago, Vitas has become the default database solution for anything that we, any new products. And a lot of our teams are kind of jealous of the people running it. So they're starting to talk about migrating their data into it. Thank you for coming here. Thank you to everyone at Demonware who helped make this possible. And thank you to PlanetScale for helping us along the way. And yeah, if anybody has questions. How does this work? We will repeat the question. Yeah. Yes, we started fresh on both. So the, the thing was, was like we had an opportunity to start fresh. So right now we're in our adoption process. We're, we're figuring out the best ways to do data migrations, but it is, you know, it's different kind of for every service. So yeah, we're, we're, we're working on that right now. And it looks quite promising. I think I was wondering if you could speak a little, you mentioned that you use Ceph. I don't know if everything is using Ceph. But if you could speak a little to the Ceph configuration and what the performance is like on Ceph. Sure. So yeah, he's asking what is the performance like on Ceph and what do we use Ceph for? So were you, we initially just used Ceph for the SED backing because the way that the operator had set it up, it was kind of expecting network storage. But then we were like, why don't we move less performance volumes into Ceph storage? So it, right now, I think we only have some test clusters running with the actual data backed Ceph, but it seems to be fine. I don't think we will ever run any like high performance requirement stuff, Ceph backed, potentially just dev environments and, and some staging environments. Is that answer? Yeah. So then I guess my follow up question would be is what's the stored back end for like the prod, the test? Yeah. So we use local storage. So we have our own disk provisioner. And I guess that this is actually the question that a lot of people who've talked to me have asked, because they're like, how do you actually do this? We're trying to solve this. Yeah, we have an in house thing that we use that just pre provisions volumes. And we just create a PVC and it pulls it in. So each server has the PVs available. And then we just say, hey, we want this PVC and they're all pre formatted. They're all everything. So it's, it's not very flexible, but it actually ended up working quite well because with the way we have it set up, we can, it's, it's like, I'm calling it like persistent ephemeral, persistently ephemeral. So, you know, we, we can treat the volumes as ephemeral volumes, but they're quite persistent as well. Okay. I have two, if you have time. So first question is, are you using any high availability, like primitives that the tests provides and like, have you having the troubles with them? And second is when you scale the video tablets, do you see any trouble with the time it takes the video tablet to start and restore from a backup and the load is gone already? So the question was, are we using any high availability tools like cells and like multi-regions things? No, no, no form for that. That's fine. No for the multi-region, but Vitas itself has great high availability built in mechanisms itself. So the proxy layer is, is fully reddened and you just keep adding replicas in the deployment. And the shard is highly available by using MySQL replicas and VT Orc automatically fails over that the primary to one of the chosen replicas. The question about the tablet resizing, I think that that was the question. Yeah. I mean, like we use VT tablets and sometimes we have like load spikes. So we want to have more database, but by the time we restore like a terabyte of data, the spike is gone. So we need to vertically scale and not horizontally scale. Yeah. If you see something like that. So I think what greatly helps in our case is that we use Ceph that is locally in our network and is extremely fast. I don't remember the numbers. We're on like a hundred gigabit backbone so we can just like pull it. It's very fast. And what we also did to, I totally get your question because that was one of our implementation requirements to make sure that the replica gets recreated even including ketchup. MySQL replication ketchup under 10 minutes. So like with this large scale database, the new replica gets up under 10 minutes. Thank you. I think we should cut it there. And if anybody has questions, just find us and we will be happy to talk. Thank you.