 Thank you. Good morning everyone. Welcome to, can everyone hear me? Okay. Thank you very much. Welcome to ApacheCon North America 2017. Netflix. Who doesn't know Netflix here? Yes, good. So I'm sure all of you here are one way rather than Netflix subscriber. Maybe sharing your friends' account or maybe staying on your own. But if you're not a customer, I can start a little sales pitch. But since everyone is a customer here, I can get into our topic. Stranger Things. One of the Netflix version shows released last year. I heard about this show on a hallway conversation with my colleague. And then I started watching it on a late evening, on a lazy day, on a couch, on my iPad. Of course I didn't start this show with an intention of winching it overnight or completing it the next day. I just started casually. But all the Stranger Things fans out there, you all know how tough that would be to stop this in the middle without winning it the whole day. So I completed about three episodes and then my iPad battery died. Then I had to move to our TV room and continue watching on our TV. Interesting thing there, when I stopped when my battery died on my iPad and then moved to TV and when I opened Stranger Things, I didn't spend any time, you know, providing, going forward, figuring out where I left off. It just assumed where I lost. I think a couple of hours into that, I was still hungry. I wanted to get some popcorn and drinks. Then I took a little van to get some drinks. Well, the interesting thing is I couldn't stop watching it. So I started watching it on an iPhone on a traffic light. The same thing. I didn't spend any time. So I didn't spend any time figuring out where to continue. I just opened my Netflix app on an iPhone and opened Stranger Things. It just assumed where I left off. While, what makes it possible, what makes a seamless movie experience while you are watching something and switching the devices. And that's called Netflix. So this is one of the great features that Netflix has, which basically record all the hot beats of your player in the background and resumes quite a bit of time. And what makes it possible to get right in front of your screen, what you like most, is called Netflix recommendations. It basically records all your viewing history and your rating history, figures out which shows, which movies you might like best. And also, we stole a lot of customer information about what you watched, what you rated, where you pause, how long you pause, and how many shows you watched, everything. And most of these features, almost all of the features that Netflix has directly or indirectly depends on an amazing persistent store that is Cassandra. Well, that's what I'll talk about today. We'll talk about Cassandra serving Netflix at scale without further ado, that we introduce myself. I'm Vinay Chala, Cassandra MVP and a cloud data architect, part of cloud database engineering team at Netflix. CDE team we call internally. Our team is responsible for providing several persistent stores as a service to the list of Netflix application teams. Apart from Cassandra, we provide several other services as several other data stores as services, which includes Elasticsearch, Dynamite, and RDS, MySQL, ZooKeeper, and all the client libraries too, which supports any of these data stores. We'll talk about specifically Apache Cassandra at Netflix and what all the challenges we face in providing Cassandra as a service to the application teams, and how do we certify and benchmark it, and how do we make it production ready to be deployed in production. Getting into the details of Cassandra at Netflix, almost everything is stored in Cassandra, not the movies, except the movies, everything like metadata about movies, customer information, view history, rating, billing, and payments, everything is being persisted in Cassandra. In terms of footprint, we have hundreds of clusters with tens of thousands of nodes making several petabytes of data and serving several millions of transactions per second just out of Cassandra. So, on a high level have categorized challenges that we faced across several years in providing Apache Cassandra as a service at Netflix, monitoring, maintenance, benchmarking Apache Cassandra, and making it production ready to be deployed in Netflix ecosystem. So, in today's talk, I'll get into specifics of these challenges and how we solved those, how we came up, how we built systems, and what systems we built to get around these challenges. So, let's first tackle monitoring issues or challenges in monitoring generally talking about persistent stores or distributed stores, and I'll get into details of Cassandra here. So, before getting into that, let's see what we monitor. We monitor latencies, we monitor latencies in terms of read latencies, write latencies, 99th and 95th. We don't monitor average latencies, and when we monitor 99th and 95th coordinator latencies, one of the critical thing is not every Cassandra cluster gives you the same amount of performance, because in terms of Cassandra it majorly depends on what kind of data you are storing, how you are accessing, and your usage pattern, and access pattern, data model, and several other things. So, key important thing here is we don't blindly come up with a number and enforce that number on every cluster, so we have SLAs defined based on cluster configurations, and all of our tooling and monitoring goes off of that. And we monitor health of the Cassandra, which includes gossip status, notes perspective about all of other notes, and client protocols that are running thrift and binary, and also any networking issues on the machine or any hardware issues on the machine. And since Cassandra is a Java based system, we also monitor JVM and heap of the Cassandra. And apart from that, we also look at what recent maintenance that have been running on Cassandra, maybe it's a user initiated maintenance or a system initiated maintenance. And another critical thing that we do monitor is the wide rows, mainly with the Cassandra, how long, how large your partition actually decides your cluster behavior. So, we do keep a look at wide row or wide partition metrics, and we also look at lot of system warnings and information messages and errors are being logged in Cassandra system logs. So, we do monitor logs as well. So, the first question that comes to anyone's mind when you have tens of thousands of instances is to how do we even monitor those thousands of instances? And the common approach would be a cron based system or a job runner, which wakes up on a schedule and reaches out to your thousands of instances, and comes up with something called a state of the system or the health check, health state at any given point in time. Let's call it a T0 snapshot. The entire Cassandra cluster of thousand notes looks healthy. And then since it's a cron based system, you would deploy something like this. I took an example of Jenkins. You have a Jenkins slave or Jenkins master that is being deployed in one data center. And you have several data centers or several regions where your actual database or data systems are being deployed. So, your Jenkins system reaches out to all of your Cassandra instances and figures out health of Cassandra. But the typical problems with the common approach and common architecture would be since they are cron based systems or job runners, they do not persist any state. But the database or the system that we are dealing with is not a stateless system. It's a state film system which has an importance for the state and the transition that every cluster goes through. And being in a cloud native system, many of you know experience network glitches or bad hardware, bad network, or maybe a transient issue or maybe a critical issue. But you do need to understand the state of a system to assess the health of a system. So, let me take a common scenario and explain the problems that are involved when building something like a cron based health check system. So, let's say at a T0 your cron based system wakes up and reaches out to thousands of instances and figures out the cluster is healthy. And then again it wakes up at T1 and tries to figure out if the cluster is healthy or not. But at T1 when it is reaching out to thousands of instances, maybe couple of instances might be experiencing a transient network glitch, or maybe some you know system is under pressure whatever. Then instead of giving up you just sleep for maybe you know 10 seconds and retry again. And when you are retrying this time maybe some other instances are experiencing the network hiccup then you sleep again. So, that would be never ending problem then you come up with some you know hacky algorithm to figure out what nodes are actually having an issue and experiencing real issue. But these are just workarounds and top of workarounds and another issue with the cron based systems is let's say your system is under pressure and you are trying to establish a connection to find out the health of a system. But you cannot establish a connection to figure out health of a node because system is already under pressure and your health check system fails when you needed most. And also since these cron based systems would not have any state of the cluster of what has happened and what is going to happen in the future, what is going on right now, they would not be resilient for any temporary network glitches or any temporary JVM pressure and stuff like that. And these systems would not scale when you are going from 1000 nodes to 10000 nodes to 10000 nodes. And we initially had a system something like this and which did not serve our needs as we were scaling out. And that is when we took a step back and thought about the problem in a different way. What if we had a fine grained snapshot of health of every instance that is being pushed out on a persistent connection instead of establishing a connection and figuring out the state at any given snapshot. So simply instead of a pole based approach something like a push based approach where all of your instances, thousands and thousands of instances would send hard beats to a centralized or a distributed system which figures out the health of your overall cluster. So that is when we thought about something like a streaming system. We looked around and we found out that there is already a system built in-house which serves exactly same needs that we were looking for which is high throughput, low latency, operational streaming system that is called Mantis which is again an open source offering. I think it is yet to be open sourced in this quarter. But Mantis streaming system is basically built on top of Apache MSOS, provides a flexible programming model in a reactive X. And also it models your computations as DAGs which is basically designed for operational insights. So by rethinking the traditional cron based health check system in a streaming based system this is how the topology, the architecture of a health check system looks like. You have thousands and thousands of instances sending hard beats to something called a source job. Source job is a terminology that I took from Mantis framework. Job in a Mantis framework is not actually a scheduled job or something like that. It is a microservice which is built on top of a reactive programming model which collects hard beats from every instance that is out there and builds something called a message that is being sent to our own job that we wrote on top of Mantis framework. So we have called it a local ring aggregator and the reason we have three different local ring aggregators here. I took an example of three data centers and when you are sending hard beats every 20 seconds from thousands of instances and we were talking about almost 60 MB per second of data that is being transmitted just for the health check and transmitting that much amount of data cross region was expensive and was not reliable. So that is when we built isolated local ring aggregator for every region. The main purpose of local ring aggregator is to collect all the messages that are being sent by all the nodes and come up with something called a score which is a mathematical formula derived from every node's perspective. Let's say you have 360 node cluster. In Cassandra's point of view cluster when you call a cluster is healthy not when every node is healthy but every node identifies and communicates with everything every other nodes in the cluster. So if you are talking about a 300 node cluster you are talking about 300 times 300 perspectives that are being generated from every node. So that's why there was a humongous amount of data that is being generated from every node from every region. So we built a local ring aggregator with which has a perspective which has a gossip status which has a thrift status all the client protocols and heap network issues and any hardware issues. All these being put in something called a score which is much more minified version of your Cassandra instances health in that region. And these messages are being these scores are being sent to something called centralized global ring aggregator which collects these scores from every individual regions and come up with builds evaluates a cluster health based on certain business rules based on whatever you call a cluster is healthy your definition of healthy and which runs a finite state machine. The reason we need a finite state machine there is because being in a cloud ecosystem nodes die all the time when a node gets terminated unlike a stateless machine where when a node gets terminated new node comes up and it installs all the binaries that you wanted and starts serving the traffic that doesn't happen in distributed databases like Cassandra. You need to that new node apart from installing binaries it needs to participate in the gossip it needs to join the ring it needs to get petabytes terabytes of data that it is responsible for and it needs to slice the token range and get the data and this all could happen anywhere from several hours to several days. So we have instances where each node is almost carrying four terabytes of data and transmitting that four terabytes of data from its peers would take anywhere between 24 to 48 hours. So for that 24 to 48 hours your health check system should not alert you because that node is down it should detect that there was a termination in the cloud and this node is trying to join the ring and avoid any false positives. So that is all being all will be handled only when you know the state of a system instead of you know point in time point in time snap snapshot based health check system. So that is why this global thing aggregator has to have something called a finite state machine which has an aware which is aware of all the issues that is that are going in your cluster. So global thing aggregator values cluster health and produces stream of health check heart beats. So whether this cluster is healthy or not healthy every 10 every 10 seconds it produces heart beats out. Well we have a good health check system and you know that is great we have the entire fleet of health check encapsulated in single data stream that is being sent out signals whether the cluster is healthy or not. But how do we you know visualize it how do we make sense out of it. So that is when we built a system UI system which subscribes to the data stream that is being emitted by global thing aggregator. And here I'm showing you the macro view where you have thousands of clusters hundreds of clusters being shown in a hallway dashboards with green red or you know yellow indicators whether the cluster is healthy or not. And clicking on any of that rectangle box there basically rectangle the size of the box represents the size of the cluster and clicking on that box drills you down to the cluster view where I'm showing an example of 36 nodes in one region. This cluster is spanned across three different regions and everything is green which means the cluster is healthy. And clicking on any of the things gives you the perspective view of the cluster again. This is not a stateless machine this is a stateful machine you need to have a knowledge of the entire gossip and perspective of every node every nodes perspective about every other node in the cluster. So this this rectangle gives you the representation of any gossip issues and any network glitches that we had. So before we had this to figure out if there was any gossip issue in the system we used to take anywhere between 15 to 30 minutes depending on the cluster size. If it is a three node cluster you log into three different instances and figure out is every node seeing everyone else the school. But we had clusters with 300 nodes and figuring out which node is missing the view of which other node was a nightmare. So logging into so we had we had instances where you know on call used to log into 100 different instances using all of his shell script and figuring out which node is experiencing the network issue talking to a different node. But all of that was rolled down to less than to detect any of the gossip issue. We we came down all the way from 30 minutes to a less than minute clicking on opening this screen would give you exact representation or exact network glitch which is happening between you know your data center between your tracks or between your individual nodes. So what did we gain faster detection faster detection of issues and greater accuracy because it is not a you know point in time snap based health check your health check as being sent as a heartbeat every 10 or 20 seconds. So you detect the issues faster and they are much more accurate and massive detection of false positives. Just to give you the number we had instances of 300 pages per day all the way from going down from 300 pages per day to now we have less than 100 pages per week. So that was just because of Mantis based health check system that we rolled out. And also separation of concerns so when we had a health check health check system in the past you know detecting the issue and remediating the issues everything was everything was put in a same system which was over complicated and then with this streaming based system we you know we made it two different systems where the streaming based system would only detect the issues and then you have your remediation system which auto-remediates any other issues that are possible. So the next challenge that I would tackle is challenge maintenance. So we talked about monitoring and what are the known problems with the maintenance when you are running Cassandra at scale. The first would be it's not a stateless system it's a stateful system so it needs to persist the state in terms of you know monitoring or in terms of maintenance. And if you are operating in a cloud ecosystem your nodes become unresponsive for no reason because your hardware is not in your control you are being given by your cloud vendor and you do not have any idea of virtualization or physical machines that you are using. And cloud being in cloud ecosystem comes up with its own cost its own problems. So and also configuration setup and tuning any configurations on you know 20,000 instances would be nightmare with any shell script or any system with any scripting tools that you come up with. And in Cassandra token distribution is key if you are not using V nodes which in our case we are not distributing the token equally among all of your data center is important otherwise you would create hot spots in your system which affects your performance and latencies. And also resilient for any other issues so when we had S3 outages when we had any region outages as I said almost every feature directly or indirectly depends on Cassandra providing resiliency at a data store layer is very critical to keep Netflix up and running all the time. So that is another problem with the resiliency. So to solve all of these problems we built a system called PREM which rescued which helps us resolving all the issues that we talked about. So let's get into details of PREM using you know building Cassandra in cloud with PREM. This is a sidecar which runs on the same instance along with your main data store Cassandra. We use the same sidecar for providing elastic search as a service to provide dynamite as a service or to provide zookeeper as a service. So all of our CD provided services comes up with the sidecars and in case of PREM and Cassandra PREM is mainly responsible for bootstrapping cluster bootstrapping node and automated token assignment so that we do not create hot spots and backing up the data all the time and restoring and recovering the data in case of an emergency. And also configuration management across your 10,000 nodes or 20,000 nodes is being done with the help of PREM and we which has Rust API for all of your node tool commands and all of the management of database and we collect metrics using Cassandra JMX and heartbeats that I talked about in the health check are being sent from the PREM. So this is a high level architecture layout of how we build Cassandra ring in the cloud. Specifically in AWS because we are in AWS. So we have ABC let's call each of those nodes as a different regions. So you have one node in one region and the next token would be in a different region and the next token would be in another different region so that we can have resiliency. I'll get into how we achieve that resiliency across zone across region outages in next slide. But each Cassandra instance in AWS would have three processes, two processes running in it. One is a main process Cassandra and we and also the sidecar called PREM and PREM is as I said PREM is responsible for managing the Cassandra. So it's basically babysits Cassandra. So this would be the ring output. It's not an entering output but this is how your ring is deployed. So you have in our example you have US RAC 1A USC is to 1E US 1B. This is a layout of tokens to achieve the resiliency. When I share these slides you can follow the same token distribution or if you are using PREM these token distribution comes to you for free which gives you the resiliency at every level that we're going to talk about. So with this layout, with this distribution of tokens we get resiliency at instance level, availability zone level, multiple availability zones level and region level. So there would be no incident because of AWS S3 outage or AWS region outages or AWS availability zone outages. Regardless of any of those outages Cassandra would be up and running because of how we deployed with the help of PREM. So how do we sustain instance outages? So most of our data is replicated with replication factor 3 and we deploy Cassandra in three different availability zones within a region. So and the way we distribute a token across three availability zones gives us flexibility that when you insert a record it would be inserted in three different availability zones no matter what. So even if one availability zone goes away or even if one node goes away you have two other availability zones and two other instances which holds the same record. So that way you get instance level resiliency and we use PREM bootstrap Cassandra in terms of instance being terminated or instance being replaced. And that is all automated through PREM so when a new node comes up, PREM bootstrap and figures out whether that Cassandra instance needs to be replaced or bootstrap and does that automatically. And resilience, how do we provide resilience at one availability zone? Again just because replication factor is three and we deployed in three different availability zones. So even if the entire availability zone goes away you have two different availability zones which is serving our traffic. And we have Kioskong exercises which takes away instances all the time which takes away availability zones all the time. But Cassandra would not be impacted, I mean not even 99th latency would be impacted when your Kioskong exercise is going up. So and also another key important thing here to sustain availability zone outages is if your cluster is 99th latency sensitive and even in case of an emergency, even in case of the entire availability zone going away, if you don't want your 99th latency to be impacted you need to provision cluster at two third capacity because when the entire availability zone goes away you have one less of an instance to serve the same token range and your traffic could be distributed only among two instances that are out there. So it's always good to provision with two third capacity so that it can sustain any zone outages as well. And because we provision at two third capacity we do take zone maintenance all the time. Let's say if there is a restart or if there is an upgrade we take down the entire zone and upgrade and bring it back up without impacting any life traffic. How do we sustain multiple availability zone outages? It depends on the usage of the cluster. If you are using local quorum application, if your application is using local quorum we cannot sustain multiple availability zones. So that is when we fail over the traffic to a different region because that region would have the same data and can take your traffic. But if your application is using a local one then we do sustain multiple availability zone outages and that depends on application to application. So how do we sustain from region failures? So in case of any connectivity issues between the regions or AWS region going away, traffic team shifts the traffic to a different region and which has all the data all the time. Because we do run repair frequently and we do keep our data center consistent with other data centers so that we can have the data across three different regions. So we talked about maintenance and monitoring. So to get around with the monitoring issues we built a system called Mantis based health check system. To make our maintenance life easier we built a sidecar called cream. And how do we solve open source product challenges when you are running in a production when you are running at scale? Just to give you the snapshot, this is I think yesterday's snapshot of Cassandra open issues across several regions. It releases and as you can see you have so many issues coming up every week. And the best way to support your open source product in production is to have Apache committers in your team and keep looking at your keep an eye on the G-Log, keep an eye on the product all the time. And when you are using open source product, driving the product vision also comes to your responsibility because based on your needs, based on your requirement, you would direct, you know, you would talk in G-Log, you would talk in conversations and then come up with the new features and make the product better. So how do we certify and do the benchmarking for open source product like Cassandra and make it production ready? So that is when we built a system called EndiBench, Netflix data benchmarking. So Netflix data EndiBench is a pluggable cloud enabled benchmarking tool that can be used to benchmark be it Cassandra, Elasticsearch or any persistent store that is out there. And the reason we built EndiBench because we looked at open source products out there, we looked at, you know, Yahoo cloud benchmarking and many other testing tools. But we were mainly looking at something which can be very close to production, which simulates and imitates your production traffic, not by just replaying the traffic, but actually generating the production traffic. So one of the key things there would be dynamic benchmark configurations. For example, if you're running a benchmarking for six hours, you might not be able to reproduce any memory leak issues that are happening in your system. So let's say you're trying to reproduce memory leak that is happening in your production system. So you tried it for six hours, then your benchmarking stops there and you can't reproduce because that memory leak might happen only after running it for two days or three days that you never know unless you reproduce it. So then to recover the same problem, you would take several days to figure out when that memory leak is actually happening. So that is when you have something like EndiBench, which allows you to dynamically tune your configurations, which is similar to your production environment where you have traffic going up and down. During the peak hours, it goes up and during non peak hours, your traffic goes down and maybe due to some traffic behavior, there is a memory leak, which you cannot reproduce when you have a constant traffic coming in. So that is when you come up with these, you tune the configuration while the load is running, you let it run for the days, weeks, months. It doesn't matter, it runs as if there is a production traffic coming into your system and then you can tune the configuration as you do in the production and it is the different load pattern and then it would be easy for you to recover any of such issues. And also be able to integrate with the rest of the cloud systems. So typical problems with the benchmarking tools would be running on a different machines. They do not share the same ecosystem that your production system is sharing. They do not co-host other services that are running out there. As a result, you won't be able to deploy the problem all the time. And with the ND Bench, we saw that when the ND Bench is running on any cloud-based instance, it would be as close as it can to any production instance. It would have same sidecast running, it would have same other services that are running in your production system so that it would be easy for you to recover any issues. And it provides a pluggable pattern in terms of random pattern or Netflix traffic pattern or Netflix offline batch processing pattern and generate the load the way you want. And it supports different client APIs because we have different position stores to be benchmarked so we made it pluggable so that it can support any different client API. So out of box we have Cassandra, Dynamite, Elasticsearch and Alcandra as plugins that are available there. But you can pretty much write a plugin for any of client that you are trying and start using it. So we use the ND Bench at Netflix as a benchmarking tool. We use it in part of integration test. We use it as part of deployment validation as well. So for example when you use ND Bench as part of benchmarking, so here is a snapshot of Cassandra 2.0 versus 2.1 leads on thrift. As you can see the blue one is 2.0 and I think the purple or red one is 2.1. So this gives you side by side comparison of different versions that are out there or different drivers that you are trying or different environment. Maybe it is on a Linux or maybe Ubuntu or whatever. You can have this side by side comparison with the help of ND Bench and right after you run this performance benchmarking tool it would be easy for you to take a decision whether to go forward with the new release of Cassandra or new release of Cassandra Java driver or whatever. As part of certification, so every AMI that we bake, since we are in AWS and ecosystem, every AMI that we bake goes through ND Bench performance test suit and based on the result only we promote that AMI to be productionally. Typical problems that people run into when they are building a benchmarking tool is that benchmarking tool itself puts a lot of overhead as a result you would not be measuring the numbers properly. But as you can see with ND Bench we were able to generate millions of operations per second which are under several milliseconds. So in this example, Dynamite was generating with 100 microseconds which clearly shows ND Bench does not put any overhead as the tool itself to measure your performance. I am not talking more about ND Bench, this is a Apache License Open Source project. You can find it out on our GitHub, Netflix GitHub and stitching it all together to provide Cassandra as a service at Netflix. So this is what it looks like. Every Cassandra instance is having three different processes that are running. You have Mantis based health check. Winston is our auto remediation system which is again open sourced. You can look at our Netflix tech blog which talks about Winston. We have Unomia which is the advisory or the governor to monitor all of our production fleet. And as you can see, all these events are being sent through our alert system, ATLAS which is our metric system, are being auto-remediated by Winston automatically. And we use Jenkins for some of our maintenance and we have capacity prediction to predict the traffic, to predict the usage pattern of any Cassandra instance. We have outlier detections, we have forklifted tools and we have log analysis which is ELK based. And CDE person gets paged only after going through several of these systems when none of these systems cannot solve the issue that is when we get paged. That I am opening up for questions and I think we have five more minutes. And yes, we do, Netflix is hiring, jobs.netflix.com and these are the two opening positions in our team. If you want to work in this kind of architecture based systems, please reach out to me. Thank you. Yes. So we used Mantis, our streaming based system and we came up with all the business logic that we gained over the several years of running Cassandra service. And that is how we decided what makes sense to page CDE or to page auto-remediation system to take an action on Cassandra. Okay. So the question is in a health check system, where do local aggregator and global thing aggregator store their state? So that is given by Mantis framework. So Mantis framework holds the data in their ecosystem. And it also has a checkpointing system which checkpoints to HDFS, S3 and several other data stores. But that comes from streaming service offering that we have. You can simulate this in Spark streaming environment as well. If you write a window based or time window based job in Spark streaming that I think you can come up with different persistence stores. So this quarter we are releasing, we are open sourcing Mantis. In that open source blog, we'll talk about more on how these jobs persist the state and all that stuff. Yes. So when we tried VNodes, I'm talking about two years back when they released. So the resiliency that we were talking about instance availability zone resiliency. The way VNodes distributed tokens was not aware of availability zones and regions. So we had instances where same availability zone had replicated tokens. As a result, when you take the availability zone out of the picture, you have a token range of lane exception. As a result, you have a downtime in Cassandra. That is when we decided not to use VNodes. But at current state of VNodes, I know a lot of token distributed algorithms have been improved in VNodes. I don't know the current state, but you might revisit in future. Yes. So Netflix is, I think 100% is on AWS. We don't have any bare metal or any of data centers. So we deployed in I2-based instances and mostly 95% on I2-based instances, I2-2XL, I2-4XL and I2-8XL in terms of AWS terminology on Ubuntu. Yeah. We were on Linux, but we migrated to Ubuntu. Yes. Yeah. So today we have automated bots which are keeping an eye on our usage and production. So we automated, I would say 20% of autoscaling where it is so obvious that you need to extend the cluster, you need to add nodes to the cluster. But as you know, the problems with Cassandra doubling and process, it takes several months. It depends on our data sizes. We haven't automated it 100%, but we are in the process of automating that as well. Okay. Thank you very much. Thank you.