 Hello everybody, I'm Jakub. This is Piotr. We're gonna tell you about testing in production. So fortunately today the debate on when to deploy changes to production has been finally settled and we know that the best time to do that is Friday afternoon. We'd like to make a step further and today in this presentation, argue in favor of using the production environment for testing. So we're gonna lead you through our journey in AstroDB to how we apply this testing in production paradigm in AstroDB. Okay, so the first thing is why actually we would like to do that? How to do that? What are the challenges along the way? And how does it work in practice? So why would we want to test in production? Cassandra has all these levels of testing from very simple unit tests through node tests to more complex integration tests that involve multiple nodes. And finally some performance or stress tests. But all these tests they share the state that they work on synthetic workloads. And for data stacks as a vendor of Cassandra as a service it is important to perform additional testing because we have plenty of users and every user brings unique workload, unique use cases. And so it is really a benefit if we are able to perform additional testing using these real life workloads. The more conventional approach to do these kinds of tests would be to model the workloads and have some synthetic workloads that are similar to real life workloads. Now the problem with this approach is that modeling real life workloads is time consuming. The models are always approximate. And they are usually outdated. Like by the time you actually start using the modeled workload, the user might have already changed its original workload. And finally it is pretty difficult to be able to tell what is the actual impact of some particular change to, for example, performance impact to the user workload. If you only have synthetic workloads then you can't really go to the user and say, okay, so we're now having this fixed and it improves your latency by 20%. You really cannot do that. Now testing in production, it brings new risks on the table as tinkering with the production environment. You may impact the availability of your services. You can negatively impact the performance and obviously there is a security component to consider. Okay, so we settled on using the shadow deployment approach for testing in production and in the most abstract terms you can think of that as having a clone of a primary system and we will call that clone shadow system. And there is some magic in between the user and the primary system. This magic allows you to to send this user load to the shadow system and then you're able to analyze this shadow system behavior for any bugs, for performance regressions. Essentially this shadow system, you can think of it as a system with a different version of software or maybe different configuration but it should reflect the primary system. Okay, so how should this magic part look like? What should it do? So there are basically two requirements. One is that it should work with CQL traffic obviously and the second thing is that it must not impact the real production load. It must not impact the user and this translates to reliability. It must be super reliable. It must have minimal performance impact. It should be secure. There should be no new security risks and it should be robust. So it should perform in a predictable way in adverse conditions. Okay, so we reviewed the existing solutions for shadow deployment and there is plenty of them but typically they are oriented or their purpose is to be higher up in the stack. Like they usually focus on HTTP traffic which means that they interpret HTTP requests. They are able to manipulate headers, manipulate cookies so that HTTP sessions work correctly and this is something that doesn't interest us. On the other hand, they miss robustness features like most of them they don't seem to care about what should happen if something goes wrong. It is not clear how reliable these solutions are. Like you would have to spend time looking into them maybe finding some bugs and the same goes with security. So fortunately, the CQL traffic is very mirroring friendly. So there is very little state that a CQL session has. There are use statements, there are prepared statements but for example, prepared statements are implemented in such a way that the prepared statement ID doesn't really depend on the place where you compute it. So both in the primary system and in the shadow system the prepared statement IDs will be the same so further executes they will have no problem with being executed in either way, in either of the systems. So when you think of it, there is actually nothing that prevents you from doing a very simple duplication of the traffic in binary form. So this meant that we settled on the following architecture for every Cassandra node in the primary system. We have a corresponding node in the shadow system and there is a simple TCP proxy in front of the primary Cassandra node and its purpose is to pass the client traffic to the primary Cassandra node but also duplicate it and send it to the shadow Cassandra node. Now in case of AstraDB, these will be coordinator services. I'll talk about more of that later. Now, as already mentioned, we haven't found a suitable candidate for our TCP reverse proxy so we decided to write it ourselves and this is something that Piotr will tell you more about. Yeah, so when Kuba and the rest of the team were investigating different solutions for out of the shelf proxies and couldn't find something that was matching our requirements, I thought to myself, okay, so how hard would that be actually to write a proxy? You take bytes from one socket and you copy them to two other sockets, right? What's more to that? This is look like a weekend evening project. Unfortunately, yeah, there are those challenges here. So this project on surface looks like a simple one but it has those additional robustness requirements, performance requirements that constitute the majority of the complexity of this project. So for example, shadow coordinators, they must be connected asynchronously because the primary traffic must not wait for the time that shadow coordinator connects. If the shadow coordinator is well broken or slow, we must not hold the primary traffic because the primary requirement is the end user must not notice that there is something in the middle. Shadow coordinators also may misbehave. They may lock up, they may slow down, not respond timely. They might also disconnect suddenly in the middle. So error handling is very, very important in that project. Like we cannot just drop the whole session because shadow dropped the connection. Also resource limits, like, okay, asynchronous communication with the shadow requires some kind of buffering but we must be aware of the fact that, okay, if we buffer too much data, then the container that we are running in might be simply killed because of exceeding some quotas, some limits. So having considered all those hard requirements and the fact that I've been writing high performance Java code for Apache Cassandra and DSC data stacks for like more than 10 years now, it's a pretty obvious choice for me to write, was to write the shadow proxy in Rust. So we get a simple Rust application that will be open sourced. It has the following features. It of course can do that TCP traffic mirroring to not just one shadow, it can actually do multiple shadows. That was a feature that wasn't planned but actually turned out to be useful later. That is very easy to add. It has data buffering. So like if the shadow proxy, if the shadow coordinator is slow, then we can buffer some data for it to replay later without slowing down the primary traffic. It has also online reconfiguration so like we can change the configuration without stopping the proxy. This is also very important. Like if you kill the proxy, you kill all the sessions and of course the user would notice that. The drivers are fortunately smart and they can reconnect automatically but this introduces additional latency and of course we don't want to do that too frequently. So that's why we can change the configuration dynamically. It has all those memory and connection limits so that it can also protect itself from being overloaded. For example, well, a customer could open thousands of connections and we must be ready to handle that and we must not overload the system. It also supports metrics so like when running that in production, of course we want to see what's happening. As for performance, yeah, it was really enraged because of reliability, performance and predictability of performance. Like no GC running. The architecture is threat per core so it's very, very efficient. We also could handle many, many thousands of sessions in very low memory because REST actually gives us ability to control how much data we allocate is extremely efficient at that. Okay, so as for the usage, there's a simple app, single binary. For testing, we have a simple syntax where we can just give it listen address, primary address, shadow addresses, one or more. And for production use case, we have a config file. This is a recommended way for running that in production because the config file is monitored for changes so you can change the file later or replace it, like delete the config file and create a new one. This directly maps to config maps in Kubernetes. It supports also Kubernetes and works very well in that environment. You can see that when it started, the proxy print out version, when it was built, also it repeats the configuration. So this is a nice feature actually, Kubadded that it's very important to know when you're running a production system, like what is it actually running? Like for example, you change the config, whether you actually picked the new config and is the config correct. So this gives us a lot of confidence in the software. As for the configuration file, okay, most of the stuff you probably don't want to touch, the defaults are good. Probably the major parts beside, of course, configuring listening address and primary and shadow addresses is to set the memory limits. So we get a section for memory limits, like max memory and max connections. And then there are some other buffer sizes which are probably okay to leave as is. We tested that and then we find some optimal configuration parameters that gave good performance. One more thing, we want to run because Cassandra is distributed. We got multiple nodes. We wanted to be able to configure all the proxies for each coordinator just using a single config file. That's why we have special sections which are named like your proxy one and proxy two. You can just select which config you're running by additional flag config key. This is very handy in Kubernetes environment so that we don't have to use separate or different config files for each kind of coordinator. As for the communication with the shadows, I said that this must not interfere with the primary traffic. So for example, when we are connecting to shadow coordinators, we are doing that in background until the shadow coordinator confirms the connection. We are simply buffering all the data on that connection for it. Of course, if ever the buffer gets filled in completely so we reach some limits, then we drop the shadow connection because we cannot continue to buffer more data. We don't want to kill the proxy. Also, all the reads and writes for the shadow stream, they are in non-blocking mode. So even if the socket doesn't accept any data, we just simply immediately get the response from the system that, okay, zero bytes written. We never blocking the loop and we can still handle all the traffic on the primary stream. Limits, there are multiple limits. There's local limit per each session, like how much data we want to buffer for each coordinator. But also, there are global limits for the whole proxy for all the sessions in order to protect from 100,000 sessions, for example, come. And even if each session uses kilobytes of data, this can actually translate to several hundreds of megabytes of data, even gigabytes. So that's why we have a global memory limit and global connection limit. There are two types of limits. So for each memory and connection limit, we have hard limit and soft limit. When we exceed the soft limit, we disconnect the shadows already. So we never want to actually touch the hard limit. So hard limit would always configure a bit higher so that we never touch that. But if ever such situation happened like the hard limit was touched, instead of going above the limit, we would simply stop accepting more connections or even we start dropping the connections, even the primary ones. Like it's still better to not accept some primary connections, but not lose the existing ones. If we didn't have a hard limit, then it would be possible that after opening 100,000 sessions, everything dies and all the connections are dropped, which is of course a bad situation. And final feature, metrics. By the way, Kuba implemented all the metrics which is really useful. It turned out when we were doing those deployments in production, metrics were invaluable in showing us what's going on in the system. So we track, for example, how many connections we have, how much memory the proxy is using. We can see, for example, if the shadow coordinators were connected properly. So if they accepted the connections, we can see, for example, how much data is going on. So both upload and download on all of the connections. What else? Yeah, that's mostly, that's mostly, you can also see, for example, like if the data is not going through, for example, well, it's not going to the shadow coordinators, we can see like where is it blocked? Like is it, for example, block on rights or maybe it's blocked on reads? We also report those timings. And the final thing, of course, we had to test the shadow proxy itself. So there are several layers of testing. Unit tests which actually simulate all those scenarios of like bad behavior of shadow disconnecting, slow shadow. This is all automated runs in just a few seconds. Then, of course, some end-to-end testing with local laptop, but also real Astra clusters with bigger sets of data. We also tried to stress test it like with really many, many connections, many clients, trying to see how much resources we really need to handle the traffic. We earlier also checked how our production, how many connections are opened in the production. And clusters, so we wanted to have a pretty good margin for above that. So, for example, if the biggest cluster was using 10K connections, we wanted to be sure that we can handle 30K connections. So, especially we didn't want to, or we didn't, we weren't even allowed to increase the memory consumption or resource consumption on the production deployments. So, the proxy must fit in existing infrastructure. Like, you couldn't just say, okay, because of now we are doing shadow deployments, we need twice bigger servers. That was a requirement from above. Like, yeah, you have some spare memory on those machines, but don't use too much. So, we ended up with just using something like 150 megabytes per 10,000 sessions, and each session is three connections, so it's like actually 30K connections. Added latency, typically less than 005 milliseconds, which is nothing. I mean, this is nothing compared to the typical latency of queries in Cassandra. So, this is unnoticeable for users, and we can basically push gigabytes of data per second even on a single core, on a single connection. And with multiple connections, of course, with going multi-core, it's multiple of that. So, that's basically it in Kuba. Okay, so with all these pieces together, we managed to apply this to AstraDB. So, if you attended Jake's presentation yesterday, you know that in AstraDB, the monolithic Cassandra nodes were smashed into separate services, and so we are concerned with coordinator services. So, each coordinator service got its shadow proxy, and this shadow proxy leaves very close to the coordinator service, like they live in the very same pod. But the shadow database, it is isolated from the primary database. For example, it is not possible, it is physically not possible to access the shadow database from outside of the production environment, like from the internet. Now, obviously, the proxy alone is enough if you want to shadow a new database, but if you want to enable shadow deployment for existing database, you need additional component. This is database cloning, so we have that. So, we are able to provision resources for the shadow deployment on demand, and all this has been alive since March this year. Finally, we added for extra security, we added this scram button, and if you press that button, all the proxies, they get bypassed. So, the traffic goes directly from the client to the coordinators, just like without any proxy. And actually, we needed to use it once when we misconfigured the proxy, the hard limits were too low, and we started noticing, okay, some client connections can't make it. So, let's press the scram button and we'll see what happens later. So, yeah, that is helpful. Okay, so, we applied it in AstraDB, we found two bugs. Now, this may sound a bit scary, but this was actually on a version, like the shadow deployment contained a version that you can think of as very pre-alpha of Cassandra 5.0, like these bugs were found six months ago, so they would probably be found anyway. Now, we also use this shadow deployment for evaluation of compaction changes. There's been a talk about UCS. So, UCS has some further improvements that it needs, and so we're testing how actually these improvements perform on the real life conditions. This is very valuable and I encourage you to talk to Brandi Mir about that. Finally, we learned that using this shadow deployment is so easy that some people started using it in Dev environment. So, in Dev environment, we've got a database and there is a suite of tests that runs against the database all the time. So, if you want to perform some ad hoc testing with, for example, topology, whatever, it's just easy to press a button to create the shadow deployment of that database and then play and tinker with the shadow database. Now, one thing that is not so great is that this scrum button takes the thrill of adventure away. So, we're making changes to production and if the proxy behaves, we just press the button and no sweat. So, there's no thrill. Okay, what we've learned along the way. So, Cassandra closes connections using reset, which means that if you use the proxy, you will start seeing a connection rested by peer errors in the metrics. And this is fine, like this is the normal way. So, don't stress because of that. Another thing is that to make the proxy effective, you need new connections. So, there are essentially two options. You can either wait until new connections are made by the client or you need to sever the existing connections so that the client reconnects and now the load goes via the proxy. So, it's a trade-off and you need to make that call. Another thing, stateless Kubernetes components, that's great, but it's at odds with the configuration of the shadow proxy where we want to have this one-to-one correspondence. So, there is some tension there and if you do that, there may be some race conditions that you run into. Like, you configure the shadow but it doesn't seem to be working because some connections were created before the mirroring was enabled. And, as I already mentioned, we needed to use the scrum button one. So, please check if the resource constraints are set to reasonable values. Like, if your application creates so many connections that it is unusual, then you need to take it into account when configuring the proxy. Another thing that we learned, which is maybe not that surprising, Rust really helped us writing bug-free code. Like, there is obviously this component of Rust being performant language but for us the biggest benefit was that it was really easy to write code which did not contain bugs. Okay, and you can try the shadow proxy yourself. We've got the demo. It is a very simple demo with two Cassandra nodes but it shows how to use shadow proxy. The shadow proxy itself will be open source very, very soon. So, yeah, you can look and hopefully contribute. Finally, I would like to express my deepest thanks and kudos to other people involved in the project. So, Jim Dickinson, Matt Fleming, Danny Atniex, Jeremiah Jordan, Sean McCarty, and Chris Mills. Okay, thank you very much. Are there any questions? Who's gonna take that one? Okay, I can take that one. So, the question is like, was performance the only consideration that helped us make this choice for Rust? Well, helped us make this choice for Rust. Well, no, performance is just a side effect. I feel like, I mean, Java is a pretty performant language in a way, like JVM does magic and does a lot of great stuff. But the type system of Rust which we could use without actually compromising on performance. That seems like the really, really big thing for us because in Java, if I use all the high-level stuff, like, hey, I can do, use streams. I can wrap every primitive type in a class and I get some level of type safety. But this all comes at a price performer, at a big price in performance. Like GC, suddenly having too much work to do and causing maybe some pauses. Like here, even millisecond pauses were considered a bit too much for us. So with Rust, we could also use all of those high-level constructs, but they still translated to a highly performant code without any interruptions with no GC, low latency. I think that this is the biggest thing here. Also the tooling, but. Well, tooling, also no data races. Yeah, that was also. Writing asynchronous parallel code was like simple, like it should be. Yeah, many people criticize async in Rust. It's a bit complex, but it is complex to learn. But once you learn that, actually, okay, I don't have the slide, but the main loop of this proxy is basically, as I said, it's just take data from one socket, copy that to other sockets, and that's it. The loop is sequential. Like we don't have to think about concurrent things happening to the buffer, like this buffer that keeps data for the shadow. It doesn't have to be protected by any mutexes or anything like that. We don't use any atomics there as well. This is just sequential code. Just put some data, read data from the socket, put it into buffer, then read it from the buffer, put it into another socket. It's very easy to reason about thanks to async and all the stuff in Rust that protects us from accidentally sharing things between two threads. We won't do that accidentally, we'll just not compile. So the question is, like, do we want to also do this for internet traffic? And I think that... Why would we want that? That would be it. So, well, for us, the idea is that the systems, they need not be compatible. Like, on this level, like, internally, right? We don't need that. It's not like we want to have a heterogeneous system with two different versions and they talk to each other and we want to check that. It's like, we're gonna do an upgrade next month. Will it work? Can we be sure or have, you know, higher level of confidence that we're not gonna break anything in production? Well, yeah, yeah, yeah, exactly. Okay? I guess that depends on how the client does that. From our perspective, the proxy is very, very, lightweight. So, the duplication, the network traffic, it happens, you know, very close to the database itself. So, for example, you don't need to consider the potentially long distance between the client and two databases, but... Also, the client drivers, they typically need to serialize the data. It's like, translate from one representation to the representation that is required by the CQL protocol. So, like, if you had two connections to two different clusters, I guess that you would duplicate that work, maybe. I mean, maybe there are ways to avoid that with some drivers, but I don't recall any drivers which could do that, like, to write to two different clusters. Here, we do it just on a very low level, so, like, byte streams, and that's why it's much more efficient. On the other hand, when you do this on the client side, then maybe you can think of getting some consistency between these two. So, this was not mentioned in the presentation, but we did not aim for any sort of consistency between primary and shadow system. Like, don't think this is any sort of migration tool or anything. Yeah, we're not interpreting the traffic yet. There is a long-term plan that maybe in the future we will do some shallow interpretation, but for now, this is just TCP. Okay, there's no more questions. Thank you.