 Hello, I'm Stefan, software developer at Adobe and along with my colleague Sherban. Today we're going to share a few things about Adobe Audience Manager, what role Cassandra plays, and some stories on how we turned our Cassandra setup into something more efficient and easier to maintain. First, a few words about Adobe Audience Manager. Adobe Audience Manager is part of Adobe Experience Cloud, a business to business range of products and services used for managing the entire customer experience. Audience Manager is a data management platform service. Its main goal is to answer the question, who is this user? To achieve this, billions of data points are collected, classified, and organized. The operation scale of this service is really big. I'll present some numbers shortly. In order to minimize client request latency, data is collected through eight geographically distributed age data centers as shown here. All the data collected through these centers is then pushed to a core database, a setup of age base running in a single central location. Afterwards, the most recently used data is pushed back to the edge locations, so that each location stores a persistent cache of the entire core data. And this is where you use Cassandra for. Some numbers here. We've got a peak number of daily client requests in Audience Manager of 51 billion that translates to roughly 313 billion of daily Cassandra requests. As already mentioned, we are deployed in eight AWS regions. We are running 34 Cassandra clusters on more than 800 instances, holding 260 terabytes of data. Now that you have an image on who we are, let me tell you what subjects we are going to discuss in this presentation. First, our architecture and its particularities. Then, split brain, dividing an edge location without downtime, event-driven automation, a process that we have successfully implemented, and token awareness, a Cassandra feature that really challenged us. So we're going with the architecture. We'll go from the core database as previously mentioned, an age base setup in a single central location. Its main purpose is to store persistently all the data that we collect, and also it is used for batch processing. Data here arrives from the data collection service. We'll talk about it, but also from other flows. The profile cache service, how we officially call the Cassandra setup, is the caching system that is present in all the eight edge regions. It has two types of clusters in terms of usage patterns, and on the other hand, it has other two types of clusters in terms of data access management, backend and real-time. We'll elaborate this. The data collection service is the Java application that serves the client requests, interacting with Cassandra, and also pushing data to the core database. It only writes data in the real-time clusters, but it reads it from both the backend and real-time clusters, and merging the result before sending a final response. And I'll tell you why. But now back to our main focus, the Cassandra setup. There are four main Cassandra clusters in each of the edge regions, along with two special clusters only in the US regions that serve for a different purpose. So in total, we are managing 34 Cassandra clusters. Our smallest setup has four nodes, while our biggest one ever was scaled to 148 nodes. Regarding the profile and ID mapping clusters, they have different schemas and have been separated due to a difference in the ratio of reads to writes, which means that they need to be scaled separately, so it was more efficient to split them apart. In terms of data access, we have the real-time clusters that support read and write operations from the data collection service. They contain new fresh data, and that is only present in that particular edge region until around 24 hours later when it is pushed from the core database to the backend clusters in one or more regions. The backend clusters are read-only. New data only arrives through bulk loading from the core database. And the thing is that the data collection service, when serving a read request, reads data from both the backend cluster, where it is historically aggregated and usually more complex, and from the real-time cluster, where it is recent, merging the response before sending it. An interesting feature of our setup is the replication factor of two that leads to a consistency level of one for both reads and writes. This basically means endangering the consistency. We only wait for one of the nodes to one of the nodes that hold the data for a request to return the fastest one as shown in the illustration that I prepared. So we have a DCS request that reaches a coordinator node. It is forwarded to the replicas that hold that particular data. The coordinator waits for only one response before forwarding it to the DCS. The other response being dismissed. Why are we okay with this? First of all, because it's cost-effective. For storing data, we only use two-thirds of the infrastructure required for a replication factor of three, because obviously we are storing one-third less data. But we also assume that we provide a better latency, because we only wait for the fastest replica to respond, not for the fastest two replicas to respond. Then regarding consistency, we are well covered by our bulk loading mechanism. All the data reaches eventually after most 24 hours, the backend clusters, which are read-only, and due to this characteristic, they always contain the same data on the main and the replica location. So no matter which one of them is the fastest to respond, the response is always the same. So inconsistencies might only appear in real-time clusters with fresh data, but however, sell them due to network issues or crash nodes that have faulty hint mechanisms. But in these cases, we are covered by our SLAs, because we still provide data accurate enough. And let me illustrate this. Let's say that we have a write request to the real-time cluster for profile one with color red that reaches both of the replica locations. Then we have a write request for profile one with color green that only reaches one of the replica location. And then a read request for profile one. It reaches the outdated replica in the real-time cluster and the backend cluster. When we get the responses, the data in the backend cluster is more complex, while the data in the real-time cluster is not the most up-to-date. So we merge the result, send it to the DCS, and then to the client. And the idea is that we provide not the most up-to-date data, but we provide an updated data relative to what we have in the backend cluster. And this is considered correct enough. But anyways, this happens rarely. So now about how we trigger the split-brain situation. For business-related reasons, we needed to add a new edge data center in India. It was intended to take over part of the traffic that was going to the Singapore region without any noticeable disruption. There were several other issues in doing this migration. I will only go through the Cassandra stuff. The goal was to replicate the APSI cluster in the India data center and keep them synchronized during the transition. Since this was triggered through a DNS change, it wasn't going to be instantaneous. So we needed to keep the cluster synchronized during the entire period. There were two very different problems, synchronizing the real-time clusters and the backend ones. Since the backend clusters are read only as seen from the Java application, it was much easier to do that there and much more challenging on the real-time clusters. So for the real-time, we had several options. First, it was possible to duplicate the incoming HTTP traffic from APSI instances and send it to India using tools such as GoReplay. We are already using that for samples of traffic. Then the DCS servers from India would write the data to their Cassandra clusters. The main risk here was that this would also duplicate several business application data flows, sending the same data twice, once from APSI, once from India. It was possible to prevent it, but it would be very, very difficult to manage this during the DNS transition and we would end up with duplicated data. Also, a serious risk of network or GoReplay errors with no recovery mechanism in case of losing large amounts of data. So this was only a very theoretical option, very quick to dismiss. Instead of duplicating the HTTP traffic, we could have duplicated the SQL writes. We already had support in the application for writing to multiple Cassandra clusters, and with VPC peering, the India cluster would have been accessed from the APSI Java application without any problem. This would have solved the business data duplication issue, but we would still have network latency problems and no retrieve mechanism. Our application is built for a P95 Cassandra latency SLT of 10 milliseconds, so it was something that we were not provisioned for. There was a serious risk of not getting the data due to timeouts considering this SLT, or on the other way of impacting the DCS performance in the APSI clusters due to waiting for an answer. I should add here there was an earlier data stacks presentation today about using a TCP reverse proxy to duplicate the SQL traffic. We didn't do that. We did the application level, but it has been a very, very useful tool for us, allowing us to test lots of changes on the production traffic. Another option was to use Cassandra multi-data center replication. This will solve the network and latency issues since the writes will go only to the APSI region and allow Cassandra to take care of the synchronization. The main drawback here was the potential impact on changing the APSI production clusters. Also lack of experience since we never used this before. And then we had no idea about how to split a multi-data center cluster into two regions. So the last option is a combination of the second and the third. The idea was to create a duplicate cluster in APSI with the same mechanism with writing to multiple clusters and then have a multi-data center replication between it and the new cluster in India. This way the APSI production cluster would not be touched. So there was no danger there and there was no need to do the split required by the third option. We would just drop the APSI cluster and on first look, this was great. We decided to try it and we'll see why it didn't work. The first problem that we had was the Cassandra Snitch. According to documentation available then, the EC2 Snitch that we used was supposed to run only for single region deployments and multi-region deployments required Cassandra instances to use EC2 multi-region Snitch and rely on public IP. But I also knew that AWS had somewhat recently added the inter-region VPC peering. So it could have been possible to use only private IPs in a multi-data center cluster. The EC2 multi-region Snitch was probably done before this was available. So I decided to look into the source code of the Snitches. The EC2 multi-region Snitch inherits the single region Snitch and it only does what is in this slide. It's not a lot. There is an important point here is that there is no rack data center logic. So that is probably done in EC2 Snitch. And it's only doing two additional things. It's using public IP for broadcast and sends gossip traffic on both public and private IPs. That's all that it's doing. So that meant that EC2 Snitch has multi-data center support. It's just not using it. There was no reason why this would not work with VPC peering. And so this was worth testing. This would prevent a Snitch change, not to mention the public IP issue, which we were running only with private IPs. And in the end it did work and I made a small open source contribution to update the documentation to reflect the AWS VPC peering. So this was the plan. First, create a secondary cluster in app C. Set DCS, the Java application, to read, write to this cluster, but not use the reads. Set up VPC peering. Add India data center using the documentation provided by data stacks. Connect India to its Cassandra clusters. Do the DNS changes. And after these are propagated, destroy the Opsys secondary cluster using the documented procedure. And thus we would have a single region India cluster, which was the original goal. There were some prerequisites to apply this. First, use local one instead of one so that the nodes used are in the local data center. Second, use DC aware round robin policy instead of round robin policy. Otherwise, remote coordinator will be chosen creating latency issues. And we actually missed this initially and see later. And put network topology strategy on system key spaces, which is something that actually was an earlier presentation, advising to do this anyway. To bootstrap, we start by creating EC2 instances in the new data centers with seeds in both without starting Cassandra on them. Then add these new added India seeds in apps instances and do a rolling restart. Start Cassandra on each node in India. First on seeds and then on the rest altering between AZs. At this point, these nodes have no data and further, they don't see themselves as having any data ownership. And this is basically this is the procedure documented by data stacks. It's something that is not usually done a lot. So that's why I'm going through this to change the ownership issue and alter key spaces required declaring a new data center and its replication factor. And this will not trigger data transfer for that node to rebuild is required. So data will start to be streamed from Opsi just like on adding or decommissioning with the difference that this can be run in parallel and to monitor node to net stats is okay. Several unforeseen problems. The easiest was there were actually three system key spaces, not two. More serious was a very high number of timeouts on the Opsi cluster. 375,000 versus 57 on a normal production cluster. It started when the Opsi servers were restarted with the new India seeds added and stopped very abruptly when the new data center was added with the alter key space. And this was, as mentioned earlier, this was due to not using the DC away DC aware round robin policy. So sometimes new these new India instances were being selected as coordinators triggering this latency and timeouts. When we move to DC aware round robin policy, the timeouts were normal and it was barely possible to see any change when we did the procedure. Another problem was the very, very poor token distribution. We are using the token allocation algorithm. And in India, we had one node with seven 70% of the total data load and another with 0.7%. And it's quite normal if you think about it since there is no data for the token allocation algorithm in that region to run on it. When Cassandra started and the token ranges are allocated. To solve it, we redid this, added only the two seeds and run node to rebuild on them to get all the data on the seeds. And then add the rest of the nodes in the regular way so the algorithm could run. And obviously this required larger instances and would not work on larger data sets. Another option would be to copy the, in that situation would be to copy the token distribution from Opsi and set it with the initial token setting. But the most serious problem was the cluster name setting. It must be the same in the two data center. And as this comment indicates, this is preventing servers from joining the wrong cluster, which actually happened to us. At some point, I put the wrong IPs in the configuration files, thus mixing the clusters. And this actually prevented the instances being added to the wrong cluster. But this was very nice, but it also meant that the plan was not viable since we could not have the same cluster name as in the rest of the regions. Of course, it was possible to actually accept this and have different cluster names, but on a very, it was very quickly obvious that it will create lots of soft issues in other places, in configuration, in observability, in automation. This was something that we wanted to avoid. And we were sure that it will generate a lot of problems and errors. It might have been possible to change the name to the normal one later, but this was not documented. It required maybe a rolling restart. There was no sure thing that it will work. And there was no fallback in case something went wrong. Another option would be to go back to our original variants to go to use option three to replicate between production clusters. This would avoid any naming issues. And by then we started to have some experience with playing with multidata center clusters. And it was easy to test using the duplicated clusters. So we decided to try it. I'll go now to the plan that we implemented to split the clusters. Without, well, there were plenty of tests, intermediate tests that will not go through. This is based on the documentation for decommissioning a data center with some very significant changes. Of course, we need to monitor timeouts, response time, read write counts, and we did that when we tested. The first step was to cut the network between the data centers that was very easy to do with AWS subnet change. So from now on, each cluster would see the other one as being down. And all operations will have to be done in each data center since they will not propagate to the other. So I was tricking each one of them to see that the remote cluster is being decommissioned. On a seed in each region, run this alter key space to remove the other data center. So the reverse of what we were doing earlier. There was a expected warning message that schema cannot be propagated. That was normal due to the network. And now when we are running describe cluster, it will show the remote IPs as unreachable and with unknown schema. And then all that we had to do was to remove each remote node one at a time using node to assassinate from the local data center. So in Apsi, I would run node to assassinate and give the IDs of the nodes in India and the other way around. And when all of them are being removed, node to describe cluster will no longer show the unreachable nodes and will no longer show a multi data center cluster. And all that is left to be done is to remove the remote seeds from the configuration file and do a rolling restart. For fun, we did some testing with we restored the network connection. We restarted the node and checked that it's not joining the remote cluster and they are completely separated. This was the intentional split brain that gave the title of this. These were the timeouts during testing, basically no impact of the change. When we did in production, there were some significant timeouts, but still they only briefly reached 1% of the total traffic that was well within SLA, especially since it wasn't a sustained timeout. On the back end, it was much, much easier due to previous experiences. First, created new empty Cassandra cluster in India region, associate, create AWS resources for the Hadoop data push that is SQS and S3 and specialized instances that are doing the data push. Then duplicate all notifications for APS in additional Qs for India. And so India will process two times as many Qs. And after this is working for at least one day, we started to see data clusters filling in with data. Take a snapshot from APS region, copy it to India, and then restore it there through streaming. Actually, copy only half of the snapshot from one AZ since there was no need to duplicate the data. We were using two AZs. And in the same time, keep streaming from Hadoop. There will be duplicates in doing that, but Cassandra will handle the duplication. There will be a lot of compactions in doing this restore with streaming. But that's okay since the cluster is not yet used and it was easy to configure it for maximum compaction performance. And then run a Python script that takes samples from each cluster and compare them. There were several problems also here, but nothing serious, things like instance availability since India was back then a new region, or problems with AWS SDK versions. They were two all that didn't have the India endpoints. But again, nothing serious. And after the DNS transition is complete, stop sending the APS data. And that meant for both real-time and backend, the cluster will contain APS specific data, which will disappear in time due to TTL. That's also true in the other region. Now let's discuss about event-driven automation. A common situation on Cassandra is when a node crashes due to an AWS hardware issue. And the data cannot be recovered in our case because we are using FMR storage for cost efficiency reasons. For us, this means no replacement and we do it as fast as we can. But many of these problems are detected by AWS prior to their happening when they are expected to happen and signaled through health events. AWS also provides a self-healing mechanism for this, but this mechanism is only triggered in two weeks' time and still involves data loss because it doesn't perform Cassandra node replacement. So our goal is to prevent the nodes crashing by being proactive and replacing the nodes after receiving the health event and such, doing it in a moment of time that the situation allows us to. So because we react to an event, we call this process to be event-driven. And the thing is that for such problems, in order to eliminate the engineer's intervention, there are tools such as the one that we used, Stextor, an event-driven automation system that we have deployed from the beginning in Kubernetes, which was very effective. And for this, I provided data from the last month when out of the 19 total node replacements, 13 were performed by Stextor or better said, 70% of them. But I brought the statistics for the entire last year and monthly between 50 and 90% of the node replacements have been automatically performed through Stextor. So what architecture we used? We basically redirected the health events as notifications to queues in AWS and we only used one Stextor sensor to pull each of the queues in different AWS accounts, different regions, and triggering a workflow for replacing the node. This workflow performs the exact same actions as an engineer would have done, like stopping and starting the instance, running a Ansible playbook for the Cassandra replacement. Most important here is that this workflow is only triggered when the situation is okay so that it does not cause an incident. And for this reason, it checks that no neighboring node is down. So no other node that shares data with the one to be replaced is down. And look, here is all we are aware about when Stextor does its job, a slack message when it begins, and the slack message when it finishes. The last topic, a short story about how we didn't actually use Token Aware load balancing despite setting it. There will not be many details here about load balancing policies. I'm sure that it's known. It's a setting on the Cassandra driver which determines which nodes will be used as coordinators. Token Aware will attempt to send requests to the node that owns the data. Our policy setting is Latency Aware over Token Aware, but it didn't really work without us realizing it. This is an example of Latency Aware behavior. On a separate occasion, I run a test to change the JVM and Garbage Collection 1 node. That increases latency and then it starts to receive only about 80% of the traffic of the others. For Token Aware, there are a few requirements. It needs to know somehow what's the partition key for a given query in order to find the servers where that data is. It does this in the get routing key method. If it cannot determine the partition key, it will return null and the load balancing will fall back to the next policy. There is no warning here and that's what created the issues for us. That was what was happening. The problem was that we were using a blob field for the partition key converted from string. The conversion was done with the text as blob Cassandra function which was evaluated on the Cassandra coordinator. There was no way for the driver to determine the value of the primary key in the Java application and the get routing key was always returning null. Always it was going directly to Latency Aware. Fixing this was very simple. Stop using this function and convert the value in our Java code with byte buffer utility wrap function in all statements. Here is an insert statement, select one and the usage of the wrap function. Very simple and very effective. There was an immediate improvement in latency. We did see some hotspots on some small clusters but it was still worth doing it. In most cases, those hotspots actually indicated a pre-existing issue that needed to be addressed anyway. That's all. Thank you. We'll take questions on the hallway or something because we know it's the last presentation and it was quite long. Thank you again.