 Hey everyone. Good morning. Thanks for joining. My name is Neha and I'm an SDM at Amazon Keyspaces team. I'm Vinit. I'm a senior software engineer at Amazon Keyspaces. Today we're going to talk about CDC streaming and how could we improve things and a whole conversation about it. So in today's session, we would be covering general overview of what is CDC, why does it matter, and some of the common use cases for CDC, followed by talking about some common data processing steps that are taken by any application that is using CDC as an architecture. We'll follow that up with some real world examples of how some companies out there have used CDC in Cassandra. As we talk about it, we talk about challenges that are faced by any of these organizations. If we reimagine CDC, what are the desirable traits for CDC and a pitch for how do we support natively? How do we support CDC streaming natively? So let's start with talking about what is CDC. CDC or change data capture is a process of identifying and tracking changes in a system or changes in a database and providing those changes to a downstream system or application. Why does it matter? In today's world, data is the lifeblood stream of any organization. And it's crucial that any decisions that we make are made on data that is delivered both real-time and is conflict-free. As we talk about CDC use cases, we are using CDC as an architectural solution to make sure that the database remains, so the systems remain in sync, which essentially means that our data, with the sheer volume of data that we receive today, because the data could easily go out of sync, we want to keep it, use CDC as a solution to keep the data in sync. So some of the most common use cases for CDC out there are we could use CDC to load data into data warehouse or data lakes. Organizations that like to run analytical applications on top of their data do not want to do that on their operational databases, as that could impact performance for those databases. So CDC as an application could be used to transfer the data to the data warehouse or data lake. The other reason to use CDC would be to sync the data to a cloud application. So in today's world, when we are using a hybrid environment, we would want some solution that could get the data out of the on-premise data clusters, move that to the cloud for durable storage. The next one is information dissemination, and the idea here is that CDC could allow you to continuously capture stream of data. With this stream of data, you could trigger events or do async processing, which also is covered by an event-based notification system. For example, any changes in the databases, if you want to trigger an event based on that data, then you could rely on CDC to get you that data. CDC is also one of the most popular choices that is used by customers for data migration and data application. So in Cassandra, CDC is enabled by setting up property at both node level and at table level. Once the CDC properties are set, the data is available in CDC rod directory, and from where customers could stream the data, and that's the place from where customers could stream the data. So to solve for any CDC needs, any organization would have to do these four steps. They would have to read data from the commit log. So for that, they would install some component on each node to get the data out of the node and convert that into change data events. Cassandra relies on replication factor to ensure that the data is replicated across multiple nodes, which also means the data is duplicated across multiple nodes. Organizations would have to write complex algorithms to de-dupe the data so that they have a single copy of every change that is made. They also need to transform the data so that they have a before and after image. So Cassandra currently doesn't provide the full image of data change, a full row data, right? It only provides further columns that are changing, and any application that is using CDC would need to know the after image of the data, and that is the complete row that they would have to figure out how do we write that. And the last step is published to the data to any publication strategy of choice. It could be a Kafka or it could be any other streaming pipeline, custom solutions as well, that organizations would like to stream to. So we looked at some of the common examples, some of the customers out there who have used CDC in past with Cassandra. So one of the example use cases was with Walmart, who actually had the exact same four steps that they were using. So they used a modified debisium connector to read the data from the commit log. They had an internal tool called as data acquisition tool, also known as DAC, and they created a custom plugin to read the data from Cassandra, sorry, from the commit log. They used Redis bloom filter to cache the data and get a unique value to dedupe the data across. And then they did use some transformation strategy, their public documentation doesn't really state how they transform the data, but they do mention that they do create a before and after image, and then they stream the data to Kafka. Going on to the next example, Yelp has a similar example. They have they've used CDC as a solution to get data from multiple databases and publish that to Yelp data pipeline. They actually write, have written a custom CDC publisher to get the data from the commit log. They use the intermediate Kafka streams and Apache flink to stream the data and dedupe the data. Then they've used rocks DB column family metadata and the merge from the commit log to merge the data and transform it into an after image and then publish that into the Yelp data pipeline. Now considering the things that we have talked about, there are many challenges that are faced by organizations when they use CDC. One of the biggest challenge is the dedupe, the duplicate data that comes from the consistency, consistency for, sorry, the application factor. So in Cassandra, we use replication factor for fault tolerance, which essentially means that if we have three nodes, a replication factor of three, and we are replicating across three regions, we would have nine replicas that would have the data that would amount to a lot of duplicate data and we would want to reduce the volume by compressing the data. Cassandra also does not guarantee ordering of data since the data is available on multiple nodes. If we are streaming the data to a streaming application, the data changes can alive out of order. Cassandra also provides limited set of data changes that only provides the data that is changing, which essentially means that you'd have to do some complex algorithms on the before after images. CDC logs are also run locally. So any log that is stored for CDC is stored locally and the connector has to be run locally to read those logs and get the data out, which essentially means that the CDC raw directory can get full with the log, and that can cause issues with availability drops. Also, the connector failure cases have to be thought about that it can go down or something else can happen and we need to ensure that it's working perfectly. Apart from that, Cassandra CDC does not support all the features today. It doesn't provide for, let's say, truncate or TTL or range deletes. It also does not provide support for LWTs and similar features. Considering all of that, we have reimagined on what should be the desirable traits for CDC and I'd like my colleague to take over. Thank you, Neher. So she talked about the various processes Walmart has used, Yelp and many more. So what properties do we want out of these updates? In an ideal case, right? Maybe we can get them all, but we can get some. So in an ideal world, I want these updates to come to me real time. As soon as my update is made in the database, it shows up in the CDC pipeline. I want exactly once. I don't want to create infrastructure to de-dupe my updates and create more services and if Cassandra can promise exactly one's delivery, that'll be ideal. If Cassandra can also promise in-order delivery for a particular partition or clustering key for each row, if we can guarantee that, that'll be a good guarantee to have. So and other pieces which are desirable are we don't want users of Cassandra to create post-processing infrastructure because that's expensive, you have to maintain it and that takes time. If we can natively support CDC in Cassandra, which guarantees some of these properties, we will be much better off. That's basically our premise for this next slide. Availability of old and new images. So currently there is no mechanism, easy mechanism to get the old image because as everything happens after the mutation has reached the Cassandra cluster, unless you have another database running. And we want minimum impact on transaction performance. So all the mutations that go through, the reasons that go through, they should not be impacted as a result of changes we propose. And maybe some of this we cannot achieve. So this is a common implementation of CDC. In this case, we have six nodes, node one through six. A client application connects to a coordinated node, which is node six in this case. And so once the mutation reaches node six, it knows that, okay, I have to modify a certain row in my table. And from token range, it figures out which nodes that should be sent to. It reaches node two, node three, node four. Those are the ones that hold that image. And a response is written based on the consistency level the user is chosen. And there is the red box, the red circle rather, which is running on each node is essentially pushing your logs, which are written on the node to a downstream process. But now you can see there are three of them for the same data set. Now somebody has to dedupe them. And that's a challenge. And can we try to eliminate that? That's our goal. Let's see. So what are the advantages of the current commonly used scheme? Is it real time? No, it's not real time. Because the async pusher is running in the background. So you're not going to get real time updates. There'll be a delay. The faster frequency you have, the closer you get. But it's not real time. You guarantee in order, you don't guarantee in order. Because they're coming from three sources, you have to do some kind of merger to get them in order. Exactly once, absolutely not. Resilient. It is resilient. You've got three copies of the data. So you're not going to lose data. Does it affect your latency for normal transactions? Actually, it doesn't. That's a good part of the implementation that it has no impact on your cluster performance because everything is on asynchronously. So, okay, you put our thinking hands on. So you say, okay, can we kind of elect some leader in this process and have a unique person or unique node which holds that data together and is the source of truth? Okay, could the coordinated node do that? Okay, let's see. So coordinated node received the data. They see the mutation from the application and it sends that off again to the three nodes. And the nodes now, other than just responding with the knack, they could say, okay, I'm going to send my image to along with it. It obviously consumes more bandwidth. But an image, if the image is received on the coordinated node and it respects the consistency the user has chosen, which could be local one, local core, and whatever it is, it responds to the client application and it has one copy now. It can have one copy because it's anyway going to wait for the transaction to complete. So you've got rid of the duplication here. So now let's look at the properties again. If we did put that red box in the coordinated node, I guess keep on saying box, red circle in that coordinated node, if you make it async, it's not real time anymore, so because you are going to send data at some frequency, collect the data and send it off. Is it in order? That's actually a complex one a little bit. It is not in order. It's not in order because the previous solution had this nice property that a partition lived on a very well-defined node. However, a coordinator is not well-defined anymore. Anything could be a coordinator for any partition. So you can get, note, five would have been a coordinator for the same partition. So now you cannot guarantee in order updates. Is it resilient? Questionable. We could do some things to make it resilient, but let's see if we can do better. And is it low latency? It is not. It does impact your transaction performance, so it's not low latency. It has an impact. We could make, if you wanted, we could make the async lock push it async. So it'll be part of your sync path. Every time you get an update, you push it out. You have some kind of a TCP stream running between each node and this red box. And you can get a sync lock pusher. And the other thing you could do is before sending a mutation, you could read the image before too. Again, still maintain the same level of consistency your mutation is for. And you can obtain the code data node can obtain the read image read before the mutation. So you have after you have, you can guarantee, you can provide both before and after. But there are some side effects of this. Code data node is a single point of failure. We could push data out to some other peer if you want to come up with that concept. But even there there will be some delay in getting it. You can absolutely guarantee it. And there is an increased bandwidth consumption. So on the wire that connects your coordinator node to other nodes, you are going to consume more bandwidth. If your cluster is network bandwidth bound, you will see an impact in latency. There are more operations you're performing in the right path. And that kind of is one of the bread and butter of Cassandra. So you're kind of deviating a little bit from that. And out of order mutations. So you're not you're not still guaranteeing in order mutation capture. Okay, next slide is going to be a little controversial. How about we have a leader in Cassandra? We never had leaders in Cassandra. Cassandra is based on the premise that every node is a worker and everybody's equal. What if we have a leader? Could be simplified. By the way, there are many, many products that have come out of databases could come out of the original paper. And Amazon, he spaces, he's based on that principle. So if you have a leader, then how does this how does this change things? The mutation still comes on to coordinator node, node six. But instead of sending it to three nodes now, it sends it to leader node three in this case. And now that's a single source of truth. And the node three will will send the mutations out to node two and node four. And depending on the consistency level requirements, it'll accept an act from node four and node two. And Mary goes the response back to your client application. And again, now you have this red circle, which is the agent CDC lock pusher agent running on each node that pushes your CDC logs to a downstream application. But there are no more duplicates. I would say we are mostly exactly ones. And the reason why I say that is if your leader node goes down, there's a potential that you may have to pull data from node two and node four in this case. And which is so we will have we would guarantee that data is available in normal functioning, you will only get one source of data. So by design, this approach in normal functioning modes guarantees exactly what's delivery guarantees in order delivery to because only only one node is serving traffic for mutations, definitely serving traffic for mutations. I mean, there are obviously some shortfalls with in choosing a leader replica. And it's not it's questionable whether it is the right answer for other reasons. But CDC could definitely be solved by by by using a replica based approach. So replica leader advantages. It definitely provides better CDC properties. And depending on your needs, you can guarantee certain properties, which would be helpful for CDC. You can get consistent levels to match what there's no loss in that. I mean, outside of CDC, you can actually make your lightweight transactions or compare and set faster. You can you can actually make a code item important at this point without seeing any loss in in speed. One of the things we do also is we you can switch. I mean, you don't want to keep a leader for for eternity, right? So you can keep on swapping your leader after a certain amount of time. So you can use some kind of a leader election mechanism every an hour or so to lead to changes. So you you're not you don't need special load. You don't you don't force all the load on us on a node all the time. There are some disadvantages to leader election is a complex process. And and needs needs special code. I mean, there was a talk from Alex yesterday, Alex Petro. And he mentioned he mentioned we are we are Cassandra community is moving towards more adaptable, being more elastic. That's the word he used. And to me, if we swap out the protocol, the interaction protocol between the coordinator node and other nodes, this could be a pluggable to it could be pluggable. And there was another talk by brandy meal, which was about pluggable memtables, where he introduced a try instead of using a assorted set of keys and cash. So if I mean, I actually some evidence that Cassandra community is moving towards making Cassandra more variable, more adaptive to the needs of the customer. And I think we should put some effort in choosing some scheme, which is which is an optional and does improve things for certain use cases. That's it. Thank you for questions. So leader election for us doesn't use gossip protocol. It's based on there's an external service which runs outside. And that's really coordinates all the leader election. And and it can take it can take time. It's not going to be instantaneous. And that external source of that service is the is the arbitrator of all this information that there's a new leader, this leader has failed and new latest leader has been elected. And this has to be disseminated to every single quest, every single node in the cluster. That that also so presumably the leader will be one of the three replicas. And and that will be able to pull. I mean, it's actually receiving the data asynchronously. And there is a process asynchronous process, which makes sure like an anti entropy, which which keeps up with the data updates all along. Absolutely. It's going to add some latency. And but the closer you do it to the node itself, the better you are, because you have to read from the disk and do with operation right there. So CDC is not supported. So CDC is not supported currently. We don't have that, but we do have a leader today. So we use a leader based mechanism to as part of our right operations, the for all of the replicas that we have, we do have a leader. We don't really support CDC today. It's not on our roadmap. But your question was just to repeat the question a little bit. Your question was more around the latency impact of choosing a leader and not really about CDC, right? Okay. Yeah, I can talk about the latency in terms of leader based mechanism and all key spaces operations currently are delivered within less than a sub millisecond latency. So if you if anyone has any more questions, these are the places that you can find the key spaces team. Some of these sessions are in the past, but the ones that are today we have a workshop that's happening at at 1150. So we can catch up with you over there. Apart from that, we have a booth so we can meet you there as well to provide more information if you have if you have any need for that. Thanks, everyone.