 The Carnegie Mellon vaccination database talks are made possible by AutoTune. Learn how to automatically optimize your MySleep call and post-grace configuration at AutoTune.com And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at StephenMoyFoundation.org Hi guys, thanks for coming. It's another talk in the vaccination division from our series. We're super excited today to have Gennar Morling. He is the project lead of Debezium at Red Hat and he's in Germany, so we appreciate him being sitting up late with us. So as always, if you have any questions for Gennar as he gives the talk, please unmute yourself and say who you are and ask your question and feel free to do this anytime because we want one to make it feel like he's giving a talk us with us on campus rather than doing the resume. So Gennar, with that, with that, the floor is yours. Thank you so much for being here. All right, Annie, thank you so much for having me. Thanks everybody for joining. Yes, so for me, it's 9.30 in the evening. I hope I can, you know, hold up, but it should be fine. I had plenty of coffee during the day. And yeah, so thanks for having me for that talk about open source change data capture with Debezium. And I want to talk to you essentially about three things today. So I would like to tell you a little bit about what change data capture is. First of all, and in particular, why many of our users are so excited about it. So what are the use cases for it, why you would like to use CDC. Secondly, I will talk a little bit to what is Debezium as one particular open source implementation of change data capture or CDC for short. And lastly, and I guess that might be the most interesting part for folks here is some of the challenges we encountered while implementing Debezium. So I mean, I believe usually in this talk series, you talk about databases and implementation specifics of those. So now Debezium isn't a database by itself, but I thought maybe I can try and make some points which folks like you, if you create a database in the future should consider to make the implementation of CDC easier. So what could database architects do to make the lives of guys like me a bit easier in the future. So I hope I can bring across those points a little bit. A few words about myself. So as Andy was mentioning, I am a software engineer. I work for Red Hat, which is the main sponsor of Debezium. So obviously I contribute quite a bit around that project. There's a few other efforts which I've been doing. I've been contributing to Hibernate. I'm not sure whether I can say that in this call or people freak out. It's real. It's not like it's like, you know, it's not like a fake thing or like, you know. No, no, no, it's real. It's very popular still. So yeah. But yeah, that's a bit of my past. I've been doing quite a few other things. Maybe I've heard about Quarkus which is a modern stack for implementing cloud native microservices. I've been the spec lead for being validation 2.0, which is a spec in the Java space for putting constraints to your Java object model. And quite a few other things. And as Andy was asking, what is a Java champion? So it's a program set up by Oracle to recognize people which are very active in the Java community. And I happen to be one of those. And if you would like to know what I'm up to, I suggest you follow myself on Twitter. If you've got any questions, you can reach out to me and I will be very happy to answer any questions as good as I can. Either here today or later on on Twitter. All right. So then let's get right into it. And it's, first of all, discuss a little bit what is change to capture and how does the BISM fits into the picture there. And really, the idea is quite simple. So as you all know, in your database, you have what's called a transaction log, right? Where it depends on the changes which happen to your data. So for each insert, for each update, for each delete, there will be an event within the transaction log, which is used usually for transaction recovery or for replication amongst different database nodes. But in our case, what we do is we capture the changes from that log and we propagate them as events to any kinds of downstream consumers. And why are we doing that? Well, this allows essentially to take the data from your operational database, typically over to other systems where you would like to have the data and process the data. So for instance, very often people would like to take data from the operational system over to a search index in Elasticsearch. And of course, this all should happen with a very low latency, right? You would like to have your full-text search, clear results from Elasticsearch or Opensearch. They should be very fresh. They should be up to date. So you don't want to do this in a batch way, which you maybe run once per day. But instead, whenever something changes in your database, you can update that search index as quickly as you can. And this is exactly what changed the capture and the BISM allow you to do. So to give you some realistic number, we can react to changes. It depends a little bit on the particular database, but we can react to changes in tens or maybe hundreds of milliseconds. And you could have such an end-to-end latency from your primary database over to some secondary system, maybe within two seconds or even less, maybe one second. I cannot use the term real-time, right? Because we have specific semantics associated to that. But I would say we have very low latency. So we want to take the data from an operational database over to something like a search index or something like cache, maybe a data warehouse, some sort of OLAP system. And this is, for instance, one use case, which I see a lot coming up more and more lately. So we have systems like Apache Pino, which try to do user-facing analytics. So unlike in the past where people only would go for once of curious to the analytics systems, nowadays we actually can think about user-facing analytics and maybe have dashboards which are shown to our end-users and which are driven by some sort of OLAP system. And again, of course, the data in the OLAP system needs to be very fresh, very up-to-date. And such a CDC pipeline lets you do this. So that's why we would like to do this. And now on this picture, you also see a little bit of how Debesion goes about implementing this. So we have on the left-hand side our source databases. And now there's what we call a connector for each of those databases. So there's a Debesion Postgres connector, there's a Debesion MySQL connector, and so on. And they essentially tap into a transaction log using whatever APIs, formats, and so on. We get by a particular database, extract those changes, and send them over to those consumers. And now we could think about directly hooking up something like Elasticsearch with Debesion. So we could just take the data and send it straight to Elasticsearch or straight to Apache Pino. But most of the times, people like to put something in between like Apache Kafka as a distributed commit log. The reason is, well, this just increases the optionality which we have here. So for instance, with Apache Kafka in the middle there, which by the way has very nice scale-out characteristics, we can, for instance, set up multiple consumers. So we can react to one change in multiple times. And we can take the data and write it to Elasticsearch and Pino, or our cache, for instance. So it's just more flexible. Also, depending on how long we keep the data in Kafka, we could, for instance, think about setting up new consumers down the road. So we can keep data in Kafka as long as we want, and well, as long as disk space is there. But this means once we have those log of changes in Kafka, we can enable more use cases down the road, which you might not have had in mind when you set up that source connector initially. So that's why we typically have Kafka here. And then there's this other thing which is called Kafka Connect. So Kafka Connect is essentially a runtime and a framework for implementing such connectors. So essentially, Debesion is a set of Kafka Connect source connectors which take data out of the database and put it into Kafka. And then you have sync connectors which take data out of Kafka and write the data into some sort of external sync system. And the nice thing about this is this all is set up by means of configuration. So typically, Debesion users, they are not programmers necessarily. They can be data analysts or people who are maybe a bit tax heavy, but they just can set up such a pipeline by configuring it. They don't have to program. So that's Debesion in a nutshell and change their capture in a nutshell. So what else can we do with that? And the way I like to think about this really, it's essentially liberation for your data. So instead of you having to go and query for change data, it comes to you with this very low latency and this lets you just do so many nice things. So we have spoken about replication. And I would say all this aspect of taking data from a database over to search index, another database, a data warehouse, a cache, I would summarize all that as replication. But then there's other things. So for instance, you can use change their capture for propagating data between microservices. So typically in microservices architectures, you would like to have one database per service. You don't want to share a database between multiple services. But still those services, they don't exist in isolation, right? They need to exchange data. And as we will see later on, CDC can be means of doing that. Audit logs, that's another very interesting one. Oftentimes you have this kind of business requirement where you need to keep track of the history of your data as it changes. And well, those CDC events, if you keep all of them, we have sort of an audit log, right? So really the sky's the limit and you can do all kinds of interesting things with a CDC. Debezium, as I mentioned, is one particular open source implementation. Of that there's others, but now I'm most familiar with Debezium. And really we try to be an entire CDC platform. So there's things like snapshotting, which is about taking the initial set of data and essentially send you a dump over to your system. So we will talk about later in more depth about that. There's a user interface and there's things like filtering and all kinds of stuff. So once it's complete, open source CDC platform and it tries to address all those kinds of concerns you may have around CDC. It's open source, it's Apache licensed. And what I'm most happy about is there's a very active community. So as I mentioned before the talk, so even more than 350 people have contributed to Debezium at this point. So that's quite a wide number of people and it's relatively engineers, but it's also other folks who work on connectors for Cassandra, for instance, for Vitesse, who drives the work there. So it's really, it's a very diverse and active ecosystem. In terms of connectors, and now this is actually getting a little bit into the database space. So nowadays the situation is there's no one single API or one single format which we could use to extract changes from all kinds of databases. So you cannot just implement the Debezium connector and be done. Instead, what we need to do is we need to have a bespoke, a dedicated connector per database. And for each of those databases, we did the research, okay, so how do we get changes out of MySQL? How do we do it for MongoDB? All those databases. But then what we try to do is we try to expose those change events in a mostly uniform abstract way. So then for you as a consumer of those change events, you don't have to care really, does this event come from Oracle? Does it come from Postgres? And so on. MongoDB, that's a bit the exception because it's a document store. It's not a relational database. So there just by definition of how the database is designed, the event looks a bit different. But for all those other connectors, the event structure is quite, quite uniform. MongoDB is different because like you can update within a document without knowing what the whole document is and you need to know what the reference of the document is. So I understand like parsing the log and handling that is different. But then the event that it emits you're saying is fundamentally different than what Postgres was put out. Yes, exactly. The thing is, well, most notably MongoDB is a schema less store, right? So within a particular collection, I don't know, of customers, of purchase orders and so on, at least in theory, all the events could be different. Typically, they will share a common structure, right? Your customer documents, they will look largely the same, but maybe you have different versions of your application. So there may be differences. And this is why those events for MongoDB, they are not strongly typed. So in Kafka Connect, there's a notion of a type system, which we will see actually a little bit later on, which gives you a schema. And this schema then tells you, okay, this is the structure of that event. And in the case of MongoDB, because of that schema, this nature, essentially, we give you a stringy fight representation of that. And this schema is essentially part of, you know, it's part of that stream, so to say. So that's the main difference. All right. And now the nice thing of what I found pretty interesting is, so we built Debezium and we built those connectors. And of course, we tried not to reinvent the wheel for each of those connectors. So what we did is we created a connector framework, which tries to unify things for us. And I don't know, makes given functionality for us reusable, like, I don't know, this snapshotting stuff, or how do we go about filtering and how do we go about message format creations and all those kinds of things. So we built a framework for that. And if you add a connector to the Debezium project, it's not too much work, I would say. So you really can focus on that part, which gets changes out of that particular database. Now, what's very interesting is that actually also other people outside of the Debezium community picked up that framework. So SillaDB is one example for that. So they essentially built a Debezium-based connector, but it's not part of the Debezium GitHub project, Debezium project, or the Debezium website and anything. They do it completely on their own premises, but it's still based on the Debezium connector framework. And, well, A, it gives them lots of things to reuse, but B, it also means that now if you use the SillaDB connector, again, this change when format will look very similar or the same really as the format of all the other Debezium connectors. And really, the thing is, people have multiple databases, right? So there won't be any company which just have Postgres, but all have multiple databases. So giving them change events which look the same for different databases, that's a big plus because then they can simplify how they process those change events. They don't have to care for different formats. And this is why I guess the SillaDB guys picked up the Debezium framework. And I think that's a pretty cool thing. They did a blog post on our blog, so you can read up more about it if you want. All right. So let's dive a little bit into the Debezium architecture then a bit more. And I won't give you too many details, but there's a few things which I want to call out. And one is, well, it's all within the Debezium architecture it's centered around this notion of change event sources. So CES for short. And now, essentially, there's two kinds of change event sources. One is there for snapshots and we will come to that in a little bit more detail. And the last is here was to select star from your tables. And then there's those streaming change event sources which with the event handler here and they react to data changes. So most of the databases in their CDC APIs, they are some sort of callback APIs which invoke some callback which we released whenever there's a change event. So we have those two kinds of change event sources. So we have these two kinds of change events that we're coordinating between those sources. So typically once this initial snapshot is done for instance we want to continue with streaming. And now the thing is all this runs as part of the Kafka Connect framework. And the reason for doing that is Kafka Connect for instance allows us to scale horizontally out in the Kafka Connect cluster. Now you might say, okay, reading from a transaction log that's by definition mostly agree. But very can for instance scale out is A when it comes to again stores like MongoDB which just give us multiple streams from different replica sets as they call it. So there we have a notion of being able to scale out and parallelize. And also for instance in case of this notion of snapshotting. So we can of course run multiple selects or we could we don't do it yet but we could run multiple selection selects against multiple tables concurrently and again then this connect framework with its notion of tasks which can be shuffled around in a connect cluster will be useful. So that's essentially why we run on Kafka Connect. Now the thing is Kafka Connect itself it's polling based. So it will not use callbacks but it polls which means we essentially as most of our source APIs are callbacks we need to go through this kind of blocking queue which well at least gives us some means of back pressure. So we fill up this queue and when it's full well we cannot proceed with any further change events from database so we would block and then Kafka Connect will poll and consume those events out of this blocking queue fast and it will send them off to to Kafka. So that's a little bit the internal architecture. I'm just mentioning it because there's this kind of mismatch which we have between those push based source and this polling based Kafka Connect API which creates a little bit of friction but well we can work with it but I thought might be interesting. So then let's zoom out a bit again and let's shortly talk a little bit about different ways of running and operating dbism. So I was mentioning Kafka Connect a lot so again Kafka Connect it's a framework which is part of the Apache Kafka umbrella I would say 90% of our users are using dbism via Kafka Connect. So that's the most common way but we do have two more ways of using dbism. The first one is which we call the embedded engine. So dbism is all written in Java and this means you also can embed it into your JVM your Java based application as a library and typically users do that who have very specific requirements or maybe who want to react to change events at a rather low level. So things like the Apache Flink CDC connectors. So they are based on the dbism embedded engine and they use then dbism programmatically to ingest change events into their Java based application. So that's the embedded engine and then what we also support is what's called dbism server and dbism server that essentially is a ready-made runtime which is based on this engine and which in an overall architecture view fulfills the same role as Kafka Connect. So it's a process environment for running the dbism connectors but unlike Kafka Connect dbism server lets you take those change events and send them out to other kinds of messaging infrastructure. So we have adapters for things like Apache Palsar, Google Cloud Popsup, AWS, Kinesis, Redis Streams. By the way, the Redis guys are very active on that one and we try to really, you know, bring the features and capabilities of dbism to all kinds of users either whether they are on Kafka or on something else but really we want to support the users whatever messaging infrastructure they use. So that's dbism server and I would say this is ramping up over time. So I've been I've seen I've seen more users more and more lately which are interested in this as I mentioned Redis folks are very interested in this. So I would say there's definitely some uptake there. Maybe the most processing are you the embedded engine actually doing per like per event like are you are you just, you know serializing whatever Postgres or MySQL generates and then converting it to the expected format and shoving it out or you doing things like you know are you waiting you know are you waiting to see the commit message from a transaction and then send the event out. Yeah okay so that's a very good question. So there is a notion of batching at play because most of those messaging infrastructures they benefit if you send the messages in batches rather than one by one it's just more efficient so you can do that if this is supported by the particular sync what currently is not happening is what you describe we that we would do some sort of buffering based on transaction boundaries. The reason is the transaction can become arbitrarily large right and we would have to somehow buffer all the events from one transaction within memory and you could implement it I guess if you wanted to so we also give you events if you get them from the database about transaction boundaries so there is the ability to tell okay transaction has started and transaction has ended but then you would have to go about yourself and implement this kind of buffering but yeah actually that's very interesting because sometimes people would like to do that they would like to hold back events and they would only like to see all of them once they once in particular transaction has committed so we provide basic beans for that for allowing you to do it yourself but it's not something which DBSUM does at the moment. If there's a bunch of changes given to the right of head log and then they get property in through DBSUM then the transaction of boards you get the import message and then whatever is on the other side of Kafka typically we only get changes from committed transactions so really most of those interfaces are designed in a way that they only give us committed transactions there's most notably Oracle is an exception so actually there we need to implement this kind of buffering already because it could be transactions are rolled back for that we have the ability to spill over a cache which we have to disk in case transactions become very large. That is my question, thank you. Alright, cool. Yes, so now I mentioned I spoke a lot about change events and I thought we should also discuss actually what is a change event and this is how it looks like and by the way I should say here this is a JSON based representation but really what matters is this is the semantical structure so how is it structured but then in terms of how is this serialized into Kafka you can go about this in different ways because in Kafka those infrastructures really messages typically are byte arrays so it's your responsibility to take the well let's say it's a matter of configuration to take that semantical type structure and serialize it either into JSON as we see here or serialize it into Avro as the compact binary representation, protocol buffers and so on but focusing a bit on that semantical structure so the core parts really are those before and after parts of a which are there in each change event and they describe the old and the new state of a row so typically well in case of an update you would have both of them if this is an insert event you would just have this after part and if it's a delete event you would just have this before part now within those two blocks before and after this is coming back to this notion of a type structure really this resembles this structure of the tables you're capturing so for each table for each column within your captured tables you would have a field within those blocks again MongoDB being the exception so there we would essentially give you a stringify representation of the changed parts or the entire document really and it would not be typed on the particular level here so that's before and after and then in addition there's some metadata which gives you some I don't know information about the version name of the connector the position in the log file some unique identifiers so you can create that like from I guess transaction ID and in this case maybe the table sometimes you can get the query which causes a change so that can be very interesting and then of course you have those this op identifier which tells you what kind of event it is and lastly in timestamp so that's how change events look like in general all right sometimes you can't get the query yes so for Oracle sorry for SQL there's an option where so you need to enable that in the database and then they will persist the query in the bin log and we can fetch it from there that's pretty cool all right so then let's talk a little bit about this outbox pattern and this problem of microservices data exchange and I'm bringing this up because I just see this is a very relevant problem so the problem is people build microservices architectures and now what typically happens is a service when I don't know it has a REST API it receives a request for placing a purchase order and now it needs to update its own database so we want to persist the purchase order but then at the same time we also would like to send a message to other downstream services so let's say we have this order service it should persist the purchase order in its own database and then if we also want to send a message to the shipment service let's say so for fulfilling this order and maybe you use Kafka for doing it and now the thing is those two things updating this database and sending a message to another service via Kafka they cannot happen in an atomic fashion because well Kafka just cannot participate in distributed transactions and even if it could I would argue it's not desirable to do that so really those two things are independent and this means doing that is prone to inconsistencies so we could end up in a situation where we have persisted the purchase order in the database but then we lost the message which we wanted to send to the shipment service or maybe even worse we sent that message to the shipment service and we told it hey please fulfill that order but then we failed to persist that order in our own database and obviously this would be very bad now this is what you shouldn't do they always tell people friends don't let friends do dual rights so that's what we call dual rights trying to update those two resources without shared transaction guarantees so what can we do instead and well if we cannot update multiple resources we always can update a single resource and this would be the database of our order service so what it does is it updates its own local table so let's say we have a model with an order table and an order line table and then within the same transaction so we have a local transaction boundary it will also insert a message into a table which is which we call the outbox table and the purpose of this outbox table is just acting as a message relay which we consume via dbism to take those messages and send them to other consumers and now as all this happens as one atomic transaction we would either do all of this right so we would persist our purchase order and we would persist this outbox message or we would roll back the entire thing and we wouldn't do any of that now dbism enters the picture cdc so we capture the changes and now in this case we only capture the inserts into this outbox table so we don't care about the other business tables we just captured the inserts into this outbox table and we would take those events and send them to Kafka we have the separate outbox table well it's essentially a means of separating our internal model from messages which we send to external consumers so typically I mean you could capture the changes to those order and all the lines tables and send those messages to your consumers but then you would really expose the internal schema of your tail model and depending on where you stand on those things it's something you may or may not want to do most of the times people rather prefer sort of abstraction and now with this outbox table in the picture they have a separate message contract which they consciously would evolve for instance which is independent of the internal model which they then could refactor or rework as they would like how could this outbox table look like so I like to think about it in terms of domain driven design where you have things like or concepts like an aggregate which for instance describes the type of a particular data item so our aggregate here would be purchase orders customers things like that you have things like an aggregate ID which tell you okay that's the particular purchase orders is about you have you might have something like an event type which allows it to differentiate different types of events which apply to one particular order and lastly you have this payload and this payload well this could be really anything it's in this case it's a JSON structure but again it could be anything and you are in charge of creating that structure and evolving it independently from your internal table model and there's a bit of bespoke support for for that pattern in Debezium so essentially if you have such an outbox table and it adheres to a particular convention then you can out of the box use that and for instance Debezium would then take care of routing the events from that outbox table into different Kafka topics per event type so the reason is by default you would get one Kafka topic per table so you would end up with all the outbox events in a single topic whereas typically you would rather probably have your order events in one table in one topic and your customer events in another sorry I don't people typically implement that like you use triggers you can write the application code in practice what do you see as the most common approach? Right that's a good question so yes it's a matter of your application design how you go about it so what I would recommend and what I also see is doing it in terms of an event driven approach so for instance for this Quarkus stack which I mentioned in the beginning so you can think of it as an alternative to Spring Boot essentially there we have an extension for that Quarkus framework which then would allow you essentially to emit just an event in your application code so there's an eventing API and you just say okay I would like to fire such an older outbox event and then this extension takes care of persisting that in this table structure so it's really quite quite lightweight actually now the thing is so here we have this approach where we have this outbox table but actually there's an interesting alternative for implementing that pattern and this actually comes back a bit to this question as well if you are in Postgres in Postgres you have the ability to write messages straight to the transaction log so there's an API called pgeological emit message and this function allows you to just emit an event to the transaction log so you won't see anything of that in any table goes straight into the transaction log and this is very cool for this of this outbox pattern and just using Postgres for sending messages elsewhere because now in this case we also can capture those events and you see this is how we do this function call so you say should it be transactional true or false so in this case yes we want to be transactional so we want to not persist if the transaction gets rolled back then there's some sort of type so in this case I just say it's an outbox message and again I have my message payload and now the cool thing is for instance this Quarkus extension it could be implemented also based on that particular API and we wouldn't need a separate outbox table which would be nice for a reason and this is well we don't need additional disk space for materials in this table right and also we don't need to think about housekeeping of the table because well once our outbox messages have been sent we don't need to delete them from the outbox table right whereas here if we only ever have those outbox messages in the transaction lock well we don't need to think about housekeeping ourselves so that's why I think this API on Postgres it's a very cool thing for implementing something like the outbox pattern to be able to have like a callback mechanism to say like in the outbox pattern I tend to measures go back and called kind of so you would I will touch on that a little bit so there is a notion of offset management which allows us to tell for instance Postgres okay this is how far we have processed the messages from the lock but you couldn't really tap into that so it would have to be a sort of a separate process where I guess you would have to monitor the offsets which we have acknowledged then you would be able to delete everything up to that point so it would require a bit of logic the thing is the outbox pattern also too is like it's sort of providing additional semantics about what the events are actually trying to capture you don't necessarily have to do this if you want to take the BZM just point it at the database it'll get all the rights to every single table and you'll propagate that yes so yes that's really essentially two alternatives you have right and I would say it's a philosophical question are you fine with exposing your internal table structures and those raw change events or are you not fine with that so I would say it depends a bit on the use case if you think about something like caching where I would argue you exactly want your raw table structure to be reflected in the cache I would say that's fine if you have this kind of inter-system message exchange then I would probably say okay you would rather want to abstract it now also if you capture the raw events you have some means in Kafka connect for instance to do things like rename fields or change their types and stuff like that so you can modify messages a bit and try to abstract some schema changes but I would feel with this outbox pattern it's definitely it's more powerful for instance you also can of course materialize an entire aggregate into a single event right if you think about this purchase order in a relational database probably would have stored at the multiple tables and there's like a one-to-n relationship between purchase order header and order lines and then if you have those raw events you need to think about combining those and materializing let's say a single document which you can persist into elastic search to represent an entire search an entire purchase order whereas you can do all this easier I would say an application code using the outbox pattern like cool and so and yeah just briefly another use case for this pgeological message apis providing auditing metadata so as I mentioned a CDC lock if you keep it it's kind of like an audit lock but what you are missing typically is this additional metadata which you often want to have like what is the business user right so you don't care about the database user you want to have some sort of business user who did the change or maybe some sort of use case identifier maybe some sort of client timestamp this kind of thing and typically you don't have that metadata in your actual tables themselves and whereas for the purposes of an audit trail you would like to have that metadata now again using that logical message api we could at the beginning of a transaction we could emit or your application could emit a message with that metadata where you would store that in whatever format you would like to and then you could implement some small logic for processing the events and the business before you send them out to Kafka so what you essentially would do is you would react you would wait for this kind of event to show up in the event stream you would then take that metadata keep it somehow memory and then you could essentially all the change events which come in until this transaction has committed so until you see a transaction commit event and you would take that metadata and put it into those events and now they would be fully self-contained as far as that metadata is concerned so again I think that's a pretty cool use case and if I had to ask or if I could ask the future database architects I believe having this ability or this facility to emit events straight to the transaction lock I think that's very useful and a very cool thing to have in the database. Alright I got to look a bit at the time yeah but I think very good so now I hope I could give you a bit of an understanding of what the BISM is what CDC is why I think it's very interesting very powerful now I would like to talk a little bit about some of the challenges which we had while implementing this and also of course how we overcame them and the first one as well this is how are those messages structured and how can we reason about their schema the thing is typically such a change event and again this is highly dependent on the database here we see an event as it comes from MySQL but those events tend to not be self descriptive so if you look at this you cannot fully make sense out of this right so you see no table name you see no column names it gives you some hints there's a table ID it tells you which are columns which are contained identified by indexes then you see those values but just by looking at this event you cannot reason what this is about and this means you need to have extra information to fully interpret that event now one thought might be okay I could try and go to my database and just query for that schema of the table so let's leave aside the fact that we just have this table ID so let's assume there was a table name and we would know which table this is about we could try and go to our database and fetch the schema for it and then try to make sense out of that message here the thing is a schema can evolve over time right and all this is asynchronous which means this which means at the point in time when we receive this change event the schema of the table might look different now than at the point in time when this event was created so this event got produced or got persisted in the transaction lock a schema change happened so maybe they renamed the column or they added the column whatever now we get to process that event if you go to the database and query for the schema of the table now there would be a mismatch between the schema which was in place when this event was created and the schema changed like now so we cannot do that and the question is how can we go about and doing obtaining that schema information and now it depends a little bit on the different connectors the worst case would be my secret because what we have to do there is we need to parse the DDL events as they happen so for each create table, alter table and so on we need to parse that so we know and those events they are part of the change event stream as we receive it you need to parse those events in order to know how do those table structures look like now I have given you an example here and you would say okay that's rather easy ish to parse I've used this antler parser generator which I guess you have heard of which makes this relatively simple the thing is those cremers tend to be very complex so this is a very basic example of a create table statement that could be all kinds of modifiers, flex and switches which attribute to that and the thing is we cannot only parse the create table statement but instead we have to support the entire mysql secret cremer days so for stored procedures and everything because we need to fully implement that right you cannot parse something just partially and now in particular in case of mysql they also tend to evolve the cremers so also in micro releases they add new features and very often we would get the bug report and it says okay I have executed this particular DDL statement and dvism could parse it so that's quite challenging and I would say it's not the most ideal way for conveying that schema information there is different ways and we will see what makes life a little bit easier for us so that's the one thing we need to parse it so we can build up some internal memory model of the data and then if the schema if the change events come in then we know okay this is the structure and this is how this event looks like now there's another challenge and this is such a connector can be restarted at any point in time right so maybe you update the connector version you update your Kafka connect cluster maybe I don't know you just want to shut it down for all the reason don't run it it seems this connector gets restarted and now we need to still somehow re-establish this schema and now we have parsed those DDL events the last time the connector was running so we need to persist the information somehow so when we get restarted we know how the schema looks like and this is why we have this notion of a schema history topic or a database history topic and essentially this is how the dvism connector persist the schema history in a special Kafka topic and then if the connector gets restarted we can re-read this entire topic from the beginning to re-load that state and so we know what is the schema of that database there's quite some subtleties around this because what you need to keep in mind is in Kafka and Kafka connect you typically have what's called at least once semantics for instance those offsets they are not stored after each and every record but instead this happens every now and then in intervals for the sake of efficiency so this means we may process some events a second time if the connector gets restarted and maybe it crashed and it couldn't persist it's offset correctly and this is why when we reload this history we need to be careful of how far we reloaded and so on so it's quite tricky and I'm saying because I feel there should be a better way for doing all that and I would say Postgres is doing actually a pretty good job and the idea they have is we want to make our messages efficient so they should not be fully self descriptive all the time so something like JSON which has the field names and each and every message that would be very inefficient but instead what they do is they essentially when we start a session or a connection they will give us a message which describes the schema for all our tables and then again when there's a schema change they will also give us a message which describes the new structure of that schema and this message is not DDL which we need to parse but it's a structure type structure which we can properly interpret easily and this allows us to build up an in-memory model of the table schema and then whenever those change events come in we can go to our in-memory model and interpret them accordingly so I would say that's a good way of thinking about this so providing this information once so we can consume it provided in an easy to consume way and so we don't have to deal with all the parsing and keeping track in a separate topic and so on so that's I would say pretty useful yeah sure so I guess a many question or just a thought that's an alternative to parsing but you're able to be able to just apply the DDL statements to like a fake or like miniscule dbms instance and query the schema directly by component that's an interesting thought I guess we could I haven't thought about that but that's actually a pretty cool idea because we may you already need to query it by SQL you need to query it at some point to build up your in-memory idea of what it looks like yeah so that could actually work so I would have to think about the details a bit but this is an interesting approach thanks for bringing this up that's cool I think that could work actually alright cool so then let's keep going and I'm looking a bit at the time here just on this notion of efficiency so another thing which is a bit which can be easy or hard for us is how do those changements look like so now for instance in case of Oracle where we query this logminer view which they have this is actually how we get changements from there so let's say you execute that update statement at the top this will result in an event in this logminer view which looks like at the bottom so it's a generated update which essentially tells us about the changed fields and in the where clause it tells us about the old values of the entire row and now that's I would say rather an inconvenient way for telling us about the information because again we need to parse that here we cannot use Antler actually because it's just too slow so we have an optimized hand built parser for parsing just those dml events but really it would be so much easier if you got a type structure as it is the case for Postgres or all the other connectors really so again the point I'm trying to make is you know give us it's always better to have properly typed structures instead of having to parse that sort of information that makes life much easier alright so I will skip that part because I feel I'm running a little bit out of time so let me continue here which is about snapshotting and I should say a few words about that so very I mean essentially Debezium is in the business of capturing changes from the transaction box and this is why people are excited about it at the same time there often is this requirement to do some sort of initial backfill of your data so you want to start let's say you take this data into your search index or into your OLAP system you want to have the complete data set there right so you can run your queries on the entirety of your data now the problem is you typically don't have the transaction logs from the beginning of time on your operational database so typically once the database figures out those transaction lots are not needed any longer for transaction recovery and so on they will be discarded which means if our database has been running for a while and we set up Debezium we don't get or we won't be able to get all the information just by looking at the transaction logs instead we also need to look at the current state of the tables and essentially query them for getting this initial view of data so that's what this notion of snapshotting is about and here I'm just showing you as a pseudo code really how this would work in case of the Debezium Postgres connector so what we do there is we create a replication slot and the replication slot that's the means or let's say one logical entity for getting changes out of a Postgres database so it has a name Debezium in this case and we specify this export snapshot parameter and this one essentially again well as it says it exposes snapshot ID so now this means we can start another transaction as we see in the next line and there we tell okay this snapshot ID 3a1 dash 1 this should be used as the point in time where we would like to look at the data so now this means we can take a snapshot at the very point in time I should say at the very offset in the transaction log where we created that replication slot which means now we can run our select queries for taking the initial data and this could of course be parallelized as I said and then we can commit or even roll back that snapshot transaction and then we would continue to read from that replication slot which we created initially and this allows us to have a seamless transition from doing the snapshot transaction to reading the log this is how it works for Postgres it works like that for Oracle for other connectors of other databases it's not as easy as we don't have means of getting that exact transition point so for instance for MySQL what we need to do is we need to take a log for a very short period of time essentially where we prevent any writes to a database so just we can essentially identify what's the current offset in the transaction log we can start our read transaction then it will allow writes again but also although this is a very short log which we need to hold people just don't like it people don't like write logs no matter how short they are so that's why doing something like that which gives us this ability to start this snapshot transaction at the same point in time where we will read the transaction log later on that's very convenient Oracle has the same notion of select as of SCN so that's really cool and I think it's a very good thing to have in the database now the thing is with those snapshots so we had them in division from the beginning I would say but at the same time there's a few shortcomings so for instance you cannot update what's called the filter list so typically people would only like to capture some tables they don't want to capture all the tables they just would like to capture a subset of the tables now the thing is once a connector has done a snapshot it's hard to update that filter list of the tables you are interested in capturing because if you think about so now you need to take a snapshot of those new tables which have been added to your filter list whereas you want to keep the transaction log read running somehow so that gets quite complicated it never was supported in the museum also you couldn't pause and resume a snapshot so depending on the size of the data it can run for quite some time and if for whatever reason was interrupted you had to restart at the beginning also you couldn't start with streaming your changes until the snapshot had been completed so quite a few shortcomings and luckily we came across a very nice alternative implementation of snapshotting which I would like to just cast a little bit and this is credit by credit this is credit or was described by now I hope I get the names right Andreas Andriakis from Netflix they built their own CDC implementation they said they would open source it but it hasn't happened so far so they mentioned this in 2020 or 2019 I haven't seen it so far but what they did is and I'm very thankful to them for that they described an innovative snapshotting algorithm in this paper which everybody can read so you got your L here and the idea there is they interleave the events as they read them from the transaction log with events from the which they take from snapshot transactions and I will tell you a little bit about how this works again all the credit goes to those guys who did this very useful work we adapted this a little bit and implemented this in Divisium from the Netflix guys so how does it work again the idea is we run those two activities in parallel so we read from the transaction log and we also read our snapshot data now instead of one single large select for an entire table we do this in chunks we do it in chunks because we want to make this we want to portion this work and make it interuptable we focus on one table here customer so this needs to be ordered so that's why I'm ordering it by ID and I'm getting event the snapshot rows in chunks of size 1000 so it could be configurable of course so I start with ID 1 and I get the first 1000 ones next chunk transaction would be good start at 1001 or maybe if I had you know some IDs were missing I would start at larger offset but I would get the next 1000 and so on so this of course requires that our table has a primary key it must be orderable so that we can do that chunked approach for selecting data so now if you just were doing that there would be a problem of course and the problem is those two things getting those snapshot chunks and reading from the transaction log those two things can collide so what could happen is we select something from this customer table at the same time an update comes into one of the records which was returned by that select and we might override that update event from the transaction log with the old state which we selected via the select statement so we need to prevent that of course and that's where this cool idea of watermarking comes into the picture so the idea is to tie together those chunk transactions for the snapshotting and the corresponding parts of the transaction log and the idea how it works is to insert records into a specific table let's call it the signals table so if we get to a new chunk of our snapshotting let's say here the first 1000 first we would insert a low watermark into the signals table which then will show up in the transaction log of course then we run our chunk query so we get 1000 customer records and then we insert another signal record the high watermark which essentially demarcates the end of this window and what we achieve by that is we couple together one snapshot chunk which gives me 1000 customers and a corresponding section of the transaction log and now we can go as we have correlated those two things we can go and resolve those conflicts which I mentioned before so let's look into one of those chunks let's say this is what happens so I have here my snapshot query and it returns me 7 rows key 1, key 2, key 3 and so on and I named those snapshot events I named them read events so that's the R that the R stands for and at the same time at the beginning I inserted that low watermark then other changes come in of course other transactions can be running so I would see an update insert to key 8 update to key 4, delete to key 1 and so on and then this high watermark gets captured and this demarcates the end of that window and now those two things those updates between the low watermark and the high watermark and those records which I retrieved from my snapshot chunk query they correlate or they belong together I should say and of course now we need to resolve those conflicts or essentially remove anything which has been selected in the snapshot chunk because everything which came in from the transaction log that should actually have precedence so the way it would look like is so now for instance this key 1 it got deleted right so in the transaction log so that's why I discard that event from my snapshot chunk buffer so by the way of course I need to read this into some sort of buffer before I stream it out so I discard key 1, I discard key 2 because this got updated I keep key 3 because there was no change to key 3 in that window so I keep that in the snapshot buffer I discard key 4 because that got updated 5 and 6 they were not changed so I keep them and key 7 was also updated so I discard them now once I have reached that high watermark I do that discarding which I described and now what I would do is just admit all the remaining snapshot records so key 3, key 5, key 6 and if you take a step back what all this achieves is we get a backfill of all our data by stepping through those chunk steps while the transaction log is consumed and if you think about it I know it's probably quite dense but again this paper describes it we have a blog post which describes it but if you think about it it gives you some interesting semantics so they are a bit different than what we did in the classic snapshot so for instance what is not guaranteed is that you would get a snapshot event for each and every of those records because this deduplication happens if there is this chunk transaction running for a particular subset of our data and at the same window we get a delete event they would take precedence so we would not get all the snapshot events but then we would get update events or we would get a delete event what also could happen is we may receive an update event and you didn't see an insert event before which is fine because those events are fully self-contained they have to full state and lastly you may also receive read and update so let's say I had that state in the transaction log and now I would for whatever reason read key4 in the next snapshot chunk but then I would first have gotten this update from that window here and I would get the key4 select event from the next window so essentially what all that means is I need to think a little bit different about those events so I should then take if I were to take the data and propagate it into some sync system for instance I would just make sure if I get an event I either update this state if I have that key already or I insert it what is guaranteed for me is once this incremental snapshot has been finished then I will have a complete set of my data and that's the key requirement which we want to have the full set of data in our sync system and this is what we get now if you compare that a little bit to that original shortcomings well all those things are not a problem anymore because when I can update my filter list I can just do it and then I can take a snapshot of all the newly captured tables by running this incremental snapshot I can pause and resume those incremental snapshots because what we also do is for all of each chunk query or let's say for the latest chunk query we persist the last primary key which we have seen we persisted in the connector offsets and this means if the connector gets restarted then we know from where to continue what is the next chunk query to execute and we can resume with that snapshot and also well by definition of that algorithm we don't have to wait until the snapshot has been completed before the streaming starts because those two things happen at the same time what I should also mention is now all this requires right access to our data because we need to insert those low and high watermark marker events so that's something to keep in mind and also currently how this is triggered is by another insert as a way you talk to the BZM essentially is through the signals table and we will capture that command from the transaction log eventually for MySQL we have a way for doing it with read-only transactions but I just want to bring attention to that and that's pretty much all I had and I know we are a bit over already but I want to quickly just wrap up so I hope I could bring across that CDC is a very powerful tool so it has lots of use cases for it it's really a very flexible tool in the box and my personal wish would be that it becomes a part of all the databases like we have as we have select insert updates and deletes today I would really love to see if there also was just native built-in support into SQL, into databases maybe using a stream keyword or whatever but it shouldn't be an afterthought so maybe I guess you have seen a little bit you kind of get the feeling CDC always has been an afterthought it wasn't the first class element of the design goals I would say and I would love this to change and actually this is changing so if you look at newer databases like YugaByte or CockroachDB they have OssilaDB they have CDC interfaces from the get-go so for them it's not an afterthought because they understand having that low latency capability of pushing changes to users is super powerful and super flexible and just very interesting to have Divisium is a implementation of CDC for a variety of databases as we have seen open source as I mentioned and lastly I'm just going to mention this keyword what also would be super interesting to see incrementally updated materialized views I believe you had somebody from materialized I owe here if not you should definitely and there's other projects in that space which not only give you incremental updates to your raw data but also to materialized views which of course takes it to the next level then we get inside incremental insight into complex resources and I believe that's a very interesting topic where I would love to see definitely more development and lastly a few resources so if you would like to know more about Divisium there's our website Divisium.io this incremental snapshotting algorithm I believe I did not do it fully justice with the short description you can read about it more either in the original Netflix paper or in the blog post which we have here this article about the Outbox pattern and lastly which I would really recommend you we have a large demo repository where you can find all kinds of examples and demos for using Divisium for different purposes so definitely check out this demo repository and take a look at what's there and with that any I'm not sure but we still have time for take question I'm happy to take them but that's all I had in any case I don't have the audience don't take one question from the audience sure okay cool we had a question oh sorry there's the question Kaby go for it am I on? yes you're on go for it so I'm gonna make this one two parts because I understood only one question so one thing is that when the aggregate functions are not algebraic so when anything in the group changes we have to have access to all the rows of that group to go recompute it right not incremental so that is more than what the replication can do because I have to access the original one to you know because yes that's a very good good point so you would either have to keep that state somehow in your aggregation state or you would have to go back that's true I believe for materialized IOT they keep everything in memory currently yeah because some of these groups are very large actually the group on the column with the very low cardinal absolutely yeah so that's one the other one is that when we do the replication and we have like target is also LTP and we have representational integrity there we have to be very careful about the timing between the replication of the tables involved in these such constraints any kind of interplayable constraints yes that's right because because the parent is not there until the parent right so yes that's that's perfectly right so you know some people typically let's say you think about data whereas OLAP you would have some sort of data landing area where you keep the draw events right and you will resolve a referential integrity later on and what you can do is some means of stream processing like flink or Kafka streams which would give you some windowed notion of joining which help in that case but yes you're totally right if you consumed events one by one you wouldn't in particular you would not have ordering guarantees between the events of different tables that's definitely right yeah so that's a major problem with this and we have to do that correctly when we do between the OLTPs which IBM has of course we have to that's a tricky thing to do absolutely yeah division does not do that it does not do that at the moment that's true because we really focus on that thing side of the story we may evolve into that space and then we would try to reason about it absolutely okay thank you thanks good question for sure I'm not talking about interplay I'm just putting up what Hamid said right the Debezim as far as I know from my perspective it's the most popular CDC framework but as I said there's Golden Gate there's Tal and I don't know what the IBM one is I'm sure you guys have one is there one feature that these other commercial proprietary CDC frameworks have that like Debezim doesn't have you wish you had it so are you asking me? yeah I guess so the thing I generally try to not focus too much on the competition but yeah so let's say for sure having a native oracle mining engine that would be very interesting to have absolutely