 I'm using dimensional modeling in a modern data stack. So, as you can see, it's about being change aware when you're doing a data modeling. So, a problem about me, I am currently, I currently serve as a senior data engineer at Graph. So, I am a technical speaker and an occasional writer on data engineering topics. So, you may have seen me around on some of the blog posts or in a pie data talks. And I am 90% self taught because as you can see where the 10% actually came from. It comes from my engineering degree. So, my engineering degree actually comes in very useful here. So, before we begin, some disclaimers here. Well, everything in this talk is based on my own technical expertise. There's nothing to do with my team or my employer. So, it's pretty much like my personal experience based on hard won battles. And the scope is primarily focused on structured template data. So, I understand that for data versioning that we have structured and structured data. So, it's similar concept but the focus of this talk is on your structured template data. So, let's imagine this scenario in your data team. So, imagine that finance department requests a region of your report dated 11 months ago for client aid because it's your own region, right? So, we need to regenerate the reports. So, client aid has already been upgraded from tier two customer to tier one customer three months ago. So, if you are so promoted in your tier that you increase your rate. But the problem is that you do need to regenerate the report based on tier two and rate because client aid was tier two. So, you can't really generate based on the fact that he's a tier one customer because he wasn't. So, this is what the data will look like in this scenario. When you are, well, at some point in time, it was, you can see that client aid was a tier two customer. But, when you're trying to regenerate your report by the end of the year, it's a really tier one customer. So, what happens if we regenerate the rebate report for like, for 11 months ago at the end of the year? And now, now that your customer tier actually has been updated and overrated. So, what do you do in this instance? Because you can't find the data back, right? So, this brings me to the concept of time in dimensional data model. So, time is an important dimension in your data because the data is usually not static because you can have state transitions during a business process and your attributes can also take over time such as your age, your income, your status and in previous case, even your customer tier can change. So, don't assume that your data is static. And this brings us to the importance of data versioning. So, what is data versioning? Now, data versioning is the concept of capturing state changes while keeping track of successive versions of the data over time. So, what I mean is that for each data recording you create, you need to create a unique reference for this collection of data. When those changes occur, while we will retain the previous versions. So, there are two general data versioning approaches. So, one is more for the structured data which is what is known as change data capture. So, there's also an extension of this change data capture approach which is applied to unstructured data which is the concept of data version control for unstructured data. So, for those who are not very familiar with data versioning, so this is what your change data capture and your data version control roughly looks like. So, change data capture is what we are quite familiar with in the transition approach of having a timestamps and having a valid problem like two. And for data version control, it's more on something like version control for data. So, you have examples like the iceberg or you have your DVC and so on and so forth. So, those are actually the extension of change data capture. So, why does data versioning matter? So, as we can see, one of the main reasons why it matters so much is because of reproducibility for data governance and audit purposes. Because let's say at the end of the year, we need to regenerate those reports and we do need to be able to capture all those changes of your dimensions over time because we do need to be able to report to whatever authority it is, like make your data governance team or make your audit team and you need to have audit trail of the changes on your data. Then, two, you need to build. So, it's to build data history with backward and forward compatibility. So, this is more for the data team because what if there is a change to trust information logic that only applies to a specific time range? But let's say, let's say they say, like Makarik says, we have a promotion for a specific time period. So, we will need to change our competition logic for our rewards. Now, in this case, it's only for that specific time range and you need to be able to capture this type of change in logic. And last one out of these, this is a little bit similar to the first point, but it's more of the fact that when you need to use point in time values to track business metrics over time. So, this is more in the case whereby if as a business user, they need to be able to have a good profiling of how, let's say the customer profile actually changes over time. So, they do need to have a good audit of that to be able to monitor their business metrics. Okay, so now we've talked a bit about data versioning. Now, what is change data capture? The change data capture is simply data versioning from databases. So, those are design patterns for capturing and tracking changes in your data from your upstream source systems over time. And in this case, our changes are actually captured in either your data warehouses or data lakes. You know what I mean? The differences are not very clear anymore because your data lakes are going to have a data lake in your data warehouses. So, they are their warehouses. So, some design patterns for change data capture will see a data versioning based on the combination of type of the version identifiers, timestamps, and status indicators, which means whether the record is valid or not. And another data pattern that's very common is your log to your tuple versioning, which is also known as type two slowly changing dimensions, which I will elaborate on this later. And last but not least is transaction logs. So, transaction logs are specific to the system. So, let's not talk about that. So, now I've talked about why the data versioning matter, why it's changed in a capture. So, now let's go on to the context of this talk. What do we mean by the modern data stack? So, it sounds like a fancy word, but simply put, modern data stack means that it is cloud-based, built around the cloud data warehouse of the data lake, and it's modular and customizable. So, instead of the traditional approach by which was a vendor, and it's a front-side spin-off, we chose the best tool for the best job. So, if we look at the components of a modern data stack, you'll have your injection step, which is done by the data loader, you'll have your data which loads data from upstream source systems to your data warehouse. And then we go through data transformation, which goes through, let's say we have the staging layer, we know there's one silver goal, so we're still also what we know as we call the medallion architecture, and after we process it into the goal layer, this is subsequently used by 1000 users, which could be for your machine learning, or it could be for the artificialization. And those are orchestrated by F, like let's say, like a workload orchestration tool, such as Airflow. So, what about data warehousing in the modern data stack? So, because we are using cloud-based compute and storage, this means that your compute and storage are now more scalable compared with your traditional data warehouses. And instead of the traditional ETL approach, we are now moving more towards an ELT approach, whereby the transformation is done within the data warehouse, and last but not least, the most important implication of this more scalable compute and storage is that it is now possible for us to stop snapshots of data in a cloud data warehouse to capture all those historical changes. So, this is something that may not be able to be done in the traditional data warehouse, whereby compute and storage is a blocker. And now we talk about what's the difference between what a data stack and your traditional data warehousing approach. We go to the point about the changed data capture in the modern data stack. So, now we have this modern data stack. So, what are the implementations that we have for the changed data capture? So, we have the traditional Kimbaos dimensional modeling techniques, which is mainly on the concept of slowly changing dimensions. And we now also have a model approach such as using data snapshots and also leveraging on incremental models. Those are a little bit more modern and a bit more functional. So, let's go on to this, let's go on to all these techniques. So, the traditional Kimbaos now dimensional data modeling. So, for historical context, it's developed by Kimbao 1996. And it was quite updated with the latest update made in 2013. During the emergence of cloud data warehousing, that's where you have your big query and your red shift. And this is an important concept because Kimbao introduced the concept of facts versus dimensions. And in during the time when you have limited compute and limited storage, this data modeling approach is designed for storage and compute efficiency. So, let's go into the fundamental concepts of facts versus dimensions. It's not fact tables. Okay, so what fact tables? That table contains metrics and facts about business process. So, it could be something about your process time or the transaction amount during the business process. And one of the defining characteristics of a fact table is that they are typically fast evolving during a business process event. But eventually, it will reach a final state at a point in time upon completion. So, you know when it will end, when this process will end. So, there is a definite end point. Dimension tables, however, they describe the attributes of a business process. So, it could be your customer details. And they are typically slow changing and updated over a longer period of time because you don't, so, okay, maybe you know that like your customer is going to be like age 30, like this year at age 31 next year. But you don't know whether your customer is going to upgrade to tier two to tier one or is going to get married or get divorced. So, you can't really estimate when the dimension will change. And they typically do not have a final state per se compared with a fact table. And this brings us to the problem of slowly changing dimensions. How do we capture changes that are slow, that they are slow and unpredictable? So, okay, so we talk, now we know we bring, now we talk about slowly changing dimensions. Let's talk about the types of slowly changing dimensions. So, what is slowly changing dimensions? So, slowly changing dimensions are change tracking techniques to handle the state changes in dimensions. So, you have the type zero, type one, type two, type three, type four, type six, like, many, many types. But let's just go through the important types. Type zero, it's very simple. It's fixed dimension. So, something like account opening date. Once you create it, you're not going to change it. So, it ignores any changes. If you try to change it, it's not going to change because it assures that this attribute will not change forever. Type one, on the other hand, reflects the latest version of dimension attributes. And what happens is that when you have a new record and then you realize that this new record is actually related to something that's existing, so the previous version of that value is overwritten with a new value. So, great, we have updated the changes, but in this case, you have destroyed history because what if it is an error-less update? It's not possible for you to rewind back to the previous version of the data. So, that's lost. Type two, which is the focus of this talk, implements role versioning for each dimension attribute. So, for each record, you have a concept, for each version of data, you have a concept of validity period. So, you have a role-effective date, you have a role-expression date, and sometimes you may even have a current role indicator. So, in this case, when the change is detected in the data record, instead of immediately overwriting the record, you create a new dimension role with the updated attribute values for the data record, and with the end for that particular new dimension role, you create a new premise surrogate key, and then the previous version of the attribute is updated with the role-expression timestamp. So, instead of overwriting, destroying history, you are updating history. So, okay, so, just now I mentioned about two-fold versioning. What is two-fold versioning? So, two-fold versioning is changing our capture mechanism that record changes to a mutable up-trip table over time, and implements your type-2 SCDs on a mutable table sources. And, typically, it will detect changes on sub-updates at times then, but maybe the naming may change, but the graph idea is there. So, some components is that you need to know where are you going to save your two-fold versioning, you need to know where to track, you need to know what's the unique identifier, and you need to know whether to invalidate records no longer in source, but sometimes you may delete the record. So, this is an illustration of how two-fold versioning works. So, we have some timestamps that we are tracking, and we are using that to determine the validity period of the record. So, this is the initial state, and then when you see that there is an update in the customer tier from tier two to tier one at a particular point in time, instead of overwriting it, you capture the changes in this case. So, now you can see that we have updated the validity period for the previous record. So, we have the type three, type four, and so on and so forth, but we're not going to go that in depth into that because this less commonly used, and it's a bit more complex, so let's not. So, you can read it on your own. Now, let's go into the more modern topics. So, data snapshots. So, data snapshots are read only in multiple copies of the state of the data source at a particular point in time, and usually you will store all those data snapshots at the staging area of Data Warehouse, but you ingest from the source, and then you throw it at the staging area for further processing. You can think of it as you're taking timestamp images of the data sources, so you're like taking photos of it at a certain period of time, and instead of directly creating your SCD tool, you are creating your data snapshots, and so that you have those data snapshots, and then you can proceed to create your SCD tools or whatever data, downstream data models that you'd like to create. Because in case you mess up your SCDs, at least you still have those data snapshots to fall back on, to complete this discussion. You have incremental models, because we need to capture changes in your option data into your downstream models, the concept of incremental models that you limit the data transformation to a specified subset of source data. So typically you'll want to be able to capture those data that have been newly created or updated since the last get it right, because you don't want to do extra work right. And in this case, not only do you want to do extra work, you also significantly optimize your run time on the transformation of a large data. So that's why you may want to use the incremental approach instead of a full load-load-load-load approach. Something like this, so it takes a bit of work, but you need to determine what is the incremental load that you would like to load. Because you want to, and it's typically what has changed since you previously ran your last scheduled job. You need to get the delta. And how it actually looks like in, let's say, your DBT models are, you'll typically define where do you insert the load? How do you insert the load? Where do you get the load? And before you think about where do you get the load? And all those things, you need to know what do you do? What is the condition for this incremental load? Because otherwise, you're going to load the whole thing. So those are the thinking approaches that you need to think of when you are defining an incremental model. So if we are looking at an SCD tool, it's pretty straightforward because if you're doing just an incremental load, you just need to select the most current records and just load it in. So you can see, it's just, some, like, you know, like, if you see that, you can just look at what is the most current record and your job is done. So some things to think about when you are looking about designing an incremental model is. Does it really exist? Because if it doesn't exist, then you have to do the full load. Do you want to just, like, is there something wrong with your transformation and you want to refresh the whole thing? Or things are fine, so let's do an incremental run. So you need to include those cases. And then what's the incremental strategy? So it depends on whether it's a data warehouse or the data lake. Yeah, some columns to track changes. So those are, let's go, oh, it's okay. So it's canvas approach still relevant. Especially in the modern day aspect. So this, so, turns out it still applies because even though your storage and compute are dirt cheap, it, you know, this doesn't really apply for very large dimensions because your storage and compute are not going to be very cheap in this case. And this does not preclude the importance of dimensional data modeling. If you kind of find out more, you can read, we can read the message between these blog posts and also watch his talks about that. Yeah, and in some way, and it doesn't mean that we need to capture, let everything about him off because yes, in some ways you do need, and typically you will really need to do a dimensional data modeling in certain cases. So one case is when you need to aggregate facts then. Yes, you do need to put in some work into how to model your data. Secondly is about metrics drill down based on dimensions. So you also require some things, some thought in how do you want to model your data. And not only is it typically, Kimball is still used in financial reporting and audit because this is where it comes in very useful. You need to have an audit of your data in data over time. So that is a requirement. So in this case, you can't say I don't want, you can't say I don't want to do Kimball. You have to do Kimball in this case. So to round up the talk, share some tips and tricks on ensuring that your data modeling is change aware. First step is to snapshot all your upstream source data. Because you're not in the model data stack, your compute and storage are scalable. And what if assumptions about our data change? So let's say your drag and team tells you that it's open only by the override. So what do you do? If you already use the SCD tool, I'm not sure how you're going to rewind your changes. So still save your upstream source data and step for them. And secondly, you know, like you have all those things like my columns are changed, my columns are draw, I add some columns. So you still need to be able to have a photo of your state of your upstream source. And last but not least is about your business logic. Because they can tell you that retroactively and when you don't apply some business logic retroactively. So you still need to have those copies of your snapshot to be able to modify your logic. And as I mentioned previously, style your data snapshot in the staging area and do SCDs from there instead of having your SCDs directly reading from your upstream source. So in this way, if you mess up your SCDs, you'll still have a data snapshot to rely on. Yeah, and of course, speaking about schema changes, it's useful to detect your schema to your upstream source and your data warehouse. So something like this would, something like a source target schema reconciliation would be useful for you to detect changes. And use type two SCDs and incremental models because type two SCDs are typically sufficient for tracking and capturing data changes. You don't need to go into SCD tree or fall out unless it's special cases. And you may think that why do I need to think about incremental models? It's so complex. But when your data is going to 10 times, 20 times, 100 times, having a thinking of it in an incremental way will pay off in terms of efficiency and cost. So unless you are like a startup and you want to deliver fast, then you can go ahead with iterating costs. But if you want to think of it, think of your data models in a more long term manner, then it pays off to actually develop in an incremental way. And to run out the talk, some strategies on designing your incremental models will be to design your upstream data update mechanism in mind and also think about your incremental strategy because it depends on whether you're using a data warehouse or a data leak. And a performance tip is that if you want to filter those early before you do any competition because if you do the completion first and then filter it, then it's going to be pretty expensive. So some key takeaways would be to adopt a big shot of a SCD approach, SCD-like approaches at incremental models and also data snapshots. So that's about it. Thank you. And you can get those slides within this QR code. Thank you.