 Hello everyone, I'm Veena and today I'm going to talk about building CI CD for data and how can you use an existing set of open source tools to build the CI CD pipeline for your data lakes. And before we jump into the talk, a brief intro of who I am, what my background is. I started out as a software engineer and then I moved on to work in the data and ML engineering space and currently I work as a developer advocate for Lake FS. Lake FS is an open source data versioning engine that offers Git like interface for your data lake. And more on that later of course, let's dive right in. I was a data engineer before and this is more or less a very accurate description of what I expected my job would be and what it turned out to be. And most of the data pipelines today that we write do not behave in a predictable manner or in a way that we coded. There are a lot of changes that might happen in the infrastructure side of it or the business side of it that might lead to breaking your data pipelines. What are those changes that might break your pipeline? It could be anywhere starting from EMR upgrade or Spark version upgrade that you're trying to do or any infrastructure changes that is coming up. I don't know if you're migrating from snowflake to BigQuery and whatnot. Or sometimes it comes in from the business side where your marketing team has updated a definition of a KPI that they've been tracking. And now you suddenly have to rerun your entire data pipeline to arrive at the new metrics or to backfill the KPI metric for all the historic period. And again, troubleshooting failed Spark jobs, none of our favorite, but we still have to do because that's part of the responsibilities of a data engineering team that we're working with. And non-idempotent pipelines, of course, most of the times when we deal with pipelines or the job runs that has failed, we want to write idempotent pipelines. So you just click a rerun and it reruns end to end, but it does not happen very often. And do I also need to talk about the legacy DAGs that none of us want to touch? But at some point we do have to make changes at least to keep them running for a longer period of time. So the challenges or the changes to your data pipeline could be more than this, but this is only a limited experience or a limited list from my experience. And now we have this host of changes that can break your data pipeline. What can we do about it? How can we make sure that our pipeline is so robust, right? Pretty standard. When you're thinking about making your software robust or your software application robust, we think about increasing or exhaustively testing our software application. But for some reason, we thought in the data world, we could just run our pipelines without testing them enough. And I have been in the past guilty of working directly with the production data and not necessarily testing the pipelines before I push my data into product. So we have these standard testing we would borrow from the software engineering world, unit testing, of course, black box. And integration testing is more important here more than the other ones because when you think about data pipelines, you have multiple data sources that are writing or your data pipeline is reading from all these multiple data sources. And at the end of the day, your data pipeline is again complex transformations which are in multi-steps. And then the other end, we also have multiple consumers sometimes read even concurrently from your data pipelines or these transform data. And because of this complex ETL setup, we need to make sure we do the integration testing or end-to-end testing of our pipelines and not just stick to the basic unit testing of the black box testing ones. But then, easier said than done, right? Because in the data pipeline world, when you're talking about testing, it is data heavy. And you need production like data to be able to test your data pipelines. And where do we get production like data? Again, there is a spectrum of what we do today. Some of us, we try to mock sample data and try to use that data to test our pipelines. Again, it's not going to reflect the entire scale and the variation and the volume of the production data, but it helps us to some extent to identify the loop holes or the bugs in the pipeline. And then some of us go one level further and copy part of the production data into a test or staging environment and try to use that to test our pipelines. Again, when you're copying the data, it doesn't matter if you copy part of the production or all of the production data. Now, you suddenly have, instead of just one place, you have N copies of data lying around. Now, you need to make sure all the data integration practice in the PII enforcement that you've been doing in production needs to be applied to all the other environments as well. This makes it the ETL testing or the data pipeline testing a lot more complex than software testing. Now, when we want to do the data pipeline testing, the idea is here to probably copy the best practices that are already solved and established in the software engineering world and then extend them to the data engineering or the ETL pipeline testing. The standard as it goes is, you know, you have build test and the deploy phase and we make sure at each of these phases, we test our assets or these applications and make sure only the quality or the data that passes certain tests are promoted to the next stage. So, while in the software development world, we might have, in the build phase, we want to keep all the resources or the source code version control. And in the test phase, we want to create, you know, all these sandbox environments on demand to run our, you know, testing of our applications. And in the deploy phase, we, of course, want to do code reviews and make sure only the high quality code gets into production. Similarly, in the data phase, we want to version all of the data assets together in the build phase and then create isolated data environments. What I mean by that is not just copying around production data into multiple testing or dev stages, but having this ability to isolate it, create these isolated data environments, but there are not necessarily just copies of existing production data. And of course, in the deploy phase, we also want to run a continuous integration test at the CI suite to, you know, have a defined data quality test and only the data that goes through these tests and is successfully passing is only promoted to your production. And as always, despite doing all these checks and validations in place, you might still run into some production errors. So we also need a way of automatically rolling back to a consistent state of data, suppose if you still get into some production errors. These are almost like the top five best practices that we, you know, take from the software engineering world and try to apply it to the data engineering or the data pipeline side of things. And how do we do this? Let's just start with, you know, one phase at a time. And in the build phase, we want to place all our data assets under version control. And when I say data assets, I'm not talking about just file level versioning or even table level versioning. Today we have different open table formats, talk about Apache, iceberg, hoody, or even Delta Lake that will give you these table level versioning. But then we want to version the entire data lake or the entire data repository needs to be versioned, which is what, which is how you can bring all your data assets under one version control system. And how exactly can you do this? This is where Lake FS comes in. And like I introduced Lake FS before, it is an open source data versioning engine that offers Git like API. And if you can look at it here, it sits at the Lake FS sits as a metadata or versioning layer on top of your object store. The object store being any of these, you know, cloud providers, S3, Minayo, Azure Blobber, GCS, and Lake FS sits on top of these and it provides Git like APIs. It could be, you know, create a branch, commit, merge, or even revert APIs exactly similar to Git. So you can perform all the operations that you were doing on your source code using Git. You can do the exact same things with Lake FS on your data. And if you do have a current host of applications who are already reading from your existing object store, now you can read them through Lake FS with the versioning enabled as well. And from the right side, if you have a look at it, this is how it looks with the minimal intervention to your existing code, you will have data versioning enabled. If you look at the S3 path, right, if you already have your data repository under a specific path, by having Lake FS sit in between your applications and your object store, all you're doing is just adding an extra prefix in your S3 path, which is the name of the branch, and it will help you identify which branch a specific data or specific file belongs to. And earlier I talked about creating on-demand data environments without, you know, copying the exact data, right? So you can do that by simply doing a branch create. So when you create a branch from your actual data repository, you create these isolated data environments exactly when you need. And to understand how exactly are these branches created? Are these, when I create a new branch, am I copying all of the data from production into a new, you know, staging or a dev environment? Not exactly. So let's just dive into a quick overview of how Lake FS works inside, right? Lake FS commits are just a bunch of collection of pointers. So you have storage objects underneath, you hash the storage objects, create a couple of, you know, hashes, and then you reference those hashes for each of these commits. So even when you are trying to create a new branch, it's just copying those pointers or copying those hashes of these underlying objects. If you have a petabyte scale data lake, when you're trying to create a new branch, you're not copying again petabytes of data into another bucket. You're just copying pointers to those. And of course, the minute you start changing the data that's underneath, let's say you deleted an object and then committed it, at that time, you want to delete the object reference as well. And if you write something new, Lake FS also does copy and write, so it is available for the future commits. So by sharing these pointers, it is not necessarily copying the data from one branch to another. It's just, you know, all the branches refer to the same pointers underneath. Now, if you think about it, if you want to create a new test data environment, all you need to do is create a branch. And even if you have petabyte scale data lake, because this is a metadata-only operation, like a pointer-only operation, it would only take a few milliseconds. Instead of copying the production data which would take at least hours, you would take a quick coffee break and then come back, okay, it's done, now I can start working. And, you know, in times like today, when we are trying to optimize, you know, costs with everything, the storage costs and the performance savings on the storage also would be drastic if the size of your data lake is going to be bigger. Now, after the next stage, now we know how to version our data assets using a, you know, versioning engine. Now, after the testing stage, what exactly do we need? We want to create these isolated data environments on demand. And isolated data environments are nothing but Lake FS branches, because every branch is an isolated environment for you to run whatever experiments or tests you want to run. And if you are a data engineer, you may want to create a new branch every time new data gets ingested. So, instead of putting them in main directly, you can put them in an ingestion branch or call it a staging branch. All the new data goes into staging. Now, you can have whatever suit of quality tests you want to run, and only if they are successful, you can merge them into production, which is here the main branch. And just like Git, Lake FS also has branch protection rules enabled. So, you can have these rules defined for who can access main or who can directly merge into main and so on, which means you have a lot more control over the quality of data that is being promoted to production, not just anybody can tinker with production data directly. And again, the experimentation is usually for the ML use cases, if you are a data scientist or an ML engineer who is running multiple experiments, trying to identify what algorithm gives you the most accuracy, you can branch out of the training data which is sitting in production. And for every experiment, you can create new branch and then run your ML experiments on that branch. And at the end of the day, whatever branch or whatever model gives you the highest accuracy, you could just push those models into production. This way, you're just making sure the highest or the winning model is being deployed to production. These are again a couple of use cases of how you can use this isolated data environments using Lake FS branches, depending on whatever other use cases you may be thinking of, you can leverage these Git-like APIs for that as well. And now, the next step further is, so you do have all these tests that are running, how do you make sure that the merge or the create branch happens only when these tests are successful? So Lake FS has these features, they're called Lake FS hooks, which are similar to Git hooks too. So what happens, you can define the rules on which, rules are the conditions on which you want these operations to be successful. So you can have a pre-merge hook that will run and only if the constraints are successful, then the merge will be succeeding. If not, it would automatically revert to a previous commit, thus avoiding low quality data getting ingested into production. We will dive a bit deeper on hooks at the later part of the talk as well. Now, so we understood for the testing part, how can we create isolated data environments with these Lake FS branches? Now, let's move on to the next part, which is the deploy phase. Deploy phase, like I said, is enabled by Lake FS hooks, because when you want to deploy, you want to run this CI test, and then only you want to continuously deploy your promoted product data to your production. And again, here Lake FS supports a Python web server hooks, and you can, like it says, the hooks, hooks at the end of the day is a YAML file, and you can define whatever constraints you want for each of the branches. So you can say main branch can only have parquet files and not CSVs or JSON, or only this person can merge into main, and all the rules can be defined in the YAML, and it's not just that, you can also define a set of tests that you want to run. You can have custom suits, or you can even integrate with great expectations, ODAI, or any other data quality tools that you may be working with. And Lake FS hooks is just a framework for you to define your own conditions and tests. So if you have existing suit of tests that you want to migrate to Lake FS hooks, you can do that as well. And so here is an example of a YAML file that you want to put together for the hooks to run, and here I'm going to make sure that only parquet files are in production, so I'm just going to define the condition that I want to pre-merge hook on the main branch, and what am I going to actually test it on? I want to make sure the format validator is running, and I have my own Flask web server running in a specific URL, and then it allows, you can even set these restrictions to be applied to a specific prefix and not necessarily for the entire data lake also. And as you can see on the right side, by defining these rules, the pre-merge hook will run before every merge, and then it will only allow certain data that follows these rules into the production. And as we've already touched upon the rollbacks as well, let's say for example most of the time what happens when you have an issue in production data, you have on calls in place, and you also need to figure out and basically troubleshoot and debug what happened, why is there something, maybe a spike in a number in a dashboard and what not. And the first thing you can do with this setup when you have data version enabled is, like we always do in Git, you just revert back to the previous commit and you have a consistent state of data. So your internal or external data consumers can continue to consume the data without any interruption while you're trying to figure out or troubleshoot what's going on in the production data. And it's just a one line command without you having to do too much about like copying the whole set of data, making sure you have like two versions of production, one is running, which is serving your customers in one more where you're trying to troubleshoot and debug. And a just quick recap of how exactly will your data lake look when you have lake FS. First thing is you have time travel option, you can easily revert corrupted data whenever there is, it doesn't matter in which environment it is, all you need is a simple lake CDL revert, one line command, you go back to the consistent state of data. And the second is of course you can safely test your data pipelines by creating an isolated environment from your production. This way you have production like data, in fact, actual production data, but in an isolated environment, so you're not risking it, like you're not affecting your consumers consuming your production data. It's just completely isolated for you to run your own tests. And the third is, again, you do create these production identical branches, but then you can also run your own set of tests to make sure the data quality is in place because today data quality is one of the top challenges for all the data teams that we are dealing with or working with and you can use lake FS hooks to enable that as well. And that's all I had and like I said, lake FS is an open source project and we have a thriving community of users, contributors and also organizations who use lake FS and if you are interested, feel free to sign up on the Slack and yeah, if you have any other questions, I'm here to take them. Okay, so the one thing that I did not cover is lake FS also has a garbage collection, meaning you can set a retention policy. For example, when I say suppose you have a data in your main branch and you deleted it and you committed it and then you move forward, right? But then if the deletion is by mistake, when you revert it, you want to be able to access it. So every delete is a hard delete or soft delete. So you get to choose, say if your retention was 30 days, that was the soft delete and if it was above 30 days then that is gone forever. And you need hard delete because with all the PII and GDPR you really need to delete the data, not just soft delete them. So yeah, you can set the retention as well. Thank you.