 Thank you. I'm excited to present about one of the Linux Foundation projects called Delta Lake. I'm a developer advocate at Databricks, where I help developers and data practitioners built on the open source technologies like Delta Lake, Spark, MLflow, and help implement data and AI use cases. I had a keynote this morning where I talked about lakehouses, how it emerged as the new modern data architecture because it adds reliability, performance, quality features on top of existing data lakes so that you can make sense out of your data and use it for critical decision making. So thank you for all who attended. One of the most important part of that was Delta Lake, which is the foundation of lakehouses. So before diving into the project, let's look at why do we care. One work morning I opened my email which was surprisingly flooded with a lot of emails and that is because our system had errors. So our data centers had issues running the applications. The first thing that came into my mind was let's look at what happened within 24 hours because I had access to the data center logs, I had access to the system data, I was able to pull all that information and we saved a huge amount of time on resolution. Does that resonate with you? This is just one of the problems which data allows us to solve. And there are many other use cases where data could be very useful for critical decision making. So organizations today have a lot of data, whether it's customer data or web data coming from sensors, from IoT devices. Because of growing volumes of data that requires a scalable storage, they already have adopted the system called data lakes. And the promise of data lakes is that you can take all your data whether it's unstructured or structured and dump them into a file system over S3 Google cloud storage or Azure blob storage. And this is really a powerful concept when you compare it to the traditional databases. Because in traditional databases, you have to come up with a schema and a lot of pre-processing and cleaning. So what a data lake allows you to do is forego that whole process and just start collecting everything. Because sometimes you don't know why that data is valuable until much later. And if you don't store that data, then you have lost it. Think about many powerful use cases that you could have improved your businesses, brought innovations if you had access to that data. But unfortunately what happens when you collect all that data in data lakes is the data at the beginning of the pipeline was bad quality, which in turn means that more advanced processes that relied on the bad quality data, it significantly flows through the entire pipeline. So machine learning models and AI which were built on top of that data now becomes unreliable. As a consequence, data science and business leaders who were really trying to extract meaningful information out of their data, they were not able to do so. So why does this happen? Why is it so difficult to get the quality and reliability on data lakes? To answer this, I will walk through some of the challenges that we have seen working with data practitioners over and over again. Through this talk, I will talk about data engineering challenges, the solutions that it has to offer through modern data technologies and open source projects. And I will talk about how and where Delta Lake fits in the picture. And most importantly, if you find that technology useful, I will share the ways where and how you become a part of it. So let's talk about those data engineering challenges. The first challenge that I talked about was data reliability. For example, if a team upstream changes any kind of data or schema without letting you know, that might break your pipeline or cause an entire production job to fail if writing that data into Delta Lake. And this is one of the many instances where my cluster failed. What happens then? Maybe the cluster failed because you relied on an EC2 spot instance and that spot instance now is lost. So when a job fails halfway through, now you have to think about that corrupted data because it failed through in the middle. Now you have all the half data and so on. So that needs to be cleaned up. So the data engineers are then responsible for deleting any corrupted data, check any remaining data for correctness and set up a new right job because the systems are not capable of doing it. This is time consuming both in terms of data engineers time as well as your cloud compute costs. Other reliability challenge is lack of schema enforcement. Data validation is very vital for any data engineering pipeline because machine learning and AI applications dependent on it. And if there is no way to gauge whether something about data is broken or inaccurate, the versus you cannot identify data errors in the beginning of the pipeline. So you can corrupt the whole pipeline. So data lakes doesn't offer any kind of schema enforcement or data quality. This is a screenshot from one of my projects where I have a parquet table with over 14,000 rows and four columns. And now I have a streaming job that appends the data into this table. Let's see what happens after my stream query is run. So looks like my streaming job then through. However my table now has 51 records and it has two extra columns that I didn't expect. So what really happened there? What happened is that when the streaming query started adding new data to the parquet table, it did not properly account for existing data in the table. Furthermore, the new data files that are written out accidentally had two extra columns in the schema. So when reading the table, the two different schema from two different files were merged together and thus it unexpectedly modified the schema of my entire table. With increasing amount of data that is collected in real time, companies also need the ways to reliably perform updates, merges and deletes so that it can remain up to date all times. With traditional data lakes, it can be incredibly difficult to perform simple operations like these and to confirm that they occurred successfully. This is how it happens in legacy data pipelines. To insert or update a table, a data engineer has to find new rows to be inserted. Then they have to identify what rows they should be replaced by or updated. Identify all the rows that are not impacted by the insert or update. Create a new temp table based on all these insert statements now that delete the original table with the wrong records and then you have to rename the temp table, drop the temp table. So all these sort of reprocessing as entire tables or partitions needs to be overwritten in each run for dates, inserts and deletes. So those are some of the problems that I have seen. I don't know if that resonates with you. Hopefully it does. So how are we solving those reliability and quality problems? We need something so that data can be reliable and used for production applications. This is why I'm excited to tell you that some of the very talented engineers built Delta Lake that tackles these problems. So Delta Lake is an open source storage format that solves data reliability problems that data lakes historically presented and it does that through all massive amount of properties that it offers but most importantly asset transactions. So Delta supports major features like asset transactions, schema enforcement, schema evolution, unified batch and streaming, open format, scalable metadata, TML operations, data versioning, etc. So it can offer all the reliability and data management for high quality data pipelines through these features. So let's dive into Delta's architecture and how it achieves reliability through these features. Delta Lake file structure actually consists of two main components. The first component is the data objects. The data objects are stored as parquet files in the scalable storage like S3, Azure, Google Cloud Storage. Second component is the scalable transaction log. One of the really cool properties of Delta Lake is that it's as highly available as cloud data lakes. So you automatically get all the same availability, scalability and flexibility that cloud providers have already baked into their flagship service. So all the metadata for a Delta table is stored in a separate folder under the root directory of my project. So what I'm showing here is the actual directory structure of our Delta Lake. There's basically a writer head log that we maintain in S3 where we have an entry for each version of the table, every transaction that we are committing. So each entry in here actually tells you all the files that are a part of specific table version. In practice, sometimes it's not easy to have all the millions of files maintained in this folder. So we have Delta Lake, Delta log folder under the root directory of my project name. So if we dive one step further under the Delta log directory, you can actually notice that there are a bunch of JSON files and each file has a version number with pre-leading zeros. This files have meaningful transaction information like indexes and statistics to ensure that you have all the metadata about every commit and every change that happened in the table. And Delta Lake also periodically takes checkpoints in the same log folder. You can see that there are some checkpoints added and it is sort of like a shortcut to fully reproduce a table state. And why this is useful is because it allows a query engine to avoid reprocessing what could be thousands of tiny inefficient JSON files. So let's further explore how these files are created and how you can actually use this metadata information. So whenever a user performs an operation to modify a table such as insert, update or delete, Delta Lake actually breaks down that operation down into a series of discrete steps composed of one or more actions. Those actions are then recorded in the transaction log, which we saw earlier, and it's recorded as ordered atomic units known as commits. What atomic means is in DeltaTable, either a data file is stored in completeness or it won't store it at all. It will fail. And through these commits, you can access any historical version of that data. So I will show you how it is useful later on in a slide. So we got an understanding of how the logs are stored. Delta Lake allows the same logs to be accessed by multiple users and also allows readers and writers to perform actions at the same time. And you know, you might see like how does that happen? And this is fine because no one is going to see any changes until you have successfully committed a change. And log says that the table is now this version of the table. So basically this is made possible by snapshot isolation property of Delta Lake. So with that, readers can now read a consistent snapshot of a DeltaTable at any given time, even in the face of different concurrent rights. In this figure, for example, one of the query is actually doing an insert and another one is updating the table with the file 003.parkay. But only one of them, those actions will succeed. Either a reader will be able to see 001 plus 002.parkay or only 003.parkay. Now, what happens when multiple writers want to update the same table? We talked about readers and writers at the same time. Now, let's talk about multiple writers. Because Delta Lake has optimistic concurrency control, multiple writers can concurrently modify a DeltaTable by agreeing on the order of changes. For example, when you have a query like this one where there are two writers trying to modify a DeltaTable with 002.json, only one of their changes will succeed and other change will fail. In this case, writer 002 with 002.json will be successful because of the agreement of order of changes. Now, let's talk about how to maintain the quality of data pipelines. We saw how it handles reliability. So what happens when somebody changes your source system? It is going to break a report or downstream application. And whenever that happens, we should be alerted of those kind of situations. So if you remember my screenshot from earlier where I discussed about Parquetable, how it unexpectedly modified the whole schema and completely wiped out my original records. Let's understand how Delta solves this. So I perform the same operation, but this time on DeltaTable. After running the query, it fails. And you might see this as a failure, right? No, but this is an expected failure since Delta Lake intentionally wants to block this because schema of the new data did not match the schema of the original table. Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures that data quality by rejecting rights to a table that do not match the original existing schema of the table. Like the front desk manager at a busy restaurant, they only accept reservations, only allowing whether one has the reservation or not and rejecting you if you don't have a reservation. Similarly, Delta Lake checks whether each column in data is inserted into the table is on the list of expected columns or not and rejects any rights that doesn't exist, whose column doesn't exist. So errors like this allows me to work on the schema fixing and I can fix my whole entire pipeline without corrupting my entire data. Moreover, the schema evolution feature of Delta Lake allows users to easily change the table's current schema so that you can now accommodate changing data over time. And there are two modes it supports, append and override. So you can just give dot options schema merge schema and it will understand that this is a merge schema and it won't actually replace the whole file. It will add separate records as well as separate two columns. Now that's about how Delta Lake works behind the scenes. If you remember, I talked about versioning and how it solves some of the use cases. So let's look at those. So Delta data versioning is one of the most helpful features in a lot of different ways we have seen, you know, with regulations as well as auditing. If there is data, of course, somebody wants to audit it, right? So as we saw earlier, Delta automatically versions the data that you store in your data lake. So you can now access any historical version of that data. This data management simplifies your data pipeline by making it easy to audit, reproduce experiments or rollbacks. So as I said, auditing data changes, which is critical both in terms of data compliance as well as simply debugging to understand how data has changed over time. So earlier I showed that Delta Lake has some indexes and statistics. How that information can be used here is as Delta Lake records every action that has been performed, it also in the schema or metadata captures the files that were impacted and so on. So you can use describe history and see all the commits and look at the history of table changes. That's how you can solve the audit. Second feature is reproduce experiments. During model training, data scientists run various experiments with different parameters on different data. And when scientists revisits their experiments after a period of time to reproduce the models, typically the source data has been modified by somebody else, right? And a lot of times, the changes were caught unaware of and upstream data teams can actually sometimes modify without even telling you. So Delta Lake time travel capabilities works well in conjunction with another popular Linux foundation project called MLflow. So for reproducible machine learning training, you can simply log a timestamp URL to that path as an MLflow parameter so that you can track all the changes, versions of the data which was used for training and so on. So that capability of time travel allows you to go back in earlier stages and settings for those datasets and reproduce earlier models. And to do that, you also have to make sure that that data historical has been retained somewhere. Otherwise, you will lose, if you perform vacuum or something, you lose that data retention capability. Third important feature is use cases rollbacks. So data pipelines can sometimes write bad data for downstream consumers. This can happen because of issues ranging from architecture, intertablities to messy data and bugs in the pipeline. And time travel in Delta Lake actually makes it easy for rollbacks in case of bad rights. How it does that? Like for example, if a GDPR writes a bad record in your table, accidentally a user data is modified. Now you can fix the pipeline by going back to that version before that change happened and restore it. So very powerful capability. So we talked about data reliability. How does Delta Lake solve the performance issue? So when we are working with large files and massive dataset, you have got to get some bugs and run data pipelines efficiently, right? So one of the features in Delta Lake is data skipping. And how it saves you some bugs is when you write data into a Delta Lake table, it automatically collects that statistic I told you about. But it also collects minimum and maximum values out of each column. So this can be very useful during reading of a Delta table where you can skip reading the files that are not matching specific condition. For example, if you have different set of groupings, it will skip the groups and only look for the groups where that data could be present. So in the example here, our query is actually looking for events triggered by user ID 24,000. And you can see those groupings here, file1.parkae, file2.parkae, and file3.parkae. In those first two groupings, user ID 24,000 will not be found. So it will skip those two groupings and look for file3.parkae. So it's easy for Delta to skip the records that don't match the query condition. And you save a lot of bugs now. Another cool feature is generated columns. Let's say you have a table that have a timestamp. We deal with a lot of timestamps, right? Because we have to find records from specific dates and years. So you don't have to write partition by timestamp. That would result in way too many partitions. Instead, you want to particularly partition by date. You can just do a column adjustment that takes the timestamp and converts it into a date. But you need to do that manually. So I remember my SQL days when I had to deal with a lot of timestamps and different tools did not match this query format. So you have to play around with timestamps. But Delta Lake has this feature called generated columns, which automatically calculates those dates and timestamp values for you. So users don't have to provide those inputs when writing to the tables. Another feature which is in progress that community is working on is z-ordering. And z-ordering is a way to co-locate the data in the same file under a partition or directory. This co-locality is automatically used by Delta Lake data skipping algorithm that we saw earlier to dramatically reduce the amount of data that needs to be scanned. To z-order data, you simply need to specify the columns that you need to z-order by. And query will look for the most common columns and can find the data or columns in the same file rather than having to look at multiple files. So these two features, data skipping and z-ordering of Delta Lake allows you to only touch a subset of files and prevents you from having to scan the entire table. And this saving of scanning entire table could be very crucial when you are dealing with massive datasets more than terabytes and terabytes. So there are many more useful features that I didn't get to cover, but all the information is available on the blogs on Delta IO website. And with the advent of all the features that we just walked through, Delta is now available everywhere you want to use it for. And that's because we released the Delta standalone few months ago. So it can integrate well with other ecosystem projects. And also it's available from a wide variety of languages, a wide variety of services, and there are some popular connector tools for data engineers that it integrates with. So you can query on many different databases. And also Delta Lake is multi-cloud. So it operates on AWS, Google Cloud and Azure. There are many more useful features as you see on this slide. Here is the summary slide for you to actually show that how our community has worked on bringing innovation to Delta Lake. And now we are at Delta Lake 1.2. It's being a long journey and very exciting and thrilling journey to get here ever since the Delta Lake was open sourced in April of 2019 at the Spark Summit, which is now a data AI summit. So you can go to the Delta IO website again and check out our blogs or release notes and you will find each of these features and how they are used. And we are not done. The community is always looking and working to bring more innovation to data engineering. And these are some of the features that community is working on. You can stay tuned with the updates on our roadmap through GitHub where we discuss all the features and track progress there. And open source projects cannot be here without thriving community of users and developers. So lots of organizations have adopted and are contributing to Delta Lake. This is just a subset of many organizations that are running workloads on Delta Lake. And the work that they are doing is very transformational. Collectively, more than an exabyte of data gets processed per day on Delta Lake. And we have an engaged, very active community of Slack users, which is more than 6,000 now. Very exciting. And more than 50,000, 50 companies have contributed to Delta Lake. So while we have exciting momentum going on in the community, I want to encourage you to get involved. There are a bunch of channels how you can get involved with Delta Lake. Check us out on Slack. Check us out on YouTube channel where you will find tech talks, live Q&As, demos. There's a mailing list if you want to get involved with the actual code. You can always join on GitHub. So in GitHub, there is a contributing guide. You can start with good first issues if you are new to the project. You can participate in the roadmap discussions or any other issue discussions there. Some threads have like 32 plus communications. Jesus. Or you can create a pool request directly if you have a solution or code that you want to add to the project. And usually our committers and contributors always have discussion and give examples for the codes. We also host community office hours every two weeks where people can join live and ask questions about Delta Lake or any questions about what is coming up. It's live and recorded for those who cannot attend live. And last call out, we have Data and AI Summit next week where there are over 17 sessions just on Delta Lake. And of course, since it's a Data and AI Summit, you get to hear a lot of use cases from different companies as well as different projects that community is working on. And just on Delta Lake specific, Michael Ambers is also going to give a keynote and some cool AMAs with the committers as well as we are celebrating Delta Lake's third birthday party. So you can join in person in San Francisco or you can totally join it from the comfort of your couch online for free. Thanks a lot for joining me today. And yeah, I can take any questions you may have. Yes. Yeah, I'm curious. So I think that the way the Part Day files sort of manage the data and the ability to go back in time and, you know, gradually to branch off, makes sense. When you go ahead and update the schema of your table, is that helpful? When you update the schema of your table, is that also in sort of a similar format, something comparable? Is it part of one of the next update Part K files? Or is it just such a simple process that if you revert back, you just, you know, make another call to update your schema? Yeah, that's a good point. So every change that you do, whether it's a schema change or a data change, it will add a new version of commit to that Delta Lake folder, Delta log folder. And you see that you remember I showed you the Delta log directory as well as data files. So it will keep all those changes in the Delta log directory. So whenever you are retaining older version, it will show that older version as the current version. But whatever data was updated before the retention, it will also store the log of that. So in case if you again want to go back, you can do that. Awesome. Thank you. Yeah. Any other questions? Hi, virtual audience. Do you have any questions? Awesome. Well, oh yeah, please. First off, thank you for presenting my question. So if I heard you correctly, so there is a retention policy then on the Delta log. Okay. Yeah. So typically, there is a retention policy as well as vacuum. So retention policy is 30 days. So anytime you have a changes to data, it will store it for 30 days. Of course, you can play around with retention. There's a specific command you can give either to retain only for seven days because maybe your organization doesn't want to keep historical data. So you can always play around with that. Yeah. Okay. And there is another retention, another thing I would say, let's say if you have deleted the data. So actually, it doesn't take all the data away. It will store that deleted data for seven days. So it will still allow you to retain that data in case somebody accidentally deleted it. So after seven days, it will automatically get rid of that deleted data. But if you do want to not keep that seven-day deletion policy, you can always vacuum at zero retention. So it will completely remove that data. So when you say vacuum, that's like the hard delete. Yes, exactly. Okay. Thank you. Yeah. Good question. Any other questions? Awesome. Well, hopefully I'll see you in like some of the Delta sessions or maybe like if you can get involved in the community. But thank you all for attending. Appreciate it.