 I'm gonna go ahead and get started. Thank you everyone. I think this is the last session of the day. I could be wrong on that. But I think after this is like a booth crawl, which is the fun part. I am going to talk about Delta Lake today. It is an open source project hosted by the Linux Foundation. I personally think it's a really cool project, so I'm gonna get into all of what's going under the hood, use cases, how it works, all that kind of good stuff. I also want to take any questions. We don't have a super large group, so please let me know. If there is anything, there is a lovely person with the microphone who can make you all heard. With that, that's me. I'm the human, not the dog. I put in my speaker profile that I did have, but I think is the world's cutest dog, so I had to approve it by putting it up on the slide. I also put my email address on the slide. If you have any questions about Delta or anything related to what I'm talking about, like in the middle of the night, you can go ahead and email me. I would be glad to talk about it at any time. With that, let's go ahead and get into things. What I'm gonna do, I have some slides. I have some code. I like going back and forth just to like spice things up a little, actually show you what the things I'm talking about look like in practice. I want to start with some motivation. Delta Lake, if you're unaware, is a file protocol. Why am I going to spend a whole 45 minutes of your time talking about files? There is a very good reason for that, and that has to do with essentially modern data architectures. You can kind of see two ways of looking at the world today. Data warehouses are one side of things we had on-premise data warehouses. We have cloud-based data warehouses, both provided by cloud vendors or provided by external vendors. Great for BI, great for reporting. That being said, they have some limitations. Let's say you want to do machine learning, deep learning. You don't do that on data warehouse. I'm sure most of you are aware. Not so much support for unstructured data, even semi-structured data. Data warehouses are structured data with good performance for analytics. One of the main reasons for that, why it's so good for analytics, is most warehouses have a proprietary file format that once they actually ingest your data, maybe it started off its license JSON format, it then gets ingested into a data warehouse, transformed. The data warehouses engine is optimized for the file format, so it's really fast, but you can't do much other stuff within that data warehouse. What that means, I think there's a statistic somewhere out there that like a huge chunk of data in the world is unstructured. Think of all the emails and tweets and images that we're generating constantly, so people taking pictures. That's unstructured data right there. So that goes in data lakes. That's our other side of the world. Data lakes are great. You can put in unstructured data, your structured data. You can put in your CSV files and your JSON files and your text images, whatever you want to use, but you can't really do BI on those things. It's going to be slow. You're going to have some angry analysts might come and yell at you. You don't want to deal with that. Governance is limited in data lakes historically as well. Data lakes can be called data swamps sometimes. Yes, they're great. They're massively scalable. You can do data science, data engineering on top of data lakes. They have their limitations as well. That's kind of the motivation here. We have use cases for data lakes. We have use cases for data warehouses. There's a lot of cons to both, but they exist for a reason. There's a lot of pros to both. So if we had a perfect world where we could do away with all of those cons, only have pros, personally, what I would like to see, and from talking to a lot of the people I work with, what everyone else would like to see as well, is from the data warehouse side. BI reporting runs a business. That is what executives look like, look at. They like their dashboards. We want to make sure we're making money, essentially. So we want fast performance for our BI and analytics on any amount of data. We also want that governance and organization that comes from a data warehouse. This is really important around security and cataloging and lineage. Data lakes are really important. We want that massively scalable cloud storage where we can do any use case on it, move from cloud to cloud, throw it all over the place, do data science on top of it. We also want it to be open source. This is an open source summit. So we want something that will essentially squish together the best parts of a data warehouse and the best parts of a data lake. And this leads to an architecture pattern that we're now seeing essentially all over the place called a lake house, which is a very uncreative name. Pretty much just saying take all the good stuff from a data warehouse, take all the good stuff from a data lake, throw out the bad, and call it a lake house. Lake house is an architecture pattern. The idea being we have a data lake based approach. So we have that massively scalable cloud blob storage, essentially, handling all sorts of data, but you can still use that data lake for all of your use cases. So that means, yes, you can actually do BI and have governance on your data lake. That can be a little bit complicated. As you can imagine, there are reasons why data warehouses have existed in the past. They didn't come to exist because someone just liked them. So let's talk about how a delta lake can fit in for our data lake here to essentially solve all of these problems, make the best of both worlds that we've talked about. Yes, there are computing engines involved in the middle of this lake house. You need those to do analytics today of science. We are focusing on the delta lake as a file format today and how that can enable essentially this lake house architecture where we can have massively scalable horizontally data storage and do all of our use cases with one essentially central source of truth for our data. There are animations I do not like in this. So let's talk about what is delta lake. We're going to do a couple things. First thing we're going to look at a little under the hood. What do I mean by delta as a file protocol? How does that differ from something else that you might be used to? I like to compare things to parquet just because parquet is essentially an industry standard big data file format. We're also then going to look at some essentially goodies that delta lake provides. So that's things around use cases. What does it add on top of other file formats? And then we'll talk about performance. Performance is important. Everyone wants things to go fast. So what is delta as a storage protocol? Two major components I want to talk about. If you have delta tables essentially, you're going to see two things. One thing is a whole bunch of parquet files. Parquet is open source, delta is open source. We have built off of, we being not me unfortunately did not make it. Parquet in order to make essentially make use of what's already a really good data storage format. Parquet is columnar storage, it compresses well, has a lot of good metadata tagging. Using parquet as a base storage format allows us to take something that's already good and make it a lot better essentially add those warehousing use case. The way that is added is through the addition of something called the transaction log. So we have a table, it's a bunch of parquet files, that's great. That's not the only thing you're going to see. Within this table directory, maybe it's an S3 for example, you're going to see a folder called delta log. And that contains a massive amount of metadata. That metadata has essentially information about every single transaction that has been done to a delta defined table. So you delete something, you merge something, every single thing is going to be logged in that transaction log. What that allows us to do is maintain acid compliance and then also essentially handle the amount of metadata needed. What I've heard often is when you're dealing with big data, even the metadata itself is big data. So the introduction of this transaction log really allows metadata to be stored alongside the delta table itself to allow for fast processing of data. And we're going to get into how that all works. I do want to look into that transaction log a little bit and we're going to look at that actually, we're going to go and look at the S3 storage to see what it looks like. But essentially what that transaction log is, is for every operation on a table, maybe you write, you merge, you delete, that is log is a JSON file. Just going to be numerical, just counting up. So every single transaction that happens is going to add a new record to your transaction log. I'm sure some of you can already see potential hiccups with this plan. So we'll talk through that as well. But essentially that's the important part here. We have ordered commits that track what's happening to our table within this transaction log. So potential issue number one, you might be familiar with data warehouses who handle concurrency. Let's say Bob is writing to a table, Alice is reading from the table, too bad the table's locked. Someone else is using it. We need to figure out within delta how to handle concurrency. We don't want to lock tables to just a single writer. So for example, let me click through all, there we go. We'll be happy here. Click through all of the animations. Essentially what happens when two people try to write it once. We want to make sure that we have consistent data. I said earlier once delta is acid compliant. So what does that actually mean in the world of delta? Delta is optimistically concurrent. That is one of the important parts of the transaction log is how is the transaction log ordered. So like I said, we have a parquet files. We have everything that's happened to the table as just JSON commits essentially. Within this transaction log, we will keep track of who is doing what to a table. Pretty simple is a writer will attempt to write to a table or a partition. You can get things down to a partition level if you prefer. Check to make sure someone else is not already writing to that partition. If yes, oh, that's fine. We'll just try again. Don't worry about it. There's no failures except for very specific scenarios. It's pretty much just called optimistic concurrency. Writer one writes, writer two writes. If they conflict, writer two will try again. So pretty simple way of handling concurrency. This will work cross compute. So I'm going to be focusing on spark as a compute engine today. What that means, though, is if essentially writer one is writing from spark cluster A, writer two is writing from spark cluster B. There are features in open source delta that allow that to be successful as well. So good news, you don't have to worry about data corruption with delta. The other potential hiccup I wanted to talk about with this transaction log is I said every single commit creates a new JSON file. So as you can imagine, and we're dealing with a lot of data a lot of the time. There's not just Bob writing to this table and that's the end of the day. There might be thousands of rights happening. There might be streaming rights. There might be rights coming from other sources. There might be just merges and deletions if you're doing something with GDPR, for example. We might have a lot of JSON commit files. We don't want to have to process potentially, but I don't know, five million different JSON files to figure out the state of our delta table every single time you want to do a transaction. That's where using spark as a compute engine comes in. This is built into the delta format. As every 10 transactions, spark will actually automatically compact this delta log. So what this means is instead of just JSON one, JSON two, all the way up to five million or whatever arbitrary number you want to throw out there, every 10 commits, we're going to create a checkpoint dot par K. What that means is when you are resolving metadata to determine the state of your delta table, you're going to max read five small JSON files. That's not bad at all. What this also means is that this massive scale of metadata is not limiting your performance. So essentially the two major things that come out of transaction log. Three major things. One, we have acid compliance. Two, we resolve conflicts optimistically. And three, we actually use spark for scaling the metadata. Any questions so far? Nothing. All right. So I've been mentioning spark a bunch of time, something else I want to throw out there is this lovely slide right here that talks about a lot of the other engines compatible with delta as a format. Now, something I say is, yeah, compatible as a delta. What does that actually mean? So delta is built off par K. That being said, because of this acid compliance and the way transactions are handled, not all of those par K files are necessarily valid parts of a table. That's where a transaction log comes in to keep track of what is a valid as part of this delta table. So resolving the table during a read involves checking the transaction log and determining what par K files actually make up that table. There are ways that you can just go right back and forth between delta and par K. But all of these engines here actually have native delta readers. So more coming as well. I'm going to put out a call at the end to you want to add something. This is an open source project. We would absolutely welcome it. I'm going to focus on spark today though. Let's get into some features of delta. So I'm going to talk about a bunch of features. I know I've already mentioned a lot of these. What I'm going to do is talk about some features and then tie those back to kind of how they create my perfect lake house world. Let's start off with that batch and streaming that I think I actually might have glossed over before, where we want to handle not only all types of data, but all essentially velocities of data. So if we have data in batch mode, we have data in streaming, we do not want to create a land architecture. If you're unfamiliar with land architecture, that's a common architecture pattern where we essentially have to have two different tool sets to handle the different velocities of data just based off of what existing tools are able to do. So let's do something a little bit weird. And let's actually take a look at Delta as a format and do some streaming, do some batch operations and see what that looks like. I'm going to go through a couple things quickly is one make this bigger so you all can actually see it. All right, so they're too big. I have created a Delta table. I'm going to be using SQL right now. That being said, you can also use a bunch other languages. I prefer to use Python, but there's also a SQL, Scala, there's a REST connector now, lots of different ways of interacting with Delta. So I have a very small Delta table here, which is essentially just some information about loans and what state they originate from. We can see this is a Delta table just with some simple SQL information I created this couple hours ago. And then we can also see, like I said, very small table. We have one parquet file and that Delta log directory. If we look into that Delta log directory, great, we see we only have one transaction. That is actually when I wrote this original table, some other optimizations as well. So here's where we're going to do something a little hacky because I am one person. But what I'm actually going to do is I am going to create a stream that is going to read from my Delta table and update this map right here. But I am also going to batch write into my table as I am streaming out of my table. So let me just reiterate that. I have a stream that just reads from my table and shows it in a map view. And then I'm going to, as my stream is continually processing data, I am going to just write more chunks of data to that exact same file. So what that should mean is that I'm not going to have any sort of conflicts. I should be able to display my new data. Just so you're aware, let's look at Iowa, I'm going to continuously write some data into Iowa as this stream is running. So if I start that, what I'm actually going to do is I'm going to every 10 seconds, just plop some more data into the state of Iowa. And we expect because Delta has this isolation and it has this essentially serializability, we shouldn't have any issues with our stream failing as someone is writing to the exact same data set. So spoiler, it works, Iowa is already being updated. This is a pretty simple way of showing we have a streaming reader with a batch writer, things are being updated in real time, and we don't have any failures or essentially concurrency conflicts. What this means from a use case perspective is that you can really easily have all of your batch and streaming pipelines essentially just all in one go and not have to worry about setting up separate architectures. You're probably aware this is great because maintaining separate architectures means maintaining separate tool sets, which is rough. You don't want to have to maintain separate code bases, separate DevOps teams, all of the above there. So once we've gotten one more update, we should get ourselves up to, I believe 1800 in the state of Iowa, we can go ahead and kill this. But what this really just illustrates is the ability of Delta to handle that concurrency and to handle multiple readers and writers. Let's go ahead and kill that guy. So I did put these slides online, just in case the internet went down, it took some screenshots, what was going on for everything too. If this is incredibly exciting, you want to look back at it. Other things I wanted to talk about is, okay, so we can do streaming, we can do batch. That's kind of more on the data engineering side of things. The I can data science already on data legs. What about BI? Two components to BI. One component is performance. We'll get to that. The other component is essentially just all types of queries support. Acid transactions is one big aspect we've already touched on. The other big aspect that I have mentioned is full DML support. So what that means is if you're familiar with Parquet or ORC or any of those other standard big data file formats, you're probably aware handling, I don't know, mergers and deletions. It's a little bit of a painful operation. You do have to overwrite entire tables or partitions to delete, for example, one record with Parquet. So let's talk about how we can use Delta to not have to do that. So if I just go ahead and skip through, yep, we are very aware of these things. So let's go ahead and create a traditional Parquet table directly on top of that same Delta table. So what we should have is duplicate data sets, one that is the Delta format, and one that is the Parquet format. So once this completes, yep, this is our exact same map. This is our exact same data set here. Let's say, okay, I made a mistake with my loop. We should not have the assigned Iowa, these additional loans. What we should have done is assign them to Washington. So let's delete it. Cool. We can just run and delete from and of course it's not going to work because Parquet does not support delete from. Delta does, good news. And we can just go ahead and run that and see how that impacts your Delta table. It will actually give us the number of rows. Something else you might have seen really briefly as that command was running is there was a little pop up that said computing snapshot. And that's something I mentioned earlier about how that transaction log is resolved is we're looking back through those JSON files as we're reading and determining what essentially version of our Delta table is the valid version of our Delta table. Going to go pretty quickly through these next ones. Let's go ahead and finish this off, do an update. Oh no, who could have predicted it if it failed with Parquet? And we can run that same update command with Delta. So this is a pretty simplistic thing that you can do in a data warehouse. But this data I'm operating on, it's actually sitting in S3. We'd get the same information if it was sitting in GCS or Azure Blob or ADLS, anything like that. We are bringing these data warehouse type operations to a data lake. One final thing, let's keep going with the Parquet comparison. So cool, I did a delete. I did an update. Two step process. No one likes to do more steps than necessary. Let's do it all in one with a merge. It's a lovely gif here of showing, yes, it is painful to do this in Parquet. You do have to, like I said earlier, overwrite entire partitions or tables with Delta. All we actually have to do is gather our data set. What's our update? Do we want to insert merge? We don't need to know, actually, Delta will figure that out. As you can imagine, this is very valuable for CDC type use cases. And then we just write a merge statement. Essentially, all this is saying is when we have a match, update our values. If we do not have a match, insert our values. It's a very standard data warehousing DML operation. This is now, like I said, something being done on a data lake. And this is lovely to actually give us some statistics. We inserted one row, updated two rows. We're quite happy about that because I didn't just have to overwrite entire table. All right, any questions so far? Yeah, so that's a good question. Question was, didn't I say Delta uses Parquet in the background? That is correct. The actual files themselves are Parquet. That transaction log allows you to make those changes. So within, we can actually look at this once we go through a couple more operations and take a look at what our directory looks like. We're going to see a couple of Parquet files. And not all of those Parquet files are actually part of the Delta table. Because as we're making operations on our table, the transaction log is keeping track of what our valid Parquet files is part of the table. But the CloudBall app storage doesn't know. Awesome, anything else? All right, the last category I wanted to talk about is a big one, and this is governance. So this is something that is, A, an important part of any data pipeline, or any sort of company out there you need to have data governance. I would categorize this as a couple of different things. One is data lineage and quality. The other is security. There's a lot of other things outside of your file format that I certainly hope you use for security. But built into Delta, there are a couple of things I want to talk about to make this easier. Going to talk about two different categories and three different features. The first thing I want to mention is the schema evolution and enforcement. So this is important. I've been enjoying contrasting things to Parquet. We can choose a different format if you want. But if we're dealing with more traditional big data file formats, you write data, great. You read the data. If there's a schema mismatch, you find out then. You're not so happy about it potentially, because that could be a week later. That could be an automated pipeline. Things break. You don't want that. Delta has a feature called schema enforcement that essentially checks schema at right time. So if there's a schema mismatch, it'll be detected immediately and throw an error. That being said, likely sometimes in the world, you do want to change your schema. Maybe you want to add a column, make sense, data changes over time. There is an option called schema evolution, which allows you to essentially change the schema. Super simple to use. All we really have to do here is, let's go ahead and take our same data set, but let's add a column. So we should expect this, if we wrote it, these first two columns, this is fine. This is our data. We can append that. But we're also appending this third column. I don't want to write this and then come back three days later and realize, uh-oh, I have a bunch of nulls. I have a dangling column. What's going on? I've already ran this, so it gave away the hint. But it'll give us an immediate error if we try to do this right with a mismatching schema. Actually, quite a good error, too. We had a schema mismatch. What this tells us is one of two things. One, QA your data. What's going on? Why do we have an additional column? Or two, maybe you actually meant to add this column. If so, that's fine. Just add this merge schema option right here. And that'll allow that additional column to be added. So now if we go ahead and just look at our data set, we can see here we have actually generated this new table that has our aggregated loan amount by state. This is great. This is one way of enforcing data quality, just built right into the data format itself. Let's talk about some other things, too. One of my favorite delta features is called time travel. It's a really fancy way of essentially just saying data versioning. I've said over and over again, every transaction on a delta table is versioned. What that means is we can actually use that versioning to roll back our tables. Not only can we roll back our tables, but we can look in a really friendly way. What's happened to our table? When did it happen? And who did it? So that's good for auditing purposes. That's good for maybe, I don't know, yelling at your coworker, all of the above. So essentially two ways, both good for audit compliance purposes, also good for example, I've seen people use it to recreate ML experiments. If you want to record the exact version of a table or the timestamp when you trained a model, you can use this time travel ability to go and retrain a model on the exact same data set, even if the table itself has changed over time. So lots of cool things that you can actually do with this, and it's a cool name. So let's look at it. Let's go ahead and look at everything we've done so far. So if I look here, this is, OK, you go away. This is what's happened. I've done some writing. I did it at deletion, the update, the merge. We can see everything that happened. Who did it? Thankfully, it was all me. And when it happened, we can also use this, though, for our time travel. So if I use this version as of, this is going to take me all the way back to before Iowa had any data, before anything happened, before I did that merges. I could use this to recreate my data set, do whatever I want with this point in time. The important thing I want to call out here is I have version as of. Reason for that is I want to rerun this notebook. You can also time travel based on a timestamp, which likely has more utility. If, for example, someone overnight wrote some really weird data to your table, and you just want the version that happened before midnight, that is something that you can do with this time travel very easily. Some more examples in here. I want to show you everything that happened to my delta log and what my tables look like now, now that I've actually done a whole bunch of stuff. Let me actually hide it, because it's giving with fun part. Questions before I get there, because we are now kind of getting off of this time travel auditing a little bit. All right, either I'm confusing everyone so much, or this is just the best thing. I've run there's no questions. When you revert, do you delete the JSON files, or do you go forward? They'll keep going forward. So you don't actually delete the metadata. So the reversion will not really go back. It'll be effectively. It'll have done three things afterwards, and then there'll be three undo type of operations. So there's a lot of other metadata management that I'm not covering, because as you can imagine, this transaction log needs to be purged. The extra parquet files need to be purged. There's a lot of ways of managing that that I'm not going to get into today. I'd be glad to talk about that after if you want. If you're curious, look at what command called a vacuum. But yes, you're absolutely correct. As you can imagine, we're going to get some very bloated files after potential years of this, so there are ways of handling that. All right, let's just go quickly through what we have here. Kind of just tying back into our transaction log. If you remember at the beginning, we had one, just a whole bunch of zero JSON file. This was our original write. These CRC files, they're important. You're going to see them. They're not really human readable. These are kind of just stats collected. I'll talk about stats collection in a second, but this allows for data skipping and performance. If now we actually look, we've done eight transactions, nine, we started at zero. This is great because this, so apologies for the scrolling, ties right back in to our eight transactions in our transaction log right here. So that is how this transaction log relates to our essentially table auditing history, is we do have one JSON entry for every human readable line in our table history. We can also see this is just what one of those JSON files look like. This is more for if you're curious, just a whole bunch of metadata is what is contained in these JSON files. I also want to do something really silly. We had nine transactions. I wanted to give us above 10 to show you that checkpoint. So I just ran the same update command three times in a row, just to give us three more transactions. What we can actually see now is if I scroll, good news, I was right. After 10, we did do that checkpoint.parquet. So now after our 11th transaction, we are going to go back and essentially read from the checkpoint instead of resolving JSON file zero through nine. With that, let me wrap things up by saying, what about performance? I have been talking a lot about all of these different features Delta enables to handle the use cases. Now let's handle the use cases but fast. So a couple of things. Delta release 1.2 was released a few months ago. I want to talk about that mostly. Prior to Delta release 1.2 though, there was already some really good performance features. Delta was fast then, it's faster now. Before release 1.2, we did have that spark for metadata management. That's one of the most important aspects of Delta, is handling all the metadata that makes the use cases possible. Then also like I said, all the benefits of Parquet. Parquet is a great file format, so let's just build off of it. This though is a screenshot of an issue from the Delta repo on GitHub. That is part of the H1 Delta roadmap. I want to talk through these performance features. Specifically, this optimize the order one because it's really cool. Let's talk about optimize though. First thing is, you might be familiar in data lakes a lot of the time, especially if you're streaming. There would be a lot of small files. One thing leads to another, you have an IOMess where you're spending more time scanning and opening files and actually doing computation. So there's manual compaction needed. I have seen so many people write manual compaction jobs to take these small files, essentially compact them into a more optimal size. This optimize command does it all for you. You run optimize, stick your table name, done. You have about a gigabyte file, although you can configure that. Bunch of files or data skipping. Something that I tell people when they ask me, why do you care so much about data skipping? Let's talk about Spark or Presto. I don't care how fast your compute engine is. If I read zero data, I'm gonna be faster than you finishing a query. Oftentimes, the amount of data that you read is more important than how optimized your engine actually is. So data skipping is a really important aspect of any sort of file format or file protocol out there. These under the hood operations, they happen. You don't have to know they exist. They happen, they exist. It's obscured from you. They allow for file level or column level data skipping. So we just read less data when filtering for a query. Last thing I wanna talk about, I think I'm right on time, here's my personal favorite, and that is this word of z-order attached to optimize. So this is a bit of a mouthful, but essentially what z-ordering is is multidimensional data clustering within files. So to break that down a little, you can data skip at a partition level. You just define your partition that puts files and directories. What about at a file level? You can sort things, sure, that gives you one dimension data skipping. What z-ordering does is it actually allows you to cluster among multiple columns. So you can data skip at a partition level with directories, and then you can data skip at a file level with z-ordering. So if you run a command called optimize z-order and then put whatever amount of columns you choose, it'll actually both compact your files and then rewrite your files to co-locate the information in these columns. There can be more than one. So when you read this information, you skip a lot more data. It's super cool. If you're curious, the algorithm called, excuse me, the algorithm used is called Hilbert Curve. It's a mathematical algorithm. It's on Wikipedia. Lot of math there. But it essentially allows you to skip a lot more data than just plain partitioning. So we have partition tables, we have compacted tables, and then we data skip at a file level. So suddenly we're getting really fast and we can actually get to these BI style queries. All right, make it big. With that, I have one minute left. So if you wanna learn more, check out delta.io. It is the open source Delta website. Has a whole bunch of information about Delta, the docs. It has the link to the Slack channel. There's a very active Delta Slack community. People are wonderful to each other and will answer all of your questions. We also love contributors. So if you think this is cool, wanna contribute, please just let anyone on the Slack channel know. Anyone that's part of the project, you can also just ask me questions or email me too. I'd love to answer anything. With that, I can take any more questions. I think we're up on time, but thank you so much for attending. Any last minute questions? Is there a way to ETL like fast with no logging? Like if you need to load up the data warehouse or do you just use the normal tools that you use? Normal tools, yeah, everyone's running away.