 Thank you for joining us today. I'm Sathith. I'm a Solutions Architect with Databricks. Joining me is Ginea. She's a Field Engineering Manager also at Databricks. We're really excited to talk to you today about Delta Lake. We've divided it into two sections. In the first half, I'll talk a little bit about the history of Delta Lake, the problems it solved and how it solves it. In the second half, Ginea will walk you through a live tutorial on how to get started with Delta Lake. We believe it's a really exciting time to be in the data space and for many reasons. The times are changing with generative models and transformers. But even before that, we've seen that industries and enterprises of all sizes starting to work with large-scale data, getting value from it. And by large-scale, we mean petabytes of data. And then we also see that these enterprises are starting to use machine learning, most of the companies that we talk to have dedicated data scientists who train their models, move their model production and then getting business value. With the momentum that we see for compute moving to the cloud, this gives us an opportunity to rethink and relook and redesign some of the data systems. Naturally, it makes sense to start looking at what matters for users of these data platforms. One might think it's fancy features and functionalities, but if you talk to them, it often has to do with how to get good data. The first challenge that we often see is that how do we access the data? How do we even get data in a platform that I have in an efficient way? The second one is reliability. Can I really trust this data? As a downstream producer of this data, is it a detailed job that failed that corrupt data has a schema change? Can I really rely on this data? And then finally, the third one is the timeliness of the data itself. If I have a job that refreshes my table daily, then real-time analytics is a non-starter to begin with. And we have papers from Kaggle and papers from Google kind of proving that even if you talk to data scientists, a lot of the time the problems they're dealing with is how to get good data right. So if great data is key and if getting high quality, lively data is hard, how do we solve it? We believe that not all the problems in the data space are intrinsic in nature. A lot of it is the by-product of the accidental complexity introduced by the system architecture of some of these systems. To explain that a little bit, let's look at the history of some of these systems. A lot of today's analytical systems started out of data warehouses, which took off in the 80s. So the idea is that you have operational data, you extract it, you transform it, and you load it to your warehouse. And then you will take this data, you do your data march and so on and so forth on top of that, and then it's ready for analytics. This worked well for many years. But in the past decade or so, there has been a bunch of challenges with this data. If you have this data warehouse, how do you train your model on a transcript automatically generated from a video that you have to store? And then that's a non-starter to begin with. The second problem was the high cost of storing this data. A lot of these analytical warehouse systems had compute and storage co-designed, and storage was costly. The third problem was that these were designed only as a sequel as a primary interface. And how do you express some of these advanced machine learning algorithms in such a system? And then came along, data lakes started with Hadoop and we had cloud object source like S3, ideal as GCS. The idea was that you have decoupled computing storage, you can store data as it arrives, you get then great things like 11-inths of durability, you get high availability, you get really cheap storage. And then formats like Parquet and ORC became sort of popular. And the result, now you have engines like Spark which can run streaming and batch against this data, but at the same time we can have PyTorps and TensorFlow frameworks working against the same data set. Often in such architectures, you would still move a subset of the data, move it to the data warehouse, and then for features like indexing that comes as a part of the warehouse, you still move that. So with data lake, we solved two problems. We solved the problem of unstructured data, we solved the problem of costly storage, but then we're not fully done yet. So this new system creates additional complexity and then this two-tier system makes architecture more complex and as a result, we have data reliability that suffers. You can have a job that reads from different sources, writes out to S3GCs and whatnot, and then let it read subset of data and then move it to warehouse. Let's say that job fails and your object store doesn't provide your transactional capabilities. How do you restart that job the second thing was, again, timeliness because of the different hops that you have in this architecture. By the time it reaches warehouse, what it takes, right, when it's already stale. And also, you know, you're still duplicating a subset of data, so cost is not going away. What if, you know, we have, we had a way to combine the best of the both worlds, you know, take some of the properties of a data warehouse like indexing, transactional, asset guarantees, apply that on top of your data in, you know, your data lake, you know, and that's where lake house systems came along and we have seen tremendous success with that in the last few years. And one of the key foundational technologies that enable you to do lake house is Delta, and that's what our talk is about. The idea is simple, right? So you have a data lake, it's a collection of objects, but that interface of, you know, and then you have an object, that's not great for features that we mentioned about in database, like, you know, how do you provide transactional guarantees, how do you provide fast access, right? So the notion is with Delta, we give you, you know, a layer between your compute engines and your cloud storage, and then it gives you the notion of tables with different versions. So I'll say version zero, which is the first version, tracks a set of files, another version tracks in a set of files, and that's that. So at high level, it's a scalable storage, where your data is still in open format in your cloud storage, gives you, you know, all the advantages of, you know, object stores, like 11 nights of durability, but you also have co-located metadata in the same location. Let's unpack this a little bit. Say you're starting out, you know, your Delta lake table doesn't have anything. So if you look under the hood, there will be, you know, a bunch of, you know, especially after you're inserting a bunch of data, you will have a bunch of party files that corresponds to data, but also you will have a path underscore delta underscore log, which is where the metadata is going to be. You know, this way, Delta Lake is self-contained. We don't need any external dependencies like our meta store. It allows you to scalably process data and metadata at the same time. How does this all work together, right? Adding a bunch of records, like I said, those bunch of records translated to, let's say, two files, parquet one and parquet two. You know, the process that insert these inserts the files, and then we'll go and say, hey, this is version zero, which is the first version of the table, and then that is an atomic commit that happens to the table. Let's say another process comes along, it deletes a bunch of records, updates a bunch of records. That results in removing those first two files and then adding a third file, and that becomes version one and corresponding to zero, zero, one. So if you look at that metadata path, which is underscore delta underscore log, corresponding to each of these different versions of the table, you would see files that says these are the files that I'm tracking. One thing to kind of point out is that we do not actively or eagerly remove the files that are not referenced. For example, after the second operation, you know, the version one, if you see it does not refer, you know, one and two parquet files. But, you know, because we keep those, and we get some side effects, which are nice in the sense that I can do time travel. I can go and see, hey, how did my data look like when it was version zero? I can compare that. It allows me to do data quality checks. It allows me to do, if I'm training a model, right, I can say, hey, this model was trained against this data set of this table. It allows you to look at the history of all of the data. So, one of the things that we said is the promise of a lake house is that it gives you consistency, reliability, right? And we need consistent reads on Data Lake. Imagine a scenario where, you know, I'm writing to the table and somebody else is reading the table at the same time. With traditional data lakes, you know, that would result often in, you know, inconsistent states. But with Delta Lake, what happens here is that, you know, let's say an update is in progress. When I'm reading this, it would be either I would be reading the first version version zero or the second version version one, nothing in between. So you won't end up in a scenario where you are reading, you know, part of files from one version and part files from another version. But how about a scenario when there are multiple writes? Let's say I have two writers writing to the same table at the same time. So in order to provide asset transactions, you know, Delta Lake has a protocol to figure out how to order commits. You know, traditional databases, what we call serializability. Again, let's try an example, right? User one is coming, user two is coming, both of them reading version zero of the table. User one opens a bunch of data, user two opens a bunch of data. I mean, user one updates a version, bumps it to zero, zero, one. When user two again tries to, you know, update the same version, because of mutual exclusion, we realize that, hey, this is a conflict. You cannot overwrite it and that gets rejected. But we handle conflicts optimistically. What do we mean by that? We'll see if that operation can be reconciled automatically. In this case, these are two opens. And these two operations can be reconciled because, you know, you're not, you know, changing the data somebody else added. So without any data reprocessing, we would add a second version and that becomes, you know, zero, zero, two. Now, there are scenarios where this doesn't happen automatically. Let's say, for example, user one, let's say removes a file that user two is trying to update, you know, that would result in a scenario that cannot be resolved automatically. So we would throw an error and then at that time, the application logic can catch it, retry it, which would result in a reprocessing of data. You know, we've seen how a simple design of, you know, I have my data, I have my metadata in a transaction log, a write ahead log, to simplify it. How it facilitates functionalities like asset transactions on top of data lake. But it also gives you a lot of other additional functionalities. So for example, right, when you store a file, I can go and say, hey, collect statistics for different columns in that particular file. Say for example, if I'm querying a table, delta lake table and telling, get me all records after May 15th. And in the statistics of different files in the metadata, I will say, hey, for this particular file, the min value is step 15th and the max value is, you know, March 15th. So by looking at the metadata alone, now the system can say, I don't need to process all of these files or scan all of these files to serve my query which has a specific filter. So this gives you a lot of, you know, advantages in terms of file skipping, data skipping. And that's only one of the properties. What you get from it like, you know, again, the liability, right? How do you enforce a schema? How do I simplify my pipeline and give you simple commands like delete and update. There are traditional data warehouse semantics. And all of this is open source and fully open and in open protocol. And, you know, we have seen this has been battle tested, right? Exabytes of data processed every day. Thousands of companies using it for, you know, production. And we have seen exponential growth for the format. And if you look at the last couple of years, a lot of this could be attributed to really the community coming along and adding support for different engines, different frameworks, different libraries. If you look at the GitHub project path for Delta, there are multiple projects. One of them that I am really excited about is, you know, Delta sharing. If I have a leak house and if I need to share my data with sort of any compute engine in an open way, in an open format, right? Delta sharing allows you to do that. It's an open protocol. You also have an open implementation to do that. There are several others. That's one that's close to my heart. Again, just throwing a picture that our ecosystem is expanding. One of the things, again, as a community, and Guinea will walk you through in the tutorial on how to get started with that. We even have companies, for example, reading from Kafka, writing out to Delta, using Kafka Delta Injustice, completely written Rust. A lot of momentum in that space. We're really excited about it. And then we couldn't do it without the community. So it's a huge shout out and thank you for all the contributors. If you would like to be engaged in part of the community, these are the different ways of doing that. Again, go to go.delta.io or delta.io. You find ways to get connected and start working with it. Over to you, Guinea, for the tutorial. Thank you, Sajith. Yeah, we're on. All right. Okay, so for this next part of the tutorial, we want to walk you through the interfaces that are currently available for Delta. If you want to follow along, you can. The first requirement is that you need to have Docker installed and running on your laptop. And you can have... If you follow the docs here, they will link you on how to get Docker, how to install it. And the most important part after that is having the demo image that we're going to be using for the tutorial today. There are different ways of having that image. The first one is building it from scratch. And this is how you would do it. Or you can just download the pre-built image from Docker Hub and that's going to be a lot faster. So, up to you folks. Now... Oh, okay, so one thing... One other thing that I wanted to show you. So we're going to be working with Docker and you might be wondering what is available in that container. So here you'll see the Docker file. You'll see the different... the versions of the different packages that we're going to be installing. And you'll see the different modifications that we had to make to permissions. We're installing Rust, et cetera. So, all of this... I can zoom in. Yes, of course I can zoom in. Sorry about that. No one brought binoculars today for reading code? No? Okay. Cool. All right. So, yes, so all of these... the full Docker file is available in the... on the repo, on the DeltaDocs repo. This is my screen got all bought. Okay, here we go. So yeah, take a read if you're interested. So, now we're going to move on to console. So... Uh-huh. Can everybody read that text? Is that... can you read that? Yeah? Can we take that at the end? Is that okay? All right. Can everybody see that text on the terminal? Yeah? Awesome. Okay, so the first thing we're going to do, for the first part of the tutorial, we're going to be talking about the different options that are available using the Rust implementation. And the first one that we're going to be diving into are going to be the Python bindings on the DeltaRust implementation. So, let's load them up. So, we're going to go to Python. We're going to import a few libraries that we need. Okay. And now we're going to create a sample very simple data frame of just five items with a pandas data frame. And we're going to write that data frame to this path. So, we call it the right delta lake function. And we're going to save that to this path. This is the fully qualified path and the data frame. And let's take a look at what we wrote in there. So, if we take a look here and we do pandas. There, we have the five items that we just added to our table. But we can also append data to our table, right? So, if we have a new data set that we want to add, we can append it to our table by just specifying the mode on the right function. So, here you see we call the same function and we just specify that the mode is appended. By default, the mode is all right. So, if we load the table again and we take a look at what we have, now we have all 10 items that we added. Okay. So, now let's take a peek under the hood of this table. And the first thing that we can check is the files. Can you see the screen down there? Is it visible? Now my prompt is very low. So, maybe I'll just... Maybe I'll just do the top half of the screen. Is that better? Okay. So, these are the two files, the two parquet files that compose the delta table. And we can also take a look at the history of the table that Sagitt was talking about before. So, we see when we print out the history, we'll see that the first item on this array, it's going to be the first version of the table and we see that this was when we created the table. And if we see, just because I've seen this a lot, I know that this is where it ends, that first item. And this is our second record, our second version of the table, and this is where we appended that second set of data. If we wanted to go back in history and load the original version of this table, we can simply show that version and we are back to the original version that we had on our table. Okay. So, the other nice thing, the other quick trick under your sleeve will be to know whether you're working with a delta table or not. To do that, the easiest way is to look for the delta log folder. So, if we, let me exit out of here, if we list the directory of our table, you will see that you have the two parquet files that we saw earlier when we looked at the files of the table and you'll also see this delta log folder and that's the delta log that we were talking about before. Okay, so that was a brief view into the Python bindings. Now, how about we go directly to the Rust API? So, we've prepared two examples for you for the Rust API. The first one is we're going to look at, sorry, I can't do both things at once. The first one we're going to look at how we read the metadata of a delta table with Rust. So, if you look at this function, we're going to be looking at this COVID-19 table that is included as part of the Rust example and is located here. We're going to load our table, then we're going to print the metadata of the table. We're going to print the files of the table and that's it. Okay, so let's run that. And now, okay, so when we run it, oh, something I should mention. The very first time you run this, all of these are part of the examples listed on the docs page, on the delta page. So, the first time you run this, you will notice that it will take a little bit of time just because cargo has to build a crate and et cetera. It takes a little bit of time, to try to expedite the demo, I ran that ahead of time so that we could see this. So, you'll see on the metadata of the table, the first thing you'll see is the fully qualified path at the top. And then under metadata here, well, you'll see the version and you'll see under the metadata you will see important aspects of the table, for example, the partition columns. If we add a documentation to the table, you'll see it on their description and you'll see how many files that table currently has. And down here we listed the actual file names. Okay. So, that was looking at the metadata of the table. So, how about we query the table now? We have another example for that. And this one will use DataFusion to query our table. So, if we look at the code, you'll see that we're going to be using the session context from DataFusion. We're going to be loading the same table that we were talking about before. We're going to create a new session context. We're going to load up our delta table based on the path that we're passing in. Then we're going to register the table into that session context. We're going to give it the name that we're going to refer it by. We're going to pass the table object that we just created. And then we're going to submit plain old SQL against that context, which I found very nice. So, let's see what this looks like when we run it. Okay. So, if we see here, the first printed part of the output is going to be the schema pertaining to the dataset that we're returning. And then we're going to print out the columns of the dataset one after another. Again, this is usually when you're using this, you're not going to be printing it out. You're going to be using it for your analysis or for your pipeline. So, it's not going to you wouldn't see it in this format. This is not as usable. This is just printing it for the demo. Okay. So, that was a brief intro to the Rust API. Another really cool thing we could do based on the Rust implementation is that we could build an API in front of your data tables. So, let's say you might have some tables that you have in production that you want to make available to other system for other systems to query from and make them available generally for your organization. You could build an API for doing that and it's that simple to do and it's very fast. So, let's see how we would do that. Okay. If we look at this, what we're doing is we're publishing our API to localhost 8080 and then we're specifying which tables we want to publish. The first one is going to be the Delta RS table that we created in our first example and we're going to give it the full path and then the format and then we're also going to expose the second table that we saw in the Rust example. We give it the full path and the format. So, we run that and now we're going to go to a second terminal and I'm going to clear that and I'm going to connect into the same container. Here we go. And now we want to take a look at the schema of the tables available in the API. So, to do that, can you see that text? Is that smaller than the other one? I think it is. I'll make it a little bigger. All right. So, we want to see the schema of the tables. So, you call the schema endpoint and once we do that you'll see that we have the Delta RS table or the fields the full schema and then we have the second table which is the COVID-19 table up here. But how do you actually query the table via this API? Well, if you're just testing it you can curl and you can actually submit SQL with the curl command. So, you would post your SQL query against the SQL endpoint of the API and we see here that it will return all the data as specified by the SQL query. But this is a very simple table. How about we take a look at a more interesting one which is the second one. So, we can query the table, we can select, we can set a limit. Why can't we see the limit? It was cutting off. Sorry about that. Okay, here we go. So, you see that we're only selecting the top five from our table and we see here all the elements that we selected. So, we see cases, county, date. And you could also send more complex queries. For example, in these other case we're going to do an aggregation. So, we want to select all the total number of cases per date and that would also run. Okay, so that was a trip around Rustland and the interface is available with that implementation. Now, let's take a look at the Spark interface for Delta. So, the first thing I'm going to do is I'm going to start an interactive shell. The second thing is first I'm going to clear my screen. Okay. And I'm going to start it on both terminals because I know I'm going to need them both. Okay. So, we're going to create a sample data frame. Again, just with five items. And we're going to write those five items out to a Delta table. So, we would specify the data frame. We're going to write them in Delta format and then we give it the path. And we're going to write that. Then when we want to read the table, you would specify that it's a Delta table, give it the path. In this case, we're ordering by ID to keep it nicer when we display it. So, if we show, you see that now we have the five items that we just inserted into the table. Okay. Now, let's say we made a mistake and we want to overwrite all of our table. So, we have a new set of five numbers that we want to add. And we essentially call the same function. We just specify them out overwrite. Okay. But in reality, you're not going to be you likely won't be overwriting your entire table. What you probably want to do is to conditionally update certain rows. So, how about we do that? Okay. So, in this case, in this example, I'm going to be updating all the rows that are even and I'm going to add a hundred to them. So, we see that we have two, I'm sorry, I didn't show you here because I overwrote it. So, how about we take a look at that first, actually? My bad. So, when we overwrote our table here, these are the values that are currently sitting on our table. So, these are the numbers that we're going to be looking at. So, if we want to update, let's say, our even numbers on this table and we want to only, I don't know what just happened, and we only want to update the even numbers and add a hundred to them. That's the syntax. And then, if we take a look at our delta table, we see that we added 100 to both even numbers. Another function that is very useful is conditionally deleting certain rows, which comes in particularly handy in cases of CCPA or GDPR compliance. You need to be able to conditionally delete some rows. And this is how we would do that. So, if we want to delete all of our even numbers from the table, this is how you would do that. And then, if we take a look at that table, again, you'll see that only odd numbers are left. Okay. So, another thing we might want to do is we might want to merge new data into our table. So, let's say we have a new data set that we want to merge in. And we want to make sure that our table doesn't contain any duplicates. So, if we write the merge logic like this, we'll see that it will only insert values where it doesn't find every existing one in our original table. So, we're going to write. We're going to execute our merge. And now, when we take a look at our table, we'll see that we have all 20 elements added. And there are no duplicates when we look at the existing ones that we had on the table already. Okay. So, we've done a few modifications to the table. We've added, deleted, updated, etc. So, let's take a look at the history of the table. Now, there's a lot to take in on this screen. But the main thing that I want to show you, maybe, maybe I can show you. It's unreadable. Let's try to make this smaller. Maybe, if you see here, you'll see, so these are the, this is the first column. So, you have the version number here. And this is what, these are the operations that happen. So, we have the original version, version zero is when we created the table. This is when we overwrote the table. This is when we conditionally updated, conditionally delete and our merge statement. So, if we wanted to go back in time and go back to when we overwrote the data into that table, we can read the history and then we would go back, we would specify the version that we want to go to. And for traveling back in time, you can either use a version or you could use a timestamp. I'm just using the version number. And you can see here that we went back to when we overwrote the data. Okay. And the last thing, the last example that we want to walk you through today is in many cases you might have productions in your organizations that are streaming data and you want to bring that data into your Delta Lake in a streaming fashion. So, let's see how we would do that. So, let's generate a sample stream of monotonically increasing numbers and let's write those numbers into a Delta table. So, here we have our stream and we're going to do some modifications. We're going to scrub the columns a little bit here. And then we're going to write it as a stream into this Delta table. We're going to specify a checkpoint. You always want to specify a checkpoint because this will guarantee resiliency in your pipelines if you have to go back in time and you have to fix your pipeline. You don't want to start all over from the beginning. You want it to pick up from the last known success. So, this is how you would do it and we're going to save it to temp Delta table stream. Okay. So, we're going to kick off our stream and then we're going to come out to this other tab here and we're going to read that stream and let me sort it so that it's a little bit easier to read. So, you'll see that when we read that data right now the top ID that we've consumed is ID number 20. If we run it again, the top ID is 32 and so on and so forth because this is a stream. Okay. And with that, that's all for you today. We have a few minutes so we can take any questions either me or Sagar. Can you say that? Yeah, the transactional logs. If your transaction log gets too long like the same kind of compaction Yeah, we automatically compact it. We call it check pointing. I was oversimplifying it when I say Jason the reality is that we dynamically decide how to check point it, we call it check point. It's a part k file and then that's already happening in that. And when does that happen? It's initially it used to be like a regular cadence like 10 commits but now it's a little more dynamic in the sense that we figure out what's the best way to do it. Any other questions? All right then, thank you all for coming. It's been a pleasure.