 Thank you. I am super excited to be here today to talk about Delta Lake and specifically how it can help bring data reliability to your existing data lakes. But before I get into that, I want to talk a little bit about data lakes and why everyone's so excited about them. Because I've heard lots of stories of people spending millions of millions of dollars on their data lake project and sometimes it doesn't want to work really work out and I want to understand what that means. But first of all, what is the promise of the data lake? And as far as I understand it, it's something like this. Your organization has a lot of different types of data. It could be customer data, it could be metrics, it could be console logs, it could be unstructured things like images and video, and the idea of a data lake is rather than spending a bunch of time doing ETL, coming up with a really strict schema and loading it into a traditional database. Instead, you're just going to store it all. A data lake, it's just a directory in a file system and you can just put it there. You don't even have to know if it's valuable ahead of time. And I think that's actually a good thing, because you often don't know ahead of time what data is going to be valuable and the ability to just collect it all and figure it out later is actually pretty powerful. And then the idea is, after we've collected all of this data, we're going to do some really cool things with it. We'll do data science and machine learning, we'll build predictive models, we'll maybe even cure cancer with genomics. But I have some bad news for you. In most cases, your data is garbage. And the reason for that is, is varied. It could be that someone upstream from you decided to change the format of a date without telling you about it. It could be that some machine is dropping records and they're showing up days late now. It could be that you're missing, there's two different data sets and you need to join them together to be able to understand them. And the problem with this is, if you have garbage in, you're going to store garbage. And if you put garbage into your algorithms, you're going to end up with garbage out as well. And so I want to talk about a pattern that I've seen play out over and over again in these data lake projects, where people try to build a lot of architecture around them to kind of solve some of these shortcomings. So let's start with a pretty typical problem here. This is, I bet, something that many of you have built. Your boss comes to you and you have a problem like this. You have events coming into some system like Kafka, but it could also be Kinesis or S3 or Azure Data Lake, and you want to do two things. You want to do streaming analytics, so you can understand what's happening in your business right now at this very moment. And you also want to do AI and reporting, where you take a longitudinal view and you actually look at historical data and you build trends and predictive models for the future. So I'm a little biased, but if I was given this problem, I would start by using Apache Spark. So we have, you know, great APIs that connect to Kafka. We support event time aggregation. You can write your code in Scala, Java, Python, or even just pure SQL, and you can do streaming analytics from that. Great, right? We're done. Unfortunately, this brings us to challenge number one, which is historical queries. Systems like Kafka are great for storing a day, a week, maybe a month of data, but they're not great for storing years and years worth of data. So I spent a lot of time looking around on the internet, reading blog posts, and I came up to this thing called the Lambda Architecture, which, as far as I can tell, means you're just going to do everything twice. We'll have one system for streaming and one system for batch, and, you know, Spark has unified APIs for streaming in batch, so it works out okay. And now we can collect all of the historical data in our data lake and from there, so we'll add that, you know, one extra complication to architecture, but it allows us to do both of these problems. And now with Spark, we can do AI and reporting. All good, right? Unfortunately, this brings us to challenge number two, which, as I said before, your data is messy. Some columns are null and you don't understand why. And so what a pattern I see is people will start writing extra scripts that do validations. So this is a whole nother set of Spark programs that are just checking the quality of your data and they're emailing somebody if something goes wrong. And again, because we have the Lambda Architecture, we have to do this both for streaming and batch, so it's another two things that we need to worry about, but we can add this to our architecture and now we've got validation, so we'll know when our data is messy. But unfortunately, some of that messy data made it into your data lake. There are already mistakes and failures corrupting those files. And this isn't a traditional database, so if I want to correct those, it's actually very difficult. I have to go and find them and fix them. I have to be very careful not to crash in the middle, or I might corrupt the data in my data lake. And what that's going to give us, that's okay. Okay, now I can continue. So what do you do to handle this? Well, a pattern that I've seen over and over again, and I bet something that many of you in the audience have actually done before, is you use partitioning. Rather than store everything in one gigantic directory, we're going to break it up into days, months, hours, minutes, whatever makes sense for your business. And now when something goes wrong, rather than recompute everything from scratch, I'll just build an engine that can blow away one of those directories, and I'll just recompute one day. And so now you can correct those mistakes and failures in my data lake, and we're good to go. But again, this is one more complication we had to add to our architecture in order to enable that. And then finally, this brings us to challenge number four, updates. Things like GDPR happen, and rather than change one day, you actually need to go through your entire data lake to remove people who want to be forgotten. Or you have problems like change data capture, where you're constantly getting a feed of updates, and you want your data lake to reflect what's going on in reality. So again, we're good engineers, we know how to solve these problems. We'll write a whole nother set of Spark programs that can handle doing updates and merges. You've got to be really careful here, though, because if you modify data while people are querying it, they're going to get the wrong results. So we'll schedule this, so we'll only run GDPR at one in the morning, and we'll make sure that our batch jobs don't start until three in the morning, and we'll hope the GDPR job definitely finishes before then. But again, this is something that we can handle, and we add these updates. But what you'll notice is this is a very complicated architecture, and I think the theme here is you're wasting a lot of time and money solving these well-known systems problems rather than doing what you really want to do, which is extract value from your data. And the things that are distracting you from getting value out of your data are, I think these are kind of a good summary. You don't have atomicity. Atomicity is this property that when something happens, it either happens completely or not at all. And you don't have this in a distributed system and on a file system. When something goes wrong in the middle of your Spark job, you have to worry about, did it write something out already? Do I have partial results out there? Have I duplicated my data? And that can make it very difficult. You have to do a lot of manual cleanup when something goes wrong. There's no quality enforcement. It's just a directory out on a file system. So if anybody can drop anything they want in there, it's very hard to reason about what you're going to get when you query it. And finally, there's also no consistency or isolation. As I mentioned before, you have to be very careful not to schedule batch jobs over a table that's being modified. And similarly, it's almost impossible to make streaming and batch on the same data set. And so this is why we created Delta Lake. And the idea of Delta Lake is we're going to take this relatively complicated architecture where you're spending all of your time thinking about systems problems and change it to something that looks more like this, where instead you're thinking about data flow. And the core enabling technology here is this well-known idea called acid transactions. Acid transactions mean that it's atomic, consistent, isolated, and durable. So you can focus on your data flow rather than worrying about failures. If something goes wrong, it automatically rolls back completely and you don't have to do any extra cleanup. And multiple people can modify the same data at the same time and it'll be as though they're doing it one at a time. And of course, if I was going to be collecting petabytes of data that was very valuable to my organization, I wouldn't want to lock this into some vendor-specific black box proprietary format. And that's why Delta Lake is based on open standard and is fully open source. The data is stored in Apache Parquet and the transaction protocol for Delta is also licensed under the Apache license. And we actually just recently, a couple months ago, announced that we've donated the project to the Linux Foundation and formed a new foundation under the Linux Foundation as the permanent vendor-neutral home for Delta. And we have a growing community, including Spark and Presto and others, where we're trying to make it possible for everybody to read from the Delta Lake. However, today Delta is deeply powered by Apache Spark and so if you have existing Spark programs, it's almost trivial to convert them and I'll actually show you how in a couple of slides. Delta plugs into both the streaming and batch APIs of Spark, so your existing programs can run with confidence. So I want to simplify this picture now and talk about some of the patterns that emerge once you stop thinking about systems problems and start thinking about data quality. I'm going to talk about these different data quality levels. Now, these are not specific features of Delta. There is nothing that says you have to have a bronze, silver, and gold table or that you have to have at least one of them, but I think these are really useful vocabulary that you can use in your organization to talk about quality. So starting at the B.O. and the idea here is rather than just go directly from raw data to the finished product, you're actually going to incrementally improve the quality of your data and I actually think that's valuable and I'll talk about why. So starting at the beginning, we have our bronze raw data. This is just whatever is coming out of Kafka, Kinesis, S3, people will have raw JSON stored in a column and you might say, wait a second, why am I storing raw data? Why am I not starting by doing ETL? And one of the reasons here is there are no bugs in a parser that you don't write. If you keep the raw data, you can always go back to it and process from the beginning of time. And Delta is designed to store years and years worth of data. We had one of our customers who was storing data in another database and they were only able to store two weeks. With Delta, they're now keeping three years worth of data. And so after you've collected all of this data in bronze, the next step is to move it to silver. In silver, you start to do some of the cleaning. You maybe filter out records that you aren't interested in. You take that JSON and you parse it into top-level columns. You do joins to augment data so that you have extra information that's useful in that processing. And you might say, wait a second, why am I materializing this intermediate data if it's not my final answer? And there's actually a couple of reasons for that. I think the most obvious one is maybe this clean data is useful to multiple people in your organization. And by creating this silver table, multiple people can use this in their final answer. So maybe you've created a really interesting, featureized data set and multiple people can train their models on it. But something that actually surprised me that a lot of users came and said was very valuable is silver tables are also useful for debugging. When something goes wrong in your final analysis, you actually have this table that you can query with the full power of SQL. You can ask questions like how many nulls do exist in this column or how many distinct values are here and things will jump out in you and you'll understand where the quality problems are coming from. And then finally, we move on to gold. This is high-level business aggregates, machine learning models, things that actually solve a business problem that means something to somebody in your company. And you can share this and do streaming analytics or AI and reporting on it. You can read it with Spark or Presto and a growing community of tools. Another pattern is people often use streaming to move their data through their Delta Lake. And you might say, wait a second, I don't need streaming. I don't have low latency requirements. I'm fine with batch jobs. And I actually think that's the wrong way to look at it. To me, streaming can do low latency and that's one thing that it's very powerful for. But what streaming is really about is it's about incremental computation on an ever-growing set of data. And pretty much everybody has a set of interesting queries that they want to run and pretty much everybody's data is constantly changing. The kinds of problems that streaming solves are the same problems you have to solve today by hand when you're running a batch job. You need to ask what data is new and what's already been processed. How do I take that new data, process it exactly once with no duplicates and no drop data and commit it downstream transactionally? How do I checkpoint where I'm at so I can recover if the system crashes? And these are all handled by the streaming system. And in Apache Spark, we actually have these really nice knobs that you can tune to trade-off cost for latency depending on your particular application. If you have a hyper-low latency application where you care about milliseconds latency, you can use continuous processing mode. We hold on to a core. We're continually streaming data through. There's no batching. And so it runs as fast as possible. But it's very expensive. Only your stream can use that core and nobody else can. In the middle, we have micro-batching. You can run many streams on a single cluster and each stream will run in tiny little increments multiplexing the resources of the cluster. Here, you can get seconds to minutes latency but run lots of streams on the same hardware. And then at the far end, many people actually have data sets that only change once a day or once a week or once a month. And in these cases, you can use trigger once. With trigger once, the stream starts up at the beginning, processes everything that's there, and then shuts down. And if you're using the power of the cloud, this is great because you can take advantage of elasticity. When your stream is done, shut that cluster off and stop paying for it. So you can save orders of magnitudes and processing costs by taking advantage to this if you don't have strict latency requirements. And so what you'll really see is streaming allows you to focus only on data flow rather than all of this control flow of processing. It can dramatically simplify the work. However, you know, batch jobs are also very important and Delta has full support for batch operations as well. When you want to do GDPR once every 30 days to remove all of those people who have asked to be forgotten, you can do that with an update or with a delete. If you want to do change data capture once a month, we have the merge into command where you can do upserts. If you want to every month delete data that's more than three months old, that's also supported. We support all of the standard DML from a traditional database, update, delete and merge. And you can also do crazy things like atomically overwrite multiple partitions. So pretty much any APIs in Spark can be used with Delta. There's one final pattern that I want to talk about, which is reprocessing. If you follow these patterns, if you keep all of your raw data in bronze and you use streaming to do processing, it now becomes very easy to solve cases where you might have made a mistake or where you come up with some new thing that you can't compute from scratch. And the reason for this is, when you start a stream with a fresh checkpoint, it's going to compute the same answer as a batch job over the same data set. The way streams work is when they first start, they take a snapshot of the Delta table at the moment the stream begins. They break that snapshot into a bunch of individual pieces and process them incrementally. When they get to the end of that snapshot, they start tailing the transaction log to only process new data that has arrived since the beginning. They can take advantage of the elasticity of the cloud. When you do that initial backfill, start up a thousand node cluster and process it really quickly. When you get to the incremental processing at the end, scale it down and keep 10 machines running so you can reduce costs significantly. But having that raw data means you can always go back to it and ask those interesting questions as though you had thought of them from the beginning and as though your code had no bugs in it. Now that I've talked about what Delta is good at, I want to talk about some actual use cases for using it. This is a relatively recent open source project. We just open sourced it in April. But it's actually been a product inside of Databricks for a couple of years. So it's actually used by thousands of organizations around the world. And last month, we processed over two exabytes of data alone. I want to talk about one particular use case that I think is really cool. This is from Comcast, which is a large cable provider in the United States. And basically what they have is they have data on every time anybody clicks their remote for any Comcast subscriber. And they really want to understand what people's content journey is like. They want to understand you're watching ESPN and then you go to the home shopping network, then you go back to ESPN, and they want to understand what that looks like. So they have this segmentation job where they're actually watching people through time. And as you can imagine, they have a lot of customers. And so this would actually max out their spark clusters. So they did what any good engineer would do when you have a scaling problem. They said, well, let's just hash partition it. They had one spark cluster. They had 10 spark clusters. And they would take all of their customers and they would take their user ID, mod it by 10, and spread them out across all of these clusters. And great, now it runs. However, now they have 10 clusters to manage, 10 schedules to work with, 10 sets of logs to comb through, and 10 sets of errors to deal with. It's a lot of extra maintenance. So when they took this and they converted it to Delta, because if Delta's more efficient metadata handling and scheduling, they would be able to manage the spark cluster. And they were also able to reduce the number of machines that they needed to use by 10x. So that's a dramatic savings. Not only in engineering time, but also in dollars spent to Amazon or AWS, or Azure. So that's pretty cool. So if you want to figure out how to use Delta Lake, it's actually really simple if you're already using Apache Spark. With this one command, dash, dash packages, you can add Delta to any existing spark cluster. And you can install it on the cluster. If you're building spark applications locally, you can add these Maven coordinates and it'll automatically download Delta and include it in your project. And then in your code, it's actually trivial. You just take the current format that you're using, whether it's Parquet, JSON, CSV, ORC, whatever, and switch it to Delta. And everything will work the same except now you'll have these power of ACID transactions and scalable metadata. I want to talk about something that's coming in the future, just to kind of let you know where we're going. This is a new project that we're currently prototyping at Databricks, and we hope to open source in the next three to six months. And the idea here is once people start using Delta, they go from having one stream and one Delta table to 10 streams, 100 streams, 1,000 streams, and it becomes very complicated to reason about what's going on. And so the idea of declarative pipelines is rather than managing each table as its own individual entity, we want to give you a language where you can express your data flow as one holistic graph. So data comes in at the beginning, and you move it through the bronze, silver, gold quality levels, using the same APIs that you already know and love. So in this example, I'm creating a new data set called warehouse, and I'm specifying it as this declarative query. For those of you familiar with materialized views, this is materialized views for Spark. The idea here is it depends on some other input, which is data coming from Kafka, and I'm going to use standard data frame transformations on it. We plan to provide these APIs in Java, Scala, Python, and also just pure SQL. In addition to the query, you can specify a lot of details about how it's going to be stored. So you can say where you want it to be located, if you have regulatory restrictions. You can specify whether or not you want strict schema checking. Delta actually has two different modes here. You can automatically upgrade the schema, so columns are added automatically anytime something new arrives. Delta can handle that, and maybe you want to do that in your bronze tables. When you get to gold though, maybe you want to be a little bit stricter, and you only want the schema to change when you plan on it, and so you can use standard DDL, Alt or Table, Add Column to modify it. Delta supports both of these modes. Finally, you might want to register it in a metastore, so it's discoverable by other people in your organization, or you might want to add a human-readable description that says where this data is coming from, how it was processed, what it's used for, so that other people can understand and then finally, my favorite feature here is this thing called expectations. Expectations allow you to take your domain knowledge about what quality means and tell the system about them. So in this case, I expect that this table always has a valid timestamp. And what is a valid timestamp? It doesn't just mean that there's a date there. It actually, for me, I know that Databricks started in 2012, and so if there's a record from, say, 1970, that's probably a parsing mistake, even though it's actually a valid date. You can actually express this as a predicate. You can say it needs to be after January 1st, 2012, and this sounds very similar to a database invariant. In an invariant, it'll fail any transaction where the invariant is violated, and you can do that, too. If this is a table that you're going to be reporting to regulators, you probably don't ever want any weird data to get into it. But earlier on in your bronze and silver tables, if you stop everything every time something unexpected happens, you probably won't get anything done. And so what expectations do is they allow you to tune the severity of an unexpected duple. So I can say, for example, if more than 5% of the records are invalid, send me an email. Or my favorite feature here is this notion of data quarantining. So with a data quarantine, when something unexpected is seen, rather than stop processing, we'll allow processing to continue, and we'll just take that unexpected record and divert it, route it to another table, where an engineer can come in and look at it at their convenience to understand what went wrong. And we don't just give you the bad duple. We actually plug deeply into Apache Spark so we can give you more debugging information. We modify the query plan so we can tell you, this was the record that came in, this was the record that came out, here is the expectation that was violated, and here's the code that produced that bad record. It makes debugging significantly easier. I've actually been using this feature on some of my data sets inside of Databricks, and it's amazing the things you discover about your data when you just write down what you expect and you see when those expectations are violated. So finally, before I close, I want to talk a little bit about how Delta works under the covers. I want to get into the nitty-gritty technical details because acid transactions on top of a distributed system like Spark sound almost too good to be true, but it turns out there's a couple of simple tricks we can put together to make this possible. So starting at the beginning, Delta on disk looks exactly like your data lake today. It's just a directory, but in addition to the data files, we also store a transaction log. And inside of the transaction log, we keep track of different versions of the table as it changes through time. So you can see these JSON files here are version zero of the table and then version one of the table. And then alongside this, we have optional partition directories. I say optional because they're actually just there for debugging. The actual information about partitioning is stored in the transaction log itself. This allows us to do really cool things like on S3 if you want to have maximum performance. What you actually do is you pick random directory names to spread it out across their metadata tier. We've actually had cases where Delta overloaded S3 and we had to do this. And then finally data files, which are stored in standard parquet readable by many of the tools that you're already using. So what goes in each of those different table versions? Well, you can really think of this as a change log. What we're doing is we're recording what's different from the previous version. So different actions that can be inside of that change log are changed the metadata of the table. So for example, if I want to add a column, I would put a change metadata action. If I want to add data to the table, I would add a file. And along with that file, you can add optional statistics about the data. For example, min max values for all of the columns to tell you what's contained in this. That can allow you to do really cool tricks like data skipping. And then finally, remove file. Remove file will take data out of it. And the result now, you can take the transaction log and play it through time. And what you'll come up with at the end is the current metadata of the table and the set of files that are currently valid. So we're using a trick here called multi-version concurrency control. On disk, there are actually multiple versions of the table simultaneously existing. And it's the transaction log that says which files are valid at any particular moment in time. So now that we've got that, how do we get those nice acid letters? Let's start at the beginning with Adamicity. So we want to have this property that let's say, for example, I add two files in. And then these are small files. So I want to compact them into one bigger file. So I'm going to add another action. And it's very important that this happens atomically. If I only do the removes and forget to do the add, I've lost data. If I only do the add and I forget to do the removes, I've duplicated data. So it's very important this is atomic. So the trick we use is we'll actually use atomic primitives of the underlying file system. On S3, you can get Adamicity kind of automatically because the put command starts by saying how many bytes to expect in that put. And if it doesn't get all of those bytes, the put will not finish. It will automatically fail. On systems like HDFS or Azure Data Lake, we'll use transactional rename. So we create a temporary file and we rename it to its final destination, and they guarantee that that will be all or nothing. And so now we have this nice ordered atomic history of the table. So now for consistency, we need to all agree on the order that those changes happened. And so here, we need a property called mutual exclusion. So user one can create version zero, and that's great. User two can create version one, and we're all good. But if both users try to create version two at the same time, it's very important that one of them succeeds and the other one gets a message back that says, hey, that version already exists. So again, on S3, that's actually pretty difficult. S3 is specifically not a lock store. It will accept any number of rights to a file, and it will not tell you who won. And in fact, it takes sometimes hours to figure out who won. So we have a separate service that sits out on the side and mediates who's actually there. On systems like HDFS or Azure Data Lake, it's actually guaranteed by the file system. That rename operation we were talking about before will only succeed if that destination doesn't already exist. And so we get this nice property. But now, you might be saying, wait a second. If every time two people modify the table at the same time, the system just crashes, am I going to like waste a lot of time dealing with that? Well, fortunately, we have one more trick up our sleeve, which is we solve conflicts optimistically. There's this cool technique called optimistic concurrency control. And the idea is this. Let's say for example, I have two streams that are writing into the same table. They're going to follow this protocol to write into the table. They'll start by recording the start version. So they're saying, okay, as a version zero, I am writing into this table. They will record what they read and write from the table. So they're both going to read the schema to make sure that the data that they're writing in matches the schema that's expected. And then they're going to speculatively write out some parquet files, so they'll write those out. Now, they're not part of the table, even though they're there on disk until they've been added to the transaction log. They will each go and attempt to commit. In this case, user one wins, user two loses. And so what user two is going to do is they're actually going to go and look at what happened in version one of this table, and they're going to say, does this change anything that I care about? And in this case, no. The schema has not changed. The schema is the only thing I looked at. So I actually don't know if I happened before or after user one's operation. So the system will automatically retry and reorder these, so that nobody succeeds and doesn't even realize that something went wrong. The only time you'll have a conflict in delta is when two people are actually modifying the same data at the same time. And in that case, you do want it to fail because otherwise you might get the wrong answer. So you try again and everything's good. So one final trick here is massive metadata. When you have a very large data lake, you might have tens of thousands or even tens of millions of files in the data lake. One who's tried to put thousands of partitions in the hive meta store knows that this can be a problem. And so in delta, we want to support these use cases without you having to do a bunch of extra scaling work. And so the trick we're going to use here is we're going to take metadata and we're going to start thinking about it like data. And fortunately, we already have a big data processing system here that we can use. So if you think about that transaction log as a big bunch of data, we'll load it in with Spark and we'll actually write out what we call a checkpoint in Parquet. A checkpoint is a snapshot of the table at a moment in time and you can read that rather than playing the log forward. And the really cool thing, since the checkpoint is stored in Parquet, is you can actually query it very efficiently with Spark. So let's say you have a table with 100 million files in it and you want to query just yesterday. Well, we'll actually do this in two parts. We'll start by using Spark to query that checkpoint to identify only the files that are relevant to your query and then we'll actually go back and query it. This trick in many cases can eliminate significant amounts of data from your query. In one of our customers' use cases, they're a large Fortune 10 company. They want to do intrusion detection over their network. They capture basically trillions of TCP connections per day into their Delta Lake. And what they want to be able to do is they want to be able to ask questions like, what did these two computers say to each other during this time? By using the checkpoint and by doing efficient data skipping and filtering, they're actually able to eliminate 97% of the data in the table before actually even reading it. And this was able to take their queries from running in hours to running in seconds, which is pretty powerful. So before I close, I want to talk just a little bit about the future of the project. This is a relatively recent open source project. Today, we don't have full parity between what's available in Databricks and what's available in open source, but my commitment is all APIs need to be open. And so my team has spent the last three months outsourcing everything as fast as they can. And I think by Spark 3.0, we should have full parity. So, you know, in the last couple months, we've added support for S3 and ADLS. We added the DML commands, update, delete, and merge. We added the ability to convert existing Parquet tables to Delta without reprocessing any of that data. We added support for Python. We added history, so you can kind of do linear audit tracking of how a Delta table has changed over time. And I'm really excited about the 0.50 release, which should be coming out in December. It adds improved concurrency, so, you know, multiple transactions that don't actually overlap will no longer conflict with each other. We had the ability to write out manifest in Presto's format, so you can query Delta tables with Presto directly. And finally, we're making the merge command even better when you're doing operations like deduplication. And with Spark 3.0, we're going to be plugging in to the data source V2 API. And the reason that's exciting is it'll allow you to do things like create table and alter table and register those tables in the hive meta store. So, I'd like to invite you to check out our website. It's just delta.io. It has a getting started guide, so you can check out, you know, how to download it, how to use it, examples for all of the cases I talked about here. There's also links to our Slack channel. My team spends all day hanging out in Slack answering technical questions about Delta. So, please check it out and join the community. Thank you very much.