 All right, let's get started with the second portion on talking about Apache Hoodie and how it plays a role in a data lake house. So I'm Nadine. I am an Apache Hoodie contributor and I'm coming in from one house and I'm working with WEN today on showing you what you can do with the lake house. So let's go ahead and get started. So let's talk about the origins of Apache Hoodie. So Hoodie started at Uber in 2016 to kind of address inefficiencies across ingestion and ETL pipelines during its hypergrowth stage. So during this time, Hoodie was dealing with petabyte of data and if we walk through an example of an Uber ride, there is a life cycle of a rider. So a rider starts a ride, it goes through the trip and then it finally ends, the rider ends its ride. So there are multiple updates that can happen. So before Hoodie, there were large spark jobs that were used to periodically rewrite entire data sets and Apache HDFS to absorb upstream online table inserts and updates and deletes, reflecting changes of the trip status. So if we remember, HDSF is an immutable store, right? So this process was not only inefficient but also very hard to scale. So what does Hoodie bring to the data lake? Well, we kind of talked about this a little earlier but one of the main things it brings is transactional guarantees to the data lake. And this is really important when you're working with big data. So one of the other things that Hoodie really prides itself in and how it was natively built from the get go was to effectively work with streaming sources where you can handle bursty writes or late arriving data and so on. So this was literally the inspiration with Hoodie. So we talk about Hoodie and we get into the connectors and stuff. We'll showcase of this a little bit more. And this is this part here. So if we take a look at this kind of overview high level architecture diagram, Hoodie can ingest data from particularly any stores whether it's an OLTP database, any types of data lakes or data warehouses and streaming sources. So once you ingest data from their Hoodie will manage your Hoodie tables within the data lake, like for example, today we're gonna be working with S3 to perform specific services to either ingest or manage the data. So once you, for example, you clean the old data or you wanna perform clustering to be able to maintain your ingestions as well as maintain your query performance and so on, Hoodie can help do that within its platform. And from there, you can connect to any query engine of your choice that Hoodie supports to be able to get the data and then perform analytics or build some sort of data application that you want to build. And so with Hoodie, you can build near real time applications. So for example, you can build a personalization app so you can build a customer 360 app and so on and so forth. So we talked a lot about the different table formats and it's good to mention that Hoodie is not just a table format, but it's also a platform within itself where there are many services to ensure that you get write and read optimizations. So one of Hoodie has a set of table services for clustering, cleaning and so on. So for example, if we talk about the clustering you can schedule a clustering service and then you can create your clustering plan and then you can execute the service on top of it. We won't explore clustering today but these are the various examples of what Hoodie brings. We also, I think there was a question about indexing. Hoodie has indexes that support write optimizations as well as read optimizations and they're a whole suite of indexes that if we want to get into we can talk about. So using Hoodie, you can perform record level inserts, updates and deletes on S3 allowing you to comply for example with data privacy laws or you can consume real-time streams or you can work with change data captures or CDC. You can reinstate late arriving data and track history and rollbacks. Hoodie has a concept of a timeline and a timeline you can see everything that's happening to your table. So you create data sets and tables Hoodie manages the underlying format. Hoodie uses Apache Parquet which one touched about and also Apache Avro for data storage and includes built-in integrations with Spark, Hive and Presto and this allows you to query any Hoodie data set using some tools that you're familiar with and love to use today. So in Hoodie we're gonna cover consistent snapshots, point-in-time travel, incremental changes later on in the workshop. So we'll cover, we'll go over some of the table storages that Hoodie offers which is copy on write and merge on read and we'll go, one of the things we'll cover is how to do incremental updates on Hoodie and to query those updates to see what those changes are. So if you're interested in learning more and wanna get involved with Apache Hoodie community you can follow us on Slack. I'm usually pretty active on Slack and there's also blogs that are on the Hoodie site as well as videos on getting started and much more. So this is kind of a quick overview of Hoodie and then from here we can get started on the actual lab portion. So let me go ahead and let me go ahead and switch my, I have another deck really quick and then let me go ahead and switch that over. Okay, let's do this. So before we get into the actual notebooks I wanna go ahead and talk about some concepts really quick and then we'll get into the actual lab itself. So let me go ahead and put this on presenter view. So we're gonna talk about the different storage types that Hoodie has and then we're gonna walk through a little bit of the workshop and then we're gonna go into looking into the future with the connection with integration between Preston Hoodie. So this is kind of a high level architecture overview of what we're architecture diagram of what we're gonna go over. So what we have in the dataset today is we have some sort of like e-commerce example where people are checking their card or viewing item, adding item to cart, things like this. So we have a lot of updates and certs and kind of maybe things, people removing things from the items from their cart, so deletes happening and it's all happening and getting started into the S3 data lake. From there, we're gonna, on the Hoodie side we're gonna write the data and we're gonna query the data and then from, and we're gonna write the data to the Hoodie table and then from there we're gonna use Presto to query the data and we're gonna use the glue hive metastore and that's how we're gonna query the S3 data. So in the first part we're actually, in the Jupyter Notebook, we're actually gonna query through Spark and then at the end we're gonna show you how you can query through the Presto CLI which is what, when showed you when you SSH into the Presto EC2 instance. So I already talked about this. Okay, so the couple of things I wanna cover is the different storage types that are available on Hoodie. So Hoodie has what we call a copy on write storage type which is if you work with a data warehouse it's exactly the same thing and then the other storage type that Hoodie offers is what we called merge on read or M-O-R and we're gonna talk about a little bit about this. So a copy on write storage is really good if you want to have a higher write amplification by a very low read amplification and it's operationally on the Hoodie side less complex. So what I mean is if you're working with batch data sets and let's say you're every 15 minutes or every hour you're ingesting data you would probably would want to use a copy on write table for that and the copy on write table uses Apache Parquet. And then what happens though is every time you ingest data as you're ingesting data there's like kind of this merge operation happening where it's combining the new data with the data that already exists on the Parquet file and it creates a new version of that Parquet file. Because of this merge operation it has a slightly higher write amplification but when it comes to the query side it has a lower amplification for reading because you can just read directly in the Parquet files. And so this is what a copy on write is. We'll talk about what this difference is between a copy on write and a merge on read table is a later on. Well let's talk about what this screen is showing you. So with Hoodie you have two different types of queries you can run on a copy on write or COW table. And they're called a snapshot query and an incremental query. A snapshot query takes a it's basically like a current view of your whole table as it exists in that moment. An incremental query is basically just showing you the updates that happen between let's say 0.2 or 0.3 or commit to commit three or commit for a commit five or commit to and commit five. Doesn't give you the whole snapshot but it gives you just the changes that happen when you were ingesting data. So if we look at the blue boxes we have an insertion of A, B, C, D, E and the commit time of zero. And if you see the prefix is like file one underscore T zero dot par K, file two T zero dot par K and file three underscore T zero dot par K. We can see that these are basically A, B, C, D is in searching and these are the par K files that it's being inserted. So if you did a snapshot query it's gonna be the same thing A, B, C, D, E and if you did an incremental query you're gonna see the whole blue boxes what is because it's commit times zero and there hasn't been any changes and it's just the first insertion of data. Now if we go to the orange boxes now you see if you look at the orange boxes that commit time one you see A goes to A prime and B doesn't change. So A got updated basically, B nothing happened there's no updates on B. If you go on a second the file two you see the C data didn't change but D went to D prime so there was an update on the D data set and then E there was nothing happened there's no update. There is no new version of the par K. So if you see when there updates the par K version changes to a new version and this is where kind of the merges happened where you get the new data plus the existing data and there's like some merge logic that happens and then the new version of the file gets created. With E there's nothing happened so it's just still file three underscore two zero because there was no update happening. But if you ran a snapshot query on this you would see the current state of the table basically you will see A prime, B C D prime E because A prime and D prime got updated but the rest stay the same. But if you just wanted to run an incremental query and you wanted to see okay well what change between commit time zero and commit time one well really A prime got updated and D prime got updated. So you can just query what you want to what you want to see change and this is actually how you build an incremental ETL. You can just see get those changes and if we think about a compute efficiency right if you have to re-scan the whole data that causes increased compute resources to do that but if you can just query the changes this is how you save on compute resources. So this is what hoodie offers is incremental updates incremental queries things like this. Now if we go to the red box or a red orange box if you will we have an update where A prime goes to A double prime B didn't change, C D prime didn't change but we have a new insert well E got updated to E prime and then we have a new insert with F. So now if you ran a snapshot query on commit time two you can see that the current state of the table is A double prime, B C that didn't change D prime, E prime and F because F is a new insertion and then if you wanted to run an incremental query on this you get the changes between commit one and commit two. So you get A double prime, E prime and then we have a new insertion which is F, yes. Just wanna make a comment to connect some of the thoughts of the great slide. So there's a lot to follow here I think but I think the main takeaway is without hoodie then Presto is just working with files directly it has no concept of what are the changes it has no concept of how to optimize the file so that I can regrade. That's kind of the layer that hoodie is really adding here and the way that it does that it has other metadata that it knows how to work with that was kind of the hoodie file that I showed you or directory inside the data lake. So that's kind of the high level thing I wanna make sure you take away and she'll go with other table types but that's so you notice that they're all parquet but now the engine knows how to interpret these different parquet files and can do a lot more advanced kind of capabilities because of this format. And if we talk about these changes right these are updates that are happening right these are like changes that are happening so this is what's happening on the data lake but he provides that transactional guarantees.