 So now we're going to put a pause here on copy on right table. I don't want to like overload it because the first Jupyter notebook that we're going to work with is going to be a copy on right table. So I just kind of want to cover the storage type, kind of what it provides, and going with the different query types you can run, and now we're going to switch over to the actual hands-on lab. So if you go back to the lab that when pointed out, you're going to see there's an introduction to part two and hoodie. If you go to the workshop ID as check-in, this is some logistics that we're going to work on now because we need to assign IDs to individuals. So when you copy the notebook, you have your own ID and they're not concurrent writes happening to the same notebook, which is why we're going to have a suffix of the ID to prevent the concurrent writes. So if you go to this link, your workshop ID for the lab, there is a bunch of numbers here. I actually have 50, so self-assign your number. I'm actually 50 because that's what I worked with. Self-assign, I can make this bigger. Your ID because we're going to do some changes. We're going to need that for later on in the workshop in about two minutes. So I'll give you guys like a minute or two to self-assign your IDs. And remember your IDs. Don't be troublemakers. I'm presuming everyone has goodwill here. Yeah. First ten out of community. Yeah, first ten out of community. Have goodwill and do goodwill to others. So if you have questions around hoodie or presto, now is a good time to ask them because we're on some downtime while people are self-assigning their IDs. I don't have Slack open, but let me check Slack. Cool. I don't see any questions yet. Okay, I think everyone is pretty much good to go. I don't know why I put a D there. Okay, cool. So the next part of the workshop. Where is my thing? Oh, yeah, that's right. It doesn't create. It opens. Okay, yeah. We're going to access the Jupyter notebook. So similar to what Wen did earlier, depending on your birth date, please pick a specific EC2, I guess, instance, right? Yeah. So I'm 11, my birthday is December 12th, so I'll go with number two. And then the password here is summer sun. So just copy the password and then click on the EC2 instance. Actually, let me just open up a new file or a new tab. There you go. And then let me know if I need to make this bigger. Make it bigger. Ashley, let me self-assign to 50. I'll start with a new one. I'll use two. So we have a lot of things left. Put this just in case. Okay. So if you also go in the workshop, if you go to, so we are in the Jupyter notebook and now we're going to copy, we're going to show you what to do in this step here. So the first thing I want you to do is that you see that there is a lab template, Houdi COW. Let me make this a little bigger. IPYNB. The first thing I want you to do is I want you to duplicate it. But this looks a little different from what I'm used to seeing. This was different, I think. Hold on. Can right click it and then let the duplicate go. Okay. Let me go back then. So the file. Yeah, let me try that. And then duplicate it. Oh, yeah, there it is. Okay. Yeah. So we're going to go ahead and duplicate this. So you're going to right click on lab 1 where it has this COW suffix and you're going to click duplicate. Okay, now when you duplicate it, you're going to get a copy one at the suffix. I want you to right-click again, and I want you to rename it, and I want you to put your ID. So, delete copy one and put your ID as the suffix right before the extension. Why is it not letting me re-copy? Let me duplicate again. Rename 51. Okay, there it goes. All right. So you see I have 51 here. I want everyone to rename their file or their lab notebook to their extensions. I'll put a few minutes, one minute or so there. Oh, okay. Rohan says make bigger. Rohan, do you want me to make it bigger more? Is this good? Can everyone see this okay? Okay, cool. Let me make sure I'm on the right notebook. Yeah, I'm on 51. Okay, so the first box, what we're going to do is we're going to get our environmental variables and keys set up over here, and where it says ID equals 3 or whatever it is, just change it to your ID that you self-assigned. And then we're going to concatenate that ID to specific parts in this workshop. Yes, looks good. Okay, great. And then after you have your ID, we have our access keys and seeker keys. When I believe this is the instance that we're running, right? So this is not your instance. This is what we're already running. And then we're just getting our arguments. And this is all like environmental variable and key stop. So we're going to go ahead and run that. And then after you run, you shouldn't see anything. The next thing I want you to do is I want you, there's actually nothing to do here. We have the app name. This is the application for the spark that we're creating. And then you can see that we concatenated your ID at the end. So there's nothing you need to do here. Actually, you can just probably run it. You can create your spark instance. And I'll pause here. Did everyone, did it load okay? Did we get any errors? Just want to make sure everything is, if you get an error, please flag on Slack or flag on here and then we can help you troubleshoot through it. So once we create our spark instance, now we're going to create our table name. The table name that we're creating is just the COW table, which we talked about earlier. And then it's just, we have a suffix of the ID. And then we have all the places, all basically the, where are the data stored in S3? These are like the locations. So we're just setting the variables to the locations. And then from here, we set the hoodie, the hoodie S3 path. So all these are just basically the locations of where things are the data, where the, where it's located. See if there's any questions. Okay. So you're going to go ahead and click on that box and you're going to click on run, which is the little triangle button there. From here, we're going to create our schema. So that we're dealing with a CSV file. And because of that, we'll need to construct our, basically what our schema looks like. And then from there, we're going to read the CSV data. We're going to display it and we're going to count how many records are there. So I'm going to go ahead and run this part. And you see when we run it, you can see kind of it's showing the top five rules, which is what we want to display because we put the, what we put the argument by there. And then we can see that there are 79 records. So this is just basically, we loaded the data and now we're querying it to make sure that the data is there. This is all this part is doing is just making sure that we have data in the bucket. So now what we're going to do is we're going to actually write to hoodie. And so what we're going to do is we're going to specify the table name where we want to write to, because hoodie offers two different storage table types. We have to specify this, the storage type that we want when this case is going to be a copy on right. The operation mode, the operation mode is going to be absurd. Hoodie offers different types of write operations like insert, absurd, delete, bulk insert and so on. In this case, we're going to absurd, which is inserts and updates. And then from there, we're going to specify the record key, which is the user ID. And then we have the pre combined field. The pre combined field is used to prevent for D duplication. So if you're working with streaming data, you might get tons of duplicates. In this case, you'll want to run with a pre combined field to prevent that duplication. So when you merge records and it's written to hoodie, you're not writing duplicate records. The ID here is that we're dealing, we are like, even though it's, we're dealing with mostly like a, kind of like a bulk load in this case, but ideally when you're dealing with ecommerce data, it's mostly going to be streaming, right, into your data light. So in this case, you will definitely want to define a pre combined field. And then all we're doing is we're just writing to the S3 hoodie path that we defined earlier. So we're going to go ahead and run that. And then from here, we're going to read the data. We're going to create a temp spark view. And then we're going to read the data. And then we're going to run some queries on it. So we're going to select the total records from cart status, and then we're going to select some fields, and we're going to order it. So I'm going to go ahead and run this. I think this is still running. There we go. So if we count the, the total records from cart status, there's 10. And if you want to see there, you can see all the user is used that exists here. And now we're going to do an update. So we're going to read the CSV data. And we just specify the headers true. And then we have the schema, which we defined earlier. So it knows how to display the data. We're going to show the last 30 results. And then we're going to count the records. So we're going to go ahead and run that. I don't know why. Oh, am I running? Hold on. Let me see. Oh, I don't know why. Okay. Oh, wait, you know what? I didn't double click. I'm so used to clicking on one, and it pushes off there. I didn't double click. That's why I'm getting that error. Okay, let me see. Let me actually, let me, let me try this again. Let me try with a new ID. I actually didn't double click onto it. So I'll take 52. Hold on. I hate when that happens. Let me get rid of this. I got to, let me rerun it. I didn't open up the right top for me. I didn't double click. It's a bad habit. I'm just running through the set of exercises that we had. I'm just redoing what I did earlier. Okay, I believe we're here. We wrote to Hitti. And then we're going to verify the initial loan. And then we have the table type. So I think we went through this already. Okay, so now, yeah, we'll start here. So we're on the D, D final spark that read format. And then we're going to show the five records. And then we're going to display the results and run the query. That's this part here. You can see the user ID in the event time. And then from here, we're going to read the incremental data. And we just provided the schema and we said that they're headers. So it knows how to read the data. We're going to show the last 30 records and then we're going to count it. So that's this part here. So everyone should be here. Am I going too fast or is this pretty, pretty good on everyone? Let me check Slack. Everyone here good so far? Okay, let me see everything. Okay, cool. All right, third question is definitely flagged. Okay, so then from here, we can see the records. So we can see that we've added to cart. Let's look at the schema really quick. We have action type, card, empty event, time date, last long done, last purchase user ID and then the ID. So these are all like the different activities that user are doing with the timestamp and what action they're performing. That's what this data set is about. So now from here, we're going to fully load this data, the update data into here. So what we have is we're providing the table name, the type of table we want to use, the upset record, the record key. And then we have the pre combined fields, which we talked about earlier. Now we have this payload class. This payload class basically defines the merge logic of how you're writing if there is a one record and a record already exists in hoodie, what the merge logic is to merge those two records and there's different types of merge logics that we can have and how you want to do that or you can specify your own merge logic as well. So that's what this is doing. The ordering field. So this in this version we're using hoodie 010 0, the current version is hoodie 013 0. So basically the ordering field is kind of the same thing as the pre combined fields. And 13 0 that's kind of been deprecated. So for the purposes, I wouldn't worry too much about this because if you're on the latest version of hoodie, most likely you're just going to specify the pre combined field. And then the payload event time field is mostly used for like metrics. At this point, we're not going to specifically use it today, but that's what this one is about. And then we're going to write to the hoodie path. So we're going to write the updates that are happening into our hoodie table. So I'm going to go ahead and run that. And it's still running. And now what we're going to do is we're going to read, we're going to create a temp view so we can actually query the data. So this is what we're doing over here. So we're going to select the distinct hoodie commit time from the cart status. And then we're going to select the specific fields we want and we're going to order it. We have the commit time and then we have the different users and stuff. So we have 13, 12, 16, these are all the updates that have happened. So that's what we're seeing from here. So if we go back here to earlier, we have to 16, 9, 5, 10. Now what we're doing is we're seeing there are updates that are happening. So now you see that with the updates, there are more records that exist here and these are all the updates and changes that have happened. So whether a user ID has had an update or whether there's an insertion of a new user that has activity as well. And then what we're going to do now is we're going to perform a delete operation. So let's say if you want to run a delete to comply to privacy laws and things like that, like GDPR, you can do that as well. So what we're going to do is we have the data set to delete the raw data. And again, it's a CSV file. Same thing, we're going to provide the header and then the schema, which we defined earlier. And then we're going to just show that the records, what the records look like and we're going to count the records. So this is just querying the CSV file basically. And you can see the deletes that were happening is 13, 14 and 10. So these are the user IDs that we want to delete. Now if we go over here, now we're going to write to the same hoodie table and we're going to delete the files basically. And then much of these are kind of the same. The only thing that's different is the parallelism. The parallelism is basically threads to do different types of operations to do the delete. So it just provides different, like shuffling the data and things like that. So that's what this part is, but the rest of the configurations are the same. And then now if you want to query the table and to see if the deletes happen, you should see that 13, 14 and 10 should have been deleted. So if you count, there are 15 records that you can see now, all the records exist except for 13, I think 14 and 10. So if you go here, 13 doesn't exist. But if you go back to the previous query that we ran, you can see, though this was querying the other data, 13 existed somewhere. I think I pointed it out earlier. Let's see. I think I'm blowing. Oh, here it is. 13. There we go. And now if we look back earlier and we go back to the latest query, you can see that 13 doesn't exist. So this is how you perform upserts. You can delete on hoodie. And this is how you'll do it through Spark. Now later, we're going to show you how you can query. We're actually querying through Spark, but we're going to show you at the end how you can query through Prosto, through the Prosto CLI. Yeah. So the question. So that delete, was it a hard delete? Did it actually delete the file? It's a hard delete. Okay. So I can't actually time travel and get that record back. No, it's a hard delete. But with the new CDC RFC that hoodie provides at 13-0, you can do incremental queries on hard deletes now. That's for the later version of hoodie. Yeah. You can also do a soft delete. Yes, you can. Yes. So just for the other folks, the reason why I bring this up is, again, you're separating kind of what the table actually is from the underlying storage. So if you do a soft delete, then when you query the data, it's as if the record doesn't exist. But if that data is actually still in the lake, then technically you can time travel back and see that data. Yeah. And again, this is all managed because hoodie keeps track of all this for you. And the end, but to the engine, you're just looking at tables. Yeah. So hoodie again has a sense of a timeline. So every commit or every transaction that happens, if you have a bulk insert, or you're having data that's being adjusted, it's all into a timeline. A timeline is how timeline is basically immutable. But it's how we keep track of what's going on with data in the tables. You can specify a hard delete or a soft delete. Yeah. If we go to the hoodie, oh, well, let me go. There we go. If you go to the hoodie documentation, and we go under docs, and we go under the spark guide, and you go under delete data, there are different ways to provide, whether you want a soft delete or a hard delete. So it's a similar thing. You just specify it. So I'm just repeating the question for the virtual audience. Oh, sorry. Was the hard, is the hard delete the default? Do you know the hard deletes the default? I believe it is. Yeah. But then if you wanted to do soft deletes, let's say if you wanted to look at Python, you can specify that. But yeah, usually it's a hard delete. Cool. Let's go back to the Jupyter notebook. So this is basically the COW portion of the lab. Let me check on Slack to see if there's anything questions. No questions. Okay. Any other questions? Okay. So now we're going to move over to the merge on read table. So we're going to go back to our slides really quick. And then we're going to start that. So let me go ahead and go to the slideshow. Okay. Now we talked about the copy on write storage. Now we're going to talk about merge on reads. So our merge on read table is a little bit different from a copy on write table with a merge on read table. And it's really effective if you're working with a lot of writes. You have a lot of writes happening, you're working with streaming data, depending on your workload. But basically when writes happen, they get written to these log files, which are avro files. And then you also have a base file, which is the parquet file. Now, when you have, when you have, when you do like your first load, for example, it will be written into a parquet file. But subsequently, any other writes that happen will get written into, if you look over here, there are log files. Oh, sorry. They get written to a log files. And because they're, because when you write to a log files, there's no like merge, they're not merging records at the time that it's getting written. And this is why the right amplification for an MR table is very low compared to a copy on write table. Now, earlier I mentioned that a copy on write table has slightly operational, no, sorry, less operational complexity than a merge on read table. This is because with a merge on read table, there is a service called compaction that hoodie offers that basically combines all these log files at some configuration that you specify. Like how often do you want to do compaction? You know, how do you want to merge these files? This is all defined in the compaction service. It gets all these log files. And then it merges it into basically a new version of the parquet files. So now all the log files get merged in and now you have whole parquet files. And then from there, when you have more commits that are happening, they'll get written to log files, a compaction service at some time will combine those log files and to a new form of the parquet file. And then the sequence keeps happening. So this is why merge on, merge on read tables have a low write amplification, but a slightly higher read amplification because it's merging the files later on through the compaction service. And then you can create the parquet files. So that's what our merge on read is. So if you look at the blue boxes that commit time zero, basically we're having our first initial load. Because it's the first initial load, everything gets written into a parquet file. We have, you know, ABCDE. So if you're running a snapshot query, which is the current state of the table, you're going to get ABCDE back. If you're running an incremental query, again, this is a first initial load, there has been no updates are going to get everything back. And if you write a read optimize query, you're going to get everything back read optimize query basically looks at the basically, it's kind of like there's a little bit of lag because it looks at the parquet file. So it's not like what your table looks like at this point in time. It's what it looks like when everything has been merged into a parquet file. So now if we go to commit time one, we see that a has been updated to a prime, but you see it's written into a log file. There's file one t one dot lock. So it just gets written to a log file there. It just, it just writes it straight to a log file. And if we look at the look at CDD prime also got updated. So this update also goes into a log file as well. So now if you're running a snapshot query, you're going to read what the current state of the table looks like. So this includes all the data that has been updated as well. So you have a prime B, which didn't get updated, C didn't get updated, but D got updated to D prime. So you're going to read D prime, and then he didn't change. So you're going to read E. So this is the current state of the table, where any updates, insertions or anything happen. Now if you run an increment, Oh, go ahead. So I do want to slow down here because I think this is the key part that the big trade off between the copy on right table and the merge on retable. So notice that on a merge on retable, when you're doing a right, you're just writing what you want to write. You're not doing anything fancy. So from writing, it's very efficient. But the way you get your penalties when you want to read it, when you want to read it, you got to merge these files to get the effect of what the data looks like. So so depending on what type of query you want to run, depending on what type of query run, if you don't run the read optimize one, if you read the snapshot one, it has to in real time merge it. That's why it's called merge on read versus the copy on right, the copy and write all the all the heavy lifting happens when you write it. It's merging it when you write it. But when you read it, it's fast because the data's already merged together. And so this is the fundamental trade off that you pick. Are you are you are you want efficient writes and correct me if I'm wrong, you want efficient writes, you do merge on read is efficient, you write the delta, but slower when you want to when you want to read it. If you're like, Hey, I don't need to be as efficient on writing it. But when I read it, I'd better be fast. And so that's when you would pick a copy on right table. So to evaluate which table you want, you got to look at your workloads. If you're doing a backed workloads like once every hour or so, maybe like a copy on right table is more than enough because you can afford that type of latency. But if you're working with streaming data, you need to be very efficient on the rights. So you can have so you can reduce the right amplification. And there's a slight cost on the read amplification. So if we look at the incremental query, if we're just running an incremental query, we just get those updates, whether they're in a parquet file, they got written to log file, we just get those updates. And now if you read a read optimize query, you can just see just queries the previous state of the table. So it's not up to date of where it is. So there's a difference between the snapshot query and a read optimized query, depending on what you want. Now if we go to commit time to a got updated to a double prime e got updated to e prime. So there are two updates. So if we run the snapshot query, we get the current state of the table. So basically will it'll be a double prime. So a prime is now updated. So that's gone. Be didn't change seeded and change. D prime is just D prime because there's no update on commit time to and then we have e prime that got updated. So we're reading that and then apparently their f got inserted. I think this might be this might be wrong. The f actually should have been written to a log file. This is actually not technically correct. But if you have an insurance enough, which is not recognized here, it should have been showing. But if you run an incremental query, you'll get all those updates. And then if you run a read optimized query, it just creates the previous state of the table on the first on the first initial load. And then finally, if we do commit time three, what we're doing over here, you see everything in the parquet file. So at this point, we have run the compaction service. We have gathered all the log files and now it's all compacted to a new version of the parquet file. And now if you're running a snapshot query, you're going to get all the current state of it. So you're going to get a double prime, B, C, D prime, a prime and app. So everything got merged. All the updates and everything are now merged to new parquet file. So this is the original version. Now this is the updated version. If you're running an incremental query, you're just going to get the latest changes again. So between commit time two and commit time three is just the compaction service that was run. There is not actually any insertion of data or updates that are happening. This is literally just the compaction service. And then from there, if you run the read optimized query now, this is the biggest change is now you're reading the current state of the base file, which is a double prime, B, C, D prime, E prime and app. So now we have a new snapshot of the read optimized query that I will look at. Any questions here? Okay. Now if you want to look at the trade-offs for the hoodie for data lake, there is a data latency cost. So we talked about this later because the right amplification is slightly higher. You're going to have a little higher data latency lower on the merge on read because you're just literally running to parquet files. And as you looked earlier, if you wanted to run a snapshot query, you can get the current state of the data. So it has a lower data latency. If you want to talk about updates, the copy on right has a little higher update costs because there's a merge logic that happens so you can have those parquet files that you're writing. For merge on read, because you don't have that, it's a lower update cost. And then there is a parquet file size, and then you have the right amplification. What's not on here is the read amplification. So that would be lower. And copy on right is slightly higher on merge on read. And these are what to consider if you want to use copy on right or merge on read. So merge on read, if you need quick ingestion, streaming data, your workloads can change. Sometimes they can be spiky. Sometimes they are not. You don't know the patterns. Typically happens at streaming data. And then for copy on right, you can afford the data latency. You understand your workload. It's pretty consistent. And so on. And the difference between, again, merge on read and copy on right is the compaction service. If you're using merge on read, you would likely want to do a compaction service so you don't have all these log files that are taking storage and you can just combine it to a parquet file. Okay. So now what we're going to do is we're going to navigate back to the lab notebook. And then let me see if there are any questions here. Nothing. Okay. So now we're going to navigate to the base MOR versus the MOR.IPYNB. And we're going to do the same thing that we did on the COW table. So we're going to right click and duplicate. And then when you click the duplicate, you're going to right click and then rename it. And I'm just going to rename it back to like 52. And then enter and then make sure you double click. Don't be like me. And then that should work. Let me, there's something I want to show really quick. So on this part on the MOR table, we are going to show an incremental updates. And I actually, I actually rewrote a query to show that update. And I'm wondering if it's stuck. It didn't. Oh, crap. Okay. Let's go by it. I think I remember it. Okay. So the first thing we're going to do is after you rename it, I'll give everyone a minute to do that. You can just, you can use the same ID that you identified earlier. So I'll just pick one I picked 52. So when it goes to the ID, you're going to change that to 52. And then you can go ahead and run the first box. It's a really similar setup. Basically, we're getting our environmental keys and setting everything up over here. So we can actually access the data. And then I think you can see that. Yeah, you can just run it. Make sure you save it and run it. And then from here, we're going to create our app name. And we're just going to put the ID in the middle. There's no conflict. And then we're going to create an instance spark and go ahead and run that. And if you're on this side, if you're on this, if you're looking through this, we're actually in lab two now. And we're going through this step right now. And then what we're going to do is we're going to create our table name. We have different batch data that is stored in an S3. And then we have the hitty path of where we're going to write the data or do the updates and things like that. Since we already have a variable with ID, we don't need to do anything, we can just go ahead and run it. And now we're going to read the batch data set. So just make sure that's good. It's always good practice to do that. We're going to display five results. If you notice that we're working in JSON, so we didn't need to define a schema with the CSV, we had to define a schema so we can just automatically read the schema, read the JSON data. And if you look at this, we have updated cart, we have action type, app name, app version, cart empty, event time, ID, last long done, last. These are all typical things that you would want to look at when you're working with some sort of e-commerce data application. So these are basically all activities that user has had. Yeah, you probably, you'll need to probably denormalize it. Some databases work with nested objects. Yeah, that's a good question. And now we're going to count the records. So we're going to see what the status looks like. And we have a thousand records. So now we're going to do a full load into hoodies. So now we have the data, the batch data set. Now we're going to write to the hoodie table. And basically the only thing that's different is we're specifying that we have a merge on retable. The operation type is upsert and then we have a record key and then we have a pre-combined field which prevents the duplication. And then we write to the data and then we write to the table. Writing still. And now we're going to verify the initial load before we apply any incremental changes. So we're just going to create a temp view and then we're going to write some queries. So we're going to select the distinct user ID and then we're going to select the user ID and event time TS. So you can see all that, all that's here. And actually I have done earlier, but I'm going to write another query because I want, it's not very obvious in this query what the updates are happening. And instead of ordering, I'm going to see select user ID event time from there. Where, hold on, I'm just going to do where user ID equals, I think this is it. I had this done earlier. Let me see if this runs really quick. I'm just selecting the user ID right now. Oh no, it didn't do that. I'll come back to this later. Okay, so now we're going to provide the updates. So we're going to read the new updates that we're going to insert to the hoodie table. So we have a JSON data. We're going to show the results and then we're going to count the records. So if you look at this table now, these are all the updates that are going to be written into the hoodie table over here. So we have, where is the user ID? 1063, 1066, 1042, 1047. Okay. And then 1059 and so on and so forth. Okay. So these are all the updates that are happening. And then we're showing only the top 20 rows. I think this is the part that I wanted to write. Okay. So now if you go back to the bottom here, we're going to write to the hoodie table. And basically everything pretty much stays the same. And now we're just going to write the updated batch of data to the hoodie table. So you can go ahead and run that. So basically we had our initial batch load of raw data. There were some updates that happened. More users had actions. If we go back to the slide over here, where we talk about updates, we're just running an update. So we run queries, we can just see the changes from the updates. So now after we run that, we're going to verify that the updates and inserts have happened and 51 records should have been updated. And now this is kind of hard to read. I had written another query. So I want to, I'm going to actually do this on the fly really quick. And I'm going to show you what the updates look like. If I remember correctly, I want to say I believe I remember the user ID. And instead of event time dates, I want to show something else. Event time TS. I'm just writing a quick SQL query at Spark SQL. Hold on. So you can compare apples to apples. Let me go ahead and rerun this really quick. Okay. So 1047 had this. So now I want to, I want to show in an earlier timestamp that it did not exist or that it was updated. So this is the updated data. And if you look earlier, this is the, these are the incremental changes. And if we look before we did the incremental changes, when we ran the query, the initial batch data set, I think, let's run that. Let's see if that works. Okay. So if you look earlier, if we go back to earlier query where it says df cart status Spark read JSON batch one data, and I'm just selecting, basically selecting user ID and event time TS. And I'm just selecting a user ID so you can see the changes. You can see the ID 1047 had this event time TS 11 16 28 53. Now if we look later, and I query that same ID. Oh, oh, oh, yeah, yeah, I didn't rerun it. That's right. We run that and things really quick. And then we have the count. Yeah, I'm trying to find where I added the thing. Oh, yeah, there we go. This is the okay. So this is the updates and inserts. Okay. So this is that 28. So now let me go back to batch data one and copy that. This is the initial load. And if we run this, oops, where did I put it? I think it was higher. We have 10 this one is the initial load. No batch one. Let me go ahead and rerun everything. And let's go ahead and rerun this. Oh, it did an update. Oh, yeah, there's a lot of boxes. There we go. Oh, yeah, I have it over here. This is where it's at. Oh, it did an update. There was a there was an ID that I wanted to show that there were updates that were happening to the records. I forgot the ID that it happened on. But basically, if you were to query the IDs, you would see that there were different timestamps that were happening between the initial batch load and then the incremental update. So if you wanted to query the different user IDs, you'll be able to find one. I don't remember the ID that happened. I just did that earlier. But that's the point of the exercises show the incremental updates that were happening. I'll pause here. Are there any questions? Let me check the slack. Cool. So this wraps it up for the hoodie portion on how you can query your data and connect to basically, we're querying through Spark. And then we're going to show you how to query through the Presto CLI. You're going to query the hoodie tables basically. So I'll disconnect.