 So, hi everyone, so my name is Alena, I'm a data engineer at Refinitiv and, okay, again, so my name is Alena, I'm a data engineer at Refinitiv. So today I will present like two topics about Spark, so as you know about three weeks ago, it was a Spark Summit in Europe that, unfortunately, I cannot visit, but the guys share almost, I guess, all the videos, so if you go the website, there are a lot of very interesting things. So one thing that they share is about Spark 3.0 and, like, if you do features about this, so I just threw about some videos, so I found these features that could be interesting, some of them we will cover today. So of course the main things, it's a catalyst optimizer, so as they say, like, they improve a lot, but I think maybe you need to find a special video for this. Also very interesting, at least for me, it's a plug-able data catalog, so also I have no details. One cool thing, it's Spark Graph. It's a new model that will work with the graphs and allows you to use Cypher. It's a language from Neo4j database, so I think it should be really cool things. One more thing, this is a dynamic partitioning planning, so they will cover today, starting from this new version so we can use binary formats for data frames, and there are two features that will be useful when you read the files, so we also cover this. Now I don't know how to switch files. Right, left. The view. Mm-hmm. Oh, I don't know. The view. The view. We have a slight PowerPoint latency, which is unrelated to any data processing, related to the quality of PowerPoint, I guess. Yeah. It is completely, completely hung. Should I turn off the screen? Okay, we'll try to see whether it refreshes. Yeah. We've got it. Oh, my God. There's even my presentation. Did you send it to me? I'll pop it up. Yeah, but it's not shortened. Oh, oh, oh. Did you send it to me? Yeah, it's still packed. Don't worry. Did you wake up? No. I'm trying to pull Alena's presentation as a backup from my laptop. Something went really bananas on the front. Give us another second. You can do light pull-up, right? That would be a lot of whiteboarding. Yes. But the idea, yeah, so like we actually can talk about it. So what you've seen, the major bit, the coolest bit that was added is, well, apart from Catalyst Optimizer, is the dynamic partition pruning. Let's use this one. So Alena will talk about the DPP, which is a long-awaited feature that will optimize the unprepared joints across large data sets and bells and whistles around the file sources there in the end so we can actually skip through so that no one is delayed here. Yeah, I want you to see my PowerPoint is better than yours. Let's check. Maybe. Okay. Okay. Yes, it works. Okay. So about dynamic partition pruning. So we start with this first slide is like teaser slide that show performance on some query on some data set around 10 terabytes. So what you'll see, like if you don't use this dynamic pruning, so the query will run around 20, 25 minutes. But if you use this feature, so it will run around 10 seconds. Really? Yes. So now when everyone is impressed, so we can see how it works. Is it actually running? Yes. Yes. It's a real test. It's made by Databricks. So first, let's remember how works the static partition pruning. So suppose we have this simple query like this, yes, I'm sorry, sorry, tried to be here. So if you have like simple query like this, and so basic data flow, first you will scan the table, then you apply the filter. So by the normal pipelines, so now we use a push down. So first we will filter, and then we will scan only what we need. And if you have a partitioning tables, so first we will filter by partitions and just skip what we don't use it. But in that engineering drop, so quite often what we need to do, it's not just simple query. We have to join with some dimensional table. So if you work with proper star schema data. So in this case, if you join with a dimensional table, so actually you don't know how to filter your fact tables. So what happens? So on the right side, so you have your fact tables, you have dimensional tables. You can filter dimensional tables, but it doesn't help you. So as any data engineer, so what we usually do, yes, we usually just join both tables and we have one big table, it's white table where we have everything. But it also has disadvantages that we cannot maintain it. So if something could change, so we have to reload all the data. So that's why in this talk, so we will show how to implement exactly these things, like how can we price this filter from a dimensional table on a fact table. So let's have a look in a nutshell how this Spark will work. So we run our query using different API, it could be Python, Scala, SQL, anything. So then Spark use logical plan. It's just rule-based transformations, it could be like push-down, you can cut columns and so on. Then this logical plan transform to physical plan. So the physical plan is like execution plan that show, to actually tell the Spark what exactly do on your cluster, right? So we will see first on the logical level. So how we can implement it. So suppose we have a table, a fact table is a partitioning table, and a dimensional table is not partitioning it. So first we will scan both, first we will scan dimensional table and we will filter it, for example, we need only two values. So and then we will join it. In an ideal case, we should join it with only two partitions from the fact tables. So the simple approach, just on the logical level, what we will do, it would be good if we can apply this query, how we filter dimensional table before we scan a big fact table. So it sounds good, sounds logical. The question is how we can implement it, because we will do just like this. So it will be just double work, so we have to run the same query twice. And then not easy to implement, and it could be quite expensive. So that's why let's see how we can implement it on a physical level. So on a physical level, so you remember we joined fact tables with a dimensional table. Dimensional table usually quite small. So that's why Spark will use broadcast hash join, right? So in this case, so it reads the dimensional table, creates small hash table, and then put it as a broadcast variable, so in distribute across all the cluster. So you have the table on all the nodes, and that's why you can do join without shuffle just locally, right? So it would exactly here, so this is the trick, so what we can do. So now when we did this broadcast join, we have the result. And now we can adjust this result when we do this file scan. So we exactly get this dynamic filter. So this is how it works, quite simple. Okay, so now we have this filter, and now we can filter fact tables when we scan it. So they also provide some tests. So they run it on this TPCDS data set with 1 and 10 terabytes of the data. There's a cluster on AWS, so the 10 machines. I think it's the same instance that Arsene recommended, and with Spark 3.0. So if you can see on the data set of the one terabyte, so it's 60 out of 100 queries. So they run the speed up between two and 18 times. Yes, quite impressive. And if you can see the top 10 queries, so they increase around 10 times. So if you ask why, so I think this is quite obvious, because you skip most of the data. So you see this query, this don't rate around 95 of the data. So that's why you get a result like this. You just skip the data. Got it? No? Okay, I have to again say. So go back, so it's the beginning, let's say we have six partitions on our table. And if you just don't use this dynamic filtering, we have to read all of them, and we have to then filter and find these two values. Right? If we use this dynamic filter, so we will skip four partitions, and we read it only two of them. So that's why you read like three times less data. So that's why, yes, every query will run much faster. So you see in this case for these queries, they just don't read like 90% of the data. And if you compare with 10 terabytes of the data, so it will be like even better. Like the query runs 100 times faster. So and this is this query, so that we mentioned in the beginning. You see we have a drawing of three tables, and in a date, so actually we scan only one month. The table itself, it's five years. So of course if we will scan, not five years, but only one month, so then you have a result like this. Better? Sorry? We do financial calculations. Most of the time we end up scanning all the data and you have to read through all the transactions. Depends on the workflows. Yes, of course it depends on the workflows. And just to finalize this, this dynamic partition pruning, so they use it, so the idea was how to implement on logical level, so then we can implement this broadcast drawing on a physical level. The main feature of this is that for that engineers who use exactly like schema, like star schema data, it will work now faster. So you don't need to avoid it or create like the normalized very wide tables. I also want to mention here a few limitations of this. As a first, it's exactly what you mentioned. You have to have your right of fact tables partitioning by this key that you are going to join. So if it not, of course it doesn't work. Second thing, it will work only with broadcast drawing. So it means that dimensional tables should be quite small. By default in the Spark it's 10 megabytes. So of course you can change it, but just keep in mind. And one more thing, it also limited only for equal drawing. So you cannot join with like my values greater than something or less than something. So the example which you are showing, the fact tables are basically, the very first example. So the fact tables are basically partitioned by date. Right? Yeah. So yeah, the fact tables are basically partitioned by date. So when you're taking like a Monday, it's what it's doing, it's like taking all the Mondays, dates which are all Mondays and just picking up those partitions from there. Right? The very first example which we did, like date time equals to a Monday, M-O-N. Right? So it's picking all the dates which are Monday. Yes. And picking up all those partitions from the fact tables. Yes. Exactly, right? So it is quite similar to like, you know, I've been collect all the Mondays and then we are in, we are in exactly. So day of week is Monday. So what exactly it's doing down here is like you're taking all the dates which are Monday, which are Monday and you are applying dynamically filter on that since the fact tables is partitions by date. So you are picking up all the Mondays, like dates which are Monday. Right? I think it's just for the clarity for the slides. So as you put like Monday, usually of course you keep it as a date and your fact tables is partitioning by date. Yes. So like year, months and date. I'm just saying an equivalent idea of that, like I mean what it's doing down here is like you're picking all the dates like Monday, suppose today and a week after, a week after. So those three days and you are just picking up dynamically picking up those dates from the fact tables and those partitions. But you don't know it. The idea here is like your sales table here does not contain the definition of Monday, but your SQL expression here. I accept that. Yeah, yeah. So you join it. I accept that. That's what I'm saying. I will get a list of names. Yeah. A list of dates, right? Which are on Monday. Yeah. And yeah. But this is this expensive because you have this is what the learner says like double processing. That's what I'm saying. Under the hood it's been done that these are the dates are Monday. Now from fact table, put where in in this list exactly that's what is being done. It's double processing. Yeah. But this is the improvement in like 3-0. You basically compress the pipeline from this long to this long by pushing the logic into the physical planning into the level where the plan is executed. So catalyst is getting more and more capability. Thank you. Okay. Good. So then let's switch to another talk. So it was session about file sources in Spark support. So just from my point of view, it's quite beginner level. So if you know something so we can just skip it. So I will ask like if you know so we can just skip and save time. So we will we can cover file formats that allow it layout and file readers and writers. So the Spark supports columnar format and row format. Do we have a different? Do you know the difference like who doesn't know the difference? Okay. So we can skip it. Okay. So you know what, what, what does it mean? So that's why we can, we can just speak directly but shortly about the columnar format. So it's easy parquet or C files. So who knows how, how parquet like works inside. No one knows. Okay. So let's see a bit in the details. So it's the structure of the parquet file inside. So you know, right? So this is the columnar format. So it's not, not row. So inside we have a list of the row groups and the main thing maybe in parquet files is footer. So inside footer we contains like all the metadata for, for this file. So that's why when Spark start read it actually start to read from the footer from the end of the file. So then and also in the footer we, we have a schema. So that's why when we read it, we can filter by this like row groups. And we also have a set of statistics like minimum and maximum. So that's why we can skip something that is not related for this query. Yes, it's more details about this and this how, how you actually this row group skipping works. So because it's again, based on the previous talk, if you have, if you can skip more data, so you'll work faster. Another format, who knows the format or see one, too many. Okay. It's my favorite format. So that's why. That's how I know it. So they actually or see the comes maybe a bit later after a parquet, but at least for my opinion, it's such just better. So it was developed by Horton works, Mosque for Hadoop. So now it sparkles like built in format. So it's very nice. The format quite similar with the parquet with some small difference. So this is like your own like full file. You also have like physical stripes. It's like set of the, set of the lines, set of the rows, like one stripe here is 250 megabytes and also you have a footer. So the same in the footer you have some statistics about the column and a number of rows per stripe. So and also you have index. So inside index, you have information about minimax and inside each stripe, you have a stripe footer where you have information about encoding each column. So because for example, it could be column with city and if it has all the value, for example, that's Singapore. So you can store only like Singapore one million times. So you store just two values instead of one million. So this is all information inside stripe footer. Yes. So when you read it, so the same, it's very efficient for data skipping. Because you have all the information about this, this stripe. So suppose your stripe is like, yes, it's 250 megabytes. And if you know that your data is not inside this, not inside this block, you can skip 200 megabytes. So, so that's why you, you read it also very fast. Okay. So then next, we can have a look on a row-oriented format. So I think you should be familiar with this. So we have a semi-structured format, it's JSON and CSV. Both of them are very good for the writing. I think if you create a parquet and orc, you notice that it takes more time. So then write just CSV. So exactly because you collect all the statistic and put all this information like in footer. So like divided by stripers. If you write CSV or something like this, you don't need anything. So you just write it. So that's why it's very fast. And usually that's why you use a CSV file for logs. So the disadvantage of this will be this schema. So you need to define the schema mostly manually. So the Spark has this opportunity, it can read the schema from the file. But if you work in the production, usually you prefer to define your own schema. So this is the same for JSON. Even if you can read the schema from JSON, but it's still recommended to put your own schema. So there is a, so who knows the format Avro, okay, just a few people. So the Avro is actually very similar with JSON, but in this case you have a schema on top of the file. And with JSON you have to put like your field with every value, but for the Avro you define schema only one time. So that's why the file is quite compact and quite fast. And usually if you work with, let's say, different clients and you have some event data. So in this case, your event comes from different data source. And you have to keep the same schema. So that's why actually Avro will be the best choice. But for example with JSON, so it's also works and especially for development when it just develops, so it's very fast. So usually prefer just to start with this. With CC file, I think familiar everything. Yes, it's easy, it's human readable, but you have all these problems with the separator, with escaping, with quoting. Actually Spark helps a lot, because it has already built tools to parcel the schema. So but still. You also know that Spark supports just text file. In this case it will read it, and we have only one column value with all these lines. The new format will be supported from Spark.do3o, it's a binary. So the binary is, for example, picture. So in this case, it will read, for example, all the pictures inside this folder and store in data frame, like one row, like one file, with its own schema. So we'll have your pass, the time, the length of this, and the content itself. It's like a binary object. So it should be very interesting. And here you have, so they have an example of how we will read this binary file. And you see like two new options. So these new options, I think, will be available starting from Spark 3.0, one of these is pass globe filter. And this feature allows you to use regular expressions to select only some special files. So you know that currently you cannot do it with Spark. So because Spark will process everything that you put in the folder. So yes, if you put some trash, so it will just fail. And one more thing, so the recursive file lookup allows you to read the data and skip the partitioning. So for example, you don't need the partitions. So you'll just read everything, like you have everything in one folder. Sometimes it may be needed. I think maybe, yes, when you read the pictures, so it could be useful. And the next thing, it's about partitioning and bucketing. Who knows everything about partitioning? Everything? No, I think it's quite simple, so it's, okay, I just do it very quickly. So the partitioning, so you know, you just put your data in like different folders. You usually choose something with low cardinality. So let's say you have a data partitioning by year, by genre, and then some files. So if you query where music is this year and this genre, so you will read only this folder, only this folder. So you skip all others. I will also skip many files. Because I want to show only, yes. So the things that you need to remember about partitioning, try to avoid many partitions. Because first of all, if you work with Metastore, it will be quite high pressure on Metastore if you have, let's say, more than 1,000 partitions. Also try to avoid to create partitions on the columns with high cardinality. So it will just create like a lot of partitions, so that's why it makes no sense. Or what they suggest, use Delta Lake. So we mentioned about Delta Lake today. So actually Delta Lake allows you to handle with billions of partitions. Why? Because the Delta format is like a pocket, but with transaction log. So if you add any new file in your folders, so it will be just like new records in this folder. So that's why you see. We can have in this transaction log all information about all partitions. So that's why it really can handle like billions. So it doesn't matter how many of them you have. So it's one good thing about Delta. One more topic about bucketing. Who are familiar with bucketing? Okay, let's have a look closer. So the bucketing is quite a bit similar with partitioning, but works vice versa. So when you do partitioning, you usually choose a column with low cardinality. So usually what you use for partitioning, it's like date. You can partition by year, by month, by date, or maybe by hours. For the bucketing, you usually choose the columns like ID, where there are a lot of distinct values. So then you do what? So because, okay, first, why we need bucketing? Bucketing helps you to do a join between two big tables, could be two fact tables. So for the bucketing, you select columns. It should be the same column in these two tables. Let's say like client ID or like customer ID. And then, so this slide shows that if you don't have bucketing, so it will just, you need to first do shuffle, you need to do sort and then after this, so you can join these two tables. So it will be very slow. But with the bucketing, so it will be like pre-shuffle and pre-sort, so how it works. You have this, I don't know, like one column with a lot of ID. So then use, let me show you this. So in this, when you create a table, you just say my table will be clustered by this column into five buckets. It means you divide all your IDs in five buckets. And then you do the same for the boss table. And then when you will join it, so it will be joined by buckets. So that's why you don't need to do shuffle, you don't need to do sort. So it's already will be, your data already will be pre-sorted. So that's why it will work much faster. Yes. Is it using the same hashing function as hyphen? The same, what? Hashing function. Yes. Of course, yes. Use a hashing function. Is it the same with hyphen because hyphen does it? It's exactly the same. It's exactly what we use bucketing in hyphen. So the idea is the same and I think implementation is the same. So about the hash function, so just to give an understanding, let's say we have the ID, it could be some integer numbers. And your hash function could be, for example, all the IDs that end on, for example, one, we will put in one bucket. All IDs that ends like last number in the ID is two, we will put in another bucket. So and then we can create like 10 different buckets, right? So and then you can join, like if you have an ID that ending with one, so you will join it with a different table from the same bucket. Okay. I hope it's a bit clearer. So for the bucketing, yes, it's also enable you efficient data skipping because, yes, you have all this information, all these statistics. A bit more interesting thing about how the Spark reads the data. So suppose we have this simple query and our goal is to understand so how we can, the main goal is to skip as many data as possible. So that's why your query will run faster. So suppose our data is like this. So we have some folder, so it's partitioned by years, so we have this data. So based on our query, so we query whereas the year is 2019. So it means we will skip all these folders. So we will read only one, only this folder, right? Then we query only city Amsterdam. So in this case, when we read the Parkette file, so we can skip a row group that doesn't contain Amsterdam because you have all this information about what you have inside this group. So you have this information in the footer. So we will skip all of them. And next thing, because we query only timestamp and a filter by city, so actually we need just to read only two columns, right? So apart all of these data that you have, so finally you just open the file and read just two columns from this file. And second thing, we want to do it in parallel. So when we read it, so what we do for this? Usually just split your data in partitions with the same size, so because normally you have a file with a different size. So what Spark usually does on a driver, so the driver split it on the partitions and then we have launched the task. And then this task will read this data from this partition. So that's why you will read, so it's like universal size for all your partitions, okay? And we can finish with how Spark writes the data. So suppose we read it and now we want to write, yes, you remember how to do it. So how Spark will write? So again, the Spark driver creates executors. So when executors start to write, so it's create like temporary file and temporary folder for each task. So when you want to check, so usually you can see, it creates a temporary folder. So when you go inside, you will see another temporary folder with a drop ID, because a different drop can run in the same folder, so all of them have a different ID. And then every task also has their own ID, so that's why there is different folders for different tasks at the end. So suppose we write this file, so then we send the commit, okay, so the file is done. So in this case you can see, okay, so this is committed. So it's not an temporary folder anymore, so it's a normal folder. So it's still in the progress. Suppose some task can fail. So in this case, we send, okay, it's abort this task. And so we'll delete all previous files what we have. And we launch this task again. So then we hope, okay, so now we can write it, so just commit it. When we commit it, we commit drop itself. So when you commit a drop, so now you have your clear results. Good, and how do they do it for the transactional? So it's exactly like this, so if we commit, if we commit everything, so we move the files to the final locations. If something wrong, so we delete what we created, so when we say drop fails. Yeah, the links below are quite important. So when you're doing that on this commit protocol, it doesn't specify that it's basically an HDFS flavor. So you're assuming that there is a file system that is locable. So you are able to guarantee a commit on the subsystem that is below Spark. Now, the links below on the transactional cloud storage rights. This is a bit different. So S3 as the major, who has data on S3? Everyone, right? So data on S3, regardless of the super source of presentation on AWS site, it has multiple availability zones and there's a made to date to heads. And there's basically some latency between inheritance, microseconds, or like milliseconds. There is some latency. There is no transactional guarantee in the S3 protocol. So if you have some workers and some drivers across the regional specification of S3, chances are they could be different availability zones. And there is no guarantee on the commit side. So there is patented like in Databricks, for example, there's like a way that is above the storage level to guarantee the transactional commits on the object store. And then there is another one on the Delta late. So this is actually really cool, but it assumes this is an open source part. And it assumes that the commits are then guaranteed by the level of the postings. Right, just figure that there is not only postings systems, which is object systems, right? It's a slide you took, sir. That's good. Thank you. Okay, so just for the recap. So we will cover column oriented files. So some row oriented that I hope everyone knows. What is partitioning? What is the bucketing? So how Spark read and how Spark write? I finished before nine. Yeah, yeah, yeah. Before nine, yeah, thank you. Questions? I have questions. Yes. Is there any case when I should use part A instead of ORC? Seems like ORC is a better part K. Yes, I think yes, if you have a nested data. Nested data. Yes, if you have like nested and nested and nested, then part K is better than ORC. Okay, it's a good, like the anecdote. I don't have any evidence to that, but the good story is because it's open source, it all in the end converges to how many other communities you are able to convince to support your format first. So it'll kind of like converged into what we observed. Why we structured Delta Lake on top of Spark A, not ORC because like the data driven decision, 90% of the data says that we observed like everywhere that we looked at was in part K. Because part K team was run, Apache part K team was basically running around more meetups. I guess just like convincing people to converge into their direction. Let's put it around longer also. Yeah, well, that is another factor. Yeah, it's, yeah. Good, thank you. D, thank you once again for posting it.