 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. So last class we were discussing BigQuery and Dremel, that system architecture, and as I said the Dremel work was served as a foundation or a blueprint for how people build a lot of modern overlap systems. Now not everyone is going to have bits and pieces that Dremel has and certainly there have been extensions to the architecture using other systems, but at a high level that's the approach of when people say you're building a modern overlap or lake house system. And when we see snowflake next week and certainly spark today, it's going to look very similar to bits and pieces. So to just understand the Databricks photon paper that you guys read, we want to sort of first talk about the history of how sort of spark came about and what led them to building the accelerator the way that they did. So again if you flash back to maybe like late 2000s when MapReduce was taking off, at the same time there was this project called Spark at UC Berkeley where they were trying to build a better version of the sort of MapReduce model. Some parts are very similar like separate in-computing storage, where Hadoop had HDFS and then the extra executor nodes, but then they also had support to do sort of iterative algorithms the way you couldn't do very easily in Hadoop that allowed you to take multiple passes on the same dataset within your program. So when they wrote Spark, they wrote in the Scala because that was the hot thing, that was the language I was most excited about around this time 2010. Now it's obviously Rust, before that it was Go, before Go it was Scala. Python's always been there. So because Spark is written in Scala, the means that they're going to run on the JVM. Now from pretty much all the papers we've talked about this entire semester, most of them are written in C++. C++ is becoming more common now, but C++ is the main thing we've been focused on. But there's still a lot of database systems that are out there that are going to be written in Java, mostly because they come around from the late 2000s, early 2010s. And we've avoided sort of implementation discussions about the programming languages using these systems. But for today's paper, as you guys read about, because Spark is really written in Java or Scala running the JVM, that matters and that's going to restrain what they can do in their implementation. So the first version of Spark that came out only supported this really low-level API based on these RDDs or distributed data structures. Think of this like the output of some task computation was put in this RDD wrapper. Later on, they added support for the DataFrame API for even higher-level abstraction and programming language. And this is sort of when Spark really started taking off. So instead of running your payments programs or accessing DataFrames and Python, you can run this now in Spark and go distribute it. But again, but at this point, again, the first version of Spark didn't support SQL, it was all these other programming languages. So once you start having some kind of computational framework that can process large amounts of data, become popular, people are going to start asking for SQL, and that's what people wanted in Spark. So the stopgap for the first support within Spark was this thing called Shark and is where they forked Facebook's Hive middleware that took SQL queries and converted them to that produced jobs. They forked that and then took SQL queries and then converted that into Spark programs using the Spark API. And this was limited because the capabilities of this approach was limited because you could only use SQL for the initial invocation of the program or the query you wanted. So it wasn't like you couldn't intermix Python code or the Spark API code with SQL in different parts. You couldn't take the output of a SQL query, feed that into a scholar program that then took the output of that and then feed that all within the same program into SQL. You couldn't mix the SQL and API calls at the time. The other challenge that they faced too was that Shark was based on Hive and the Hive query optimizer was really designed for picking the best query plan to generate MapReduce jobs. And so they had to try to contort it to make it work in Spark. But Spark had a more expressive API. You could do more things in your program than you could in this MapReduce. MapReduce only exposed two functions, Map and Reduce. And so the queries that they were generating through Shark were not as efficient as one could build if you're written by hand because Hive's optimizer just wasn't aware or didn't know about the other things it could do with Spark programs. So again, this was a stopgap solution and the initial approach to add SQL support in Spark. I think I mentioned last class also as well. There was this other system called Mpala that came out of Cloudera. And Cloudera was actually the one shipping the most Spark in the early days. Because people were asking for it. It was part of the such Cloudera distribution you would get. But then when people started asking for SQL support in Spark, Cloudera wouldn't provide them Shark because they wanted people to use Mpala instead because they made money on that. So even though this was available as a sort of add-on to Spark, it wasn't available in Cloudera's distribution of it. So then in 2015, the Databricks team put out Spark SQL. And this was native integration of SQL directly in the Spark runtime. And now in this case, when you download Spark, you got Spark SQL. And so Cloudera couldn't excise it out. They hadn't, when they shipped Spark, they also shipped Spark SQL with that. And that basically undercut any need to use Mpala. And it was like a Trojan horse basically helped destroy, or not destroy, because Cloudera started around. It helped undermine Cloudera and Databricks became the big companies now. Cloudera got, they went IPO, but then they got reverted back to private equity. And I think Databricks helped facilitate that in some ways. Another problem was, despite the name of having Cloud and the name Cloudera, clearly they didn't have a Cloud offering, a MapReduce, a Hadoop. Amazon was making more money on Hadoop than Cloudera was. Even though they had the inventor of Hadoop there and a bunch of other people that were key contributors to it. But that's an aside. All right, so the way Spark SQL worked was that they would still rely on some of the row based architecture or the feeding data in from the lower levels of the query plan. But when they passed data from one operator to the next, they would at least now start storing it in columnar buffers or vectors. They would support dictionary encoding, RLE, pitpacking, compression stuff all the time we did before. And now they would also introduce an in-memory shuffle phase to go between the query stages or between the different pipelines. The way this actually would work was they weren't doing full query compilation the way we saw in Hyper. They would just do compilation of the where clauses. And they would do this by converting the where clauses into like Scala ASTs. And then there was an internal method in Scala that didn't generate the basic Gapalus into the JVM bytecode. And then you could link that in and invoke it within it. So they were doing partial query compilation to help speed things up. But there was other challenges that they were going to face. So I also point out, too, that if you read the Spark SQL paper, they had this little blurb here they talk about for their in-memory shuffle. In the original version of it, they just relied on the US page cache to keep things in memory. And then only use that to spill the desk when necessary. But then you'll see in Photon, they got rid of that because the OS was just messing everything up, and this became non-scalable. So there's another good example, you don't want to let the operating system ruin your life or do anything for you. You won't let the data system do everything by itself, right? Because the overhead of the sys calls, the overhead of the journaling and the file system, all that was causing problems. So the other challenge to this is that converting things into this Scala, or doing co-gen for the ware clauses, and then compiling that came to be a challenge because there was a limit of how big the program you could co-gen directly inside of the GPM. And they were finding that when they were running SQL queries, they were becoming CPU bound rather than disk bound. And there's a too much computational overhead of doing this approach that it became problematic. And again, this is 2015, so say maybe 2015, 2020. The disks are super fast now, but at the time, you would expect your data warehouse or OLAF system to be always disk bound, not CPU bound, right? Nowadays, the pendulum is sort of swung the other way, the CPU is overcoming the bottleneck, not the disk. But when they were building this, the disk was not slowing down. It was the CPU. So in the photon paper, they point out a bunch of observations about this Spark SQL attempt. The first one I already, as I said, that the workload is becoming CPU bound just because there was so much computational overhead of everything that they were doing, and they spent a lot of time trying to work around the JVM and that sort of memory-managed environment and doing things like, I don't know how much you know about Java programming, but when you allocate something, like you call a new object, that goes in the heap, the JVM's garbage clusters keep track references to it. Every time you run a garbage collection pass, you check to see whether anybody is still pointing to it, and then you can free the memory. There's no direct mallet call in the JVM. So the way you get around this is you do off-heap memory. Basically, within the JVM or Java code, it's like an unsafe in Rust. You can allocate memory to some outside region. It's still owned by the process, but the garbage collector isn't going to look inside of it to see if anybody can free any memory. So they had to do a bunch of engineering effort to move stuff out of the off-heap so that the garbage collector wouldn't slow them down, and that becomes sort of problematic. It's a lot of engineering effort just to get around the internals of the JVM. The other challenge that they faced during the Jet Compilations piece was particularly tricky, because in order to get the best performance out of it, you had to have people that really know how to do the internals of the JVM. And they found they had trouble scaling up the engineering team to be able to optimize the Spark SQL engine, because it's really unique. People had experience actually in the internals of the JVM, versus a bunch of C++ developers can jump in and start writing code or anything. They had other limitations of how big the code itself could be. So the paper mentions, in the photon paper you guys read, that if the table had more than a couple hundreds of columns, which are pretty common sometimes, then the JVM would just say, this is too big to compile and throw it back to run at the slow, interpretive mode. So they were basically dealing with this remnant of the early 2010s of like, OK, let's build this on the JVM, because that's how people were building scalable big data systems at the time. But then when, as the Harvard landscape changed, as the demands of what people wanted to do with the database system changed, they were sort of hitting the upper limits of what the JVM could report. So they had to come up the way to overcome this. So that's the background for the paper you guys read for photon. So as I said last class, photon isn't a standalone system like Dremel or Snowflake when we read that next week. But rather it's a library that's meant to embed inside of an existing database system like Spark or the Spark runtime, what they call the Databix runtime. But it's even more low level than the Velox paper we read before, where at least Velox was providing not a full execution engine, but pretty much all the pieces or components we need for the execution engine, like thread pools, all the operators and so forth, photon is even smaller than that. It's really individual, what they call kernels or tasks or operators that you can then put in at a really fine grain within the query plan itself. But all the threading, all the memory management, all that's going to be handled by something else. In this case here it's going to be the Databix runtime. We'll see alternative approaches to accelerate Spark at the end where they're not going to do it the way photon did it, where again, just injecting the photon accelerated operators within the query plan at specific locations, the alternative approaches we'll see will just take the physical query plan, ignore all of Spark, and run it somewhere else using data fusion or a whole another database system. So the idea here, the highly designed goal of photon is that they wanted to integrate with what they'll call the Databix runtime or the Spark runtime. The Databix runtime is the proprietary version of Spark. But they wanted to be able to integrate it with the DBR without throwing out all of the existing infrastructure. And that as over time as they expand the scope or the capabilities of what photon support, they can then incrementally add those pieces into override the ones in the Databix runtime. Again, without disrupting any of the users. You don't want someone to run a query one week and on an early version of photon, that maybe doesn't have certain tasks implemented and get one result and then next week they run the exact same query and the exact same data. But now photon has been upgraded and now they are getting back different results. They want to avoid all of this and make this completely transparent to their users other than they charge you more for it. So they want to support all of the semantics and capabilities of the earlier SQL engine and the DataFrame API, but again without having any changes in natural results or high-level behaviors. And then they also want to be able to handle the legacy artifact that the Spark runtime is going to be handing out individual rows one at a time and convert it to a column oriented or vector oriented implementation to speed things up. So the basic way it's going to work is that queries are going to, some portion of a query will be able to run in photon, but if we recognize that for what the query actually wants to do, if we don't have a photon accelerated implementation of that task or operator, then we just fall back to the slow path of the original Spark runtime. So we'll talk about in a second how they're going to do this, but the way you do this in the JVM is you're using what's called the JNI or Java Native Interface. It basically allows you to have Java code then invoke C++ code. Like the thread that's running the JVM now goes down to your C++ code and can do whatever it wants. And the paper they talk about the overhead of making that JNI call was roughly about the same as doing a virtual function lookup in C++. I think it's like 23 to 25 nanoseconds per call. It's not free, right? But it's not like, because we're just going to pass around memory buffers and memory pointers in the same process. It's not like we had to copy things from one process to the next. So for the paper you guys read, what's super awesome about it, at least for me as a professor that I like, is that when you look at the author list, there's a lot of CMU people here, right? So Prashant Menin was my number one PhD student that actually worked on co-gen query compilation stuff. So he worked on this system. Yudh Karsh and Arvin, they were master students here. They took 721, the same class you guys took here. And they went off to Databricks. Ryan Johnson did his PhD at CMU. I think he graduated in 2010, but so before I showed up. But I'm pretty sure he took 721 and T8 at like 2006, the early version of it. And there's a bunch of other people in the, the bunch of other people in the citation list of the bibliography of the paper. Like, you know, Alison Wang is my former student. They set a blog article when she said Databricks working on stuff. So there's a lot of CMU DB alum at Databricks working on this project, which is super cool. This is basically why you should take this class, but you know, we're already into like, whatever, the 12 weeks, so you're here. All right, so here's the high level overview of what Photon's gonna have. Right, again, some of this will be in the context of the overall Spark system, like shared disk and disk aggregate storage, right, the subreddit computing storage. But the one we're gonna focus on is the pull-based vectorized execution and then the pre-compile primitives. We'll see how they can do shuffle-based distribution, just like in Dremel. No, nothing really radical there. They do sort-mergent and hash joins. And they'll also talk a little about how they're doing query optimization and adaptivity. So, the overall architecture, it looks a lot like Dremel, right? We have some distributed file system. We have a query shows up. It gets sent to what they call a driver, but Dremel called the coordinator. You'll see Snowflake next week. They'll call this, I think, the compute service or something like that. At a high level, again, it's some node that's responsible for taking the query, generating the query plan, handing it off to some scheduler and make sure the task gets executed, right? So, in the first stage here, the executer is fired up and they'll start pulling data from the distributed file system. They'll reduce some kind of computation on it. And then they're gonna start writing it out to, for the shuffle phase, but in the default version, like if you just download Spark, they're gonna do a local in-memory shuffle store. So, I mean, they're gonna write to basically a hash table that's local to them and then other workers can pull from them in the next stage. But as we also saw in the last class, we have the ability to also write out to a remote shuffle service, like Unifor or Apache Celeborn, right? For our purposes, it doesn't matter. And then in the second stage, again, it's the executers in the next stage are gonna be pulling from the previous stage. And again, if it's not a remote service, these guys have to stick around until these guys fire up again. And whether or not you can be clever and say, okay, well, this guy's gonna be on the same thing as this guy, so I can reuse my local memory, right? That's outside, that's done sort of high-level scheduling. And then we fire up the next stage, produce the final result and send it out. Again, at high level, this looks a lot like Dremel, pretty much all the systems we'll see, they may not always have this exact shuffle implementation, but basically it's gonna work the same way, like you gotta get data from the previous stage. So again, the thing we're talking about today is the photon piece, that's gonna run inside here on every single executor, right? It's not like a standalone service, it's not like a standalone process running. Everything's gonna be running inside of the implementation and it'd be invoked by the Java code. So photon is gonna do a pull-based vectorized engine and they're gonna leverage what they call in the paper operator kernels. But when we talk about vector-wise, they call them primitives. They're basically the same thing, right? Some pre-compiled code that can do a small computation or a task on a vector of data. And since these are gonna be written in C++, they're gonna be templated based on the data type and they would have different variations to deal with, if I'm doing a less than, a greater than, or whatever the computation or task I wanna do, these things would be all generated ahead of time then compiled into the system by the developers, right? We're not doing cogen on the fly for anything here. And so the way that they're gonna avoid the cost of jumping to these different, these pointers to these functions are either gonna be just amortizing them across multiple, you know, operating on multiple tuples within a single vector, as we saw on vector-wise, but they're also gonna do something they call expression fusion, we'll see in a second, where they actually can combine multiple primitives they know that they're being used continuously or continuously by a bunch of queries. So part of the reason I had you guys read this paper other than like Databricks is obviously a huge company and this is a pretty interesting system is that it's actually a really good paper. It's one of the better industry ones will read this semester and read mostly overall because most of the industry papers that are out there they just say like, hey, this is what we did. Ta-da, right, like how great we are. And maybe they'll scribe the details of what they did but they don't really discuss what led them to those engineering decisions. Whereas in the photon paper, although they don't provide numbers for, like, you know, we tried this, we tried that, but they have nice little anecdotes of the things that they considered and looked into before they ended up making the final choices they presented in the paper, right? So the one that they talk about that's really important is how they, why they decided to build a vectorized engine that is based on the vectorized model, the X100 paper, rather than a just-in-time engine, just-in-time compilation engine, like Hyper, even though they had already done that in Spark SQL. And what they talk about is that they found that the, the, you know, their developers are spending most of their time writing the tooling you would need to, if you have a JIT engine, to deal with a JIT engine, like debugging it, profiling it, understanding what's going on, because you basically, again, when you crash in one of these compiled engines, you just land in the assembly, there's no lineage information or provenance to say what part of the code in your system generated the code that crashed, right? So they had to write a bunch of tools to figure these things out. Whereas if you go with the pre-compiled, you know, primitive approach with, you know, based on vector-wise, then at first the engine may not be as fast as the co-gen system, but like you can just iterate it over and over again and more people can work on the project without having specialized training and you can get, you know, nearly the same performance and sometimes beyond what you could get in a hyper-style vectorized engine. So that little, that little, you know, tidbit in the paper I thought was really interesting and I agree with them at this point. Having gone through the experience of building a hyper-style compilation JIT, compilation engine here at CMU, you know, I wouldn't do it again because, you know, although the students you guys here in 7.21 are super smart, you know, there wasn't, you know, a small number of students that actually could work on it. Even I struggled with it because it was like, when you crash, you just, you know, you land them in assembly. Or we tried to build an interpreter to help us walk through the op codes and understand, you know, do bug programs. But what are we doing? We're building an interpreter. We're building debugging tools rather than just writing the code to make the system go fast, right? So again, that piece of, that piece of the paper I think I thought was really important. And if I was building a new system today, I would, maybe, well, I wouldn't, I wouldn't have to deal with the JVM because you shouldn't use the JVM to build a new system, but everything else they talk about, I think, is still relevant. Still important. The paper also mentions that the, the auto vectorization works pretty well because most of the parameters are pretty small, but when necessary, they do write some intrinsics. Just sort of force it to use SIMB in the right places. So all the operators are gonna support this to get next function that's gonna produce some column batch. And they're gonna choose to keep track of what tuples are relevant or still active, you know, as they move up in the query plan, using a position list. Remember, we talked about this before. I think we talked about Velox and the execution end of the beginning of the semester. Position list is just a vector that keeps track of what offsets are still valid in your batch of vectors as you're passing along, right? Because you may suddenly get filtered out and then rather than compacting the vector, you just pass along the dead tuples but need to keep track of which ones are still active, right? So this is just literally an offset within the batch of columns that you're looking at. And recall that the alternative approach was to maintain a bitmap of active rows and it's just a zero to say whether it's active or not at a given offset. And so the paper talks about how they considered both approaches but that they found that the position list just was, except for the extreme examples where you're passing along all the tuples, that this is preferable and this performed better than the bitmap. And then if you follow this citation here in the paper, it says, recent work confirms our conclusions. That's our paper, right, that we wrote here by Amadou, who was a 721 student. He's now a PhD student at MIT with Todd Mowery and some other people you know, Wann, Lynn, and Matt, right? And so this is Amadou's master thesis. He basically did a brute force search of just trying out all different combinations of position vectors versus the bitmap under different workloads of different combinations of different hardware using SIMD and not SIMD and he found out the position list is better. Overall. So this is an artifact of 721, which is cool. So as I said before, they're not gonna be doing the hyperstyle code generation for a compilation. It also means they're not gonna do the hyperstyle operator fusion where you have multiple operators in your pipeline and you combine them together. So they're gonna do this for two reasons. One, obviously they're not doing compilation because they want to use primitives, pre-commod primitives, because it's easier for engineering. But another interesting thing they talk about, which I had not considered before reading this, is that in order to make it understandable for their users, like debugging information and profiling information, that they have to show them that here's why your query ran for this amount of time, if you do the hyperstyle operator fusion, it's very hard to convey to the user what actually, why their query might be running slow or where they're spending most of their time. So I'll call this vertical fusion so within the query plan you're sort of vertically combining the different operators and they're instead gonna go choose something called expression fusion or horizontal fusion, is what I call it, within a single operator, horizontally within inside of that, I want to fuse things together. So I'll go through examples of each of these one by one, but first we'll see this again from Hyper and see why they don't want to do this. So this is that query we had before in Hyper. You can see all the different pipelines here and in Hyper they're doing a push-based model and they want to fuse together all the operators within a single pipeline. So if you focus on number four here, it's the scan on C and then you probe this table and probe this table, so it's basically three, four loops nested together. So one is this would be difficult to do in a, using primitives, I guess it's not difficult to do, just like you have to pass around batches and you wouldn't go by a single two by the time. But now if you have to tell a user, like here's the query plan operator that took the most time within your query plan, you can't show them a bunch of these four loops because they're not gonna know what any of this is, because all they know is I wrote some SQL query, here's a query plan that looks like the query plan in every other system and now you need a way to excise out exactly like how much time I spent in one for loop versus another to map them back to the different operators. And that's not really something you can easily do. Right? So again, for this usability reasons, you'd be able to convey to users what's actually going on, you know, you can't do this approach. So instead they're gonna do again what they call expression fusion. The idea here is that you can see, you have two primitives being used over and over again, one after another within an operator, you just generate a single primitive that then can encompass both of the computations that the primitives are doing. So the example they have is you have a query here doing a lookup on some date between some range. Well, I take the between operator and write this to be a less than equal to and a greater than equal to, right? With a conjunction clause. So without expression fusion, I would have two separate functions, two separate primitives for each of these two expressions. And then I just, you know, you just take the intersection of the offset list and figure out what actually two plus match both of them, right? So if again, instead of having to make two different function calls over and over again, expression fusion just says I'll cogen, or not cogen, I'll create a single function that does both of them. And it's templated based on the type, so I could do this, you know, these are for dates, but I would cogen ones for all other possible data types. Right, and then now for the expression that I have, now it's just a single, you know, single line of code that does both of them. And so in the vector-wise paper you guys read, that's from like 2006, of course that was an academic project, but even vector-wise in the early days was meant to run on-prem, it wasn't a cloud system. So in that world, you had to kind of guess what your kind of queries are user running and which ones should be combined together. But when you're a cloud company like Databricks, you see all the queries and you just do analytics on top of that and figure out what things are being used together and that guides you to decide, you know, using your own data to decide how you want to optimize your own system. We'll see another example later on when we talk about Redshift, you know, all these systems are really designed for doing read-only workloads, but in Redshift they looked at their own logs, their own telemetry, and they found there's a lot of people running update queries unexpectedly, so they spent a lot of time making update queries go fast and they wouldn't have not made it known to do that unless they looked at their own data. So these database systems as a service providers are in a unique position that is much different than how people have built systems, database systems over decades, because now they know exactly what queries that people are running and decide how to optimize things. Again, rather than guessing or having, worst case scenario, having to talk to people, which is always the worst. All right, so the next interesting thing that they're gonna do with Photon is how they want to manage memory. So Spark already has a memory manager, already has memory pools, and because again, Photon is not a standalone system, you got to run it inside of Spark. You just don't want to malloc anywhere inside of the C++ code, because again, thinking about trying to run this as a service for users, you need a way to keep track of who's using the memory, right? And so rather than rewriting a C++ memory manager and have that run alongside the Java one, you just rely on the existing memory management, the Java-based memory manager, it can basically allocate blocks of memory, keep track of it as if it was any other memory in the JVM, and then hand those pointers off to the C++ code to be able to use that. So they rely on all that, the memory manager in Java, and they all get, so they get the reporting and all the infrastructure you need to keep profiling and other stuff for free. The interesting thing that they do though is, and I like this paper because again, none of the other papers are gonna talk about this, because again, they're running in the lake house environment, they're running on files that they've potentially never seen before, they don't have accurate statistics, so they need a way to be more adaptive and dynamic to deal with certain operators needing more memory than maybe they originally expected. And so the way they do this is that rather than having every operator implement the capability to spill memory to disk on its own, they instead have the, when you need memory or you're running out of memory, you just go, the operator goes and asks for more. And now it's up for the memory manager hanging up in Java to figure out how to give you that memory to avoid the query from failing and who to take the memory from. So what they do is they basically get a list of all the currently running tasks, how much memory they've allocated, they find the one that has the least amount of memory that can satisfy the request for this query that's running, and then they tell it to release it and get the memory back so they can hand it to another operator. So all the operator implementations they have to have support for spilling the disk and sort of pausing it and coming back and fetching things from disk as needed. And then that way everything sort of seamlessly integrates and not worrying about running out of memory or keeping track of memory in different locations. And again, we had to do this and we'll see the other way to handle this. We had to do this because we don't know, we don't know how much memory, sorry, how many tubos may be coming out of a one operator because we don't know the selectivity of any of the filters that may be running on it. Okay, so the paper you guys, in the first time paper itself you guys read, they didn't really talk about query optimization other than maybe this last one here, but there is a earlier paper in the Spark SQL paper, they talk about their query optimizer, this thing called Catalyst, and then we had them come and give a talk with us a few years ago and describe what they're actually doing. So I wanna talk a little bit about what Catalyst does, both in the context of Spark and then originally for Spark SQL, but then I'll talk about what they specifically added to Catalyst to work with something like Photon. So again, because they don't have a robust query, a cost model, they have to be very adaptive and make the best of a situation where they don't have accurate statistics about what the query is actually gonna do. We'll see in a second how they're gonna handle this or try to improve their lot by using something like Delta Lake so they can see the data when it actually arrives and collects statistics on that. But again, in the worst case scenario, if you're accessing a bunch of files on S3 you've never seen before, you have to handle that. So it's a Cascades optimizer where they're gonna have predefined stages that are sort of handwritten or crafted by the engineers similar to what we saw in SQL Server where you sort of run from one stage to the next and do the transformations as needed. And so as expected, they're gonna have logical, logical transformations, logical, physical transformations, but what is unique about Catalyst compared to the textbook definition of Cascades is they're also gonna put physical to physical transformations. And this is where they're gonna inject the photon logic because they have a physical plan that's generated by the previous stage and that could just run in Spark SQL as is, but then they go back and do another pass and say, okay, where can I inject the photon operators as needed? So SQL Server does something sort of similar to this which we didn't really talk about where it'll run through the regular Cascades optimizer, generate a physical plan, and then sum the engines that are beyond just SQL Server or different variations of it that support distributed queries can then do another pass where they convert the physical plan to inject distributed operations. So they'll convert a single node physical plan to a distributed physical plan. And they'll do that in this actual last step here. So again, there's nothing really different for these first two stages here. This is the one that's relevant to photon. So the way they're gonna do this is that they're gonna do an additional pass over the query plan, but rather than do the normal Cascades top down, they're actually gonna go bottom up and they're gonna convert it to a photon-specific physical plan. And the idea here is that you wanna identify in which operators that are in the original physical plan that would normally run in Spark SQL can be replaced with the photons-specific operators. But now we need to make sure that not only are we gonna have a lower cost, but if part of that cost calculation is gonna be reducing the number of times we switch back and forth between JVM and C++. And so if you're going from the bottom to the top, there's most of the times, in most queries, there's more data being passed from one upper to the next at the bottom. So you start down there, try to get into the photon engine as much as possible and avoid weird transitions of like halfway through going back to C++ and then coming back to the JVM. You wanna try to have the longest stride going up of just being entirely in C++. And so this happens during query execution, sorry, query optimization before you do any execution. Because again, at this point, you know that the system knows that I have these operators that are written in photon and I can replace that Spark SQL one with this. This is not something you need to figure out at runtime. So this is an example from the paper or another talk. So this is, say, the original plan here, something real simple, scan a file, filter, shuffle, and then produce some output. So you'll start from the bottom to top, say the file scan, for whatever reason, that's still in Java. And then you know that you can replace the shuffle and the filter steps with the photon implementations. So at this side, we're running in Java, so we use J and I to vote down in C++ and then we have some landing area that they call an adapter and that's responsible for converting things from the row store that's coming in the Java world into the column store. And then you do all these vectorized processing up through photon on this. And then when you gotta go back out to Java, they have this transition stage, transition operator, that's responsible for converting the data back into the form that's expected by the Java code. So these adapters and transitions, they're not hardly expensive, but you're basically pivoting and copying and moving data around. So you want to avoid having to do this as much as possible. So this is all the Java stuff and this is all the stuff. So again, they take this physical plan, they take a pass through it, and decide where can they inject the photon specific capabilities. All right, so the paper also talks about how the two styles of adaptivity that they have, and they prefer them as sort of macro and micro, which is their terms. So the macro, the query level of that adaptivity is the stuff that we talked about before, where we can take the statistics that we've collected while we're running the query at the shuffle stage and decide how we want to modify the next stage accordingly by expanding or reducing the number of workers, potentially moving things around, you're switching from a shuffle join to a broadcast join and so forth. And that's basically all the stuff that we talked before with Dremel. But then they have this other style of batch level adaptivity where within an operator, like as we're actually running the query plan itself, we can decide within a single tuple batch or a vector, like do we want to change what implementation of that operator we want to use based on the data that we're seeing, right? And this is something that Photon just does for you automatically on the inside while it's running. This is not something that the query optimizer has to be aware of. So this is gonna be very similar to the Velox optimizations that we saw way back in lecture five, where I'll show examples of this, but identifying my data, I thought my data was gonna look this way, but it actually looks another way, where there's no nulls or that they're asking instead of Unicode and I can switch and permute what operator implementation I want to use. In the case of the Photon one, you have to have the primitive, the variation already pre-compiled, right? Obviously, because you can't switch, they're not code-gening on the fly. So this is basically what I already said, the query-level optimizations, again, after each shuffle stage, after the shuffle phase completes, you can decide what you want to change later on, and we've already talked about shuffle for broadcast joins and current and Dremel. For optimizing skew joins, the idea here is that you recognize that the data has a different distribution and you can change the number of executors you would have maybe for one side versus another, and then they're also gonna be able to handle overflowing partitions, but you can do something slightly different than we saw in the Dremel case. Remember, the Dremel case was you recognize that a partition starts filling up and then you stop and you do another round of recursive partitioning and you start writing tuples to other partitions that are overflowing it. They're gonna do something slightly different. I'll show in the next slide. And again, all of this is because in the Lakehouse environment, we don't have any real statistics. So the way they're gonna do overflowing partitions is that they actually can't expand the number of partitions, and I don't know whether there's something fundamental about Spark why you can't do this or this is just a design choice of why they went down this path, but to avoid a problem of a partition running out of memory or running out of space, they just allocate a lot of partitions more than you think you would actually possibly need, right? So let's say my worker is filling up the partitions and then I have these three partitions that are underutilized, right? Because I've allocated so many of them. So what they'll do is they'll recognize before the partition phase completes that these three are underutilized and they make a new partition and they have these guys just write their data into it and then just carry over the two other ones and then they, oops, sorry. And then they blow the memory away from the previous partitions, three and four, right? So it's sort of like achieving the same thing as Dremel did, but like taking from a different approach, like Dremel is adding partitions dynamically on the fly. Spark is actually taking partitions away. So the end result is still the same, it's just they're doing this in a slightly different manner. I don't know whether one is better than another, right? Because at the end of the day, you still have to have an extra step to move data into the new partitions, right? In the Dremel case, you had that redistributed operator you had to run after you've done it for crystal partitioning. In this case here, you gotta funnel these guys into the new one afterwards. So there's no free lunch, it's not like one is entirely better than another. I think that there's something, some aspect of, again, the Spark engineering, the Spark runtime that requires them to do it this way, rather than then take them away later on. For the batch load add activity, again, a lot of these are gonna look like the stuff we talked about in Velox, right? The first one is if you recognize that you're doing, you're processing a string column, but the data is all ASCII characters instead of Unicode, you can switch to a faster implementation of your string functions because you're dealing with one by ASCII characters instead of up to four by Unicode characters, right? There's more variability in Unicode, and so that means more branch-based predictions than the CPUs. If you have the sort of the fast-pass version, you just use that. If they recognize that the vectors that they're passing up from one operator to the next within Photon are pretty sparse, then you just compact them down to smaller memory sizes. And this will help when you do like a probing on a hash table, because now all the data will be co-located. The relevant data will be co-located together and it's, you know, with all within a single cache line and you can put things out more quickly or do the probes more quickly. They also can leverage the template nature of the primitives by identifying that they have no nulls in a vector and avoid having to check the null vector. And likewise, if they have no inactive rows, they can avoid that lookup as well. So this is the code from the paper. And so again, it's a template as C++ function where you have a Boolean that tells you whether you have any nulls in processing this or you have any, and whether you have any active rows that are active. So this is like, this is the template of function. You always write this code as is, but then at compile time, you would pass true or false either of these and you make different variations of the same function. Then at runtime, you could say, okay, well, if I know that I don't have any inactive rows, then I would invoke the all active ones. And then this will just get thrown away, the check gets thrown away at compile time because you know that it's a compile time constant that this thing is always true. So you just pick I, right? So you just sit row in X to I and you don't have to do this lookup in the position list. So it's a low, low optimization, but if you're doing this for billions of tuples over and over again or billions of batches, so this definitely have a big savings. Likewise for the has nulls, if this thing is always true, then you know you don't have to do the lookup in the null offset. So I think they, I forget what numbers they talked about for these, but for this compaction one, I think they do this for joins and it gets like a 1.5X speedup on the probes. Again, that's just the probe phase. There's almost other stuff you're doing in your, when you run a query, but that's a, it may not be the highest pole in the 10, but that's a pretty significant speedup but just doing the sector step. I forget what the numbers they save for this one here. All right, so I'm gonna show one graph. This is what they report as the improvements you're getting with photon enabled in Databricks and Spark versus the Spark SQL and for TPC-H, a rather large data size, let's go back to 3000. So across the board, you would see the red is just faster. In some cases, the performance is quite significant, like for Q1 here, for other ones, maybe less so. I think Q1 is like the simplest one, it's just ripping, scanning through, right? So in that case, I'm assuming most of the operators in the query plan are all floated into photon, right? And again, the way I think in Databricks, I think this is still true, I think they charge like 20% more like you specify whether you want your job to run as a photon accelerated query versus like the regular Spark SQL, right? And they'll charge you more for it because you're getting better performance. It's kind of a win-win for them. I mean, they had to spend the engineering time to actually build the thing, but like your query is gonna run faster, so it's gonna use less compute resources, probably uses about the same amount of memory, but like, you know, and you pay them more for it. So, there's a smart move. So this is TPCH, TPCH is a precursor to, which was the original OLAP benchmark that came from the early 90s. In the paper as well, they talked about getting official TPCDS results. So TPCDS is the successor to TPCH in 2006. And so they made a big deal when this came out in 2021 and this paper and everything that they had official TPC results, right? So this is their webpage for TPCC, the Transaction Processing Council, right? They're sort of like the independent arbiter, a referee of database benchmarks, right? And so TPCH, TPCDS, those benchmark specs are defined by this consortium. And then if you want to have, be on the official leaderboard, you go get certified by the TPC people to say your implementation matches the spec requirements and then you can be listed here. So Databricks got listed in 2021. They're still listed, they're still number one for running on 100,000 gigabytes for TPCDS, right? And they beat, the previous one was Alibaba from 2019. So it's not just the performance you're getting for the query itself, like how many queries per second you can do. They also check to see like how much you're paying for this if you had to pay it as a service, right? So Databricks was super excited when this came out. They made a big deal about it, right? They had the bunch of articles when this came out talking about how Databricks has official TPCDS results that have been audited and how they're gunning for snowflake, right? So the bunch of articles at this time, for this one here, they actually asked me for a quote and I wrote something, I said, at the enterprise level, maybe some CIO is gonna care about what your official TPC ranking is, but they don't make sales that way. This is actually not the full quote. What I said after this was and only old people care about TPCDS results and they cut that off. And so what do I mean by this? So the, I mean, how many people have, I mean, you've heard of TPCAs and TPCDS because you're reading the papers and seeing these things. You may not know that there's actually a consortium that is the one that's defining these things. So this comes out of the 1980s. It was founded by Jim Gray and a bunch of other deduced people because at the time, all of these different data as vendors are putting out their own, making up their own benchmarks, putting out marketing stuff that says, look how much faster we are than everyone else. But then like, you couldn't reproduce the results because the source code wasn't available, the environments weren't reproducible. So it's really hard just in terms of performance, to say one system is actually better than another in different ways. And so the TPC consortium was meant to, again, sort of provide a level playing field for everyone to understand this. But like, it mattered a lot, certainly in the 1980s and 90s, I would say it matters less these days, although we use TPCAs and TPCDS all the time. There's a good workload to benchmark systems. But to go and get the, if your new startup to go get official ranking, get audited, it's a laborious effort that probably does not worth it. As I'm saying, like no one's gonna care, like I have officially audited TPCDS results. Like nobody cares. Cause it's a lot of work. Actually they told me they had to go get, drag somebody out of retirement to go do the auditing for them because nobody's around to do it anymore, right? So there's another website, like Clickhouse has this thing called Clickbench. I can show that next class. There's, I'm not saying the benchmarks themselves are worthless. I'm saying like going through the effort to get officially audited is probably not, doesn't mean as much as it used to anymore. And certainly now with all these cloud database systems, as we'll see next class when we talk about Snowflake, it's getting the, being able to do Apple's Apple comparison from one system to the next is hard to do because there's just so many different layers to this, to these systems these days, that it's hard to say, am I running on the exact same hardware I would get from one vendor to the next, right? Cause you know, in Snowflake, you don't say I want to run on, you know, this instance size on EC2, you just say I want this compute capacity, same thing with Azure and other cloud platforms. So you actually really know what hardware you're actually getting is sort of hidden from you. So again, like, I'm not saying that the benchmarks are useless. I'm just saying going down the path that Databricks did to get officially audited. I don't think it was the, well they have a ton of monies that can do this. Like, if you're a new database startup, it's not the best use of your time. Again, we'll cover benchmark wars more next class with Snowflake. All right, so, Photon proprietary. So Spark is open source, but as I said, the database runtime itself, which includes Photon is not open source. And so you can only get the acceleration that we're talking about here if you go pay Databricks money for it. But they're not the only game in town to try to accelerate Spark. And so there's a bunch of other accelerators that are open source from various companies that you can then use to get Photon-like improvements. But what's interesting about them is that in Radvan doing the fine grain integration that Photon does with replacing new operators and deciding at runtime, when to call into JVM versus into C++ code. As far as I know, all of these systems are just co-opting or taking the entire query plan, ignoring the Spark runtime entirely and running it on a separate engine. So probably the one that's most popular is not the right word. The one that's probably the most well-known is Apache Gluten that this is usually, previously it was called Gazelle. I think it was Intel and Alibaba, but now it's an Apache project that Intel's running. And the way, again, the way it works is that you take the Spark generation, some kind of physical plan, and then you have the different mechanism to take that query and run it somewhere else and by converting it into substrates. You can run it on Aero, FPGAs, they have click-out support, or Velox, right? Again, they're not doing the per-operative basis. They're taking the entire query plan running it somewhere else, but you're still accessing or running the query through the Spark front-end, like through that expected interface. There's another one that actually just came out two months ago from Apple. So actually, this is, as I said, this was showing, a bunch of different back-ends that'll run this. This one from NVIDIA, Rapids is like their CUDA-based library that do data science stuff, right? And it's obviously meant to run their GPUs. Blaze is out of a Chinese company. That runs Data Fusion. And there's a new one here called Comet from Apple that came out two months ago. That's all entirely in Data Fusion. There's been other attempts. Gluten is, you know, multi-data systems plus Velox. But Data Fusion seems to be the one that everyone is leaning towards to accelerate Spark these days. And I don't know what, I mean, Databricks is obviously the biggest Spark vendor. I think Azure or Microsoft, I think they're promoting the Rapids one, to speed things up. All right, so the last thing I wanna talk about is, as I've been saying multiple times, and we'll see throughout from now to the end of the semester, is we don't have statistics because we're running them data lakes. And ad-aptivity will help for certain things, but ideally, if we have ad-aptivity plus better statistics, then that's the best world to be in. So, what has evolved and come out over the last couple of years, you know, since the first Dremel paper came out, of just like, instead of just assuming that everyone's gonna dump their files on S3, and your engines are gonna go try to figure out, try to make sense of it later on. The idea is that what if you provide a service or mechanism to allow people to ingest data into your lake house or data lake. And as that data gets ingested, you then can collect statistics about it, keep track of it in your catalog, other than just like, I have these files, here's where they are. Now you look at the data and keep some track of statistics. And you can use that now in your query optimizer for your cost model. So the photon paper, I think they mentioned the Delta Lake stuff, but this is an open source project, but this is specific to data bricks where this is their, it's a standalone service, but it sits in front of S3 and Spark, where you provide a basic interface to send data into your lake house. And all it's really doing is just taking your updates, writing them to some log file, and then in the background, there's this compaction process that runs, grabs the log file, converts it into Parquet, and then writes that back to S3 or whatever your object store is. And then when it does this conversion to Parquet, that's when they compute statistics about what's actually in the file, store that in their catalog, and then now the query optimizer can access it. So no longer you just, some other service could be much of data, and then plopping it down to S3 is a Parquet file, Delta Lake will print a Parquet file for you, but because it's actually, it has to scan the data, put it into the columnar format of Parquet, that's when they collect statistics. So they're not the first people to do this. It actually, one of the first versions of this was from Cloudera as an extension to Impala, this thing called Kudu. It was basically the same idea, although I don't think they were writing exactly to Parquet, I think they had their own proprietary format, but same thing, you had this system that could do fast updates into a distributed file system, and they collect the statistics about when the data was being ingested, and they then store that in the catalog and then feed that to the query optimizer that you can then access through Impala. This one, again, I think in Delta Lake, I think you can run SQL queries on it. I know in Kudu and Iceberg you can, using other systems, but this one was like a really low level, like getting that interface to do these things. And then if you ever wanted to run SQL, they would tell you to overrun it on Impala. So again, this was 2015, well before Delta Lake came along as a way to again help sell Impala from CloudDare. So again, I think they were right to say this. In the same way that Databricks is the huge database data company now, that was CloudDare 10 years ago, but most of you probably never heard of CloudDare before we started talking about it here, right? Actually, who here has heard of CloudDare before in this class? Nobody, that was the hottest company, right? They were huge. And it sort of shows you how quickly things can change in technology. All right, there's other alternatives to this. Apache Hoody came out of, I think it was Uber in 2016. And a lot of this text is basically the same thing that I said for Delta Lake and Kudu, right? It's a service where you can ingest data. I think Hoody can support Parquet and Orc files, generating those, but they're gonna collect statistics about the data as it arrives, sort in a catalog, and then expose it through different sources. And so this is actually the diagram they have on their homepage, and it shows you exactly what I've been saying. Like here's all your front end stuff. Like you have these updates coming in from your operational databases upstream for your application. They're feeding into Hoody the updates to the log. It'll write them out to what they call the cloud platforms, but this is basically HDFS, S3, whatever the Google cloud storage, and so forth. Right to your object store here. And then you update the catalog. I think that's Amazon Glue. I forget what that one is. That's Hive, the Hive catalog. You update the meta store, the catalog, like here's what my data looks like. Here's the statistics about it. And then you have the various query engines that we've been talking about. Be able to read from the catalog, figure out what's in there. Use the statistics that they generated, and then process the data that the service has generated. Right, so that's how these things fit into this, right? We've been talking about all the stuff back over here. This is just a way to get the data into the system rather than just writing it directly to the object stores. The next one that we've been talking about that the catalog teams have been working on is Apache Iceberg. This one came out of Netflix in 2017. And again, it's basically the same thing as I just said. You're ingesting data into the system, collecting stats, and exposing it to the query engines. So we'll see next class when we talk about Snowflake. This is the only way Snowflake is actually gonna be able to support ingesting or querying data that isn't stored in their managed storage. So they add a support for this in 2021 because prior to this, the only way you could query data in Snowflake is you had to insert it into Snowflake. But now with Iceberg, they can actually feed that into their own system. I forget whether they do wholesale copies or they can access the data where it resides. We'll see the same transition with Redshift where Redshift only had managed storage. They weren't supporting Iceberg but then they added support to read arbitrary files off of S3 later on. Okay? So again, I think Photon is a really interesting system because, again, rather than just like, what? Oh yeah, yeah, that's true. So Photon is a really interesting system but you see the lengths that they had to go to like dealing with the legacy code that they had to support. And from a business decision, I think it's the right move to like, don't throw away what is already working and you're making a lot of money on for your existing customers. Improve it by being more fine-grained at how you integrate the new technologies and new capabilities based on all the things we've talked about throughout the entire semester, right? But certainly if you were gonna build a system from scratch today, you wouldn't want to go down that path of like taking some GVM monster and trying to work around it. It doesn't make sense to build anything in Java these days. Yes, Java is getting better. The GVM's getting not better. The new garbage collector that just came out a year or two ago, way better than it used to be. There are proprietary GVMs that you can get, something like Azul that's faster than like Open JDK or the one from Oracle. This is what all the high-frequency traders use. But again, if you're trying to run something at scale, you don't wanna pay a license for every single node to use that proprietary GVM. So building something in Rust, C++, ZIG is probably gonna be the right choice. Not the GVM. But again, there's a lot of systems that are in this path. And we solved this with what we mentioned, Presto and Trino, right? Presto, the one that is at a Facebook, they're replacing their Java runtime with Velox. And again, that's a whole sale replacement for the entire query engine rather than the fine-grained thing that Photon does. Whereas the Trino guys are just gonna deal with, they decided they're gonna stick with Java, legacy code. All right, so again, next class on Monday, we're talking about Snowflake. And the Snowflake architecture is gonna look like Dremel. They're gonna be, the paper you can read from 2016 is gonna be more about the managed storage that they provide where the data system controls everything. But then we'll talk about how they integrate with Iceberg and be able to handle updates in the newer versions of Snowflake. But again, the high-level architecture is gonna look like the same. It's gonna smell a lot like vector-wise in X100 because one of the co-founders of Snowflake was the guy that wrote X100 out of CWI. So we'll see a lot of similarities of two systems except that the vector-wise stuff we talked about was running on prem, where the paper you guys read about running in the cloud as a service. Okay? You know what I'm saying?