 Yo, hey, yo, hey, yo, yo, pack the chrome style, fly like Mrs. Jones. Lyrical mathematics will have the devil smoking stones. I'll put heads to bed, lick shots around the fence with the church. Married is not married. Again, don't have kids before you graduate. That's the most important thing with databases. So I want to make two corrections, the things I misspoke last class. So first one, he was asking, oh, he was asking, how does BigQuery do transactions or do care about acid? And I said, they don't do transactions, which is not correct. So BigQuery actually does support multi-statement transactions. They're doing OCC, and they give you Snapsite isolation. So this is not the, I'm not trying to say this is the system that's built to do transactions. This is something they've had to do because people sort of ask for this. One of the interesting things, despite being an early up system, it's designed to run early up queries really fast, but people still want to do some transactional things they want to do for updates and things like that. So BigQuery supports transactions. I know that Redshift does as well. Actually, a funny thing is we'll discuss this a little bit about when we talk about Photon. Because you're a cloud-based database system, you see all the queries. You know what your customers are doing. So you can use that to figure out what you need to optimize. In the case of Redshift, they told us that they looked at the query logs, and they saw people were doing a lot of updates. And no one has spent any time thinking how to optimize updates. So because they were seeing it so often, they went and optimized updates. So I would say that the system wasn't really designed to support transactions, but they do support it now. And then I also said that with Hadoop, the way they did shuffle was that they wrote the output of a map task for the shuffle to HDFS. That's incorrect. It doesn't do that. He was right that they write all the shuffle data to the local disk on the data node. So you don't need to do the triple replication that you would get with HDFS. So just be mindful of these things. But we don't care about transactions in this class. They exist, but we don't care about them. Sorry. So before we talk about today's paper, let's talk about a little history lesson of what led to photon. So going back to the early 2010s, late 2000s, is when sparks came up. Again, prior to this, the flavor of the day, if you will, in large-scale data processing, the kinds of things we care about in this class was Hadoop. Again, Hadoop was a clone of the map-produced framework that Google had. So Google wrote a paper in 2005, 2004, like here's this map-produced thing. This is the way we're going to do all our large-scale data analysis. Yahoo took that and said, oh, should do the same thing. And they wrote a clone of it as Hadoop. And then they made that open source. And a lot of people in the Silicon Valley and other tech companies started working on it, started using it. It has some deficiencies. There's a paper I wrote with Snowbreaker or DeWitt in late 2000s, basically taking a p*** on map-produced. Bunch of stuff we see there in that paper was trying to be correct, and map-produced died out, Hadoop died out. But some of the technologies, some of the ideas that were good ideas in map-produced or Hadoop, landed in things like Dremel. So Hadoop was the hot thing, but it was slow. So at Berkeley, University of California at Berkeley, they had this consortium similar to the PDL here at CMU called the AMP Lab. And the idea is that it's a five-year program where they say, OK, we're going to build some software around some of your general theme. And obviously, cloud computing was a big deal. At the time, I'm still big now, but it was the early days of it. And so they looked at how to build a better version of Hadoop, the same high-level semantics, but faster and more efficient. So that's what Spark came out of. Spark came out of Berkeley, they open sourced it, and then a lot of people started using it, and it blew up to be the juggernaut that Databricks is today. So Databricks, the company got forked out of Berkeley. A bunch of the PhD students that worked on Spark left the AMP Lab, left Berkeley, went down the street to the office where Databricks started. So it was written in Scala, which was the hot language, not Rust, but Scala, hot language in 2010. And so this means that it's going to run, you know, Scala's just a functional version of Java. So it means it's going to get compiled down to Java flight code and run in the JVM. And we haven't really talked about this semester, like, OK, how do you actually implement one of these JVM systems? Most of them are going to be in C++. This one, you know, Spark's going to start off in Java. Hadoop was written in Java as well. And the reason why this matters for this discussion today, because a lot of the stuff they're going to do in Photon is to deal with the fact that the original version of Spark SQL is in Java, which is slow and has a bunch of problems. So they're going to have a little engine that can fit into there and bypass the JVM as much as possible. So the original version of Spark only supported this low level query API based on RDDs, resilient distributed data sets. And then I think in the early 2010s, they added support for the PANIs data frame API for a sort of right little abstraction. So as far as I know, most people don't write this anymore, other than some legacy applications still using it. Everyone's going to use this data frame API, and then also SQL as well. So people like Spark started getting some traction. And of course, when people start having a lot of data and they want to start doing analytics on it, and you're not providing SQL, what are people going to ask for? SQL. And so the first version of supporting SQL in Spark was this thing called Shark, which was a fork of Facebook's Hive middleware, which was designed to take SQL queries, convert them to Hadoop Map produced jobs, and then run them in the cluster. So the basic idea was, OK, instead of producing Hadoop jobs, we'll take SQL kind of high produced Spark jobs. So there were some limitations in this. It was a middleware. So it was a standalone thing that was like this software layer that sat in front of Spark. So you would submit your queries to that, and that would then run, you know, convert them to the jobs. It was just doing this sort of translation. And so that meant you could only use SQL in these for Shark for data that was sitting in your data lake, for what they call data lakes in your object store that were cataloged in the Hive metadata catalog. So you couldn't have like a Spark program, or like halfway through, you run a SQL query to do something, and then you take the output of that SQL query and run more Spark jobs on it, or do additional things. It was sort of limited what you could use it for. The other challenge that they faced was that the query optimizer in Hive was designed specifically for generating Map produced jobs. So it was rule-based. I think I don't remember whether it had the cost model, but it was really meant to really generate these Map produced jobs that were basically had Map shuffled reduced, like it was only sort of three phases. Whereas in the Spark API, it was more expressive. You could do more complex transformations natively without having to do additional Map produced steps, or Map produced stages. But it wasn't really set up to do that. So it was generating query plans that were maybe not as ideal as they could have been compared to one that was written by hand. So Shark was the first prototype for this, and then they came up with Spark SQL, which still exists today. As far as I know, nobody's, I mean, Shark is deprecated, nobody's actually running that now. Everything, if you're running Spark locally without Photon on Databricks, and you run a SQL queries, you're gonna be running Spark SQL. So this is the second iteration of doing SQL on top of Spark, and for this one, it was got to be a native, a module that was native to the actual Spark runtime. So again, whereas Shark was a standalone middleware, like Hive is, Spark SQL is actually directly inside of Spark itself, right? And so it's an in-memory, it's gonna use in-memory column representation for data, and everything's gonna be running still in the GVM. They're not doing J and I, they're not doing C++ yet. So to avoid the overhead of the GVM, you wanna keep things off heap, or using off heap memory, meaning you don't wanna call it like new object because now the garbage collector is gonna keep track of who references that object. You allocate some byte buffer, and you put all your data in there, and then the garbage collector doesn't look at it, right? That's the way you write high-performance Java code. So they're gonna do all the tricks that we talked about, through the dictionary coding, RLE, bit-packing compression, they're gonna do an in-memory shuffle up between query stages. Again, just does like we talked about in Dremel, except again, they're gonna write the data locally rather than having a separate shuffle service. So they're also gonna do Code Gen, but only for the where clauses. So they're gonna take the predicates in the where clause expression, convert them into Scala ASTs, and then use the built-in Scala like JIT compiler to then convert them into byte code that runs and then run in the GVM. So they're not doing holistic query compilation that we saw in Hyper, they're only doing for the predicates themselves. So the reason why you wanna use off heap memory stuff, and we'll see this in Photon, is because again, if you let the garbage collector, the garbage collector has to keep track of the references, so there's a bunch of metadata it's gonna maintain about who references what, and then for every object you allocate anyway, it's gonna put like 16 bytes in front of it for some information about like, what's the Java class that allocated this memory and so forth, right? So if you had a column of data, you don't wanna store them as Java integer types, you wanna store them as native primitive types, like the IEEE 74 standard stuff we talked about before, all right? So there's this great little bit in the Spark SQL paper, where they talk about how the original version, when they were doing the memory-based shuffle, that they were gonna write everything to disk and then hope that OS was gonna keep everything in its page cache, so that when you do the fetches from the next stage, you get things out of the shuffle, that was gonna be good enough, like as the OS page cache, would be able to keep everything in memory. Of course, relying on the OS, as we said many times, is a terrible idea and sure enough, they talk about how the syscall overhead of writing to disk and then letting the OS figure out when and when to evict something, that turned out to be really terrible and made the performance be unreliable, so what do they do? Well, they're data-based people, but it took them a while to get this. They just maintain all the memory themselves, right? Don't let the OS do anything, which is not surprising, which again, we've talked about many times, so it's in the intro class and in this class. Actually, another thing I'll say too, what's sort of interesting about this, was Shark, people wanted to do SQL on Spark, and so this was sort of a stop-gap solution, but Cloudera, actually who here has ever heard of Cloudera before, before last class? Less than like for you, right, so very few. Cloudera was the big, big Hadoop company, and they had the guy that invented Hadoop at Yahoo, he was one of the co-founders on it. It was a big deal, it had a lot of funding, and they eventually went IPO, I think, 2015. Iprocratus, who's gonna come speak with us about Redshift a few weeks ago, he worked there, working on Impala, like they were like the, they were the data bricks of 10 years ago, and so they never did a good job at sort of monetizing the, like open source software like Hadoop and making like a cloud offering of it, right? There was Elastic Map Produce from Amazon, and they were making more money on Map Produce than Cloudera was. So with Shark, Cloudera actually refused to ship this in their Spark distribution, even though people wanted it, because they had Impala, right? They wanted Impala to be the sequel on Hadoop system, and so they didn't, they wouldn't ship this, right? But everybody wanted Spark because it was better than Hadoop, it was people wanted this, and Cloudera was in the business of giving these, these distributions of these open source data processing frameworks. So the way the Databricks guys got around this was to then integrate SQL support directly in Spark itself. So now Cloudera has to ship Spark with their competitors' software that they're gonna eat them alive inside of it. I'm not saying this is why Databricks was successful, this helped with it, right? And that's why Databricks is a huge company, and Cloudera went IPO, and then as I said last class, converted back to private equity, because they've lost the shine. So I find that super interesting that like, everybody wanted SQL, Cloudera didn't wanna give it because they wanted people to use Impala, so Spark says, or Databricks guys, okay, let's just put SQL directly inside of it, and then that destroyed Impala. Okay, so again, as I said, we have not talked about implementation languages at all this entire semester, but for this paper, you have to because it's in Java, and they're trying to deal with Infoton dealing with an existing environment, an existing infrastructure, where they can't throw everything away, and say, oh, here's the new version, because that's gonna be, it's a moving target, Spark's a moving target, it'd be very hard to replace that wholesale. So they're trying to figure out, what can they do to slide something in, and then at least get a foothold, and then build out the scope in which the functionality that the engine can support. So what's the problems with the JVM? So again, Spark was written in a scholar, because that was the hot thing, it runs in the JVM. So what they were finding in their workloads is that they were becoming more and more CPU bound, and this is surprising in a large scale, oh, that's something that's pressing a lot of data, because you would think reading data from disks should be the bottleneck. Right, that's the conventional wisdom, but now if your engine is so slow, then the disk no longer becomes the bottleneck, or the disk is getting really fast, and you can start skipping things, or throwing things out, because you're doing better filtering with indexes and zone maps and other stuff. So they're finding over time that the queries themselves are becoming CPU bound in the JVM, because they were all the advancements in the hardware, and doing better job at throwing things away without having to read it. So what they were finding is that with the Spark SQL engine, it was really hard for them to make it go faster because it's in the JVM, the code gen code was difficult to maintain, because it was only people that had experience actually working in the internals of the JVM could actually manipulate that piece of the system. If the heap size got larger than 64 gigabytes, then the garbage capture became a huge bottleneck, even though you're putting things off heap and most as possible, it's, you know, there's still overhead of the metadata. Now there are, I don't know how much you guys know about Java, there are, there's like the open JDK, and there's like the Oracle JDK or JVM, but there's also actually high performance ones that are like proprietary, that cost a lot of money. I think Azul is probably the most famous one. Like this is what if the high frequency trading guys are gonna be using Java, they're gonna be using one of these very expensive, like, you know, stripped down JVM that don't have the garbage collection overhead of like the open source version. But again, at their scale, they weren't gonna, they can't pay that. So all right, so the garbage collection would be a problem. And then they were finding also two for the, for the code gen piece that I talked about for the, for expressions in Spark SQL, that for really complex queries, or doing queries on the tables with a lot of columns, the size of the code they would have to generate to evaluate the predicate became too large for Scholar or the JVM, and it would throw errors, and then they had to fall back to the slower volcano style interpreted engine. So this problem came up so much that they realized, you know, they can't proceed forward with trying to make the JVM or Java based engine work faster, and they had to switch to something, and that was photo time. So all right, so what is actually photo time? So it's not a standalone database system like we've talked about for a bunch of other systems of our end for what we're talking about going forward. It's, you can sort of figure it as a library that provides a single threaded execution engine that does, you know, that executes portions of a query plan. And there's a bunch of work they have to do to get the data in back and forth between Java LAN, between the Java code that they're going to support as well. So I know that everybody's here written like a C++ module that runs, or something in C that runs in Java. You use the JNI, the Java native interface, and it's basically, they expose these function calls that you have to implement to allow Java to call into C++, or to see whatever you wanted to be, right? So they want to, again, the reason why they have to do this is because they have an existing user base. It's running already Spark SQL, or these Spark jobs in Java. And because Spark is a actually worked on project, both within Databricks and outside Databricks, it would be very difficult for them to say, okay, guys, let's rewrite everything and put it in C++. So the idea was they would have a way to slide in a photon as this library that can implement some of the things that Spark does, like the most extensive operations, the most CPU intensive operations, replace those from the Java stuff into C++, and then over time as the replace more and more pieces of Spark and have it called down through JNI to photon. And you obviously want this to be seamless. You want to avoid, you don't want people to realize, people don't notice that, okay, my query ran today, I got one result, it runs tomorrow, I get a completely different result because today some photon add more features and I'm getting different results. So you can think of photon as like an extrusion library, like Velox, which will cover in a few weeks. It's not a system where you can run a zone deam and run queries on it. It still has to exist inside the Spark engine. So the paper also talks about two, we'll talk in a second, there's Databricks Runtime, DBR. So this is basically, my understanding is the same thing as the open source version of Spark, except they forked it and there was a bunch of stuff that Databricks has added to make it better that they don't put in the open source version, right? That's their business model, I'm not saying it's right or wrong, but that's, instead of saying the Spark engine, DBR, they're basically the same thing. So I would say using JNI isn't always easy. It's, if you want to copy data back and forth, it can be expensive, but if you're just calling down to do something, let it crunch and then come back up, it's pretty straightforward. And it's interesting because the conventional wisdom has always sort of been that JNI can be slow, but in the paper, they talk about how, in their measurements, the overhead of making a call from Java through JNI to C++ code is equivalent to that C++ virtual function table lookup, which was, to me, that was sort of surprising what it kind of makes sense, right? If you don't have to marshal data, do any deserialization of things back and forth, then it's obviously, it's just making a call into C++. So I want to point out one thing about this paper, do you guys have you guys read for this class? In the author list, there's a lot of CMU people, and a lot of people have taken this class. So Prashant Menem was my PG student. He took this class, he did the class. Yudh Karsh and Arvin were master students here at CMU, they took this class with me. Ryan Johnson actually finished his PhD here at CMU before I started, I think in 2010. So technically, I think he took this class, but like in 2006, 2005. So, yeah, and there's more people if you go look in the, if you go look in the citation list, there's a blog article from Allison. She was my student, she took this class, right? There's a lot of people with data bricks now from CMU. So this is what I was excited to see how many people are on this. Okay, I would say, yeah, this is why you should take this class, but we're like, what, 12 weeks in? This, what's that? I think why you should take this class? Wait, we're 12 weeks into it. So it's like, I've already sold that to you guys. Okay, all right, so let's go high. What is data bricks photon? I've already said, it's not a standalone system, it's its own execution engine. So some of these things that I'll talk about are gonna be part of the overall Spark runtime, and where some of these will be specific to photon, and in particular, we're gonna focus most of our time talking about how they do vectorized query processing, because that part is unique, and it's sort of tied to how they do pre-coupage primitives and expression fusion. But just like in Dremel, they're gonna be doing shuffle-based, or just shuffle-based distributed execution, but again, unlike in Dremel, where they had a dedicated shuffle store, they're gonna be storing shuffle data locally and have the other workers pull from it. They're gonna support both sort merge and hash joins. And then we'll talk about also how they do query optimization. And like Dremel, they also support a some adaptive query execution methods as well. It's quite similar. And again, when we discuss Snowflake on Monday next week, they'll have, it'll basically be almost the exact same thing of all the same stuff, except for they don't always do shuffle. Okay, so how to Spark, actually queries. Again, this is Spark the overall system, not the system of the photon, it's photon that has no notion of network communication and workers, it's the little library of kernels that it runs to execute queries. So just like in Dremel, we have a distributed file system, I have a bunch of files. The first stage, they're gonna go pull some data from the distributed file system. They're gonna crunch on it. But again, unlike in Dremel, where it has a dedicated in-memory shuffle store, they're gonna store all the shuffle data locally in memory and it could just go to disk if necessary. Again, as I said, they control these OS memory allocations or the OS memory allocations, they're not letting the OS do any paging. Then in the second stage, all the executors are gonna go retrieve, and the next stage gonna pull data from the previous stage, do whatever computation they want on that, same thing, store it locally, and then push it out to the next executor, right? And then there's all this mechanism that Spark provides of how to maintain fault tolerance. You have one executor goes down, how do you make sure that the task is rescheduled, all that we don't care about right now for this class, for this discussion, but again, all that exists. And then that's provided by Spark, that's not provided by Photon. And so Photon is basically gonna be a little piece inside each executor, where as you're running the query, is the system's gonna decide, okay, well I have a Photon version of this operator, I'll make sure that I call that instead of the original Java version. So the Photon engine itself is gonna be doing pool-based vectorized execution using pre-capot primitives at the operator level. And so we talked as before, when we talked about vector-wise. And vector-wise, the software developers, people building the data systems, they would write all these primitives, or what the Databricks calls kernels, to do some small operation, scanning a table or probing a hash table, doing comparison in some expression. They're gonna write all this ahead of time in templated C++, and then that then gets expanded out to different compiled versions that are optimized for different data types and different data layouts and so forth. All right, so that's how they're gonna get, they're gonna get better performance without being a cogen engine, by having things still be compiled. And then you basically now, when you execute a query, it's a bunch of pointer calls through these functions to do whatever processing on the data that they're given. And you say, okay, well, function call per piece of data would be expensive. Well, it's amortized over time because they're dealing with these column batches, or chunks of data. So, again, what I like about this paper is they have commentary about, oh, we thought about this, we tried this, this other technique, and it turns out it was hard or like it sucked, it wasn't as good. And here's why we made the decisions we did. This is fantastic. Most times in an industry paper, you just say, okay, we do it this way. We can't tell you why, but trust us, it's the right thing. I like this paper because they talk about, even if they don't have numbers to show it, they say, we tried this, we looked at this, here's what worked, and here's what didn't work. So, in particular, what I like about it, one of the key things is that they discuss how it's, they found it'd be easier to build, maintain, and optimize a vectorized engine with pre-compiled primitives than trying to build a JIT engine. And certainly they had experience building a JIT engine because they did that with Spark SQL, like trying to compile expressions. And so what they talk about is that when they were trying to build the new version of Photon, or trying to build Photon, they tried doing C++ Code Gen, the way we've talked about it with Hyper. And what they found is that their engineers are spending more time writing all the infrastructure you need, the tooling you need to debug and understand what the compiler program is actually doing than actually trying to write the engine itself. So, again, when if you do the Hyper approach or you generate LL and IR, compile it, and then that's your query plan, you run it, when you crash, you don't have symbols, you're sitting in x86 assembly, it's not a pretty stack trace. You have to write then a bunch of tools and reverse that back and figure out, okay, what was the code that generated the code that crashed? And the Hyper guys can do it because they're German, right, but it's a lot of work. And so what they said is better to have your engineers just write the additional code paths or primitives that are specialized to different scenarios. And we'll talk about one that's beyond just like different data types. And then you get the, since it's just C++ code, you get to use all of the debugging tools, information tools, profiling tools that exist today and make the system better. So they talk about how because you're not writing tooling infrastructure, just writing the engine itself, you can get closer to the performance you would get if you were doing complete code gen in the way that Hyper does it. So I'm certainly, again, my student Brashant, he built a code gen engine for us, he built two versions of this. And the second one he built was sort of designed to make it easier for people to debug and actually, for students to actually work on. And it's true, the second version that was a noise page was easier to debug, easier to maintain and more students could work on it, but it was still, it was still a small subset of all the students that worked with us. Because again, if it crashes, now you have to do an interpreter and try to figure out what's going on. So I would agree with this, that this makes sense, that if you're gonna build a system, this is going forward, this is the way to do it. It's just not worth the engineering overhead to do code gen for the entire queries. Okay, the other things you get benefit also too, if you, I guess you have complete control with this in code gen, like not relying on the GVM and other stuff, like they have control of what gets vectorized, but they talk about how auto vectorization actually works pretty good, because the primers are kind of small, but in some cases they do use Nintrinsics to do direct vectorization. All right, so every operator is gonna support the get net function or next function. And then what they're gonna produce, because it's vectorized query processing, is a vector or a batch of tuples. So in this paper they call them the column column batches. So column batch is gonna be one more column vectors with a position list that tells you which offsets in the vectors are gonna be, are considered active. You sort of think of like I'm calling a bunch of filter primitives to start pruning things out as we talk about when we talk about vectorized execution and you need to keep track of which filters, or which tuples pass each filter. So they're gonna be using offset list that says, the offset in the array for the different columns, these are the ones that are actually still valid. And then obviously if this thing goes to empty or it goes, the size is zero, then none of the tuples have passed it. You can short circuit the execution for this. So in the paper they mentioned that they also considered using, what we saw before when we talked about vectorization, using a bit map of the active rows. So the size of this bit set is gonna be the same number of columns, or sorry, number of tuples you have in your columns. So like this is set to zero, then this column is not active, it's set to one. So if this set to zero, this row is inactive. If it's set to one, then the row is active. And they say that in their experiments, it turns out that using the position list, although it seems like this would be more expensive, because you gotta basically loop through this thing and look at, should I look at a given tuple? They have this little line here that says why it turns out to be not the case that the bit map is not faster, right? Because you're basically dealing with a, you're iterating over a smaller array to figure out what gets active. And so they have this little blob down here. Recent work confirms our conclusions. So I'm gonna follow this citation. It's our paper, right? Ritten came out of this class. Amadou took 721. Now he's a PhD student at NMIT. Prashant, we know about, Matt, you know about Lynn Wan and then Todd Mowry. So this paper, he basically, Amadou basically did a, almost like a brute force sweep over all different variations of represent filters, represent the active row set or the position list, and just figured out when or what conditions one is better than another. And in most cases, again, the position list is gonna be better. It's only for the really extreme examples where the bit map's better. So, I've already said before that they're not gonna be doing the hyperstyle, or I said they're not doing hyperstyle whole list of query compilation. And as a byproduct of that, they're also not doing that push base operator fusion model that hyper does. And beyond just sort of engineering effort to do that, they also make a good point that if you fuse operators, then it's difficult to be able to show to the user a nice, you know, information about where they're, you know, why their query might be running slow. Or what part of the query plan is actually, you know, is gonna be slow, is that this operator versus that operator. Because again, things get fused together and there's no longer a clean map between, here's a logical query plan or even a physical query plan to the actual runtime in the code. So, this is my term vertical versus horizontal fusion. So they're not gonna do vertical fusion where within multiple operators vertically in a pipeline, you're gonna collapse them down into some small kernel. They're not gonna do that, but instead they're gonna do what are called horizontal fusion within a single operator, all fuse together a bunch of things that would be separate primitives into a single primitive. And that way it's less function calls into that. And I'll show what I mean in a second. Again, it's a reminder, as we saw this before when we talked about hyper and cogent. You have some complex query here, it gets further into this query plan and then we have the boundaries for those pipelines. Right? And so for things like say the first pipeline is scanning over a table and then applying the filter. Well, okay, that one is sort of, as a small enough for you to explain to somebody where there was the time being spent. Except that you're also doing materialization of a hash table. But I think the one that clearly would show you the problem with operator fusion and showing to users how much time is being spent is the fourth pipeline. Right? So I'm scanning C, probing hash table on B and then probing hash table on A. That's three nests of for loops. So how would you tell someone, oh yeah, this query time is being spent all in this join when it's all sort of in this, in a bunch of for loops, you can't show people for loops and say, hey, here's where your time's being spent because they're not gonna know what the hell you're talking about. Right? So again, they're not doing this not just for engineer reasons, but also I think for usability reasons to end customers, which I think is a fair point. And it's not something people in academia usually think about. All right, so what they are doing instead is, again, horizontal fusion within predicates with, yeah, with the different predicate primitives. So they have this example here that say I'm doing a select star from through where some date is between beginning of the year and beginning of this month. Well, between calls is just two ands, right? Greater than equal to and less than equal to. So they comment in the paper how they see this pattern all the time. And so if you were doing the precompile primitives, the way that vector-wise describes how it is done, you would have one primitive to do the greater than equal to for the first date and then another primitive would do the greater, the less than equal to for the second date. And so what would happen is you would, you would for every single, in your loop, for every single tuple or column batch is your processing, you would then make a call into this, you would get back your position list, then make another call into this function passing the previous position list and then do that processing. So it seems kind of true with two function calls, but again, think of like if I'm scanning a terabyte of data, I forget what their column batches are. Actually, I don't think the paper says, say a thousand tuples, right? If I have a billion tuples and I broke up to a thousand chunks, that's a lot of function calls. And so an easy optimization that they do is that they would, again, this is why they don't want to be a cogent engine. If you're vector-wise, you can find these scenarios where this occurs and then you can build a specialized version of the predicate where you see some pattern over again. So now I can do the between clause where I pass in the low and the high and do the comparison, right? So again, it seems sort of obvious, but you know, you need the telemetry of the system and you need to know what queries you are running in order to make these decisions. Because they're a cloud-based system, they have all that, right? In the old days, if you're on-prem, if you're shipping on-prem software, you would only find out what people cared about if one of you asked them, which is expensive and time-consuming, and people usually don't know what they want. Or then if something crashes and you get like a debug statement or something like that and you figure out what's going on. But being in the cloud, you see everything and you can make decisions based on it. So again, this is what I'm calling this horizontal fusion because again, within an operator, I'm fusing multiple primitives together and the idea applies through a bunch of other things. Okay. So the memory management is for them as a big deal too because they're hanging out in Java or they have to deal with Java. So it seems kind of intuitive, but all my understanding is all the memory allocations that the photon would make in C++ are going to be done through this memory allocator or memory manager up in running in the GVM. So you would ask the unified memory manager to give me a chunk of memory and it's just a pointer that gets passed down to C++ that it can use. It doesn't matter whether it was malocked by C++ or Java, it's memory is memory and it can just use it. So what's interesting and one of the things that they talk about in the paper is that I mean, you obviously want to do this because you don't want to have a sort of separate memory pool for photon and a separate memory pool for the Java stuff. You want to have a single memory pool and that way it's easier to do reporting to show to like either the operator or the customer, here's how much memory this node is using and so forth, keep track of the runtime usage. And certain Spark already had a memory manager so why write another one to use that one? What's interesting about it is that the other papers don't talk about is of how they're gonna spilt a disk when they run out of memory, when there's memory pressure for these different operators. So we didn't really talk about in this class, like okay, when do you actually spilt a disk? It's basically the same techniques that we talked about in the intro class. I have a buffer pool or I'm giving some fixed allocation of my hash table to do a join and then if I exceed the size then the buffer pool will spill things at disk. It'll run slow but it's better than crashing. So they take a different approach and they separate the memory allocation from the sort of releasing of memory as sort of two separate processes. So it allows one operator, one component running the system to free up memory to make space for another component at the same time. So no longer is your component responsible for recognizing I'm gonna run out of memory. Let me start me spilling the disk. You run out of memory, you go ask the memory alligator for more memory than it decides who should then release that memory. And the reason why they wanna do this as they talk about in the papers because they're dealing with data that they've never seen before. They don't have any statistics. They don't know how much work you're actually gonna do, how much memory you're gonna need for different operators. So you need to be more dynamic in your allocations. And in Postgres the way it works is you define, here's the amount of memory I want my hash table to use as the initial allocation. And then if I get it wrong then I have to double the size and run it again. All those systems if you run out of memory that will throw in air and abort the query. That's not ideal either. So the way it's gonna work is if you need more memory you go to the memory manager, you ask for more than it decides who to kill and they use a simple heuristic where they sort all the concurrent operations that are holding memory. You sort them by the least amount to the greatest amount and whatever operator has the least amount of memory that is enough to satisfy the request it gets told to release this memory. All right, so everybody has to be dynamic which I think is kind of clever. All right, so let's talk about query optimization. So this paper doesn't talk about the query optimizer. There was a in the Spark SQL paper from 2015-16 they talk about this, the catalyst query optimizer. And so again, what things I'll talk about here will be a little bit from what Spark does but I'll mostly focus on what they've added to catalyst to make it work for photon. So catalyst is a cascade style query optimizer written in Scala of course. And it's gonna do the transformations in predefined stages very similar to the SQL server talk that you guys watched before. And these stages are defined by the software engineers that decide, okay, when should I actually apply this? And because again, they're not gonna rely on this fantastic cost model or because they're not gonna have statistics for random data that are sitting in a data lake. A lot of the optimizations they're gonna be doing are gonna be sort of heuristic based at least enough to get them close enough to a good plan. And then at runtime, they'll be able to adapt a little bit to fix things up. So one big difference though about catalyst versus the other, traditional way people think of cascades is that they're gonna be able to do logical to logical transformations and logical to physical transformations but then they're also gonna be doing physical to physical transformations. And this is how photon's gonna work or how they're gonna integrate photon. So they're gonna generate a physical plan using the normal process but then there'll be an extra step or stage that it looks at, okay, what portions of the physical plan that I have now can I offload into photon? And then I'll transform it to the physical plan to another physical plan. So these labels here, analysis and optimization rules, strategies, purpose issues, this is what Databricks call these things. But my understanding is it's logical, logical, logical, physical and physical to physical. So I mean, I don't wanna say that SQL Server doesn't do this. There's an extra step that they talk about how there's another, like for the different engines that are using their cascades out front of the query optimizer, they can do additional transformations like to convert a single node query plan to a should be query plan. So I shouldn't say that not everyone, nobody else does this, but it's, they're Databricks is doing this specifically for photon in a certain way. But it's, it's, it's, it's equivalent to, I guess what the SQL Server guys are doing. All right, so the physical to physical transformation. The idea here is that they wanna traverse the query plan from bottoms up and figure out what portions could again, could be moved into, to a photon plan. And the thing they have to be mindful about is the, the, the amount of back and forth they're doing between going from Java world to C++ world. Because again, they have to, the, the, the Spark SQL engine is row based. The photon engine is columnar based. You have to pivot and convert things as you go back and forth. So you don't wanna do that, like back and forth over and over again, cause you'll be spending most of your time doing that transformation rather than actually processing the query. So they go bottoms up to make sure that, you know, that they don't do a transformation, they know how many transformations they've done below and they're not making weird ones in the middle. The idea is that you figure out where, at what point can I start converting a thing to a photon and how far can I go up into a pipeline or a query plan. So here's a, a, a basic example, right? So it's, I scan some file, I do a filter, I do a shuffle and then I produce an output here. So what they'll do is they'll, they'll figure out, okay, well the, the filter and the shuffle, I have photon versions of those things. So my physical plan will now convert those into specific operators that photon supports. But then I have to add in this, what they call an adapter or the transition operator here to make the JNI calls into C++ and then the return call back into Java with the result. So the way that it's like this side of here is this is all the Java stuff, the JVM, and then over here is all C++. So that's sort of clear. All right, so again, this happens before you start running the query. Like the physical plan, before you start executing it in the overall system, it'll make this decision. And then at runtime, they're gonna do two types of adaptations. So the first one I'll call a query level at adaptivity or macro adaptivity. This is very similar to the Dremel stuff we talked about before where the system is recognizing that the shape of the data or the size of the data or the distribution of the data is different than maybe had anticipated and it can make some high level changes to the query plan itself. Again, this is not being done by photon, this is being done by the high level part of Spark. So the Databix runtime or the Spark runtime. The thing that is specific to photon is what they call batch level adaptivity where within the, while you're actually running the kernel loops, you recognize, okay, data looks differently than I expected or it looks a certain way or has different properties, then they'll choose different precompiled primitives to invoke for the data that they're looking at. And we'll go through examples of both of these. So I would say, I think this existed before photon and then this is obviously photon, but this is the thing. All right, so for the dynamic query optimization for at the Spark level, again, this looks a lot like Dremel. It's basically the same thing, but you, they're doing shuffle in between stages. So it's a nice, that at the end of a stage, you have much in the shuffle buffers, it's the right time to go look at the query plan. So am I gonna do, am I doing the right thing for given the data that I've seen? I guess they didn't have a cost model when they were doing query optimization. So they couldn't predict exactly what the data was gonna look like, but at the shuffle phase, you have that data. So you can make the right decisions or make better decisions. So just like before Dremel, they can switch between a shuffle and a broadcast join. They'll call us partitions. We'll show in the next slide. This is, they're actually, this one looks, the way they do it is more primitive than what Dremel does in my opinion. And then they'll also be able to dynamically optimize skew join or potentially also change the join order. So let's see how they do the coalescing. So unlike in Dremel, where Dremel would recognize that it had pressure on a partition and then start spilling it or split it into multiple partitions. As far as I can tell, Spark can't do that. They can only combine partitions that are underutilized. So the way they do this is they sort of overallocate the number of partitions you expect to come out of a worker, large enough so that things don't start spilling to desk. And then after the shuffle completes, you go back and figure out which ones are underutilized and then combine them. So say my worker has produced some output, I fill in these five partitions, but then I recognize that these three here are underutilized so I can take their result, again, this is all local done at the worker. There isn't a distributed service, you just combine them. And then for the other two that were completely full, you just carry them over. And then now you can then adjust the number of workers that may be coming on the other side that are pulling data from this. And the reason why this matters, of course, if you have, there's fewer partitions, then there's fewer tasks you need to consume them, and there's less pressure on resources and less pressure on the schedule. I forget what the default partition size they said was. It might have been like 64 megabytes or something like that. So again, they do this because they don't know what the data's gonna look like coming out of the worker because they don't know how to predict selectivities of filters, joins, and other operators. So you just allocate enough memory, fill it up, and then combine it together if what you realize that you over predicted. All right, the thing that is more interesting is how they do the batch level add activity. So we're gonna highlight sort of three ideas here. They have a little example of the template of C++ code in the paper. I didn't show that because I guess I could have shown it, but it's explaining I think is easy enough. So again, the idea here is that because a query plan that's running is basically a list of pointers to these pre-compiled primitives, there's no reason you can't swap out one point or with another to a different function long as it's doing the same high-level operation at runtime based on what the data actually looks like. So one of the optimizations they talk about is that people do all sorts of weird encoding for strings. And sometimes the data will be ASCII encoding, sometimes it'll be Unicode, UTF-H, or UTF-8 coding. ASCII encoding is always gonna be one byte per character. It's always eight bits. In UTF-8, it's variable, can either one, two, three, or four bytes, depending on what character set you're using. So if you assume everything's gonna be UTF-8, then you have to use a slower version of string operator functions because it has to account for that variability in the data you're looking at. But if you can recognize that, oh, for this column I'm looking at, it's always gonna be ASCII data, then I can use, Swiss use an optimized version of that. So they talk about how they have a basic way to detect as you're scanning data, oh, this is always ASCII data for this column. So let me make sure I always use the ASCII versions of the pre-commod primitives because that's gonna be faster. They also keep track of, for column batches, within a column, if there's no null values, then I don't need to check for nulls. And in the example in the paper, they show how it's a template function and the compiler is smart enough to recognize that if you pass in that I'm not gonna check for nulls for this pre-commod function, that line of code gets compiled away. So you would have a version of a primitive that would check for nulls, another one wouldn't check for nulls, and if you know there's no nulls, you use the faster one. Similarly, you have the same thing in the column batch. If you know that there's always, all the columns are activated, basically your position list contains all possible offsets, then you don't need to go check at every single record in your loops, is the row I'm looking at even actual or not. I lied all that. And again, it removes indirection, which is a bad thing in superscalar CPUs, as we said. So things are running faster. So again, they can do this switching on the fly as the query's running, based on the data that they're seeing. And obviously you need fallback mechanisms if you get it wrong, how to handle that. All right, so this paper actually has benchmark results, particularly for TPCH. I don't know why that one's there, I can't get rid of that. But this is basically, they have benchmark results because they're comparing against themselves, which is nice, like before and after. So the gray bar is the old Spark SQL engine, and then the red bar is the new Fujian engine. And then the main takeaway is obviously it's faster, for this query in particular, because they're relying on the C++ decimal implementation versus like the Java one. And it's just so much faster. Yes? It's Q6 just not so. I can't remember, I don't remember. I think it's just a scanning line item, that's pretty simple, yeah. But I don't know why. In this case it's Q214. Yeah, in this case it's very fast, yes. And again, the main takeaway here, it's not surprising the C++ engine is faster than the Java one. Yes? Why is the Java one? I forget the D's of this, I figured it out a little bit, what you said. You're going to have to roll on a flat. Well that's the way that's in, sorry that's Oracle, this is Databricks. Oh, oh, oh, oh. Yeah, I don't remember whether it was fixed point or not. Yeah. But yeah, actually if it was fixed point in Java, then it would have to be fixed point in Photon. So I can imagine that would make a difference, sort of. But they implement. Yeah, but you basically have to go, you would have to implement, like if they're using a fixed point decimal type that's in Java, like I got a big decimal, something like that, you have to re-implement the same thing in C++, but you have the same semantics. And I don't think they did that. I don't know whether it's fixed point or floating point. It's in the paper, we can go look at it again. All right, so they also had this other blurb at the end, if you notice, they talk about TPCDS numbers. And in 2021, part of this paper, Databricks made a huge announcement about how they have official audited TPCDS results. So TPCH came out in the 90s, TPCDS, the DS data science came out in the 2000s. TPCH is considered deprecated, TPCDS is what everyone, it's supposed to be the benchmark everyone should be using, but like it's more complicated, there's a hundred queries, there's CTEs, like it's easier to get TPCH running instead of TPCDS. But so that's why a lot of people who've read they only look at TPCH. But TPC, if you're one of, actually, a modern system, you want to use TPCDS. So this is the webpage, it's straight out of the 2000s, but it is what it is. But this is the official results here. So for 100,000 gigabyte results, Databricks is the champion. Below this is Alibaba for doing MapReduce. And they're faster and cheaper and faster. All right, so they made a big deal about this announcement. And when this came out, and there was a bunch of like articles like this that talks about how Databricks has official numbers for TPCDS, and they're trying to go after, you know, Snowflake, take over Snowflake. We'll talk about the benchmarks next class, but for this particular article here, they asked me to give a quote in it about the significance of these TPCDS results. This is actually not the full quote, they cut it off. So at the enterprise level, maybe some CIO, chief information officer, is going to care about what your official TPC ranking is, but they don't make sales that way. And only old people care about TPC results. That was the full quote. So does anybody know what TPC is? I mean, we talked about TPCH, TPCDS, right? So TPC stands for the Transaction Processing Council. It's a nonprofit consortium that was set up in the late 80s by Jim Gray and some other database people as a way to have a unbiased, you know, nonpartisan official referee for benchmark results. Because prior to this, prior before TPC was set up, all the vendors had their own little benchmarks, so of course that showed that their system was faster than anything else, right? There wasn't a standard worklet, a benchmark that everyone could use to actually do fair apples to apples comparisons, right? So TPC was set up. I think the first benchmark was TPCA, TPCA and B. Like, you think of like transactions, it was early transactional worklets. And it's funny, you go read the early papers about like the, when they set up TPC, they'd be like, oh, wouldn't be great if we could run a hundred transactions a second, right? Like, for them, that was wild, that was the vision. So, there's a bunch of these different benchmarks, TPCDS, TPCC, TPCE, and so forth. I think, how do I say this? This doesn't matter, well, how do I say this? Nobody cares about this anymore. Like, it's nice to have, but like, nobody's making decisions, oh, your TPCDS results, the official auto results are amazing. So people put out TPCH results all the time. You're not supposed to be able to call it TPCH unless it's actually officially automated, but nobody actually checks this anymore. So like, it's nice to have these standard workloads that everyone can use, but I don't think, I think we've moved past the need for like a consortium like this to check things. Actually, they told me that they had to go, they didn't have anybody that could audit this. They had to pull somebody out of retirement to come help them audit this, because again, nobody was there to actually check this stuff anymore. So I still state my statement, only old people care about this. It's a nice to have, but it's not, no startup is gonna, except for Databricks, because they have money. Nobody's gonna make the time, they spend the time to actually care about, they're trying to get ranked here, yes. So what should you care about if you're not caring about, or particularly what you do? So like, I mean, the other thing is you should care about what you're workloading in your application, right? So I think there's like standard benchmarks people use, that you can use to evaluate these different systems. At the end of the day, if you're a user and you need to make a decision, you care about your workload. Like you don't care about their TPC DS results, if your thing doesn't look like TPC DS, right? Now these results, they're not official results, just running TPC DS, which you can do. Those, like that'll be a good, at least give you a starting point to understand the pros and cons of one system versus another. But I would argue too, that a lot of things we talked about in this semester, they're basically being commoditized. Like every system does vectorize execution now. And the things that you're gonna care about is how good the query optimizer is, how good the user interfaces, right? Certainly how much it costs. Those are the things I think would make a decision should be helping to decide where to pick one system over another, right? I'm just saying that like for them to spend the time to get it officially ranked and beat Alibaba, like I don't care, yes. To have official numbers. Well, I said, except they asked me and I said I don't care. I don't think anybody knows, I don't think nobody knows what TPC DS is. Like you can come out of the gate and say, I'm faster than snow, like here's some results. And then not do this, right? We'll cover this in next class. Like there's a back and forth between, when this came out, I said, who cares? Snowflake then did a retort and said like, you ran Snowflake wrong. Databricks said no, we ran it right. Like they went back and forth on this. So what he opens, right? One of them did, yes. Next class. Next class, yes. Anyway, I had to see this. There hasn't like, I enjoy this because it's like, I find I love databases, but like, there's a lot of back and forth like this in the 80s and 90s. Less so in the 2000s, because it was mostly like, it was Hadoop versus the relational database and stuff. It's good to see this kind of gang violence. And I'm surprised there has a bit more of it. And I think what'll happen is, as the money gets tight for these startups, well not startups, they raise hundreds of millions. I think you might see more of this kind of back and forth. Just to get an edge over customers. But in the end, again, performance matters, sure. But like, a 10% difference between one system and the next, probably not gonna matter as much as like, what's the overall experience of using the database? Like how much time you have to spend tuning self and managing stuff yourself. That matters a lot. All right, cool. I wanna finish up quickly for some additional things. So this lecture, the last lecture, we made a big, big deal about, hey, the query optimizer doesn't know what the data looks like, because it's much a random file sitting on the data lakes. And so the activity helps with some things, but obviously the more you know in the beginning, before you start generating a query plan, the better decisions, the better position you're gonna be to make good decisions. So the thought then comes up, okay, well, what if there was something for data lakes, a storage service, that could support the kind of operations that people wanna do for data lakes, like bulk loading data or incremental changes. And then as the data arrived, the system could then compute statistics that it could then feed into your OLAP engine. And so they did this, this is what Delta Lake is. So in 2019, the Databricks announced we had this thing called Delta Lake. And the way you think of it is a transactional data store that provides basic insert, update, delete operations and some scans over structured data that's gonna be stored in your data lake, on top of your S3 or whatever object you're using. So the way it basically works is that it's a transactional system, and you make updates to data, to your tables, it's gonna pin entries into a log, and then there's some background job that occasionally wakes up, goes, grabs all the things you had in your log, converts them to parquet files, which then compute statistics that get put into the file, and then registers that with the data catalog. So now when Spark gets a query on some stuff that's sort of Delta Lake, it has the calculated statistics to help it make a decision about what the data actually looks like. So again, this is a natural progression to this kind of, for an LAP engine, Snowflake has not exactly the same thing, but they have their own little OLTP system they put in front of it, can get some of that information. Again, this is, from their point of view from the system itself, like if you can one, charge people for inserting their data, collect some statistics, then it makes your system, you know, looking better. This is a clear win. But they weren't the first people to think of this idea. Cladera did this in 2015. Who here has ever heard of Kudu? You don't count, you don't count. Two. So basically the same idea, right? That this was a transactional storage layer that could ingest data, you could do CRUD operations on them in transactions, it would be able to compute statistics, and then that could then be fed into Pala for how they run queries. I'm not saying Kudu and Delta Lake are exactly the same, the high level idea, the goal is very similar. Okay? When we talk about Iceberg and Hoody next class, they work basically the same way as well. All right, so in my opinion with Photon, the two most interesting parts about it is that it uses the pre-compiled primitive idea from vector-wise, and how they do what I'm calling horizontal-operative fusion, and as well as the discussion of here's the pros and cons of different trade-offs that we just remade. But then also how they had to integrate with the existing Java interface or Java runtime that already existed so without disrupting users and without having to rewrite everything from scratch and switch over. So obviously a main takeaway from all of this would be if you're gonna build an OLAP engine today from scratch, do not use Java, do not use Scala, do not use the JVM, you need complete control over everything and you wanna implement it in C++. Or, yeah, thanks. Okay? We need a little button for you just like every time. Rust, rust, rust. Okay? All right, any questions? All right, next class, Snowflake. Again, this'll be, it can be the same thing. We'll see a lot of the same ideas, but we'll talk about what parts of Snowflake are unique. Okay? All right guys, enjoy Carnival. Have a good weekend. It's gonna be nice out. That's my favorite all-time job. Yes, it's the SP Cricut, I-B-E-S. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Then it comes, Duke. I play the game where there's no rules. Homies on the cusp of Yamafu cause I drink bro. With the bus a cap on the ice, bro. Bushwick on the goal with a blow to the ice. Here I come, Willie D, that's me. But yo, I drink it by the 12 o'clock. They say Bill makes you fat. But saying eyes is straight, so it really don't matter.