 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. It's another exciting talk in the Vaccination Database at Marthune. We're excited today to have Deepak Majeddy. He's a principal engineer at HANA. It's a commercial version of PrestoDB. And he's here to talk about Deluxe, the new engine that they're voting for. So Deepak is a principal engineer at HANA. Prior to that, he was a tech lead at Convertica. And he has a PhD in compilers, not databases, from Rice University. So as always, you have questions for Deepak who's given the talk. Please don't meet yourself, say who you are, ask for questions at any time. And feel free to do this whenever you want, because otherwise he's talking to himself. And we appreciate him being here with us, although he's in Pittsburgh, so he's like, you know, five miles from my house. But we're doing as a resume. So Deepak, the floor is yours. Thank you for being here. Thanks, Andy. All right. So, yeah, we're going to talk about Valox today. And this is the outline of my talk. I'll start with a brief motivation for Valox. I'll then spend some time going over some of the highlights of Valox. Which include the API components in the design. And then we'll see how we implemented a Presto worker using Valox. And then I'll finally conclude with some open source development experience. Yeah, Valox open source. All right, so why are we building Valox? So if you look at the hardware and data trends, you'll see that data growth is significantly outpacing the hardware growth. And I'm sure you already heard that before. And you also must have heard about the implications of near end of Moore's law and dinner scaling. And we already see some heterogeneity in processes today and researchers are proposing dark silicon for the future. And as a result, data processing has become quite complex. However, CPU speeds are stagnated and memory has improved, but it's very expensive at the higher end. On the data processing side, we have various workloads such as batch analytical streaming and machine learning. And all of them employ some common features such as joints, filter pushdown, sorting, grouping, et cetera. So and also building these extensions is generally not trivial. So it'll be great to have a shared library that provides some optimized implementations of this common functionality. And at the same time, if we can extend this library, it's much desired. And if we can consolidate these data process systems, that's also great. So to answer all these needs, the locks is being developed by the community. It's as an open source project. All right, so what is well often and it's like it sits in open source C++ data processing acceleration library. It has building blocks and components for various data processing to build various data processing frameworks. So the goals are enabled to enable high performance extensibility and consolidation across these data processing frameworks. It has a generic API that can be used across data processing frameworks. And it is primarily, the engine is primarily vectorized with the task and instruction level very similar to the money DB x100 work. And the other main benefit of the locks is open source. It enables a lot of academic and industry R&D, a lot more opportunities because it's open source and it's designed to be adaptive. For example, filters, conjuncts and caching, we'll see those are all sort of designed to be adaptive and general purpose. All right, so what is the locks does not what does will not start out so it does not have the traditional SQL parser. There's no data frame layer. There's no global optimizer and it's not on the control link. So that's all the responsibility of the client to incorporate those in country in parallel with blocks. All right, so what are some of the use cases of the locks. So we already see a couple of data processing tools that are already adopting the locks. For instance, we have a presto worker that has been implemented on top of the locks. Spark has a couple of projects. So we have a spark script transform that has been extended to offload exclusions to blocks. I think sparse script transform is a interface development meta that allows users to execute arbitrary binary in spark. And then we have the other gazelle project that is being co-developed by Intel that can also offload spots equal to blocks for execution. And then we have extreme from again from meta that is also using some of the blocks components that uses the blocks vectors and expression evaluation from blocks. And then there's a pytorch or torch arrow project that uses the live functions that we built in blocks for the presto worker. So it's using some of those functions in pytorch. And then and this is new substrate project. I don't know if you already heard it's it's a cross language serialization for relational algebra. It's essentially it's a framework to explain to describe plans and other operators for data processing. And that also has extension for blocks where you can translate substrate plan and expression and execute them using blocks. So a lot of use cases having this general purpose and open source scope for blocks. So straight, like, like, is it is substrate a subset of everything that Velox can do or could could the substrate get capturing everything that Velox support right now. Sorry, could you repeat that? Like it's for substrate, like for their grammar and what they support. Is that a subset of what Velox can support now or is it just one to one at this point. I'm not really sure the scope is pretty recent edition. But I know there's some they added some it's actually less than two weeks old contribution. So I know they added some support for plans and expression, but I really don't know what's the width of that support. I mean, the chat is generally welcome using Velox. I think no, right. No, Rami is not using blocks. I think they have their own dimension. Yeah. Pedro from Meta. Pedro, do you know what's the scope of substrate and. Yeah. Hi, everyone. I'm Pedro from the VLox at Meta. Yeah, I think my understanding is that those are kind of different efforts right substrate is looking at consolidating query plans in general, while VLox is consolidating the execution engine. So there's one integration that lets VLox execute substrate plans but substrates the scope is much larger and it's also extensible so it can we use another engine so it was a kind of different open source project. Thanks. Thanks, Peter. All right. So, yeah, so looking at the use cases, we already can say that VLox is like providing big wins in terms of consolidation. So we thought VLox different teams would have to work independently on each of these data processing frameworks, each framework because it's independent will probably end up with varying semantics. There's overall code duplication. There are missed opportunities to share it, optimize algorithms and features. And, yeah, often functions as well that they get into end up with different semantics across any different data platform so VLox sort of solves all these headaches by providing a shared platform and it will provide significant gains and engineering efficiency consistency and as well as productivity and these are also big advantages, apart from high performance with using VLox. All right, so let's look at some highlights of VLox. So, VLox has most of the conference required to build an execution layer. I tried my best to balance the breadth versus depth that VLox currently provides. VLox has many more tricks under its sleeve than what I will be covering today. So given it's a one hour talk, I just tried my best to limit sort of the scope. All right, so VLox components include types. So types can be scalar and as well as nested complex types like structs, arrays, maps, sensors and more. VLox has this notion of vectors and they are sort of very similar to Apache Auto. They're very, they're compatible in most cases. I'll talk to you one case where it is not and then it has functions and it has both API to build custom scalar and also vector functions. And there's an expression evaluation layer that basically fully web-rise expressions over inquiry data. And it also has the notion of tasks, drivers, pipelines and operators. And yeah, they basically coordinate scheduling and processing. And then there's also this IOS up system, which includes the connectors, file system, file formats, caching and network serializers. And finally, this is a source management that handles the memory map spilling and etc. So pretty much these are common modules that you see in most data processing exhibition engines. All right, so let's first look at what kind of vectors currently VLox supports. So VLox has various encodes, supports various encodings. It has flat, constant, dictionary, bias and sequence encodings. So the fundamental to all these vectors is something called a base vector that tracks the type and the nulls and the number of rows. And then there are other encoded vectors that are extended on top of this base vector. So they have flagged vectors, for instance, as a base vector and also as a values buffer. You have constant vectors that's implemented over single value. Bixby vectors contain a values vector and the indices buffer and bias vector is sort of a delta encoded vector and sequence vector is sort of an hourly encoded vector. And on the right, there's a sample picture that illustrates the simple flat integer vector. So on the left, there's an integer array with some index and logical values. And on the right, it's represented as a buffer of values with a not null vector that indicates if which element is a null or not. So that's sort of what very similar to what I noticed too. So yeah, the index two and four and five are null. So we have a zero there and then values corresponded to real values with not null set to one. All right, so string vectors are sort of unique to sort of in terms of they were being from arrow. So string vectors are based on the umbrella paper. They have a 16 byte value per string called string view, and then code the byte size, the prefix of the string. And there's a pointer to external buffer that sort of contains the entire the remaining the entire payload payload. And if the string is less than is up to 12 bytes and it's sort of inlined into the string view itself. So here you can see an sort of an array of parts are we have a flat vector with string view. And the first value orange peel is about is exactly 12 bytes. So it's inlined. And then the next value is tall mountain. You have a 13 byte string here. So you have a 13 size there. You have the prefix for byte tall. And then it have a pointer to the payload. And then nulls are again presently is the nulls vector. So this sort of deviates from the arrow format because here we have a pointer instead of offset. And the main advantage of having a pointer is you can sort of have a concurrent operations on the elements. So you don't have to you don't have a dependence essentially because you're not depending on another elements offset. So you can perform operations on these vectors in parallel out of order as well. And you'll see that the size also helps. So if you have some operations that that are like substring or something, you can just update the size. You don't have to really change the payload. So one of the big advantages you can have sort of zero copy or you can try to avoid materialization in most cases using this representation. Yeah, actually, I also proposed this format in their spec. There's a mailing there's a RFC or request a proposal on this format. But I don't think it's been finalized yet. But yeah, this is very helpful for its evaluation of certain operations. So you compress the strings or just sit around the roll text. Sorry, Andy. Do you compress the string buffer? I don't think so. Not today. Okay. So yeah, so the couple of optimizations with vectors as well. Dictionary vectors provide zero copy for most cardinal ditching operations. And they're heavily used. The whole idea of vectors execution is to avoid materializing limited values and will also have provides all tricks to delay materialize is how much as much as it can. And also, like I said, because of the way the values are presented, all elements can be written out of order. And as a result, certain operations like conditionals can be executed faster. And there's also this notion of a lazy vector that does not immediately materialize basically lazy materialization vector. So it's only loaded when it is determined by runtime filter. So say you have a lazy vector loading some column from a file. If the runtime filters determine that there's no rules that are needed. So it won't basically load that column. So it helps a lot of cases to use lazy vectors. So like I said, dictionary encoded is pretty popular and we want to avoid materializing intermediate values and a couple of operators that end up producing a dictionary encoding vector. So operator, for instance, produce a dictionary vector with columns with a lot of repeated values. Filter operator uses dictionary encoding to present a subset of the input rows that passes the filter. Similarly, join operator and nest operator and even functions, for instance, can emit dictionary encoded vector. And again, the idea is to provide zero copy wherever you can for when you're executing. Alright, so dictionary encoding again. So it can be used to represent both not just cardality increase, but also cardality reduction. So traditionally when you use dictionary vectors, you want to encode a large values to a smaller sort of values before. But here you can see that the you have basically a table with name and colors. You can encode the colors using indices, but not but you can also encode names that are associated with color red using dictionary encoding. So it's not just to reduce the cardinality, but also, yeah, to also like represent these other cases and filters applied to a dictionary column produced by the operator adds another layer of dictionary encoding essentially. So you can see if you have, if you have the dictionary vector and you're applying a filter on top, it adds one more dictionary layer. And then if you apply a projection on top of this filter, it'll add another dictionary layer. So we keep the dictionaries can be can wrap around other dictionaries, if wherever possible. And again, the idea is to avoid materializing and have zero copy. All right, so that's the story. Anyway, like if I start doing a filter, like the output of the filter is another dictionary, like, like you don't use the filter on the press data, and then what you spit out is more depressed data. Got it. Okay. And yeah, there are these API to basically flat whenever you need to flatten it, there's API to sort of flatten out all those layers and you get the values out. So didn't vertical get rid of laying materialization in their system at like some point. I'm not sure. Okay, who's curious. Yeah, I think just I think that the whole point of doing that is of course not for primitive data types because for those types is probably cheaper to just copy but the assumption is that most of the data we're dealing with are kind of vectors of arrays maps of NASA things are pretty expensive to copy. Yeah, primitive. Yeah, it's lightweight but like, like Petro said, complex types and strings can be expensive. All right, so that's the summary of vectors. Now, let's move on to functions. Again, functions of their various types like scalars aggregates and lambda. And there's API to allow users to add custom functions. It's a friendly API have some slides coming next. And scalar functions have two APIs one is a simple row by row, and it can be template based and just in line. And then we have vectorize which is batch by batch. Again, the optimizations that you that you can provide in functions like you can specify fast parts. If you know the input is not null. You can have the fast part for not null, but flat vectors. And if you know it's ASCII already then there's a fast part for that too. And currently blocks as most of the functions implemented required for Pesto SQL and spark SQL. You can see on the right, there's a list of various categories of the functions that are currently available in blocks. Okay, so and yeah, if you look at the blocks documentations also this nice function coverage map that tells you what are the functions that are currently implemented and what are not. So the ones in green are those which are currently implemented and the ones and without yeah and any of the highlighters sort of DVD. Is there a function that like Presto supports and you're adding or like what is this one. These are functions that are currently available in vlogs and some of these are used in vlogs and sorry in Presto and some in spark like both these are functions that are like, yeah, that are both of these layers. Yeah, so this is a sample function API for a row by row. So you simply create a class multiply multiply function, and you have a you create a bull call with the arguments and then you implement the function. And then the important piece is to register the function so there's a register function API that you need to register with a string to sort of yeah tag the function so whenever you use this in your expressions you just say multiply. And then it will the function is to automatically automatically picks up your function call. So yeah, this is the templated style where you create a templated function and you can obviously specialize some of the function so on the top is a generic template and bottom is the solution for double type for this is a simple multiply again function. And again, the actual institution happens when the registration is happening so we can register individual functions or there's a helper function that will register all the template possible type templates of the function so there's a lot of these helper modules that can be used for functions. Alright, so that's again the summary of functions. Alright, so now the next layer is the expression evaluation. Like I said earlier expressions are valued in a very priced manner over an expression tree. And there are two phases that are involved in the expression evaluation. The first is the compilation, and then the next is execution. And, and pretty much they, they happen on an expression tree that is built inside using the blocks components I'll come to the plan notes later. Expression nodes, and they get built in the compilation phase. So the completion phase pretty much does some optimization so it does constant folding. For instance, if you have upper off a where upper is, we know it's a constant, then you can sort of create the, you can do constant folding, and then also does some content flattening so if you have sort of a large content of ants then it flattens up all of them. And then there's also this nifty common subjects elimination optimization as well. Usually, the database optimizer would already do this, but if some cases, this is helpful to have it here as well. And it also compilates is also populate some metadata, like it sort of identifies the behavior of nulls to see if the operation is a deterministic or not. And then, if the fields are distinct, and then it also checks if the conditionals in that expression and all these are pretty much used to optimize the execution of that expression. So the expression evaluation execution phase pretty much involves traversing the expression tree with the row mask, which is a selective diveter, and it identifies the active rows and it only performs the expression on those active rows. Again, the idea here is to avoid materializing intermediate results during execution. And then once it traverses, as it traverses, it avoids evaluation wherever it can. For instance, if you know the function returns null for any non-input, then you don't need to evaluate the expression, you can just update null. And also if there's a common sub expression elimination, this common sub expression, then you can sort of reuse the results. And it also works on encoded data wherever possible directly. So for instance, if you have a dictionary vector and you want to do upper, if you basically have an array of strings of colors, and then encoded as a dictionary vector, if you want to do upper, which basically converts the lower string to uppercase, it directly gets pushed into the value of values vector and it just translates those vectors. It doesn't basically do any, yeah, unparsing or basically converting the dictionary vector to flat and then back. So it directly pushes into the dictionary vector. Does my upper function need to know that it's actually operating on the dictionary? I guess you're feeding one scale at a time. There's a scale where it came from. Yeah, so basically it depends on the metadata. So if it knows, it's like the distinct, the null fields, then it'll determine that it can push it inside and it'll, yeah, it'll automatically determine that. Peter, do you want to add anything here? No, I think just the function doesn't need to know about the encoding in that case. So you just provide that signature and everything else is handled by the expression evaluation engine. Okay, yeah. You can let your function one, not one for like compressing on the rest. That makes sense. Okay. Exactly. Alright, so that's the expression evaluation layer and then moving a level on top. So you have this plan nodes and operators that you can compose to basically build your execution plan. So some of these are sort of inspired from Presto. But yeah, they're pretty much full coverage for I think most database needs. So we have tables can have a filter node is a project node. And you can see like the plan node has a corresponding operator or a set of operators. So a plan node can be mapped to a single or multiple operators. And some of them are actually leaf or source operators. For instance, the tables can node has a corresponding tables can operator and it's the leaf node. Similarly, if you look at the hash join on the fifth column, that has probe and has build operators. And yeah, some of the other leaf nodes are exchange and merge exchange and values. So yeah, a couple of, yeah, I think there's coverage for most operators, at least for what Presto needs. Alright, so moving on to the other components. So it has connectors for file formats and file systems to support for these modules as well. So connectors again, I think it's a Presto sort of specification. So it's sort of used to specify a source. And it has some nuances like splits, which is the fundamental unit of tasks that the coordinator assigns. And it has this notion of data sources, data sinks column handles, table handles and expression evaluator. So this is again an API that I think the Presto specification provides and blocks has corresponding implementation support as well. And currently blocks has specific implementation for the hype connector. So it has a high data source does the hype data sink hype column handles, and it understands the partition some other hives related nuances. And on the file format sides you have support for dwarf, which is sort of an extension of orc, and we have parquet file for reads. And for on the rights we only support for dwarf, and on the file system side they have support for s3 hdfs and local file system. What is the main bottleneck in your parquet implementation that requires you to rewrite it. So we currently wrapped it around that DB, but we realized it does it like pushing some of the dynamic filters and other optimizations is not happening. So we were actually making that another disaggregated layer where we have a generic reader API, and you can extend any file format mostly columnar, like orc on parquet and there are other custom formats that I think data is building better do you have other. Yeah, just making a quick plug into that as well I think another thing we observed was that if you look at the kind of main open source readers for parquet and orc like the the apis they have are not as efficient as they could be so they lack support for kind of pushing now aggregates and some of the things we did internally. So one of the things I've been discussing is doing what the park mentioned and kind of disaggregating that part as well and maybe providing a smaller library that basically has the kind of filing coding and decoding logic that implements the full extent of the the apis we need but also decoupling that from the memory layout, which is most of that is they're either coupled with arrow, or like Deepak mentioned like that to be coupled with that to be this memory layout but, but that could be the couple I think there's also a separate project which is kind of part of the lots but necessarily that we were kind of, you know, the goal kind of decoupling the encoder and decoder logic. Right. All right. So, and I think this is the, yeah, the other piece that I mentioned so there's a notion of tasks, pipelines and drivers. So here we have a Velox application, and you have this task that sort of coordinates the plan or whatever the composed operators. So, yeah, it's basically, we have a pipeline of operators here where operators zero is the source, and the Velox application basically has this ad split, or, like I said split is sort of a fundamental unit of a file in some sense so it keeps providing this splits, and then the task framework pretty much it sort of controls the drivers in the pipeline. So, so driver is like a thread in some sense of it, it controls your pipeline. So, and the operators in it. So that's the sort of the setup you have operators, the pipelines encapsulates the operators, a driver controls the pipeline, and then tasks sort of coordinates everything coordinates the drivers, and as the pool basically has some resource management other things too. And the way the driver works is it basically coordinates the operator input and output. So it, so here you have op and op, next op, op basically gives the output to the next, and next op consumes it. So driver first invokes a good output call from the op, and either you get a null pointer, or, yeah, you get a notification that's not blocked, or basically, yeah, it gets it's finishing, and then the driver informs the next op that the previous op is finished, and then, yeah, that's how the protocol works. So the driver coordinates the synchronization, it checks the good output call, and then it notifies the next op. And the main advantage of this style is basically the driver can now go off thread whenever it's blocked, whenever the operator is blocked, or I'm sorry. Yeah, whenever the driver is blocked, and it can like basically provide other drivers to to execute. So it's easy way to context switch you don't without having to put the control stack again, when we context switch. Any questions here. Okay. So, yeah, the most fun part is the buffer management, which is very crucial for performance, and I think has been well studied. So in the logs we sort of employ again, the umbrella's idea, where you have this. So small objects in the logs are directly allocated from heap, large objects have this map allocator inspired from the umbrella work so we have this size classes. And each size class has the full virtual memory is up. And behind the scenes, we use m map and m advice to sort of map to the physical memory blocks. And since this is not backed by any physical file, there's no overhead. It's sort of a nice way and it also allows for variable size blocks so you don't have the traditional limitations of like handling basically serializing variable size blocks to see to fix size blocks and have it over. Yeah, so, I mean, I use the MF semantics on Windows are different and assuming this means you're only going to have a run on Linux or target Linux. Which is fine, I'm not saying it's a bad thing, I'm just curious if you guys thought that. I don't think Petro can help Petro, do we have anybody who use the logs on Windows? I don't think so, I think for now we only support Mac OS and Linux. We'll do the same thing. So, and then, yeah, once the buffers are allocated, there's this memory management that you can. Yeah, there's a memory manager that controls the allocations and basically happen via memory pools and then practically fashion so if you have nested complex type, the parent has a memory pool and the child has a memory pool and you can pretty much basically have like this hierarchy of pools. And then the API allows you to track the top at each level so you can get to the top level and say what are what is all the usage and then tasks can reserve memory a priority or they have to provide a mechanism to spill to this sort of the responsibility of the operator to handle that and the memory manager monitors the usage and it sort of helps the task to Yeah, basically what are the limits, what is the usage and limits, etc. Petro, anything to add here? No, I think that's pretty good. I think actually just adding to the previous question. I think maybe that's one of the main ways why how VLUX and BPD differentiate. I think BPD focus a lot more on providing the full BMS stack in a very portable way so they support Windows, they support different adapters. For us, the goal is a lot more providing kind of more reusable components for really high performance systems and in that case, that's portability if it's a little less important in a way. Actually, I have a question, maybe can you go back a couple of slides back? Sure. Sorry, sorry, what? Front or back? Front, front. Yeah, no, sorry, keep on going, going. Oh, you're going? No, no, I think I have a question on this driver, so you had a slide on driver. Yeah, yeah, yeah, yeah, sorry. So my name is Aruna. So the question here is this, so you have these various operators and next operators. Correct. And then driver. So I mean traditionally, such a system is implemented using iterator fashion. So when you have iterators, then you don't need a driver. So, you know, operators are connected and every operator has this, you need to get next and close. So things happen, you know, automatically. So my question is why did you need a driver? So that you can easily context switch. So that's one of the main benefits, right? So say you're blocked on the operator. So in traditional ways, you have to employ other means to like offload that right. So how would you do that in the traditional iterator way? You'll be blocked, right? Yeah, I think maybe in another way, I think in the previous mode, like we rely a lot more on the operating system to do that, while now we control those things in application layer. And I think the assumption is that we can be more efficient that way. Let me see. So I guess in your case, a driver is a thread and that thread kind of acts as a orchestrator. So actually, I actually kind of missed. So in this case, a driver is a thread, you said, right? It's analogous to a thread, but it's actually like, yeah, it's an object that basically gets that handles the interaction between the operators. Okay, and then every operator thread as well? No, no. So every pipeline has a driver. So again, these are just logical, but when a driver can sort of say, okay, I'm waiting, so you'll go offline and then the other drivers can come and like execute. So it's sort of like a logical way to handle threading. Oh, I see. So maybe, so correct me if I'm wrong, but this operator pipeline operator 0123 all the way to N that is controlled by a driver. Yes. Yes. So the driver basically, yeah, up and up next are basically two operators in this pipeline, like op one and up to maybe so. Right. Okay. Yeah. Okay. Thanks. Thanks. All right. So before my manager memory management, and then yeah, the other interesting layer is also caching in IO, again, performance, critical layer. So in Velox, we currently have support for both memory and SSD caching. Again, the caching purposes to elevate the impact of remote IO latency. And the smart thing you hear is memory cache works with the memory manager and sort of it knows how much memory is available. So it will try to reuse use whatever is free for the caching purposes. And when memory is full, it evokes the blocks that are not popular, and it flushes to the SD cache. And there's some more tricks as well here like IO reads can be prefetched and they can also be coalesced. So if you have like reading from S3, and if the gap is not too much, it'll sort of coalesce those two reads into one read and it'll read the entire range from that S3. And again, this gap obviously depends on the file system. So S3 has more latency so the gap will be bigger on SSD will be different. So yeah, gap is set by benchmarking the storage. And then there's also this scan tracker that tracks the access patterns and recognize, okay, two columns are always accessed together. So we'll try to coalesce them always. So it also exploits temporal locality. So once it loads one column naturally the other column will be there. And yeah, it can sort of manage coalesce those reads as well. Yeah, my question is regarding Raptor part of Bellox. So how do you interact with that one? Specialized for Presto actually. Yeah, I don't think Raptor is involved here at all. It's native to Bellox, this layer. So Raptor is separate. Is it in addition or it is, you don't need it? We don't need it because this is sort of like, yeah, handless data as well as metadata and everything. Pedro, you can fix, correct me. Yeah, no, I think that's correct. I was just going to say that maybe in a way that a way to think about this, that's the C++ implementation of Raptor in a way. But Raptor is also doing multi-cluster caching. So are you guys doing that here? Yeah, I mean Raptor, you can have n clusters accessing the same data underneath and we can cache across them. No, not in the current implementation. We don't cache across the cluster. You guys don't do that because Raptor is doing it, right? Yeah, I'm not sure if we, at least not at Meta, I'm not sure if we use that. It might be supported by the community. Yeah, okay. Yeah, I'll take a look at Raptor versus this. Thanks. All right, so that's the caching in IO. Yeah, so extending over that is the SSD cache specifically is backed by files on SSD, where each file on SSD is a shard, essentially. And the data data backed by each file is selected based on the storage file number. So there's a one-to-one correspondence between file on the remote storage to the file on local SSD cache. And the SSD file consists of a digital number of regions, 64 MB size. And yeah, each region has a pin count and read count that you can use to sort of decide to have it or not. Yeah, cache replacement basically takes place region by region and regions that have a smaller read count are sort of evicted first. So that's the SSD layer story. All right, so now let's sort of summarize all the VLOX components. And the next is the extensibility wins. So as we saw, most of the components that I described are extendable. For instance, VLOX users can customize aggregate functions, connectors, operators, file systems, file formats, functions, types, and even caching policies. So this extensibility is again a big win. And having an API that makes it easy to extend is also a good property of VLOX. So yeah, so yeah, so pretty much we saw high performance, some of the features that provide high performance for VLOX extensibility and also consolidation that VLOX provides. All right, so that's pretty much the end of VLOX. Before I move on to Pestissimo and talk about how we used VLOX to implement a Pesto worker. Are there any questions on VLOX? The question is, is the local SSD cache related to failures of SUD? I'm not sure what SUD is. Or SUD. Sorry, I mean. He's asking whether the... Yeah, how many, what are you actually asking? I'm sorry, you're asking whether... I'm sorry, I'm sorry, I was off. So you're... So as you said, there's a big gap in performance latency particularity of the S3 versus local SSD. So when you bring data to the local SSD, is that actually made resilient with the racial code or something that if it fails, which they do fail a lot. You don't need to go back to S3 and then drag the stuff in, which takes a while. I know there's some support that was added for like for CRs for some redundancy. I'm not very sure. But do you remember what was added for SSD? Yeah, I'm not sure if we have any type of checks, some SUDO content we have in cache. I think the assumption is that it's ephemeral, right? So in the worst case, if something that happens, you just need to read things again from the external storage. We do have some work on durability, of course, but I'm not sure if there's anything outside of that. Yeah, of course, you can rely on the fall systems that they do this sort of thing, so you don't need to do yourself. Yeah, and I think with that said, the whole cache work, I think it's also partly like some of the ongoing work. So I think maybe some of those things, it's also like things that would be really interested about collaborating and partnering with anyone interested on that area. So durability of SSD. All right. All right. So moving on to Prestismo, it's, yeah, the meaning of Prestismo is sort of faster than Presto. It's a tempo in the music space. It means extremely fast. So the way, so yeah, what is Prestismo? It's Presto worker implementation using Velox. So currently Presto's coordinator and the workers are written in Java. The idea here is to have a C++ worker written in using Velox and making it drop into placement for Java workers. So customers will not have to do anything extra except to just spin a Prestismo process with the same configuration files that you would use for the Java process, for the Java worker, and it'll run out of the box. So that's the end goal. And yeah, like I said, Prestismo is built using the Velox library. So, yeah, this is how the big picture looks like. So on the left we have the coordinator. This is in Java. It does the parsing optimization and it also manages distributed execution. And then it sort of ships the task or the fragments to the worker. And then that's where the Prestismo sort of uses the Presto protocol to understand those pieces and translate it to the Velox library. So, yeah, the plan is split into fragments. This all happens in the coordinator. So the query plan is split into fragments. The fragments are shipped over HTTP by encoding as JSON blobs. And again, connector splits are also, which as I mentioned earlier, the fundamental units for data processing are also scheduled across the workers. And now we have this Prestismo process that sort of reads from the storage and it does the execution. So that's the end goal. Basically, what currently works today. And yeah, the worker has two pieces because the control plane that sort of is used to coordinate the task and query fragment execution and then it has a data plane that sort of gets the results once the execution is completed. Yeah, we have the serialized page wire format for Presto currently implemented in Velox. So we can, the final result we can translate to the Presto wire format and ship it off to the Presto coordinate. All right, so let's look at a simple query and have some slides on how it sort of entire, all those things work. So say you have a simple line item query we're doing the filter project that so Presto optimizer basically analyzes this and it generates this JSON fragment that gets shipped it has basically all the details of the plan that needs to be executed and it gets shipped to the Prestismo worker again this is sort of a JSON blob of the split it has the file location, it has the offsets and other metadata that is required for loading the data. Basically it has a hive connector and all the information. And then, once the Prestismo gets the fragment, it invokes the Velox API, it converts the JSON plan fragment into a plan of to a Velox plan. This is the tree of Velox plan notes that we saw earlier we have a table of table scan filter project and all those so it converts into that tree. So for the given query basically they are, this is the plan tree so there's a partitioned output, there's a projection and there's a table scan for our example query. So the Velox plan is then sort of further passed into the execution layer so the inputs to this are the Velox plan and the number of drivers to use. And then, yeah, there's also a code chain that I didn't talk about, which is currently experimental it's this it needs some more love but yeah there's all this debate on vectorization versus data centric or code generation. There was a recent paper where it's not clear where one is better than the other, it's actually, yeah, depends on the use case so if you have a use case that you feel code gen is going to be better so you can code gen the plan and yeah you'll use that for the execution. So after optional code gen, there's some more local planning, which converts the plan to pipelines and operators. So you get some nodes are composed of multiple operators, and then it optimizes the operators and then it creates a driver factories, and then the driver start executing pretty much that's the flow. And this is sort of how multiple drivers work on a single pipeline so I'm sorry, there's a driver zero driver one driver two and driver three that all work on the same pipeline zero, and basically they execute in parallel. Some pipelines are can be parallel some pipelines have to be sequential so by the well off planning planner automatically controls all that so if it if it can execute something in parallel it'll sort of spread it into two pipelines. All right, so that's the flow of how the blocks is sort of running or the pastissimo worker and how Presto can now get high performance just by drop by a drop in replacement of the pastissimo worker. All right, so that's sort of the end of the implementation of Prestissimo. And they have some slides on the open source development experience as well. Yeah, I feel the forces strong with this one blocks is really young. In terms of open source projects it's been open source by Meta around August eight last year, even though meta internally has been working on it for about a year or year and a half or so. So, right now, there are a bunch of partners, including mainly by dance Intel and meta. And, yeah, the community is very active, the PRs are promptly reviewed and addressed. The documentation is up to date. It's very helpful for beginners, especially run times building items is not trivial so documentation is very helpful to extend or add anything. There's a good, there's a good amount of documentation all the components I described today. And the community also publishes some newsletter every month to sort of talk about what progress we have made, and then we have the Velox Slack channel that is used to ask you to collaborate and communicate. Yeah, this is a sample bullet items of the documentation so we have the functions that are currently available and then there's a developers guide that is very, very useful. I learned a lot as well from the developers guide when I was onboarding this project. Yeah, talks about vectors, how to add a scalar function, how to add aggregate function. Again, how expression evaluation works and there's some debugging tools as well that you can use to debug Velox plans and the runtime statistics that are sort of generated when you execute the plan. All right, so yeah, this is a sample monthly update slide, sort of has a documentation update and what features got added to the core library, what Presto functions got implemented and yeah, March update had actually had a feature of the month that made it much easier to deal with vector functions for complex types. All right, so to conclude, Velox is an open source C++ library that helps build data processing acceleration library. It provides high performance, extensibility and convergence as API and modules that get it to a wide range of data processing needs, and it's also a great platform for research and development. There are some resources here if you want to refer the documentation for Velox and also the GitHub project repo. And yeah, so I briefly have two slides. I didn't really talk about what AHANA is, even though, yeah, the slides all contain AHANA. So, like Andy mentioned, we are offering Presto as a SaaS service to customers. So the two main pieces, there's this AHANA SaaS console that is the control plane that sits inside the AHANA VPC. However, the compute plane totally resides in your account or the user's VPC. That's sort of a big differentiator between us and other SaaS providers because since the user's plane, the data doesn't have to move from the user account and there's no security or any other headaches involved. So that's a big difference. And yeah, we're working on a bunch of projects currently. As you saw, we're working on the C++ worker and planning to extend beyond Meta's needs and make it more general purpose. We have some interesting projects in the optimizer for Presto and again, automatic tuning that Andy is working on and telemetry are of great interest for us too. The knobs keep just growing. So some way to automate tuning is automatic tuning is always nice. Yeah. I got a quick question. Thank you for your presentation. First of all, I'm just wondering what's the status on the pre-systemal you mentioned before? Is it somewhere we can get our hands on on the source code? Yeah, so this, so there's a, actually this is last week, the Pestismo code also moved to the Pesto DB project. So you can get the Pestismo layer from the Pesto DB project. So there's a native folder there, I don't know exact name, but if you search for native, you'll see only one and you'll see the Pestismo code there. Great. Thank you very much. I'll be, I'll call it to finish up. Thank you for doing this. We have a lot of questions. So Kareem, why don't you go first and then we'll go to Hamid. Okay. So I have two questions. The first question is the VLOX in memory format matches Aero most of the time for many data types. But sometimes it deviates like in the example of strings. Can you comment on this like pros or cons of doing this? The pros is definitely to get high, like better execution opportunities. So like I said, conditionals can now execute out of order. And yeah, so if, if, yeah, if, if transferring data over the wire using Aero is sort of a less frequent case, then yeah, you want to get as much as possible, a better performance as you can, and then have this slide over it. And again, it's not too much deviation. I think the payload can be copied as is in most cases. It's just that you have to sort of update those offsets correctly when you write convert a VLOX vector to an Aero vector. So, but yeah, it's sort of a trade off between high performance versus portability. So thank you. My, my other question is what is the best channel to like if I have questions, general questions about VLOX. What is, what's the best channel to. So I think if you go to the GitHub project page, there's a Slack link for the VLOX project and or I think I have to email somebody. But yeah, there's all the reviews are there on the GitHub VLOX project. If you don't find it, you can always email me as well. Thank you. Thank you. I mean, go ahead. Yeah, so the one thing over here is that what do you have any performance comparison between you particularly are you also using LLVM or Kojen? No, so like I said, Kojen is still experimental. We're not using LLVM yet. That's some that's some area for sure we need to add give some more love for VLOX. But now answering question about performance. We don't have any numbers yet. So the big thing is VLOX is just a piece, right? Like so, like if you want to end to end a lot of effort or burden is also on the optimizer. So if the optimizer does not generate good plans, then VLOX can only do so much. So yeah, we still are sort of figuring out some. So yeah, the current trend is to use TPCH benchmarks, right? So Presto, I think still needs some more. So so we're doing like just pure VLOX comparison. So last, but I was actually working on benchmarking last couple of weeks. So we compared that DB and VLOX just purely just using VLOX. So we could run. We have a TPCH builder in VLOX where it has plans for some of a couple of TPCH queries and you can benchmark that just using VLOX. So we compare with that DB and the performance was pretty comparable or actually better. But then the next step is basically now use Prestismo and do end to end TPCH using Presto. But that means Presto should give you at least a reasonable chance. So that's where we are. We're actually looking into the plans that Presto generates and studying those pretty much. That's like one more question for you here. So what is the number of the contributors over here and rank list of the companies that you're doing that? I'm sure that Facebook is involved, right? Yeah. So number of, I don't know the numbers. You have to look at the account to see it's all open source. So all the contributors are there. Even meta directly to open source. There's no internal rip or anything. Everything goes to open source directly. So I actually have a question Deepak. When you said throughout this presentation, when you said Presto, did you mean Presto or Trino? Presto DB. So Trino is a straightforward. Yeah. So I think so my, so my question is, I mean, I actually knew to this area. So how different is Trino from, from a Presto? It's another presentation. The short answer is. Can you, can you maybe briefly say like what are the points? I was, I think a year ago or maybe a year and a half ago, both are separate projects, both. So at the high level, at least from Ahana's perspective, we, Ahana offering Presto DB is focused more on execution. Whereas Trino, I think is focused more on federation. So that's the high level difference. I can tell you from Ahana versus Starburst or yeah, Trino, but at the open source level. Yeah. There's no, there's no specific answer. I just, I just posted in the chat, there's the link to the Trino talk from Laughmutter. And they get into a little bit about the forking from the Meta Facebook version of Presto and Ahana was working on. Yeah, okay. Yeah, thanks. And I think Deepak's comment is correct at like that talk is very heavy on like, hey, we're doing federated queries across a bunch of data stores. So my last question would be, it's sort of a meta question. Other pager is gone. I'm sorry, meta, like not meta, the company meta meta. Like, what do you see is the future of like OLAP systems with if Velox succeeds, like what is the differentiator? Is it just like the UI, the UX stuff up above with the query optimizer? If everyone's using Velox, they're all getting the same performance. What would be other than a query optimizer, what would be something that would differentiate one system from compared to another? I mean, within Velox, so sorry, within Meta, I think their use case is to consolidate different processing systems that they have machine learning, they have spark and they have like so analytical use case are all a different thing. So there will be some modules that are specific to like machine learning or maybe spark, but they'll reuse like the memory allocator and all those can be reused. And the functions, for instance, are used for the PyTorch. So, but so yeah, I mean, I think for from a engineering aspect from a development aspect, this saves a lot of engineering resources because they can reuse some of these layers. But at the high level, like maybe, but they still have they all diverse like spark is more ETL and Presto is more analytic. So like some of the algorithms will will vary like some of the operators and other modules won't won't they be different? Okay, if you need to be Velox of these are the backbone and the validation and then people would sort of implement their own custom functions. Correct. So right now we have, we have spark SQL functions different from Presto SQL. So they have sets of a like function implementations and yeah, so if somebody else needs spark in Presto, like just some function that they like, they can easily now use that because Velox has that implementation.