 There's a black Ferrari parked outside in the fire lane and the school wants you to move it License plate is number two PAC number four L Y F. This is your car. Please. Please remove it So we're super happy to have Felipe here today from Blazing DB Blazing DB is the only database Assistant big set of Peru, right? So Felipe and his brother started Blazing DB in 2015 most development is done in Peru They have an office in Texas where because he went to UT Austin for his undergrad. Okay. All right. Thanks. Okay, so hey I'm I'm Felipe. I'm gonna talk to you today the very small amount about the company and Blazing DB but mostly what I'd like to do is once I give you a quick introduction to What we've been building for a little while. I want to get questions from you guys about what makes it fast What's low what works what doesn't work? What however I can, you know Help increase the knowledge that you may or may not have about GPU accelerated databases So that's that's really my intention here today. So please stop me ask questions like, you know yell it out I will not get offended So the first thing I want to talk about a little bit is Rapids AI. So Rapids AI is It's an open-source ecosystem that's been built by NVIDIA us and a conda Think gun rocks I could learn and I'm forgetting somewhat Quonsite But what we've what we've all basically done is that we've taken arrow Do you guys all know arrow does anyone not know what arrow is raise your hand if you don't know what arrow is? Great arrow is a red. It's basically it's a representation of columnar information and Because it is it's a representation that has been defined and that has certain primitives on it We can all share the same data representation, right? So what makes arrow appealing to people is the idea that It doesn't matter if it's in your process or it's in my process. We have the exact same data representation We can build primitives together. We can share those we can ferry the data over each other via IPC And there's no serialization deserialization I Hope I don't have to explain why that is probably a good idea unless someone is opposed In memory parquet without the compression. Yeah, and so on top of it So this Rapids AI ecosystem is basically an implementation of Apache arrow in GPU with a bunch of computational Primitives built on top of it. There's deep learning visualization machine learning. There's graph analytics And we are over here in the normal Analytics ETL type space. So this is a part of kudia. We'll call it kudia for the rest of the talk kuda data frame is It is basically the the primitives built on top of Apache arrow For processing right so think of things like scatter join Zore and What have you right? It's it's it's ways of operating on these data representations using these primitives and so What what we're doing? One of the things that we've been able to show with this and so these are the speeds of Execution of a mortgage pipeline that doesn't mean anything Basically, you have a bunch of information about California about who's paying or not paying their mortgages and and with the the rates of Delinquency and whatnot and they were we were doing a great deal of machine learning training On top of this. This is a project. This is a thing that NVIDIA itself was working on So we these numbers are made by them not by us, but what we see here is that This is how fast we were able to do this using spark. This is a 20 CPU nodes spark cluster You see what happens when we go to 30 we go to 50 we go to 100 right here We're using five times a number of resources, but we're definitely not getting the 5x improvement and over here We're showing the exact same workflow using very similar code So it's all written in Python at this point the the interface that the user used but we had like a 4,000 x performance improvement and and this This is really what what what kind of you know it made us get excited about rapid Well, actually this happened after but but this is what we're trying to do with rapids, right? We're trying to show you that The CPU you can't just throw more money at it and you all know this because you're in a GPU class, but the GPU rocks for these kinds of operations and and so working in a nice open-source ecosystem like rapids that works on the GPU or Something would be very beneficial for us So what we ended up doing is we built blazing sequel Which is a group of different tools that are operating on top of this rapids aoi ecosystem and Before we could even get to the blazing sequel part. We had to build a lot of the computational primitives in What used to be called libgdf and it's now called cooDF So the first thing I kind of want to show you guys is we'll talk about that later Is what arrow on GPU is right? It's it's it's a very simple struct and then obviously because you're smart you will abstract this with a nice C++ API like we did and you'll do all kinds of exciting stuff with it But the the basic gist of it is you've got two pointers to to information that resides in GPU It's just your your data and this is a validity bit mask, right? No, one for I am a valid value zero for I am not an invalid value And then there's a very similar kinds of information that you would see an arrow metadata, right? The size is the number of rows the type of column that we're looking at here a null count Which kind of lets you know the way you need to look at this valid thing and And this stuff over here is junk it's stuff that some of the people use but we don't And so will you end up with things like this? So we make we have lots and lots of primitives that are basically like this the inputs are columns and You know a few other things and your outputs are always columns, right? So you're Everyone that that that is Operating in the gdf in general is using primitives that are like this We're all using column or primitives and we're all Interacting together so the the idea that we were trying to solve here is ETL in information for for all these different, you know that for QDNN, which is neural networks in CUDA cu graph a graph library Or QML so we were trying to make it so that you could take information from the data They can we'll get to that in a bit Bring it into aero GPU memory and make that available to all these other libraries Does that make sense that that is what we were interested in solving? Yes, no questions so far Okay, all right Okay, so real quick I'm going to show you a little bit about How it would be used and After I kind of go into how it would be used. I'd really like to get some questions about What why this could or could not work a little bit of doubt But users interact with sequel all sorry users interact with sequel via Python connector Most of these apis here are from Pi gdf, which is the Python version of this cuDF thing that I've been talking about You do things like read parquet files here We're skipping out a bunch of stuff like telling you which columns and what row groups and all that just for Showing you something quickly, but you basically load parquet files into gdf columns You open up connections then you start running queries on them I think this is this is actually one of three ways that you can get data into live gdf into cuDF All user interactions for us go through an orchestrator the orchestrator's only job is to interact with the rest of the cluster it is the entry point for users and What the orchestrator the first thing the orchestrator interacts with when you're running sequel for example is our parser and planner Which uses calcite who here is used calcite? Who here has heard of calcite? Okay, all right who here likes or dislikes who dislikes calcite because you've heard of it, but nobody's used it Don't know yet. All right. Well, I Like it a bit, you know like Slightly more than halfway, you know, I don't dislike it for sure But calcite lets you do you know as you all know a couple of nice things that before we used to manage ourselves And we're just a pain in the ass so parsing and validating queries is is a nightmarish work And it is not very interesting I think normally from the aspect of developing a database so I encourage everyone to To look for tools to help you do the the the regex industry because you're it's not very exciting stuff, right? You know, sorry Some people find it very exciting parsing. Oh Validating okay. All right. All right, so yeah Okay, I as a person trying to sell something in all honesty was not so interested in the parser And so that is kind of why we we decided to use calcite So calcite basically what it does is it takes something like this. Here's an example of a very simple query Sucking X X plus Y from a table and doing a joint with another table and what it ends up giving me is is just right here And here we have a non-optimized query plan, right? I'm going to scan table A and table B so that I have them both available I'm going to do a logical join right then I'm do my filter, right? I'm saying that the first column which is X in this case is less than five and then I'm doing some projections, right? Getting the column X and then adding X plus Y and what what kind of what sucks here? What what what what what is interesting about the order of these operations? Filtering out yeah, because we're moving up right the nest in this way. What should we do with that? Switch it. Yes, we said switch it now Obviously, this is like a very simple example It's very easy to know how to do but that that is one of the things that calcite can help you do, right? It can push this filter condition down Ideally, we're not I didn't put the optimizer here, right? Ideally would also be pushing these projections down because after you scan the table You don't actually want to load all of the columns in the table when you have a column in their data set But that's right. Yeah the Predicate pushed down and and being able to even replace certain relation algebra nodes, right? So something like a logical join followed by a filter could become something else, right? You you could do something where I have a special kind of scan that actually knows how to do Predicate push down on metadata, right? So you could make you you can make new operators that that that do a lot more exciting things and Build that into calcite. You can build your own rules You you can make it so that your your algorithm can basically Replace parts of this subtree with what you consider to be equivalent equivalent more more efficient interpretations Any questions about calcite why we picked it why it might be a bad idea We picked calcite because it was Open source and free was one of the big impetuses. So is orca. So is orca. Yeah, I never heard of orca in all honesty okay Okay, all right, all right, so yeah, I had never heard of it and Why did we know that's a good question because I'm actually thinking about it like what so we picked it in part it felt like Yeah, this is the only one I knew about and it it was actually very fast to get up and started so you can You you can do the the parser planner validator you can go from going to a query to a relational algebra plan in hours and And through not too much effort a few weeks, you know You can make that Persistent and build your own rules and do some nice optimizations on top of that but it's just the barrier to entry was low low low and We at that point weren't thinking oh, it's in Java and it's normally we don't like Java Not that we don't like the language. We're just we're usually writing in C, right? We're usually and we're usually writing in C because we have very Tight loops that we're running in right where it is a column or database But it it was easy to use didn't take a lot of time to get up and running and Let you forget about all this nonsense, which was a very big problem for us the first time around where I Said you could never make a database unless you also build the parser and planner because that's what it does And well, that's the only partly true In all honesty, it has not been stressed enough But because of the way it works and the fact that it's entertaining very small amounts of information I mean, there's ways to scale that up. That's not I'm never worried about Calcite being the bottleneck in the workflows that I'm looking at. Oh Yes, but the way that so In the end We're going to try to we're building our own physical planner that's going to take Calcite's logical plan and Implement that as opposed to using it for the physical plan the reason that we're doing that is because we're We're trying to to interact with some columnar primitives that not going to talk too much about right now just because it's Still very early days, but we we had not planned on using it for the physical plan yet Any questions about that because that's kind of crazy because it does sound like a bad idea so would you think that it'd be smarter to Try to build that representation in Calcite and and and have that be something that Exists within Calcite like do we should we teach it to build directed graph relational plan logical physical plans when it doesn't really have that capacity I Don't know about Calcite. I'm thinking things like Like you want to do like a nesting of Nested queries like that some queries like that. I don't know. I I don't know how that Whether you can do that purely in a logical plan a nested query Of course you can do a nested pretty logical plan. I mean At least Calcite does it but I keep y'all thinking about it I'm wondering when you say you are not using physical plan, what do you mean to win when you translate the logical plan to physical plans And then which plan you choose would affect the cost of that plan So you get so you can get an optimized logical plan like this, right? Great question you don't estimate it when you've been working on it for nine and a half months like we have So this new version does not do that. We're not it's not a cost-based estimator It is a rule-based estimator Sorry But yeah We're satisfied right now because in all honesty If you're if you're working on the CPU this matters so damn much to you, right? I have 700 gigabytes per second of bandwidth to my memory I'm a little bit less worried about making the most most efficient Physical plan. I'm a lot more concerned about worrying about my I.O. I'm a lot more concerned about Getting out of like cash having to jump to system memory having to come to disk This this is just not in a GPU context This is normally not where you're being cut off at the knees if that makes sense. It's not what's slowing you down Questions about that. It is a bold saving. I'm expecting some incredulity What's the largest like joint query you guys for? In in blazing sequel or before? The largest query we've run in calcite I want to say five That and that okay All right, so yeah, so the kinds of joins and work that that we're doing for machine learning We're not doing a lot of like star scheme of build this enormous like I'm gonna take a million tables Or I have a table for like different kinds of You know like I got a flag over here and that maps to the dimension tables That's the word I was looking for so we're we're not doing these kinds of workloads yet It's it's not been our focus I'm sure that as this gets used more and more someone's gonna try something like that They just hey you think fucking sucks. It's super slow and and we're gonna have to make it better and I as I've been telling you right we've been working in the database space for for about Four and a half years and three and a half years is blazing This is something very new for us and the vast majority effort that has gone into Into blazing sequel is work on rapid AI It's in this open-source columnar primitive library, which we've kind of used to build this on top Any other questions about parsing and planning? What where we're not the the best anyways? So then we we get to the way that a worker nodes works, right? So we've got the Apache arrow GPU memory representation, right? It's the same representation that like we said all the machine learning all the All the all the rapid AI ecosystem projects are all using the same thing So you can actually shuttle this information around during execution after execution before you can receive inputs You can run queries on information that is output from machine learning library because we're all speaking the exact same language Oh, and a very important thing is that whenever we're on the same nodes were they were using zero copy IPC All right, so there there's no there's no copy I don't copy information to the machine learning library that wants to consume it I I consume it directly as You you basically give the consuming process access to that region of memory So you you can write to it you can read from it. You can't free it obviously But it it greatly reduces the the need to kind of be moving information around You guys do a lot of IPC is that there's that something guys are excited about gets you going doesn't not really Not really what do you guys do you copy it all around move it around double the disk? What's Okay You gotta do it you gotta do it just like it's not it's not it's not the most big it's not the biggest concern for you guys or Make sense make sense So in this this the worker has a relational algebra in it it takes the logical plan Interprets it and basically converts it to live GDF primitives, right? So it'll like if you're doing something like I'm filtering Well now that filters a boy, let's do let's do I don't okay. I'm doing a Aggregation right so it is what will If you're if you've got for example multiple columns You're aggregating on and you're interested in hashing that information to do to do to put it into faster buckets But it is the code that will generate the hash It's the code that makes a we use two different There's a sort based in a hash base, but if you're using like the hash base, it's what builds the map for you It's what knows how to probe the map. So these are all Commoner operations that happen with in kudia. It's not in RAL it is either primitives that exist here and the really nice things is that These primitives are shared by all the members in the ecosystem. So we're all working to make them faster that's That's the basic anatomy of what this does is the the only thing the RAL does is it takes the plan that came out of here and It it converts that to libgdf primitives and hasn't mathematical expression parsing for for converting, you know, mathematical expressions to binary operators Yes, so binary operators are all code gen. They're just in time compiled No, no to PTX assembly It's so there's Okay, that's an interesting question because there's actually four versions of this, right? So one that was originally made which is basically I'm gonna take all my operations And I'm gonna template this and I'm gonna make a binary that is enormous because you've got 14 different data types And and and how many operations it grows it takes two hours to compile the next iteration of this was using this thing called jitify, which is it it takes c++ compiles it in the moment and and gives you a Cuda kernel that you can basically run some some some the machine instructions that you can You can run just like you would any other kernel The problem with that one is it takes like a hundred milliseconds to compile these things for the first time Once it's in the cache, it's okay, but that's it's it's slow and Then there's one that's being worked on it's a joint collaboration between NVIDIA and us That is the same thing but based on PTX assembly and then there is my favorite Which I'm trying to push which is and this one will be very controversial is I actually believe in making an interpreter that takes in Okay, I'm I'm more interested in making an interpreter that oh, yeah, let you do the thing Wherein you would you would make a kernel whose purpose is To to limit the amount of IO right so my biggest interest here is When you're reading from global memory, you're limited to 660 gigabytes per second which is slower than obviously what you get in each core so Do you guys know about coalesced memory access? Great that that's a concept do I need to go over that at all? anyone Might as well just do it. Alright, so when you're can't is this usable? But it's not in the video. I won't use it Yeah, so basically in in when you're interacting with the GPU whenever you make a request from global memory a Making a very a small request has the exact same overhead of making a much larger request This is a concept that you're yet. You're very aware of right But another very important part of this request is that the Alignment so the if I have let's say a GPU with 10 threads, which doesn't exist, but let's pretend like we have 10 threads having that information so that the distance between The information that you're reading on the first thread and the second thread is the same as the information that you read on the second Thread and third thread gives you orders of magnitude more performance improvement for for getting information onto and off of your GPU so because That is what a person like me thinks is is the most important part to manage the the solution that I The the last potential binary operators right now We're using the JIT one, but that could be implemented as one where you Let's say you've got four columns and you're trying to do like a plus b divided by x times y But I'm let's just do a plus b actually a plus b divided by y You can either do something where you do a plus b materialize that and then divided by y Which you know, it's it's gonna force you to be dumping the global memory You're gonna be you're gonna be writing and reading from it twice or you can set up a situation where You read all of these columns in a coalesced fashion into a buffer, right? So let's say I've got you know to it the two inch and the last hint that I'm trying to divide it by I Read it into local memory once so and then never and you never materialize a plus b divided by y you instead interpret that expression and do all those transformations locally and the only thing you end up dumping back out to global memory is The final output of the interpretation So what this allowed you have a question? It is something a compiler probably could do I'm sure but it's not something that the NVIDIA compiler does right now Also, it's in the GPU. It's very different Because obviously if they could do it they would be doing it, but the the synchronization mechanisms that you have there are a A little bit more complicated, right? You've got 4,000 threads that could or could not be stepping on each other and the compiler doesn't necessarily know What you're gonna be doing with it, right? I don't even know what that means what Is who doing that I again, I don't understand the expression Yeah No, is that like to avoid them stepping on each other they basically just say They don't have to share the So that's okay. Yeah, that's not exactly how a GPU works, right? So you've got so each each thread has its own registers, right and and and But each thread could only be running one instruction at a time So if you've got like like, you know how an SM works, right? So you've got a bunch of different cores on the same on the on the same basically like little mini processor They all have to execute the exact same instruction. It's not like you can have some managing though Oh, I'm not stepping on you or like that So For us it's not that there is one but I'm saying the compiler can't just abstract that away So it does not know how so if you're doing binary operators from like a plus b divided by y There's no issue there, right? It's fine because there's there's no shared memory that we're synchronizing There's nothing happening like across the words. It's just The compiler doesn't know that and I am not smart enough to tell the compiler how to know that All right, so the Very last thing that I'm gonna mention before I'm hoping to get some more questions is So this I mentioned you all before we interact with the reds of the rapid deco system through the QDF and all the movement of data between this and that happens via zero IPC and an interesting aspect of this is that when you run a query on the engine the the relational algebra interpreter is not Going to return there's a response over to here and back to the user And and the reason for that is that every worker is going to keep a subset of this Result set right so if you've got four nodes in your cluster Each one of your nodes is going to have a small part of that result set in it And the idea behind this is that we can then use a tool like we're using desk You guys ever heard of desk? No desk is basically a task scheduler for Python and it's integrated well with the rapid AI Ecosystem so that you can do things like schedule machine learning jobs XG boost and And because all these processes are also running in this distributed context and desk is aware of How a distributed GDF works so our representation because it's part of the same ecosystem is the same as what desk uses You can run queries and have those outputs go directly into your machine learning libraries Without having to do any any copy, right? That's the zero copy IPC and That that in a gist is what with the blazing sequel is Right now and in its first iteration Except for I owe which I kind of glossed over but the like before I start going into this Does any of this make sense? Where's he like like does it suck? You can so the way that it works is that when you run a query Yeah, the the row has a result the this node has a result set repository, right? It has a place where results sets are stored and they have a token associated with them Whoever wants to retrieve that token can and they can retrieve it right now over zero copy IPC and TCP and we're working on UCX, but that's a little bit further down the line But the idea is that you you want to make sure that the person consuming your results set is not necessarily the originator of The query it can be somebody else and this was important for us so that we could do this right so that These people can request those results sets when they want them and expect them Very Hadoop-esque. Yes. Yes. What what are some alternatives that we could do? That's a standard way. Yeah that so I guess I don't see any downside to To doing it this way because you can replicate JDBC like functionality by just get my result. I'm done Yes, okay, and one of the one of the big ideas here is that We're not gonna be doing a lot of that coalescing. We're not expecting to we're not Looking for those kinds of workflows at the moment In fact if somebody else wanted to do that that would be like great Like if you wanted to do something like take a really big distributed result set and merge it in system memory somewhere That that that would be something we might consider later on down the road but right now the our idea at least for now is we're trying to get as fast as possible to these guys because people have spent a lot of money and time on this and Everyone kind of ignored the less sexy part of machine learning, which is oh shit You need information to do machine learning questions, yes That's what I'm about to do so that that that first one so that That is one way of loading data into a gdf, right? The user can do it in the python Another way that this is possible is that there's an API that receives file handles, right? And the file handles are it's not just like a file path, right? These handles can tell you I'm on s I'm on this s3. I'm on HDFS over here and we have this We have our IO library, which you can register file systems into it so you can say something So it basically makes an HDFS like a Hadoop file system available to your whole cluster or an s3 available to your whole cluster and as our file storage available to your whole cluster and then The the when the you the user would instead of submitting right the file the the actual data Submits a file handles here go to the orchestrator and the orchestrator's job is to it has Capacities for pushing data skipping computation over here to the row coming back and then dividing up that problem, right? So you can you you you partition the files the row groups the columns here and then the the The each relational algebra interpreter uses blazing IO to pull information from that source Right, so it's so each of the workers and ends up doing their own IO And I mean that that that's self-explanatory. Why do you want to do that? Right? You don't the the first use case I showed you was for the We were giving this talk it in at a at a conference that was a lot more marketing type speak and you want to show them Hey guys, look at how fast it is to connect to PyGDF and While you can do that that's obviously not the most intelligent way Definitely not as smart is as using something which can connect your data lake and interact with parquet ork and csv Or the file formats that are handled right now and hopefully more more are coming. I know that they're working on one. That's We and you could also use arrow of course just arrow and Avro someone's working on Avro. I think that was coincide But yeah, that makes sense doing all the IO here interacting with the data lake and I guess the last step that and this is something that has actually been pretty exciting for us because we've been working on this problem for a while is a distributed cast so that The data lake while enormous and great that you can put everything you want to it. It's still slow, right? So if every time I'm trying to launch a big query. I'm having to hit that data lake You're you're you're gonna run into a lot of the problems. So what what we do actually is is is that the the the orchestrator is aware of What what IO has been done in all these different places so that you can you can actually Pull in from if you know for example that worker 3 has has pulled a certain amount of information from the data lake Recently worker 2 can get its information directly from worker 3 and it's it's up to you orchestrator to maintain those mappings And obviously if worker 2 goes to worker 3 nats it from cash, and it's like oh, no This isn't in my cash. It still just falls back to the data lake That part makes sense All right Questions so that that this is kind of in a nutshell like the thing that that that we've been building, you know Well a lot of these things are things we've been building for a long time But there's this relational algebra interpreter on the the parser and planning and getting involved this rapids AI something we've been doing for a little bit over nine months, maybe Oh down here, this is just the idea you got multiple workers. Oh So these are nodes This is a node. This is the user. I mean this could be here too, but this is this is your user Right. This is a coordinating node. These are worker nodes, and this is a misnomer because these are actually in here It depends on what you're trying to do right but so in the in the The mortgage pipeline was just trying to figure out it basically in the end the output of the mortgage pipeline was A model which can tell you how likely you are to be someone that defaults in the state of California on your mortgage But that performance for me was 4,000 X Okay, so spark doesn't count is a way to say For your Rapids because that's not your right. Rapids is not your code Yeah, no, so rapids is not our closed source code rapids is the open source project that we contribute to you're not building You didn't write the DNN library. No, we ran the we wrote the QDF library, which is in rapids You're saying like like for the part that is blazing the sequel. Oh, yeah Before you came it off to Rapids, but what performance benefit you get it from the GPU so the the graphic is so this is Sorry, this right here is data loading. This is data conversion, right? So this is ETL so the the the vast majority of the time is ETL here and Yeah Yes, yes, so you the wow, yeah, very good. Yeah, I like that. I I'm gonna say that from now on. Yes We are an on GPU sequel ETL engine. Yeah That is a much more succinct way Then I say it Yeah, yeah, no, but that that's really all we're trying to do we're trying to ETL information from files Into machine learning This makes it sound a little bit less exciting, but there's a lot of fun things going on there So here this right here was actually converting it to the the the GPU like the format that XG boost was expecting so Park a what is Park a's input format like what is Park a's input format Yep binary compressed columns split up into a bunch of small chunks, right XG boost So now it actually takes arrow on GPU But So in This was not this so they they I'm not So I was involved in this part more what what are they out but to it's It's I'm pretty sure it's just buffers of like in 16 and stuff like that like you pass actually boost takes buffers Yes And to doing Yes That's a good question I'm not certain but whatever they're gonna look like you're still gonna have to move it from the CPU to the GPU and Manage that process manage that IO Right, so that that's Significant right if you're Yeah, of course Right, you have to move it to the GPU and moving to the GPU can be Not so expensive if you're using pinned memory But then you need it like if you start using too much pinned memory though, then you're the rest of your system degrades So there there's you know There's always that that you're gonna have to transfer it to the GPU bottleneck when you do do it on the CPU So whereas in our pipeline, right the only thing that comes in is compressed Representations into the GPU gets decompressed on the GPU You do all your transformations ETL in select whatever where and then it goes here but So you made a good point this and let me let me let me clarify this. This is a marketing slide It's intention is definitely to show holy shit. It went from this to that I Will endeavor to answer that question more appropriately so that when someone says, okay, what's it really doing? I can tell them Yeah, I don't have a good answer for you. Sorry Okay What that Oh, yeah, you have to do this you have to put it to the form the GPU once Deep net libraries want anyway, right? So rather than do the transformation in CPU. Just do it on the GPU You're saving PCI bandwidth, you're also able to do it faster, right so the through Okay, so do you have 760 gigabytes per second of bandwidth your system memory? Do you actually be compressive? No, but you get so but I mean so Do are you are you doing this decompression at you know at at system memory speed or at a subset like Like your your memory speed is going to map, right? and we have decompressed at 200 gigabytes per second on the GPU and I don't know very many CPU solutions that can do that. I I also I mean at Sorting on the GPU right something as simple as sorting is orders of magnitude faster so It's it's a couple of things right when it's the course just the number of cores that you have available to you Right, it's it's the memory bandwidth It's Yeah, but when you're trying to do it in the 20 CPU nodes, you got a lot more coordination going on between the different nodes This is a single computer and So you get to you have right you've got interconnects in between your GPUs So your GPUs don't have to go over PCIe, right? They can speak to each other directly using NV link and That number is always changing, but I think it's like 160 180 gigabytes per second right now So and not only that but they can you can write algorithms where they access each other's memory, right? You can treat the entire box is one memory space if you want to we don't do that right now because that feels lazy, but Yeah, I mean these are these are your monster machines in terms of of memory bandwidth and Well, this takes up a lot more room, too You know that this is like that and 20 CPU nodes is well, it's not He's a quarter of a million dollars might be more DGX one is probably like a hundred or maybe a little bit less now Is actually a little bit yeah, oh Shouldn't it? Great, we're gonna bleed the next minute or so so a lot of people Yeah, nobody here was Hey, I love them. They've And that's how he died It is You're welcome 40 50 grand you can get something that's that's got like Three or four GPUs in it the private GPUs are like 10,000 11,000 bucks, right? your students And he died again Questions, what is your joint algorithm? There's two there's a there's a sort merge drawing and there's a hash join. It's on the GPU You asked Nikolai who's the guy that did somebody at Nvidia built the the actual the probing and the the building of the hash table the basic idea is you do it in a very stupid way and you synchronize after So you do something like this. I've got a Thousand a thousand threads in each one is like divided up these groups of a hundred and they share this shared memory representation Right, so shared memory is orders of magnitude faster than global memory and it's it's a lot cheaper to synchronize So you write to the shared memory you reach like this point where your shared memory is kind of like ah This is like what I consider full and then you bring from and then basically Your first thread in each group of your hundred threads when it gets to this point it It it it pulls it into global memory So you build many hash tables and then merge them is the idea and the coalescing has done all the GPU Not all on the GPU all on the GPU All in most use cases. Yes, the hash join is faster So it's It was a lot harder to figure out before so you have more options for for managing that synchronization and one of the things that they did is that the hash map because They they they didn't know how big it was going to grow or not grow at the beginning. It is done in UVM not in Not in regular did this is a point of contention amongst us. I I Love UVM, but I think there are specific places for it And you know the building up of a high performance test table may or may not be it But that's how it's doing it now. It's using UVM. Do you guys know what that is unified virtual memory? No Unified virtual memory basically means I Allocate some memory and is it on the GPU is it on the CPU you don't know you don't care if you address it it works So that that's basically the idea It's its coherency between the system memory and and the GPU so you can you can access this information on on Both without having you know things go. Oh my god. I'm on the CPU and you're asking me for this GPU pointer or vice-versa. I Said a lot of things it should be like not that you should be like no way Like I got a lot of no like a year ago. I feel like I would have gotten a lot more no ways Are you always just like all on the GPU bandwagon? They rock you want to use them? I'm gonna play around What's the biggest challenge you guys face now? Business of technical so business So this rapid ecosystem was Nvidia and a bunch of relatively small organizations and Oracle IBM and All those other like a lot of a couple of fancy names have all agreed to come into this ecosystem And they said they're gonna dump tons of engineers and code into it That's a little bit scary for me Especially because you know We've all been working on this there's been a high degree of control, right? There's I know everyone That's working on this project right now and that's about to become a lot more people and We are not an enormous entity. So in the business Worry I'm you know I'm worried of losing the edge that we currently have because we built it We're really aware of all the new stuff of how it's working We Are our engine comes out like a week or two behind rapids as opposed to you know months behind rapids They're but that that is a real business worry for us like our 414 14 15 sorry 15 14 and a part-time very smart guy I Ten ten in Peru and in Peru. Oh and rapids. It's like 60 engineers Yeah It's about 60 engineers. So about 35 ish from Nvidia Then ten on our side and then like another ten or fifteen from anaconda And then a few from guns from gun rock and quans site I have no idea they're saying that they're gonna bring in like another hundred developers and stuff like that So who knows like what I? One although when I when I see it I will believe it and then I will freak out As of right now. I'm pretty chill technically My biggest concern is that so we made it's it's a risk. It's a risk. It's a decision that we made right? Do you try to make these trees that you can you know kind of distribute and Then there's some kind of shuffling process at the very end of query execution That makes sure that you didn't miss anything so like in in the case if you're doing something like aggregations, right? It's really easy. Oh, okay This is what I agree at it on and this is my value and I just have to kind of put them together and merge them Right, but this can be Gets a little bit more complicated when you talk about things like left out or joined You know did this thing was it ever like in the like it oh it it never got compared or you know You have you have to manage that like Like in the old version of the engine we did all that work at the end basically right where we'd we divide up the data set Run the query and then we knew how to bring that back together to get the full result set In this case what we've decided is that every step along the way is going to be distributed and resolve So I am we're banking on nicks getting better than they are today literally getting better Having more throughput Internode being able to to use GPU direct so that we can bypass the kernel whenever we're trying to send information From the GPU to the network card to the other network card but that that I think Technically, I think that's one of our our biggest risk like is are we gonna get burned because There's gonna be a lot more move because I until we see it like we got no clue Right. They always are gonna be moving too much. It's gonna be too little like there's that that and that's that's the Yeah, that that's the real risk is when people start using this for real Am I gonna be network bound and I just did all this nothing and then we're gonna have to go and rework but So There's two things There's the open ucx project, which is being worked on So putting ucx into the Rapids AI and right now we're using flat flat buffers on Over over TCP rocky or infiniband, but yeah flat buffers is what we're using for Non-GPU communication. So for like the matter like when you IPC a column over to somebody There yeah, sure. There's the GPU data, but you also have the size the type You know, like what am I you you have other information? Which is Which is on the CPU right but that is relevant to the column and all that goes over with flat buffers now I'm really sorry you had to go through that I was just clapping for myself that felt awkward sorry next week we have Richard Hens from bright light And then November 29th or something. We have the swarm 64 guys coming. They do FPGA. So all right guys Black suits the group on the start and the uncivilized island of New York with the criminals run the project Development to drop spots. I'd be sleeping through the screens and rapid-fire shots my block consists the multiple Proof of an elephant even pre-sense kids on these kids make it fix keep this operation safe home and shit They got these perpetrating house and cops on the dick