 Brightlight, come talk about their GDO accelerated database built on Postgres. He sort of has a KG background and didn't really give me the full details of what his bio is, so I don't really know what else to say other than he was a rugby player, he has five kids, he lives in London, but he's not British. He's South African, so don't be confused by his posh accent. It's not British, okay? Go for it, thank you. Thanks Andy. Right, hi everyone. Thanks for having me. What I'm going to be talking about today is Brightlight, obviously, and our GP accelerated database, and we're using it for accelerating analytic workloads to deliver speed of thought analytics. A little bit about me. Thanks Andy for the intro. So, I'm the founder and CEO of Brightlight. I started the company around about four years ago. Originally my background is in engineering. That's a picture of UCT, so that's where I went to university in Cape Town, lovely university, and I had a bursary from a mining company, and so straight after university I went and worked on the mines, which is this picture over here. Literally in the mines? Literally I was an engineer on the mines, going underground, and every day I had to go past the sign which said, congratulations on getting through eight million shifts without killing anybody. So I thought that was maybe not the best place to spend my working career and two years after that, or spending two years on that, I then got into business intelligence, databases, and analytics, and that's pretty much the story since then. I always think it's interesting with these kinds of talks to understand what was the genesis, what was the starting point, and in 2008 I was sort of between roles. I'm a very inquisitive person and engineering was a great course for me, and I came across an article which was basically some hackers had worked out how to use GPUs to crack passwords, brute force cracking, and I just thought that was really, really interesting. GPUs were just starting to emerge where they would be used for these sort of non-graphic for generic type workloads for data processing, and I was fascinated that you could use a GPU for something else. So at the time, my first son had been born and we didn't have a lot of financial resources available to us, so I had to go to eBay to get a GeForce 8800, got it, worked out how to use it. Fascinating, but cracking passwords and so on wasn't really a use case that I was interested in. So the GPU sort of gathered dust for a couple of years, and then I joined a company called Danhumbi, and they were doing a lot of things with data analytics. They were responsible for the loyalty program of Tescos, which is the biggest retailer in the UK and it's got a global footprint. They were making a lot of money from understanding what customers were doing by all the data they were collecting, but the tools behind that were really struggling to keep up with the kind of data loads, and that was the epiphany, that was where I thought, you know what, GPUs could be a solution to really accelerate the kind of queries that companies like Danhumbi were using to understand what their customers were doing. And that was the start of the journey. I started researching how one could run predominantly SQL operations on GPU. Sorting was pretty well understood as in all the sort of aggregations, as is filtering, but what was really tricky at the time was working out how to do joins in parallel and obviously GPUs are coming to parallel devices. And so any algorithm that you want to use needs to be paralysable. Embarrassing paralysable is the term that people like to use. And I spent about 18 months researching the space, trying to work out how to do joins efficiently on GPU. Found an algorithm. It's called Recursive Interaction Probability. It allows us to do these joins on GPU very efficiently. And I still remember the evening sitting with my wife talking about what I had discovered and the use cases and the opportunities. At the time, KitKant had gone up to four. So there were some real challenges that we needed to think about. You can't just embark on these kinds of journeys when you've got a lot of other people dependent. Today, KitKant is at five, like Andy was telling you. Journey has been fantastic. And what we have today is a GPU accelerated database that's allowing us to really accelerate SQL operations. So just to run through some of those differentiators being based on Postgres, being able to access our patent pinning IP and also tapping into our fourth generation GPU manager, which allows us to bring SQL workloads and AI workloads together. And that's fundamental to how we've implemented our technology and some of the technology decisions that we've chosen. As I'm talking, obviously, if there's any questions that you would like to ask, please jump in and I'll stop and I'll answer those questions. So please feel free to do that. What it's all about then, some of this is the marketing aspect of stuff, but it's really delivering time to value for analysts. Very interesting survey, recent survey. 32% of analysts are still having to deal with slow query speeds. 64% of analysts time is still spent using SQL type tools to prepare data. And so on, only three days a month is spent doing some of the clever stuff with machine learning and AI. And amazingly, 37% of insight takes more than a week to generate. So you might come in with an idea on Monday morning and you spend the entire week trying to find the answer to that. You go home on a Friday afternoon, you still haven't really resolved, found an answer to that question. And a big element of that is actually being able to get the answers quickly from the data to be able to query those, formulate those questions, hit the database and get those answers out. Time to value, Google is a great example of what time to value means. I often speak to people and they're like, well, you know what, I can run a query in five minutes. That's fine, but actually really is it? Because with Google, there's two things that really generate value out of Google. One is that the answers are correct. The second is time to value is really important. You get those answers immediately. You get your answer and start thinking about the next question immediately. If Google took just 30 seconds to answer a question, that would be really unacceptable and you just wouldn't be using Google, you wouldn't be able to deliver that value to you. So that's why I think time to value is really important. What we want to be doing is giving analysts that Google experience where they can just hit the database, get the answers to their questions immediately and really think about the problem at hand with sitting around eating pizza waiting for your queries to come back. So that's where GPUs come in and it's the fantastic capabilities of GPUs. So there's two capabilities in their characteristics. One is obviously the compute. Massively better than CPU type architectures and that gap is actually growing. So next year in the year on, GPUs are going to continue extending this advantage. And the other thing really important to databases is being able to feed that compute. So memory IO, getting data onto those GPUs is very, very fast. GPU today can transfer data or process data at around about a terabyte a second. And if you look at CPU RAM, it's about 100 gig a second. You can get 8 GPUs into a machine. In fact, Nvidia's latest machine GDX2, you can put 16 GPUs in there. So you can actually process data at 16 terabytes a second, which is awesome. The big thing about databases is the actual compute is not really a big factor. People might tell you that, but actually invariably what you're doing is just comparing two values for the basis of sorting or for joining. So actually the real benefit from GPUs is very much the ability to deliver data very quickly out of memory. Last of the marketing slides, so how fast is ultra-fast? We've done benchmarks on a billion-rate dataset. The fastest of those queries have come back in 5 milliseconds. The slower of those queries have come back in 150 milliseconds. If you think about Usain Bolt, you know, 150 milliseconds, 5 milliseconds, what does that mean for us in real time? Usain Bolt, in the starters blocks, the gun goes off. It takes him about 155 milliseconds just to start moving, just to start twitching a muscle. If you moved quicker than that, you'd be disqualified because human perception is in the realm of 120 milliseconds. So bright light on billion-rate datasets, absolutely delivering as far as we're concerned, real-time analytics. So that was on five Minsky machines, five Power 8 machines for GPUs each. The interesting thing about GPUs, which is I think a good point that you, you know, touching on there is basically the data you will scale your hardware with data. If you need more data, you barely have to bring on more GPU power. So the response times are actually linear. So we will get 550 millisecond response times, regardless of if it's a billion-rate dataset or even bigger, you'd be scaling the data out. You'd be scaling the hardware out as the dataset increased. I mean, I'm assuming you're going to get to the architecture, but like, certainly coalescing with all the machines, taking the result and coalescing from the different GPUs and putting it together like that. The overheads, yeah, the overheads of managing everything are actually pretty small, particularly in analytic databases. So for transactional databases, the overheads of, you know, managing the transaction and getting the data to where it needs to be, those are quite high and obviously that would be a factor. But when you're looking at an analytic database, the management is actually the overhead of managing all the hardware and bringing everything together is actually quite small and definitely less than five milliseconds. So we can do everything, bring it all together and deliver the result in five milliseconds. One of the things that allows us to do this is our patent-penning IP, recursive interaction probability. And this really touches on the GPUs, right? So the ability to get this great performance on being able to do stuff in parallel, that's what makes a GPU great. But on the other side of that coin, there's a real technical challenge there because you have to do everything in parallel to actually realize the benefits. And so what recursive interaction probability allows us to do, particularly for joins, is tap into that parallel nature of the GPU fully, fully realize the potential of the GPUs while still being able to do the SQL type operations, particularly joins. So just a little bit of talk about recursive interaction probability. Paralysable, embarrassingly so. Very efficient when you look at big O notation for joins. When you compare N log N to something as simple as binary search, which is log N, this kind of algorithm is very, very efficient. And just a simple description of what it is, what recursive interaction probability is. If you think about a join, so just to understand the audience, just if everybody could stick up their hands and knows what a join is, okay, everybody right, okay, excellent, okay. So what you're doing in a join, you've got your two relations and you've got your two columns and you're basically wanting to find matches in those two columns, okay? If you think about those two columns, let's say they are just numbered one to 12, okay? You've got two columns, one to 12 each, and you want to join them, one way is a nested for loop. And so you're comparing every element in the one column to every element in the other, okay? So 144 operations comparisons. But imagine if, because it's sorted, you chop each of those columns into four sub-partitions and you look at the boundaries on each of those partitions and you compare those boundaries. The top sub-partition that goes from one to four, the boundary elements are one and four, and the ending partition goes from eight to 12, nine to 12, nine to 12. There's no way that you're going to get a join interaction there, right? So you can discard those. And so by doing those sub-partitions, you're going to know that the first two sub-partitions have got a likelihood, a likelihood and not a non-zero likelihood of interacting. And your first two partitions will satisfy the criteria. The second two partitions will do 30 and the fourth two. So straight away, you're now looking at if you then did the base case of four by four, you've got 16 comparisons per partition and you've got four partitions that you're going to be evaluating. So you've got 64 comparisons. Now 144 to 64 is not great, but if you think about a billion-row data set and if you think about doing this operation recursively, you start to discard huge amounts of comparisons that you'd never be needing to do anyway. Does that make sense? Don't need to do it on the board? That's the greatest hash drawing from the 1990s. Is it? Yeah, used to putting things in the bucket and so you'll look at everything in the buckets when you have that same hash key. But you're not hashing, you're free-sorting. How are you making sure things match up? So what we're doing is you do it recursively and you're looking for a non-zero probability that you'll get an interaction and you'll know there's an interaction while looking at the boundary elements. So yeah, I think you're missing that everything's pre-sorted. And then you're basically doing divide and conquer and you're breaking things up into chunks. And then you know the upper bound and the lower bound. And then on the outer table. So you're doing recursively splitting them up into chunks and then upper bound and lower bound. And then now you do the same splitting on the other side. You'd split both sides into four, say? Yeah. And compare those and look at which ones of those will interact. And then you apply the same algorithm to the ones that have just formed an interaction. Split those into four. Find the ones of those that will be interacting. Split those into four, find those interacting. And you go all the way down from a billion-row data set down to a base case of say four. Or whatever, right? Everything is pre-sorted? Absolutely. It does have to be pre-sorted. Yeah. So there's a cost of sorting. But with indexing, you can get around there. So if you index those two columns, you could get around there. GPs are very good at sorting as well. And the alternatives are hash join. And there are some limitations with hash join. You need to have a big enough hash space for one of the relations. And that's what you're using. So I was talking to Andy about this. And basically we just sorted sorting a permutation of the data. So we're using pointers to, yeah, it's a sorted projection. Sure. Yeah. Yes, sir. So under a big old notation, what is your end with regards to and what is your bound? Like what are you bounding? A number of elements that are going to be compared. So if you have two lists and they are N elements and M elements, basically NMN became the same thing. I mean, the performance improvement from nested for loop. I know that that's the worst case scenario. But we did a benchmark a couple of years ago on, I think it was like 1.5 billion elements. So 750 million each. We were able to do that on a single GPU in like three minutes. And if you had the run time on nested for loop, it would have been 30 years. So I was talking about 144 to 64. But when you look at it, it's really big because it scales logarithmically or exponentially. For very big joins, it's very, very efficient. And paralysable and very easy to code up. There are other algorithms out there that one can use that are paralysable. Index nested for loop is another one. But this is a very good GPU enabled join algorithm. Parallel complexity? So it's old and long and predicated on the fact you have how many cores? That's total number of... I don't know if I can answer the question, but it's basically the... Non-parallel. So if it's parallel, then you'd be... But if it was parallel, you'd be dividing it by a number which kind of comes out anyway. You don't consider constants in big O. Yeah, so I think what you're describing is basically great hash drawing with variable side pockets. And then you're not hashing. Everything's already pre-sorted. So you're getting that benefit. It's almost like a... It's like you're not paying the cost of actually having to build a hash table. Everything's already pre-sorted. And then you look at large buckets and then the goal would be... If I say something pre-sorted, I know my upper bound lower bound and upper bound. If nothing matches, then I just gave away... You just discarded? Yeah. So you would choose... You're indexing, you can choose by indexing to say which ones you want to pre-sort and that's why we call it indexing. If there was no indexes, then you would just sort it. There'd be the overhead of sorting. So his big O is missing the sort time? Yeah, that's right. You've got to sort it. Yeah. I think most of the algorithms, the good algorithms are indexed but the good algorithms are in-log in. So it'd be in-log in plus in-log in. Something like that? Would that be right? So it'd still be in-log in. Everything's in memory, so yeah. Okay, so brilliant. We're very proud of that. What I also want to now talk a little bit about is what the other vendors are doing and the challenges that they are overcoming. So our approach is very GPU focused. We're using pointers. Everything is in GPU RAM. And that means that we can really tap into the GPU power and one of the things that we can tap into is the fantastic performance that you can get from data bandwidth, right? So we can process data at 8 terabytes a second. In fact, with the GDX2, it could be 16 terabytes a second. And databases, when it comes down to it, that's one of the key challenges that you're trying to solve is data IO. The operations are actually really trivial. There's a lot of data to the cause to do the processing that really counts, right? So Brightlight is a GPU focused database so we can really tap into that. Karnitaka, their approach is very much centered on using the CPU RAM as the basic storage and then offloading to GPU where necessary. And what that means is, sure, they can look at bigger data sets, but it also means they're bound by the hardware constraints in CPU RAM. One of those constraints is that you can read data at 100 giga second, and this is all basically per machine, right? Scream, another GPU accelerated database. What they're doing, and you can see it from their performance results as well, so I've got a slide on that. But they're focusing on getting data off disk to GPU and there the bottleneck is going to be on disk, right? So when you look at the performance characteristics of these different approaches, they match up with sort of existing solutions. Scream being disk-based or disk-focused means that the kind of database performance that they're going to see is going to be very much related to the traditional disk-based solutions. Karnitaka, going to be very similar to the kind of in-memory solutions that you see today. Brightlight obviously is going to be quite different to that and we are very focused and that I think is one of the things that differentiates us from the other vendors. Yeah? So wouldn't the 8,000 gigabytes a second only be between the NV link and the HVM2? Okay, so that question is all about where the data is actually located to start off with and when we create a table, we allocate GPU RAM and the data comes onto the GPUs at the point of ingest. So it's already on the GPU by the time the query gets executed so there's no data transfer. Only the results have to get off. Correct, exactly right. You just have to get the data, do the process and get the results off. I've got a bit more on... There's a limit there because it means that you have to have enough GPU hardware to be able to deal with the dataset. Exactly right, so that's a good point there. And I've got to slide on that of how are we going to extend that out and start to use a lot more CPU RAM for these kinds of workloads. Just, you know, one of the challenges... There's a bunch of challenges when you start building a GPU database. Obviously, the parallel nature of GPUs is one and writing CUDA kernels and all that kind of thing. But at a higher level, one's also got to look at all the different moving parts from a parallel perspective and we sort of list these in four levels of parallelization. So the GPUs themselves are obviously parallel devices that needs to be taken care of and managed in a specific way with the specific algorithms that are appropriate. Level three, there's a way of streaming data onto GPUs while they are processing and streaming data off GPUs. So there's a parallel element there. And level three is actually where we would start making use of using CPU resources. So the streaming of CPU RAM onto GPU. Level two is now you've got a bunch of devices all on a single machine. They need to be coordinated and they need to be executed in parallel. And very similar to level two is level one, which is right, I've got a bunch of machines. How do I manage those? And actually level one is dealt with by a GPU manager hub and that controls the entire cluster, knows where the data is, refactors queries and understands how to take a basic query operation, instruct the nodes that are then going to do the processing. All of this is fully containerized. And that's really cool because what it means is we can now go from taking a single GPU and using it to co-locate various users. So you could have three or four users on a single 32 gigabyte card if they were doing small workloads to federating up huge numbers of GPUs across multiple servers, all being used in access for a single workload with a single data set. And by containerizing it, it means that we can actually, at runtime, when you're executing a query, decide how much resources you need, right? So you can say, right, I'm going to attach a container that's got a specific amount of GPU resource allocated to it, load my data, and then start running those queries. The postgres side of things and the way the GPU manager is being created, the GPU manager itself actually sits on its own port and IP address. So the GPUs can literally be physically located anywhere and your database can be separate to that. And you can use postgres, standard postgres functionality on that postgres database, and then when you decide you've got a big job that you want to execute or whatever it is, you can say, right, click a button, I'm going to fire up a couple of containers or a single container, I'm going to attach GPU resources to that container, and you can start using it as a GPU database. And when you're finished, you click a button, all those containers are shut down and you're back to a standard postgres database. And all of that can happen at runtime. This is really interesting for large enterprises where they are going to be purchasing a big pool of GPU resources, and they might be wanting to chop that up on a daily basis depending on the kind of project they want to be running or the workloads that they want to be running. They might even be wanting to chop them up overnight as they've got different workloads running overnight. And all of this can be done because everything's containerized. So you can take GPU resources from a pool and allocate it to a specific job, get that job done, and return those resources back to the pool. I think the answer to that question is we use Docker, and it's virtualized, so you can have a number of containers all accessing the same GPU. What it does mean is when you start allocating resources on that container's GPU, it's virtual GPU, you need to manage that because you don't want to start allocating too many resources. So there is some intelligence that needs to happen that we manage. When I started, I started talking about, and on the blurb as well, is how BrightLight is an AI-enabled database. So one of the things is we're postgres, everything's containerized, we've got a GPU manager, all very cool stuff. But the other very, very cool thing is how we manage the GPU resources, the actual memory and the actual memory allocation. So when we started looking at our second generation GPU manager, at the time our memory management wasn't where it needed to be, and we were encountering a lot of overheads with allocating memory at runtime. And so we needed to find a way to pre-allocate memory and then just load it with data before we started running the queries. And at the same time, we were looking at how we can start integrating BrightLight with AI-type tools, and we started looking at PyTorch, it was actually Torch, and one of the reasons was Torch uses Lua, and we use Lua as well. But also Torch is a great tool, very easy to get up to speed with and a great option compared to the things like Google TensorFlow and Caffe and The Honor. So we looked at Torch's memory management, and it had pre-allocation and it had caching and a whole lot of sophisticated things that we were looking for. And if you think about a tensor, it's a list of data, which actually conceptually is very similar to a column in a database, a column in a table. And so what we've done is we've actually taken Torch memory management, extended it and enhanced it, and that is what we are using to manage the columns in our tables in our database. So when you start running SQL on our columns in our database, you're actually running down on the GPU instructions on Torch memory-allocated tensors. And that means that you can then do the full remit end-to-end from SQL doing all your data preparation. This data ends up or is Torch tensors, so you can immediately then start to run Torch operations and Torch thinks that it's working with one of its own tensors. So when you declare a Torch tensor, you use the table name and column name in the definition and now you've got a Torch tensor, PyTorch tensor, because we've moved from Torch to PyTorch, but you've got a PyTorch tensor that is essentially reading data directly out of the database on GPU as if it was its own tensor, which is really, really cool. How many of you guys are doing stuff with PyTorch, say? Okay. One, two, anyone else? Brilliant. So this means that you can start running so how much of your time is spent doing stuff on a database, extracting it and then loading it into PyTorch module. That's an element of work. Is it? Okay. Flat files, okay. Okay, fine. Okay, so right. So we've touched on Postgres, the GPU manager, levels of parallelization, the fact that we're using PyTorch as our memory management. Sure. The data you stored on GPU in your database just help the exact same format that you are going to fit into PyTorch from the beginning. Absolutely. Literally. So you literally say, right, I want this column to now be a tensor. They're a copy. That column, as far as PyTorch is concerned, is a tensor. Sure. I mean, it's amazing because you can now create a model and use SQL to inject data into your tensor. Train your model again, just that you can feed data back into the database using your tensor as well. So it really, you know, just totally short-circuit all that effort between database work and you know, building models. And this is, you know, you can use it for deep learning. I know you said that, you know, a lot of you guys using flat files, but, you know, often actually that flat file data will come out of a database. On that slide earlier, a huge amount of time is actually spent running SQL queries preparing the data that would fit into those flat files. How do you expose, like, what, like, to, like, I get it, like, you update the Postgres catalog and say, I have these tables, right? Yeah. How do you expose it to, like, PyTorch so that someone can come directly at your tensor from PyTorch that's sitting on the GPU? So we make some changes in PyTorch, but when you create a PyTorch tensor and you use a descriptor, the function that you use to create that tensor, the parameter, one of the parameters that you insert is the table name and column name, and everything else just happens. Basically, there's some clever stuff that happens with pointers. And the PyTorch code, then, basically, as far as it is concerned, is working with a tensor. That's, you know, that's one of the fundamental things with our IP and our domain expertise. If I totally revealed it, it would be really starting to touch on that kind of stuff, but basically, yeah. Makes sense? Cool. Okay. Good. So your question about getting data on and off disk. So, oh, sorry, on and off GPU. IBM is a great product nowadays. Envy link onto the motherboard. And that means that you can start to transfer data from CPU RAM to the GPUs at 150 gig a second per GPU. When we started doing this on power 8, the first time we tried it, it didn't work. We were getting sort of close to PCIe type performance with data transfer, and we were wondering what was going on. So the issue here is you do need to really take into consideration NUMMA and what was happening there is when we were getting those slow performance benchmarks was the memory of the data for this core was being transferred onto these GPUs. I needed to go through this bus, the socket to socket bus. So if you do want to make full use of these of Envy link and really get maximum bandwidth for the hardware available, you actually have to construct two GPU managers and use NUMMA to make sure that those GPUs are allocated to the correct sockets. So for power 8 or power 9 system, we will actually have two GPU managers running on that even though it's the same system. For us, that doesn't really matter because each GPU manager presents itself as a port and IP address, so very easy for us to do that. How come you do see power of 9 deployment and there are people running right now? I'd say percentage wise. Yeah, you know, it's got a lot to do with what's in-house already actually, so there will be decisions that companies have made on hardware already. This is fairly new technology and I suppose there's a form of education but it's very much dictated by existing relationships that are already in place. But the performance improvements are fantastic because if you have to do this with PCIe, you're looking at 16 to 32 gig, whatever the latest is today compared to what we were getting benchmarks out recently on a power 9 box 250 gig a second data transfer, that was actually benchmark, that's not theoretical. That was actually measured which is faster than the memory on the CPU, right? So when you bring this together, I was talking a lot about how BrightLight is very GPU centric and a lot of our initial effort has been working on how to get the data onto the GPUs, keep it there and operate it there and really tap into that GPU power. But there are workloads that have got larger data sets and we need to then really start to think about how you get more of those CPU resources into play. So using Numa, using level 3 parallelization which allows us to stream data onto GPU, we are now creating what we call the big f***ing column. So this is one of the Falcon Heavy Lift rockets and I think the next iteration of this is the big f***ing rocket, right? So our BFC, big federated column is all about introducing the ability to have a column that sits over from GPU through CPU resources all the way down to disk and to access and make use of those resources according to the use case. So there are use cases that you do have a huge amount of data and really what one wants to be doing is using disk based solutions for that, cheaper. You might have slightly smaller data sets and the ability to store that in CPU RAM, great we can take care of that and if you want very, very fast performance on slightly smaller data sets again, then we can do that as well with the standard GPU platform that we've got at the moment. Still a bit of benchmarks, I heard that you guys are very interested in actual benchmarks and hard and fast figures. Has anybody seen this website? Mark Lipinski? Okay, great. So he's done a lot of benchmarking, he's got a fairly standard benchmark, he's got four queries, a billion row data sets and he's looked at all sorts of scenarios. At the top there we've got Brightlight, this was on Power8, there's our five millisecond benchmark, 188 milliseconds for the more complex query. Some of our competition are here, but I'd like to introduce you or draw your attention to Spark. Very common, well used, popular in-memory type database. This is on 11 AWS instances. This was on five IBM nodes. This runtime is 10 seconds. That runtime over there is five milliseconds. So obviously there's a cost trade-off that IBM Minsky hardware is a lot more expensive. But if you had to scale up this part, this hardware spend the same on hardware you would still be getting a massive improvement in performance. What we're talking about here is not only absolute speed but also the efficiency of the resources. Cost per query on Brightlight is actually cheaper than any of these solutions that you see below you because GPUs are just so good at what they do. So what is that there was a bit of a Twitter conversation when this came out and we did say to MAPD that this second benchmark here is on AWS. They were totally able to it would be very easy for them to refute this and say, you know what, we're going to run it on AWS ourselves and show you that we are at least as fast today's worth of work, right? What do you think the MAPD is hard to prepare the hardware? Like I've got a OK, so you're talking about OK. Is there something fun about what you're doing that's different than what they're doing or is it just better engineering? Maybe also another question is what percentage of this what percentage of your performance gain is because you're in the GPU or is it because again something about the implementation? So I think if you did this sort of comparisons we would still come out faster than MAPD. It's probably about two times faster or something like that. Well the only thing that's different is the software, right? So the hardware is exactly the same so the software, underlying software is better and faster, better engineered. But 2X on a platform that is really hundreds of times faster than everybody else doesn't really matter. I think what's really important actually is to say performance times with MAPD are in the same ballpark but BrightLight is using Postgres and there's a whole lot of stuff that you get from Postgres for free like stored procedures, curses user defined functions native JSON connectors the whole environment around Postgres every, pretty much every visualization tool that you'll see on the market today has got a Postgres connector so you can take Tableau install it and use your Postgres connector on Tableau to connect to a BrightLight database and start accelerating your dashboard Excel anything will have a Postgres connector all your Python code have got Postgres connectors so it's the fact that we're using Postgres and also the fact that we are using PyTorch memory management so you can start to run some of these AI workloads directly on data that's really been managed in the database. So this is a benchmark that we did on Dell TPCH query 6 we were seeing, this is on the GPUs on the 940 4 GPUs so seeing throughput of 2.1 billion rows per second 360 giga second TPCH was in 150 ms TPCH 6 250 ms so very fast database the scale factor was I think it was 16 giga GPUs for them and we were loading them at about 60% so I don't know what that would work out to be so 100 gigs probably a scale factor of 100 this is some benchmarking that Screamer published so when you think about the throughputs there, I'm not talking about the total size we're talking about the total size of data of about 100 gig for the Brightlight benchmark this dataset massively bigger but the run times are in minutes if not hours so just highlighting the fact that different vendors are taking different approaches, these performance metrics are very similar to what you would see on a disk based database and the performance metrics that you will see on Brightlight are very much different to that or what you're going to see on a GPU accelerated database so these are the actual queries this is map D in these two queries we were using indexing so we're using pre-sorted data in these projections or permutations that we really talked about that gives us a massive performance improvement but it's sensible so indexing is a very old well understood way of accelerating queries and the original indexes were actually copies of data so well understood gave us a significant performance improvement because a lot of the work has been up done up front when we look at a lot of our workloads 90% to 95% of the time there's sort involved and 90% of the run time is actually doing the sort, everything after that is actually very quick and easy after that and you use sorting for group buys very easy to understand group buys and once you do that all the follow on calculations are very quick this is bright light, so 90% of the execution time of a query is spent on sorting for this for these two here for these kinds of queries, yeah the question is do you do any code generation no so for map D does that time include code generation as well no, that would be unfair because for map D the first query the first time you run the query would be really really slow so that run time will be 10 times that sometimes it's a second or more but it's not just indexing that allows us to be really really fast these are unindexed queries so this is just raw compute raw performance and still fairly significantly faster this is probably if you're going to compare to map D all the kind of performance characteristics that separate the two map D doesn't have indexing so you can never have that option to choose it good thing about bright light is you do want to accelerate and you can and indexing is an option then use it and you get a fantastic improvement in performance so sorting, really really important to all of our database operations if you're going to be using indexing if you're going to be doing group eyes if you're going to be using recursive interaction probability on your joins sorting is massively important so by tonic sort N log N, paralyzable I imagine you're all very familiar with by tonic sort right has anybody excellent, has anybody tried to implement this on a GPU excellent how was it I don't mean performance wise did you do it in the afternoon it was more than a day more than a day, okay and it was probably very specific to the hardware that you had available and the data sets that you were looking at and all those kinds of things trying to generalize this there you go, right okay brilliant lead on so what is the do you know what the sorting algorithm that thrust uses I think it's by tonic or radix so radix is the other one who does it, okay excellent I guarantee what he says because the video won't pick it up okay sorry say it again the by tonic sort comes as a sample for CUDA right so by tonic sort comes as a sample with CUDA that's right so exactly so okay so what we were doing to start off with and this is leading up to thrust right so we don't actually write very many of our own CUDA kernels because the level of effort to do that is massive and there are some fantastic algorithms where you know somebody's already spent all that effort looking at the best options so we use thrust to do a lot of our coding we write very very few of our kernels or very few of our GPU stuff is actually hand rolled CUDA and the sample that you're talking about our discovery we were looking at wanting to initially write our own kernels and that sample that you're talking about in the CUDA I think it's one to ten right and there's all the different samples and the first one is the getting the stats out of your hardware right we looked at that trying to tease it into our existing application and the level of effort was massive thing with thrust is that it abstracts away all that complexity fantastic tool gives you the best algorithms and allows you to run on all sorts of different hardware with different configurations all of that is taken care of so we don't actually the purpose of this slide was these two slides was really to demonstrate how important sorting is to us but there are fantastic algorithms out there that are paralysable with great big O notation complexity and they're all there for you in things like thrust and other libraries so good we've talked a little bit about indexing indexes are really pointers arrays of pointers that means that they're very light whether you have a covering index or just a standard index on a single column all you need to be doing is storing an array of pointers so very very lightweight and that means that we're talking about this I was talking about it with Andy earlier so there are two sort of ways of approaching indexing you can either make a copy of the data or you can use pointers and in memory obviously you can use pointers GPU resources are still relatively expensive and so we're adopting what is effectively a much lighter way alternative to this or implementation to this while still getting a lot of the benefits so a bit more marketing speak Postgres GPU Managers Containerized PyTorch basically three products that we are taking to market there's the Brightlight database which is a Postgres clone it's got the patent pending IP which allows us to do joins very efficiently and according to benchmarking the fastest database in the world Brightmind which is what we're really excited about SQL plus artificial intelligence all running on GPU directly connected to the database using PyTorch which is an analytics workbench browser based so just username and login and from that you can then access a full blown SQL editor so do all your DML, DDL, data loading all via the browser you've got a full workbench of analytic tools so you can do all your charts geographic geospatial mapping and you can also link that up with Jupyter so there's also a Jupyter client and you can start to write PyTorch all from a browser, everything's there all running on GPU end to end, SQL AI and a workbench which to drive it all GPU resources very very fast very very expensive so where does it actually fit into the kind of use cases that you might look at open source cheap and cheerful you can store huge amounts of data on disk and access it intelligently something like Pivotal Green Plum maybe a bit more expensive bit more performance and not able to deal with the kind of scale that Hadoop can deal with bright lights very much focused on performance and I think if one looks at how data is accessed and the value of data at an age are really important aspects not necessarily in this context but in a business context what customers are doing today is a lot more valuable than what your machine logs are from two years ago so what bright light is all about is looking at high value data and getting that data on to GPU and then using analytics very fast analytics to really get value out of that going to talk a little bit about postgres now why are we using postgres so what we get out of postgres is basically the first two parts of the side the parser and the optimizer we've talked about algorithms we've talked about GPUs but actually all of that is meaningless until a user can actually connect to it and use it that means writing SQL code and as soon as somebody starts writing SQL code you need to parse it and generate the execution execution plans and so on so what you do is to do that you can obviously build your own and so flex and bias in that book will tell you how to write your own SQL parser but what you get then from postgres is an execution plan and it's this that we then traverse and feed across the GPU manager to execute each one of these components so the GPU manager doesn't necessarily have to worry about trying to parse SQL statements it doesn't need to worry about creating an execution plan all that is done up front and that means that all that effort that we might have had to incur is actually done in postgres all we need to worry about is each of these individual elements as long as the GPU manager can process each one of these individual elements then we've got a working database very very fast database but more about postgres and how the interaction looks so we've got our GPUs the GPU manager manages those and what we've done is rewritten large parts of the postgres database engine so that instead of processing code and data on CPU it just hands that off to the GPU and gives it an instruction right, I need to do a join they are the tables, they are relations this is what needs to happen they'll need to do a calculation or expression and that's handed off to the GPU manager any questions? any postgres queries that aren't supported? so not everything has been ported over to GPU so for instance windowing functions we don't yet support that I meant those will just still run a CPU as opposed to generating an error so if you try and run a SQL operation that hasn't on GPU data that hasn't been implemented on GPU you will get some form of a message sometimes what we might do is for instance if you try and join a postgres table with a GPU table it will use postgres and it will just use the GPUs for data access but if you try and do something a bit more complicated so there are stats functions that we haven't yet implemented there are postgres functions that we haven't yet implemented but 99% of the kind of workloads that people are going to be running typically you can run using standard postgres on GPU okay so this is all about the tools and accessing the platform in the middle you've got the bright lights API bright lights software running on GPU on the top you have all the tools that you can access the platform with so obviously postgres you can use PSQL, PG admin anything that would connect to a postgres database you can use we've got connectors because it's postgres all of these visualization tools will work Tableau will work, Power BI we've got our own visualization tool called spotlight which I mentioned earlier a workbench and then also Torch and Jupyter those will also work on the bottom using foreign data wrappers which is standard functionality you can access data from any one of these data sources I think I think there's something like 60 foreign data wrappers around there yeah so pretty much everything from flat files to Twitter to Oracle, Hadoop OpenStack, MySQL, MariaDB you can get that data directly from that data source into the platform and start using it like if you use a foreign data wrapper are you going to take it out and shove it down to the GPU and process on it? so you can decide what you want to do but what you would typically do is it's a way of intelligently getting data out of the third party source so you run an SQL query and load it onto GPU more than this I have a foreign data wrapper that Postgres can already support I shove it down to Brightlight into GPU what it means is you can run an SQL statement that says insert into GPU table select statement so you can get data from from any of these data sources instead of running a SQL statement on the database into flat file and then flat file into the GPU you just go straight from an existing data source onto GPU with a single SQL statement just a little bit about joins ultra-fast joins, irrational database you know the whole thing there is in the name relations and being able to join data is really really important and a lot of the vendors only now are coming to grips with being able to join data on GPU it's something that we've had for a very long time with recursive interaction probability this slide is probably really familiar for you so I'm not going to dive into that too much this is the wrap up so we've got Brightlight, GPU accelerated Postgres Spotlight which is the workbench and allows you to do end-to-end, SQL AI, dashboards visualizations Jupyter all in one space and Brightmind which is this ability to use PyTorch because we're fundamentally using PyTorch memory allocation for storing data and running SQL workloads and also doing AI type workloads using PyTorch and that is the end we have time for one question go for it since you're using PyTorch do you think it would be feasible to run Brightlight on Google's TensorFlow? so so would it be possible to also use Google TensorFlow in the same way is that what they're... Google's dedicated hardware good question we would need to be using NVIDIA GPUs because of thrust and all of that that all that comes with thrust and I think a lot of what PyTorch actually does is I'm guessing yeah but I think it'll probably use thrust and those kinds of things I don't think it uses Open GL, OpenCL so the answer to that is no good thanks very much alright the next talk is November 29th and they'll be the last talk in the seminar series and that'll be the Storm 64 guys alright have a good weekend everyone you