 So we're super excited to have Todd Mostak and Alex, sorry, last name, right, it says right there, sorry. From MapD. MapD is one of these, again, these newer startups that are designing data systems for Iran on GPUs. The background for Todd is actually quite interesting for this. So he was a master student at Harvard. Yeah. And he was doing research on the air of the spring. And he wanted to do visualization of the tweets, went down the street to MIT, took the database class there with Sam Madden, and then ended up building his own in-memory GPU database engine to do processing of the tweets and visualizing them. Realized this was a good idea, and then left it home, that became a startup, right? Yolo. Yup. With that, Todd had time. Thanks, Andy. Yeah. Hey, everyone. Thanks for sitting through the technical difficulties here. Really excited to be here with Alex and I. So I'm the CEO and co-founder of MapD, and Alex is one of our awesome engineers. Coming out of grad school, he can tell you a little bit out of Johns Hopkins. Initially working more on the graphics side, which I'll talk about, but basically across the stack of our system. So I'm going to jump right in because I know time is relatively limited. So MapD, we're not a mapping company. MapD originally stood for massively parallel database. Sometimes it's confusing for people, especially since we do do a lot of maps and geospatial, which you'll see. We're not quite a conventional database in the sense that even though I think we, how do I say this? We focus on running SQL very, very fast by leveraging the massive parallelism of GPUs. We do a lot more than that, right? So we don't necessarily want to be a traditional data warehouse. We feel that the advantage of having all the speed is the interesting things that you can also do with GPUs. So you can leverage the rendering pipeline to visualize data in real time, as well as leverage the massive compute capabilities of GPUs to actually kind of embed machine learning into the platform, which I'll talk about in a little bit. So jumping right in, seems to have lost my, there we go, great. So why would you do something like this, right? So CPUs are ubiquitous, right? And they've evolved and they're everywhere. They work really well. They still get faster every year, right? And it's relatively easier to program something on CPUs than it is GPUs. CPUs are truly general purpose processors and there's a whole ecosystem, not to mention, you know, a massive number of people who know how to program to CPUs. So why would you actually try to run a database or analytics platform on a GPU? So just starting at the like 10,000 feet view, right? So looking at it, basically organizations are drowning in data. So the amount of data being produced and a lot of this is sensors and devices, IoT. It's also clickstream data and social media, call data records. You can imagine data coming off satellites. Essentially it's growing at a 40% year-over-year rate, right? But actually CPUs are only growing, their processing power is only growing at a 20% year-over-year rate, right? So every year, and this is probably a high, this is kind of a top-of-the-line number. In many aspects it's probably lower than that and particularly with things like meltdown and inspector, you know, they may actually be going backwards, right? So there's this gap here between the growth of data and the growth of CPU processing power and it's causing people to take all sorts of awkward workarounds. People have the down sample, they have to pre-index, they have to pre-aggregate or they just scale out, right? They throw massive amounts of hardware at it but a lot of platforms simply don't scale very well, right? So you might double your servers but you may not get double the performance and not only that but it costs a huge amount of money and there's a lot of complexity in managing massive clusters. So that's where GPUs come in, right? So I think hopefully a lot of you are familiar with graphics processing units. Originally, you know, built to run video games, you know, people realized very early on that it was better to have many slow cores to render pixels which is basically an embarrassingly parallel problem to a screen and parallel rather than having a single fast core render them sequentially, right? So this idea, you know, people like NVIDIA and AMD and there's a bunch of, I think, AMD bought ATI. A lot of folks were building technology around this and at some point people had this revelation that, hey, this actually parallel architecture is good not only for visualization or for rendering but also for a whole host of other tasks. And the cool thing about GPUs is I'm not going to get into all the reasons why but because they have this different almost simpler architecture it's kind of just brute force, let's throw thousands of cores at this thing, thousands of ALUs, they've seen, you know, they've been able to scale, performance scale in a way that CPUs haven't. So particularly over the last five or ten years, every generation of basically NVIDIA's GPUs has been roughly give or take kind of 50% faster year over year. So the cool thing about this is that as, you know, CPUs show signs of actually slowing down there's signs that we're kind of at the end of Moore's Law, the traditional kind of doubling of transistors every 18 to 24 months. GPUs have this architecture where they've been able to scale and then basically as Moore's Law does give us more transistors we're able to basically throw that into more cores on the GPU and the performance gain has been pretty linear over time. I'm happy to talk about why that is but yeah, okay. So I know Bill Valley left this graph, but large chunks of that upper line are because of area of grim. And if you look at what NVIDIA has done with this, like they have produced the largest die in the humankind in both of them. They've taken, they've scaled that back for terrain. It's just too ridiculous, they're expensive. There's no indication that they're going to be able to continue doing that. They've taken their one last advance to TSNC12, they're going to hop to TSNC7 and they're fed as intel. So your assumption here is just blatantly wrong. We've already got the 100X, right? The 100X is there and it's probably not going to disappear. But the extrapolation is just extremely speedy. Okay, that's a strong way to put the argument, right? There is, yes, on Volta they went to the rectical sizes, massive, I agree, right? I don't know what tricks they have with their sleeve and obviously we're all bound by the same transistors sizes, I agree, and 7 is the next step. I do think that there are certain things that as you get more transistors, they tend to scale better, right? CPUs have a multi-prong problem, right? They're not just trying to do one thing. Like you add more cache, you add more cores, you add better. I'm not denying it at all, right? The gap is totally there, it's just the projection. So let's just block this out. This is definitely, you know, I see your point, right? And I don't want to have a crystal ball and I think obviously they went to the Max for Volta. And I can't speak to the roadmap or anything they have with their sleeve, but okay, let's just say here, right, there are some signs that, and I think a lot of it is just the kind of singular purpose of GPUs is just to do parallel computed scale, right? So a lot of the things that CPUs, you have these very fast cores and what are you going to give your transistors to, right? So it's a choice, right? And they're not always just trying to optimize flops. They have other things in their mind and there's a lot of cache synchronization that needs to happen and yada, yada, yada. So there's some advantages in scaling for GPUs, whether it will continue, I think it's a good point. But I do think there's a significant advantage here right now in 2018. So thank you. Keep me honest here. So this is the most marketing slide we have and so I'm sorry, we hired a marketing team. This is what happens. So, you know, what's cool about GPUs, right? I think there's three cool things. So everybody thinks of GPUs in terms of their compute, right? Like the teraflops, every one of the NVIDIA's recent GPUs has at least over 10 single precision teraflops and the double precision depends on the architecture. So that's exciting, right? But there's also some very other cool things about GPUs that distinguish them. So high memory bandwidth, right? So the ability to scan massive amounts of data. So the latest NVIDIA GPUs almost have a terabyte a second each memory bandwidth and you can combine that. You can go eight ways in a box and all of a sudden you can scan a ton of data per second, right? So particularly in SQL or analytics workloads which tend to be memory bound, frankly, right? IO, memory bound. Being able to scan more data per second is a huge win. So the second thing is the rendering pipeline, right? So this doesn't matter to a lot of people but as you'll see for us, we actually are a very kind of vertically integrated stack. You can use this as a pure SQL database and people do that, you know, replace data warehouses in certain circumstances. But where I think we're really focused and where we shine is being able to use an integrated stack where not only can you query billions of records in milliseconds, but you can actually also render that in situ. You can visualize it at a granular scale. And then finally, and this is more a kind of future looking for us, obviously they have a massive amount of compute. So teraflops for CARD and as you've seen the whole AI revolution, at least the deep net revolution has largely been built on GPUs. And so being able to leverage that and kind of adding it to your analytics workflow, predicting nulls by running neural nets, being able to do even just logistic regressions or XGBoosts which tend to run faster on GPUs. It fits nicely in with the kind of analytics ecosystem that we work in. Cool. So just to give you a quick high level tour and we're going to dive into some of this and then Alex is actually going to get pretty deep into one particular aspect of the system we're focusing on around adding geospatial types and execution capabilities to the system. So, you know, the core of our system is basically the SQL engine that runs on GPUs. And so basically it's a, you know, we have our own memory management, we do persist on disk although we think of ourselves, we try to cache as much in memory as possible. And then we have an LLVM compiler execution engine that goes on top. So we actually generate like a few other good systems out there like MemSQL and Impala. We actually generate code on the fly. And so that is also open source. So we open sourced about 18 months ago. So it's Apache and everything so you can check it out and I think Alex is going to show some code samples. So that is all open source and you can run it on 8 GPUs, whatever, and you know, scale to billions of records. So that's been very cool for us in terms of engaging the community. Then we have this render piece, right, which is actually one of the highlights of the system and I think you'll see it in a second when I give a demo. But this actually runs like, it's actually on the GPUs so the very cool thing is that there is CUDA. CUDA is the kind of compute language for GPUs and there's OpenGL, right, which is quickly, I think people are going to be moving to Vulkan, but it's a graphics API. And what we can do is actually keep the data memory, run a query, keep the results set in GPU memory, and then render in situ on top of it, right. So you can take billions of records and do scatter plots. A lot of geo stuff is one of the main use cases, but you could potentially do network graph layouts and all this kind of thing. And it's very nice because you don't actually have to ship data to the client. You don't even have to ship it off the PCI. It can all be done in situ. And then you'll see here we have immerse, which is a visual analytics system, which is like a very lightweight kind of BI thing, but it was built kind of to work hand and glove with this whole stack so we can leverage the rendering. It very efficiently leverages the query engine. And then it's just, you know, it's a platform. So it's a database so you can have JDBC, we have Kafka connectivity. I'll talk about Arrow in a little bit, which is something I think is really exciting, you know, and on the way out. So you can put it under Tableau. You can use Arrow to feed it to say XGBoost is working on the same GPU. H2O has some really cool stuff that's open source that we can leverage. And I'll talk about that in a little bit. Okay. This clicker's a little bit. Plugging. So just quickly, I'm going to cut to a demo here just to give you a sense. I know we're time bound, so I'm going to do this very, let's see. Okay. So just to give you a sense of the scale that the system can achieve. There's a boat here in the middle of the, in the middle of Kansas or something. I don't know what that's about. So this is a demo we have. This is about almost 12 billion records. This is an open source dataset from the U.S. Coast Guard. It's basically ship trajectories, right? So all the ship AIS data. And so this is an open dataset that we actually just ingested into map D. And because we're a scrappy startup, we're actually running this on, I think, 1080, 1080s, not even 1080 TI's. So running this across 32, you could take a modern new Turing card, our new Volta 32 gig Volta, and probably run this almost in one server. But we're running across four servers here. So you can see here that this is immersed. So it's hitting our query engine, which I'll talk about in a second. And everything here is interactive, right? So I can, you know, I can zoom it, as I zoom into San Francisco, everything updates. And the thing you'll notice here is that all these charts are cross-filtered. So as I zoom in, it filters on all these other charts, the ship type, the ship length, the time. And the cool thing here is that, obviously, you don't want to send 11 billion records from your server to your client, right? We'd be sitting here all day. So we actually render that in situ, and we're sending a compressed PNG back to the browser rather than actual, all these data points. And so, you know, in some ways, like server-side rendering is not cool anymore. Everybody wants to use DECGL and KeplerGL. But frankly, when you're doing kind of analysis at scale, where you can't just move all this data to the browser, right? So I can start here, and I can cross-filter this. And, you know, all of a sudden I'm pulling an entirely different data set to the browser. So let's see here. So as I zoom in, so we're sitting here in the bay area, right? And this system is a little bit sluggish. So you can try it on your own computers. It's quite fast. Let's do this. There we go. Okay. So now I zoom into San Francisco here. So our office is sitting here, and we can actually see the boats. I think they're tankers, right? They sit out here and they moor, and you can actually see them basically kind of move around in the tide as they're mooring. And if you look at actually what's happening behind the scenes here, so just to kind of give you a sense. This is not a... So just showing the sequel. This is not really a product feature, right? But as I kind of brush here, you can see all these queries running. And the reason why we can be so fast and interactive is that we're basically table scanning this data. There's no pre-indexing, pre-computation, pre-aggregation. We're literally running these queries, and, you know, as I zoom here, we're really running these queries in, okay, 74 milliseconds, 33 milliseconds, 40, 66, 37. And we can also do this render. So I'll talk a little bit how we use Vega, which is an open-source kind of rendering API out of Jeff Heer's lab to run these visualizations, these renderings on top of the data. So that's how we can be so fast and interactive, and all of these demos are online. So just feel free to go to our website if you're interested, and you can check them out. So there's sequel happening under the hood, and the non-demo stuff, there's interesting ways of being able to actually use sequel to interact with the system or something like Jupyter notebooks. Cool, okay. So this is... I'm going to skip this demo because we're running a little short on time here. Let's see. Oops. That's... there we go. Okay, thanks. So I'm going to fly through this a little bit, but I think it's interesting to understand where the space and where kind of GPUs they come from and how they've evolved into compute engines and when and why people started thinking about actually running databases on GPUs as well. So just quickly, you know, back in the Wild West days, there was not even something like OpenGL, right? Every vendor basically had their own API. There was nothing portable between the vendors, but people even early on, I mean, this is back in the 80s. People were already starting to try to map compute, so the early pioneers or cowboys in this space would literally try to trick the GPU into saying, hey, I have all this data and I want you to do these algorithms on it, but I'm going to trick you and make you think this is an image. I'll just load this data as texture data and, like, you'll run it through the graphics pipeline, but something interesting will happen that I'll be able to use. And then OpenGL DirectX hit, which was kind of standard APIs to do graphics. Brooke was Ian Bucks' work, who's the kind of father of CUDA out of Stanford. It was one of the first kind of C-like APIs with which CUDA evolved out of general-purpose compute language, and then OpenCL, which is kind of the twin that AMD has really pushed, OpenStandard, but AMD has pushed. All right, I'm just going to stop using this. No worries. Okay, so once we had this compute API, right, the space exploded in the sense that people said, hey, there's something different here. We have this new hardware paradigm with a ton of compute that we can leverage, and so people started doing kind of different things with it. So the first wave was all this kind of HPC and kind of scientific compute workloads. So this is whether it's protein folding or it's doing steady at home or it's simulating nuclear weapons or whatever it is. People realized that GPUs are very good at Monte Carlo algorithms, at running these algorithms. And then the next big wave, and this is why NVIDIA is no longer a sleepy company, but this company that's market cap is approaching Intel, is the whole deep learning revolution, right? So I think in the last five years, you've seen an explosion of people saying, hey, let's take neural nets, which is actually not a new idea. It's been around since the 50s, almost. But all of a sudden, everybody had all this massive computer at their disposal, so you could do things. You could do deeper nets. You could do bigger nets. And all of a sudden, these deep nets were actually more effective than a lot of traditional machine learning algorithms at solving particularly classification problems, but other things as well. And so that's created this huge industry, and I'm sure you hear about it all the time. And then finally, I think what's exciting is that there's been this emergence of general-purpose analytics databases and kind of visual analytics platforms, which we're part of, right? And you've heard from Connecticut, and I'm sure you'll hear from some of the other folks as well. I think there could be potentially a big market there. So we're not the first to do this. So well before Matthew existed, there's papers of people trying to use the Graphics API to shoehorn database operations into it. Peter Bacoum, he actually did Virginia, and it was basically the first full SQL acceleration. He used SQLite as a plug and was able to take some of the op codes and actually run them on the GPU. Since 2012, so Matthew started in 2012 out of Sandman and Mike Stonebricker's class, as Andy mentioned, and there's been emergence of quite a few other academic platforms. And then you have commercial GPU databases. So we found it in 2014. You also have Connecticut and some of the other companies that you're going to hear about in this seminar series. So I'm going to keep moving here. These are interesting questions about GPUs, but I just think we just need to keep moving because I want to run short on time. I want to make sure Alex says time. So I'm just going to give a quick architectural tour, show you some of the interesting parts of the system. I already told you, you know, the core engine is this very fast SQL engine moving towards more of a general purpose compute engine. Like it shouldn't just be SQL that we're running, but imagine running, being able to run any kind of general purpose algorithm across multiple GPUs, multi nodes, and being able to leverage SQL as well. So you can think of the platform evolving almost towards something like Spark, but running on GPUs. And then it's very closely coupled with Mathew Merce, which is what you saw there. Hits the SQL engine, hits the render engine, and provides kind of a very fast interactive experience for data exploration. So Mathew Core. I'm going to just talk about a few things I think are interesting. So there's a lot of design considerations that you have to think through when you're designing a GPU database or a SQL engine. So the first thing is that the memory on these GPUs is relatively limited, right? Particularly when we started Mathew, I think the GPUs basically capped out about six gigabytes each. And you can maybe get four in a server. So the first server we bought at MIT back in 2012, I think it had like 24 gigabytes of memory. So that's pretty paltry, right? Fast forward to today and these systems have evolved where now the standard cards have 32 gigabytes each. And there's actually a new card, a Turing card that has 48, so there's signs that it's going bigger. A lot of it's driven by wanting to put bigger neural net models, actually in memory. And so you have actually quite a bit of compute that you can leverage, and I'm sorry, quite a bit of memory you can leverage. So the idea behind Mathew is we're a columnar database, right? We're a column store. You only pay for what you use. And so we'll actually cache the hot data in GPU memory itself. So that's kind of the L1 cache, if you will, just to use a metaphor here. Where we keep the hot data here, what's being actively queried, and not everything actively queried needs to be on GPU, so the system will use CPU as well. What doesn't fit there, we can spill on the CPU RAM, which is the equivalent of our L2 cache, often much bigger, and the memory is much cheaper, but it's also significantly slower. So on a GPU, you can have almost a terabyte, a second of bandwidth, you can have eight of these on a box, right? And CPU, you know, you might, this is probably a little outdated, so you might see 200 gigabytes a second across dual socket. You can see higher on quad socket. But there is a big gap, order magnitude at least gap, if not one and a half between what you see on GPU and CPU RAM in terms of bandwidth. And then finally, you have SSD, we persist on SSD, and so you can think of the whole system as this wedding cake architecture where we have dual buffer poles on GPU and CPU, and we try to keep the data that we need to on NGPU RAM. Yes? What's your ratio in a typical, like, is this supposed to static data in your query analytics, or are you suggesting a lot? Yeah, so that's a great... Yeah, yeah, yeah. So, his question was, what's your ratio of reads, writes? You know, are you frequently reading data as a static? Yes, yes, in fact. So, we actually built the architecture with this kind of in mind, right? So, the cool thing about Matthe is particularly for a pin workloads, fast stream workloads, what we'll do is we'll keep the data in GPU memory basically. And then, you know, as the stream comes in, it goes into CPU and we can move it to GPU, but we can actually, I guess you could say, we can stage it so that, you know, there's a delta, you run a query, right? And then a new query comes in, there's a delta of what's not been put in GPU memory, and so that's actually quite small, right? So, it doesn't really affect the query times, not in any significant way. Now, updates and deletes are a little more complicated. We haven't overly optimized for that use case yet. We probably will do more, but that is something that we do more kind of as a batch thing that's synchronous, but you could do a lot of stuff to make that faster as well. Actually, streaming data is a frequent use case. I would say 50% of our customers, it's either streaming via Kafka or it's like micro-batch every 10 seconds. And the nice thing is because we don't have to pre-index the data, we actually can see very fast ingest rates. We have an oil and gas customer that saw like 500 million rows in about 40 seconds, 50 seconds, and that's simply because we just take the data and column and encode it and just, we do actually normally persist a disk, but there is a full memory option as well. It's a question on my question. So, is it basically that what you're using? Since you have this large table, are you just sort of focused on what columns are currently, does that makes it top what columns are currently included? Exactly, so the system is not terribly smart right now, but it's basically an LRU system, right? So whatever's been last touched, we could obviously get smarter about being predictive about query patterns. And then we actually do things where like a projection doesn't necessarily have to be in GPU. So if you're doing a select star, right, you're not doing any compute on that. So we'll actually run the filter, for example, on GPU and then join that to the projected columns on CPU. And we also have something called fragments, which are basically partitions. So we can do fragment or partition skipping. There's a technical term for it in the database world. Basically, you use metadata. So we have metadata on each partition, right? So if you have a filter, yeah, zone maps. So basically, if you have a where date is in the last one month and you have a year of data, right, and often your data is entered in time order, or you started on it or whatever, we can skip things that, and that can increase performance a lot, or you don't pay for that memory that you don't need. We can just skip it via metadata. So if I'm currently zoomed in to San Francisco, then clearly, like most of my data is just over the web, right? Do you think of any of that? So right now, relatively little. It would depend on if you inserted your data in, or you sharded your data explicitly on kind of the geo dimension. Yes. If not, you're going to be scanning at least that filter column. Obviously, you wouldn't have to access in memory the other things, right? So if we fail the filter, then we'll just skip that. So you could achieve some memory bandwidth savings by being able to just say, hey, this predicate is false, so I don't need to access the memory of the other columns that would be in the query. However, we still have that in GP memory. There's a lot of smarter things we'd like to do in the future. You can imagine seeing access patterns and say, hey, we're going to rearrange the table, or actually do an in-memory kind of temporary view where a materialized view where we're actually going to have partitioned on this geo thing or time thing on the fly. But currently, the insert pattern has to be right. Are, like, PSTDs and you can use this sort of point, I think they have a different word, or they're just, like, splitting more clips? Yeah, so that's a great question. So right now, the cool thing, and this actually ties into this, so we use LLVM as our compilation engine, right? So with some tweaks, we can basically take the same templates of code and generate them for x86 or for GPU or even things like power and arm, right? So obviously, we do some optimizations around GPU because not everything is the same, but a lot of the code can be reused. Right now, we're working on this area. So right now, basically, we'll pick, you know, if the data is too big, it will fail and run on CPU. So what we're working on now actually for a fairly major retail customer is this idea of, you know, they have, like, years of data, right? And they want to make sure that, you know, a certain amount of the data is kept in GPU memory and is ultra-fast, but the rest are okay. Okay, these are porting queries. It's okay if they take 20 or 30 or 40x as long as the GPU queries. So the idea would be that you'd have a slow lane and a fast lane, and the fast lanes are GPU and you could actually execute in parallel, right? Initially, it will come to a late CPU and it depends on sometimes the rail and it will be GPU. It's just done lazily. So when a query wants it on GPU, the executor will request it, and, you know, the GPU buffer pool will request it from the, if it doesn't have it, it will request it from the CPU buffer pool. The CPU buffer pool doesn't have it, it will request it from disk or storage, right? And so it's chained. And so then if you've cached it, they'll be able to say, oh, I have it in cache and they'll be able to use all the hardware efficiently and so that we can execute, you know, across whatever hardware you have in parallel, right? That's kind of a cool thing about GPU as well is that as we're doing, a lot of databases really suffer when you have heavy read analytics queries because then all you're pegging on your cores are you using a lot of your cores and then you're actually trying to do ingest, which may involve indexing, which we don't do, but we do things like dictionary encoding and things that are relatively compute intensive. So the fact that we can use GPU heavily is the read queries and the CPUs are relatively free for parsing CSV or parquet or doing the string dictionary encoding, this kind of stuff. It's really nice because it doesn't bottleneck the system. So anyway, compile queries with LLVM. I won't go into this. This is not a math D-specific thing, but I think it's a smart thing to do for an analytic database. It's essentially the difference between running an interpreter, like interpret language like Python versus a compiled language like C or C++. The great thing is that you can... There's a lot of overhead in an interpreted base language or interpreted base query engine in that you do one operation on data. It's A times B and then you have to materialize it some way. If you're really smart, maybe you can do it in cache and then you add three to it and you have to bring it back into memory. So basically when we can, we generate just one fuse function using LLVM and then it can just run and sometimes you do have to materialize for joins and things like this, but generally it's much faster. Even on CPU, it gives us a good speed advantage against interpreted base query engines. How do you generate the co-gen on the LLVM stuff? Is it like C++ code generates the IR directly or do you have like an intermediate language? So once we get our analyzer nodes, we basically, and Alex is probably a little closer to the metal to this than me lately, we basically, yeah, the C++ walks the analyzer nodes and generates LLVM IR directly. We've talked about an opcode thing because there could be some advantages for being able to cache. So we do cache compilations, but you can imagine if we had opcodes, we could be a little more sophisticated about that kind of thing and do some more optimizations that are tricky. And we could have a nice separation between, but... We also have a lot of pre-register C++ code that we compile. We'll inline that wherever we bring. Like all of our hash joint infrastructure, a lot of runtime functions, we just write C++ and drop it into something and compile it in VCC. Yeah, when you have a known algorithm, there's not really a big win of actually generating the code via LLVM, right? It's just complex, right? So when we have these kind of things that we just always do, like hash joints, we'll have, we'll just call inline basically the C++, which is cool, and also user defined functions, right? It could be written in C++ and there's inline into the LLVM generator. Cool. How are we doing? Okay, wow. All right. If anybody has to go, you saw a Merse, it's cool, it's great. Just quickly on rendering, you know, we use something called Vega, which is out of Jeff Year's lab at UW. And we use that, it's a declarative API to show how to visualize data. And we use that on top of our SQL things like political party to color and things like this. And we generate these compressed PNGs rather than sending the raw data back to the browser. And the cool thing is you probably saw in the ship demo when I was hovering, it looked like the data was actually in the browser. We do that because we also do this cool thing where we actually generate a hit map. We also render it through OpenGL and we'll actually keep the ID, the row ID of what element or what data item actually ended up on top so we can do quick reference. You can also do link list referencing. So if you actually want all the items, it's a little slower, but you can actually see all the items under a pixel as well. Or that's actually being worked on, but it's about the way. Do you see resource conflict between the rendering pipelines and the CUDA pipelines? Yeah, so sometimes, right? So right now, especially if we render in SIFU, you might have 8 or 16 or 32 GPUs and you keep the data there and then you render on top. We lock around that because it contends for resources with the CUDA pipe, right? So right now, yes. So one of the things we're working on is basically something that's out of ban where you might have separate GPUs responsible for rendering, particularly with systems like NVLink, which is this kind of fast interconnect between GPUs, you can imagine doing kind of this asynchronous pipeline. But not right now. So there is some contention sometimes. Usually the render, you know, the query might take 20 milliseconds and the render might take anywhere from 10 to 30, depending on the complexity, right? So it can be some bottleneck, but the system is so fast that often it's really unnoticeable unless you have a lot of load on the system. So I'm going to skip this, this GDF stuff. The only thing I'll say is that we're really excited about something called Apache Arrow and we've been integrating it into our system and so it's a native way of getting data in and out. Apache Arrow is just a common in-memory column in our data format and a set of APIs driven by Wes McKinney, who's the parquet, he's the parquet, he's Panda's guy. And I think it's really exciting because what we worked on, we did this thing called the GPU Open Analytics Initiative. Last year we founded it with H2O, Anaconda, NVIDIA, and BlazingDBJoin, who you'll hear from as well, and GunRock, and basically it allows all these systems to talk together in memory using this column in our format. So you don't even have to send the data to CPU. So it's interesting, it's almost like externalized UDFs where you can actually do compute. On the data, you run a query, you filter it down in FD and you can send it to XGBoost and H2O and do it all in situ. It's still early, right, but we're working on a multi-GPU version of this and I think there's a lot of exciting things in store. Okay, great. And now I'm going to turn it over to Alex. We need a mic. Oh, yeah. I know exactly. I was like, that was a little bit... So like Todd introduced earlier, my name's Alex, I'm a software engineer at MAPD. I've been there for about a year and a half now. And I thought it might be interesting to kind of go through how we added these geospatial features to MAPD. So about three months ago now, at the beginning of the summer, we launched our 4.0 version which had geospatial capabilities. And so I worked on this project pretty extensively and I might have a slightly ambitious agenda for how much time we have, but we'll see. But the nice thing about the geospatial capabilities is they kind of give you a quick tour through our system. So I'm going to start in the middle, there we go, with our sort of cogen and kernel execution. We'll talk a little bit about how we handle query parsing, query optimization. And then at the end, I'm going to actually talk about some ongoing work that I was sort of working on on the plane last night. And hopefully we'll have some time for questions and let you guys get out of here relatively on time. Okay. You mentioned all of this is in the open source too. Yeah, yeah. That's a good point. So I have some call outs on the bottom of the slide. So we have an open source repo that has our core database which actually tracks our internal repo master branch with usually only a couple of days or at most a week latency. So I've tried to put links to code and GitHub on slides so we can distribute the slides too after the talk. I'm going to go super fast because we don't have too much time, but please stop me for any questions. So we started with this geospatial capability. We do a lot of mapping already or rendering on maps. I have a lot of clients that have geodata. And so we sort of did a survey and of course the predominant system out there is post GIS. And so we basically followed the blueprint that they laid and I think there's a standards committee behind them too or a group that tries to set some standards. So right now we support point line string polygon and multipolygon types. And so you may not be familiar with this data but basically it's any kind of geometry and possibly with geographical context. So what I mean by that is for a point you might have a XY Cartesian coordinate or you might have XY which is a longitude latitude in like Mercator space or WGS84 in actual degrees. And typically these are represented on the right side of this chart as a WKT string. So it's sort of a standard string literal format. All right. So in our system geotypes are first class types but they're actually type containers. So we don't really have a sort of lower level geotype. And I think that this is kind of an interesting design decision and gets into some of the ways in which we can really accelerate these queries. So what I mean by that is for example a point is really just a type container that has underneath of it an array. And we actually support compression so we can take a set of doubles and put them in basically a byte array so that's why the type is tiny int. It's based on the longitude latitude range so it's only if your coordinates are in degrees. But for points and line strings we actually have these underlying containers, array containers. And this is nice because basically after we teach a module of our system about the geotypes we can use all the other code for these fundamental what we call physical types. So to give you like a concrete example if you pass in line string to our code gen we're going to see that and we're going to say okay line string is actually two arrays and so we're going to be able to just call the existing code gen functions on the arrays and we don't really have to mess with the fundamental parts of the system. Can that help with GIS work as well? Yeah actually, so I was thinking about that. It's actually more farther than the model, right? It's a module. It's not a common sense. No, no, I don't think, yes I want to be clear that I don't think this is a novel design but I think that this shows off that our system works really well with this design. And I actually, I was thinking about that when I read these slides because I joined this project after the design phase and when we were sort of grappling with the design decision at very low levels in the system and you know there's some interesting aspects especially when it comes to the institute rendering. So Todd sort of alluded to this but we're super, super careful with GPU memory and so if you pass in a query like the first query here select map2go with no filters then we're just going to pull that data into CPU and send it right back to the user. And so we refer to this in the system as a lazy fetch basically. That map2go column, those underlying physical columns never hit the GPU. For the second query where we actually have a filter on map2go we're actually hitting the GPU and so we'll actually pull in those physical columns. Now this is super nice for rendering because we can put... So here's an example of just the non-lazy fetch and you can see we have a map2ql command line utility. It's just a way to interact with the system and it'll dump memory info. And so you can see here that we have basically the three this is a multi-polygon so it's got three underlying arrays and you can see each array is represented there in GPU memory. So where this is really nice is for rendering and so imagine that you have like 11 billion rows of multi-polygons and you want to render all of the multi-polygons in a specific region. We can pull just the bounds so whenever we ingest the polygon we actually compute a bounding box and store that as an initial physical column. And the reason we do that is so that we can pull just that physical column to GPU memory do the filter on that physical column and then we can go and pull the appropriate fragments that are not included in the filter as a sort of a second phase or potentially a non-institute phase depending on other parts of the query. And so for this example we're projecting out this map2go column but we're only pulling the bounds column to GPU so if you follow the number in red you'll see that the column index is actually 8, we're on a previous slide it was 5, 6, and 7. So this is a different column as far as the system is concerned. And so this is basically a sort of summary and where we'd like to take this is with 3D and time series and that kind of thing with 3D we're probably not going to have an additional column for Z but for time series it might be interesting to have an additional column for Z because you might want to filter on time first and not have to bring all the chords to the GPU. We have a couple of clients that have like 500,000 chords in a single polygon or single multi-polygon which can actually be multiple polygons but it's still pretty big. Yeah so this is sort of a repeat of what I said before but we also leverage existing support for like data load and null checking etc. So we don't have to like handle geo nulls separately than arrays and you'll see in our system in a lot of places if you were actually to go look through the code that will do like an is geometry check and then we'll recursively call the same function that was called just with a sort of synthesized arguments based on what columns make up that geometry. Okay so I want to talk a little bit about how we handle geotypes at sort of the parser and translator level and so I might have been out of order I guess not anyway so one thing that's nice about let me just skip to here okay so our parser is a patchy calcite and we actually use it as both a parser and an optimizer so we send the query into calcite for basically all of our queries I think it's DDL that doesn't go through calcite right now but all of our other queries go through calcite and we get back in our HRE and so we go through and deserialize the our HRE and then we'll run it through our interpreter and our translator and we'll end up with these analyzer nodes which tell our system what to work on and so we follow this example of ST contains which is a common post-JIS operator and is what most of our clients want to use when that goes through calcite so calcite actually added geospatial capability after we added ours and so we're not actually using any of theirs so we have started to look into whether that's going to be feasible but calcite just sees ST contains as an extension function and it actually sees this poly and point columns as being of type any so it doesn't even know any type information so it just makes sure that ST contains has two arguments and it knows that it's going to return a bool and it'll drop that through the our HRE so when we hit ST contains in our interpreter we're actually going to go and start to replace that with some geospecific information and so the interpreter is really just going to say oh okay so this point is a you know this is actually referencing a certain column in the catalog that's a point column and this polygon is referencing a polygon column and it's not going to be until we hit the translator where we unify all this information and convert from relational algebra to these work units that we're actually going to go and convert ST contains from this sort of like calcite point polygon function to this actual what basically mimics the C++ code and so at the bottom of the screen there you'll see the ST contains after we've really gone through and translated it and this is actually replacing these type containers with their actual physical types so you have for the polys you have the pointer and the size and you'll actually have a bunch of other arrays for the polys and then for the point you've got again pointer size then you actually have compression information some other metadata and so you know the interpreter translator makes this actually really easy for us and it's a part of the system that I only really got to know recently but it's really you can do some interesting stuff we've been like modifying our binary operator node to sort of add an overlaps node which we can create as you know a greater than less than and operation the other thing that we do in the translator that I think is worth mentioning is we'll take a certain or in the interpreter we'll take certain operations like a filter and a group by and we'll actually coalesce them into a single relation and that ends up being a single work unit for us and the reason that this is important is we don't want to make two passes over data if we can apply the filter and do the grouping at the same time you know it's much better in our system let me just go back to actually using the geospatial types as Todd sort of mentioned we have a user defined functions framework or an extension functions framework I wouldn't I wouldn't call it a UDF framework yet we're working on getting it to be there SDF yeah, software definition but basically we have these extension functions where you can write C++ code that operates on on these physical columns and so this this STN points is a very simple function it just tells you how many points are in your chords array but you can see the definition there kind of matches this pointer size and then a compression parameter and so you can go and write these they'll get compiled both with whatever compiler you're using for your C code and then also with NBCC and then they'll get inlined into the kernel and so it's really we're talking about user defined functions and maybe supporting Julia and all these other languages but it's really nice in that you can sit down and write C++ code in these files and CalSight will actually pick up a simple one for you so if you just have like a function that outputs a double and takes in a double or something like that CalSight will handle will handle the translating everything for you you just drop that function in am I doing on time yeah I am yeah so the last thing I know that was super fast but the last thing I wanted to sort of get to was some ongoing work so when we launched our initial geo framework our ST contains and ST distance were basically the only sort of useful for analytics functions we had and that's because it's a lot of work to get all the other parts of the system right and now that it's been out in the wild and we've gotten some feedback from customers we're looking at actually doing interesting stuff with this so ST contains is I always have to look at the Postgres website to get the order right or the PostJS website but it's basically it's saying is the point in the polygon in this example so I have you know some of the schema here and how you would write the query and sort of a diagram that shows a couple of polygons and a couple of points three polygons three points and we just want to do a very you know it's basically an equijoint operator is am I in a polygon and you wouldn't believe how many companies from retail to like connected car, telematics to oil and gas want this capability yeah, census blog, zip codes so right now we're using a loop join which is pretty rough though it's actually you know we have all these CUDA cores we get decent performance and a lot of clients have benchmarked us we have one client benchmark us against Spark and PostJS and we were able to finish some queries that those systems weren't able to the flip side of that benchmark when those systems were indexed we're doing an n-squared loop join exactly the flip side is that this is you know hours and not you know milliseconds which is kind of our sweet spot and what our clients expect so we're basically looking at adding ST contains to our hash join infrastructure and so right now we have an open addressing we use a murmur hash for our hash join and we basically go through and we have a basically a one-to-one and one-to-many what we call baseline hash which is the open addressing scheme and then we have a perfect hash if you've got you know sort of trivial like our dictionary encoded strings are typically pretty easy to just perfect hash because you'll get like zero to a thousand and pretty straightforward so the idea here is to basically take the inner table the polygon table and to chop up the square into basically evenly sized grids and so we'll go through each grid and we'll generate for each grid just give it a basically an ID in the hash table and for our probe phase for each point we'll compute what grid that point is in just by taking the floor we'll say it's the lower left corner and we'll hash that as well and so we basically are treating this problem as a generalized and the reason that we can do this is our hash infrastructure supports these sort of coalesced keys so we can have up to eight sort of coalesced key components in a single hash and so this is like ongoing work that you know we're still scoping out but we're hoping that we can really accelerate the SD contains performance and then you know extend this to a 3D quad-tree or n-dimensional data with you know point ZM yeah exactly would this be you in the GPU to do this? so we would do so we actually do build hash tables on GPU and then the you know probe takes place on the GPU you build the hash table on the GPU and you ship it back up to the CPU then do what you're probing on that up there no we probe on the GPU alright we will conditionally build the hash table on GPU I'd have to go look at that code again to get you the exact conditions but it's all in you know the open-source repo admittedly could use some more organization but I mean if you care about you can only pass table make sure you don't have bins that are too hot the very cool thing about a GPU is that a lot of people keep statistics a lot of people keep statistics in their database to kind of know what the cardinality estimation what are the parameters I should use in my hash bin and not be able to use we can just simply table scan 10 billion records then we have very precise Instagraming and cardinality estimates that we can use or kind of hash bucket parameters one advantage which is having all of that yeah so yeah so we're you know hoping that this work will land shortly and we'll be able to sort of you know improve on this for other types the other thing too is like you know the fundamental idea here is that we're basically looking at an overlaps operation so we're trying to see do two ranges because you can imagine extending ST contains to is this line string in a polygon where you'd actually have multiple bins you have to probe but you know the sort of overarching ideas overlaps which we're seeing for time series customers, geocustomers all kind of stuff like that so that's pretty much all I have happy to take any questions on anything I covered or just any general system questions I don't know did you have any more Todd or you I'll just take any questions let's take the speakers applause alright two questions can Matthew take advantage of the smaller insights that are available on the tensor like or those just not the smaller we have NA we have small it's a smaller type right now but we end up having to align you know we end up having to so right now we don't we have some so I was going to say we don't actually take the GPU model into account but that's the compute capability into account but that's not entirely true we have some features where we will enable or disable or change things based on whether you're a Maxwell or Kepler or what not I think right now based on our alignments we sort of just choose to align everything to eight bytes or four bytes or whatever is most appropriate given the query but we do have in a smaller it is one byte and we don't have it right we did good okay so we there is bytes we definitely have one byte, two bytes and then four bytes and eight bytes right and we don't do any sub bytes stuff that we made in the future but I don't think we have to explicitly take advantage of these features like the heads or cores and stuff right you have to be doing a particular type of matrix location under the moon on the third week of the month and you know we don't do that yet I'm not saying we could in the future we don't pad anything in storage but I think in the output buffers depending on what you're doing we might pad an output buffer in the storage then it would be but I don't know if we would take advantage of of the intate instructions yet we do stuff with like shared memory like the cool thing after Kepler which is not pretty old Invity did a lot of good stuff with shared memory so we do all this stuff around group by like where if the group by will fit in shared memory 64 kilobytes per SM so you can actually have 2000 groups then we can use shared memory which is almost as fast as L1 cache and those queries are basically read down which is pretty awesome but then bigger stuff has to go one very quickly alright let's thank the speaker from LaSanne applause let's take a trip to the far side and black suits troops to group on the storm and the uncivilized island of New York where the criminals run the project to drop spots I'd be sleeping through the screens and wrapping fire shots my block consists the multiple juvenile offenders and some of these kids making operations safe home and shit Julianne got these perpetrating housing cops on the dick