 So again, this year, the theme is hard-rexcelerated databases. So databases that are designed to run on not just the CPU, but GPUs, FPGAs, and ASICs. So real quick, before we introduce our speakers today, I want to go through the things that people help make the seminar as serious as possible. First of all, I want to thank Yahoo Labs for sponsoring us for our fourth year. I want to thank Halla Point Karen for organizing the room and wrangling my squad, keeping everything under order. I want to thank JL from the Seattle streets, NUSLesh, and if I'd agreed to stay in true to everything. I want to thank KB and Lil T for always helping me keep one in the chamber. And so I just want to say, if you're going to cross, CMVV, right, all for the other side, all right? All right, so with that, we're super happy to have Connecticut speakers today, right? So we have two speakers. We have Nima and Ellie. Nima is the co-founder and CTO of Connecticut. He did his undergrad at the University of Maryland. It turns out we grew up five minutes from each other back in Maryland from the mean streets of Ellicott City. We also have Ellie Glaser here as well. He is the VP engineering at Connecticut, and he is from alumni of Johns Hopkins. With that, thank you. Yeah, thanks, everyone. And Andy, thanks so much for inviting us. It's a really cool series, and it's fun to be able to kick this year off. So as Andy mentioned, we're Connecticut. And one of the ways I like to describe Connecticut and the problem it's built to solve is to go through the origin story. So way back in 2010, we were a small government contractor on this unique program that was called the Brain Program. Later it was called Red Disc. And basically, the remit of that program was you need to be able to consume over 250 different real-time data feeds and give a common compute capability and analytic capability to a group of developers and analysts to quickly produce applications and analytics that you can deploy quickly into theater. So at that time, NoSQL was like all the rage, the panacea for everything. So the pattern was, let's set up a huge storm topology with a lot of boxes and executors. Let's set up an even bigger Cassandra or HPA's cluster. And we're going to have a team of 30 to 40 engineers really spend a lot of time creating a very advanced indexing methodology. Some of those guys are database developers. Some of those guys weren't. But they had data structure background. And they essentially started this process of, let's gather the query requirements. And let's gather the sizing effort. And we're confident that we're going to be able to meet the SLAs. We're confident that the query inventory is going to be fixed. Invariably, what happened is analysts wanted to chop up the data in a number of different ways. And so that query inventory grew. That caused the hardware fan out to explode. That caused the delta indexing time to creep higher and higher and higher to a point where the indexing would never catch up to the data stream. And every time that would happen, after two or three times, they got to a point of saying, hey, what do we do here? We've gone through this cycle three times now with different underlying technologies, slightly modifications to our indexing approach. And we were there, like I said, in a geospatial context. And I had a background in GPUs. And at that time, GPUs were really coming of age. It's the Fermi chipset. So it was really its maturation inflection point from a maturity standpoint. So I said, look at this device. It's insanely powerful from a compute perspective. Instead of making a database that's trying to maintain these advanced data structures so that when it's query time, we minimize how much compute we use, let's make a database that uses compute as an abundant resource where we're going to try to feed this device as fast as possible across many nodes. And we're going to try to orchestrate it and synchronize its results quickly and easily for the developer. So one of the main goals was being able to scale linearly. One of the main goals was allowing the developer to not have to learn new data modeling and not have to change their data model when they're trying to achieve a new level of scale. So no denormalizing, no five-way hash maps, the classic relational developer interaction model, but with the new school level of scale and performance. So with that goal in mind, we started building out what was actually then called Gaia, where we really were focusing on just certain sets of OLAP operators, geospatial and temporal query capability. We were able to be that engine for that program and take that problem away from the rest of the program, which was focused on NLP. So as part of that program, there was a huge amount of location data streaming in. And location data has a lot of unique challenges to it. On the filter side, it's a complex filter. So it's not a simple range query. It's not a simple predicate lookup. It can be thousands of points that you need to check against to see whether or not certain points are in or whether two shapes cross. And then on the visualization side, it's got a lot of unique challenges, because it can't be easily summarized. So it's not like a bar chart or something where you can compress it down to a small data structure and feed it to a client. You have to be able to usually send some large structure to some third-party application, whether it's your browser or your Geo server in the middle to do your actual final rendering. And even today, if you take a strong laptop and go use like Uber's Kepler and try to load up 500,000 shapes, you're going to hear your laptop fans start to spin up and feel your browser start to drag, because it's still not a fully solved problem. So seeing that and knowing how important location was for what we were doing back then, we really focused on the geospatial processing side, but also on the visualization side. So one of the main innovations that came out of that work was let's make a combined data processing and rendering pipe where a visualization request would be rendered in situ in our process space and then flattened and then given out via web service, in this case WMS. And that with our heat map capability, which was also a very advanced visualization that used that same pipeline, really set us apart. And through that effort, we won this award from IDC, from their HPC group. And just by sheer luck, someone at the Postal Service saw that we won this award from a press release. And Postal Service had just spent millions of dollars putting on a breadcrumb emitting device on every postal truck and giving one to every carrier. So they spent hundreds of millions of dollars. They had a big legal fight because the carriers didn't want to be tracked. They went through all this effort and they created this really high volume data stream. And at the end of the day, they realized, we can't query this. We don't have the infrastructure to be able to drive value from this thing we've already invested hundreds of millions of dollars into. So like I said, just by coincidence, they see this press release. They said, come in. It looks like you're trying to tackle something that's near and dear to our hearts. We learned about the problem. They said, let's see a prototype in a couple months. We were able to really deliver on that. And then nine months later, we were able to get into production. And that was our first major commercial license sale. And from that point, we kind of pivoted as a company from primarily services-focused government-based entity to, or not based entity, but focused commercial entity to a kind of product-oriented startup where we were going to focus all of our capital resources in the development of the product. So that was around 2014. My co-founder, Mitt, went to San Francisco to try to raise capital. While a team of myself at 2014, myself, Ellie, and about two other engineers continue to develop the product. Again, by kind of luck, we met a friend of a friend that said, hey, you guys got to go show this to Ray Lane. He's been in the database industry for 20 years. He's starting a fund. He's going to get this. And we went that day, presented to him. And he believed and he got it. And he said, look, this makes sense to me. Let's do a seed round. And from that point, we kind of took it from there where we did a subsequent round and finally this last A round. So now we're about a team of 35 people. And we're constantly focused on developing Canada and pushing the envelope around this major challenge, which we're going to define here as the extreme data challenge. So if you think about what are the kind of characteristics that define this problem, it's all around the data. So the variety, the volume, the cardinality, the velocity. So it's not batch oriented. It's constantly streaming. And then the workload. So the complexity of analysis. So not a simple lookup, not simple aggregates, not simple filters, but complicated filters, very advanced aggregates with billions of groups, location analytics, ML inferencing, and being able to feed that all into one another so that you get this multiplier effect. This is where we are now. And this is the problem that enterprises are trying to tackle now. Because they've invested so much in their data feeds. And they're saying, how do we get maximum ROI from our data? Our data has all this intelligence. Why are we doing a batch report every night? Why is our entire hedge fund being informed about their risk profile only in the morning? It should be real time. It should be alerting them. They should be able to chop it up. They should be able to visualize it. They should be constant ML inferencing going on against it to do different alerts or scores about behavioral things they might be seeing from those OLAP workloads. So where we sit today is kind of like, I think, the precipice of a lot of enterprises saying, yes, we're going to tackle this. Because where we were 30 years ago was the operational space. Let's do our transactions. Let's be able to do some basic triggering. Let's make sure that we can retroactively verify things are operating correctly from a clerical perspective. About 10 years ago, we were at the height, maybe even 5 years ago, the height of the big data experience with Hadoop leading the charge. Let's make a huge data lake. Let's be able to have this ecosystem of components, Spark, Presto, Hive. And let's just give it all as one big platform to developers. And let's see what they can patch together and what they can explore. And inevitably, that did move the ball further. And it did allow for a bigger kind of appetite for the complexity of the workload and for the amount of data being addressed. But where it fell down was the TCO around maintaining these solutions, the data lifecycle of the reporting, in that mostly it was really based on batch kind of workloads. Because Hadoop fundamentally was really made as a batch style job system. So from that point, which is kind of now in its twilight, we're just starting to see enterprises understand that, look, we need to be able to take these feeds. And we need to be able to derive data and have it be real time so that there's no lag. We're able to do all the different variations of the work. And we're able to feed one insight into the other. And that's what we're trying to solve here at Connecticut. So I know it went kind of long on that. But I'm just going to also kind of touch on all the anti-patterns we've seen trying to solve this problem. So I mean, going from enterprise to enterprise, you see these are the three or four things that we see all the time in a drink of water. So this is a big one. And you see this, a big real tailor wants to do real time inventory. And they have billions of transactions coming in. They want to be able to maintain a really complicated aggregate, billions of groups, not 100,000 groups, but a huge aggregate table. And they want to maintain possibly multiple versions of those. And then they want to have a microservices layer that's doing very fast key look up against it so that they can inform up their stack, whether it's a report or an app or whatever it might be. So what you see in this approach is, OK, we've got a Cassandra. We've got an HBase. It's got this incredible mutation and read scale. Just like what we saw in 2010, let's spin up a streaming executor pipeline that's going to consume data off, let's say, a Kafka topic and maintain this advanced indexing system that we're going to either kind of invent ourselves or use common patterns or use libraries that are out there. Just like we saw in 2010, you see the same problems the ability to keep up with the rate of mutation is usually the first thing to go. But then right after that is the desire to change anything about that aggregate or add another aggregate is extremely expensive from a TCO perspective because basically you've taken an engineering group that is really supposed to be focusing on analytic and you've made them ad hoc database developers. And so you're asking a potentially young or let's say not mature data processing solution, which is what they build to add on this whole new capability. And it really is just from a developer standpoint not ideal. And so this is still one of the most common things we see today. We just saw it a month ago with a 200 node Cassandra cluster with a retailer trying to do real-time inventory. Question? Just so I have a little more context, like can you give us more absolute numbers of what the scale you're talking about? Is this data like it's a memory from the cluster? So with our current kind of profile, right? Our traditional size cluster is in the 10 to 30 node. It can scale higher. And we do have one or two customers that are higher than that. Is this bit of the aggregate memory across there? Yes. So as far as your query, your addressable query size is the memory of the cluster. We're in the 100 millisecond to three second window. With our product that's coming in the end of Q4, we're kind of getting into the data exploration, data lake style workloads with our tiered storage capability. But up until this point, that's the space that we play in. And so that space really makes sense if you've got streaming data coming in, if you need to be able to maintain that in an easy fashion and be able to instrument that up your stack where you know that everything's up to date, this advanced aggregate's just being maintained for you. And you've got the hardware and the budget to back it up. And so it's an emerging problem. So right now it's the most aggressive enterprises. It's the ones with the biggest kind of data streams or the biggest desire to do real-time reporting or real-time analytics or real-time pipeline that can inform a number of different business units. But I think it's going to be something where you've got your first movers and we're going to see a lot of the kind of stragglers come on board and realize the power of the data that they have and what they're just kind of leaving on the table. So the second one that we see quite a bit is around the traditional data warehouse. I'm not going to mention any particulars, but you got your traditional warehouse. It was built for a really kind of different set of problem scenarios where it was more about being able to do OLTP and complex transactional workloads at scale. Inevitably, what they find is the data structures are not optimal for being able to leverage the compute capability of vectorized processors like the latest Skylakes or the GPU. And on top of all that, when you bring in mutations, being able to or having to maintain that those complex data structures really makes everything fall apart. This is just not what these systems were built for. They were built for right integrity. They were built for being able to do really coordinated complex transactions. And they do those things incredibly well. That's not what Connecticut is built for. We're not trying to target that, right? But you do see this a lot. And that's partially because of the initial investment. That's because that's what they know. But you see it fall down over and over again when it's trying to tackle this problem. So from there, there's kind of the newer generation of MPP data warehouse where they're fully focused on it. They're in memory column store. They're able to at least partially bring in at a higher rate of mutation. But what we've seen is they haven't really committed to vectorized processing. So when you need to do something like a really advanced, painful aggregate that's going to generate million groups plus, the actual underlying logic in the kernels aren't fully vectorized. So basically, you don't have the kind of fundamental compute kernels to leverage the hardware that's out there today, whether it be Skylake, AVX 512, or GPUs, right? Just recently as six months ago, one of the leading cutting-edge MPP databases, distributed MPP databases that I think is a great product, they kind of came out with their next generation release, their commitment to AVX 512 and being able to leverage this new hardware. But if you kind of looked underneath in the fine print, you can see it's actually limited to a very certain, it's a small set of kernels and kind of processing capability. And the reason for that is it takes a lot of work to create complex group by kernels, complex join engines that can do things in a fully vectorized manner. It takes a certain amount of discipline because there's always the easy way out. So that's another major kind of thing that has taken us quite a while to get to, but has been well worth it in the sense that our kernels are a differentiator, whether it's on Intel with leveraging the AVX 512 instruction set or leveraging GPUs. This is a ****. I can bleep it out. All right. Done. And this other common one is really around kind of just the location analytics stack, right? So we're going over the new workload characteristics that we're seeing, the new data types that we're seeing. And there's a tremendous amount of value in location data. I mean, if you think about it, it's a way to implicitly define relationships between people, between products, between intention. It's really powerful. I mean, the government actually has a little bit of a more forward looking view on being able to kind of derive value from location that I think enterprises is now starting to get. You're seeing ad text really embrace location because there's just a tremendous amount of things you can learn about, someone you're going to serve an ad to through doing location analytics, right? But with that, location analytics has a whole host of its own very complicated problems. And they're really on the filter side, right? So when you have to do like a complex feature analytic where maybe you want to join one set against another around whether or not any of the shapes are within whatever some buffer zone of each other, that's not something that you can build an index for easily without having misses or having hotspotting effects take place. So what we spend a lot of time on is building those kernels out, building that capability out to do that kind of brute force computation. Now, we do leverage, and with a lot of the stuff we do and Ellie is going to jump more into that, we do leverage indexes. And we do leverage kind of not commonplace, but we generate the end-slavery data structures that we feel are most complementary to being able to solve the question being asked. But what we really try to focus on is having both worlds, right? Where we want to have the indexes to a point, they'll help us to a point, and then when we have to switch over and feed that kind of brute force table scan problem, we have the vectorized kernels and the compute hardware that are available to do it. Geospatial has also that visualization problem that I mentioned, and we've spent a tremendous amount of time with our EGL pipeline being able to do complex feature rendering. It's something that it sounds kind of easier than it is, and why not just bring in vector tile and bring it to the browser. But again, it's something that if you're looking out at a nation and you have, let's say, a road network, bringing all that feature data to a browser or to another solution to do the final rendering is going to crush that engine. So you have to have something where you don't have to play the serialization penalty over again, where you have the compute, and you have the kind of complexity and depth of capability to do the rendering. So symbologies, all the different shape types, and that's something else that for where things are going as far as deriving value out of data, I mean, if you think about autonomous and all the different major automakers trying to get into that field, just forget LIDAR, just 2D data analytics and visualization is a tremendous part of their pipeline that they're all trying to figure out how to conduct. So yes, we were called GPU DB. Our production systems ship on GPU and video GPUs. We do have an Intel flavor, but all of our production deployments leveraged the GPU, and the GPU was kind of the device that inspired all of this. It's not the perfect device. It's not going to solve all your problems, right? But it does have some unique things that are really powerful, right? It's got thousands and thousands of very kind of, I call them, dumb cores that are then managed by an SM core. If you have a V100, it's much more capable. It's got a much higher number of SM cores now than it did when you had a K80. And the fact that you can fit that in a small footprint is really a big wow factor when you can bring in a 4U kind of mini pod and take out a rack of Teradata or leave that up, but a rack of whatever solution, right? Like that is for on-prem, for focusing on-prem stuff, there's a lot of value there, not just from the performance aspect, but just your total TCO. The way we kind of have brought our products through time is really from a heritage of doing table scans, right? It was initially a really great table scan engine, and that's what the GPU is great at, right? That brute force kind of complex aggregate, complex filter. What we've done since then is really built around that basic idea, a whole lot of instrumentation, a lot of data orchestration that allows us to do it more intelligently, because obviously just doing that all the time has its limitations. On the downside, the GPU for analytics, so for ML, it's great, right? Because most ML workloads are not that data intensive, they're more compute intensive. Think about Monte Carlo simulations, right? You bring a small amount of data over the PCI bus, and then you do all of this compute straight onto the GPU. With analytics workloads, it's very different, right? You have to bring over a lot of data to drive the actual compute, right? And so that PCI bus becomes a tremendous bottleneck, because if I got to bring in, if I got to send a terabyte of data to a two-node cluster of GPUs, and my PCI bus can only do 11 gigs, I mean, just right there, you can do the math, that's, you're not gonna go any faster than that, right? So, there's been movement there on kind of alternative platforms like with Power, they have their NVLink bus, and Ellie's gonna touch on that. But even still, it is the number one bottleneck that we face for analytic workloads, because for most of the operations on the compute side, the GPU can just, it's not really, it's not really much of an effort for the GPU to blow through the compute, it's really just getting it there, and then being able to orchestrate that. And on the topic of orchestration, the GPU as a compute device, compared to the CPU, it's pretty primitive in that you have basically no scheduling instrumentation, right? So, there's a heuristic basically based on how you size your kernel, and what GPU you're using, the ASM processors are gonna schedule it, how they best see fit. So, it's a combination of the NVIDIA driver and that particular card, whatever version it might be, K80 or V100, right? So, if you compare that to what you can do with a CPU, I mean, that's crazy. I mean, no one would expect one of their primary compute devices to have so little scheduling instrumentation. And on the resource management front, also, again, it's very primitive, it's very early days. You can't do fractional allocations of a GPU right now. Like, if any of you guys play with Kubernetes or Mesos, I mean, you already can see that, look, the GPU is still nascent there, and you can't do fractional deployments of a GPU in your pod. You have to give entire GPUs because you're doing a PCI pass through, and there's just no instrumentation from the NVIDIA driver up into the OS that gives developers a way to instrument it in ways that you can with a CPU, right? I mean, a CPU has whatever, 50 years of scheduling and management instrumentation kind of capability baked in where the GPU is still just starting to get there, right? They're just trying to start to tackle these problems, and that's only if you can afford the latest cards, right? So, it's not a perfect device, but it's extremely powerful. It gives you a lot of kind of processing capability in a small footprint. And if you can work around some of the barriers, like the PCI bus, which we spend a lot of time figuring out different methods of how to either send less data with kind of like a cache eviction handshake where you keep stuff on VRAM or compress stuff and send it compressed over the bus, these are workarounds, but sometimes the workarounds don't work, and you're gonna have to pay that penalty. All right. Yeah. All right. Otherwise, we'll be tied together. Any questions so far? I'll take the time and stuff. Actually, the whole thing. Yeah, so we actually have a shared nothing style data layout, and Ellie's gonna dive deep into that and how we organize it within each node. I don't wanna say your turn earlier. Yeah, but just to answer your question, we try to keep the data on the GPU, but most of the time, with the customers and the amount of data we're dealing with, there's not enough GPU memory to be able to satisfy a query. You're typically GPU has 12 gig of VRAM, right? And your system memory is obviously in order of magnitude, more capacity there, and the kind of workloads that we are trying to solve for, they just can't fit on VRAM, right? And there are some solutions that try to focus on just fitting it all in VRAM, which is fine and good, but for this kind of problems that we're tackling, it's not realistic from a financial standpoint. So Nima gave an overview of Connecticut as a company and where we came from and the general outline of the product and the problems. Oh, sorry. You said that the main board bank was there, can you think? So the most prevalent one is the PCI-Gen 3 bus, and that is the most painful one, I mean, compared to NVLink, NVLink is much faster. It's like what, 90 now? Yeah, we measure about 60 gigabytes a second. Yeah, I think NVLink 2 is some other big step up there, but I mean, you don't really see power 9 in NVLink 2 out in the wild right now. I mean, it's also very, it's rare to see power. I mean, it's like 10% of deployments are power, right? The overwhelming amount is x86 and Intel is not moving on the PCI right now. Okay. Also, so to move the NVLink transfer board in there, have you looked into integrated GPUs where you would not need to transfer data from? Should repeat the question, so it cheers up my mind. Oh, so you're asking about integrated GPUs. So the problem there is that, yes, they're integrated and they look like they're integrated from that perspective, but those systems are very limited in memory at that point. Either, and there's no free lunch, so it's still transferring data, it's just hidden from the developer. So I'm gonna go into some of the guiding principles behind the design of the Connecticut database and then we'll dive a little deeper into the architecture as well. So the main principle that we have, we're all about performance here and so what we wanna do is we wanna use all the available computing resources. So we wanna focus on performance, we definitely wanna focus on scalability, scale out and concurrency. We've got kind of large workloads, but we also have lots of concurrent workloads. So what we wanna do is we wanna use all the computing resources that we have. So we have CPUs and we have GPUs available. So we don't wanna run everything on GPUs or everything on CPUs, we wanna kind of do the best, use the best mix for the job. And we're trying to walk that fine line on the data structure side where it's beneficial enough to make a difference, right, but not so punitive when a mutation comes in that we're gonna fall into that common pitfall where in this new kind of problem space where data is constantly coming in, just us maintaining the data structures is gonna bring us to a halt. Right, so CPUs are great for general purpose computing tasks, map lookups, indexes, anything that's branch heavy, complex data structures. GPUs on the other hand, I mean they're, like Nima said, they're kind of dumb cores. You might have 5,000 cores, but they're very simple. And it's SIMT, it's single instruction multiple thread. So all those cores are basically working on the same problem on a different piece of data at the same time. So it's really great for things like sorting data, scanning data, just crunching through data, things like very simple data structures. You know GPUs work on vectors, on matrices. You know very simple, but nothing complex. Like there's no hash map that works on a GPU or at least not directly. And any kind of path dependent algorithm falls apart on the GPU, right? You need to have kind of brute force, no path, no branching kind of algorithms that you can launch via vectorized kernels. So yeah, going along with this, we're a distributed system. So we're across multiple machines, a whole, an entire cluster. So we really need to encourage data locality. So we wanna minimize data movement across the network. So we have a lot of ways that users can tune the way that their tables are distributed across the database. We call that sharding. So you can shard your data, you can replicate small tables. We wanna basically not have to move data between nodes because that's very, very expensive. So we wanna keep joins local whenever possible and basically minimize data movement across the network. Yeah, from a developer standpoint, it's a sharding system like any other. If you don't define a shard, we'll round robin it and we'll do our best to do the query, including the extra data movement that may incur. But what we try to, I don't know, recommend is doing sharding setups that reduce that data movement as much as possible. So if you take your kind of classic snowflake schema, traditionally you've got a massive fact table that's constantly growing and your dimension tables are usually something we would recommend being replicated so that they're not small. They're in the hundreds of millions maybe or 10 to 50 or 10 to 100 but they're small enough that you can still replicate them and not have to incur that IO penalty across your cluster. So yeah, so we wanna minimize data movement both across the cluster but also from the CPU to the GPU as Nima mentioned before. So typically this is the biggest overall bottleneck when we're actually going to process queries. For x86, it's the PCIe bus. You can get about 10 gigabytes a second achievable bandwidth. On the IBM Power Systems with NVLink, you can get 60 gigabytes a second and that will grow with the Gen2. But basically the principle that we're trying to adhere to is only move the data that you need to satisfy the query. So this of course implies column oriented processing. If you're only, if you're filtering on x, you should only be copying the x column to the GPU. We wanna minimize movement wherever possible. And like you mentioned before, we definitely wanna cache data on the GPU when possible. And then when it's possible, we wanna copy data that's already been compressed, whether dictionary encoded or other methods, copy the compressed data and decompress it on the GPU or operate in the compressed domain when possible. So I'm wondering what the cache the data on the GPU when possible really means like when you are executing a query across you dispatching some user data or GPUs and then you're just going to leave data there as much as possible? Correct. So yeah, so the question was about caching data on the GPU. And yeah, so if someone did a query on Table T and they looked at column x, we copy column x to the GPU, we did whatever query they wanted to do. If we don't need that room on the GPU, we'll leave that column there. And so the next query that comes out, if it's also gonna be running on that column x, it's already there, we don't have to copy it again. So there's like, it's basically like an LRU cache, right? Where the subsequent query will say, like, do you have such and such chunks there? And you'll reply, I've got these two, but you gotta go grab the others and send them down. So, Nima mentioned a little bit about the challenge of programming GPUs. And it's a different way of thinking about programming. So with GPUs, there's no for loops. You're never looping over your records. You have to map everything into parallel programming primitives. So when you're writing a kernel, you can think of the kernel as that's the thing that's running inside your loop, but you're never actually executing that loop. The loop is being scheduled across your GPU and you're getting that behavior across the entire device. So what we wanna do is when we're writing kernels, when we're writing our processing kernels, we wanna stick to parallel programming primitives. So things like sort, transform, prefix sum, or scanning, reduction, scatter, gather. So when you're dealing with processing kernels, whether it's computing aggregates, whether it's joining data, we have to always think about these primitives and how we can use them to best process the data. And then the other guiding principle that we go by is to follow established standards. So that means following SQL or SQL92. We don't support all the transaction mechanisms that SQL provides. We do have somewhat of a transactional guarantee in that when you insert data into the database, when that insert has completed, when you've gotten a response back, that data is in the database, it's ready to be queried and it's fully operational. There's no background processing or anything like that to get it ready. What we found was for OLAP workload, that's enough of a transactional guarantee to take away some of the frustrations around, what we've seen with other solutions where you do your mutation, you have no guarantee when that mutation is gonna be represented in any follow on query. With us, because of our underlying mechanism of all being HTTP RESTful endpoints, we kind of use that request response cycle as a transactional mechanism to say, look, you make a call, it's gonna block until it's been put to the memory store and the disk. And when it comes back, any subsequent query is gonna have that represented in its result set. So for a single threaded kind of workflow, it's more than enough for OLAP workloads, right? What it's not is kind of a multi-process, advanced transactional system with rollbacks, with rollback, which is what the past 40 years of solution has been focused on and already kind of does very well. So right, so we're really focused on OLAP, so we've added in support for some of the newer OLAP features that newer versions of SQL provide. And some of them are vendor-specific and some of them are pretty standard, but we support window functions, pivot, unpivot, cube, rollup, basically any of the OLAP kind of aggregation functions that people are used to, we support. Of course, we support ODBC and JDBC connectivity. Most of our users are connecting via ODBC or JDBC. We do have a native API that we support that I'll go into a little bit more later and for that we use Avro serialization, so that's an Apache project. And for rendering, we follow the OWS standards of WMS, WFS, and the newer Vector Tile standard. So we're following all the standards wherever possible. For geospatial processing, we support the ST functions that have been popularized by PostGIS and other databases like that. One more note on like kind of a, we're focused on OLAP, but we do, we have spent time making sure that there's certain kind of OLTP-like functionality that works at a very high scale to kind of unlock or show off the OLAP capability because you actually kind of need some, highly performing OLTP-style capability to be able to work against this problem that we've been talking about. So most notably, yes, we talked about insertions constantly coming in and being able to do that at scale, but also updates and upserts, right? If you think about data streaming in, a lot of it is around like, enrichment of a certain entity, right? Where they're constantly learning new things about a certain entity and they want to absurd on that record if it doesn't exist, right? Having that be able to stream in and keep up with that and then have aggregates being queried against that same cable is a requirement of being able to solve this kind of new problem that we've been talking about. And you would more classically define those kind of capabilities as more of an OLTP or high performance OLTP problem rather than an OLAP problem. But it's our belief where we're talking about inserts, updates, and a linearly scaling lookup capability that you have to have those to be able to kind of really get full value out of that OLAP capability. Okay, before I get into the architecture, any questions on that? Okay, so going into the architecture. Sorry for the coloring here. It's not too hard to read. But the basic idea is we're a distributed database. So each one of these kind of tall boxes here is what we call a rank or a process. So these processes are running maybe on the same machine or distributed across your cluster. For instance, when we're doing development work, we'll typically run the head rank and one worker rank just on our laptop. And it's basically, it's the full database. We can do everything that we do. But for actual deployments, we scale out. And so this might be 10 or 20 machines with maybe 40 to 100 ranks or processes. Each rank, each of the worker ranks is responsible for one GPU. So that's kind of the way we map processes to hardware. So your node may have four GPUs or eight GPUs or two GPUs and we'll have one rank or one process for each GPU. The head rank, that's our, that's kind of our head node. The main function there is it has our HTTP server. So everything comes in via a REST API. The head rank is responsible for input validation, error handling, metadata storage, some global synchronization. But it doesn't store any actual data. No records are stored in the head node. It's really just a kind of an IO coordinator. The actual data is stored in columns in the worker ranks and I'll go a little bit deeper into those. And then as far as how you interact with the database, everything is coming via a REST API. We have native language bindings for all the popular programming languages, well not all, but many of them. So we have APIs for Java, Python, C++, C-sharp, JavaScript and Node and we're adding more as requests come in. The other way that people interact with the database is via ODBC or JDBC. And so we provide an ODBC server that basically translates into our native API calls. So users still install like a classical ODBC client that client talks to our server that we've developed that speaks ODBC wire line, takes that and basically converts it to native driver calls. So the database itself continues to get those RESTful calls, even though the client is getting a fully native ODBC experience. So this allows us to integrate, and JDBC is the same pattern. This allows us to integrate with the tableaus of the world and a lot of other legacy and non-legacy applications where even though I didn't really think it was, I haven't been using a lot of JDBC prior to building this or ODBC prior to building this, it's still everywhere. So when I first built Connecticut, I thought that the native bindings were the way to go and that was so much easier and to me so much more intuitive, but I was dead wrong basically SQL, ODBC, JDBC these. These kind of common pathways are very prevalent and probably getting more popular. Yeah, enterprises have just piles and piles of SQL code and that they don't have any appetite to rewrite it. One other thing to just mention there is the host manager. So host manager is actually a separate process that is responsible for being the supervisor fabric across the cluster and is also distributed in that system. So rather than try to wrap around like a system D or wrap around like other systems, not application level monitoring services or tools that you find in a common line stack, we wanted to roll around because we could do more advanced supervision and we could do more advanced in it that makes more sense for Connecticut. So this idea of host manager is what we rolled out in just way last year and it's actually the one that's responsible for spinning up the actual database. Yeah, so basically this is talking more to what Nima just mentioned, but yeah, it's basically, it's supervising, it's managing that, it's spinning up new instances. Yeah, it actually spawns the database processes so that it can catch a signal and mitigate that. So if it catches a signal where the database is going down, it's gonna try to, where that process is going down, it's gonna try to bring it back up, it's gonna tell all the other host managers about it and essentially it allows us to have a clear state from at the process level what's going on in the database and having the ability to have a very intricate kind of supervisor pattern. I mentioned data distribution and data locality before. So the sharding of tables were distributed so making sure we distribute the data properly is a key concept. So the typical way that data comes into the head node and then it's distributed out to the worker nodes, the worker ranks. Now we have a concept of what we call TOMs or shards, basically partitions of data. You can have one or more of these TOMs per rank and the sharding model basically lets you direct kind of in a deterministic manner records with a certain value go to a particular TOM. So that's important for joins. So for instance, if you have two large tables that you're gonna be joining on one or two columns, then you'll need to shard on one or both of those columns so that the records with the same data go to the same TOM, so they're living in the same TOM. So the data is local there. Or- What is TOM? Like I- It's a data container. It was originally a class that was specific around type object management and any kind of group. It's just the name stuck. It was a great nickname and it just stuck. Yeah, I mean maybe we'll change it after. And we don't have any developers named TOM. And as far as how you specify the shard key, here's a quick example of what the DDL looks like. So it should be mostly familiar, but the things to point out is, you know, you have your columns and what the type of the column is, whether they're knowable or not. You can specify a shard key. This is one or more columns, which determines the way that the data is routed, the way that the data is assigned to a particular TOM. And notionally, the algorithm is, you know, take the value of the shard key. So if you have a record that comes in with a test ID of five, then basically what we do is hash that value and then mod it by the number of TOMs and that tells you which TOM to go to. That's not exactly how it goes there because that doesn't work as far as a elastic system and scaling it up and down, but it's notionally that's what we do. So you can specify a primary key as well. Primary key gives you uniqueness constraint. What's that? What do you actually do? So the way we do it is it's kind of similar to some other databases about, we basically have a number of hash slots. So think of some thousands of slots. You take the value, you hash it and that gives you a slot and then we have a number of slots that are assigned to a TOM. So then as you scale out and you have more TOMs, then you move certain of these slots over. So it's kind of like consistent hashing, but... It's a little bit different there. Yeah, it's similar. Yeah, it's actually similar to how Mongo does it. But the general idea is you're only moving a portion of the data as you get more and more. And you get all that logic too. Like all the client APIs when they initially connect, they get the math, they get all, and they have the logic built in so that when it's time to do mutation, all that routing calculation is done on the client side, leveraging your client to do. That way, the routing decision is made up front and it's going direct to a node rather than going to a single node and then get it on our side and get it rerouted. But this allows us to get that linear scale on the injection front. And then a couple other things to note on the DDL. So there's ways to specify properties for your columns to tune the memory usage. So for instance, this column is annotated as store only. And what that means is that we're gonna get records that have this column in it, this value in it, but we're not gonna keep them in memory. They're just gonna be stored on disk. And so you can't do any aggregates on those columns. You can't do any kind of queries on them, but they're there, you can return them back. And they're there, they're carried with the data, but you're not gonna waste memory on columns that you don't care about. You can annotate text columns as text search. And then we have basically a full text search engine that works alongside the database to be able to do wildcard searching and that kind of stuff. And then you can annotate certain columns as dictionary encoded. So these are for low to medium cardinality columns. We don't have that many unique values. We'll use dictionary encoding and really reduce memory usage that way. Just a couple of words on the sharding constraints. So this should be pretty common to other databases as well, but the shard key, whatever you're sharding on should have sufficiently high cardinality so that you distribute to all the toms. And if the cardinality is not sufficient or if you have lots of records that happen to go to one tom, you could get a poorly balanced cluster. You could have a skew basically. As we mentioned before, tables can be replicated to really help with joins. And this makes sense for smaller dimension tables where you can just have an entire copy at every tom. I mean, what we've seen is that your classic snowflake schema, the dimension tables, I mean, they can get big, but they're not in the billion. And your back table can be huge and constantly growing. The dimension tables usually, they're sub one billion. So it's something that we're kind of with seven allowing you to fully, without having to do the best case sharding or replication, just do it and we'll move the data around for you. But it's something we encourage you to just, it makes the workload a lot more efficient. So I will get back to the encoding a little bit. So you mentioned that the user can specify hey, I'll do dictionary coding if the cardinality is low. But I wonder if there are any like some default compression mechanism or encoding mechanism you automatically apply to reduce the amount of data you do transfer or just specify whatever you want. You specify what you want. So you can specify columns as dictionary encoded. You can specify columns to be compressed as well. So I didn't show that in the DDL. And that's there that we're kind of continuously improving. It's an open area, compression that are GPU friendly. So dictionary encoding is one of those. So that's the primary. So if you have a string column that's low cardinality you can massively reduce your memory usage that way. Just a kind of distribution example. Again, it might be a little hard to see with the coloring. But if you think of, if this is one of your worker ranks and you've got in here two to two toms or two of our shards on that rank, you'll have a fact table, a shard of that fact table will be in each of those toms. Replicated tables, the entire table will be at that tom. And we're column oriented as we said before. And what we do within the tom, within each partition is we break data into chunks. And that's the chunk is the fundamental unit that we operate on when we go to do the actual processing. So here I'm showing a chunk size of three records. Typically our chunk size is eight million records, but that's a tunable parameter. And for each chunk we keep metadata around each column, right? That allows us to skip chunks when appropriate. Right. So I'll go into a little bit of the data processing. How do we actually execute queries? How do we use the GPUs? You'll often take the several iterations or constant iterations to raise shards these toms. All these stuff now work for my application and we'll have to do it again. So what is the involved when the administrator changes the sharding scheme? Yeah, so if you change the sharding scheme, we can kind of do it in place, where we'll take the table and we'll basically make a copy of the table and insert the, and then reinsert the records into the shard that they belong. So you can do that or you can just create a new table and do a insert into, basically. And if you look at the architecture diagram, one thing that's not there is each of those processes have a high performance messaging path to every other node so that they can do things like the resharding words. They're not going to the head nodes. They're gonna direct to the appropriate toms on another node in another process, insert the resharded record. But basically we're shipping data around to where you, correct. Yeah, you're basically network limited at that point. Network and, and, and, and disk, so. I mean, if the table is persisted right and you're resharding, it's gonna have to get re-persisted. So as I mentioned before, we process data by chunk. That's our fundamental unit of processing. Yes sir. So if a rank becomes unavailable for some extended time, is the data automatically re-sharded or is there some intervention required? Right now, with our current release, if a rank goes down during operation, we end the, we end that operation and we, we focus on bringing the rank back up and getting the cluster to a healthy state. And that's what the host manager does today. We were looking at, you know, different options and kind of remediation instrumentation to get to the user. Today, you know, if I'm doing a query and one of the ranks go down, we just, the query just, you know, so it's an exception to get an error back. So I'm trying to finish up in five minutes, okay? Okay, well, I'll race through these. So, you know, we process data by chunk. That's the fundamental unit, as I said before. That's by default, eight million records. It can be tuned as necessary. Now we do process multiple chunks in parallel. So we process multiple chunks concurrently and, you know, chunks from other Thompson ranks are all being processed concurrently as well. Just want to emphasize that we've got multiple levels of parallelism. So we've got parallelism across the cluster. We've got chunks going across the entire cluster. I keep hitting this microphone, sorry. And then we do have parallelism within a chunk by leveraging the vectorized execution, whether on the GPU or on the CPU. We're using, you know, the multiple threads available to process that chunk. And so depending on the query and depending on the data, we may process, we may do the processing on the CPU or the GPU or both. So for instance, if it's a small amount of data, we may just process it on the CPU because it's not worth it to copy to the GPU and to, you know, run a kernel in GPU and then copy the data back when it's, you know, we can just do it right on the CPU. If there's many concurrent queries, we may want to do some on the CPU, some on the GPU, if the GPU is busy. And we have, you know, heuristics built in to basically tune that behavior. And, you know, as far as, you know, what happens in the CPU versus what happens on the GPU? So CPUs, the CPU does, you know, query parsing, input validation, all the metadata maintenance, disk persistence, index lookups, index join, so hash joins, most many geospatial functions. So GPUs aren't great for variable length data. And if you think of geospatial data, if you've got shapes or tracks or points or lines, you know, that's very variable, that's variable length data. And GPUs aren't great at that. So for a lot of those functions, we do them on the CPU. Full tech search, that's all done on the CPU. So as far as GPU, what do we do there? So we do equal joins. That's basically a sort merge join. GPUs are fantastic at sorting. So that, you know, that's one of the areas that the GPU really shines. What we call predicate joins or nested loop joins when you've got a general join condition. We do that looping, you know, not via a for loop, but using the factorization available in the GPUs to do that nested loop. Fixed length string processing, you know, we can do that all on the GPU for fixed length. So you might have a care eight, you know, an eight length string or 16 length string or whatever it is, that's fixed length and the GPUs can do that really, really well. All of our aggregations and window functions, analytic functions, those are all done on the GPU. The GPU is, that's really the bread and butter of where we get a value from the GPU. And rendering, all the rendering, which I'll talk a little bit about next. And then, oh, just a quick note. So we process data by chunk and then we have to merge those chunk results hierarchically to get a final result. So, you know, we merge data from the chunks within a Tom. We merge data from the Tom's within a rank and from the ranks across the cluster. And these results might be things like, it might just be simple counts if you're doing a filter. They might be, you know, result tables if you're doing a group by and they might be images if you're doing visualizations. But that's the, this hierarchical merging architecture is used by all of our operations. Any questions on that before I jump to visualization? And I know we're a little bit over time here so I'll try to rush through here. Just going over the basic visualization pipeline. The basic idea is, you know, we copy the data to the GPU if it's not already there. If these might be latitude and longitude data, it might be triangulated shape data. We apply whatever spatial projection and bounding box that the user specifies to map the data into screen space. We pass these through an OpenGL or an EGL shader pipeline. So we use a geometry shader to do things like convert points and lines into quads. So we can do things like apply dotting and dashing textures to them. We run a fragment shader to apply various styling options. You know, for points we want to be able to set the shape, fill color, edge color, basically all the styling options. These are all run within OpenGL shaders. And then, you know, we get partial results. We get images from each chunk and then we merge those hierarchically as I showed in the previous slide. And then finally, basically PNG compress the final image and ship it back to the client which is typically a JavaScript web service or something like that. So we include a full-fledged WMS server. WMS is a web mapping service. It's kind of a standard way to request images from a remote URL. We also support WFS and vector tile. But we basically, it'll basically return an image corresponding to a view of a particular table or multiple tables over a specific region with whatever styling options that the user specifies. And under the hood, we're using CUDA and we're using OpenGL for performance. And just an example of what a request looks like, it's just a, you know, you can do this via curl, you can do this via your browser. You hit the website, you hit the Connecticut host, our standard port, and the WMS endpoint with all the various parameters to define what you actually want. And if you have styling options, they'll be listed here as other URL parameters. So what kind of renderings do we support? We support what we call feature rendering or raster rendering. So this is, you know, example of points or shapes. So this is all OpenGL accelerated. So, you know, we're showing the states here. But as Nima mentioned before, you know, you may have millions and millions of footprints or, you know, whatever geospatial footprints, if you're a utility company, it might be, you know, a location of every one of your transformer boxes, every one of your utility poles. So the, leveraging OpenGL means we can run, you know, millions and millions of points. Yeah, you can find your electrical lines or, you know, a road network or, you know, your cell tower network. And these are all kind of very complex features that, you know, can get very complex to render at scale. We support heat map rendering. These are basically density plots. So this is, you can see like, you know, this, I think this is like location of tweets. And you can see that there are hot spots in city centers and things like that. But basically, it lets you summarize all your data in one image and you'll be able to, you can see like, see where the activity is happening. We support class break rendering or chloropleth. And the idea here is, you know, you render your shapes or you render your points and you style them based on some column or some other value. So this might be showing, I think this is showing state by population. But basically, this is all happening within the database, within a WMS call using the standard pipeline that I mentioned before and all the underlying data machinery. One of the things we've recently added in is contour rendering. And so this is a little bit more complicated to do because you can imagine you have, you know, points distributed across your cluster. And so we've got to basically do partial results at the tom level, at the chunk level and then merge those back and come up with a final contour plot. And this is useful for things like oil and gas customers who might want to see, you know, information about oil wells or whatever other kind of information they'd want to look at. Any questions on rendering? I think you have to finish off. All right, we'll keep going. Yeah, so, yeah, I think we're, I mean, so we talked about multi-node ingestion and then we talked about kind of the routing system and the mapping system that we use. It's the same thing on the key look upside. Again, you know, the reason why we spent time on this is that because you need to have these kind of high-performance OLTP classic trend, you know, capabilities to, you know, drive your, and unlock the value of your OLAV capability. You know, we've, on the mutation side, gotten to billions of mutations per minute on relatively small clusters and on the lookup side, you know, they have done pretty well. Not, you know, top-level YCSV well, but, you know, that's not necessarily because of the lookup capability but more of the transport. So, yeah, thanks so much for having us. Andy really appreciate the invite. And, you know, we'll be around for any questions or anything you guys want to talk about. All right, thank you. Thank you. Thank you. Thank you. Thank you. All right, we have time for two questions. We'll beat the question and sort of pick up the mic. Sure. Yes, sir. Do you compile queries to native code? So the question is, do we compile queries to native code? And currently, we do not. So that's on our roadmap. But we don't. We operate on, we kind of have, you know, fundamental low-level operators that we map queries to. So we, you know, we come up with, like, an execution plan and map it to that, to these low-level primitives, but they are not fully compiled. Any other questions? Do we do any scheduling on GPU or just cannot? Does not have the feature allowed to do it? Yeah, sorry, the question was about scheduling on the GPU. Yeah, you don't really have a lot of instrumentation there. NVIDIA gives you kind of a kernel sizing heuristic, right, that kind of can give a certain kernel priority over others. But, yeah, it's not, it's not nearly what you have on it. Now, you can do things like have multiple streams of execution on the GPU. So you have multiple, so if you think about how we process by chunk, each chunk we're processing in its own CUDA stream. And so we can have multiple ones of these streams happening at the same time. But as far as explicit scheduling, we don't have much control over that. And the behavior changes from car generation to car generation. So we have to do our own scheduling layer on top of that. We don't want to have 40 chunks all hit the same GPU at once. So we have our own scheduling layer on top of that, but it has to work around the GPU limitations. All right, thanks. Please speak with me one more time. Let's take a trip to the far side and blast these troops to group on the storm. And the uncivilized island of New York where the criminals run the project development through drug spots. I'd be sleeping through the screens and rapping by shots. My block consists of multiple proven elephant. Be sense. Get this. Dome these kids. Make it fix. Keep this. Operations safe. Phone and shit. Julianne got these perpetrating house and cops on the dick.