 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome to another Vaccination Divider Terminal 2. We're very excited today to have Mark Cusack from Yellowbrick. I'm excited because we've been trying to get Yellowbrick to come give talks for several years now, and my emails always went to Dev Nol. So finally, you know, we sent a letter or something to get to Mark. We were super happy. He agreed immediately. So Mark has a PhD in computational physics from Newcastle University. He was at a startup called Rainstore that got acquired by Teradata, and then he's been the CTO at Yellowbrick. That's not a Freudian slip. Freudian slip, that. I'll leave that out. He's been a CTO at Yellowbrick since 2020. So any questions for Mark as he's given us talk, please unmute yourself, say who you are and ask a question anytime. We want to see a conversation and not just him talking to himself in Zoom for an hour. So with Mark, thank you so much for being here. The floor is yours. My pleasure, Andy, and thank you very much. And very nice to meet you all looking forward to your questions. So yeah, so what I'm going to talk about today is the Yellowbrick software stack principally and how we've gone on this journey from being much more associated with database software, data warehousing running on our own hardware to a pure software only elastic version of the warehouse that runs on Kubernetes and effectively runs in public cloud. So that's the focus. I'll get into more detail about what I'm going to talk about as we go forward. But just as a little bit of a background, I want to kind of discuss who and what is Yellowbrick. And I'd love to see and I don't know how many people have got their cameras on at all. Has anyone heard of Yellowbrick before or done any research or looked at us at all? And it's quite fine if the answer is no. Okay, so we've got one. They can raise their hand if they haven't heard you. Yeah. So I saw one hand go up. I don't know how many people are on the call. I can't quite see that right now. Okay, so I think we're a fairly well kept secret in the data warehousing industry to this point. And we're fixing our marketing as a result of that. So you're going to hear a lot more about us going forward. We were founded in 2014 and we've been on the market since 2017. So we are a classic orthodox SQL relational data warehouse, MPP based, shared nothing. We've up until now deployed on our own hardware. And I see an example of some of the hardware running at one of our customers on the right hand side there. And there are certain things that we do with our hardware. We have FPGA cards in with these compute nodes for doing certain types of accelerating jobs. And we are also slightly somewhat different in that we're a hybrid row column store as well. And I'll get into more detail about that a little later on. But it's very much aimed at sort of high performance bulk loading of data as well as handling streaming ingestion use cases as well. We tend to find ourselves being used in industries that are effectively almost wanting to do operational data warehousing. They're kind of 24 seven always on business critical applications that are running. And our kind of primary customers tend to be hedge funds, telcos, insurance companies and banks. Just to talk a little bit about the hardware just before I forget about the hardware completely. So if you look at that picture on the right hand side, actually our customers got our first and our second generation hardware systems. The first is on the far right up at the top and the second generation is at the bottom. And that that system on the top right probably can store around three petabytes of user data. And the system at the lower end of there is probably about four and a half petabytes, but it can expand up to eight of those yellow block things. So shells up to about 12 petabytes of user data and about 80 terabytes of memory and about 10,000 virtual CPUs. So we're really focused on high performance data warehousing. To give you an example, a quick example of a couple of use cases. So one of our customers has 4000 registered users against a single system that looks like one of these. They have, you know, very, very strict SLAs around query performance here. Some of our customers running real time fraud detection algorithms that run 24 seven and power basically any online purchase that you make outside of AWS and Amazon. It is validated from an identity perspective and a fraud detection perspective on a yellow brick system. So we're there behind the scenes doing a lot of pretty critical infrastructure things from a, from an analytics applications perspective. Now that's it. Can you actually say with the order of magnitude for the SLA for that like fraud example, like one millisecond, 10 millisecond. Oh, yeah. So that that's a less than two second SLA on the queries for that one. And so their previous solution couldn't handle a 30 second SLA. So that's one of the reasons why we got selected there. I'm not going to talk about the hardware. It's about the software and the journey to the cloud. So today I'm actually going to cover three broad areas and I have taken the approach of giving a more broader overview of what yellow brick is and does rather than drilling in into one specific area and these three areas you could easily spend our or more or whatever on each of them. But depending on the questions you provide, then we can go in any direction you want. So I'll start off by giving you an overview of how we containerized our software, how it runs in the cloud, how Kubernetes forms that orchestration layer and that kind of platform agnostic layer that allows us to run and easy, easily deploy importance to other platforms. I'll then talk more about our query stack. And like many of the latest crop of data warehouses and some of the older ones we all typically derive our front ends from Postgres and so you know we're no different here. So I'll talk about the planner and the cost based optimizer and the changes that we've made to Postgres to support that. And then I'll talk about our back end MPP query execution engine as well and mention a little bit about the road column storage and talk about our workload management side of things. You know, again, our focus is on straight and narrow data warehousing. How can we get the best price performance out of this thing, you know, very, very high concurrency levels at high data scale levels as well. And then finally, I actually want to talk about something that I think we do rather differently in the industry. So while there are many kind of Postgres derived data warehouses out there, what we've also focused a lot of our engineering investments on is that operating system level optimizations as well. And so I'll talk about the work we've done around custom device drivers, the threading and scheduling models that we use within Linux, how we manage memory to some extent. And I'll talk about some of the custom network protocols that we put in place. And all of these are aimed at effectively, you know, squeezing the most efficiency out of out of the platform that we run on, whether it's our own hardware or cloud platforms. And please, any, at any point, please interrupt the questions. So on to the software architecture. So as I mentioned, we've gone from this Andromeda hardware, which I showed you a picture of earlier, which is really made made for extreme performance. You know, 200 gigabit InfiniBand network backbone FPGA accelerated, but very much an architecture where the compute storage are combined in those blades that you saw. What we're flipping to now is taking that software stack deploying it in public cloud. You know, containerizing it, layering it down into pods controlled by Kubernetes and separated compute and storage as well. So we followed the model that pretty much snowflake and data bricks and many others have followed Dremio and others of, you know, persisting our data for the long term in S3 and using in the case of AWS EC2 instances and local ephemeral NVMe storage as caching. And so more on that a little bit later. So as we drill down into the kind of broad architecture, again, this model is very familiar from a lot of other vendors in the way that they're approaching data warehousing in the cloud today. We basically have a set of resources from our particular cloud, whether it's object storage or compute. Again, what we do is provide almost a multi-tenant architecture here that allows one to spin up what we call data warehouse instances, these yellow blobs, and they could be used in this particular example in a telco where you're provisioning data warehousing into different lines of business, and you've got different use cases, you might have a production cluster or production instance in one organization and development and test pieces there. And then we have kind of a separation of the compute here where above each yellow block here, which is a data warehouse instance, you can have one or more virtual compute clusters here. So the role of the data warehouse instance is to provide a set of shared services from which the virtual compute clusters who actually do the workload of running queries and loading data can benefit from. We have this kind of idea of a single pane of glass called the cloud data warehouse manager, which is the control plane from which you control and provision data warehousing in different lines of businesses or even different clouds. So, to give you kind of an example of the main culture that we're kind of using here so we have a virtual set of virtual compute clusters that can scale in and out from a single node, node by node up to 128 nodes per cluster. So if there's yellow blocks those instances, you can have you can have up to 3000 clusters or 3000 individual compute nodes attached to them, and effectively any virtual compute cluster attached an instance sees the same catalog seems to see the same data. And the purpose of that instance again that yellow block is it's managing the databases it's managing the catalogs the metadata it's a transaction manager. It also actually contains the row store component. As I mentioned, we have a row column store piece here and it does these kind of instance wide shared services along the lines of query compilation and bulk loading support and things like that. Given a little example of one of our screenshots from our cloud data warehouse manager on the right hand side here that shows you how you'd create a compute cluster select the node type in AWS and the node count and and also what I wanted to illustrate is that anything you can do in terms of anything you want to do in terms of manipulating yellow brick you can do through SQL. And so at any point you can throw up these kind of little breadcrumbs which show you how you create a compute cluster through SQL. And so we want SQL to be the kind of main language with which you do any kind of cluster operations as well as any kind of dml operations etc. I need to have a question. I mean, you want to mute yourself. Yeah, yeah, can you hear me. Yeah, yeah. Yeah, so the question over here is that you have your own hardware right because how do you get those things into hyperscalers like Azure AWS etc. The answer is we don't. So we are running natively on the on the stock instances in AWS. Okay. But what I will get to is how some of the things that we've done at the operating system level in our own hardware can translate now into public cloud instances. Of course, there's one thing that won't translate and that's things like the FPGA cards that the two FPGA cards per blade. The instance configurations that you get won't allow us to support what we need there right now. Maybe they will in the future. But this is all about taking that software and deploying it natively on on cloud instances. For your catalog here, is there anything special about it? Like, like, you know, like famously use this foundation to be there's like a perspective thing you guys wrote. Is it transactional? What can you say about it? No, we actually just use a Postgres catalog for that. But we do things around caching and I'll get into a little bit more detail around this about how we cash at a cluster and at a node level. The mappings from table to to what we call shard files in S3 and in cash because it would be hopelessly inefficient to try and pull data out of PG catalog. But that that remains the source of the truth. And of course, remember also the scope of the metadata is limited to a single instance. So kind of going back to this picture, those yellow blocks are the scope of all, you know, the databases are only visible within a yellow block and to those virtual compute clusters that attach to that yellow block. It's not a single service that spans all our kind of deployments, if that makes sense. So the scope is limited. Okay, so a little bit around what we did around our software so our software was fairly component oriented in the first place. And so what we've actually done is is take the core components and build containers out of them core core pieces of functionality within the database software stack. But we have done quite a tight integration with Kubernetes and for a lot of for all a cluster state we're effectively storing our cluster state and relying on Kubernetes to be the source of truth here. As I mentioned one of the things that's rather unique is we built effectively a sequel interface over Kubernetes is one way of looking at it at least from a kind of a crude perspective. Now the things if you look at this kind of picture up here. So there are certain things that are kind of stateful in nature what we call stateful sets within Kubernetes speak and some things that are replica sets that they're kind of very very horizontally scalable so our compiler services again more about this a little later and compile queries down to native code and run them those are horizontally scalable based on the workload that comes in and we load balance and expand that out and contract it dynamically. We do the same for bulk loading data as well and so the more bulk loads that happen we can scale those out independently of the virtual compute clusters that are running in the background. The pods that do the heavy duty worker called workers. I'll talk again a little bit more detail about those in particular but those are the MPP nodes effectively. They have persistent volume claims that attach to the the storage ultimately in cash and at the end of the day through to S3 for example, they're the things that do the heavy lifting in terms of workload. They have a lot pods for logging observability effectively monitoring and so on and so forth as well as for user interface pieces and our race store and other components and the transaction manager. Back to staff on bulk load or do you run analyze separately. We run analyze separately and automatically again I'll talk a little bit about what we do and when around that so at the point of bulk load certain sets of statistics are passed back automatically but there's an automated analyze that uses hyper log log statistics gathering to to populate it afterwards. Okay so so going down again to another kind of level of detail talk a little bit about what we do around the workers. And so we have these nodes want to end in our clusters each virtual cluster that effectively runs a single part and this is something of an anti pattern in Kubernetes which is normally you would kind of throw pods kind of across a Kubernetes cluster and expect to organize where best place where best to place those pods based on the available resources so we don't want to do that we want to place a single pod we don't want a strong affinity of a single pod to single worker node. And that's very important because of the OS level optimizations we want to do, we want to completely control access to the lower level devices on that particular compute node and we don't want to pods to workers competing over that. So that's kind of more of that in a little moment. We have done some things as particularly as we got to the cloud and and we didn't have the benefit of highly reliable sort of 200 gigabit infinity band to play with like we have in in our own hardware and so we quickly through our tests as we reporting this to realize that you know what the standard stock TCP IP stack that you get in something like AWS or the other clouds isn't going to cut it for us and so we spent quite a lot of time developing a reliable UDP base protocol to do data distribution and communications between workers during the course of a query. And we went through a whole lot of selection criteria do we use things like RDMA, zero copy networking and things like this and we ended up on the basis of this is the best standard that's most widely supported that gives us kind of some of the performance we want we ended up using DPDK for all of for all of this as the basic approach to networking here and through our own performance testing we were getting kind of consistently 50% higher throughput than using standard TCP IP there. What if the, what if the, I mean I understand like a fit of in on prem you control it with the performance between the DPDK on the AWS with the UDP protocol versus like a fit of in. Yeah, well you know that that's a great question and it's, it's quite hard to answer because the way that the network performance is somewhat tied to the type of instance you get right and the maximum you can kind of get in AWS setting is around 100 gigabit per second. But we, you know, we were, we were seeing, we were seeing as I'm trying to think off the top of my head the raw numbers and I can't to be honest think of what they are right now in terms of that but the point is it's more complicated you know you've got all of these different constraints compared to what we had on prem anyhow. You know the main thing we were trying to avoid actually and again it's come out in the OS level optimizations that I'll talk about is avoiding kernel overhead at any point and so a key, a key selection criteria for whatever we do whether it's network drivers or, or envy me drivers is that we have to have them running in in user space, we, we want to avoid any kernel overhead at all here and any and trying to avoid kernel copying and things like that copying through the kernel stack. Yeah so so common to a lot of cloud based separated computing storage data warehouses, we also use locally attached envy me and use them as caches. Talk a little bit about how we cash data in a moment but we, you know we have this kind of right through approach to caching to loading data and when you load data into yellow brick it's persisted transaction and you need immediately into the object storage and then kind of the caches re hydrate based on the on the query patterns they get. Again when we kind of looked at what you could do in terms of data retrieval libraries and access libraries with S3 in particular. We ended up writing our own, because we couldn't get the Q depths we wanted out of the standard stock libraries and we ended up building something ourselves that was about three x better in terms of throughput to. Okay, so file caching. Yeah, so we have a very simple approach to how we cash files that arrive on those workers that we've pulled off S3 in response to a query and we have a very simple sort of LRU based a fiction policy and also a very simple kind of two level scan resistance policy as well that basically says, you know if if this shard file which is the files that we persist on S3 is being used once. Then, then you know that's that's great if it's used twice we move it to a kind of a use more than once list. And then they basically, if it doesn't get opened again and it's on the use once list we evict it. And so that kind of helps us a little bit with the scan resilience. We do a mix of kind of prefetching and single block reads out of S3 and into into the cash. And then we assign shard files to different compute nodes, following a cluster resizing. Yeah, using a kind of the kind of a rendezvous hashing approach to just try and reduce the amount of shard reallocation that we have to do around the cluster. I needed asking our cash file resilient using the application or ratio codes. Oh, wow, that's a that's a really interesting question so no so there's there's no resilience in those cash files at all. So you know we we hydrate those from S3 I'm not sure if I'm asking asking your question answering your question directly correctly but we we have mechanisms where we exchange notifications when different nodes and clusters right data they exchange what shards have been written through to those caches so there's no need to go kind of going all the way back to S3 and do painful long reads to get that back. So there's no there's no sort of erasure encoding no read Solomon stuff going on in the caches. Now interestingly enough actually go back to our hardware on on premise appliance, the blades there do use erasure encoding for for full data persistence there just just that's how we do use it but we don't use it in in our cloud deployments. How big are your shard on S3. They're about 100 megabytes in size but we actually segment those into two megabyte sub blocks in S3. And so that helps us with kind of a smaller, smaller read read size. Okay, so yeah right around cash which I think I mentioned, and I also mentioned the exchange of notifications when between clusters to kind of avoid and all these clusters have in memory mappings of the tables to the shards that they own definitions in place again to avoid kind of costly reads out to either S3 or two to RPG catalog. Okay, on the database stack so we are originally based on Postgres, we have made some significant changes over time I'll talk a little bit about what we've changed, but I'll be brutally honest with you. Most of the standard optimizations and algorithms that pretty much everyone uses in the industry today we're kind of applying. So a lot of the kinds of joint algorithms approaches that the rewrites that we're doing the kind of optimizations around in predicate inference, you know, pushdowns join elimination. We're doing those we have a cost based optimizer that's really mostly aimed at costing joins, but we also have cost based estimates for aggregates and scanning. And there's actually a lot more work we want to do around building out our costing models for this and there are things that we don't do today that I think we will do such as costing just distribution steps we don't do that but but effectively for when when a query enters the system for for a large degree the the parsing and the planning is handled by by by standard Postgres and then we kind of optimize in certain areas. So we do things like results at spilling spilling join results spilling intermediate results spilling we use this temporary disks first sorting and aggregation and things like that. As I mentioned earlier we update our statistics on a on a regular basis following data loads using kind of hyper log log approaches. I've got a picture kind of on the right here of a query plan from our web based UI where you can kind of see what's going on in the plan so you know in any plan we've got the lowest leaves of the trees of the table scanning nodes. We've got these distribution nodes in place between the table scanning nodes and the upper regions the join parts you'll see our hash based sort of build and probe approaches going on if you look at that plan in any depth. And what we do do actually is it's very aggressively push filters down to the to the table scanning nodes and min max filters and bloom filters will build bloom filters on the basis of the of the build side of any of any hash join and push those down so we're only scanning what we need to as aggressively as possible. Do I still have the user genetic out in the first house or did you do that away. That that's actually a little bit of an sort of topic actually yes we have we do use it in places but but we don't make heavy reliance on it but it is it is still there yeah. How much of you think I mean I understand somewhere they're specific to what yellow pick does and the hyper log look stuff is specific to you but how much of like the opposition optimizations you're applying could have could in theory be applied back to regular process me set predicate push down process is doing that right. Like how much is it like how much is just doing something stupid. Here's the right way to do this versus like hey there's something yellow brick and support the process can't let's do that. We take again we take a very pragmatic approach to what we change in Postgres and it's always driven by what our customers are seeing what use cases you know when queries are failing when we can't run with certain workloads. Can we write and do a manual rewrite for that customer then does it make sense to make that change in post in Postgres or whatever in our planner right so that's the kind of approach we take we have not approached this of trying obviously the impossible problem of trying to solve every single corner case. query out there we've gone. If it's a real pain point will fix it will build into the product. If they can get away with doing a rewrite and the customers happy. It's great it's it's usually actually the auto generated queries from from bi tools that are obviously the real pain to deal with. And I think this is the hard for the process and you're not like you're not like falling along their main mind and using extensions to like timescale put in changes you know. No and in fact in fact I mean it's based on Postgres nine five so we're you know we're a little bit we're a little bit behind but you know we have we've in fact we're kind of let's let's get on to this this is kind of where we really differ from from Postgres right. And the different places so you know we the wire protocol we kept we wanted that outer layer of Postgres compatibility for us actually that was one of the most important things because as a database company and as a database sorry data warehouse. Vendor and for a customer one of the hardest things to do is migrate to a different platform and so. Postgres is actually pretty much now the de facto sequel dialects for many customers right it's it's just so widely used a lot of platforms like redshift vertical. You know green plan they they're all based on one degree another from from a fork of Postgres. And so we use the same Postgres drivers we use the same adbc adbc wire protocol. We actually put up an entire new bulk loading interface in place, which is directly linked to our back end MPP nodes to bypass this postgres completely for performance reasons, the parses largely unchanged we wanted to keep the syntax even down to the store procedure side of things the same. Obviously we had to teach it something to do with how you plan an MPP job so we did that we did a few things like do some star joint optimizations in there. You know, get it to to join the three or more smaller tables first things like that. No changes the transaction log we wanted to maintain the kind of acid characteristics that Postgres gave us. We really also probably supported a limited number of data types and stop postgres we wanted to focus on the ones that were really important and that made sense from a high performance perspective. Like I said like our focus now our company philosophy is, we want to almost be an inch wide and a mile deep as far as classical data warehousing is concerned and really focus on where we need to place our investments to get the best price performance and ignore that. So you won't see us adding a whole bunch of you know machine learning data preparation functions necessarily to our platform we're interested in staying right now in the in the kind of classic historical data warehousing perspective. Yeah so the execution engine yeah we ripped that out completely and I'll talk a little bit more about that a little later on and it really is aligned to allow us to stream data directly off NVMe straight over the PCI buses straight into the L3 caches. That's our goal at every point and that's what we try and achieve and so the entire stack is aimed at to as much of a degree a degree as possible, never have to go to main memory to get the next piece of data you need for your to answer that part of the query. Yeah, so again there's the storage and I'm sorry. The extension engine to support PLBG to go. Yes, yes, we do. So we preserve that. Yeah. So the storage engine so yet we ripped and replace that it's a column our MPP backend. We also have this new row storage component we actually started out with using Postgres as our restore component to get a break but we were not we're not seeing the performance characteristics we needed so we actually ended up throwing away the row store storage part for data storage out of Postgres and we built our own row store there. And then as I mentioned this column our storage at the back end. Okay, so again common to a lot of other data warehouse vendors out here in a particularly in data warehousing. We have a planner and optimizer which generates a serialized plan it happens to be serialized in XML that XML serialized plan essentially gets sent out and turned into C++ code and then compile down to object code and those object code fragments are distributed across the MPP nodes on the H. They're pretty dumb. I mean they're pretty much the the the workers themselves are kind of flow control engines that they're pretty dumb they're just made to run object code within the constraints provided by workload management system in terms of the resources they can use. So it's all pretty dumb. I'm sure there's a lot of other vendors out there. I know there's a lot of other vendors that have a similar approach to how they approach this problem of compiling an execution. But any questions on that? If you're doing a push-based model or a pull-based model in your query plan. Yeah, so it's more of a push-based model from the ground up. Yeah. So covered some of this already. The data storage in layout. So we have these shard files that are compressed column or it orientated and it's completely proprietary format, a layout that we do. However, when we access that data, we do a lot of transposition to more of a row oriented cash friendly block. So when things get off persistent storage, they're transposed. And actually that's one of the things on our own hardware we use FPGA for to do that row column transposition as well as decompression and compression. So yeah, so we have these kind of cash friendly blocks that flow out into the systems and then hopefully end up in a good place in L2 and L3 caching. And we also really don't have manual indexing, but you can specify a distribution strategy for your data and you specify that on your DDL. And so a good example on the right there where we've got a distribution key, an order key from the TPCDS data set here in a, sorry, the TPCH data set I should say. And we can also specify sort keys as well. So you can order the data within shards, and also ran hash distribute on a particular column or group of columns as well. But we also support different distribution approaches like a random or replicated distribution. So if you've got a bunch of small dimension tables, it makes sense to replicate them across all of the MPP nodes in your cluster. Otherwise, it likely makes sense to do a hash distribution if there's a suitable key on a large fact table. Yeah, so again, nothing new under the sun here in terms of how we'll do shard pruning, you know, and the pruning of shard files and the filtering of data. We obviously try and wait to the last possible minute and only decompress and read data to decide whether to filter it. If we have to so we'll eliminate from the query plan based on file names. We'll open up the headers of these column of these shard files, only read the column in maxes and rule in or out data based on those if we need to. And then as I mentioned, we can even apply and build bloom filters that we can apply and push down once we've actually decompressed and read the data. And so we can easily eliminate whether or not which records are going to participate in any hash join or could be eliminated based on max min criteria. Take a bullet point from the data transfer process of the road in format. What do you mean by that? Like I'm scanning a column. Why wouldn't I want to call, you know, process it in a vectorized column or format. Yeah. And for certain operations, we do apply those same vectorizations and SIMD approaches at the column level. But, but actually when it comes to processing the data later down the line, it actually makes sense for us to transpose it into more of a row orientated packet. Up, up, up, correct. Yeah. Okay. Okay, on the statistics side of things then so we automatically create statistics and it's usually kind of if a rule of thumb around more than 10% of the data is changed. But we, we run it anyway immediately after we've done any inserts or inserts into select froms or CTAS operations or bulk, bulk loads. And then like, periodically, you know, we'll update those statistics based on other operations and there's a range of statistics we capture to support our cost based optimizers, figuring out the distribution of nulls within there. We want things like the average column widths so we can start to build costing in costumes in terms of row counts and column widths, population histograms, for example. And then, you know, there's additional shard level metadata that will either cash based on loading and that could be the min max values, row counts, and even the way that we compress data in string dictionaries as well and use that for faster access and elimination based on string comparisons. So again, probably, probably not a lot. That's unusual and new and I wouldn't say this is kind of where our core differentiation is. There's a lot of kind of standard techniques that we're approaching here. What I would say and I, you know, I'd give, I give kudos to our engineering team. A lot of this stuff is extremely well engineered, which is partly why, why we get some of the performance levels that we get out of it. It maintains the vastity semantics in terms of inserts updates, deletes, but we, again, similarly to a lot of other models and approaches out there, we have the notion that all our shard files are immutable. When you, when you delete data we're effectively placing a side file alongside the shard file which is a bitmap to mask out which records and which records and no longer in the shard will do things like automatic shard merging over time from a garbage collection perspective. We use the standard kind of Postgres transaction log to, to sort of to keep all of this consistent and use transactions throughout the system in the same way. And yeah, and it's kind of that read committed level of isolation. We, we lock at the table levels inserts don't block other inserts readers don't block readers and writers etc etc. You follow basically the same methodology as Postgres does and VCC like apparently, yeah, always the newest version chain. Yes, that's right when we take we, we tried to keep it as similar that part of things as similar as possible. But you know, it works, it works just as well from a, from a storing things on an immutable store like something like s3. Okay, so yeah, the, the row store in the column store. We took a kind of a design decision pretty early on that we wanted to be able to store and support more of the emerging streaming use cases you know the knowledge of databases out there today that handle streaming from for example Kafka topics particularly well, and particularly for columnar databases so we wanted we wanted to design a system that was a combined row store component a column store component. Now, kind of obviously, as far as queries concerned, we logically union queries across the front end row store and a column store so to give you an example what happens is that as streaming data comes in from a Kafka topic it lands into our row store. That runs in one of those pods that you saw earlier on, and then over time and over space constraints it will get aged out and put into a column format and stored as shard files at the back end. And that common transaction log that that Postgres space transaction log is a thing that keeps the asset properties and our data consistent across those two stores. So the design criteria, we had to have something that could cope with more than a million records a second in fact it goes higher than that we we've clocked it at sort of three four five million records, a second going in single inserts or micro batches. And, and yeah but the majority of the data in yellow brick for long term persistence ends up at the column store at one level or another. So bulk loads that we do don't go via the row store first they bypass that and go straight into the column store. And so we're able to get very, very high kind of wire rates of data bulk loads I mean 10 terabytes an hour over ever kind of 10 gigabyte plus networking is not unheard of. Well at the same time being able to spot those use cases for streaming as well which are becoming more and more common particularly as we find ourselves being being applied in kind of almost manufacturing use cases closest to factory floor and IOT kinds of kinds of examples. Okay workload management is something that is incredibly important for us, because, as I mentioned at the top of the call we're really focused on how we support very very large user communities, you know 4000 users against a single system. And, you know the state of the art in terms of workload management I would certainly, I would certainly put terror data at kind of the top of the tree in terms of data warehousing workload management. And I used to work at terror data I used to own the data warehousing product line at terror data so I'll speak, I'll speak around that. But you know we've done a lot of work actually at Yellowbrick to kind of boost our workload management chops here. And so the way that we partition resources on a Yellowbrick system are that resources are split into what we call resource pools. And we split compute up across different priorities, and we split the memory across different pools as well as well as things like temporary spill space. And then what the user would do is craft or either use our system defined rules that are in place already, or craft new rules to decide where a query would run which pool a query would run on. And it could, the criteria for deciding where a query ends up in terms of which resource pool it goes to, it could be based on a tag that you've put in the query, who submitted it, which database it's doing from a from a fragment of the sequel itself, you know which tables it's addressing, which client application and so on and so forth so you can make these distinctions to place long running bulk loads in a particular queue, you know, high, high sort of complexity ETL tasks in another short order tactical queries in other queues, and divide and kind of conquer the workload in that way. And so that's kind of approach we make but we've also got sort of governance mechanisms in our workload management scheme in place as well so you can, we can put in place the ability to throttle and cancel queries if they exceed given limits, move queries from one pool to another and restart them. So you can have the notion of a penalty pool if a query is taking too much time in one pool it can be moved to a slower priority queue and and and so on and so forth and so there's a very, very flexible way of, of creating these rules defining these pools and and allowing you to run very different currency levels and an example on the right hand side is this almost kind of guitar hero style display that we have that shows the different convertible compute clusters running in AWS. You know you've got three slots per the C1 pool at the top there's that means it can support three concurrent queries in that small pool it's only a small cluster. And then you've got a much larger cluster C3 at the bottom that can support many, many more concurrent queries and those those lines indicate the concurrent queries that are running. It's super important but I'll get Andrea and sexy. But like so in your example here this small instance mix, you have three slots. So that means like we only have one query running so that means the other two are completely idle. No, so we have the notion of flex slots and so if a query finds itself in one of these slots and it's unaccompanied. There's a there's a rule that says you can effectively use a certain more amount of memory than that. And so basically have a minimum concurrency and a maximum concurrency per pool. And so for that particular slot it might be a minimum concurrency of two maximum three. So if one query finds itself in a slot on its own it will actually use the resources that two queries will occupy. So we can actually scale up and use more memory and more temp but only up to two queries and you can you can define what those minimum max concurrency are. Yeah, in principle I could set it as a minute one and say that one query would have all of the resources for that pool available to it and anything else entering that pool would be queued behind it until it finished. And then three more queries that came in straight out the time at the same time would all get a third effect for example. Can I ask you have a question you want to meet yourself. Yeah, I have a question so how do you predict how my resource security would use before running it and how accurate these predictions get that. Yeah, so as part of our query planning we try and make estimates on the amount of memory queries going to run in the amount of temp space it's going to use. And through our rule mechanism you can put in place rules that decide which pool it goes to based on the planned amount of memory it's going to use. So it's very easy to set those kinds of rules up. Now, if I'm going to be honest about how accurate they are, it could be a lot better. On our estimates of the amount of memory we're going to use, what I see I think we tend to overestimate the amount of memory we're going to use compared to how much is actually used. So our models on the basis of creating these kind of guardrails is not as great as it could be and so that's actually an area of roadmap development. Even going down to getting better estimates and converting anything, most things from a cost actually to a time, trying to understand how we can kind of rationalize from a query planning perspective how much time a query is going to take, not just kind of a kind of a unit, a unit list cost around it. And that's somewhat different from the amount of resources and memory it's going to use but that is an area of active kind of development research that we're doing. Awesome. Thanks for answering. And quick follow up. Is there a specific type of query that you think is hardest to predict or like some are harder than others? Well, I think the ones that we probably do better are the ones where our cost models are better. And I think we spent most of our effort in modeling the costs of joints and less so on accurate scanning and accurate aggregation. And as I mentioned earlier, we've done nothing at distribution. So I think we do a pretty good job of estimating the cost of joints. But yeah, I'm not sure I'm really answering your questions on a relative basis on how much better it is. It's pretty tough to answer. Yeah, that makes sense. Thanks for answering. That was really helpful. Thanks for the question. OK, so it looks like we got about 10 minutes. So that's about right in terms of time, I think, because I'm almost my penultimate slide, I think. OK, so the OS level optimizations, this is where I think we do do we do things that I don't necessarily think but I'm prepared to be corrected that a lot of other vendors have done before. And so I say this is where we've applied most of our kind of IP concentration. And so in the principal areas of device driver management at the OS level memory management and thread management. So we actually implemented our own PCIe NVMe SSD drivers and our own network adapter drivers, our drivers running kernel user space bypassing the kernel. And one of the reasons we created our own drivers are partly by the way that we've approached the coding of the entire stack. We have a synchronous and polling in nature, big use of things like co routines in the coding. And so to get the scale we want to do, there's a lot more polling that we're responsible for doing and avoiding kernel overhead is absolutely critical. So I mean, to give you to give you really just to turn up the contrast. When a yellow brick system boots up, we make about six to 10 OS Linux level calls. We make no further system calls. No, those OS calls will grab all of the memory in the system completely. And we have our own memory management memory allocator. It'll load up our device drivers and from there on in the Linux kernel, the core kernel is really yet relegated doing to doing a few logging tasks. And so to the fullest extent possible, all of the processing and the processes are running in user space for us here. From a kind of a memory management perspective, what we've done in and designed is is numerous architecture aware. We make big use of CPU pinning. So at the startup of the of the kernel, as I mentioned, we grab almost all the memory we pin it to particular CPUs. We've written our own memory allocator, which we've measured to be about 10 to 100 times faster by avoiding locks than standard Linux. We again pin this memory because we want to avoid kernel swapping and any any risk of that at any point. And then a very, we very, very carefully manage the execution of a query lifetime. We want to avoid cash displacement, memory fragmentation, the way we even allocate through our memory allocator using kind of two megabyte huge page allocations to kind of avoid this kind of fragmentation. From a threading management perspective as well. The way we actually synchronize work across all of the MPP nodes on the cluster is is kind of quite interesting as well because we effectively have a universal clock we have a metronome that's ticking at around 100th of a second. And what we're trying to do in each of those 100th of a second slots is have every worker on every node processing the same part of the same query at the same time. And this is really important because we want when it comes to example read data redistribution across nodes as you know during the course of a query perhaps as a join or aggregate or a group by or something like that. We want to avoid the buildup of network Q depths and anything that would mean that that memory and that those packets would end up in main memory and not be in a position where they could go straight through to the L2L3 cache. So that that's really important for us to synchronize the work done at every stage on every node. So everything's working on the same part of the query plan for the same query at the same time. We built a scheduler and a context switcher that can switch between these queries in in you know the order of 100 nanoseconds so we even ripped out the standard inbuilt scheduler out of Linux and built our own and wrote her in there so we've done a great deal of work at the lower levels of the system I would say so that we can benefit at the higher levels in terms of efficiency. This is amazing. Sorry. So you have your own device drivers and obviously you guys are a David's company not a device driver company either carefully tick exactly what hardware you're going to support in in yellow brick mode right. Well and it's more about because how we manage memory how we allocate memory how we use to say synchronous nature. We have to build drivers from that regard but but yeah but effectively if we cannot support a type of hardware using our driver. As I mentioned on this slide actually we gracefully fall back to Linux and we do try and do that. If there's something that isn't going to work using our own driver we will kind of fall back to what's available in the underlying kernel. The device driver is communicating with the device like if everyone speaks to me then in theory no matter what the vendor is you shouldn't have to rewrite that. That's correct. That's correct. Like I say it's mainly for the it's mainly whether or not the it's about how we allocate the memory and how we code that driver to get the highest performance out of it. You're quite right. I mean if it is to the PCIe standards that's fine. It's not a PCIe NVMe drive we'll fall back on to something that the kernel will provide us. Okay and then you say that your memory allocator is faster than libc malloc. Sure. The low bar. But like tc malloc and the other like j e malloc all those other sort of the ones out of Facebook. Are you guys still faster than those things or is it roughly the same idea. I haven't I haven't benched we haven't I haven't benchmarked that so I'm not sure how how how they those are approached me this these parts of the stack we developed you know probably around 2016. So perhaps the state of the artist has come on certainly beyond that now but I think we've we're kind of we've kind of we don't really concentrate too much on this area anymore. We're kind of moving further up on onto the stack I think we've we've got I think it's probably a little bit of diminishing returns from where we are right now to concentrate anymore on the operating system level so we're really focused more on the planner and optimizer. So in 2016, yes, a bunch of the stuff didn't exist. You had a row your own. In terms of like the device drivers, like I get back to this piece to me and maybe never get to have the next, like, is it all you have to have all of them or is there any one you was like if someone's been building some scratch now which device driver should they target first, or do you need all of them. In fact, what we what we found actually was really interesting is that even because we were concerned when we moved to the cloud around, we wouldn't have full control we wouldn't know we were sitting on infinity band for example we would be at the mercy of whatever Nick was in the particular AWS instance and this is why we went through a whole journey looking at RDMA which instance types had the Nixon and that supported that we had to rule that out because it just the pool of available instance types was hard enough. Well, we ended up with DPDK but it turns out we can do this kind of cut through architectural operation directly address the Nick through a virtual interface that AWS provides above the Nick so we actually haven't had that problem. When we've used these we still get low level access to the the Nick but we haven't had to deal with the idiosyncrasies are particular types of them. So what remains in the kernel, what's the bottleneck what's the, like if you could kick up the back and shoot one thing in the in Linux to make your life so much better. Is there anything that remains at this point that you care about that gets in your way. I mean, as I mentioned, I think we've pretty much eliminated the use of it at all I mean, you know, we, we, we ended up having to when we were looking at moving to the cloud and looking at networking and looking at how we could try the biggest bottleneck for us always tends to be doing is is copy is kernel copies. And that's that's the biggest overhead that we typically find. And you're even during the course of this DPDK work and the networking side of things. That was the thing we kept hitting our heads on actually. No, I'm actually going to reverse my answer because one of the other bottlenecks that we struggled with is on the S3 connector. And we're still stuck with kernel overhead there to a degree, which is taking up something like 30% of the overhead is just doing kernel copies in our S3 connector. Even though we've done a great job in boosting the throughput, we're still using probably a third of the CPU just on on doing kernel operations in that library. So if if there's something we could do better there. And I don't think that's necessarily a fix in the kernel it's it's about how we could do something with AWS better but that's probably my answer that's our biggest pain point now I would say in fact my next slide my last slide will illustrate what I mean by that in our performance results. I owe you ring might solve that your problem, if you roll your under your rolling on access API. You basically effectively build a unit kernel. Right, like you removed everything turn everything off that you can. The hypervisors should down below you. You, if you have a year on nice and you know what and what another concern that we had a lesson learn going into this is that we thought containerization and Kubernetes would get in the way that turns out not to be the case and that hasn't been an impediment at all to do to any to supporting any of these kind of Yellowbrick specific device drivers or whatnot and the performance gains that we get from being recorded and without and that if you can't say it's okay. I mean, how much of this did terror data do in that their enterprise stuff. I'm not sure I'm allowed to talk about that but I mean, I, yeah, I don't know. I mean, I know I think I think I think this probably isn't the public domain, you know, I think, if you look back at the history of terror data. Yeah, there are points of which they've had their own kernel, right, and and certainly their own. And I think it's only more recently that they've moved to kind of a stock version of Suzy Linux. So, I think that's the case but I might be wrong about that. Yeah, that's that's the question and I'd love any, any feedback from from your team because I don't see a lot of the emphasis on on the operating system level side of things, particularly amongst the public cloud vendors that are very much focused on the database stack itself. You've got you're the different pieces here right so like the thread management, the lea DB, they have their C star, you know, co routine scheduler, they're using dbdk to do kernel bypass, a couple, a couple of systems are finger using kernel bypass. Or actually the cover team stuff also sounds like SQL S from SQL server, they basically try to avoid us scheduling anything. And then there's all that. I don't guys here, those are db to for ZOS stuff of who knows what's right. Right. I know we're over time but I just wanted to talk a little bit about the performance and as a vendor I hate putting up performance slides around standard benchmarks like tpch and comparing them to competition because I think the last thing you could ever believe is a vendors, what comes out of a vendors is the vendor's mouth when they talk about performance benchmarks they've done themselves, but I thought it might be interesting to show you what a system running an AWS software only is it kind of looks like and I wanted to kind of bring out the you know the impact of caching as you can know it's pretty terrible, you know, when the caches are cold. And so what I what actually ran was just, you know the 22 queries in TPC, each 10 terabytes of scale. I had a 10 node system running an AWS pure software only data persisted as three, and I just, I just ran a sequential run of the so it doesn't tell you anything about concurrency handling it just tells you a little bit about our overall performance on a single sequential run and and the impact. So from a histogram perspective aware the query times for based on on hot runs versus cold runs and so you can see the kind of impact that directly addressing s3 has in on the cold runs on on the right hand side you know it's two and a half x worse on that first run and then when the caches get warmed up. The run time drastically reduce and the geo mean time improve etc etc and this is a 10 node AWS system they're pretty hefty nodes they're 96 virtual CPUs per node. They've got eight NVMe SSD drives attached to each node here so it's some pretty pretty heavy heavyweight metal going on here but of course what we're trying to do is translate what we do in our own on prem hardware and the performance and compute density which you can make that available to customers in the cloud for those highest concurrency and scale use cases so a bit of an example I mean if you want to go off and compare these geo mean times and run times to other published times you could do that. No, no, no, no worries from me so you can get an indication of how quickly we are. And with that. That's my final slide so any other questions. I don't want to have everyone else mark thank you this is super super fascinating. Alright, so go to the floor and family has the questions for you. Number one or two. Yeah yeah so any comparison with the click house guys. You're pretty fast. Yeah, and you know, yeah, I mean, click house we've, we don't come across them in the market very often. Now there is one example that I know that we've come across them and you know, I think that they do some. Interesting things to help speed up using their kind of materialized view approach I think they use there. But, but nothing that we've been concerned about as we put that put ourselves through a competitive bake off with them. And as I mentioned we've only come across them once. It's almost like a textbook modern data warehouse. Right like you do all the things we teach now in like, you know, in advanced areas class in grad school, except for like the, the kernel bypassed over that that that that's the that's definitely different. My question is like, do you guys, do you guys face any long term engineering challenges like with like supporting dbk, and like always like the vice drivers and all that like the memory stuff sure you do it once you're done, but like over time as you either migrated to you know from on-prem to the cloud, or even supporting the AWS and other clouds, are you facing challenges supporting, you know, like from engineering perspective. You know, this stuff, this the cloud version of yellow brick is very new. And so I don't think we've got the history to answer that question. But you know, there's a learning curve, you know, the approaches we take, and what we do at the operating system, the coding approaches, take a learning curve, there's a steep learning curve to the way that you have to write code in yellow brick. So I'd say that that is a difficulty. You know, there are certain rules and certain conventions you have to follow when you're allocating memory releasing memory to do it right to, you know, avoiding locking at any point. There are, there is sort of a well established set of criteria that any developer has to learn the ropes around before they can actually contribute in a meaningful way to here so you know I would say if anything you know it's it can be challenging for for new developers coming in there but but you know like like any other database company we're hiring with we're hiring like crazy and we you know we double our business last year, we want to double it again this year and you know, if you're interested in doing this or coming along after and working with us we'd we'd love to talk to you. One quick question. Yeah, you showed the DDL to create a cluster, a compute cluster, which looked fairly static. So, how expensive is it to change it. Because you say you have like elasticity in your in your system so if I need to change the size from whatever 64 to 128 compute cluster, how long is that going to take. Yeah, and okay so the two answers it depends whether or not you've already provisioned the underlying cloud infrastructure. Otherwise it's going to take five plus minutes just for the standard EC2 instances to spin up, but if they're already available and you've got a reserved capacity pool of those. The time comes from the time it takes to pull a container for that worker from the elastic cloud registry, which is where the container lives, and the time it takes to boot up the worker. And so I think for a you're talking a matter of a minute or two to expand that cluster to get in route. How does that impact the mapping of the charts to the instances of your compute cluster. Yeah, and so, as I mentioned, so you got a starting cluster. And in fact there's a long as a much more detailed answer that I think you're getting at anyway, but when you know as data is loaded into a chart into a cluster of a particular number of nodes. Every, you know, shards that are distributed around these different nodes. And so they all own their own set of shard files. When a cluster is added we use this kind of rendezvous hashing to decide how we balance those shard file ownership pieces out to the new clusters. So, you know, after reconfiguration, which happens very, very quickly because it's just an exchanging of file names between worker nodes, we've got a balance of shard files. But of course if you loaded a table with a particular data distribution key, that doesn't mean to say that your data within each shard file is correctly distributed. And more likely than not when you expand the node by one node, you know, you've now got misdistributed data within a particular shard file, as far as a distribution keys concerned. So what we have to do and what we do is actually by default put distribution nodes at the straight above the table scanning nodes in our plans. So distribution nodes are built in we assume that data is badly and malformed distributed. Now when we find ourselves in a in a cluster where we distributed on a key, and we're still in that same cluster for configuration. That's a special case that will remove those distribution nodes from the query plan because we know everything is perfectly distributed down to the individual record level in the shard. So that was a rather long answer. Did that answer your question? Yes, I believe so I have to think about it more, but I have maybe one quick one. And if that is okay. Just quickly on the on the right, the right around cash. So how does that work. So, I mean, it's actually remarkably simple. We certainly didn't want to mark we decided we decided not to implement a model where we, we'd hydrate the cash first as part of that commit to S3. So essentially the workers as they process loaded data. They simply via your transaction controlled by the Postgres transaction manager persist that data or not directly in shard form file form into S3. It's as simple as that. Then what we also do well it's not quite as simple as that that worker then broadcasts other workers into other clusters, the fact that these new shards are available. Obviously this is still a single source of truth in the Postgres catalog at the end of the day, but we do that for optimization purposes so other classes can be aware of new data without having to reference the catalog.