 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. We're studying today to have Steven Walkanus, the Director of Engineering at Vertica. He's been there for a long time, certainly in the early days before they were acquired by HPE, and then obviously when they were moved on to bite the promise. So as always, if you have any questions for Steven, please unmute yourself, say who you are, and ask your question directly, Adam. And feel free to do this anytime and interrupt him, because that way you want him to feel like he's actually giving a talk in front of people and not to his laptop and zoom for an hour. So, Steven, with that, thank you for being here. The floor is yours. All right. Thank you very much, Andy. And yes, please do interrupt and ask questions. I want to deliver the information as clearly as possible, and it'll help to clarify for yourself and for others if you interrupt me and ask me questions. I titled this talk High Performance Over Varying Terrain. You could look at the varying terrain in two different ways. One is that Vertica has customers who are deploying our system on very different scales. We have some that are running single node Vertica instances on a couple terabytes of data. We have customers like the trade desk who are running in the cloud, and they have replicated databases, one on the east coast, one on the west coast. Each of those databases is over 600 nodes and 12 petabytes of data. So even within a single version release of Vertica, there's quite a variance in the way that the product is deployed. The other way you could look at the varying terrain is that Vertica was the inception of Vertica. The pre-store paper was published in 2005, and obviously over the time, since then to now, there have been substantial changes in hardware in the way that the database is deployed. I chose these images carefully. The one on the left, the Jaguar, is their Formula E race car. Jaguar is a partner of Vertica. They're using our database to do analytics on data that they're collecting, metrics that they're collecting during the race. Including weather, a bunch of metrics about the car and the performance. And then they're using that both to provide optimizations during the race and also to do comparisons after the race and figure out how they can race better. If you didn't see the Vertica logo on the first time, I added this obnoxious red arrow to make it clear. So that's on the actual Formula E race car now, the Vertica logo in the partnership. Do you guys pay them for this or is it a partnership? It's a partnership. So they're using our products in doing analytics on racing data, and they wanted to recognize us as part of that partnership. So they actually approached us about putting the logo on their Formula E car. And of course, we thought it was awesome to have it appear there. So a little bit about what is Vertica. I'm sure many of you have already heard CSTOR or Vertica distributed SQL analytics database. What's it mean to be an analytics database? So we're answering questions that require doing computation over oftentimes billions of rows. So a simple example might be, show me the stock ticker symbols that are most frequently traded over the past month. This Vertica was commercialized from the CSTOR paper that was written by like Stormbreaker and many others. At the time that I was interviewing at Vertica, Mike had already moved on, become bored with OLAP and he moved on to OLTP. And it was actually the OLTP group, the VoltDB folks, which if you're not aware, Volt stands for Vertica Online Transaction Processing Database. It was actually those folks that I first interviewed with and I missed an opportunity to work with Andrew Pavlo. One of my great career regrets, fortunately, I joined an outstanding team at Vertica and it's been an awesome experience for the past 13 years. I started as an individual contributor and I've worked up through being a manager, opening an office in Pittsburgh and director of engineering at Vertica. All of you know that enormous amounts of data are produced every minute. This is from 2019. One thing that I found interesting here is that the number of emails sent is 10 times the number of text messages that have been sent or received. Another thing that's interesting to think about is that though the way this data is collected may have changed dramatically over the years, you think about emails and text messages and so forth, they don't need to be stored in transaction processing systems because they're not updated. Something like a log structure and merge tree will do just fine as long as it has some reliability guarantees applied to it. Though there was movement away from consistent analytic systems to eventual consistency, there's been a large migration back to systems that have guarantees around consistency because they're much easier to reason about. And as long as they can scale to handle the magnitude of the problem, then it's better to have consistent results than have to worry about ones that will eventually be consistent. Vertica is heavily used in a number of different verticals, FinTech, AdTech, and so forth. One of my favorite examples is in the healthcare space, CERNR has developed a system that they call health facts that they use to detect early onset of sepsis. Sepsis is a bacterial infection that when detected early, it is relatively easy to treat with antibiotics. However, when it progresses to some to later stages, it becomes very difficult and expensive to treat with long hospital stays and a 50% mortality rate. And so it's been really neat to see and work with the crew at CERNR and the advancements that they made on early detection of sepsis and the millions of lives that they've been able to save because of that. In general, Vertica's customers value performance and flexibility has been performance since day one when Vertica was outperforming row stores by several orders of magnitude. In today's marketplace where you have options to deploy on the cloud, on premise and private cloud and Kubernetes and so forth, Vertica supports all those options as a software only product now also with our Vertica accelerator, which is our as a service offering for Vertica. And so we have a breadth of flexibility for our customers to be able to deploy the product and we have some customers who are deploying both on premise, replicating into the cloud where they can handle bursty workloads, especially for those who are in e-commerce and see a big rush during the holiday. All right, so that's more of what Vertica is and where it's positioned in the marketplace. If you've read the Seastore paper, then these five pillars are going to be very familiar to you. Vertica stores its data in projections and one of the important attributes of projections is that you can specify the sort order. So if you're storing stock trades, for example, you can specify that you're going to order the entries first by symbol and then perhaps by date or other attributes of the table. One misconception that I find people have about projections is that they think each of the columns is sorted independently. When we're sorting rows of data, say you're doing an insert into a Vertica database, then the sort key specifies the order in which we're going to sort those rows. Each column of the row is going to appear in the same ordinal position as the other columns. So we're sorting by a key, but it's an entire row that's positioned in the projection. Projections are sometimes confused with indices. They do some things that are very similar. Improving the performance, the optimizer can use the sort order of a projection to say, you know, efficiently apply a predicate. A significant difference between projections and indices is that projections are stored in sort of a log-structured merge tree. And so they can be created at the same time that data is being queried. There's no need to turn off the indexing that happens with projections, as there was with databases that support indexes. The other significant difference is that when you've done the binary search to find the range of values that match your predicate, within an index you would then have positions that you would go and you would look up to find the actual values corresponding to that range that you had found in the index. And those positions require seeks on disk, which can be quite expensive, whereas with projections you have the entire table sorted by the same order. And so when you go and look for the range of values that match that predicate, you're then doing scans over all the other compressions. All the other columns, I'm sorry. When storing data oriented in columns, the entropy going down the column tends to be much less than the entropy going across a row. And so we get compression levels that on average are 10x above the raw data size, oftentimes significantly greater than that, especially for those columns that appear in the sort order. We can also do RLE encoding of the data. So if you have stock symbols and you're ordering by those symbols, there are only a few thousand symbols. There are billions of trades. So we get a sort column that has only a few thousand entries, and then the run length encoding tells you how many recurrences of those entries appear in that run. The data is columnar, which has been a significant advantage from the beginning, but now these days it's more so than even in the early days of Vertica. With columns, you can do joins before loading the data into the database and avoid doing those more costly joins at query time. So we see a lot of customers who are denormalizing their data gaming companies, for example, when they have a new game, they'll add a new column to the database for the games that don't use that attribute. They're storing null values. Those null values compress very well. There's very little cost to adding columns that are infrequently used or that are sparse to a column or database. The evidence that we've seen in our product is that we went from supporting 1,600 columns as the maximum limit, to now supporting 10,000 columns in the database based on the request that we've had from customers. Another attribute of a projection is that you can specify. By 10,000, it feels like an odd number. Why not add the 16 or whatever it might be? Well, yes. You often find in the code-based places where we've rounded down based on it being... It's easier numbered to provide to customers than a 2,000-something value. But you can tell them it was 10,000 versus 2,000 or whatever. Yes, and it's actually a knob that controls this. We set the upper bound at a particular limit. It really is workload dependent, but that was a safe limit that we felt comfortable with. It will depend on the types of columns themselves. If you have very wide VAR binary or VARCAR, then you're going to get fewer than those, at least for a performance that's acceptable. I'm thinking Oracle Famosity has 1,000 because it's hardcoded in the code. It's hardcoded in the source code. All of the players, they told me they already changed it once and it's such a big pain. To change it again would be a major undertaking. Nobody wants to do it. When we bumped the limit up, we put it behind a knob so that we wouldn't have that problem going forward. If you know which knob to turn, you can turn it up above 10,000. As I said, depending on the types of the columns, if they're all integers, you can certainly turn it up without running out of resources, especially on these 1TB memory machines that people have these days. You have to pick a number as the default and 10,000 was what was chosen. Another attribute of projections is that you can specify the segmentation. Vertic is very careful about the way we place each of the rows within the cluster of nodes. We do this so that we can apply optimizations like co-segmenting group buys and co-segmented joins. If you've segmented a projection on keys that you're going to use in the group buy, then we can apply that group buy without having to redistribute the data among each of the nodes in the database, similarly with the co-segmented join. I don't know if we'll have time to get to that part of the presentation, but if we do, we'll talk about how elasticity is impacted by that choice of careful placement of the data. Drilling in on projections in a little more detail. Using this trade's example, if we decided to sort by symbol and secondarily by date, the trades table might have many other columns like the bid and the ask price, the quantity that was traded, the broker that made the trade. It's common that if you're storing data in an analytic database, you want to store everything. Just in case you later find out that some bit of information is useful. As I said, each column of a particular row is stored in the same position. That means that when we search the symbol column, we can do a binary search on that since it's first in the sort order, that then any of the other columns would appear within the range that we find, in this case for the apple ticker symbol. For the query that I've included in the bottom left there, where we're selecting the minimum date of any trades made on this apple symbol, because the date is second in the sort order. This is a matter of simply finding the first block within the positional index, which I'll talk about in just a second, and we don't need to scan beyond that because we know that the dates are sorted and that the first one is going to contain the minimum, once we've put a constant value on the symbol. Drilling into the structure of the content as it's stored in the database, each column, at least logically, is stored in two separate files. There is what we call a positional index or the PIDX file. That contains the min and the max value as well as an all count and a CRC so that we can check for file corruption in the FDB file. An offset into the FDB file, the uncompressed size of the block that's stored within there, and the compressed size as well. The FDB file is the file that contains the actual data. It's been encoded and compressed, though I've shown it here in plain text. Obviously, it's as stored on disk. It would not be human readable, though if you could guess the encoding or if you had the catalog and could look it up, then you could unencoded, uncompress it, and read the data. So it's not encrypted, at least not without additional functionality being applied to it. For integers, we take a block of 64k, a buffer of that size that's already been encoded. We run compression, might be LZO, or a delta compression on it, and then we store that data compressed in the block. So what's happening when we're doing a search for a range of values within the PIDX file, we can do a binary search because we've ordered by that column. When we find the range of values, then that's pointers into the FDB file where we can find the compressed blocks, and then we can uncompress and jump to the ordinal position within that block to find the corresponding column, entry or cell entry. There's a trade-off, obviously, between the size of the block that we're compressing. Bigger, we would get better compression, but too big, and we end up reading a lot of values, uncompressing a lot of values that aren't going to contribute to the calculation that we're making in the query. So there's a trade-off between too big and wasteful uncompression and too small and don't get good compression on the data. So looking at a relatively simple plan to load data into a Vertica database, if you've read the CSTOR paper, it makes mention of something called a buddy projection. This is a replica of the projection that can be used for what's called in the CSTOR paper case-safety. So for case-safety of one, you need at least one replica. That way you can lose at any node in the database and continue running. And in fact, as long as you lose more nodes than just that one, as long as you don't lose any two buddy projections. So you still have a consistent collection of the data. In the CSTOR paper, it talks about each of these buddy projections being able to have a different sort order than one another. So the benefit of doing that would be that the optimizer can choose the projection that's best suited for the particular query. So in the previous example where I was using symbol as the predicate, if there were a projection sorted by symbol, the optimizer would choose it. If I then queried and used timestamp as the predicate, then the database would look for a projection that had the primary sort order as the timestamp and use that for the query results. So what's going on here is if you read from the bottom up, the load is the thing that's parsing the tuples, parsing the limited file or parquet or orc and turning them into tuples. They then get copied and sent to this resegment send. The resegment send is then going to look at the segmentation key of each of those tuples and determine which node to send it on to. It gets received by that node. And then the data target, the thing at the top is what's doing the sorting and coding, compressing and writing out those PIDACs and FDB files. So in this plan, the copy is being done to the in-memory representation of the tuples and then those tuples are being sent over the network to the primary and to the buddy node. It wasn't long in Vertica's history until we discovered that having buddy projections with different sort orders is a bad idea. Does anyone want to wager a guess as to why that is? So if you think about... Sorry. It makes recovery harder? It makes recovery really, really difficult. If you have terabytes of data that are stored in, say, your largest fact table on each of the nodes, then those terabytes of data have to be resorted in order for the node to recover. And that's a relatively expensive operation to run. In some instances for our customers, it was taking them many days to recover. And DBAs are usually... Their heart rate is elevated whenever recovery is going on. They're worried about that replica copy going down and losing the database or having to go back up to recover it. So it turned out that having buddy projections with different sort orders for a large-scale database was not a good idea. So we were quick to address that and change the original plan so that... Well, we actually just put the requirement on the projections that you couldn't create a buddy with a different sort order. That was easy to do, relatively easy to do. But then with the previous plan where we're making copies of the tuples and re-segmenting them to each of the nodes, there's an inefficiency because the copy of the tuple is significantly larger than the compressed and encoded ROS format in the PIDX and the FDB files. And so we rewrote some of the code and redesigned the load process so that the records were fed to a single data target for that segment of the data that was being written. And then once the ROS file was created, the PIDX and the FDB files were then streamed, pipelined over to what we call the data target proxy. So that data target proxy was a pretty simple operator that was just receiving the storage format and persisting it to disk on the buddy node. So that's a high-level explanation of what's going on when data is loaded into Vertica database. Digging a little deeper into the data target itself, you can see that there are multiple threads that are working on queues to do the sort in parallel. So you might be loading massive amounts of data into the database. We want to make use of all the processing power that we have in order to do a sort. The sort might have to be done in multiple merge phases. So if we've loaded large enough amounts of data, we load it into a large buffer. We do something like a quick sort in memory. We compress that data. We write it out as a chunk. And then if we've done that with very many chunks of data, we might have to make multiple passes in order to do the merge efficiently. If you tried to do the merge all in one pass, you wouldn't have enough resources to be able to keep the data buffered and you do a bunch of disk seeks every time you're trying to access the next record in the compressed chunk of sorted file. And that would be very expensive, more expensive than doing it in multiple passes. Once we've done those multiple passes over the sorted chunks, then we're writing it into the PIDX and FDB files and then streaming those over to data target proxy as those are being written. Today, when we're writing to our persistent storage, it's shared storage. It's S3 or HDFS or Google Storage and GCP or the Blob Storage in Azure. However, it still makes sense for us to also send another copy of the data to some of the other buddy nodes. In this case, we call them subscribers because they keep a local cache of the recently written data on their local disk so that we can reduce the amount of times we have to access shared storage. And so we have the file system cache that we call the depot on local disk of each of these nodes and then a buffer pool that's caching the data in memory as well. So that's a bit about how we changed our design after realizing that having different sort orders for the buddy projections didn't work well for our customers who are storing large fact tables. There have also been a lot of changes to the hardware that Vertica runs on. When Vertica was first year, we were running on local disk. Now, as I said, we're running on shared storage and caching on local disk. All of our customers on the early days were running on bare metal. We didn't support anything else at that time. And now our new customers are more likely to run in a virtualized environment, either a public or private cloud or VM of some sort than they are on bare metal. CPUs, the cores, the frequencies are no longer getting faster. In some instances, beefy machines have slower cores, but they have more of them on a die so that the overall compute power is increasing, but the performance, the clock speed of an individual core isn't increasing. So we've had to change our scheduling and our threading model that we use within our execution engine. And networking, either because of hardware changes or because of... Can you talk a little bit about the scheduling changes you guys made? Yes. So we used to depend on the OS thread scheduler and we would create a number of threads that were roughly equal to the number of cores on the box for each of the plans that we were executing. And then we would depend on that to give us reasonably good parallelism for that query execution. And now because the number of cores is significantly greater, at that time there would be 8, 16, maybe 32 cores on a box. And now we're seeing machines with significantly more cores on the box and so we're more careful about how we do our thread scheduling. We now handle it within the software instead of depending on the OS. Did you originally depend on the OS because Vertica started off as a focused Postgres and that's what Postgres does? But at that point, before you did this major rewrite, you guys have already removed that process worker per process approach that Postgres was doing. Yes. The worker per process approach that Postgres was doing and their use of global memory because they were running in separate processes was a significant headache for the early engineers of Vertica to work around. But this is different. Our execution engine had already worked around that. But we were depending heavily on the operating system thread scheduler and as the number of threads that we were running increased, we then created a process queue and our own thread scheduler and that also gives us more control over prioritization and some other things. We played some games with nice threads and depending on the operating system scheduler to give us the behavior that we wanted but it never really gave us the behavior that we wanted. Lots of different reasons for that. Some of the commands or the operations that you run are exclusive per process memory mapping and so forth and just lacked that fine-grained control over prioritization of threads. Thank you. The other significant change has been with the network layer. Networking has gotten faster relative to CPU than it was previously. We used to be able to depend on networking a bottleneck and now that's not always the case. There was a need to look at the way that we're doing sort and in particular the way that we're doing key comparisons in order to adjust to the changing landscape of the hardware and make our query execution faster. Sort is important as I described. We use it in our data target when we're writing out these ROS files. It's also important when we're doing query optimizations like a group by pipe or pipeline join where we can use the sort order of the projection in order to do the join or the group by efficiently. Especially if you want to do those operations across multiple threads. I'm sure that most of you are aware of when you're doing an optimization you don't have to worry about the sort order. It tends to be significantly easier to parallelize than if you have to maintain sort order and parallelization. We end up doing a lot of things where we take sorted streams, we segment them, do the operations like partial group buys then we bring those streams back together maintaining the sort order doing something like a tournament sort so that we can do them in multiple threads and get high throughput. We ask ourselves could we do the sort faster and in particular the comparison. We compared ourselves to Tencent's Indian Daytona and that was faster than what Vertica was doing so we decided that yes we can do sort faster and set about achieving that. A little bit of background. Data that's transferred between operators within our execution engine. Operators being things like group by join re-segment of the data and so forth. When we move that data between those operators they're stored in blocks that are either row or column major but you can pick out individual cells within those blocks. They're self-maintained. So you might have a block of time stamps that we can then vectorize and send through an expression that might apply a date difference to each of the cells within that block and having that column oriented makes that expression and evaluation very efficient in vectorizing it so that we can get a bunch of operations within a single virtual method call. So in order to do a comparison between each of the values within the key that we're comparing on or that we're sorting by we have to consider the type of each of the column the descending or ascending specification the null ordering whether nulls come first or last within the ordering and then some other pieces of content that are important for finding the values within those blocks like the location of the first row the stride between each of the columns the number of rows and so forth but those first three that are in bold those are inspections that have to be done for each comparison that's made and so with our fifth version of our execution engine one of this being one of the design changes that we were making we looked at how we could do the encoding differently so that we could do a word by word comparison with no access to metadata in order to speed the sort speed the comparisons that we're doing and you know could we push all the logic to do the comparisons correctly to an encoding on the keys and so the goal was to have a comparison algorithm that looked just like the simple bit of code and in fact this is actually the compare this is the exact code this isn't a snippet or a summary of the code this is the exact code for the comparison that we're doing within our execution engine so that works fine for integers that can be packed into words easily enough but how do you make it work for variable length values you know for example if you had these two tuples where the first column in the sort order had close followed by the second column in the sort order close but not the same and then the second tuple that had as the first column in the sort order close but and then as the second column not the same how do you differentiate between these two tuples if you're going to glom them together into a contiguous words you're asking like how did you look at fuzzy matching or what are you asking sorry it's going to be an absolute comparison so in this case what's on the left hand side would come out first in the sort order because the column that contains close would sort lower than close but sorry close but not the same close but not the same that's the example data sorry yes sorry the first column contains close and the second one but not the same and then in the second tuple you have close but and then the second column not the same right and you want to go on these into one contiguous array of words that you can do a fast comparison on so ignoring what we're doing to the first byte in the word right where we're applying this XOR in order to make it compare relative to the null values which we'll talk about in just a second the solution is that for a variable length word you store each of the bytes within the word with and at the very end you store the number of bytes contained in the word plus one if the value continues on to the next word so this allows us to do a word by word comparison of all the variable length values possibly spanning multiple columns and compare it with that very simple algorithm that I showed earlier just a simple for loop with very little branching involved so the key here is that we're doing an encoding to the keys that are going to be compared in order to come up with words that we can compare each word in the array to determine whether the value sorts before or after the other so null values if it's null first which is our default then the null value is the minimum integer value and empty string would become one greater than that and then we'd apply the logic that I described before where we XOR the first value and then store each of the bytes subsequent bytes in the word last byte being the number of bytes in the word plus one if it continues on to the next word by encoding the values prior to doing the comparisons because we're going to do many more comparisons than we're going to to than just one we saw significant performance improvement in this chart higher is better because we're looking at the number of gigabytes of bytes per second per socket that can be sorted and as I said previously before we undertook this redesign Vertigo was slower than both Tencent, Daytona and Indy after doing the encoding trick and applying the simple comparison algorithm that I showed Vertigo became significantly faster this is on a single node comparison between an early implementation of this EE5 encoding and one that was done as recently as this year so that has an impact not just on sorting data to store it in the ROS format as the data target does but also when we're applying a group by algorithm specifically when there are a significant number of distinct values in the group by so the X axis has number of distinct values and the Y axis has amount of time so in this case lower is smaller bars are better as the number of distinct values increases the performance benefits that we get over using this encoding to do the comparisons increases as well and that's particularly important because if you have a small number of distinct values then a group by hash will work just fine you're not going to have to worry about spilling to disk but when you have large numbers of distinct values you prefer to run it through a group by pipe and avoid having to spill to disk any questions on the encoding and the sort comparison that we're doing can I ask you a question quick yes absolutely yeah yeah so how do you handle when you have multiple like called distincts like A, B, C data science guys do quite a bit there may have like 50 of them in there yes so this is remember you do partial and you find also you have to remember a bunch of stuff in order to T2 you do partial group by different nodes but then when you want to define a group by you have to dedupe the columns again because they count distinct not just count yes so you're describing a situation where you are grouping by the same set of keys and then applying a count distinct across so you might group by a symbol but then count the distinct bid and the distinct ask prices is that correct yeah that is your group by column, group by column then you're distinct on other things right right and so what you're applying then is you're doing a grouping based on you know in this case a group by pipe you'd be grouping based on the grouping keys which gave the example of symbol and then you'd be applying a similar grouping to the secondary key which would be the bid and the ask price so you have the there are many secondary keys right yes yes right and so the simple thing to do is to separate those and treat them as separate sub queries and that of course is expensive because you've got to reapply that primary grouping based on the group by key the more efficient thing to do is to do the primary grouping and then apply multiple secondary sorts to the keys that you're counting right and so yeah but the problem is that you may have a huge common sub expression across them because you are creating your common sub expression before grouping which is terrible I'm in sub expression before yeah because I do the join then I have to do group by like the stock symbol common distinct of A and then I do another one for symbol and column distinct of B and so on and so forth so the result of the join has to be fed to all of these guys so unless you have some push mechanism other than that then you have to temp it and then read from the temp of the disasters because it'd be for the group anyway we can I can introduce you to Hamid at first if you want sounds good I'd like to dig into that more thanks Hamid so switching gears a little bit I'm talking about the elasticity of Vertica running in the cloud I need a second to peruse the cartoon and I will mention that Vertica does now support running in Kubernetes as well as on all the major cloud vendors so a couple of use cases that we've seen from our customers over the years for why having elasticity which is akin to running in the cloud either public or private was important to them some of them have a bursty workload one example that is a follow the sun model where your workers in Asia are busy at different hours than your workers on the west coast of the United States and so you want to move resources around in order to accommodate those data analysts in the places where their where their sun is up another is that customers don't want to throw data away they wanted to keep it around forever and so they had a very long tail of historical data but they would are typically querying the most recent data in the database so they want their compute to be independent of the amount of storage that they have they might be querying this month's data but they've kept data around for many decades another reason that is very straightforward is if everything else is in the cloud if you put your other applications there so that they can be scalable then you want your database to be close to those applications and so you want your database to be in the cloud as well and another very pragmatic reason is that justifying op-ex is often easier than justifying capex especially when the difference between them can be significant if you want to run an experiment do a skunk works project or hackathon project in the cloud it's easier to get approval for spending several thousand dollars over a couple of months than it is to get approval for spending hundreds of thousands or millions of dollars on hardware just for an experiment so keeping in mind some of our design principles especially those for MPP scale out and distributed query I mentioned before that Vertica is careful about how we place the data so that we can do co-segmented joins co-segmented group buys don't have to utilize the network to re-segment data among the nodes and the database if we stored it in a way that the grouping or the joining can be done independently on the node so here's an example where if I create a projection for my customers table and I segment the projection by the hash of an ID on all nodes and then I create another projection for my purchases table and I segment by the corresponding customer ID key now when I do a join between those two tables there's no need to shuffle the data between the two nodes so I'm joining on the keys that I segmented the data by I can do the join local on each of the nodes and then process the results and send them back to the initiator so that makes that careful placement of the data makes elasticity challenging in our eon mode architecture instead of calling it segments we call it shards but the design is that nodes are assigned to shards and though the data is stored in shared storage the shard that they're responsible for processing is determined by the subscription and the shard that they're storing in that local cache that we call the depot is also determined by that shard subscription so if we started off with three nodes and we had three shards to begin with each time a query is executed all three of those nodes are participating in it if we double the size of our cluster then we have a challenge because we've only got these three shards but now we have six nodes that can do compute on top of those shards for new data the solution is relatively easy for new data we can just write it down so that we segment the data in a way that's optimized for the number of compute nodes that we have if we were to take away three of those nodes then we'd have an assignment where for the new data each node would be associated with two of the shards one A and one B the older data it's a little bit trickier that's already been written out it's already contained in storage containers that group things together based on that segmentation so we have two ways of dealing with what we call elastic crunch scaling which is when we're going to apply multiple nodes to the same storage containers one is that if we don't care about the segmentation then we or sorry this is the other case when we do care about the segmentation then we can have those two nodes so in this case one and two that are reading from shard one they can apply a filter where they're reading just half of the segment's worth of data so they're both doing the same amount of I.O but they're applying a filter so that they're doing compute on only half of the data and this works well for compute intensive tasks obviously it doesn't provide a significant win for anything that's I.O. intensive the other approach that we can take is to store this data in multiple storage containers if we don't care about the segmentation of the data or if it's very I.O. intensive then we can separate those storage containers and distribute them reasonably uniformly to each of the nodes so that nodes one and two in this instance are reading from different storage containers they have the same segments that they're processing and so if we care about a co-segmented group by we don't care about the data between the two of them but this works well for workloads that are I.O. intensive Any questions on those two different approaches? Alright and I think that was all the material I had prepared for this talk so there's you're probably already aware you can go and learn more at vertica.com you can try it for free that link takes you to our Vertic Accelerator which is our as-a-service offering it's just easy to spin up and get going with and then of course we're always hiring so you can find job listings at vertica.com slash careers I'll pause and have everyone else open up to the audience and Matt would you want to go first? Yeah thanks Stephen for the talk just a quick question so when you talk about these cloud deployments are these nodes typically homogeneous or do you see a mix of instance sizes within a single deployment? Yeah I'm really glad you asked that that's a great question so we have this notion of a cluster which is the entire database and then within a cluster we have subclusters which are you might have a subcluster for doing your ETL jobs and maybe that has to be big and beefy so you have 64 nodes in that for your ad hoc queries maybe you don't care as much about the performance so you have 16 nodes in that subcluster for your dashboarding queries you might have many of those subclusters for dashboarding that have relatively few nodes maybe four or eight nodes in each of those within a single subcluster so those should be homogeneous because they're doing work together right and you're only going to be as fast as your slowest member of that subcluster so those should be homogeneous between subclusters though they can be heterogeneous so that big beefy ETL job that you have to run maybe those 64 nodes are big expensive powerful machines maybe for the ad hoc query or for your machine learning workload you have where time might not be as important you have less beefy machines does that make sense? yeah perfect thank you yeah it's a great question I'm glad you asked that I mean you've got one okay I actually have two questions so one is the follow up to the previous question so when you have different subclusters do you also cache the data independently? yes the data is cached independently and that's a great question because the query workload might look very different across each of those yeah okay good makes sense so the other question I have is that have you compared with the cloud native engines such as the Rimeo how close they are to you because they're all chasing you anyway I mean chasing the classic databases so how close are they? I would say also too that the database guys announced last week they hold the TPCH record yeah okay yeah yeah Rimeo we haven't done comparisons with recently mainly because we don't see them very often in our field when we're doing proofs of concepts we've done some comparisons with other cloud native databases Snowflake relatively recently where we showed that we had especially when it came to throughput significant advantage in performance and especially in cost for performance okay so that was a price performance comparison or just the pure throughput comparison? it had both elements in it what we found is that when scaling up the number of concurrent queries that performance in Vertica was significantly better so how big of a database did you have over there if you can't talk about it? this was done by a third-party contractor I can look for if you look for search for midnight and Vertica you'll probably find a comparison that I'm speaking of or if you go to Vertica's website I believe there's a link for downloading that as well okay any other questions from the audience? I'll ask one more can you talk a little bit about how the catalogs are stored like how things have changed from obviously Postgres what 15 years ago or something like that but have you moved to a different database system entirely or should it be used for some systems to store their catalogs or is everything in-house? everything's in-house the catalog has several really neat properties to it it also has a couple of pain points it's relatively bloated it is that's everyone's catalogs but it does have some great properties of consistency guarantees it's really interesting particularly when distributed queries we have this mechanism for reference counting the catalog keeping versions around because my query that I initiated might be running on a different version of the catalog than the query that you initiated just a fraction of a second later each of the nodes their catalogs are advancing at a different rate than are perceived by the query if that makes sense so when I initiate my query I'm looking at a catalog version but now when it gets distributed to the other nodes in the database they have a different version of the catalog we need to make sure that everybody has at least the same version as my initiator there's a lot of interesting stuff going on in terms of the catalog in a Vertica database my last question would be have you guys noticed any sort of major trends or changes in how people are using Vertica now that's the cloud service versus when it was on-prem it made this hard to know because you don't have the visibility to see how people are using on-prem Vertica as much as you do in the cloud so I'm asking is there any high level things like oh people do their queries are shorter queries are longer any sort of major trend like that or are people stupid are people dumber on the cloud to get through the process of running Vertica on-prem you had to go through procurement and talk to people who knew what they were doing now you just give Vertica your credit card and you're up and running well it's definitely one of our challenges to offer excellent performance that's also easy to administer and I don't know that that's unique to the cloud because it was valuable to have DBA expertise those are limited and it's not always the DBA who's optimizing performance of the database application developers are creating tables and projections and so forth and so that's been one of the challenges is both identifying instances where a projection could be rewritten to be optimized for the query workload that's being run against the database and also making it easier for those optimizations to be applied because we no longer in the early days you had somebody that was working with sales engineers at Vertica somebody who was an expert on the Vertica system they'd learn how to optimize projections for the workload and they'd curate their Vertica database and now many of our customers have a lot of different users on the database who don't have that Vertica expertise maybe the question might be like is there a sequel functionality or feature that you guys do not think was a high priority when it's on-prem but now that you have observability to every query that's been executed on the cloud you realize it is something you guys need to optimize so our cloud deployment model is a little bit different it's that we deploy within the customer's own cloud account so we are able to administer the database we can't see some of the queries but we're not observing it