 Hello, everybody, and thank you for joining us today for the virtual Vertica VDC 2020. Today's breakout session is entitled Vertica Database Designer, Today and Tomorrow. I'm Jeff Healey with Vertica Marketing. I'll be your host for this breakout session. Joining me today is Yanse Bay, Senior Technical Manager from Vertica Engineering. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait to type your question or comment in the question box below the slides As always, there will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions we don't address, we'll do our best to answer them offline. Alternatively, visit VerticaForms.Form.Vertica.com to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double hour button at the lower right corner of the five. And yes, this virtual session is being recorded and will be available to view on-demand this week. We will send you a notification as soon as it's ready. Now let's get started. Over to you, Yanse. Thanks, Jeff. Hi, everyone. My name is Yanse Bay. I'm a Senior Technical Manager at Vertica Server R&D Group. I run the Query Optimizer catalog and the distributed engine team. I'm very glad to be here today to talk about Vertica Database Designer today and tomorrow. This presentation will be organized as the following. I will first refresh some knowledge about Vertica fundamentals, such as tables and projections, which will bring to the question, what is Database Designer and why we need this tool? I will take you through a deep dive into a Database Designer, or we call it DVD, and see how DVD's internal works. After that, I will show you some exciting DVD improvements we have planned for 10.0 release. And lastly, I will share with you some DVD future roadmap we planned next. As most of you should already know, Vertica is built on a columnar architecture. It means data is stored column-wise. Here we can see a very simple example of table with four columns. And many of you may also know table in Vertica is a virtual concept. It's just a logical representation of data, which means user can write SQL query to reference the table names and columns, just like other relational database management systems. But the actual physical storage of data is called projection. A projection can reference a subset of all of the columns onto its anchor table, and it must be sorted by at least one column. Each table needs at least one SQL projection, which references all the columns to the table. If you load the data to a table with no projections, an automated auto projection will be created, which will be arbitrarily sorted by the first couple columns in the table. As you can imagine, even though such auto projection can be used to answer any query, the performance is not optimized in most cases. A common practice in Vertica is to create multiple projections, contain different sets of columns, and sort it in different ways on the same table. When query is sent to the server, the optimizer will pick the projection that can answer the query in the most efficient way. For example, here you can say, let's say you have a query that select column B, D, C, and sorted by B and D. The third projection will be ideal because the data is already sorted, so you can save the sorting costs while executing a query. Basically, when you choose the design of a projection, you need to consider four things. First and foremost, of course, the sort order. The data already sorted in the right way can benefit quite a lot of the query, actually like order by, group by, analytics, rejoin, prescade, and so on. The select column group is also important because the projection must contain all the columns referenced by your workflow query. Even missing one column in the projection, this projection cannot be used for a particular query. In addition, Vertica is a distributed database and allow projection to be segmented based on the hash of a set of columns, which is beneficial if the segmentation matches the drawing keys or group keys. And finally, encoding of each column is also part of the design, because the data is sorted in different ways, may completely change the optimal encoding for each column. This example only showed the benefit of the first two, but you can imagine the rest two are also important. But even for that, it doesn't sound that hard, right? Well, I hope you change your mind already when you see this. At least I do. This machine generated queries really beats me. It will probably take an experienced DBA hours to figure out which projection can be benefit these queries, not even mentioning there could be hundreds of such queries in the regular workloads in the real world. So what can we do? That's why we need DVD. DVD is a tool integrated in the Vertica server that can help DBAs to perform analysis on their workload queries, table schema, and data. And then automatically figure out the most optimized projection design for their workloads. In addition, DVD is also a sophisticated tool that can be customized by a user by sending a lot of parameters, objectives, and so on. And lastly, DVD has access to the optimizer. So DVD knows what kind of attribute the projection needs to have in order to have the optimizer to benefit from them. DVD has been there for years, and I'm sure there are plenty of materials available online to show you how DVD can be used in different scenarios, whether to achieve the query optimized or load optimized, whether it's a comprehensive design or an incremental design, whether it's a dumping deployment script and manual deployment later, or let the DVD do the auto deployment for you. And many other options. I'm not planning to talk about this today. Instead, I will take the opportunity today to open up this black box DVD and show you what exactly hides inside. DVD is a complex tool, and I have tried my best to summarize the DVD design process into seven steps. Extract, permute, prune, build, score, identify, and encode. What do they mean? Don't worry, I will show you step by step. The first step is extract. Extract interesting columns. In this step, DVD parsed the design queries and figure out the operations that can be benefit by the potential projection design and extract the corresponding columns as interesting columns. So predicates, group buys, order buys, joint conditions, and analytics are all interesting columns to the DVD. As you can see these three simple sample queries, DVD can extract the interesting column sets on the right. Some of these column sets are unordered. For example, the green one for group buy A1 and B1, the DVD extracts the interesting column set and put them in the unordered set. Because either data sorted by A1 first or B1 first can benefit from this group buy operation. Some of the other sets are ordered, and the best example is here ordered by clause A2 and B2. And obviously you cannot sort it by B2 and then A2. These interesting column sets will be used as seeds to extend to actual projection sort order candidates. The next step is permute. Once DVD extracts all the seeds, it will enumerate sort order using seeds. And how does DVD do that? I'm starting with a very simple example. So here you can see DVD can enumerate two sort orders by extending D1 with the unordered set A1, B1. And the derived two sort order candidates, D1, A1, B1 and D1, B1, A1. This sort order can benefit queries with predicate on D1 and also benefit queries by group buy A1, B1 when D1 is constant. So with the same idea, DVD will try to extend other seeds with each other and populate more sort order permutations. You can imagine that there could be many of these candidates based on how many queries you have in the design. And that can be hundreds of sort order candidates. And that comes to the third step, which is pruning. This step is to limit the candidate sort order so that the design won't be running forever. DVD use a very simple capping mechanism. It's sort order candidates are ranked by length and only a certain number of sort order with the longest length will be moved forward to the next step. And now we have all the sort orders candidate that we want to try. But whether this sort order candidate will be actually be benefit from the optimizer, DVD need to ask the optimizer. So this step, before that happens, this step have to build those projection candidate in the catalog. So this step will build, will generate the projection DDLs from the sort order and create this projection in the catalog. These projections won't be loaded with real data because that takes a lot of time. Instead, DVD will copy over the statistic from existing projections to these projection candidates so that the optimizer can use them. The next step is score, scoring with optimizer. Now projection candidates are built in the catalog. DVD can send workload queries to the optimizer to generate a query plan. And then optimizer will return the query plan, DVD will go through the query plan and investigate whether there are certain benefits being achieved. The benefit list have been growing over time when optimizer add more optimizations. Let's say in this case because the projection candidate can be sorted by B1 and A1, it is eligible for group by pipe benefit. Each benefit has a preset score. The overall benefit score of all design queries will be aggregated and recorded for each projection candidate. We are almost there. Now we have all the total benefit score for the projection candidates we derived from the workflow queries. Now the job is easy. You can just pick the sort order with the highest score as the winner. Here we have the winner B1, B1 and A1. Sometimes you need to find more winners because the chosen winner may only benefit a subset of the workflow query you provided to the DVD. So in order to have the rest of the queries to also benefit, you need more projections. So in this case, DVD will go through next iteration. And let's say in this case, find another winner D1, C1 to benefit the workflow queries that cannot be benefited by D1, B1 and A1. The number of iterations and thus the winner outcome, DVD really depends on the design objective that users set. It can be load optimized, which means the only one super projection winner will be selected or query optimized where DVD tried to create as many productions to cover most of the workflow queries, or somewhat balanced of objectives in the middle. The last step is to decide the encoding for each projection column for the projection winners. Because the data are sorted differently, the encoding benefits can be very different from the existing projections. So choose the right projection encoding design will save the disk footprint of a significant factor. So it was the effort to find out the best encoding. DVD picks the encoding based on the actual sampling the data and measure the storage footprint. For example, in this case, the projection winner has three columns and say each column has a few encoding options. DVD will write the sample data in the way this projection is sorted. And then you can see with different encodings, the disk footprint is different. DVD will then compare the disk footprint of each of different options for each column and pick the best encoding options based on the one that has the smallest storage footprint. Nothing magical here, but it just works pretty well. And basically that's how DVD internal works. Of course, I simplified quite a lot. For example, I didn't mention how DVD handles segmentations, but the idea is similar to analyze the sort order. But I hope this section gives you some basic idea about DVD for today. So now let's talk about tomorrow. And here comes the exciting part. In version 10.0, we significantly improve the DVD in many ways. In this talk, I will highlight four issues in all DVD and describe how the 10.0 version new DVD will address those issues. The first issue is that the DVD API is too complex. In most situations, what user really want is very simple. My queries were slow yesterday. With the new or different projection can help speed it up. However, to answer a simple question like this, using DVD, user will be very likely to have the documentation open on the side. Because they have to go through a whole complex flow from creating a projection, run the design, get the output, and then can get the design at the end. And that's not yet yet. For each step, there are several functions user need to call in order. So adding these up, user need to write a quite long script with dozens of functions. It's just too complicated and most users may find it annoying. They either manually tune the projection themselves or simply live with the performance and come back when it gets really slow again. And of course, in most situations, they never come back to use the DVD. In 10.0, Vertica supports a new simplified API to run DVD easily. There will be just one function, designer, single run, and one argument, the interval that you think your query was slow. In this case, user complained about it yesterday. So what the user need to do is just specify one day as argument and run it. The user doesn't need to provide anything else because the DVD will lock up its query or history within that time window and automatically populate design, run design, and export the projection design and clean up. No user intervention needed. No need to have the documentation on the side and carefully write the script and the debug. Just one function call. That's it. Very simple. So that must be pretty impressive, right? So now here comes another issue. Before they utilize the single run function, users are encouraged to run DVD on the production cluster. However, in fact, Vertica used to not recommend to run a design on a production cluster. One of the reasons the issue is that DVD takes massive blocks, both table logs and the catalog logs, which will badly interfere the running workload on the production cluster. As of 10.0, we eliminated all the table and catalog logs from DVD. Yes, we eliminated 100% of them. Simple improvement, clear win. The third issue, which user may not be aware of, is that DVD writes intermediate results into real Vertica tables. The real DVD have to do that is DVD is a background task. So the intermediate results, some user needs to monitor the progress of the DVD in concurrent sessions. For complex design, the intermediate results can be quite massive. And as a result, many block files will be created and written to the disk, and we should both stress the catalog and that the disk can slow down the design. For EL mode, it's even worse because the table are shared on communal storage. So writing to the regular table means that it has to upload the data to the communal storage, which is even more expensive and disruptive. In 10.0, we significantly restructure the intermediate results buffer and make the shared in-memory data structure. Monitoring queries will go directly look up in this in-memory data structure and go through the system table and return the results. No intermediate raw file will be written anymore. Another expensive usage of local disk for DVD is encoding design. As I mentioned earlier in the deep dive, to determine which encoding works the best for the new projection design, there's no magic way, but the DVD needs to actually write down the sample data to the disk using the different encoding options and find out which one has the smallest footprint and pick it as the best choice. These written sample data will be useless after this and it will be wiped out right away. And you can imagine this is a huge waste of the system resource. In 10.0, we improve this process. So instead of writing the different encoded data on the disk and then read the file size, DVD aggregates the data block size on the fly. The data block will not be written to the disk. So the overall encoding design is more efficient and non-disruptive. Of course, this is just about the start. The reason why we put the significant amount of the resource on the improving the DVD in 10.0 is because the verdict is the DVD as the essential component of the Out-of-Vox performance design campaign. To simply illustrate the timeline, we are now on the second step where we significantly reduced the running overhead of the DVD so that users will no longer fear to run DVD on their production cluster. Please note that as of 10.0, we haven't really started changing how DVD design algorithm works so that what we have discussed in the deep dive today still holds. But the next phase of DVD, we will briefly make the design process smarter and this will include better enumeration mechanism so that the pruning is more intelligent rather than brutal. That will result in better design quality and also faster design. The longer term is to make DVD to achieve the automation. Here automation, what I really mean is that instead of having user to decide when to use DVD until the query is slow, Vertica have to know, detect this event and have DVD run automatically for users and suggest the better projections design if the existing projection is not good enough. Of course, there will be a lot of work and need to be done before we can actually fully achieve the automation. But we are working on that. At the end of the day, what the user really wants is the faster database. Thank you for listening to my presentation. I hope you find it useful. Now, let's get ready for the Q&A.