 Hello everybody, and thank you for joining us today on the Virtual Vertica BDC 2020. Today's breakout session is entitled, A Technical Overview of the Vertica Architecture. I'm Paige Roberts, Open Source Relations Manager at Vertica, and I'll be your host for this webinar. Now joining me is Ryan Roelke. Did I say that right? He's a Vertica senior software engineer. So it's Roelke. Roelke. Okay, I got it. Alright. Ryan Roelke. And before we begin, I want to be sure and encourage you guys to submit your questions or your comments during the virtual session while Ryan is talking as you think of them as you go along. You don't have to wait until the end. Just type in your question or your comment in the question box below the slides and click submit. There will be a Q&A at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to get back to you offline. Now alternatively, you can visit the Vertica forums to post your questions there after the session as well. Our engineering team is planning to join the forums to keep the conversation going so you can have a chat afterwards with the engineer just like any other conference. Now also a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And before you ask, yes, this virtual session is being recorded and it will be available to view on demand this week. We'll send you a notification as soon as it's ready. Now let's get started over to you Ryan. Thanks Paige. Good afternoon everybody. My name is Ryan and I'm a senior software engineer on Vertica's development team. I primarily work on improving Vertica's query execution engine so usually in the space of making things faster. Today I'm here to talk about something that's more general than that so we're going to go through a technical overview of the Vertica architecture. So the intent of this talk essentially is to just explain some of the basic aspects of how Vertica works and what makes it such a great database software. And to explain what makes the query execute so fast in Vertica, we'll provide some background to explain why other databases don't keep up. And we'll use that as a starting point to discuss an academic database that paved the way for Vertica. And then we'll explain how Vertica design builds upon that academic database to be the great software that it is today. I want to start by sharing somebody's approximation of an internet minute at some point in 2019. All of the data on this slide is generated by thousands or even millions of users. And that's a huge amount of activity. Most of the applications depicted here are backed by one or more databases. Most of this activity will eventually result in changes to those databases. For the most part we can categorize the way these databases are used into one of two paradigms. First up we have online transaction processing or OLTP. OLTP workloads usually operate on single entries in a database so an update to a retail inventory or a change in a bank account balance are both great examples of OLTP operations. Updates to these datasets must be visible immediately and there could be many transactions occurring concurrently from many different users. OLTP queries are usually key value queries. The key uniquely identifies a single entry in a database for reading or writing. Early databases and applications were probably designed for OLTP workloads. This example on the slide is typical of an OLTP workload. We have a table account such as for a bank which tracks information for each of the bank's clients. An update query like the one depicted here might be run whenever a user deposits $10 into their bank account. Our second category is online analytical processing or OLAP which is more about using your data for decision making. If you have a hardware device which periodically records how it's doing you could analyze trends of all your devices over time to observe what data patterns are likely to lead to failure. Or if you're Google you might log user search activity to identify which links help your users find the answer. Analytical processing has always been around but with the advent of the Internet it happened at scales that were unimaginable even just 20 years ago. This SQL example is something you might see in an OLAP workload. We have a table searches logging user activity. We will eventually see one row in this table for each query submitted by users. If we want to find out what time of day our users are most active then we could write a query like this one on the slide which counts the number of unique users running searches for each hour of the day. So now let's rewind to 2005. We don't have a picture of an Internet minute in 2005. We don't have the data for that. We also don't have the data for a lot of other things. The term big data is not quite yet on anyone's radar and the cloud is also not quite there or it's just starting to be. So if you have a database serving your application it's probably optimized for OLTP workloads. OLAP workloads just aren't mainstream yet and database engineers probably don't have them in mind. So let's innovate. It's still 2005 and we want to try something new with our database. Let's take a look at what happens when we do run an analytic workload in 2005. Let's use as a motivating example the table of stock prices over time. In our table the symbol column identifies the stock that's being traded. That was traded. The price column identifies the new price and the timestamp column indicates when the price changed. We have several other columns which we should know that they're there but we're not going to use them in any example queries. This table is designed for analytic queries. We are probably not going to make any updates or look at individual rows since we're logging historical data and want to analyze changes in stock price over time. Our database system is built to serve OLTP use cases. So it's probably going to store this table on disk in a single file like this one. Notice that each row contains all of the columns of our data in row major order. There's probably an index somewhere in the memory of this system which will help us do point lookups. Maybe our system expects that we will use the stock symbol and the trade time as lookup keys. So an index will provide quick lookups for those columns to the position of the whole row in the file. If we did have an update to a single row then this representation would work great. We would seek to the row that we're interested in finding it would probably be very fast using the index in memory index and then we would update the file in place with our new value. On the other hand if we ran an analytic query like we want to the data access pattern is very different. The index is not helpful because we're looking up a whole range of rows not just a single row. As a result the only way to find the rows that we actually need for this query is to scan the entire file. We're going to end up scanning a lot of data that we don't need and that won't just be the rows that we don't need. There are many other columns in this table. Maybe information about who made the transaction. We'll also be scanning through those columns for every single row in this table. That could be a very serious problem once we consider the scale of this file. Stocks change a lot. We probably have thousands or millions or maybe even billions of rows that are going to be stored in this file and we're going to scan all of these extra columns for every single row. If we tried out our stocks use case behind the desk of a Fortune 500 company then we're probably going to be pretty disappointed. Our queries will eventually finish but it might take so long that we don't even care about the answer anymore by the time that they do. Our database is not built for the task we want to use it for. Around the same time a team of researchers in the Northeast have become aware of this problem and they decided to dedicate their time and research to it. These researchers weren't just anybody. The fruits of their labor which we now like to call the Sea Store Paper was published by eventual Touring Award winner Mike Stonebreaker along with several other researchers from elite universities. This paper presents the design of a read-optimized relational that contrasts sharply with most current systems which are right optimized. That sounds exactly like what we want for our stocks use case. Reasoning about what makes our queries execution so slow brought our researchers to the memory hierarchy which essentially is a visualization of the relative speeds of different parts of a computer. At the top of the hierarchy we have the fastest data units which are of course also the most expensive to produce. As we move down the hierarchy components get slower but also much cheaper and thus you can have more of them. Our OLTP databases data is stored in a file on the hard disk. We scanned the entirety of this file even though we didn't need most of the data and now it turns out that is just about the slowest thing that our query could possibly be doing by over two orders of magnitude. It should be clear based on that that the best thing we can do to optimize our query execution is to avoid reading unnecessary data from the disk. And that's what the C-store researchers decided to look at. The key innovation of the C-store paper does exactly that. Instead of storing data in a row major order in a large file on disk they transposed the data and stored each column in its own file. Now if we run the same select query we read only the relevant columns. The unnamed columns don't factor into the table scan at all since we don't even open the files. Zooming out to an internet scale size data set we can appreciate the savings here a lot more. But we still have to read a lot of data that we don't need to answer this particular query. Remember we had two predicates one on the symbol column and one on the timestamp column. Our query is only interested in Apple stock but we're still reading rows for all of the other stops so what can we do to optimize our disk reads even more. Let's first partition our data set into different files based on the timestamp date. This means that we will keep separate files for each state. When we query the stock table the database knows all of the files we have to open. If we have a simple predicate on the timestamp column as our sample query does then the database can use it to figure out which files we don't have to look at at all. So now all of our disk reads that we have to do to answer our query will produce rows that pass the timestamp predicate. This eliminates a lot of wasteful disk reads but not all of them. We do have another predicate on the symbol column where symbol equals Apple. We'd like to avoid disk reads of rows that don't satisfy that predicate either. And we can avoid those disk reads by clustering all the rows that match the symbol predicate together. If all of the Apple rows are adjacent then as soon as we see something different we can stop reading the file. We won't see any more rows that can pass the predicate. Then we can use the positions of the rows we did find to identify which pieces of the other columns we need to read. One technique that we can use to cluster the rows is sorting. So we'll use the symbol column as a sort key for all of the columns that in that way we can reconstruct a whole row by seeking to the same row position in each file. It turns out having sorted all of the rows we can do a bit more. We don't have any more wasted disk reads but we can still be more efficient with how we're using the disk. We've clustered all of the rows with the same symbol together. So we don't really need to bother repeating the symbol so many times in the same file. Let's just write the value once and say how many rows we have. This run length encoding technique can compress large numbers of rows into a small amount of space. In this example we de-duplicate just a few rows but you can imagine de-duplicating many thousands of rows instead. This encoding is great for reducing the amounts of disk we need to read at query time but it also has the additional benefit of reducing the total size of our stored data. Now our query requires substantially fewer disk reads than it did when we started. Let's recap what the CSTORP paper did to achieve that. First, we transposed our data to store each column in its own file. Now queries only have to read the columns used in the query. Second, we partitioned the data into multiple file sets so that all rows in a file have the same value for the partition column. Now a predicate on the partition column can skip non-matching file sets entirely. Third, we selected a column of our data to use as a sort key. Now rows with the same value for that column are clustered together which allows our query to stop reading data once it finds non-matching rows. Finally, sorting the data this way enables high compression ratios using run length encoding which minimizes the size of the data stored on the disk. The CSTORP system combined each of these innovative ideas to produce an academically significant result. And if you used it behind the desk of a Fortune 500 company in 2005, you probably would have been pretty pleased. But it's not 2005 anymore and the requirements of a modern database system are much stricter. So let's take a look at how CSTORP fares in 2020. First of all, we have designed the storage layer of our database to optimize a single query in a single application. Our design optimizes the heck out of that query and probably some similar ones. But if we want to do anything else with our data, we might be in a bit of trouble. What if we just decide we want to ask a different question? For example, in our stocks example, what if we want to plot all the trades made by a single user over a large window of time? How do our optimizations for the previous query measure up here? Well, our data is partitioned on the trade date. That could still be useful depending on our new query. If we want to look at a trader's activity over a long period of time, we would have to open a lot of files. But if we're still interested in just a day's worth of data, then this optimization is still an optimization. Within each file, our data is ordered on the stock symbol. That's probably not too useful anymore. The rows for a single trader aren't going to be clustered together. So we will have to scan all of the rows in order to figure out which ones match. You could imagine a worse design, but if it becomes crucial to optimize this new type of query, then we might have to go as far as reconfiguring the whole database. The next problem is one of scale. One server is probably not good enough to serve a database in 2020. CSTOR, as described, runs on a single server and stores lots of files. What if the data overwhelms this small system? We could imagine exhausting the file system's I-NOS limit with lots of small files due to our partitioning scheme, or we could imagine something simpler, just filling up the disk with huge volumes of data. But there's an even simpler problem than that. What if something goes wrong and CSTOR crashes? Then our data is no longer available to us until the single server is brought back up. A third concern, another one of scalability, is that one deployment does not really suit all possible use cases we could imagine. We haven't really said anything about being flexible. A contemporary database system has to integrate with many other applications, which might themselves have pretty restrictive deployment options. Or the demands imposed by our workloads have changed and the setup you had before doesn't suit what you need now. CSTOR doesn't do anything to address these concerns. What the CSTOR paper did do was lead very quickly to the founding of Vertica. Vertica's architecture and design are essentially all about bringing the CSTOR designs into an enterprise software system. The CSTOR paper was just an academic exercise, so it didn't really need to address any of the hard problems that we just talked about. But Vertica, the first commercial database built upon the ideas of the CSTOR paper would definitely have to. This brings us back to the present, to look at how an analytic query runs in 2020 on the Vertica analytic database. Vertica takes the key idea from the paper, can we significantly improve query performance by changing the way our data is stored, and gives its users the tools to customize their storage layer in order to heavily optimize really important for commonly run queries. On top of that, Vertica is a distributed system, which allows it to scale up to Internet-sized datasets, as well as have better reliability and uptime. We'll now take a brief look at what Vertica does to address the three inadequacies of the CSTOR system that we mentioned. To avoid locking into a single database design, Vertica provides tools for the database user to customize the way their data is stored. To address the shortcomings of a single node system, Vertica coordinates processing among multiple nodes. To acknowledge the large variety of desirable deployments, Vertica does not require any specialized hardware and has many features which smoothly integrate it with a cloud computing environment. First, we'll look at the database design problem. We're a SQL database, so our users are writing SQL and describing their data in SQL way to create table statement. Create table is a logical description of what your data looks like, but it doesn't specify of the way that it has to be stored. For a single create table, there's a lot of different storage layouts. Vertica adds some extensions to SQL so that users can go even further than create table and describe the way that they want the data to be stored. Using terminology from the CSTOR paper, we provide the create projection statement. Create projection specifies how table data should be laid out, including column encoding and sort order. A table can have multiple projections, each of which could be ordered on different columns. When you query a table, Vertica will answer the query using the projection which it determines to be the best match. Referring back to our stocks example, here's a sample create table and create projection statement. Let's focus on our heavily optimized example query which had predicates on the stock symbol and date. We specify that the table data is to be partitioned by date. The create projection statement here is excellent for this query. We specify using the order by clause that the data should be ordered according to our predicates. We'll use the timestamp as a secondary stored key. Each projection stores a copy of the table data. If you don't expect to need a particular column in a projection, then you can leave it out. Our average price query didn't care about who did the trading, so maybe our projection designed for this query can leave the trader column out entirely. If the question we want to ask ever does change, maybe we already have a suitable projection. But if we don't, then we can create another one. This example shows another projection which would be much better at identifying trends of traders rather than identifying trends for a particular stock. Next, let's take a look at our second problem. How should you decide what design is best for your queries? Well, you could spend a lot of time figuring it out on your own, or you could use Vertica's database designer tool, which will help you by automatically analyzing your queries and spitting out a design which it thinks is going to work really well. If you want to learn more about the database designer tool, then you should attend the session Vertica Database Designer today and tomorrow, which will tell you a lot about what the database designer does and some recent improvements that we have made. Okay, now we'll move to our next problem, the challenge that one server does not fit all. In 2020, we have several orders of magnitude more data than we had in 2005, and you need a lot more hardware to crunch it. It's not tractable to keep multiple petabytes of data in a system with a single server. So, Vertica doesn't try. Vertica is a distributed system, so we'll deploy multiple servers which work together to maintain such a high data volume. In a traditional Vertica deployment, each node keeps some of the data in its own locally attached storage. Data is replicated so that there is a redundant copy somewhere else in the system. If any one node goes down, then the data that it served is still available on a different node. We'll also set have it so that in the system there's no special node with extra duties. All nodes are created equal. This ensures that there is no single point of failure. Rather than replicate all of your data, Vertica divvies it up amongst all of the nodes in your system. We call this segmentation. The way data is segmented is another parameter of storage customization, and it can definitely have an impact upon Clary performance. A common way to segment data is by using a hash expression which essentially randomizes the node that a row of data belongs to, but with a guarantee that the same data will always end up in the same place. Describing the data the way data is segmented is another part of the create projection statement as seen in this example. Here we segment on the hash of the symbol column, so all rows with the same symbol will end up on the same node. For each row that we load into the system we'll apply our segmentation expression. The result determines which segment the row belongs to, and then we'll send the row to each node which holds a copy of that segment. In this example our projection is marked K-Safe 1, so we will keep one redundant copy of each segment. When we load a row we might find that its segment has copies on node 1 and node 3, so we'll send a copy of the row to each of those nodes. If node 1 is temporarily disconnected from the network, then node 3 can serve the other copy of the segment so that the whole system remains available. The last challenge we brought up from the Seastore design was that one deployment does not fit all. The Seastore design neatly addressed many of our concerns here. Our use of segmentation to distribute data means that a Vertica system can scale to any size of deployment. And since we lack any special hardware or nodes with special purposes, Vertica servers can run anywhere, on-premise or in the cloud. But let's suppose you need to scale out your cluster to rise to the demands of a higher workload. Suppose you want to add another node. This changes the division of the segmentation space. We'll have to re-segment every row in the database to find its new home, and then we'll have to move around any data that belongs to a different segment. This is a very expensive operation, not something you want to be doing all that often. Traditional Vertica doesn't solve that problem especially well, but Vertica Eon Mode definitely does. Vertica's Eon Mode is a large set of features which are designed with a cloud computing environment in mind. One feature of this design is elastic throughput scaling which is the idea that you can smoothly change your cluster size without having to pay the expenses of shuffling your entire database. Vertica Eon Mode had an entire session dedicated to it this morning. I won't say any more about it here, but maybe you already attended that session. Or if you haven't, then I definitely encourage you to listen to the recording. If you'd like to learn more about the Vertica architecture, then you'll find on this slide links to several of the academic conference publications. The CSTOR paper is here as well as the Vertica 7 Years Later paper which describes some of the Vertica designs 7 years after the founding. And also the paper about the innovations of Eon Mode. And of course, the Vertica documentation is an excellent resource for learning more about what's going on in a Vertica system. I hope you enjoyed learning about the Vertica architecture. I would be very happy to take all of your questions now. Thank you for attending this session.