 Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled Sizing and Configuring Vertica in E-Unmode for Different Use Cases. I'm Jeff Healy and I lead Vertica Marketing. I'll be your host for this breakout session. Joining me are Samit Kazwani and Kiran Kamat, Vertica Product Technology Engineers and Key Leads on the Vertica Customer Success Scene. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, visit VerticaForms at form.verteca.com. Post your question there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also as a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And yes, this virtual session is being recorded and will be available with you on demand this week. We'll send you an invitation as soon as it's ready. Now let's get started. Over to you, Sriram. Thanks, Jeff. So for today's presentation, we have big e-on-mode concepts. We are going to go over the sizing guidelines for e-on-mode, some of the use cases that you can benefit from using e-on-mode. At last we are going to talk about some tips and tricks that can help you configure and manage your cluster. So as you know, Vertica has two modes of operation, e-on-mode and enterprise mode. So the question that you may have is which mode should I implement? So let's look at what's there in the enterprise mode. In enterprise mode, you have a cluster with general purpose compute nodes that have locally attached storage. Because of this tight integration of compute and storage, you get fast and reliable performance all the time. Now the amount of data that you can store in enterprise mode cluster depends on the total disk capacity of the cluster. Again, enterprise mode is more suitable for on-premise and cloud deployment. Now let's look at e-on-mode. Take advantage of cloud economics, Vertica implemented e-on-mode, which is getting very popular among our customers. In e-on-mode, we have compute and storage, which is there are separated by introducing S3 bucket or S3-compliant storage. Now because of this separation of compute and storage, you can take advantages like Matthew or dynamic scale out and scaling, isolation of your workload, as well as you can load data in your cluster without having to worry about the total disk capacity of your local nodes. As obviously, it's obvious from what they have said, e-on-mode is suitable for cloud deployment. Some of our customers will take advantage of the features of e-on-mode, but also design it in-premise by introducing S3-compliant-flash-flat storage. Okay, so let's look at some of the terminologies used in e-on-mode. The fourth thing that I want to talk about are communal storage. It's a shared storage or S3-compliant shared storage for a bucket that is accessible from all the nodes in your cluster. Shard is a segment of data stored on a communal storage. Subscription, it's a binding between nodes and shards. Last, depot. Depot is a local copy or a local cache that can help queries improve performance. So shard is a segment of data stored in communal storage. When you create a e-on-mode cluster, you have to specify the shard count. Shard counts decide the maximum number of nodes that will participate in the query. So Vertica also will introduce a shard called replica shards that will hold the data for replicated projections. Okay, subscriptions, as I said before, is a binding between nodes and shards. Each node subscribes to one or more shards and a shard has at least two nodes that subscribe to it for case ST. Subscribing nodes are responsible for writing and reading from shard data. Also subscriber nodes holds up-to-date metadata for a catalog of files that are present in the shard. So when you connect to Vertica node, Vertica will automatically assign you a set of nodes and subscriptions that will process your query. There are two important system tables, the node subscription and session subscription that can help you understand this a little bit more. So let's look at what's on the local disk of your e-on-mode cluster. So on local disk, you have depot. Depot is a local file system cache that can hold subset of the data or a copy of the data in terminal storage. Other things that are there are temp storage. Temp storage is used for storing data belonging to temporary tables and the data that spills to this when you're processing queries. The last is catalog. There is a persistent copy of Vertica catalog that is written from this. The writes happen at every commit. You only need this persistent copy as node startup. There is also a copy of Vertica catalog stored in communal storage for durability. The local copy is synced to the copy in communal storage via service in the interval of five minutes. So let's look at depot. Now, as I said before, depot is a local file system cache. It's helped to reduce network traffic and improve performance of your query. So we make assumption that when you load data in Vertica, that's the data that you may most frequently query. So every data that is loaded in Vertica is first entering the depot and then as a part of same transaction also synced to communal storage for durability. So when you query, when you run a query against Vertica, your queries are also going to find the files in the depot first to be used. And if the files are not found, the queries will access files from communal storage. Now, a behavior of whether the new files should first enter the depot or skip depot can be changed by configuration parameters that can help you skip depot and then writing. When the files are not found in depot, we make assumption that you may need those files for future runs of your query, which means we will fetch them asynchronously into the depot so that you have those files for future runs. If that's not the behavior that you intend, you can change configuration parameters to tell Vertica to not fetch that when you run your query. And this configuration parameter can be set at database level, session level, query level, and we are also introducing a user level parameter where you can change this behavior. Because the depot is going to be limited in size as compared to amount of data that you may store in your Eon cluster, at some point in time, your depot will be full or hit the capacity. To make space for new data that is coming in, Vertica will edit some of the files that are listed and changed used. As depot is going to be your query performance enhancer, you want to set the content of your depot. And so what you want to do is to decide what should be in your depot. Now, Vertica provides some of the policies called pinning policies that can help you pin a specific table or partition of a table into the depot at sub-cluster level or at the database level. And so we will talk about this a little bit more in the future slides. Now, look at some of the system tables that can help you understand about the size of the depot, what's in your depot, what files were edited, what files were recently fetched into the depot. One of the important system tables that I have listed here is DC files. DC files can be used to figure out if your transaction or a query fetched its data from depot, from communal storage or from both. Okay? One of the important features of EON node is a sub-cluster. Vertica lets you divide your cluster into smaller execution groups. Now, each of the execution groups has a set of nodes to gather, subscribe to all the shards, and can process your query independently. So when you connect to a node in the sub-cluster, that node along with other nodes in the sub-cluster will only process your query. And because of that, we can achieve isolation as well as achieve scale-out and scaling without impacting what's happening on the cluster. The good thing about sub-cluster is all the sub-clusters have access and have access to the common communal storage. And because of which, if you load data in one sub-cluster, it's accessible to the queries that are running in other sub-clusters. When we introduced sub-clusters, we knew that our customers would really love these features. And some of the things that we were considering is we knew that our customers would dynamically scale out and in lots of... That means add and remove lots of sub-clusters on demand. And we had to provide... We had to give this feature or provide ability to add and remove sub-clusters in a fast and reliable way. We knew that during off-peak hours, our customers would shut down many of their sub-clusters. That means more than half of the nodes could be down. And we had to make adjustments to our quorum policies which require at least half of the nodes to be up for database to stay up. We also were aware that customers would add hundreds of nodes in the cluster, which means we had to make adjustments to the CAD laws and commit policies. To take care of all these three requirements, we introduced two types of sub-clusters, primary sub-cluster and secondary sub-cluster. Primary sub-cluster is the one that you get by default when you create your first neon cluster. The nodes in the primary sub-clusters are always up. That means they stay up and participate in the quorum. The nodes in the primary sub-cluster are responsible for processing commits and also maintain a persistent copy of catalog on disk. This is a sub-cluster that you would use to process all your ETL jobs because the proper more also runs on the nodes in the primary sub-cluster. If you want now at this point, have another sub-cluster where you would like to run queries and also bring this cluster up and down depending on the demand or depending on the workload, you would create a new sub-cluster and this sub-cluster will be off-site secondary in nature. Now secondary sub-clusters have nodes that don't participate in quorum. So if these nodes are down, Vertica has no impact. These nodes are also not responsible for processing commit though they maintain up-to-date copy of the catalog in memory. They don't store catalog on disk. And these are sub-clusters that you can add and remove very quickly without impacting what is running on the other sub-clusters. We have customers running hundreds of nodes, cluster with hundreds of nodes and sub-clusters off-site with like 64 nodes and they can bring this sub-cluster up and down or add and remove within two minutes. So before I go into the sizing of yarn mode, I just want to say one more thing here. We are working very closely with CommaFort customers who are running yarn mode and getting a better feedback from them on a regular basis. And based on the feedback, we are making lots of improvements and fixes in every hot-fix that we put out. So if you are running yarn mode and want to be part of this group, I suggest that you keep your cluster current with latest hot-fixes and work with us to give us feedback and get the improvements that you need to be successful. So let's look at what we need to size a yarn cluster. Sizing a yarn cluster is very different from sizing enterprise mode cluster. When you are running enterprise mode cluster or when you are sizing vertical cluster or running enterprise mode, you need to take into account the amount of data that you want to store and the configuration of your node, depending on which you decide how many nodes you will need and then start the cluster. Whereas in yarn mode, to size a cluster, you will need two things like what should be your shard count. Besides the maximum number of nodes that will participate in your query. And we'll talk about this little bit more in the next slide. You will decide on number of nodes that you will need within a sub-cluster, the instance type you will pick for running a specific sub-cluster, and how many sub-clusters you will need and how many of them should be running all the time and how many should be running in a dynamic mode. When it comes to shard count, you have to pick shard count upfront and you can't change it once you have picked it and your database is up and running. So you need to pick shard count depending on the number of nodes or maximum number of nodes that you need to process a query. Now one thing that we want to remember here is this is not amount of data that you have in database, but this is amount of data your queries will process. So you may have data for six years, but if your queries process last month of data on most of the occasions, or if your dashboards are processing up to like six weeks or ten minutes based on whatever your needs are, you will decide or pick the number of shards and nodes based on how much data your queries process. Looking at most of our customers, we think that 12 is a good number that should work for most of our customers. And that means the maximum number of nodes in the subclass because that will process query is going to be 12. If you feel that you need more than 12 nodes to process your query, you can pick other numbers like 24 or 48. If you pick a higher number like 48 and you go with three nodes in your subclass that means node subscribes to 16 primary and 16 secondary shards subscriptions, which totals to 32 subscriptions per node. That will leave your catalog in a bloated state. So pick shard count appropriately. Don't pick prime numbers. We suggest 12 should work for most of our customers. If you think you process more than the regular number, or you think your queries process terabytes of data, then pick a number like 24. Don't pick a prime number. We are also coming up with features in vertical like grant scaling that will help you run queries on more than more nodes than the number of shards that you pick. And that feature will be coming out soon. So if you have picked a smaller shard count, it's not the end of the story. Now the next thing is you need to pick how many nodes you need to begin your subclass to process your query. Ideal number would be node number equal to shard count, or if you are going to pick a number that you've left, pick node count which is such that each of the nodes has a balanced distribution of subscriptions. So over here, you can have an option where you can have 12 nodes and 12 shards, or you can have two subclusters with 6 nodes and 12 shards. Depending on your workload, you can pick either of the two options. The first option where you have 12 nodes and 12 shards is more suitable for batch applications, whereas two subclusters with 6 nodes each is more suitable for dashboard type applications. Picking subclusters, it depends on your workload. You can have remote nodes to achieve isolation or elastic throughput scaling. Your subclusters can have nodes of different sizes, and you need to make sure that the nodes within the subcluster have to be homogeneous. So this is my last slide before I hand over to Sumit. And this I think is a very important slide that I want you to pay attention. When you pick instance, you're going to pick instance based on workload and query project. I want to make it clear here that we want you to pay attention to the local disk because you have Depot on your local disk which is going to be your query performance enhancer for all kinds of deployments in cloud as well as on-premise. So irrespective of what you read or what you heard, depots still play a very important role in every e-on deployment, and they act like performance enhancers. Most of our customers choose Vertica because they love the performance we offer, and we don't want you to compromise on the performance. So pick nodes with some amount of local disk. At least two terabytes is what we suggest. I3 instances in Amazon have come up with a good local disk that is very helpful and some of our customers are buying physical calls. With that said, I want to pass it over to Sumit. So hello everyone. My name is Sumit Deswani and I'm a Product Technology Engineer at Vertica. I will be discussing the various use cases that customers deploy in e-on mode. After that, I will go into some technical details of SQL and then blend that into the best practices in e-on mode. And finally, we'll go to some tips and tricks. So let's get started with the use cases. So the very basic use case that users will encounter when they start e-on mode the first time is they'll have two sub-clusters. The first sub-cluster will be the primary sub-cluster used for ETL, like Sri Rang mentioned. And this sub-cluster will be mostly on or always on. There will be another sub-cluster used totally for queries. And this sub-cluster is the secondary sub-cluster and it will be on sometimes, depending on the use case. Maybe from 9 to 5 or Monday to Friday, depending on what application is running on it or what users are doing on it. So this is the most basic use case, something that users get started with to get their fee to it. Now as the use of the deployment or e-on mode word cluster increases, the users will graduate into the second use case. And this is the next level of deployment. In this situation, they still have the primary sub-cluster which is used for ETL, typically a larger sub-cluster where there's more heavier ETL running pretty much nonstop. Then they have the usual query sub-cluster which was used for queries, but they may add another secondary sub-cluster for ad hoc workloads. The motivation for this sub-cluster is to isolate the unpredictable workload from the predictable workload so as not to impact certain SLA. So you may have ad hoc queries or users that are running larger queries or batch workloads that occur once in a while from running on a secondary sub-cluster on a different secondary sub-cluster so as to not impact the more predictable workload running on the first sub-cluster. Now there is no reason why these two sub-clusters need to have the same instance. They can have different number of nodes, different instance types, different depot configurations, and everything can be different. Another benefit is they can be metered differently. They can be costed differently so that the appropriate user or tenant can build the cost of compute. Now as the use increases even further, this is what we see as the final state of a very advanced EON mode deployment here. As you'll see there is the primary sub-cluster of course used for ETL, very heavy ETL, and it's always on. There are numerous secondary sub-clusters, some for predictable applications with a very fine-tuned workload that means a definite performance. There are other sub-clusters that have different usages, some for ad hoc queries, others for demanding tenants. There could be more sub-clusters for different departments like IMANs that meet it at the end of the quarter. So very different applications. This is the full and final promise of EON There is workload isolation, there is different metering, and each app runs in its own compute space. Okay, so let's talk about a very interesting feature in EON mode which we call Hibernate and Survive. So what is Hibernate? Hibernating the Vertica database is the act of dissociating all the computers on the database and shutting it down. At this point you shut down all compute. You've still been for storage because your data is in the S3 bucket but all the compute has been shut down and you do not pay for computing anymore. If you have reserved instances or any other instances, you can use them for a different application. And your Vertica database is shut down. So this is very similar to Stop Database in E-mode, you're stopping all compute. The benefit, of course, is you pay nothing anymore for compute. So what is Revive then? The Revive is the opposite of Hibernate where you now associate compute with your S3 bucket or your storage and start up the database. There is one limitation here that you should be aware of is that the size of the database that you have during Hibernate you must revive at the same size. So if you have a 12-node primary subcluster when Hibernating, you need to provision 12 nodes in order to revive. So one best practice comes out of this is that you must shrink your database to the smallest size possible before you Hibernate so that you can revive it in the same size and you don't have to spin up a ton of compute in order to revive. So basically what this means is when you have decided to Hibernate, we ask you to remove all your secondary subclusters and shrink your primary subcluster down to the bare minimum before you Hibernate it. And the benefit being is when you do revive, you will be able to do so with a minimal number of nodes. And of course, before you Hibernate, you must cleanly shut down the database so that all the data can be synced to S3. Finally, let's talk about backups and replication. Backups and replications are still supported in EON mode. We sometimes get the question, where in S3 and S3 has 9 lines of reliability, we need a backup. Yes, we highly recommend backups. You can backup by using the VBR script so you can backup your database to another bucket. You can also copy the bucket and revive a different instance of your database. This is very useful because many times people want staging or development databases and they need some of the data production and this is a nice way to get that. And it also makes sure that if you accidentally delete something, you will be able to get back your data. Okay, so let's go into best practices now. Let's talk about the Depot first, which is the biggest performance enhancer that we see for queries. So I want to say very clearly that reading from S3 or a remote object store like S3 is very slow because data has to go over the network and it's very expensive. It will pay for access costs. This is where S3 is not very cheap, is that every time you access the data, there is an API and access cost levied. Now, the Depot is a performance analysis feature that will improve the performance of queries by keeping a local cache of the data that is most frequently used. It will also reduce the cost of accessing the data because you no longer have to go to the remote object store to get the data since it's available on a local, a terminal volume. Hence, Depot shipping is a very important aspect of performance tuning in a neon database. What we ask you to do is if you are going to use a specific table or partition frequency, you can choose to pin it in the Depot so that if your Depot is under pressure or is highly utilized, these objects that are most frequently used are kept in the Depot. Therefore, Depot shaping is the act of setting eviction policies so that you prevent the eviction of files that you believe you need to keep. For example, you may keep the most recent years data or the most recent partition in the Depot and thereby all queries running on those partitions will be faster. At this time, we allow you to pin any table or partition in the Depot, but it is not sub-cluster based. Future versions of Vertica will allow you fine-tuning the Depot based on each sub-cluster. Let's now go and understand a little bit of the internals of how a SQL query works in EON mode. Once I explain this, we will blend into best practice and it will become much more clear why we recommend certain things. Since S3 is our layer of durability where data is persisted in an EON database, when you run an insert query like, you know, inserted to table value one or something similar, data is sequentially written into S3. So before the control returns back to the client, the copy of the data is first stored in the Depot and then uploaded to S3. And only then do we hand the control back to the client. This ensures that, you know, if something bad were to happen, the data would be persisted. The second type of SQL transactions are what we call DDS, which are catalog operations. For example, you created a table or you added a column. These operations are actually working with metadata. Now, as you may know, S3 does not offer mutable storage. So storage in S3 is immutable. You can never append to a file in S3. And the way transaction logs work is their append operation. So when you modify the metadata, you are actually appending to a transaction log. So this poses an interesting challenge, which we resolved by appending to the transaction log locally in the catalog, and then there's a service that sends the catalog to S3 every five minutes. So this poses an interesting challenge, right? If you were to destroy or delete an instance abruptly, you could lose the commits that happened in the last five minutes. And I'll speak to this more in the subsequent slides. Now, finally, let's look at drops or truncates in Yonah. Now, a drop or a truncate is really a combination of the first two things that we spoke about. When you drop a table, you are making a pattern of operation. You are making a metadata change. You are telling verticals that this table no longer exists. So we go into the transaction log and append into the transaction log that this table has been removed. This log, of course, will be synced every five minutes to S3, like we spoke. There is also the secondary operation of deleting all the files that were associated with data in this table. Now, these files are on S3, and we can go about deleting them seamlessly, but that will take a lot of time. And we do not want to hold up the client for this duration. So at this point, we do not synchronously delete the files. We put the files that need to be removed in a reaper queue and return the control back to the client. And this has a performance benefit as to the drops appear to occur really fast. This also has a cost benefit. Batching deletes in big batches is more performant and less costly. For example, on Amazon, you could delete 1,000 files at a time in a single call. So if you batched your deletes, you could delete them fairly quickly. The disadvantage of this is if you were to terminate a replica cluster of roughly, you could leak files in S3 because the reaper queue would not have had the chance to delete these files. Okay, so let's go into best practices after understanding some technical details. So as I said, reading and writing to S3 is slow and costly. So the first thing you can do is avoid as many round trips to S3 as possible. The bigger batches of data you load, the better performance you get per commit. The second thing is don't read and write from S3 if you can avoid it. A lot of our customers have intermediate data processing like staging tables where they will transform the data before finally committing it. There is no reason to use regular tables for this kind of intermediate data. We recommend using local temporary tables, and local temporary tables have the benefit of not having to upload data to S3. Finally, there is another optimization you can make. Vertica has the concept of active partitions and inactive partitions. Active partitions are the ones where we have recently loaded data, and Vertica is lazy about merging these partitions into a single ROS container. Inactive partitions are historical partitions, like last year's data or the year before that data. Those partitions are aggressively merged into a single container. And how do we know how many partitions are active and inactive? Well, that's based on a configuration parameter. If you load into an inactive partition, Vertica is very aggressive about merging these containers so we download the entire partition, merge the records that you loaded into it, and upload it back again. This creates a lot of network traffic and, as I said, accessing data is from S3 slow and costly. So we recommend you not load into inactive partitions. You should load into the most recent or active partitions and if you happen to load into inactive partition, set your active partition to count correctly. Okay, let's talk about the deeper queue. Depending on the velocity of your ETL, you can pile up a lot of files that need to be deleted as synchronously. If you were to terminate a Vertica cluster without allowing enough time for these files to get deleted, you could leak files in S3. Now, of course, if you use local temporary tables, this problem does not occur because the files you never created in S3. But if you are using regular tables, you must allow Vertica enough time to delete these files and you can change the interval at which we delete and how much time we allow to delete at shutdown by setting some configuration parameters that I have mentioned here. Okay, so let's talk a little bit about the catalog at this point. So the catalog is synced every five minutes onto S3 for persistence and the catalog truncation version is the minimal viable version of the catalog to which we can revive. So for instance, if somebody destroyed your Vertica cluster, the entire Vertica cluster, the catalog truncation version is the minimum viable version that you will be able to revive. Now, in order to make sure that the catalog truncation version is up-to-date, you must always shut down your Vertica cluster TV. This allows the catalog to be synced to S3. Here are some SQL commands that you can use to see what the catalog truncation version is on S3. For the most part, you don't have to worry about this if you are shutting down TV. So this is only in cases of disaster or some event where all nodes were terminated without the user's permission. And finally, let's talk about backups. One more time. We highly recommend you take backups. S3 is designed for 99.9% availability, so there could be an occasional downtime making sure you have backups will help you to accidentally drop a table. S3 will not protect you again data that was deleted by accident, so having a backup helps you there. And why not backup? Storage is cheap. You can replicate the entire bucket and have that as a backup or have a DR cluster running in a different region which also serves as a backup. So we highly recommend you make backups. So with this, I would like to end my presentation and you're ready for any questions if you have. Thank you very much. Thank you very much.