 Hi, everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled Vertica in Eon Mode at the Trade Desk. My name is Sue LeClaire, Director of Marketing at Vertica, and I'll be your host for this webinar. Joining me is Ron Cormier, Senior Vertica Database Engineer at the Trade Desk. Before we begin, I encourage you to submit questions or comments during the virtual session. If you don't have to wait, just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address will do our best to answer them offline. Alternatively, you can visit Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a quick reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on-demand this week. We'll send you a notification as soon as it's ready. So let's get started. Over to you, Ron. Thanks, Sue. Before I get started, I'll just mention that my slide template was created before social distancing was a thing. So hopefully some of the images will harken us back to a time when we could actually all be in the same room. But with that, I want to get started. Before we get started with the technology, I just wanted to cover my background real quick because I think it speaks to where we're coming from with Vertica at the trade desk. And I'll start out just by pointing out that prior to my time at the trade desk, I was a tech consultant at HP, HP Vertica. And so I traveled the world working with Vertica customers, helping them configure, install, tune, set up their Vertica databases, and get them working properly. So I've seen the biggest and the smallest implementations and everything in between. And so now I'm actually principal database engineer at the trade desk. The reason I mentioned this is to let you know that I'm a practitioner. I'm working with the product every day or most days. This isn't marketing material. So hopefully the technical details in this presentation are helpful. I work with Vertica, of course, and that is most relevant to our ETL and reporting stack. And so what we're doing is we're taking a bunch of data into Vertica and running reports for our customers. And we're an ad tech, so I did want to just briefly describe what that means and how it affects our implementation. So I'm not going to cover all the details of this slide, but basically I want to point out that the trade desk is a DSP, a demand-side provider. And so we place ads on behalf of our customers or agencies and ad agencies and their customers that are advertisers, brands themselves. And the ads get placed onto websites and mobile applications and anywhere digital advertising happens. So publishers are what we think of and like we see here ESPN.com, MSN.com, and so on. And so every time a user goes to one of these sites or one of these digital places, an auction takes place. And what people are bidding on is the privilege of showing an ad, one or more ads to users. And so this is really important because it helps fund the internet. Ads can be annoying sometimes, but they actually help or are incredibly helpful in how we get much of our content. And this is happening in real time. So on the open internet, there is anywhere from 7 to 13 million auctions happening every second. Of those 7 to 13 million auctions happening every second, the trade desk bids on hundreds of thousands per second. And any time we bid, we have an event that ends up in Vertica. That's one of the main drivers of our data volume. And certainly other events make their way into Vertica as well, but that wanted to give you a sense of the scale of the data and sort of how it's impacting or how it is impacted by real people in the world. So more into the workload. And we have the three Vs in Spain, like many people listening to massive volume, velocity and variety. In terms of the data sizes, I've got some information here, some stats on the raw data sizes that we deal with on a daily basis per day. So we ingest 85 terabytes of raw data per day. And then once we get it into Vertica, we do some transformations. We do matching, which is like joints basically. And we do some aggregations, group buys to reduce the data and make it clean it up, make it so it's more efficient to consume by our reporting layer. So that matching and aggregation produces about 10 new terabytes of raw data per day. It all comes from the data that was ingested, but it's new data. And so it is reduced quite a bit, but it's still pretty high volume. And so we have this aggregated data that we then run reports on behalf of our customers. So we have about 40,000 reports per day. That's actually a little bit older number. It's probably closer to 50 or 55,000 reports per day at this point. So it's, I think, probably a pretty common use case for Vertica customers. It may be a little different in the sense that most of the reports themselves are batch reports. So it's not a user sitting at a keyboard waiting for the results. Basically, we have a workflow where we do the ingest. We do the transform and then once all the data is available for a day, we run reports on behalf of our customers on that daily data. And then we send the reports out via email or we drop them in a shared location and then they look at the reports at some later point of time. So up until Eon worked on Enterprise Vertica, at our peak we had four production enterprise clusters, each which held two petabytes of raw data. And I'll give you some details on how those enterprise clusters were configured in the hardware. But before I do that, I want to talk about the reporting workload specifically. So the reporting workload is particularly lumpy. And what I mean by that is there's a bunch of work that becomes available, a bunch of queries that we need to run in a short period of time after the days ingest and aggregation is completed. And then the clusters are relatively quiet for the remaining portion of the day. That's not to say they're not doing anything as far as read workload, but they certainly are. But it's much less read activity after that big spike. So what I'm showing here is our reporting queue and the spike is when all those reports become available to be processed. We can't run the reports until we've done the full ingest and matching and aggregation for the day. And so right around 1 or 2 a.m. UTC time every day, that's when we get the spike. And the spike we affectionately call it the UTC hump. But basically it's a huge number of queries that need to be processed sort of as soon as possible. And we have service levels that dictate what as soon as possible means. But I think the spike illustrates our use case pretty accurately. And it really, as we'll see, it's really well suited for Merdic Eon and we'll see what that means. So we've got our, we had our enterprise clusters that I mentioned earlier. And just to give you some details on what they look like, they were independent and mirrored. And so what that means is all four clusters held the same data. And we did this intentionally because we wanted to be able to run our reports anywhere. So we've got this big queue of reports, this big number of reports that need to be run. And we've got these, we started, we started with one cluster and then we got, we found that it couldn't keep up. So we added a second and we found the number of reports went up, that we needed to run in that short period of time and so on. So we eventually ended up with four enterprise clusters basically with the same data. And we'd say they were mirrored, they all have the same data. They weren't however synchronized, they were independent. And so basically we would run the ETL pipeline, so to speak. We would run the ingest and the matching and the aggregation on all the clusters in parallel. So it wasn't as if each cluster proceeded to the next step in sync with the other clusters. They were run independently. So it was sort of like each cluster would eventually get consistent. And so this worked pretty well. It created some imbalances and there were some cost concerns that we'll dig into. But just to tell you about each of these clusters, they each had 50 nodes. They had 72 logical CPU cores, half a terabyte of RAM, a bunch of rated disk drives, and two petabytes of raw data, as I stated before. So pretty big BC nodes that are physical nodes that we held, we had in our data centers. We actually leased these nodes, so it was on our data center providers, data centers. And these were what we built our business on basically. But there was a number of challenges that we ran into as we continued to build our business and add data and add workload. And the first one is something I'm sure many can relate to is capacity planning. So we had to think about the future and try to predict the amount of work that was going to need to be done and how much hardware we were going to need to satisfy that work, to meet that demand. And that's just generally a hard thing to do. It's very difficult to predict the future as we can probably all attest to and how much the world has changed even in the last month. So it's a very difficult thing to do to look six, 12, 18 months into the future and sort of get it right. And what we tended to do is we tried to, our plans, our estimates were very conservative, so we overbought in a lot of cases. And not only that, we had to plan for the peak. So we're planning for that point in time, those number of hours in the early morning when we had all those reports to run. And so we ended up buying a lot of hardware and we actually sort of overbought at times. And then as the hardware went age, it would kind of come into maturity. And we have our workload would sort of approach matching the demand. So that was one of the big challenges. The next challenge is that we were running out of disk. We wanted to add data in sort of two dimensions, the dimensions that everybody can think about. We wanted to add more columns to our big aggregates, and we wanted to keep our big aggregates for longer periods of time. So both horizontally and vertically, we wanted to expand the data sets. But we basically were running out of disk. There was no more disk. And it's hard to add Vertica in enterprise mode, not impossible, but certainly hard. And one cannot disk without adding compute because enterprise mode, the disk is all local to each of the nodes for most people. You can do exotic things with sand and other external arrays, but there are a number of other challenges with that. So in order to add disk, we had to add compute. And that basically kept us out of balance. We were adding more compute than we needed for the amount of disk. So that was the problem. So we had to add actually physical nodes, getting them ordered, delivered, packed, cabled, even before we even start touch Vertica, there's lead times there. And so it's also long commitment to compute, like I mentioned, release our hardware. So we were committing to these nodes, these physical servers, for two or three years of time. And as I mentioned, that can be a hard thing to do. But we wanted at least to keep our CAPEX down. So we wanted to keep our aggregates for a long period of time. We could have done crazy things, more exotic things to help us with this if we had to in enterprise mode. We could have started to daisy chain clusters together. And that would have been sort of a non-trivial engineering effort because we would need to then figure out how to migrate data. So first we would charge the data across all the clusters. And then we would have to migrate data from one cluster to another cluster at the ages. And we would have to think about how to aggregate or run queries across clusters. So if your dataset spans two clusters, you would have had to sort of aggregate it within each cluster maybe, and then build something on top to aggregate the data from each of those clusters. So not impossible things, but certainly not easy things. And luckily for us, we started talking to Vertica about separation of compute and storage. And I know other customers were talking to Vertica as we were because lots of people had these problems. And so Vertica in EON mode came to the rescue. And what I want to do is just talk about EON mode really briefly for those in the audience who aren't familiar. But it's basically Vertica's answer to the separation of compute and storage. It allows one to scale compute and or storage separately. And there's a number of advantages to doing that. Whereas in the old enterprise days, when you add a compute, you add a storage device first. And now we can add one or the other, both according to how we want to. So really briefly how this works. This figure was taken directly from the Vertica documentation. And so just to talk really briefly about how it works, taking advantage of the cloud. So in this case, Amazon Web Services, the elasticity in the cloud. And basically we've got EC2 instances. So elastic cloud compute servers that access data that's in an S3 bucket. And so three EC2 nodes in a bucket are the blue objects in this diagram. And the difference is a couple of big differences. One, the data no longer the persistent storage of the data. Where the data lives is no longer on each of the nodes. The persistent storage of the data is in the S3 bucket. And so what that does is it basically solves one of our first big problems, which is we were running out of disk. The S3 has, for all intents and purposes, infinite storage. So we could keep much more data there and that mostly solved one of our big problems. So the persistent data lives on S3. Now what happens is when a query runs, it runs on one of the three nodes that you see here. And assuming, we'll talk about Deepo in a second, but what happens in a brand new cluster where the query will run on those EC2 nodes, but there will be no data. So those nodes will reach out to S3 and run the query on remote storage. So the nodes are literally reaching out to the communal storage for the data and processing it entirely without using any data on the nodes themselves. And so that works pretty well. It's not as fast as if the data was local to the nodes, but what Vertica did is they built a caching layer on each of the nodes and that's what the Deepo represents. So the Deepo is some amount of disk that is relatively local to the EC2 node. And so when the query runs on remote storage on the S3 data, it then queues up the data for download to the nodes. And so the data will reside in the Deepo so that the next query or the subsequent queries can run on local storage instead of remote storage and that speeds things up quite a bit. So that's what the role of the Deepo is. The Deepo is basically a caching layer and we'll talk about the details of how we configured our Deepo. The other thing that I want to point out is that since this is the cloud, another problem that helps us solve is the concurrency problem. So you can imagine that these three nodes are one sort of cluster and what we can do is we can spin up another three nodes and have it point to the same S3 communal storage bucket. So now we've got six nodes pointing to the same data, but we've isolated each of the three nodes so that they act as if they are their own cluster. And so Vertica calls them subclusters. So we've got two subclusters, each of which has three nodes. And what this has essentially done is it has doubled the concurrency, doubled the number of queries that can run at any given time because we've now got this new chunk of compute which can answer queries. And so that has given us the ability to add concurrency much faster. And I'll point out that since it's cloud and there are on-demand pricing models, we can have significant savings because when a subcluster is not needed, we can stop it. And we pay almost nothing for it. So that's really important and really helpful, especially for our workload, which I pointed out before was so lumpy. So those hours of the day when it's relatively quiet, I can go and stop a bunch of subclusters and I will pay for them so that yields savings. So that's Eon and obviously engineers and the documentation can use a lot more information and I'm happy to field questions later on as well. But I want to talk about how we implemented Eon at the trade desk. And so I'll start on the left-hand side at the top. What we're representing here is subclusters. So there's subcluster zero or ETL subcluster and it is our primary subcluster. So when you get into the world of Eon, there's primary subclusters and secondary subclusters and it has to do with quorum. So primary subclusters are the subclusters that we always expect to be up and running. And they contribute to quorum. They decide whether there's enough instances, a number of enough nodes to have the database startup. And so this is where we run our ETL workload, which is the ingest, the match and the aggregate part of the work that I talked about earlier. So these nodes are always up and running because our ETL pipeline is always on, we're an internet ad tech company, like I mentioned. And so we're constantly running ads and there's always data flowing into the system and the matching is happening in the aggregation. So that part happens 24-7. And we wanted so that those nodes will always be up and running. And we need those processes to be super efficient. And so that is reflected in our instance type. So each of our subclusters is 64 nodes. We'll talk about how we came at that number. But the instance type for the ETL subcluster, the primary subcluster is I3-8X Large. So that is one of the instance types that has quite a bit of NVMe stores attached. And we'll talk about that. But 32 cores, 244 gigs of RAM on each node. And what that allows us to do, I should have put the amount of NVMe, but I think it's seven terabytes for NVMe storage. What that allows us to do is to basically ensure that our ETL, everything that this subcluster does is always in depot. And so that makes sure that it's always fast. When we get to the secondary subclusters, these are, as mentioned, secondary so they can stop and start and it won't affect the cluster going up or down. So they are sort of independent. And we've got four what we call read subclusters. And technically they're not read-only. Any subcluster can ingest and create new data within the database, and that'll all get pushed to the S3 bucket. But logically for us, they're read-only. Most of the work that they happen to do is read-only, which is nice because if it's read-only, it doesn't need to worry about commits. And we let the primary subcluster, our ETL subcluster worry about committing data. And we don't have to have the all nodes in the database participating in transaction commits. So we've got four read subclusters and we've got one ETL subcluster. So a total of five subclusters, each subcluster running 64 nodes. So that gives us a 320 node database, all things counted. And not all of those nodes are up at the same time, as I mentioned, but often for big chunks of the day, most of the read nodes are down, but they do all spin up during our busy time. So for the read subclusters, we've got I3-4XLs. So again, the I3 instance family type, which has NVMe stores, these nodes have, I think, three and a half terabytes of NVMe per node. We just rate it, there's two NVMe drives, and we rate zero them together. 16 cores, 122 gigs of RAM. So these are smaller, as you'll notice, but it works out well for us because the read workload is typically dealing with much smaller datasets than the ingest or the aggregation workload. So we can run these workloads on smaller instances and save a little bit of money and get more granularity with how many subclusters are stopped and started at an given time. The NVMe, the data on it isn't persistent for you to stop and start. This is an important detail, but it's okay because the depot does a pretty good job in that algorithm where it pulls data in that's recently used and the data that gets pushed out, evicted, is the data that's least recently used. So it's used a long time ago, so it's probably not going to be used again. We've got subclusters and we've actually got two of those. So we've got a 320 node cluster in US East and a 320 node cluster in US West. So we've got high availability of region diversity and they're peers like I talked about before. They're independent but mirrors. They each run 128 shards and so what shards are basically it's similar to segmentation but you take the data set and you divide it into chunks and each subcluster can see the data set in its entirety and so each subcluster is dealing with 128 shards and we chose 128 because it'll give us even distribution of the data on 64 node subclusters. 128 divides evenly by 64 and so there's no data skew and we chose 128 because the sort of future proof in case we wanted to double the size of any of the clusters we could double the number of nodes and we'd still have no skew if the data wouldn't be distributed. The disk, we've got an EBS based array that the catalog uses so the catalog stores location and I think we take four EBS volumes and raid zero them together and come up with 128 gigabyte drive and we wanted the EBS for the catalog because we can start and start nodes and that data will persist. It will come back when the node comes up so we don't have to run a bunch of configuration when the node starts up. Basically the node starts it automatically joins the cluster and very shortly thereafter it starts processing work. So that's catalog and EBS. The NVMe is another raid zero as I mentioned this data is ephemeral so when we stop and start it goes away but basically we take 512 gigabytes of the NVMe and we give it to the data temp stores location and then we take whatever is remaining and give it to the depot and since the ETL and the read clusters are different instance types the depot is sized differently but otherwise also it all adds up what we have is now we stopped the parsing data for some of our big aggregates we added a bunch more columns and basically at this point we have eight petabytes of raw data in each EON cluster and it is obviously about four times what we can hold in our enterprise clusters and we can continue to add to this maybe we need to add compute maybe we don't but the amount of data that can be held there can obviously grow much more we've also built an auto scaling tool or service that basically monitors the queue that I showed you earlier monitors for those spikes and when it sees those spikes it then goes and starts up instances in one sub-cluster in any of the sub-clusters so that's how we have compute match the capacity match the demand also point out that we actually have one sub-cluster of specialized nodes it doesn't actually it's not strictly a customer reports sub-cluster so we had this tool called planner which basically optimizes ad campaigns for our customers and we built it it runs on Vertica uses data in Vertica runs Vertica queries and it was wildly successful so we wanted to have some dedicated compute for it and EON with EON it made it really easy to basically spin up one of these sub-clusters or a new sub-cluster and say here you go planner team do what you want you can completely maximize the resources on these nodes and it won't affect any of the other operations that we're doing the ingest, the matching, the aggregation or the reports so it gave us a great deal of flexibility and agility which is super helpful so the question is has it been worth it and for us the answer has been a resounding yes we're doing things that we never could have done at reasonable cost before and we've got more data we've got this line nodes and we're much more agile so how to quantify that well it's not quite as simple and straightforward as you might hope I mean we still have enterprise clusters we've got two of the the four that we had at peak so we still got two of those around and we've got our two EON clusters but they are running different workloads and they're comprised of entirely different hardware as I've covered the number of nodes is different per subcluster so 64 versus 50 is going to have different performance the workload itself the aggregation is aggregating more columns on EON because that's where we have this available the queries themselves are different they're running more queries more data intensive queries on EON because that's where the data is available so in a sense EON is doing the heavy lifting for the cluster for our workload in terms of query performance still a little anecdotal but the queries that run on the enterprise cluster the performance matches that of the enterprise cluster quite closely when the data is in the depot when the data is not in the depot and Vertica has to go out to F3 to get the data performance degrades as you might expect but it depends on the query itself things like counts counts are really fast but if you need lots of the data to materialize lots of columns that can run slower not orders of magnitude slower but certainly multiple of the amount of time in terms of cost anecdotal well let's get a little bit more quantified here so what I try to do is I try to figure out multiply it out if I wanted to run the entire workload on enterprise and I wanted to run the entire workload on EON with all the data that we have today all the queries everything and to try to get it to be apples out so for enterprise I would estimate that we would need approximately 18,000 cores CPU cores all together and that's a big number but that doesn't even cover all the non-interviewal engineering work that would need to be required that I kind of referenced earlier things like sharding the data among multiple clusters migrating the data from one cluster to another the daisy chain type stuff so that's the data point now for EON to run the entire workload I estimate we'd need about 20,480 CPU cores so more CPU cores than enterprise however about half of those and approximately 10,000 of those CPU cores would only run for about 6 hours per day and so with the on demand and elasticity of the cloud that is a huge advantage and so we are definitely moving as fast as we can to being on all EON we have time left on our contracts with the enterprise clusters we're not able to get rid of them quite yet but EON is certainly the way of the future for us I also want to point out that I mean EON is we found to be the most efficient MPP database on the market and what that refers to is for a given dollar of spend of cost we get the most from that we get the most out of Vertica for that dollar compared to other cloud and MPP database platforms so our business is really happy with what we've been able to deliver with EON EON has also given us the ability to begin a new use case which is probably pretty familiar to folks on the call where it's UI based so we'll have a website that our customers can log into and on that website they'll be able to run reports run queries through the website and have that run directly on a separate Vertica EON cluster and so much more latency-sensitive and concurrency-sensitive so the workload that I've described up until this point has been pretty steady throughout the day and then we get our spike and then it goes back to normal for the rest of the day this workload will be potentially more variable we don't know exactly when that feature that is going to make a lot of people want to log into the website and check how their campaigns are doing but EON really helps us with this because we can add capacity so easily we can add compute and we can scale that up and down as needed and it allows us to match the concurrency so concurrency can be much more variable we don't need the big long lead time so we're really excited about this so last slide here I just want to leave you with some things to think about if you're about to embark or getting started on your journey with Vertica EON one of the things that you'll have to think about is the node count and the shard count so they're kind of tightly coupled so the node count we determined by figuring by spinning up some instances in a single sub cluster and getting performance similar to finding an acceptable performance considering current workload, future workload for the queries that we had when we started and so we went with 64 we wanted to have them be too big of course it costs money and so we like to do things in power of two so 64 nodes and then the shard count so the shards again is like the data segmentation it's a new type of segmentation on the data and the shard count we went with 128 and again the reason is so that we could have no skew each node could process the same amount of data and we wanted to future proof it so that's probably a nice general recommendation double the shard count for the nodes the instance type and how much depot space those are certainly things you're going to have to consider like I was talking about we went for the i3, 4xl, i3, 8xl because they offer good depot storage which gives us a really consistent good performance and the data is all in depot the very good segmentation has some information on I think we're going to use the R5 or the R4 instance types for our UI cluster so much less the data smaller so much less emphasis on depot so we don't need all that NVMe storage we're probably going to want to have a mix of reserved and on-demand instances if you're 24x7 shop like we are so our ETL sub clusters those are reserved instances because we know we're going to run those 24 hours a day 365 days a year so there's no advantage of having them be on-demand cost more than reserved so we get cost savings on figuring out what we're going to run and keep running and it's the READ sub clusters that are for the most part on demand we have one of our READ sub clusters is actually on 24x7 because we keep it up for ad hoc queries or analyst queries we don't know when exactly they're going to hit and they want to continue working whenever they want to in terms of the initial data load the initial data ingest what we had to do and I imagine how it works so today is you've got to basically load all your data from scratch there isn't great tooling just yet for data possibly moving from enterprise to EON so what we did is we exported all the data in our enterprise cluster into parquet files and put those out on S3 and then we ingested them into our first EON cluster so it was kind of a pain we scripted out a bunch of stuff obviously but it worked and the good news is that once you do that like the second EON cluster is just a bucket copy so there's tools that can help with that you're going to want to manage your fetches and evictions so this is the data that's in the cache is what I'm referring to here the data that's in the default and so like I talked about we have our ETL cluster which has the most recent data which has been aggregated so this is really recent data so we wouldn't want anybody logging into that ETL cluster and running queries on big aggregates to go back one through years because that would invalidate the cache the depot would start pulling in that historical data and it would start refreshing that historical data and evicting the recent data which would slow things out slow down that ETL pipeline whether they're service accounts or human users are connecting to the right sub-cluster and we just created the US entries with IPs and target groups so it was pretty easy but definitely something to think about lastly, if you're like us and you're going to want to stop and start knows you're going to have to have a service that does that for you we built this very simple tool for you and stops and starts sub-clusters accordingly we're hoping that we can work with Vertica to have it be a little bit more driven by the cloud configuration itself so for us, follow Amazon and we'd love it if we could have it scale with AWS can take two points two things to watch out for when you're working with Eon is the first is system table queries on storage layer metadata and the thing to be careful of is that the storage layer metadata is replicated there's a copy for each of the sub-clusters that are out there we have ETL sub-cluster and our retail cluster so for each of the five sub-clusters there is a copy on the storage containers system table all the data in partitions system table so when you want to use these system tables for analyzing how much data you have or any other analysis make sure that you filter your query with node name so for us the node name is less than or equal to 64 because each of our sub-clusters are 64 so we limit the nodes to the 64 node ETL cluster but if we didn't have this filter we would get 5x the values for counts and some sort of stuff and lastly there is a problem that we're kind of working on and thinking about is DC table data for sub-clusters that are stopped when the instance is stopped literally the operating system is down and there's no way to access it so it takes the DC table data so I cannot after my sub-clusters scale up in the morning and then they scale down I can't run DC table queries on how what performed well and where and that sort of stuff because it's local to those nodes so we're working on something to be aware of and we're working on a solution or an implementation to try to suck that data out of all the nodes even those read-only nodes all the time and bring it into some other kind of repository perhaps another verdict cluster so that we can run analysis and monitoring even when those nodes are down that's it thanks for taking the time to listen to my presentation really enjoy it thank you Ron that was a tremendous amount of information thank you for sharing that with everyone we have some questions come in that I would like to present to you Ron if you have a couple more minutes the first let's jump right in the first one loading 85 terabytes per day of data is pretty significant amount what format does that data come in and what does that load process look like yeah great question so the format is tab separated files that are Jesus compressed the reason for that is basically historical we don't have much tabs in our data and this is how the data gets compressed and moved off of our bidders the things that generate most of this data so it's Jesus compressed and we kind of have how we load it I would say we have actually kind of the Cadillac loader in a couple of different perspectives one is we've got this orchestration layer that's homegrown that manages the logs the data that gets loaded into Vertica and so we accumulate data and then we take some files and we push them we distribute them among the ETL nodes in the cluster and so we're literally pushing the files to the nodes and we then run a copy statement to ingest data in the database and then we remove the files from the nodes themselves and so it's a little bit extra data movement which we may think about changing in the future especially when we move more and more to Eon but the really nice thing about this especially for the enterprise clusters is that the copy statements are really fast and so we the copy statements use memory like any other query but the performance of the copy statement is really sensitive to the amount of available memory and so since the data is local to the nodes literally in the data directory that I referenced earlier it can access that data from the NVMe stores and the copy statement runs very fast and then that memory is available to do something else and so we take a little bit of cost in terms of latency and in terms of downloading the data to the nodes and we might as we move more and more to Eon we might start ingesting it directly from S3 not copying it to the nodes first we'll see about that but that's how we the data ingestion works great thanks Slade another question what was the biggest challenge you found when migrating from on-prem to AWS yeah so a couple of things that come to mind the first was the backfill on the data load it was kind of a pain I mean like I referenced in that last slide only because I mean we didn't have tools built to do this so I mean we had to script some stuff out and it wasn't overly complex but we had it's just a lot of data to move I mean even with starting with two petabytes so making sure that there is no missed data, no gaps making and moving it from the enterprise cluster so what we did is we exported it to the local disk on the enterprise cluster then we pushed it to S3 and then we ingested it to Eon again all as parquet so it's a lot of data to move around and I mean we have to take it out at some point stop loading data while we do that so that was a challenge sort of a one-time challenge the other thing that I mean we've been dealing with not that we're dealing with but the challenge was it's a relatively relatively new product for Vertica and so our big advantages of Eon is allow us to stop and start nodes and recently Vertica has gotten quite good at stopping and starting nodes for a while there it took a really long time to start the node back up and it could be invasive but we worked with the engineering team with Yonzi and others to really really reduce that and now it's not really an issue that we think that we think too much about okay thanks towards the end of the presentation you had said that you've got 128 shards but you have some clusters are usually around 64 nodes and you had talked about a ratio of 2 to 1 why is that and if you were to do it again would you use 128 shards good question so as a reference the reason why is because we wanted to future proof ourselves so basically we wanted to make sure that the number of shards was evenly divisible by the number of nodes and I could have done that with 64 I could have done that with 128 or any other multiple 64 but we went with 128 to try to protect ourselves in the future so that if we wanted to double the number of nodes in the ETL cluster specifically we could have done that so that would have doubled from 64 to 128 and then each node would have just one shard that it would have to deal with so no skew the second part of the question if I had to do it over again I think I would have done I think I would have stuck with 128 so we're either running this cluster for more than 18 months now in the U.S. and we haven't needed to increase the number of nodes so in that sense it's been a little bit extra overhead having more shards but it gives us the peace of mind that we can easily double that and not have to worry about it so I think 2 to 1 is a nice place to start and you may even consider 3 to 1 or 4 to 1 if you're expecting really rapid growth if you were just getting started with Eon your data is just small now but you expect to have them grow significantly going forward great, thank you Ron that's all the questions that we have out there for today if you do have others please feel free to send them in and we will get back to you and we'll respond directly via email and again our engineers will be available on the vertical forums where you can continue the discussion with them there I want to thank Ron for the great presentation and also the audience for your participation in questions please note that a replay of today's event and a copy of the slides will be available on demand shortly and of course we invite you to share this information with your colleagues as well again, thank you and this concludes this webinar and have a great day