 Hi everybody. Thank you for joining us today for the Vertical Virtual Vertica BDC 2020. This breakout session is entitled Vertica at Uber Scale. My name is Sue LeClair, Director of Marketing at Vertica, and I'll be your host for this webinar. Joining me is Garish Belita, Director, I'm sorry, User Uber Engineering Manager of Big Data at Uber. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address will do our best to answer offline. Alternately, you can also visit Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. And as a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yet, this virtual session is being recorded and you'll be able to view on demand this week. We'll send you a notification as soon as it's ready. So let's get started. Rich, over to you. Thanks a lot, Sue. Good afternoon, everyone. Thanks a lot for joining this session. My name is Girish Baliga. And as Sue mentioned, I manage interactive and real-time analytics teams at Uber. Vertica is one of the main platforms that we support. And Vertica powers a lot of our core business use cases. In today's talk, I wanted to cover two main things. First, how Vertica is powering critical business use cases across a variety of orgs in the company. And second, how we are able to do this at scale and with reliability using some of the additional functionalities and systems that we have built into the Vertica ecosystem at Uber. And towards the end, I also have a little extra bonus for all of you. I'll be sharing an easy way for you to take advantage of many of the ideas and solutions that I'm going to present today that you can apply to your own Vertica deployments in your companies. So stick around and put on your seatbelts and let's go start on the ride. At Uber, our mission is to ignite opportunity by setting the world in motion. So we are focused on solving mobility problems and enabling people all over the world to solve their local problems, their local needs, their local issues in a manner that's efficient, fast and reliable. As our CEO, Dara, has said, we want to become the mobile operating system of local cities and communities throughout the world. As of today, Uber is operational in over 10,000 cities around the world. So across our various business lines, we have over 110 million monthly users who use our rides services, our each services and a whole bunch of other services that we provide to Uber. And just to give you a scale of our daily operations, we in the rides business have over 20 million trips per day. And our each business is also catching up particularly during the recent times that we have been having. And so I hope these numbers give you a scale of the amount of data that we process each and every day and support our users in their analytical and business reporting needs. So who are these users at Uber? Let's take a quick look. So Uber to describe it very briefly is a lot like Amazon. We are largely an operation and logistics company. And our employee work base reflects that. So over 70% of our employees work in teams which come under the umbrella of community operations and centers of excellence. So these are all folks working in various cities and towns that we operate around the world and run the Uber businesses as somewhat local businesses responding to local needs, local market conditions, local regulations and so forth. And Vertica is one of the most important tools that these folks use in their day to day business activities. So they use Vertica to get insights into how their businesses are going to drill deeply into any issues that they want to triage, to generate reports, to plan for the future, a whole lot of use cases. The second big class of users are in our marketplace team. So marketplace is the engineering team that backs a ride share business. And as part of this running this business, a key problem that they have to solve is how to determine what prices to set for particular rides so that we have a good match between supply and demand. So obviously the real-time pricing decisions are made by serving systems with very detailed and well-crafted machine learning models. However, the training data that goes into these models, the historical trends, the insights that go into building these models, a lot of these things are powered by the data that we store and serve over Vertica. Similarly, in each business, we have use cases spanning all the way from engineering and backend systems to support operations, incentives, growth, and a whole bunch of other domains. So the big class of applications that we support across a lot of these business lines is dashboards and reporting. So we have a lot of dashboards which are built by core data analyst teams and shared with a whole bunch of our operations and other teams. So these are dashboards and reports that run periodically say once a week or once a day even depending on the frequency of data that they need. And many of these are powered by the data and the analytics support that we provide on our Vertica platform. Another big category of use cases is for growth marketing. So this is to understand historical trends, figure out what are various business lines, various customer segments, various geographical areas doing in terms of growth, where it is necessary for us to reinvest or provide some additional incentives or marketing support and so forth. So the analysis that backs a lot of these decisions is powered by queries running on Vertica. And finally, the heart and soul of Uber is data science. So data science is how we provide best in class algorithms, pricing and matching and a lot of the analysis that goes into figuring out how to build these systems, how to build the models, how to build the various coefficients and parameters that go into making real-time decisions are based on analysis that data scientists run on Vertica systems. So as you can see, Vertica, you say it spans a whole bunch of organizations and users all across the different Uber teams and ecosystems. Just to give you some quick numbers, we have over 5,000 weekly active people who run queries at least once a week to do some critical business role or problem to solve that they have in their day to day operations. So next, let's see how Vertica fits into the Uber data ecosystem. So when users open up their apps and request for a ride or order a food delivery on each platform, so the apps are talking to our serving systems and the serving systems use online storage systems to store the data as the trips and each orders are getting processed in real time. So for this, we primarily use an in-house built key value storage system called Schimeles and an open-source system called Cassandra. We also have other systems like MySQL and Redis, which we use for storing various bits of data to support our serving systems. So all of these operations generates a lot of data that we then want to process and analyze and use for our operational improvements. So we have injection systems that periodically pull in data from our serving systems and land them in our data lake. So at Uber, our data lake is powered by Hadoop with files stored on HDFS clusters. So once the raw data lands on the data lake, we then have ETL jobs that process these raw data sets and generate modeled and customized data sets, which we then use for further analysis. So once these modeled data sets are available, we load them into our data warehouse, which is entirely powered by Vertica. So then we have a business intelligence layer, so with internal tools like Query Builder, which is a UI interface to write queries and look at results and iterate over different insights, and Dash Builder, which is a dashboard building tool and report management tool. So these are all various tools that we have built within Uber, and these talk to Vertica and run SQL queries to power whatever dashboards and reports that they are supporting. So this is what the data ecosystem looks like at Uber. So why Vertica and what does it really do for us? So it powers insights that we show on dashboards as for folks use, and it also powers reports that we run periodically. But more importantly, we have some core properties and core feature sets that Vertica provides, which allows us to support many of these use cases very well and at scale. So let me take a brief tour of what these are. So as I mentioned, Vertica powers Uber's data warehouse. So what this means is that we load our core fact and dimension tables on to Vertica. So the core fact tables are all the trips, all the each orders, and all these other line items for various businesses from Uber, stored as partition tables. So think of having one partition per day, as well as dimension tables like cities, users, riders, Korea partners and so forth. So we have both these two kinds of data sets, which we load into Vertica. And we have the full historical data all the way since we launched these businesses to today, so that folks can do deeper longitudinal analysis. So they can look at patterns like how the business has grown from month to month, year to year, the same month over a year or multiple years and so forth. And the really powerful thing about Vertica is that most of these queries, even the deep longitudinal queries, run very, very fast. And that's really why we love Vertica because we see query latency P90s, that is 90 percentile of all queries that we run on our platform, typically finished in under a minute. So that's very important for us because Vertica is used primarily for interactive analytics use cases and providing SQL query execution times under a minute is critical for our users and our business owners to get the most out of our analytics and big data platforms. Vertica also provides a few advanced features that we use very heavily. So as you might imagine, at Uber, one of the most important sets of use cases we have is around geospatial analytics. In particular, we have some critical internal dashboards that rely very heavily on being able to restrict data sets by geographic areas, cities, source destination pairs, heat maps and so forth. And Vertica has a rich array of functions that we use very heavily. We also have support for custom projections in Vertica. And this really helps us have very good performance for critical data sets. So for instance, in some of our core fact tables, we have done a lot of query analysis to figure out how users run their queries, what kind of columns they use, what combination of columns they use, and what joins they do for typical queries. And then we have laid out our custom projections to maximize performance on these particular dimensions. And the ability to do that through Vertica is very valuable for us. So we've also had some very successful collaborations with the Vertica engineering team. About a year and a half back, we had open sourced a Python client that we had built in-house to talk to Vertica. We were using this Python client in our business intelligence layer that I had shown on the previous slide. And we had open sourced it after working closely with the NSTeam. And now Vertica formally supports the Python client as an open source project, which you can download today and integrate into your systems. Another more recent example of collaboration is the Vertica EON mode on GCP. So as most of or at least some of you know, Vertica EON mode is formally supported on AWS. And at Uber, we were also looking to see if we could run our data infrastructure on GCP. So the Vertica team hustled on this and provided us an early preview version, which we have been testing out to see how performance is impacted by running on the cloud and on GCP. And so far, I think things are going pretty well, but we should have some numbers about this real soon. So here I have a visualization of an internal dashboard that is powered solely by data and queries running on Vertica. So this GIF has a sequence of different visualizations supported by this tool. So for instance, here you see a heat map, time varying heat map of traffic, demand for ride shares. And then you will see a bunch of arrows here about source destination pairs and the trip lens. And then you can see how demand moves around. So as the cycles through the various animations, you can basically see all the different kinds of insights and query shapes that we sent Vertica, which powers this critical business dashboard for our operations teams. So now how do we do all of this at scale? So we started off with a single Vertica cluster a few years back. So we had our data lake. The data would land into Vertica. So these are the code fact and dimension tables that I just spoke about. And then Vertica powers queries at our business intelligence layer. So this is a very simple and effective architecture for most use cases. But at Uber scale, we ran into a few problems. So the first issue that we have is that Uber is a pretty big company at this point with a lot of users sending almost millions of queries every week. And at that scale, what we began to see was that a single cluster was not able to handle all the query traffic. So for those of you who have done an introductory course on queuing theory, you will realize that basically, even though you could have all the queries processed through a single serving system, you will tend to see larger and larger queue wait times as the number of queries pile up. And what this means in practice for our end users is that they're basically just seeing longer and longer query latencies. But even though the actual query execution time on Vertica itself is probably less than a minute, the query is sitting in the queue for a bunch of minutes and that's the end user perceived latency. So this was a huge problem for us. The second problem we had was that the cluster becomes a single point failure. Now Vertica can handle single node failures very gracefully and it can probably also handle like two or three node failures depending on your cluster size and your application. But very soon you'll see that when you basically have beyond a certain number of failures or nodes in maintenance, then your cluster will probably need to be restarted or you will start seeing some down times due to other issues. So another example of why you would have to have a downtime is when you're upgrading software in your clusters. So since we are a global company and we have users all around the world, we really cannot afford to have down times even for a one-hour slot. So that turned out to be a big problem for us. And as I mentioned, we could have hardware issues. So we might need to upgrade our machines or we might need to replace storage or memory due to issues with the hardware in there due to normal wear and tear or due to abnormal issues. And so because of all of these things, having a single point of failure or having a single cluster was not really practical for us. So the next thing we did was we set up multiple clusters. So we had a bunch of identical clusters, all of which have the same data sets. So then we would basically load data using injection pipelines from our data lake onto each of these clusters. And then the business intelligence layer would be able to query any of these clusters. So this actually solved most of the issues that I pointed out in the previous slide. So we no longer had a single point of failure. Any time we had to do version upgrades, we would just take off one cluster offline, upgrade the software on it. If we had node failures, we would probably just take out one cluster. If we had to, or we would just have some spare nodes, which we would rotate into our production clusters and so forth. However, having multiple clusters led to a new set of issues. So the first problem was that since we have multiple clusters, we would end up with inconsistent schema. So one of the things to understand about our platform is that we are an infrastructure team. So we don't actually own or manage any of the data that is served on vertical clusters. So we have data set owners and publishers who manage their own data sets. Now, exposing multiple clusters to these data set owners turns out it's not a great idea, right? Because they are not really aware of how to, the importance of having consistency of schemas and data sets across different clusters. So over time, what we saw was that the schemas for the same time, schemas for the same tables would basically get out of order because they were, all the updates are not consistently applied on all clusters. Or maybe they were just experimenting some new columns or some new tables on one cluster, but they forgot to delete it, whatever the case might be. We basically ended up in a situation where we saw a lot of inconsistent schemas, even across some of our core tables in our different clusters. A second issue was, since we had the injection pipelines that were ingesting data independently into all these clusters, these pipelines could fail independently as well. So what this meant is that if, for instance, the injection pipeline into cluster B failed, then the data there would be older than clusters A and C. So when a query comes in from the BI layer, and if it happens to hit B, you would probably see different results than you would if you went to A or C. And this was obviously not an ideal situation for our end users, because they would end up seeing slightly inconsistent, slightly different counts, but then that would lead to a bad situation for them, but they were not able to fully trust the data that was, and the results and insights that were being returned by their SQL queries and work ecosystems. And then the third problem was we had a lot of extra replication. So the 2080 rule or maybe even the 9010 rule applies to data sets on our clusters as well. So less than 10% of our data sets, for instance, see 90% of the queries. And so it doesn't really make sense for us to replicate all of our data on all the clusters. And so having this setup where we had to do that was obviously very suboptimal for us. Then what we did was we basically built some additional systems to solve these problems. So this brings us to our Vertica ecosystem that we have in production today. So on the ingestion side, we built a system called Vertica Data Manager, which basically manages all the ingestion into various clusters. So at this point, people who are managing data sets or data set owners and publishers, they no longer have to be aware of individual clusters. They just set up their ingestion pipelines with an endpoint in Vertica Data Manager. And the Vertica Data Manager ensures that all the schemas and data is consistent across all our clusters. And on the query side, we built a proxy layer. So what this ensures is that when queries come in from the BI layer, the queries are forwarded smartly and with knowledge and data about which clusters are up, which clusters are down, which clusters are available, which clusters are loaded, and so forth. So with these two layers of abstraction between our ingestion and our query, we were able to have a very consistent almost single system view of our entire Vertica deployment. And the third bit we had to put in place was a data manifest, which was a communication mechanism between ingestion and proxy. So the data manifest basically is a listing of which tables are available on which clusters, which clusters are up to date, and so forth. So with this ecosystem in place, we were also able to solve the extra application problem. So now we basically have some big clusters where all the core tables and all the tables, in fact, are served. So any query that hits 90% less served tables goes to the big clusters. And most of the queries which hit 10% heavily queried important tables can also be served by many of the small clusters. So much more efficient use of resources. So this basically is the view that we have today of Vertica within Uber. So external to our team, folks just have an endpoint where they basically set up their ingestion jobs, and another endpoint where they can forward their Vertica SQL queries, and they are served to a proxy layer. So let's get a little more into details about each of these layers. So on the data management side, as I mentioned, we have two kinds of tables. So we have dimension tables. So these tables are updated every cycle. So the list of cities, the list of drivers, the list of users, and so forth. So these change not so frequently, maybe once a day or so. And so we are able to, and since these data sets are not very big, we basically swap them out on every single cycle. Whereas the fact tables, so these are tables which have information about our trips or each order and so forth. So these are partition. So we have one partition roughly per day for the last couple of years. And then we have a more of a hierarchical partition set up for older data. So what we do is we load the partitions for the last three days on every cycle. The reason we do that is because not all our data comes in at the same time. So we have updates for trips going over the past two or three days, for instance, where people add ratings to their trips or provide feedback for drivers and so forth. So we want to capture them all in the row corresponding to that particular trip. And so we upload partitions for the last three days to make sure we capture all those updates. And we also update older partitions if, for instance, records were deleted for retention purposes or GDPR purposes, for instance, or other regulatory reasons. So we do this less frequently, but these are also updated if necessary. So there are endpoints which allow data set owners to specify what partitions they want to update. And as I mentioned, data is typically managed using a hierarchical partitioning scheme. So in this way, we are able to make sure that we take advantage of the data being clustered by day so that we don't have to update all the data at once. So when we are recovering from a cluster event like a version upgrade or software upgrade or hardware fix or failure handling, or even when we are adding a new cluster to the system, the data manager takes care of updating the tables, copying over the new partitions, making sure the schemas are all right. And then we update the data and schema consistency and make sure everything is up to date before we add this cluster to our serving pool and the proxy starts sending traffic to it. The second thing that the data manager provides is consistency. So the main thing we do here is we do atomic updates of our tables and partitions for fact tables using a two-phase commit scheme. So what we do is we load all the new data in temp tables in all the clusters in phase one. And then when all the clusters give a success signal, then we basically promote them to primary and set them as the main serving tables for incoming queries. We also optimize the load using Vertica data copy. So what this means is earlier in our parallel pipeline scheme, we had to ingest data individually from HDFS clusters into each of the Vertica clusters. That took a lot of HDFS bandwidth. But using this nice feature that Vertica provides called Vertica data copy, we just loaded data into one cluster and then much more efficiently copied to the other clusters. So this has significantly reduced our injection overheads and speeded up our load process. And as I mentioned at the second phase of the commit, all data is promoted at the same time. Finally, we make sure that all the data is up to date by doing some checks around the number of rows and various other key signals for freshness and correctness, which we compare with the data in the data lake. So in terms of schema changes, VDM automatically applies these consistently across all the clusters. So first what we do is we stage these changes to make sure that these are correct. So this castes errors around trying to do an incompatible update like changing a column type or something like that. So we make sure that schema changes are validated. And then we apply them to all clusters atomically again for consistency and provide a overall consistent view of our data to all our users. So on the proxy side, we have transparent support for replicated clusters to all our users. So the way we handle that is as I mentioned, the cluster to table mapping is maintained in the manifest database. And when we have an incoming query, the proxy is able to see which cluster has all the tables in that query and route the query to the appropriate cluster based on the manifest information. So also the proxy is aware of the health of individual clusters. So if for some reason a cluster is done for maintenance or upgrades, the proxy is aware of this information and it does the monitoring based on query response and execution times as well. And it uses this information to route queries to healthy clusters and do some load balancing to ensure that we avoid hotspots on various clusters. So the key takeaways that I have from the stock are primarily these. So we started off with single cluster mode on Vertica and we ran into a bunch of issues around scaling and availability due to cluster down times. We had then set up a bunch of replicated clusters to handle the scaling and availability issues. But then we ran into issues around schema consistency, data staleness and data replication. So we built an entire ecosystem around Vertica with abstraction layers around data management and ingestion and proxy and with this setup we were able to enforce consistency and improve our storage utilization. So hopefully this gives you all a brief idea of how we have been able to scale Vertica usage at Uber and power some of our most business critical and important use cases. So as I mentioned at the beginning, I have an interesting and simple extra update for you. So an easy way in which you all can take advantage of many of the features that we have built into our ecosystem is to use the Vertica EON mode. So the Vertica EON mode allows you to set up multiple clusters with consistent data updates and set them up at various different sizes to handle different query loads and it automatically handles many of these issues that I mentioned in our ecosystem. So do check it out. We've also been trying it out on GCP and our initial results look very, very promising. So thank you all for joining me on this talk today. I hope you guys learned something new and hopefully you took away something that you can also apply to your systems. We have a few more questions, time for some questions. So I'll pause for now and take any questions.