 Hi, all. Thanks for joining me for this presentation today. Seven years ago, I attended a conference session where someone talked about exploring Docker for a week. Today is a big privilege for me to present about some of the work that we've done at Uber, where we have Dockerized our interactive infrastructure. Over the next 30 minutes or so, I'd like to share our containerization journey. Before we go into the details, a quick introduction about myself. My name is Matt. I worked on Apache Ambari in the early days, a Hadoop cluster management system. I joined Uber in 2016 and pretty much worked in the same domain working on Hadoop clusters. And since then I've expanded my scope into the broader data infrastructure. Currently, I need the deployment and automation domain for data infrastructure. Which also includes containerization and container orchestration of the stack. Outside of this domain, I also follow new developments in the data architecture space, which includes the data mesh and products that enable it to open that data. We have an action pack agenda today. I'll cover how we used to manage the data infrastructure back in the day and what made us adopt containers. I'll walk through some of the challenges that we faced as we adopted containers. One of the biggest parts of adopting containers was the migration itself. I'll talk a little bit about how we engineer the migration and into details of some of the specific strategies that we use for making the migration easier. I'll briefly discuss about the cultural shift within the team as we went along with this migration. Some of the key takeaways from this presentation and also give a glimpse of what we are going to work on in the future. With that said, jumping into the background, Hadoop today powers the entire batch analytics stack at Uber. Everything from mobile click events, online DB change events are ingested into the Hadoop infrastructure. And this powers a whole bunch of analytics for Uber, including ETA prediction, decision making for pricing and promotion, ML and whatnot. The Hadoop infrastructure itself comprises of two major services, one of them is SGFS and the other one is down. SGFS is our distributed file system and Yon is a resource management system for launching your distributed applications. Before going further into the details of the presentation, I'd like to do a quick refresher on the Hadoop architecture. On the left side you will see the SGFS architecture where we have a central name node which stores the metadata files, basically files to block mapping. And the SGFS architecture also has data nodes which stores the blocks themselves. Together they form a distributed file system. On the Yon side, which is on the right side of this slide. We have a resource manager, which is the central system that controls resources across the entire cluster. Clients and users submit applications to the resource manager. The resource manager is in charge of ensuring that these distributed applications get allocated containers to run across the entire fleet. Node manager is the node level agent which is responsible for starting and stopping containers and basically ensuring that your containers are running completely. One common thing that you would see here is that both SGFS and Yon have control planes where name node and resource manager are kind of control planes within the Hadoop architecture. And we also have data nodes and node manager which are kind of the data plane of the Hadoop stack. One other point to mention over here is that Hadoop is a very traditional system. It is built and designed for bare metal and it has been run traditionally in the industry for bare metal, in bare metal or on VMs. When I mentioned containers in this architecture, these containers are typically JVM processes running on those and they are not necessarily like documentators. This slide is a very good overview of how our team has evolved and how the scope of the team has evolved. I joined the team in 2016 where we had an SRE team with three or four members. And in late 2017, early 2018, we amended the team by adding an additional sub-team which worked on development on the core produce side. By late 2019, we merged the team, merged both sub-teams together into a single DevOps team which is after which we had a lot of innovation happening and moving Hadoop to containers. Until early 2020 or so, we were using a lot of ad hoc scripts tools and a lot of tooling originally different languages to maintain our Hadoop infrastructure. By mid 2020, we started adopting containers and there's a reason why we started adopting containers after which I'll get to in a minute. Now, one of the other things to look into is how the fleet size has grown over the years. You'll see this blue line over here which showcases that the fleet size grew exponentially from a few hundreds to the 10,000s, 20,000s range, whereas our engineer engineering team grew only 3x and that kind of thing. Once we scaled our clusters, we got to a point where we were managing 25,000++, the Hadoop infrastructure was supporting 350,000 distinct applications or jobs per day. I'll be using applications and jobs interchangeably in this presentation. And these jobs or applications have very short lifetime to run for a few hours or so and do an ETL job or some addition job and basically come back the next day and run in a similar fashion. These are all distributed applications, so these 350,000 jobs were translating to 20 million plus JVM processes launched across the fleet on a daily basis. As we scaled our Hadoop infrastructure over the years, it came in a bit of a go to invest people in doing zone turn-ups, zone B comms, OAS upgrades and whatnot. These fleet-wide operations became cumbersome for the team in general due to the lack of automation and operational excellence. And this was our major turning point. We were repeatedly assigning people every six months to do one of these fleet-wide operations and this is when we made a big decision to move to docker containers and rely on a container orchestration system which was internally built to simplify our operations. Given that I've set the context, I'd like to move into some of the challenges that we faced along the containerization journey. As I was preparing this presentation, I created a list of challenges that we faced and I ended up with the long list, so I'm going to present the top three important ones that I felt was more meaningful for the students and I'll walk through them in the next few slides. The first challenge that we faced was multi-cluster management. Hadoop is notorious for having thousands of conflict files for making one single cluster. Back in 2016, we had two clusters to manage. It was relatively easy to manage their conflicts, but by 2020, we had around 40 to 50 clusters to manage. The first practice within the team to manage clusters was to create a new directory in a Git repository copy paste that would do default conflicts from open source, customize it, and by that time, you know, deployed the conflicts to the cluster, you might have already ended up with 1000 plus lines of conflicts. There's a snippet over here which shows one of our internal clusters, Broman, and the number of lines in the conflict files that Hadoop perhaps do manage, which is around 1000 plus. So if we multiply this 1000 with 40 to 50 clusters with a C3, like 40,000, 50,000 lines, then it became a bigger maintenance problem for us in general. So we thought about this solution, we revamped the entire conflict and cluster management system in such a way that we plugged out all the common conflicts and put them into a specific part of the code. So we looked into defining different cluster types and also defining the cluster names and tying them to cluster types, adding a bunch of template files which were generated, which were used for generating XML files which Hadoop understood. And we have this conflict generator which gets all these inputs and generates the conflicts file at a cluster level. So these files are generated, we save them into the DockerM range and our container orchestration system deploys it to each of the clusters. During deployment, a set of runtime variables are injected and these runtime variables included host name, cluster name, ports and whatnot. Some of these variables such as cluster name provided enough input to basically pick the correct conflict files for running the container. Now we, the biggest benefit of this approach was the conflicts and the cluster definition was defined through a language called Starlark, which was pretty much 4000 lines of code, and we were able to generate 150,000 lines of conflicts through the system. So this made it easy for us to manage cluster in general. We also adopted a similar approach for maintaining not only the Hadoop conflicts but also for metrics and alerts and so on. So we basically follow similar templates. The second biggest challenge that we had back in the day was security being an order. But the old process of adding nodes to the cluster, kind of, kind of look like this, where someone goes on provision source, and then you talk to someone in a different team to generate secrets for these hosts. You kind of SEP your seekers to these hosts and then you start the process with them later join the cluster. As part of our containerization and automation journey, we, we thought about infrastructure security from the start, because that was very critical for automation to just work. So, within Uber we use one of our proprietary container orchestration systems to do to do all of this. This is the deployment automation and dinner. What you see in this picture here is a very simplified diagram on how we, how we provision secrets and secrets for, you know, for each of the nodes that we add to the cluster. The very first process that we bring up on the host is something that we that is called a spire agent. The spire is pretty much based on SPP authentication protocol it basically provides identities and provides a way to trust software systems within the distributor system. Once the spire isn't up spire isn't deserved the node is attested and the next process, the cluster manager agent is able to come up and report to the cluster manager server saying hey this this this node is up and would like to join the cluster with this attested token. Now the cluster manager server upon receiving this request basically ensures that a secret is created for the for the worker, the worker is pretty much a custom site card container which is managed by the team, which again authenticates with the spire agent gets a token and reads the secret and provides it to the process, which is either the data node, the name node, the node manager and so on. With this we basically put in all the security related automation into the overall automation of container orchestration. The third big challenge was orchestrating the entire distributor system by itself. This is a very simplified diagram of the stack on the number of containers that we run across both stf and young stf is supported by around 30,000 plus containers and young by around 70,000 because bus container so overall we run at least 100,000 containers. On top of that we have customer applications which launch 20 million plus new containers on a daily basis. Now that we are talking about the bottom part of the stack on maintaining 100,000 containers at scale. Like I mentioned before we use an internal container orchestration system, which is proprietary to Uber for managing all of our documentators. This diagram showcases how how these how the interaction happens between the user or automation to add or become nodes to stfs as a dispute system. One of the key things over here is that stfs unlike other services that you run on a system like Kubernetes has, has different roles. Like I mentioned before this stfs control plan this stfs data plan which are data nodes. So these are different roles and one of the key important aspects is that both of these systems, both of these roles are basically stateful systems. So maintaining containers at scale without compromising on durability of data is a big challenge by itself and automating on top of that to do live scale live scale sheet operations is another step forward in for us in terms of automation. So going into the details, the user or automation input is basically given to the cluster manager control plane. Once the state and adding a note to the cluster as an example. The cluster manager control plane changes a state which is somewhere in the system, and the state is propagated into the worker next to the stfs control plan. The stfs control plan basically gets information that a new node is supposed to be added and registers the node with itself. The cluster manager system basically provides the node level information, such as container version, container type and so on to to the node where the container should be started, which is given to the head of worker, again a sidecar agent sidecar container. And it launches the stfs data, data node, which connects to the stfs control thing. So what you see here is a lot of coordination between these different systems and how they work together. One of the key things to notice over here is all of this is based on the goal stage or decide state and all of these operations are happening in a in a synchronous manner. These are the groups which actually control all of these operations. This is a diagram that I took from one of our previous blog posts which was published last year. This is a full overview of how different parts, different components in in our architecture supports how to infrastructure and containers today. I'm not going to go into details of everything over here, but I would like to highlight a few key points over here. There are two aspects over here. One is this entire system is based on a desired state or a goal state. And we have decide states or goal states at different levels of the system. For example, we have decide state for, for an stfs cluster, and we also have decide state for each node within the cluster. And of that we have actual state being collected from different parts of the system including each of the nodes and then building the entire cluster actual state. We have basically control group control groups or reconciliation groups which look at the difference between the decide and actual states and basically trigger actions to move the state of the system towards the desired state. On the top right you'll see a circle which is pretty much showcasing how the cluster goal state and cluster decide state that reconciliation group that's happened over there, and also the left bottom red circle showcases that there's a reconciliation loop within the node itself to ensure that the containers are running or stopped. So all of these cluster level reconcilations are triggered through cadence, a workload management system, and that has scaled to support our recruitment percentage so far. In the next few slides I'd like to talk about how we engineered the migration. One of the questions that I that I got when we were sharing some of our learnings with our peer companies was whether we took down time with migration. We took down time for the migration. It is typical for some of the companies that run that similarity stack to shut down the stack for a while upgrade the cluster and then turn on all the pipelines. But due to the nature of Uber's business we, we can't shut down any of the pipelines we can't basically it will affect some of our business use cases so we never took down time for the migration. But that actually pushed us towards looking at the migration in an engineering way. And in the next few slides I'll talk through some of the strategies that we adopted for engineering the migration. The first strategy that we adopted was how can we optimize for the ROI. So, as I mentioned before, there's both control plans and data planes and the different project system, and the control plans are a few in number there's literally like 5 to 20 to deal with, whereas the data plane isn't the 1000 scale. So we focused on the biggest part of the fleet is which is making the migration easier for the data plane and the invested heavily in building automated was to do the migration. And this was again based on our kids. This graph showcases how we, you know, ran through the migration, you'll see that the graph goes up all the way to hundreds of 200s which means there was like the peak of how many nodes were decommissioned and kind of piling up in our pool for basically going through the containerization process. And as the graph goes down, that's when the nodes were kind of cleansed, added with the doctor demon and other system related processes and added back to the cluster. We ran this automation for several months and within within a year or so we were able to basically migrate more than 75% of the cluster through through these automated locals. The second strategy that we chose was how do we make things compatible between the legacy bare metal JBM way of running things to the new Docker way of running things. And we identified two different areas were compatibility really did matter, and that was based on what was visible to the customers. And the area of what those was able to customers was where their jobs from their jobs used to run on bare metal machines as JBM processes but in the new system they would be running as Docker computers. So that that would be on the data plan. The second part, which we looked into was how do the customers interact with the other staff and their interactions are typically with the control plans, whether you're submitting a job or listing something on the cloud system. That goes through, goes to the, the hood and role base. So going into details of how we made made it compatible between the legacy data plan and the, the Docker Docker base data plan. The first thing that we did is we looked into all the dependencies on the, all the dependent libraries on the legacy bare metal host. We basically created a Docker image with all these base dependencies. So, as we as customers were submitting applications to hood infrastructure, we basically injected the jars that they were submitting into the base Docker image that we were providing used this in our architecture and ensuring that the customers, did not even know that their JBM processes weren't running in containers. Because of some of the changes that we did it was completely transparent. So this was, this was something that we did underneath of it. The second aspect of the data plan migration was more around. How did we evaluate this, there are hundreds of 200, 200 libraries that customers depend alone, and it was very hard to figure out what are the full set that is required to make the migration work for all the 350,000 applications and 20, 20 million plus containers launch per day. So to come up with a very good migration challenge, we actually partitioned the cluster where we had two subclasses one which is in red, which is the legacy standard the one which is in purple, which is the Docker base time. And we started, we built a transformation function within our control plan to move some of the workloads which were in the legacy bare metal to the Docker based cell. And this provided us a good amount of insights by shadowing some of these jobs into into the Docker based Docker based fleet because that's a step some issues early on which we basically fixed forward in an agile manner. However, the handheld moment of 350,000 jobs would be would be kind of an impossible task for us. So, as we started building confidence that 100 200 jobs that we have shadowed. We basically changed our shadow and we kind of knocked down the partition sense that we are going to randomly distribute some of the some of the Docker base close into the entire cluster and let applications run across both. So when it was an epic moment for the team where we had applications submitted by customers were 20 of the containers, 20 containers out of the same application was running in Docker, whereas the best 80 was running on bare metal, and they used to they can communicate with each other and the application run perfectly fine. We continued along by adding more Docker base host into the system and eventually the depleted the old legacy bare metal goes from the from the question. So the second aspect was to make compatibility possible at the control plane level. So customers in fact with the SKS name nodes and young resource managers. And the way they track with these control planes are through a typical Hadoop Java binary, which is what we internally call as a client. Consist of conflicts, which point to the name notes and the resource managers which basically map to IPs in our system. Due to the legacy nature of the staff will be built. We started to do back in the pre IP IPO days we have had a lot of systems. They have thousands of client usages where these IPs are uploaded across the entire So one of the first things that we did over here was to basically standardize the client and also the conflicts so we moved to the basic DNC name based approach where we have one single name which points to SKS name notes and other single name that points to young resource managers. And this change was orchestrated across the entire company using very decentralized effort called the Hadoop client standardization A good amount of the work was done also to some of our monorepals we have a single repository that is shared across the entire company so we were able to upgrade the libraries through a single diff in the monorepo cover we require many of the teams to validate and basically redeploy some of their applications so that things wouldn't break. Once the control plane, once the customers have adopted these new clients, it became easier for us to move the control planes around because they're all fronted by the same so we were able to move these control planes around we were able to shut down one control plane node, move them into Docker, kind of validate and test in the back. So this was kind of our strategy to move control plane of Hadoop infrastructure and development. As part of this big migration journey and containerization journey, I think one of the biggest aspects that we that we learned was the the cultural shift that happened within the team. As of today, we have 99% 99.4% of the Hadoop infrastructure running in Docker. So the events of this transition from from a full SRE team to a devil sorry team to the DevOps team and moving to DevOps model actually brought us a lot of benefits in terms of basically consolidating the work together, ensuring that the teams are more being more efficient, reducing the dev dev and release cycles. So this this this the shift was pretty much an eye opening experience for us. We have also reduced the operational effort across across the team for managing the clusters and we are also on a very strong trajectory to reduce the operational effort further for some of the bigger fleet wide operations. I think one of the key parts which which was kind of a turning point for this cultural shift was around when back in 2020 when we had the pandemic and everything started, there was this hardware shortage that happened and to mitigate the pandemic on our side, we had to basically been back as a young in such a way that we could reduce our cluster sizes and sustain sustain growth for the long term. Just because of the container orchestration system and the container based way of operating on clusters, we were able to do this been happening within six months or so to automated workflows by just defining one policy to ensure everything is kind of been working together. And this basically build trust among the people who are kind of skeptical about Dr containers and finally gotten the mode of head this is this is the way we should operate. Some of the key takeaways for us is that containers made it really easy for us to manage to do the DevOps model of working basically improve our team productivity. And the key takeaway for myself at least, having a super well engineered strategy is very important for making large scale migrations like these easier. We're almost towards the end of this presentation but I'd like to provide a glimpse of some of the work that we're doing for the future. We have gotten very far with containerizing and orchestrating to do two contours. We are in the works of optimizing these basically tuning some of the concurrency limits and doing drills to ensure we can do larger fleet by operation basically can be turned over. We have thousands of hoes by the click of a button without any manual involvement and we are still publishing some of those edge cases and so on. And we have a big goal of reaching zero mental effort for zone turnips decons and fleet by that fixed something that we applied to achieve in the next 6 to 12 months. And because of that, we have adopted containers as a foundation for our architecture with that we have unlocked a lot of other opportunities including leveraging the cloud. We are currently looking at modernizing our data infrastructure check and cloud basically plays a major part of that story. That is the end of the presentation. Happy to take any questions and feel free to reach out to me offline if you have more questions. Thank you all.