 Good afternoon, everybody. Today, Alex and me are going to present distributed deep learning on Mesos with GPU and GAN scheduling. This is a project we delivered at Uber with the collaboration between the compute team as well as the macangular team. So we have, my name is Minkai. This is Alex, and we have a holler they are true. Then also Paul Maxo. So first, let's talk about why we need Alex. You want to introduce yourself first? I'm Alex, I'm from the deep learning team, and then Min is from the compute platform team. So first, let's talk about why we need deep learning at Uber. So as you guys probably know, Uber has grown very fast in the last few years from initially a ride-sharing company. Now we have our self-driving business unit. So basically self-driving cars are a huge source of workload, especially for deep learning and machine learnings. The next use case is, of course, as a ride-sharing company, we would like to predict what is the trip and how much trip we're going to see in the next coming months or year. And then third one is because Uber is completely based on mobile payment. And of course, people are going to take advantage of that and make some money out of Uber's success. So first, for the self-driving vehicles, right now we have both self-driving cars as well as self-driving trucks. And the self-driving vehicle problem is actually very interesting, as you guys probably already know. It's largely computer vision as well as image recognition and basically perception issues, as well as how do you build precise 3D maps. So all those need lots of deep learning computation of all, as well as how you train those deep learning models quick enough so we can do quick turnarounds. Then for the trip forecasting, as we see here, this is actually we have an engineering block talking about how we do trip forecasting using deep learning networks based on lots of time series data as well as other information potentially like weather information, events, for example, there's superbook, all those things. So as we can see here, deep learning is very powerful actually, the prediction is pretty accurate. That gives us lots of nice advantage or nice lead time to notify drivers, say, hey, there's lots of trip demand coming. They want to turn on your app and then start like driving for over. Then the third use case is fault detection. So this is, for example, a typical use case for how people can cash out some money from Uber, especially in early days, Uber has lots of incentives or referral code. And basically some first just go to Spine referral code to their friends and then partner up with friends as the driver and then cash out Uber credits. So this is actually a huge problem when we have the China business because people just like do lots of those. So deep learning is actually a very good approach to basically use different signals to figure out who might be a potential protest. So now let's talk about why distributed deep learning. So as you guys probably know, the success of recent deep learning is because of two things. One is big data. So we have lots of lots of data and potentially hopefully labeled data. The second part is you have a huge cluster as well as the advance of GPUs. So basically you just have lots of computational pause with lots of data. So that's actually kind of the main contribution under this reason why the distributed deep learning or deep learning itself become a success in the recent years. So for distributed deep learning is critical for that, for three reasons. One is you want to speed up model training because some large deep learning models would take weeks or months to train with huge amount of data. With distributed deep learning, you can basically speed up that significantly. And of course, you can scale out to hundreds of GPUs. The third reason is some large deep learning model, they cannot really fit into a single machine. So with distributed deep learning, you can basically partition those models and then doing training as well as prediction for that. So this diagram shows how distributed deep learning works in general. This is like a very high level. So the idea is you have lots of data on your data store, which could be HDFS or any other like file system. And then you have different training process, which is normally running in containers. And basically, they will ingest data from data store. And then we have data scientists or deep learning experts will come up with deep learning models with different layers and initial parameters. And those models will be in each training process. And then each training process will compute the model updates, which are the grandings. So this step is kind of parallel or independent to each other. However, after the model updates, all those training process need to talk to each other and then compute the averaged grandings. So that's actually one very unique property of this kind of workload. Because it's not like Spark or MapReduce or streaming process. Those different processes don't necessarily need to communicate to each other. So this is a kind of very tightly coupled distributed application. And one kind of similar application is like a traditional MPI kind of applications in super computing world. So this has lots of kind of unique challenges to how to make this work efficiently on like a custom management system like Mesaus. So first, we're going to talk about why we choose Mesaus. Of course, because this is the Mesaus call. So that's the reason we are here. But beyond that, the reason we decided to use Mesaus is first, it's widely adopted. Like as we know, lots of big companies, Apple, Netflix over Twitter is using Mesaus in production for quite a long time. And also, Mesaus has a very good GPU support, especially with the latest NVIDIA isolators, everything. So that is kind of perfect for this use case. The third one is with the newly introduced nested containers, which enables us to do lots of things, for example, like separate the management code from users own doc image. So basically, that provides us lots of good isolations between the machine learning platform and the user doc image. Of course, Mesaus is highly customizable and reliable and scalable. So that's the key reasons we use Mesaus. So I'm not sure how many of you guys already run GPUs on Mesaus. Anybody raise your hand? Cool. So we have actually pretty good. So how many of you are using Mesaus containerizer or unified containerizer instead of a Docker containerizer? Good, good. So as you guys probably already know, Mesaus has this nice APIs for containerizer, as well as APIs for isolator. Actually, for us, we also use a unified Mesaus containerizer because that's the only thing that works upstream right now. For Docker containerizer, you have to take some upstream patch and deal with that yourself. But the key part here is NVIDIA actually contributed their GPU allocator, as well as the isolator. So basically, you can pretty much, from the Mesaus framework perspective, we can manage the GPU resource almost the same as other, like CPU or memory resource. Then, of course, you have to prepare your Mesaus agent with the correct NVIDIA driver, as well as in the Docker image. You have to have the correct CUDA versions, everything. So those are the GPUs a little bit tricky from that aspect. Now, the second one I would like to discuss a little bit is the Mesaus nested container. So I just want to get some idea, like how many of you guys are using nested containers? OK, that's much less people. But this is a great feature, right? So basically, the idea for nested container is you want, for example, let's say we build some general machine learning platform, which, of course, we have some control code inside of the container. But that control code needed to be kind of separate or decoupled away from the user Docker image. Because in the machine learning or deep learning space, different teams or different data scientists, they want different toolings in their Docker image. For example, they want to install some different Python libraries. Those will have issues if you put them in the same Docker image. So with the nested container, basically what we are doing is we have a parent task which is running the management code. And then the management code will then go back to the Mesaus agent and to pull the nested containers, which is a different image, and running that way. And then, basically, for us, for example, we have TensorFlow, the model or the trainings is in the nested container. So yeah, Mesaus is cool, but the question is what is missing? So there are a few key items. We need to kind of build ourselves on top of Mesaus so we can make this a production ready. The first thing, our most important one is we call it elastic GPU resource management. Because GPU is kind of a resource which is very expensive. As you guys probably know, the enterprise-grade GPU is like five grand, one piece. And you put four of them in one machine. That's like 20 grand right there. And plus the power support and rack, everything. So GPU is very, very expensive. Of course, you don't really want the GPU resource to be set as their idle. So that's why, at least at Uber, we have different organizations. They all want to use GPU, but you cannot really give them a static quota. Because if you give them a static quota, if they're not using it, then your GPU is set idle. The second part is, as we talked about before, because of the averaging, grinding, computation, we need lots of network-bandwise between different workers to communicate to each other. So they can compute the final model. So this will need us to support locality and the network aware placement. The idea is you would like to place all the containers of the same training job as close to each other as possible. The next one is the GAN scheduling. Because typically, in the distributed deep learning training job, you will need all the workers to be up, so you can make progress together. So that means if you only allocate half of the workers, you only allocate a resource for half of the workers, then they will have to stay idle there and wait for the remaining resource. So this will easily introduce a data lock situation. For example, if you only have 10 GPU and then each job needs eight GPU, then if you don't have any way to do GAN scheduling of those eight GPU workers, then they will put each grab like five and then waiting there. Also, for task discovery, because this is a distributed training application, they need to know each other's machines they are running on and which dynamic ports they are listening on the connection. So this we have to solve that discovery problem. But that discovery is a little bit different than the service discovery we have talked about yesterday, because this discovery is only within the same job scope. Finally, of course, we have to handle the failures. So this is how to address these issues, and we actually implement a system called a Palatin, which is a cluster of bicycles. The idea of that name is if you cluster together, you'll move faster together. So basically, as we can see here, Palatin is trying to bridge the gap between more general purpose scheduler and then Mesos. Because Mesos provide lots of great task execution resource allocation as well. But then we're still missing the preemption aspect, the placement, and finally the job and task lifecycle. So the Palatin is trying to be a scheduler, which will provide a common implementation for all those things. And of course, it's going to support GPU as well as mixed workloads. So this is a high-level architecture of Palatin. It's basically set on top of Mesos Master, and we are using HTTP as the latest Mesos HTTP API. And then itself has its own API exposed in a protocol buff. And right now, we have our own home-grown RPC implementation, but it's 100% backwards compatible with any GRPC client. So that's actually gave us a very good flexibility. For example, our UI is in Node.js. We have Python client, Java client, which can all talk to the backend using the same protocol. Then the Palatin itself actually has break it down into four big components. One is the host manager, which kind of aggregate all the resource from Mesos and present that to a resource manager. The resource manager itself basically will do this elastic resource sharing, and then we implement different favor of placement engines. One of them is kind of for deep learning, so which kind of supports locality, which we are support actually locality we are aware. Right now, it's more like constraint based. And also, then for the job manager, we were to basically manage the lifecycle of the job and tasks. And all those data, all those metadata like jobs configuration or task configuration as well as the long time are saved in Cassandra. So you can basically do lots of search and index on that. So this slide shows like what's our solution for the elastic GPU resource management. The idea is basically we take this very similar idea as what Mesos is also doing right now, but we cannot wait. So we did that in our schedule itself to begin with. Once Mesos is available, we can basically offload that to Mesos. But anyway, the high level idea is that we have something called a reservation, which every organization can say how much GPU I want to reserve. But if they don't use those resources will be used for other organizations as a like based effort or fair share resource. And then we have job priorities, which is local to individual organization. And we have a concept called resource pool, which is very similar to the raw concept in Mesos. So basically for this is a hierarchical resource pool. And then you can define the, what's the reserved resource for this pool and what's the fair share for those idle resources. And then dynamically we can compute what's the current resource, this resource pool is entitled to raw. So then for GAN scheduling, basically we solve this problem in this also in the pattern scheduler layer by define a subset of tasks in a job, which was basically scheduled, basically all the GAN tasks will be a single scheduling unit. That means it will be admitted, placed, preempt and cared as a group. And but those GAN tasks though is an independent execution unit because they are running in separate containers and may fail independently. So that actually requests us to do lots of life cycle management on the GAN side because if some tasks cannot be restarted we have to care the whole GAN. So final then the next interesting problem is like we have to implement some special placement strategies for this particular use case. One is like we would like to place as many as containers of the same job into the same host or rack. So we can minimize the network traffic across top of rack switch. And also we would like to use some kind of best fit algorithm instead of wide spreading algorithm to tightly pack GPU containers. Finally we have to support constraint based placement so that the user can place all the containers of the same job using the same generation of GPU. So this is actually also very important because unlike CPU, different generation of GPU has huge difference in terms of performance. So we don't really want to mix different GPU models in the same job. Now I'm gonna get to Alex to talk about how TensorFlow and how the performance number look like on top of our methods. Thanks, man. All right, so as we grew as a company, as we did more and more of machine learning, we realized that we need to invest in our deep learning infrastructure and the framework we chose for majority of our model is TensorFlow. So why did we do that? TensorFlow is the most popular open source framework for deep learning at this moment. When I was preparing these slides, I looked at the GitHub page and it's basically showing 69,000 stars and 64,000 forks. It's huge number compared to other frameworks. So that as users give us ability to ramp up people much easier because there is a lot more examples for how to use TensorFlow, how to deploy it and all that stuff. TensorFlow also combines high performance with ability to tinker with low level model details. So what that means is you can take some of the high level abstractions and they work very well. At the same time, if you want to write your own implementation for some new operation that you came up with, you can do it in CUDA code and you can integrate it with TensorFlow without any issues. They have nice plug-in model for that. And then TensorFlow has end-to-end support from research to production. So you can do research really easy because of that and you can also export your model and then deploy it all within the same framework. So Min mentioned about how we do distributed deep learning, so in general you have, basically you have MapReduce, except MapReduce whole phase happens every second. And your model weights are about one or multiple gigabytes. So every second you exchange multiple gigabytes that every worker possesses and they all need to arrive to average number. So how does Distribute TensorFlow handle that? There is typically one of the two patterns. Either you have one parameter server which is special kind of worker and all the workers communicate with that parameter server to upload those weights. Another option that people do is instead of having one parameter server, you can have multiple or in extreme case you have same amount of parameter servers as workers. So what that gives is it allows you to offload the load from the single parameter server so you don't have a bottleneck anymore, but at the same time that gives you basically all-to-all communication pattern that every host has to communicate with every other host, which is also not ideal. But it works. So when we deployed traditional distributed TensorFlow on the MISOs, that's how we did it. Basically at Uber we have this Michelangelo deploying platform or service which talks to Peloton and then Peloton schedules our distributed TensorFlow jobs on MISOs agents. So we get containers provisioned with our management code and then we start nested containers with user code which may use variety of versions of TensorFlow and may use variety of native libraries or whatever which could conflict with our management container otherwise. So typically what happens is in case of the first pattern you would have one parameter server and then on every host you would typically have one container with one worker container. And then those containers would use as many GPUs as possible within the same container instance. So we deployed that and then we did the benchmark. So we tried standard TensorFlow benchmark which is very popular now and this is the numbers that we get. It's actually pretty good numbers. So we're able to train model and image net dataset which is about a little over one million images. We're able to train it in three or four minutes because processing rate is 5,000 images per second. So that's pretty good. But is it best we can do? So if we look at scaling factor which is basically a number of images per second you get on many GPUs compared to one GPU. This is what we get. We see that for different networks numbers are different but major team is that we lose 50% of our compute resource. So on the rest net which is one of the popular networks we get 31 GPU, basically 31X on 64 GPUs and then on inception we get 35. These are two very popular models. How can we do better? So what's wrong with this model? We can do two things. First we can change communication algorithm. So instead of doing all to all or all to one maybe we can do something else. Second we can try to use different kind of networking because we exchange gigabytes of data every second. Maybe we can afloat doing so from the CPU to NICs themselves. So we invested in this and we came up with a framework to do that called Horwatt. So what does Horwatt do? Horwatt is a distributed training framework for TensorFlow. It uses different algorithm. It uses bandwidth up optimal algorithm to do this exchange of gradients which I will dive deep in. It uses Rocky or Infiniband if it's available or otherwise it can also work on playing TCP and then it installs on top of TensorFlow just using Python package installation. So as a benefit you don't rebuild the whole TensorFlow which takes like a half an hour and experience of doing so. You can just do pip install and it installs in a few seconds. So how does it work? It's using very different methods than your regular parameter server. So what happens is it's using so-called ring or reduce which is something that was coming from the HPC world, high performance computer world where they do MPI, they do weather predictions, these kind of things. And there is a paper which is showing that this is most optimal algorithm and the high level idea is this. Let's say you have three workers, each of them have some amount of numbers. So they split their arrays of numbers into same amount as how there is workers. So in this example, three. And then they send each other those numbers and they add the numbers together. So in this case, first worker will send its first chunk to the second worker, second worker will send second chunk to third worker and so on and so forth. So they have to do this operation same amount of times as there is workers minus one. And what happens is during that time, they on each worker only communicates with two adjacent workers, which is good because it saves your bandwidth, you don't have this all-to-all communication pattern. So your network switches are pretty happy. At the same time, what happens next is after you do these N minus one iterations, you will have correct answer for every chunk but it will be scattered across all the workers. So what you do is you just send the data around in the similar circular pattern. And then after another N minus one iterations, everybody will have the right answer. So this algorithm, it's, as I said, more optimal and it's only talking to two peers. So how do we run this on MISOs? This is very different, right? So we leverage MPI to provide us with infrastructure to kick off all the jobs and make them aware of each other. And we use NVIDIA library called Nickel to actually do all reduce on GPUs. So the way we deploy this is we provision one container on one server and that container, regardless of how many GPUs, and that container will launch N copies of the trading script where N is number of GPUs you have. So in this case, you have two GPUs per server so it launches two processes. So what it gives you is that you write your code as a single GPU code and MPI will launch a number of copies per container. So in this case, we have MPI here and we use SSH as a communication agent. So when MPI launches the script, it will use SSH to just launch the program on every host. So how does this compare performance-wise? We see that, so these are numbers where we see both distributed transfer flow and Harvard. We see the numbers are much better. So with Harvard, we get almost 8,000 images per second and that means that for the same image net, we can do one epoch in about two minutes. And then we see that scaling factors are actually very good. We see that we get about 33x on inception with three and we get about 37x on ResNet 101. So these numbers were obtained on TCP network. We strongly believe that on Rocky capable network, we would get even better numbers, but at the time of doing this test, we didn't have correct setup. So that's all cool, right? Sounds good. What about developer usability? Our developers are gonna be happy to use this. Not everybody cares about squeezing every inch of performance, they just wanna train them all faster. So what we found is this approach of doing development of single GPU TensorFlow script and then having it distribute actually is much easier than standard TensorFlow. So in this slide, there is an example of distributed TensorFlow script from the Google website and it doesn't fit. You cannot read anything because it's a pretty long example. And the problem with that is there is a lot of concepts that with distributed TensorFlow you have to learn. You have to create so-called towers, average gradients among them within the worker. Then you have to create these parameter servers, workers. And the problem is all this code bleeds into your model and you have to insert different portions of the code in different places. So with Horovot user experience is arguably much better because so this is the equivalent program but in Horovot. And basically the highlighted both parts, there are things that you need to add to make your single GPU program distributed. And there's basically four major components. And one of them you need to initialize it. Another is you need to tell it which GPU is gonna use, which is gonna be same as program original number in your container. Then you wrap your optimizer that you use with distributed optimizer, which is gonna take care of collecting the gradients and then averaging them. And the last part is to do initialization when everybody does initialization. Let's make sure that all of them start initialization with the same values. So that we did internally a lot of deployment with this model and we found that users are much, much more happy to embrace this model rather than convert their program to standard distributed TensorFlow. And then the best part is this Horovot is available today on the GitHub. If you or your teams or your companies are doing deep learning, you may find it useful. So we're super welcome for you to use it, leave us feedback, if you find usability, some any issues or any installation problems or anything, give us a shout. If you do performance benchmark, we're super interested what numbers you're gonna get. And we are willing to help you to get the best numbers. So thank you. Do you have any questions? So you were asking, do we have a training and whether we have information about how the full data set runs? Or yes, so we did apply this to actual deep learning problems that we have. In, I can give you one example. We had models which used to train two weeks and we made it trained in seven hours. So Facebook actually recently published a paper how to train ImageNet on ResNet 101 on ImageNet in one hour on 256 GPUs. And they're using exactly the same approach, but they use CAFE2. So if you use CAFE2, you can use Facebook approach. If you use TensorFlow, you can use this. No, if you use, so this is actually very interesting question. When you scale distributed training, what happens is you effectively increase how many examples per second or per batch you process. So when you do, let's say you have single GPU training, you look at 64 examples, you compute model update based on 64 examples, and you basically adjust your model to that. If you go to super large batch, instead of doing these small updates, you kind of do bigger update. And the tricky part with that is the space of the model is not convex, so you may fall into some trap. So if you have huge set that you process per one step, then it's easier for you to fall into most local trap. Facebook introduced this approach of changing learning rate as you scale, and they demonstrate a scaling on 256 GPUs. So what they recommend doing is you basically need to increase your learning rate, but don't start with the learning rate which is equivalent to your original learning rate, multiply by number of GPUs because that's introduced too much here. So what they did is they scaled their learning rate doing first N epoch, which is hyper parameter which you need to tweak, and then you do your normal reduction of learning rate. This is though open area for research. We found that, they found that it doesn't scale more than 8,000 images, 8,000 examples per batch, so there is definitely limit how much you can go higher, and it's actually very active area of research. How can you go higher? If you have 1000 of GPUs, how do you train? And I think next paper about that is should be coming from somebody. So MPI doesn't have great handle of failure recovery. It basically crashes everything. So what we do is we ensure that users do checkpoints on fairly regular basis, and then we just recover from last checkpoint. In practice, we found that we seldom see errors in flight, assuming all the hardware is okay. And also like in the pattern layer, we're building like automatic retries. Of course, you can see if this is a batch job, so you can see what's the maximum number of retries you wanna do, so we will do that automatically. Any more questions? Last chance, okay, thank you.