 Hello, hello, hello. How's everybody doing? Good, good. Friday afternoon, Friday afternoon. I just spoke next door. I had to run over. I'm sorry for being a minute late. So let's get started. Come take a seat. All right. I'm talking about a topic here with Joy today that I'm really excited about. I know Joy is as well. So democratizing machine learning on Kubernetes. And I don't know what you expect when you see that title, but we're gonna talk about a journey that we've both had speaking from the heart and some different things that we've seen. And then give you some best practices for what we've learned and how we propose we might move together as a community. So with that, I will hand it over to my lovely colleague, Joy, to introduce herself. Hi, I'm Joy. I'm a solution architect from the Microsoft AI and Research org. So I work with a team of data scientists and engineers on a daily basis. We're all basically building models, training models every single day. And obviously doing it in a very efficient way is super important. And that's where Kubernetes comes in and Lackey has been tremendously helpful. Thank you, Joy. Joy is absolutely a wonderful mind and it's been a pleasure working with her for the last few months, I'm Lackey, I'm an SRE over at Microsoft on Azure and what I spend my day doing is helping folks with Kubernetes. So Joy came to me several months ago, said, hey, can you help me? I hear Kubernetes is a great place for me to run machine learning workloads. She'd already taken a great look around and had a lot of things running, but we took a look together and what we found is what we're gonna present and see what you all think. But I represent the SRE. So I build and maintain the infrastructure. I know how to build Kubernetes, I know how it should look, but I have no machine learning experience, right? What I would like to propose here that this is quite a common occurrence in this space. You have highly talented people at the infrastructure level who know how to build Kubernetes with their eyes shut and run it and then you have a team of more than capable wonderful data scientists chomping at the bit to use this infrastructure that you've built. The problem is that in the middle of that Venn diagram lives no one, right? So nobody lives there. So what we're looking to do with this talk is start conversation around how do we get people in that middle point so that the data scientists can get the best out of Kubernetes as a platform to run their machine learning neural network workloads. And me have confidence that I'm actually building the most efficient platform for them to utilize. And this is where the revelation is, I won't chomp at the bit here. So what I'm seeing in the environment and working with Joy is the right tools are out there to run machine learning. The tooling is already there. And Kubernetes is the right platform, but we have two distinct set of people that know two different knowledge pillars with nobody in the middle, right? So how do we actually start this conversation between infrastructure engineers who can lay down Kubernetes and the data scientists who wanna put Kubernetes to the test, which is what I was really excited about. So what we found taking a look at the documentation, it's fairly spotty out there when you take a look in the wild. Is anybody running machine learning on Kubernetes workloads? Neural networks too or traditional? Okay, okay, fantastic. So you'll probably can sympathize with some of this. So the documentation is lacking in some places. And when you do get a chance to take a look, everything's moving at the speed of light and typically it's out of date and or does not work. And it's several weeks old, right? Because Kubernetes is moving at a different plane to machine learning libraries and meeting them in the middle somewhere. So that was the challenge for me coming in with the infrastructure I and Joy coming in with the MLI looking at the same set of documentation and making sure how do we make this a successful outcome? So thinking about that, just digging down into what we do here, how do we lower the barrier to entry? So when we say democratizing, I think there's a whole world and I believe working with Joy that there's a whole world of folks chomping at the bit to get at this. They have neural networks to build and there's great minds out there to build them. But the barrier to entry is still very high for them to be able to get this knowledge out of their head and actually start putting it to the test. So how do we get some best practices and create some content out there so people don't stumble around as much as I did? I don't know if anybody else did, but to get there. So let's get started. But we're gonna take you on a tour of deep learning because I think one of the things that trips me up at least is naming and nomenclature. A lot of these things are very deeply related but they have different names in the ML world and different names in the infrastructure world. So Joy's gonna give us a very great tour of how distributed deep learning works and I think I'm gonna speak for the infrastructure engineers. So I'll hand it back to Joy. Okay, thank you, Lackey. So before I dive into the nitty gritty details about doing deep learning, especially distributed deep learning which is actually much harder than a single node deep learning training on Kubernetes. I would like to just to provide a quick context on the typical distributed training architecture on multiple nodes. So they're mainly in a deep learning space specifically. There are mainly two ways of doing distributed training if you wanna scale out to multiple nodes. One most popular way is actually data parallelism. Basically, if you have multiple GPUs sitting across multiple nodes, the way you do distributed training for deep learning is basically you place a copy of your deep neural net model within each every single GPU that you have and then you feed a mini batch of the entire training dataset, one batch at a time into each every single GPU and each GPU basically just works on that mini batch of the data is being fed to and train the model one iteration at a time. And at the end of each iteration, basically it sends up the delta which we often call the gradient which is basically the delta of the parameters of the model that need to be updated at the end of that iteration and it sends that delta, that gradient to a centralized parameter server and the parameter server basically talks to all GPUs aggregate all of those gradients that are sent across all those GPUs and do a central aggregation and then update the model globally and then push down back the updated version of the model to each GPU again and then the whole iteration starts again. So this is the way a typical data parallelism distributed training architecture works. The second way is called model parallelism and this happens when you have a model that is too big to fit into one single GPU and this normally happens when you have a big large neural net that is doing some natural language processing, big LSDN models or those big neural machine translation models. Those are typically pretty big and cannot fit into one GPU. So in this case, you divide the model into K sub models and each GPU just works on a sub model and they all need to work with each other in order to do the global model training. So this picture, I'm sure if you have used TensorFlow distributed training, you must have seen this architecture diagram a lot. This is a typical setting when you do TensorFlow distributed training, you can have one or more parameter servers sitting at the top and then you have multiple GPUs across multiple nodes. One thing to call out is TensorFlow has done some basic optimization in terms of the gradient that are being sent from each GPU. Each worker node that is doing the work with multiple GPUs, they do like a local aggregation first to optimize the network bandwidth before they send the per node level aggregated variance up to the centralized parameter server. Fantastic, thank you Joy. I think one of the things just in hearing those three slides, that was a revelation to me as an engineer, knowing where the bits get pushed and crunched and knowing I can already mentally map how that lays down on a Kubernetes. So what Joy actually did for me and open sourced it was just a bunch of steps to actually help me build out and map what she was doing. So she's created a very lightweight kind of in the weeds setup for us so that we can see a very low level of how this thing hangs together without making it overly complex. So bring any Kubernetes cluster on your provider of choice, this will work. But what we're doing here is with Kubernetes we make sure that we have availability to the GPUs. So if you've ever used GPUs in Kubernetes they are schedulable as a resource, you see them there. But just because they show up doesn't mean you have all your house and order on the nodes, right? So you need the libraries to access the GPU hardware, whatever that be a common one is Nvidia. So you need to make sure that you contain a runtimes actually have access to utilize that hardware, right? So just because you can see that you have four GPUs doesn't mean your workloads actually have access to those GPUs. So there's a script inside that repo that will actually go and on your nodes you can run it and it'll make sure that your GPUs are in the right place and shape and then you can run this YAML file. So I'll quickly just pop over to this one. Okay, fantastic. Thank you and we'll have the links here at the end. But for me, joy putting this together and actually I could get in the weeds. I know lots of folks are like this but knowing what a PS0 and a PS1 was and a worker zero and a worker one in the context of what we just learned was incredibly, I was incredibly grateful that I had that knowledge because I could start to peel back how I could design an infrastructure for what her workload was. So you can work through this at your own pace but what it ends up is you have a parameter server and several workers and number of workers. We run a Python script manually inside them all and describe where the parameter servers and the workers are and this kicks off the process that Joy just detailed. But what you can see in there is you can see the number of patches we wanna serve up and the different models which is gonna be great. Joy's gonna take us into what we actually saw when we did a lot of benchmarking and testing. So this is available to you and we'll have the links there at the end. Let me pop back over to the deck. And so finally, if you do a describe on your nodes you will see GPUs if they're available. That is that feature is still under alpha I believe but you can do that. And most of some providers GKE have access ACS to GPU nodes that you can run this up and start playing with it today. Now I'm gonna hand it back over to Joy and she's gonna take us through what we actually found together as we build this infrastructure. Yeah, so to share with you with some of the key learnings we observed when we were doing a lot of the training on Kubernetes especially doing a lot of the benchmarking test tuning various different knobs and options and what we've learned so far. So just to let you know quickly the training environment we used for the benchmark testing we did a couple of months ago. I was using basically two worker nodes. Each worker has four Tesla KID GPUs. And I was using one just CPU VM for my parameter server. Since parameter server as you have seen basically its job is to do a lot of the gradient aggregation and update. It actually doesn't do the model training itself. So the parameter server does not really need the GPU. So I was just using a CPU server for that. And we were using the real all the whole ImageNet data set. And those benchmark scripts. So we did use the upstream benchmark scripts from TensorFlow as well, right? So you can take this and run the same thing on your own test bed. Yes, correct. So before we go into the distributed training just to show you what we observed on a single part with multiple GPUs as you can see with the chart showing here you pretty much you can get a pretty good linear scalability here. And I'm showing a ResNet here just for an example but actually the same pretty predictable linear scalability applies to a lot of different models we ran and tested. So VGG16, inception V3, GoogleNet, different versions of ResNet, we try them all and they all scale from one GPU to two to four on a single node pretty nicely and all the GPUs are fully saturated which is actually good. That means you're not wasting your GPU resources. And for the distributed training testing just a quick settings. So I was using just one parameter server and two workers and I was doing the async gradient update because the communication between the parameter server and the worker node can happen in a synchronous fashion or in an asynchronous fashion. Async sometimes can help you especially if you have a slow network bandwidth that means if you're using async the worker each worker does not have to wait for the rest of the workers to all finish their batch of the data before the worker can move on to the next batch. The communication, the synchronization it can happen asynchronously. But sometimes using async can also hurt your speed to accuracy meaning the parameter server sometimes can get a stable model coming from a slower worker node which might be having some network issues. So the model you're training the performance might go up and down a little bit if you're using async. And in this particular testing the parameter server, each parameter server and each worker pod has their own dedicated host so they're not competing for resources in this particular testing and I'm not using any feedback network just pure Ethernet or GPRC protocol. So here is the distributed testing results as you can see actually we have tested different models here and you can see different models actually have very different behaviors when you scale out. So it's not necessarily that you will always get the linear scalability. In those cases, you can see GoogleNet which is at the very left here. It basically has pretty much the best nearly a linear scalability here whereas in section V3 the rest net 50. They all, they have worse scalability. In this case I was only getting about 1.6 speed up when training on two nodes versus one. I think this was when I first saw these results this was a bit of a revelation because I was under the impression that you could just throw more hardware at the problem but if these models aren't built to take advantage of that distributed hardware you're just wasting your time. So a lot of effort on my part was wasted building larger clusters with bigger GPUs and putting more of them out there when the model couldn't actually saturate them. So you can see there that first diagram GoogleNet actually linear scales in a distributed fashion but over to the right there's a model there that's actually worse when you run it distributed. Right, and some are 1.6, the average there, GoogleNet's the outlier there. The rest is not linearly scaling. So we're gonna talk a little bit more about that but me as an infrastructure engineer give it more GPUs, give it more hardware that we talk about that now causes other problems. So this is kind of the knowledge gap that I was receiving and starting to build clusters in a different way for these folks. So just to dive a little deeper on why those models are behaving very differently when you scale out, basically it all comes down to in one single sentence, a critical characteristic of your actual model architecture which is the compute over communication ratio of the model, basically those two factors, how much GPU resources your model takes to get trained and also how many parameters, how many layers your neural net model architecture has. It has a dictating factor on how much network bandwidth it needs to consume so the compute over communication ratio of the model really decides how well your model can scale when you try to scale out when you're doing the training and as you can see Google net really has the highest ratio here so that's why obviously we saw during our real training as well it does scale pretty well whereas the other models they have lower ratio and they don't scale as well. Also for me, I actually thought disk access was gonna be the largest factor in this, that ratio so compute to disk access and talking to Joy, she informed me of something that I didn't know so under the ML banner which contains a lot of robots that do good and bad things, there is traditional ML, things like Spark which are disk intensive and then there's distributed deep learning neural networks which is the more modern, some more modern approaches which are actually moving these data sets around the network, crunch move, crunch move so it's actually the move of that data after the model's been trained that actually causes the whole model to slow down. How fast can I get the trained model, these layers, pushed around so everybody else knows about them? So again, a revelation to me, here we are with 40 gig NICs so to actually move this around otherwise the hardware also I think in the parameter server model as you scale out more things go to the parameter server so your parameter server actually becomes your bottleneck because everybody's trying to shoot them 40 gigabits per second and he's only got a 40 gigabit per second NIC so again, I thought you could throw more hardware at the problem and it actually made the problem worse, right? So this is things we need to talk about and think about as we're building this infrastructure, that's true. So some other observations that we wanted to call out when we were doing the training in addition to the network bandwidth and obviously the compute over communication ratio of your actual model. So as obviously a Kubernetes admin when you're supporting your data scientist with your Kubernetes cluster, they're doing their training on your Kubernetes. The one big important thing to monitor is whether they are saturating their GPUs or not because if their GPUs are basically sitting there idle not getting saturated, then you wanna think twice before you throw more hardware at the problem. And also another thing that I was having to basically try different options and wasn't really very intuitive for me to come up with an optimal design at the beginning was the correct ratio between the worker to the parameter servers. I was testing with two parameter servers, with one parameter server, with two parameter servers sitting on the same host with the worker on the same host as the worker path or should I be having the parameter servers sitting separately on a separate host. So those are all very not so easy and intuitive to design at the beginning of your training. So we've shared a lot of pain points and a lot of learnings. So how can we do better? So this is where I wanted to do some quick call outs on some of the latest technologies and frameworks that literally just came out about over the past two months. One thing is Horovod that is Uber's recently open source distributed deep learning framework. For TensorFlow, it is important to note that it is not replacing TensorFlow or just yet another fork of TensorFlow. It is actually a standalone Python package that works on top of TensorFlow. So you don't have to worry about getting distracted or getting detached from the mainstream TensorFlow. This is just another nice standalone package that works with TensorFlow. And it solves a lot of the network bottleneck issues and also the hard decision was the correct ratio between your parameter server and worker. It solves a lot of those problems by using the ring or reduce algorithms so that all of those workers, instead of thinking and sending their parameters, those gradients to a centralized parameter server, they can all just talk directly among each other. So now with Horovod, you don't need even to specify how many parameters servers you need anymore. There is no parameter server. All the workers can just talk directly with each other among themselves using this ring or reduce algorithm that is highly efficient at cross node, internal or internal GPU communication. And it is using this NICL and NICL2 library from NVIDIA. NICL1 is for internal GPU communication. NICL2 is for internal GPU communication and they all use this ring or reduce to talk to each other directly. And also it basically, because all the workers now can talk directly to each other from the network bandwidth perspective, you also basically avoid a lot of the network bandwidth bottleneck that we were running into earlier. So as you can see from the chart here, if you're running Horovod and top of TensorFlow, you get a much better, much more predictable linear scalability when you scale out. And it doesn't matter what model, what your specific model architecture is anymore. As you can see, whether you're using Instruction V3, ResNet or VGG16, your model can all pretty much scale pretty nicely. And in this way, since you get better predictability of the scalability, you then can design in a much easier way how exactly how many workers you need, how many GPUs you need instead of guessing and testing. And another thing is this deep gradient compression technology, this is a paper that literally just came out about two weeks ago. It also directly targets on addressing the huge network communication bandwidth issue here when you're doing deep learning training across multiple nodes. Basically, it turned out that 99.9% of the gradient exchange could be too actually tiny to be significant enough to be transferred in the first place. So it does a lot of compression technique without losing the accuracy. If you're interested in this, check out this paper. It's actually, it can really democratize distributed training over a commodity, just one gigabit ethernet. Yeah, I like the smaller next two. And I think even the Google Pixel 2 ships with the TPU. So they're obviously doing something very similar as well. Now you can actually process these models and not have to have large bandwidths between them. Thanks, Troy. So I think in the spirit of making this easier, one of the things that the Microsoft research, ML research lab did was actually create an open source repo where you can bring your own Kubernetes cluster and install just a couple of pods and have a lightweight framework and Joyce is gonna take us through how you could run these models yourself in that workspace. Can you do that? Yeah, there we go. So this is basically an open source toolkit. We just released about a few months ago. It's from Microsoft Research. This is the deep learning workspace that are being used by our researchers, data scientists from Microsoft Research on a daily basis when they do a lot of deep learning training. And it is backed by Kubernetes and you can install this on-prem on any cloud providers. So it is integrated with Active Directory. So I can just quickly log in there. And obviously a typical data scientist wouldn't really know anything about Kubernetes and all they need to have is just access to this very intuitive, simple to use GUI. So if I, for example, if I wanna just spin up a Jupyter notebook, I can go and select a TensorFlow iPython CPU template and I can just click on the submit button and I'll get a Jupyter notebook really quickly. But what I really wanna show you now is how I can do a distributed TensorFlow training just from this simple GUI interface. So I can now just basically select the job type to be a distributed training and I can give it a job name. I can call it JoyTF, test nine. And in this case, for example, I can have just one parameter server and two workers and each worker should have just one GPU. And for the Docker version, I wanna use, say for example, TensorFlow 1.2 and in the training command line I can just simply tell it this is the Python, obviously the Python, my TensorFlow Python script here. So I can enable TensorFlow board. So obviously people love TensorFlow board for debugging purposes. So when you're doing the training, often you can just say, enable TensorFlow board and you will get an endpoint and you can just quickly specify a model path for saving your model checkpoint. And then you just click on submit. And now as you can see, some jobs are being queued and they will get scheduled very shortly. And while we are waiting for the jobs to get scheduled, I can show you a previous notebook, Jupyter notebook job that I spin up. So this is a Jupyter notebook job that I had opened. So here I was given an endpoint and by just simply clicking at the endpoint, there's my Jupyter notebook. And then back to my previous distributed training job. As you can see, the job is already running. So if I click into the job output, you can see actually there's a real-time training going on here. I specified 5,000 steps across two worker nodes and you can see there are two workers each working on some steps here. So and then just quickly, if I wanna see my TensorFlow board, I can just go into my TensorFlow board endpoint. And so there's my TensorFlow board. That is literally the graph for my model that I specified in my script. So back to the slide. So just very quickly, two quick things I wanna call out. There is a free flow CNI plugin that comes with the deep learning workspace we've just shown you. And the plugin is from Microsoft Research. It basically automatically leverage shared memory if you happen to have parts that are sitting on the same post. And also if you do have RDMA network, whether it's walkie or if in event, it automatically detects that you have RDMA network and it will use the RDMA network automatically. And it's completely transparent to your parts and your apps. So it's mainly because this is super critical when you're doing deep learning workloads on Kubernetes network bandwidth, network communication speed is always super important. And this is why we use the free flow CNI plugin ourselves a lot to help with the network performance. And also we're contributing back to the Kubernetes community by we're gonna be contributing a GPU resource schedule and that is from Sanjeev from our Microsoft Research team as well. So with that scheduler as a custom plugin that you can plug into your Kubernetes cluster, your data scientists can or your DevOps can specify a part, not only with just how many GPUs you need for the part but also how many GPUs with how much memory and also how many GPUs with that are interconnected through NVLink. Those are all very important when you're running deep learning GPU related workloads on Kubernetes. Thanks, Joy. And just to round up, there's a bunch of other resources that we placed on here. There's a lot of community engagement already going on. I know that the folks at Google announced Kubeflow a couple of days and there'll be folks rallying around that. So maybe go take a look at that as well. But everything up here today but what we're hoping for is that we can all work together and build a community around making ML workloads run great on Kubernetes and get some knowledge gaps that exist here today between me knowing ML and the data scientists knowing Kubernetes. Feel free to engage with both Joy and myself. We do this day in and day out via Twitter or you can find us somewhere or other. Or we'll take questions but thank you very much for your time. Thank you.