 All right, we'll get started next with Chris Hall to talk a little bit about deep-scale learning. So go ahead, the court is yours, you got the timer. Test, test, test. There we go. Awesome. I haven't done a university lecture for quite a while. I feel like I'm back at college. So my name's Chris and I originally trained as a lawyer, legit, and as a data scientist. And then about two years ago, I had a mid-life crisis and went back and retrained as a helicopter skiing guide. And can anybody tell me the difference between a helicopter skiing guide and a large pizza? Nobody. The difference is that a large pizza will feed a family of four. Ah, so you stop doing that. But in my spare time I do quite like to climb mountains. So that's looking down towards Vancouver, and we actually climbed up the top of that with our skis, ride bikes with my little daughter in taupe, and kayak down waterfalls. That's about 10 minutes from my house in New Zealand. So I work for these guys. And I've only worked there a couple of years. And interestingly enough, these guys are now these guys. And you're saying to me, but Chris, why are you presenting off your Windows machine that has an Apple sticker? Because I like taking the piss out of all of my friends who tell me they need a Mac to do open source development. And the reason I'm presenting off my Windows machine is I presented it at the TensorFlow user group last night, and then the guys from Google were kind enough to say, thanks Chris, they took me out for dinner and I said, do not let me forget my laptop at the restaurant because I have a habit of doing that here in Singapore. I'll be picking up my laptop from the restaurant at Maple Tree Business Park on Monday morning. So we're presenting on Windows and remote into my Linux workstation at home. So sorry about that. Originally, as you can probably tell from the accent from New Zealand, but now I live in this place. Although if you do it properly at scale, it's really this place. But that's kind of cool. I can ride the bike around the whole island and things like that. The project I'm going to talk about today we did in this place, which is Korea. Fantastic. So at Microsoft, I run a team called the Commercial Software Engineering Team for the region. And we build ship. So basically, we are a bunch of engineers who work with major customers to build interesting things on our platform. We built this out with a major customer in South Korea. They would be the biggest of what they do in South Korea and they don't make either cars or television sets or cell phones. So we built this project out with them and I thought I'd come and talk to you a bit about it. This is not a deep learning session. So you don't really need to know about deep learning for this session. It's really a deep learning meets high performance computing session. So we're really talking about how can we take deep learning and apply high performance computing techniques to train things at massive scale. So the project was with this big firm in Korea. And they have built a Korean language chatbot that uses a deep learning conversational model to provide the answers. That deep learning model is for those who are who the deep learning nerds. A few, there's one at the back. You're talking next, right? It's an LSTM based or current neural network with attention. And just to geek out, it has about 80 million trainable parameters. And to give you some idea of scale, Baidu's deep speech version 1 had about 100 million trainable parameters. Deep speech 2 had about 300 million parameters. So it is, by global standards, a very large neural network. They had this already built and they said to us, we've got a problem which is actually we can't train it quickly enough. So it was taking several days to train the model. And now that it's in production, they want to be able to run it every day, catch a feedback from the mistakes that it makes and then retrain it reasonably quickly at the very least overnight, although ideally potentially train it intraday. So the model is built using a high level machine learning framework called Keras. Who uses Keras? A few people, fantastic. And it uses TensorFlow as the back end for doing the actual training. And this is a bit of a story about how we went and allowed this to run distributed training using Keras. So Keras is from some folks who work at Google. TensorFlow is obviously from Google. It's a tool called Horobot. I'm going to talk about and we ran this all on top of, I think, Batch AI, which is a service in our cloud that allows you to run really big high performance compute workloads. It was originally written down in New Zealand and Wellington, New Zealand by a company called the Green Button, who were a bunch of folks who spun out of Weta Workshop who made the Lord of the Rings. And you knew that given ours from New Zealand at some point, this presentation would talk about Lord of the Rings. So what are the reasons that you want to think about doing distributed training for your machine learning models? I already put them into three buckets. The first is you want to run more models. So this is typically at experimentation time. This is typically your data science folks wanting to do things like hyper parameter optimization, test new features, blinking. It's on the blink. Is it going to work? Let's try that again. Okay, I won't touch it too much. So you want to run lots of models. You want to try different hyper parameters. You want to try different model architectures. You're typically running smallish levels of training on top of those models. It's quite experimental, but you want to run a lot. So typically you'll run some sort of a cluster, distribute those models to be trained around the cluster, and capture the results. The next one is you need to run the model training faster. This is typically a combination of I have too much data, because we have this explosion of data. One of the things that's really empowered deep learning is the fact that we've got shit tons of data to work with now. Or you've got too many trainable parameters or too complex a model architecture. These funky new model architectures, VGG, if you've heard of VGG, from the guys at Oxford, is a particularly nasty model to train in terms of computational complexity. So a common approach there is what we call data parallel. We train the same model on multiple machines, and then either share the weights, or average the gradients as part of the training process. And we'll talk about that one shortly. And then the last one is you've got really, really big models. Typical candidates here are really large recurrent neural nets. Speech models, language models tend to be very, very large. And in some cases they're too large to fit into a single GPU, and therefore too large to be able to train the same model across multiple GPUs. So what you have to do is to break that model up, put portions of the model on multiple GPUs, and effectively flow both down through the GPUs during forward prop, back through the GPUs doing that prop. We're going to talk primarily about the middle one today. So of these sort of three distribution approaches, we have a thing called model parallelism. That was that last one I talked about, where we chopped the model up, train it across multiple GPUs. We have data parallelism. We were training the same model at the same time on multiple GPUs, and then sharing the weights or sharing the gradients. And then we have hybrid architectures, where we might be doing data parallel across multiple machines, but on each machine we have multiple GPUs, and within that machine we're actually having to do model parallel, because again we've got two larger models. The whole model parallel thing, Eberson flows, particularly as we sort of have this race between models getting larger, and the amount of RAM in our GPUs getting larger. There's a quick show of hands, who's got a GPU in their machine? Who's got more than one GPU in their machine? Yeah, just a couple. So this is really talking about how do we enable training across machines with more than one GPU? And also how do we enable training across more than one machine with more than one GPU? And to give you some idea of scale, we're looking at training anywhere up to 100, 150 GPUs at a time. So reasonably large scale training. So we came into the project, we were working with TensorFlow, that was kind of the default. We did try some other things, tried changing out the back end underneath Keras, but we stuck with TensorFlow. There's a couple of approaches to distributing TensorFlow. The first is to use a thing called TensorFlow distributed, which is a native part of TensorFlow. It's really good at doing things like model parallel. It can do data parallel by using a parameter server to update the weights. There are some challenges with both of those approaches. The first is that with model parallel, you've got a significant amount of rework to actually map your model into the physical topology of the GPUs that you've got. That's worth doing if you really need to have model parallel. We were quite lazy on this project. We really just wanted to make the thing run faster, as quickly and easily as possible. We didn't really want to go spelunking into a whole lot of code that had been written by our Korean customer and had lots of comments in Korean language that only some of us knew. So we just wanted to do this with as few lines of code as possible. So just before we engaged, the folks at Uber announced a thing called Horavod. Horavod is a data parallel framework for TensorFlow that basically manages doing the weight updates or the gradient updates during the training process. So as you go through, you do your forward prop for a mini-batch and then your back prop for the mini-batch get your gradients, and then we need to pass the gradients around all of the other GPUs on the same machine across the network so that they can all average the gradients and then update the weights accordingly. And it implements a thing called the Baidu, or the MPI ring all reduce pattern, which was originally sort of pulled out into deep learning by the folks at Baidu. So we'll talk a bit about that first. So once you start scaling across... you start with a deep learning model and very quickly you go from having a deep learning problem to I've now got a big compute problem. So we start to solve the big compute problem and very quickly you go from I've got a big compute problem that I can now solve with a crap ton of GPUs in the cloud to having a high performance networking problem because ultimately I have to communicate between those GPUs. And so if you start with patterns like having parameter servers, the problem you've got is that your GPUs go through, do forward, prop, back, prop, get their gradients or weights, and then they have to send those to some sort of parameter server, some sort of reduce that to do the weight or gradient averaging, and then they have to send them back. Who can tell me the problem with this approach is I start adding more GPUs? So it's less about memory, it's more about bandwidth. Yeah, so my communication cost scales linearly with the number of GPUs that I have. So it starts to have real problems when I want to run 50, 60, 70, 80, 100 GPUs. So what the folks at Baidu suggested is that we use a thing called what they call it ring or reduce. It's really a ring, scatter, reduce, or gather. So it's been pretty common in high performance compute for a while and basically what this does is it has a pattern for passing data around this ring architecture and that means that at every step each machine can send and receive at full bandwidth just to its neighbor in the ring. So it's bandwidth optimal by taking full advantage of the bandwidth of all of the GPUs. So the way this works is that it's effectively a way that we can sum up large arrays. And if we're trying to average gradients for a deep neural network, summing up large arrays is kind of useful because that's really the gradients. We sum those up and that lets us get the average. So the way this works is you take your array on each machine. So each machine's got its own copy of the array. You chop it up into K chunks where K is the number of GPUs that you've got. And then we start passing these chunks around the ring. So at each step the thing to note here is there's only one little gray arrow going from to and from each GPU. So we pass it around and then at each step we can add the data that's been passed around or reduce the data that's been passed around. So ultimately we then end up with each GPU having one chunk where it's managed to reduce all of that data and sum all of those arrays. So that's the scatter reduce portion. So that's one trip around the ring but again we're using optimal bandwidth the whole way. Then we do an all gather which is basically taking that data and we pass these chunks of data again between the GPUs and ultimately end up with a copy of every complete chunk on every single GPU. Again it's another trip around the ring but bandwidth optimal the whole time. So this gives us a really efficient way of passing these gradients between GPUs and the great thing about this is it does not in terms of computation cost the number of GPUs that we have is not relevant so it doesn't scale in terms of cost with the number of GPUs. So that's really important because we want to add lots of GPUs. So to do this we're using this thing called Horobot. Horobots are distributed training framework for TensorFlow which is a source under the Apache 2 license from the folks at Uber. It implements that scatter reduce all gather pattern and it does so using NVIDIA's Nickel 2 library and is able to take advantage of RDMA if you've got either InfiniBand cards or RDMA over converged Ethernet. So ideally what you want to do is you want to have a cluster of GPU machines and you want them connected by really high performance interconnects. Anybody lucky enough to have a DGX-1 from NVIDIA? They're really nice. Anybody want a DGX-1? You must want a DGX-1. I've got a customer who's looking at buying one at the moment. 150 grand. Very nice. They always have this thing called TensorFlow Fusion and TensorFlow Fusion allows us to batch up gradient tensors before we send them across the network. So you'll find if you're training really skinny deep models, things like ResNet, you can end up having to pass around quite small tensors and you start getting hit by communications latency. So the latency as you go and do a lap around the ring starts to become a problem if you're only doing quite small tensors. So what they do with TensorFlow Fusion is they bin up 64 megs or so of gradient data before they pass it around and then split it back out once it comes back around the ring. They have this cool thing called Timeline which does distributed logging of the execution and then they have a way of rendering that in Chrome's viewer. Can I just ask my friendly AV person? I do also have... Do I? I've got VGA as well, maybe. No. I thought I had a display port. I've got... It's just I don't... You guys are going to get quite annoyed with that flashing. Well, let's do this. Let's go Windows P. Extend. Let's chuck that up there. So you want it to go to 720, yeah? 12V by 720, yeah? Okay, let's try that. Really minimal changes to your TensorFlow program. I'll show you shortly and you can just go pip and install Horavod or condor and install Horavod and off you go. So the way this works really the most important thing to understand is that it's just a distribution framework for the optimizer. So it rats your standard optimizer and it manages all of the distribution of gradients to do updates across the network. It uses NPI, but really just for discovery and coordination of the process, because it primarily uses Nickel for the scatter-reduce all-gather. You don't have Nickel because you've got Windows, or you just don't have that capability, then it'll fall back to using NPI. I spoke a bit about TensorFlow, and I'm just trying to skip through that. In terms of scripting with it, pretty simple. Want to try that one? Oh, yeah, yeah, that works. So import the packages, initialize Horavod, so what that does is it does discovery of all of the local GPU nodes and all of the remote GPU nodes over NPI. Pin each process to a specific GPU, so you do local rank, so get the rank of the process and then pin it to each of your GPUs. Specify your epochs in terms of the size of your Horavod network, so basically the number of GPUs you're working across. Build your model just the same as normal. Create a standard optimizer of your choice, adjusting the learning rate based on the size of the network. Wrap that with the Horavod Distributor Optimizer, compile your model, and fit your model, and this callback here is used to broadcast all the initial weights around the network. That's really important because otherwise if you've got different random weight initialization across the different nodes, you may end up with your average gradients not necessarily all pointing downhill, so they could be pointing away from each other, and it doesn't work very well. In terms of training, a few things to look out for. Pre-process your data. Ideally before you spin up your major cluster, if you've got a large amount of data, you may want to do the data pre-processing on the cluster and stage it into a temporary, stateful store. This is important because your cluster's potentially going to have uptime and downtime. If you're running 100 GPUs, you will have downtime. That's so planned for that. Shuffle and sample the records. Broadcast initial weights out to all the nodes. There's some guidance here on, in terms of mini-batch sizing and in terms of learning rate. Papers there from Facebook and the folks at Google. It will work with TensorBoard. It will work with Model Checkpoint, and if you're going to do that, do that on one specific GPU just so you don't get race conditions. I'm not going to do the demo, but I've got a repo that you can pull the code down from later. Super simple. In terms of the perf that the folks at Uber came up with, it has significant performance benefits when you start pushing out to really large numbers of GPUs, because the problem that you've got is your communications time starts to outweigh the time you spend actually training on mini-batches. In terms of how other frameworks support it, there are some other options in TensorFlow that provide this all-reduced pattern. There's one from Baidu. We found that the Uber one was a bit easier to use. CAFE supports it natively using Nickel. It also has a nice thing called one-bit stochastic gradient descent, which is an aggressively quantized way of sharing gradients. And an even nicer thing is that last Friday, that Friday before last, on the 15th, we changed the license for one-bit stochastic gradient scent from being a somewhat proprietary license to now being covered under the standard CNTK license, which is an MIT license. So that's a really nice thing to be able to share at the open-source conference. Chana provides a thing called Chana-MN, Chana-Multi-Node, which also uses Nickel. One thing to note, there doesn't support floating-point 16-bit arithmetic yet, and you want to be using Fp16 generally for training your models. MXNet does not provide Ringel reduce. It uses a parameter server. So you'll probably see similar scale challenges to what you see with distributed TensorFlow. Torch provides a couple of options. One is a super simple multi-GPU using shared memory on a single-node approach. The other is a full multi-GPU multi-node approach once again using Nickel, and Tiano is dead. A few other things to note. Who's thought of building a GPU cluster? Who's thought of using GTX cards in their GPU cluster? Because they're nice and cheap. Who got a mining cluster at home? Bitcoin miners? No Bitcoin miners? Surely. No Bitcoin miners. So, if you are looking to build a cluster, you need to be aware that on the GTX cards Nvidia do not enable a thing called GPUDirect. GPUDirect allows high-performance communications directly between the GPU and IO devices, such as storage and more importantly, InfiniBand network adapters. The reason this is really important is in this Ring All Reduce pattern, you are limited by the slowest connection point between two GPUs. Without GPUDirect communications from a network card to the GPU, you need to come through the switch through the CPU into the DMA driver space for the IO driver, then back through the CPU to get copied to the DMA space for the GPU driver, and then from the GPU driver back through the PCI bus into the GPU memory. I know this is kind of super-nerdful, but again, when you start training at NASA's scale, every piece of performance is costing you significant amounts of money. With GPUDirect, communications can go straight through from network card into GPU memory. In terms of optimal GPUs to be using, ideally, if you can get hold of them, you want to be using the Volta Generation GPUs from Nvidia, which have extremely high-performance FP16 flops. Haven't played yet with Google's TPU. They want to give me some cloud credits to play with it. That'll be nice. Don't know if we're allowed to do that. We're allowed to do that. But, you know, the Volta's kind of the gold standard Nvidia card for training deep-learning models. In our cloud, we have a SKU called the NC24RS underscore V3, which has four Volta V100s and FDR, 54 gigabit per second in Finnyband. So that works really well. We also have the Pascal Generation P40 cards that are worth thinking about for model inference, because they support really high-performance int8, and you can quantize your models down to int8. One of the things I'm a massive proponent of is to use spare capacity in people's clouds. We have a thing called low-priority VMs. Google have preemptible VMs. AWS have spot instances. Basically, this gives you an enormous discount, while a significant to enormous discount on VMs, on the understanding that they can be preempted and get turned off on you. It does take some work to make this OK for training large deep-learning models, because in a data parallel situation, if you kill a node, you kill the entire training job. So you need to think about how am I going to go back to a checkpoint and restart it. Ideally, you'll use some sort of workload manager, BCOS, Azure Batch, AWS Batch. There's a whole bunch of different options. So we're obviously using Azure Batch. You need to checkpoint your models, because if you kill one node, you're going to need to restart from that checkpoint. You're running really large clusters with lots of nodes. Think about running spare nodes, so that when your job gets killed, you can immediately restart it with a hot machine, rather than waiting for another machine to boot up. Why? Because you've got 20 other machines sitting there doing nothing until that new machine starts up. Spear capacity VMs, we run a fixed price for low-priority VMs. So the fixed price on the NC24 RSV3 is about a buck 22 US per hour for four-volted GPUs. Not sure if you think you can get that cheaper anywhere else, but I'm pretty sure you can't. So frankly, there's no point in building your own big machine learning cluster when you can get them for a buck 22. That said, I do advocate having a powerful local workstation with at least one ideally two consumer-grade GPUs so that you can test everything locally and muck around and kick it up to the cloud. We use a single batch AI, the sample codes in the repo. It's a HPC job distribution engine, and then we've taken that and built AI-specific stuff on top of it that makes it really easy to run any of the major AI frameworks, either natively on the VM or inside Docker containers. It's sweet. It's really well. Some final tips and tricks when just watching the clock here. For best price performance, use Linux. You should not be using anything other than Linux to train your neural networks. From both the price point of view and in terms of availability of the necessary drivers. Key drive, key thing you're going to miss on Windows is support for the Nickel V2 framework, which provides and video-optimized all reduced to go across network boundaries. Remember that every second of less than 100% utilization of your network and all your GPU is costing you money. So really focus on optimization. Scale up, then scale out to multiple GPUs, then scale out to multiple nodes. Don't do anything stupid like running four machines with one GPU in each. Start by going to the biggest GPUs you can buy, which at the moment is V100s. Then get as many of them as possible running in a single machine. Then go out to multiple machines containing as many GPUs as you possibly can. Use the largest possible batch size you can get away with without affecting accuracy. Why? Because it reduces the number of times you need to go from training to doing a ring or reduce and back again. You are only as fast as your fastest GPU and your fastest interconnect. So don't try and do this with heterogeneous GPUs. So if you've got like a 1080 and an old 940, it's a really bad idea to use those two together. Likewise, don't build a network that has different interconnects. Ideally, you want everything as fast as possible and interconnect with the lowest latency and Finneband is great. Another thing to watch for is anything that can make mini batches take different lengths of time. So if you've got weird network architectures or weird approaches to mini batch sizing, that can make certain mini batches take longer on some nodes than others, you are going to be wasting GPU clock cycles because you're going to be waiting for every single node and every single GPU to finish its training to do the gradient update. There are some asynchronous approaches that you can look at, but it just starts making things hard. Use a job management, cluster management tool and that way you can take advantage of these low priority VMs. So again, we've got low priority, Google have them, AWS have them. Probably the only sales pitch I'll give you today is that you won't find anywhere remotely as cheap to buy Volta 100 capacity at the moment than what we've got. GitHub repo, WACCA-ULDNZ, WAC high scale dash D and N dash training. If you want to really nerd out, there's like a 200 slide tutorial from the Hot Interconnects conference last year from the folks at Ohio State, go Buckeyes. A good discussion of asynchronous stochastic gradient descent, which as I said before is a way of doing data parallel where you're a little less dependent on latency. And a good discussion from last year's GPU Tech Conf on ways of doing automated model parallel work distribution. And I note that next week, I think is this year's GPU Tech Conf. Is anybody going? Maybe wish they were going. So folks, with that, I'll thank you very much for your time and attention. So it was interesting content. My team at Microsoft loves doing hard stuff. So if you're an interesting major customer or a high potential startup and you want to build cool shit, I'd love to talk to you. Thanks very much. Thank you, Chris. In the interest of time, no questions, but if you have questions.