 Thank you. Hello, everyone. Hi. So you probably have seen that there was two names at the beginning on this talk. And you probably have counted that there is only one body right now on the stage. So yeah, there was some last-minute changes. And I will be presenting myself. Let me first quickly tell you more about what I'm doing in Google. So yeah, my full name is Vatislav. Just don't try to pronounce it, Slavo is good enough. In the Google, I'm leading a small team that's trying to optimize TensorFlow in other framework for the hardware that we have on Google Compute Engine. So basically, if you're spinning up a VM on the Google Compute Engine and you're choosing special deep learning OS that we have, you probably will be using our custom version of TensorFlow precisely optimized for the GPUs and CPUs that we have on the platform. Before we're going to proceed to the talk, quick question. How many of you guys actually using TensorFlow? OK, it's impressive. How many of you guys using PyTorch? So why have you come to this talk? But anyway. OK, so let me first say kudos for authors of this presentation who are not here. Several of them are from the TensorFlow team. Several of them are from my team. Huge thanks to all of them, because it wouldn't be possible without them. Now let's jump to the agenda. As you might have guessed, we split the talk on two parts. First, we're going to cover the theory of distributed training. Oh, yep. There was any questions? OK. First, we're going to cover the theory of distributed training. And inside of the theory, I want to answer two simple questions. How you can do a distributed training, which type of distributed training you need to pick, and why? Then we will see in the practice how you can actually apply all of them with the TensorFlow. And what is more important, we will see how in the practice you can take some legacy model that was implemented five years ago, let's say, or three, or some old, old model, and use the modern practices of doing distributed training. So let's first start with theory. Before we're going to jump in, several just simple icons that I'm going to be using. These are official icons of Google Cloud. First of all, one is Virtual Machine on Google Compute Engine. The second one means that VM that I'm showing is high CPU instance without the GPU. And the third icon is GPU. I know that GPU icon is kind of look the same as VM and GCE, but these are official icons, so just bear with me. OK, so let's start with asking question why to distribute. If you want to do a distribution, you probably need to ask three questions. Question number one, if you have a model of hands, how fast do you need to train this model? So basically, how fast do you want to experiment? If you're trying one model per day, maybe training one day is fine for you. Maybe you need to train one hour or two hours or maybe one week. As soon as you have a realization how much time you need to spend on training the model, you need to answer the question, what the level of the accuracy that is good enough for your business needs. As soon as you have these two questions answered, that you, let's say, absolutely fine training is between two hours. And the level of the accuracy, let's say, 89%. The third question is, what are the cheapest way to do the training to satisfy these business requirements? And with distributed training, I basically saying that the rule of thumb is the same as with concurrency programming. I used to tell to my students, if you're thinking to do concurrent, and if you can avoid it in any possible way, please don't do concurrent programming. The same is here. If you can avoid distributed training in any possible way, please do avoid it. But OK, let's now speak more precisely why you might want not to do distributed, and then we'll speak how to distribute it if you're still willing to do so. So this is a chart that I'm going to show just to illustrate some compromise that you will have to have when you think about distributing. On the x-axis, I have the configuration of different type of instances. And I'm going to show you the speed of the training of a RESTnet with synthetical data on different type of the instances. So y-axis is actually the speed of training. It's not going to be up to scale. OK, the first instance is 96 CPU Skylake. If you're going to use special deep learning VM that basically we have, that have a special TensorFlow optimized for the Skylake, you might have 200 images per second. This is like an ideal case. Let's say that this is not what you want. You need to go faster. So you might actually spin up the instance with one volt GPU, and you will have around 800 images per second. Again, this is ideal case with TensorFlow 111. We recently have released 112, but these numbers will be approximately the same. These two numbers, what will we be calling baseline? Effectively, we're calling it baseline because in the second instance, we're using only one GPU. Let's say that even this is not enough. You need to go faster. Then you spin up a GPU machine. And on the right-hand side, these two configurations, what I like to call it is FattyTalesOn. And why FattyTalesOn? Because if you're spinning up a machine with 8 GPU, you're probably expecting the speed of training would be 720 multiply 8, which is roughly 6,000. This is reasonable to expect. You now have 8 GPU. There is no network with all of them and within one box. This is reasonable expectation. Even more, let's say even this is not enough, so now I want to spin two instances. Again, I probably would expect for 6,000 multiplying two, something around 12,000 images per second. This is my expectation. And again, this is FattyTalesOn. In reality, of course, you're paying the prices of the overhead. If you're using 8 GPU, you probably will get something roughly like 5,000 images. Again, this is close to EDL case. You might get even worse situation. So you're paying, effectively, 10% of the overhead just to go multi-GPU. This 10%, it's roughly, I'm trying to recall the actual price, might give you around $1 per hour that you're just paying for this overhead. Things getting even worse if you're going with network. With network, effectively, you're paying almost 25% of the overhead just to go in the distributed mode. Reasonable question that the business might ask, does it actually make sense to go to distribution mode? Here is another chart that basically shows you how much speed up you gain and how much more cost you pay. So effectively, you can see it's still better. It's still worse the money to continue growing the speed if you do need to speed up. At some point, this both line will converge. And the price will probably will go higher than the speed up that you're gaining from the training. But if you're still willing to wait, let's say you don't want to spin up two or three machine, you actually want to use one machine but to wait more, price will be actually lower. Because you no longer get paying overhead of the network. And again, if you can avoid doing distributed, you should avoid doing distributed. To answer the question whether this difference in the price that you pay is worse the time that you're getting, it's up to you guys and the business to decide. OK, so now let's jump to question which type of distribution exists in theory. And we will start with multi-GPU distribution. Effectively, how you can distribute the training if you have one physical instance and you have many GPUs in one physical machine? Let's say this is your VM. You spin up a brand new VM. You have 8V108V and you want to train your model. Let's say this is your model. I know it's not practical. Not many people in real life have the model with five elements in your computational graph. But for the sake of the example, let's say this is your model and you actually want to train this small model on V100. OK, as long as you have money, you can do whatever you want. One of the type and of the way how you can do distributed training for this super crazy graph is to actually copy-paste this graph to all the V100. Each of the V100 will be training the same graph. It will be executing forward pass, backward pass, applying the gradient and doing it again and again. Now, the question arise if all of these V100s doing the training on the same computational graph, how are you going to actually synchronize a graph between each other? One way of doing so is the called all reduce, the special algorithm that basically going to synchronize elements of the graph between all the GPUs on your physical instance. And once the most famous all reduce algorithm is ring all reduce. Ring all reduce basically says looks something like this. And this is slide from the technical blog of Uber. The Uber have done a drastic work to implement the Horowitz high level framework that actually based on the ring all reduce and heavily speeds up the distributed training of the distributed training on the floor. We will speak up about later in the practice phase. But the whole idea of the ring all reduce is that each worker could be one GPU or one instance in your cluster, sends a part of the gradient to the next worker, and at the same time receives part of the gradient from the previous worker. Effectively, this means that each worker connects to next neighbor and previous neighbor and forms something that looks like a ring or so-called whatever you like more. The good thing about this part that each worker utilizing both channels, download and upload at the same time. So you kind of can squeeze all the possible performance from the communication channel in between your instances. As basically you might guess, you need to build communication between GPUs something like this. Issues of the GPU and the instance should have precisely two connections to the next neighbor and previous neighbor in order to give you the high speed ring all reduce typology for your synchronization with ring all reduce. Now, if you spin up deep learning VM on Google Compute Engine and you will run NVIDIA SMI Topo dash matrix, this is basically special NVIDIA tool that gives you all the connectivity between GPU that you have on the hardware. This is something that you can just run this CLI, get this connectivity. And NVIDIA V100 has six NV links. NV link is a special technology from NVIDIA that allows to connect to GPU between each other, creating the high speed connections. In this particular diagram and this matrix that you see, important part of this part. You can see that each GPU basically has three NV link connection to one GPU and three NV link connection to another GPU. So each of the GPU precisely connected with two neighbors allocating all the connection speed to previous and the next neighbor. Actually, if we're going to graph exactly this particular connectivity to our previous picture, you probably will get something like this. This is six NV link that design on the hardware level to speed up wrinkle reduce in the best possible way. So this is one way how, in theory, you can do a distribution. You can copy the graph on all the GPUs. We will see how to do it in practice and then apply some level of synchronization. The second way, which actually came due to historical reasons, still sits around, called model distribution. The previous one called data distribution. Idea behind the model distribution that, again, you have the same graph with five nodes. Again, please don't train the five nodes graph on the 100. And this time, you're distributing the graph. You're no longer copying the whole graph on each particular GPU. You're copying a part of the graph. In this particular graph, for example, you have one GPU having one node, another GPU having another node, then one GPU have two nodes, and so on and so on. Effectively, this type of distribution de-mos complex one, because you, as a developer of the graph, actually need to decide how you're going to distribute graph across the GPUs. You need to take the account connectivity between GPUs, like if you have a ring topology that we saw, it's not practical to have diagonal connection like we have in this example, because there is no direct hope between them. So this is model distribution. As soon as you, as a developer of the graph, have created a distribution across the different GPUs, you might see, OK, I have eight of the V100s, and I distributed my graph across four of them. How are we going to use another four? For another four, you can actually apply now high-level data distribution. So now you have a logical graph that sits on four GPUs, and you have a copy, another logical graph that sits on four GPUs. And now you can do a data distribution. You have two copies of the graph. Both of them can train separately and use a ring to synchronize data between these two logical graphs. So these, so four, we have covered distribution within one instance. Small conclusion. So we have a data distribution. It's kind of simple to implement, because for you as a developer, usually it's absolutely transparent. You're saying, here's my graph. Please do all the magic. And usually, a framework that you're using knows how to copy the graph across all the GPUs and knows how to synchronize them. Then, usually, due to the popularity of this type, you have a hardware optimization, hardware that precisely designed to do ring call reduce. The second type is model distribution, which is actually hard to implement. Please don't do it. But due to historical reason, it's still around. And we will cover why. And then you have the most complex one, hybrid distribution, when you're kind of doing model distribution, and then you're copying this model that sits on four GPU and doing a higher-level data distribution. This one, the hardest one, but it's still used in the legacy code. OK, now, multi-node. Multi-node basically has the same idea behind the distribution. The first one is model distribution. Only in this case, you're spinning up not the many GPUs, but many instances. This type of distribution used to be around probably five years ago, ish, I think, 2012, six years ago. OK, so this is actually a picture from the disbelief white paper. Disbelief is a framework for deep learning that we used to have in Google before the TensorFlow came up. And the disbelief, by that time, was heavily utilizing CPU training. So in this particular case, we have four instances. Each has 96 CPU scores. And therefore, in order to create a huge model, you do need to have a model distribution. In this particular case, due to the nature of the model, usually, researchers were designing topology of the model, were precisely specifying how many machines need to be distributed on. And then, when you finish describing your topology of the model, you can do a more high-level data distribution when you have each four nodes is a logical copy of the graph. And now, you have four clusters. Each of the clusters doing a data distribution, they effectively training the graph on the part of the data. And then, you would have what is called parameter server. Parameter server is special designated machines, hardware machines, that is used just to save your model. Now, this is model and hybrid type of distribution. You can also do a data distribution as we discussed on the GPUs. Effectively, a year the same. Just copy a graph across all the nodes. If your nodes has enough power, you have a GPUs, and you have enough memory, just copy all of them. And now, you need to synchronize them. Synchronization, the idea is basically the same. You can do a ring. And if you're doing a ring, you need to be sure that all of your work are actually reliable, all of your worker have high connectivity bandwidth network between each other. In some cases, if you're not sure about this, you can split your ring on two. You can have two rings. Now, in this particular case, let's say, you have a fast connectivity between worker two and three. So you can create a ring for the fast synchronization of the model between them. And then, you have a green arrow that, effectively, two nodes ring. That second ring, you can start only when the first ring is over. So usually, you can technically split your cluster on subparts that has a high connectivity between each other. Now, the next one is distribution that's still around is parameter server. When, effectively, you're no longer sending the small chunk of your gradient step, you're sending the whole gradient step to the parameter server. And the parameter server is the source of truth of your model. This was one of the most widely used type of distribution in Google way back then. And there was several reasons because of this. First of all, you can utilize heavily the schema if one of the workers goes down. It's not a problem. Other worker will continue working because they no longer rely on the circle ring. They are no longer waiting until they will receive all the part of the gradient step from all the members of the rings. Therefore, you can technically use this model of distribution with a lot of frame table machines, a lot of instances that are not reliable, or if the connection is not reliable. Now, people always ask, which of these two models should they be using nowadays? And there is no silver bullet. Effectively, if you know that you have a reliable connection between instances, instances are fast. You probably should go with all reduced topology without parameter servers. That is actually why in TPU, we're using only all reduced. You cannot use parameter server with Google Cloud TPUs instances. Now, if you want to use something like preemptible instances, or you have unreliable connections, or you have slow instances, you have CPU instances, or any type of instances can go down, you probably need to use parameter server. But there is a bottom line. If you're using parameter server, often your workers might using a stale data. They might using not the latest version of the model. Therefore, you're paying time for converging. Now, these are the four distribution models across instances. You have ring, you have parameter servers, you have model distribution, you have the hybrid mode. Out of these, if we're not speaking about parameter servers at all, data distribution is the simplest one. Model distribution is a hard. Hybrid is the hardest one. Now, the question. If the hybrid distribution is the hardest one, why we still can see official TensorFlow models or other models that still using model distribution? They're still doing this, and you as a developer will probably have to work with this. Particularly, this is example from the documentation of neural machine translation model. It basically says that it's doing model distribution, and it leads to really fascinating results. You're starting your training, and you might see the GPU number one throwing out of memory exception, while GPU number two actually utilizing only 5% of memory because your model not distributed evenly. The reason can be found in the history of how the TensorFlow itself was developed. If you go back to the disbelief white paper and you look on these two pictures, you will see that model distribution is the hardest part from that point of time. So the question, why in 2012 the people were doing model distribution? The thing is in 2012 GPU were just at the beginning. A lot of people were start trying using them for the research, but they used to have small footprint of memory, just six gigabyte, and CUDA computational compatibility were not having that much of sophisticated API to implement workaround of the small footprint of memory. Speed of training were not that fast, and the several more interesting, this is actually chart from the white paper, and it's really interesting. On the white axis, you can see amount of CPU cores that they used to train a huge model. On the Y axis, you will see time to train. All of this is different CPU configuration except that purple triangle, purple triangle close to Y axis. This is actually GPU time to train. So basically in 2012, GPU was one of the slowest wave training the huge model. This is one of the conclusion that the researcher found, and I will just give you a moment to read it through, because this is so not how it today. OK, so basically by then in 2012, training on GPU was not that big of a thing as it today. Nowadays, no one using model distribution, but due to historical reasons, there are many models that you can find an example that's still doing this. And we're going to cover in the practice part how you can take model that doing this and actually use it with data distribution that way more simple. OK, yeah, this is basically what have changed in 2013 and why we now training on GPUs. So let's go to practice part. Practice part, first, good old days, how you used to do distributed training. This part of the source code from the official TensorFlow website that gives you explanation how you used to start the cluster with parameter server and a lot, a lot, a lot of source code. I basically just put it here to emphasize how horrible it was. I'm not going to even dive here what you're doing here, because you need to do a lot of stuff, and usually you would find yourself doing a lot of bugs and mistakes in starting to run this. Community have come with two simplification for doing distributed training. One was designed by Uber and has the name Horowitz. Horowitz heavily utilizing Ringle Reduce, and it's actually a name of a traditional Russian dance when people dancing with each other and then switching each other and kind of look like a Ringle Reduce, so they decided to call it like this. Distributed strategy is a native concept, new concept in TensorFlow that we're going to discuss. Distributed strategy right now in the process of the development, so you can think of a distributed strategy as something that will be default distribution way in TensorFlow 2.0, but you can actually start using it right now. We'll speak why you might wait for 2.0. So let's take a small example that is not distributed, and let's speak and discuss how we can do it distributed. So here's an example, simple linear classifier. Let me walk you through all the source code that we have. So we're creating two numerical column, x1, x2. We're creating small optimizer. We're creating estimator. We have input function that just produce random values, and we finally start the training. Now let's see how you can do it in distributed way with the Horovot. So Horovot initially, when they were designing the whole idea behind the Horovot, they took, they look on the current approach of doing distribution. If you have the model that trying to utilize all the four GPUs, let's say I have an instance, you have four GPU, you have the model that utilizes all of them. Usually you would start TensorFlow, and TensorFlow somehow inside of the model knows how to split the training between four GPUs. Horovot and the people in Uber decided to train this and do it in slightly different way. They thought, okay, now you're going to spin up process for GPU. Each of the processes will have TensorFlow that will think that is the only one process on the machine, and it will think that it has only one GPU that it can utilize. So effectively you have four absolutely independent processes. Now if you have many, many processes, how are you going to start them? Usually in Horovot way you're using OpenMPI, OpenMPI to start different processes, and then Horovot under the hood going to synchronize model between all the processes of the TensorFlow. So it will be kind of transparent for the TensorFlow, and most importantly, Horovot heavily utilize NVGA-NICL and CCL for synchronizing model between instances. The real beauty of this model that technically if you want to do multi-node distribution, you don't need to change the source code. You just need to change your OpenMPI starting script. That's it. Now coming back to example. This is original example. This is example, the same example with Horovot. And let me give you some explanation what have changed. First of all, we're importing Horovot and we're calling any. This is something that just must, according to documentation. The second part, you wrap in your optimizer within Horovot optimizer. This just a wrapper. It has the same API. Your model source code should not be changed. It just should be absolutely transparent. Then this is a small if you want hack. You're saying if the rank of your particular instance is zero, then save model here, otherwise none. The reason for this, since you spin up a lot of processes, all of these processes going to train the same model. Therefore, they will have, at the end, the same model. And you don't want each of the processes trying to save it in the same folder. You technically just want only one process to save it, so you're saying, OK, if your rank is zero and the rank is unique ID of the processes, your rank is zero, then you are the one who are going to save the model. Everyone else can just ignore it. Then you're specifying config just to point your process of the TensorFlow which particular GPU it can utilize. Horovot local rank, give you the local rank on the machine. If you have four processes, rank will be from zero to three inclusively. And therefore, here, each process will utilize each own GPUs without any conflicts. And you're passing this config to your estimator. That's it. That's it. Almost, yeah, I forgot about the hook that you absolutely must apply. OK, so this is the changes. Usually, it's really small changes. You don't need to change a lot of your stuff. And this is how you're starting it. In this particular example, you're starting training with eight GPUs, eight processes, on the local machine with MPI run. And as I said, the best part and the beauty of this model that you want to do with training on many machines, you just do this. This is actually how you're starting these same source code to do all reduced distributed training between four different nodes. In this particular case, I'm going to utilize 32 GPUs. Each of the instance will have eight GPUs. That's it. So you're just scaling it to as many nodes and machine as you want. Another part that Horowitz actually gives you, it's not only the simplicity of scale horizontally, it has several more additions, really nice addition. One of them is TensorFlow Fusion. TensorFlow Fusion and the technology that allows to combine small tensors together and send over the wire only when you have the full buffer. This saves some amount of traffic. And they have a special visualizer that allows you to debug your distributed training slightly simpler. And the second part, distributed strategy. Distributed strategy is the brand new API. Right now, distributed strategy lives inside TF Contrip. And TF Contrip is something that the TensorFlow we're using, as you can think about, is an incubator. So some API never leaves Contrip and might be actually removed. Some APIs eventually gradiate it from the Contrip and move to another part of the TensorFlow when they satisfy the requirement that we want them to satisfy. In this particular example, distributed strategy is still in the Contrip. So we need to take this in account. This means that the API might change in the future. And they're going to be moved away to all the Contrips of the effectively this API will graduate with TensorFlow 2.0. So you effectively can have a sneak peek of how distribution will look in the future. Distributed strategy right now provides you four different types of synchronization. The first one, one-device strategy, the default strategy. If you just start an example, the same example that we used, it will pick one device and visualize it. That's it. Second one is the mirror strategy. Mirror strategy is effectively wrinkled reduced that we just have discussed. And let's quickly see how you can change this example, same example, to use mirror strategy. Again, same example, changes. So it's even first hard to see where exactly the changes were made because it's literally two places. One, you're creating the mirror strategy distribution instance. Two, you're passing it to your estimator. That's it. You can actually inline this. And you will have only one place where you're saying, I'm, you want to use mirror strategy. And that's it. TF effectively will, under the hood, try to do wrinkled reduced and visualize all the GPUs that it have. You no longer need to use open MPI. You can literally don't need to do a lot of other changes. Then you have collective old reduced strategy. It's the strategy for synchronizing between the instances. Under the hood, collective old reduced using JRPC. This is the biggest difference between Horovot that actually utilizing NCCL from NVIDIA and all intents of flow implementation that heavily utilize JRPC. We will look on the difference later with benchmarks. Again, same example, what need to be changed? Unfortunately, a lot. First, you need to specify config of your cluster. And the worst part, you do need to specify index of your worker. This is interesting because you need to somehow figure out what the index of your current worker that you're starting. You need somehow to start worker. Then you need to actually create the instance of old reduced strategy. And the interesting part here, if you can see NUM GPUs per worker mean that all of your worker need to have the same amount of GPUs. It's actually common that your worker will be the same, but time to time, I saw that the people trying to train on strange configuration when different instances has different amount of GPUs or even type of GPUs. With this type of strategy, for now, it's not supported. And then you're passing the same config and you're starting train evaluate. It doesn't support just train. You have to use train evaluate. So this is basically how you're doing multi GPU, or sorry, multi instance distribution with distribution strategy. Now, small conclusion. This is the comparison of the speed between different distributed strategy. Red line is a Horovot, blue line is distribution strategy. Now, as you can see, while you speak on the one instance up to eight GPUs, you probably do want to go with distributed strategy. It's slightly, slightly slower, but it's probably still within the deviation of our measurements, but it's way simpler. It's only one line that you need to change, that's it. Another important part that distribution strategy is supported by the TF team. So we can file a bug and have expectations about the time of reacting on that bug. If you're using Horovots, I don't know, I cannot speak on behalf of Uber. It's up to them to decide how soon they actually can fix the bug inside of the Horovot. If you're speaking about multi-node distribution here, for now at least, Horovot is faster. And the hypothesis why it's faster like this because they actually utilizing Nickel, while TensorFlow native distribution, utilizing GRPC under the hood, and it will take some time to TensorFlow to actually optimize GRPC communication to be on the same level as Nickel right now. So yeah, for multi-node, I would suggest to at least look on the distribution strategy because it's the future. It's what you will have to use in 2.0 because even the Horovot effectively will migrate to distribution strategy APIs. It will be just another distribution strategy. That's it. So this is comparison of different distribution strategy that you can apply. And huge thank you and questions, questions. Okay, so if no questions, you can find me outside. Thank you guys.