 Cool, okay. All right 455. So we'll go ahead and get started Thanks everyone for coming out to our talk today. It's really exciting to be doing in-person conferences again. So Hope you all enjoy as well I'm Travis Adair CTO of Prada base. It's a company that's spun out of uber AI focusing on providing a low-code and highly scalable machine learning Platform and I'm here with Nicholas. I'm Nicholas. I'm a senior deep learning engineer at NVIDIA. I'm focusing more on deep learning training at scale And we both have been working on the Horovod project for a number of years now And we're here to tell you a little bit about some of the applications of Horovod and how it's evolved and how the entire distributed training Ecosystem has evolved over the years and hopefully find a way to make it useful for you and the work that you're doing as well So to quickly start off with a little background So the problem that we're talking about here is that we want to train these deep neural networks that have some Pre-processing stuff followed by a series of layers that provide what's called the forward pass and then at the end You get a prediction and then based on the air in that prediction you then back propagate all of the these gradients that you compute back to update the weights of the model and This back propagation step these gradients need to be aggregated in some way often times in order to Do this in a distributed setting and that's going to be at the crux of the problem that we're going to talk about So broadly there are two ways to do this kind of distributed training Thing one of them is what we call model parallelism where you would subdivide different sections of the model across different Hardware or usually GPUs and the other is what we call data parallelism Which is what we commonly happen when your model is fairly reasonably sized, but your data set is too large to fit into To be handled by a single machine, right? And so both of these are commonly used in practice today, and we'll be talking a little bit about each of them So going back in time to about 2016 There was really only one way to do distributed training and it was called parameter servers This was something that was developed by our pioneer by Google in a big way of the paper called disbelief And they packaged it up into TensorFlow back in the early days when TensorFlow was pretty much the only game in town for doing deep learning So at uber one of my colleagues Alex Sergei there was a tech lead for the deep learning infrastructure team Working to support the self-driving car group and the problem that they had was that they had these very large perception models That were processing, you know terabytes petabytes of image data and they just couldn't train these models fast enough, right? And so Alex was looking at what was out there in the market saw the parameter servers we tried giving it to them and it wasn't working for them and Specifically it wasn't working because it required making very substantive changes to their actual training scripts in order to use it required a deep understanding of the infrastructure itself and how it worked and also significantly complicated the Operationalization story because not only did you have to think about the architecture for these training processes You also to think about the architecture for those parameter servers themselves And so that led to the development of the Horovod framework itself Which has started as a data parallel framework for doing distributed training We've developed it to be framework agnostic. So it supports TensorFlow PyTorch and Xnet Has many high-performance features including support for NVIDIA's nickel library MPI using our DMA and a special technique that we call tensor fusion It's very easy to use with just five lines of Python code And of course, it's open source is part of the Linux foundation and very easy to install as well and so at the heart of Horovod and where it gets its name from which is a Dance in Russian tradition where you kind of like hold hands and dance in a circle comes from this algorithm called all reduce which dates back to like the high-performance computing days and the very interesting insight that was discovered by some folks at Baidu was that If you were to use this all reduce algorithm to aggregate the gradients that you need to use to update the model It actually ends up being the bandwidth optimal algorithm for doing this If that's the constraint that you have in your kind of computing environment And so the way Horovod Start out in the way it works in principle Is that you just take these gradients across all your workers pass them around in a certain way and at the end They all have a single shared copy of the Parameters that they used to update their models and so none of them diverge as they train on different subsets of the data and This approach turns out to be very fast We benchmarked it on a number of different problems the original ones We looked at were kind of in the computer vision domain And so here you see benchmarks going up to 512 GPUs on a variety of benchmark architectures But Nicholas will show you in a bit that this actually can scale quite a bit beyond that as well And so today when we look at a bit about the community and how this project has grown over the years So starting from when we first open sourced in 2017 it's since grown to be over 12k stars on github and We have quite a variety of different use cases as well Which is going to be at the heart of what we're going to talk about today So we have a lot of people that are using it in research as well as production across academia Some people using as a hobby which I congratulate those people Community sets we also have a variety of different sizes of Workloads that people work with most people tend to kind of operate and I guess this two to four machine range But you have some people that go up to you know, very large clusters as well Most people these days run it on Kubernetes, which is not a surprise. It's definitely very popular these days Apache spark and ray are two other popular options for people who are in the kind of the ML ops world But we also have a lot of people that run on HPC systems using schedulers like slurm and Databricks and Amazon StageMaker also have official integrations with Horovot as well and I think one of the most interesting things for me personally is looking at this slide Which shows that we actually have a very good variety of people across both TensorFlow and PyTorch You know there's as well as some people using TensorFlow v1 Which in particular for people using TensorFlow v1 I think Horovot is really your only viable option, but Definitely, it's interesting to see the variety of people out there and Again coming back to the diverging use cases. So while this project originally started out just really focused on You know big data sets computer vision models trying to speed up training There's really a lot of variety to what people are doing in the distributed training space today And I think it's oftentimes difficult to cut through the noise and figure out You know when should I use Horovot versus another framework and what are all these different features they're talking about? And so I think a few of the big ones that we want to talk about today will be the HPC case Which is where Horovot really like got its start and has done a lot of very interesting groundbreaking work as well as a lot of the More interesting recent trends, which you may have been hearing a lot in the news recently about people Claiming that machines have become sentient or something and that's you know very much in this wheelhouse of very large language models We also have a lot of people that are wanting to run not on you know Dedicated HPC hardware, but more in the cloud where the problem is a lot more about how do I reduce costs? How do I handle failover if I'm using pre-emptible instances and then finally there's ML Ops Which I think is a very one very near and dear to my heart in the in the business that I'm in Which is very much about integration into end with other data processing systems like spark and ray And to start things off. I'll hand it over to Nicholas to talk about HPC. Thank you Travis So the first use case that we look at is HPC so I pair high performance computing So you can see on the on the timeline a lot of things happen at least some breakthrough happen in the end of 2018-2019 we'll focus on two I guess Breakthrough which is the exact scale paper that led to a lot of improvement in all of odd and also the ML path competition that that also led to a lot of improvement in in deep learning at scale and I put back the star history to to see to show that he matched up a little bit the When a lot of things happen at the beginning in deep learning at at scale, which is the end of 2018 The star wire going up and after he starts smoothing out a little bit So so this publication the the exact scale paper won the golden bell price in 2018 It was really the first exact exact scales class the learning application Not with full precision, but at the 16 precision to be to be precise and so was the first time to to do deep learning at this scale so 27 27 K GPUs and We had a lot of improvement coming from that In the communication library so with NVIDIA for Nicole and also in the whole of our project We'll talk about that in the next slide We like to point out in the in the picture in the middle Josh Romero with a big contributor for over so One of the opinion optimizations that came out from this publication is response cache optimization So the way or over the work He worked in a really dynamic way in the sense that He will checked on all the GPU which tensor are ready to be to be reduced Which gradient are ready to be reduced and when some of them are ready. He will Launch and already used to perform to perform the reduction and keep going So we have scan of these these loop in the code checking if there is any tensor to be ready and Before the optimization The way to do it was to gather All the metadata of the tensor that were ready to GPU zero so everybody sends their their messages and GPU zero was aggregating the data and sending back a list of The tensor that were ready to be to be reduced and So the inside there is during deep learning training The same tensor are always the same So the the solution was to to implement a cache So you don't have to do these these exchange to a single Point of failure GPU zero and after send back the data to everybody so So by just Implementing a cache with one bit to represent each gradient tensor When a tensor was ready on the GPU you could flip it to one and after really doing a All reduce on this bit vector to figure out which Which tensor was ready on all the GPU you could after after that sends the already use communication and so now a single collective operation with already a Small small small data to send over is just a bit vector and you can see it on the on the graph that Before the optimization after 200 400 GPUs We starting not scaling as well but with the optimization and Which send will your a small amount of data you can scale linearly at least to 800 GPUs the other Interesting thing I wanted to talk about in the HPC use case is the ML path training competition So is a benchmark at the same time as a competition So you represent a diverse set of DL models to evaluate DL platform so both the software and the hardware of the platform and The competition is organized twice a year and People or competitors to meet the results But the result are peer reviews by the different Competitors so to keep the competition fair and the result need to be of course reproducible and over the years of the competition I think started in 2017 over the years like a lot of inside Came from these Come from this competition optimization Especially at scale so in the last last submission of ML path was in the fall so fall 2021 We we see that four out of the eight benchmark records used or over For the scale out submission And you see it is different different type of neural network So mini go for reinforcement reinforcement learning 3d unit for image segmentation SSD for object recognition and Resnet 50 for image classification and one of the the improvement we made to a robot for the last submission was Was the async data dependency optimization that we'll talk in the next slide so we saw that For data parallelism the goal of for over these two average those gradient tensors after each at the end of each iteration and What we have in the code what we had in the code I guess is if you don't have the definition enable is still doing that The CPU will be waiting that the forward pass the backward pass is done and and After that all about we'll check that the tensor are ready on all the GPU and then we'll be able to learn the GPU kernel to do the communication operation and after that the CPU will be waiting again for the output of the communication to to tell the framework that You can do other things like sending Optimizer work to the GPU So what we did is we kind of Lower down those those dependencies to GPU stream So the the framework and all of our could queue up The GPU operation even if they were not finished on the GPU so you can see on the after kind of diagram We managed to queue up the forward pass the backward pass the old reduce and even the optimizer Before the backward pass is done. So you see there is no now. There is no gap in GPU Utilization between the backward pass and the old reduce and the old reduce and the optimizer. So you get faster faster training training time and And yeah more GPU utilization so a few tips to Especially to scale at big a big scale So the first thing usually you want to do is to to utilize properly your GPU is to use The maximum batch size that you can To utilize all the memory of your GPUs But at one point you will see that for large batch The quality of your conversions will degrades And so a lot of time you want to stop there you want to maintain this good conversions curve But at that point you will switch to strong scaling so instead of Increasing the problem size you will keep the same problem size, but we'll add more resources to it But the downside of that is the computation part GPU Decreases but the communication stays the same. So you cannot overlap as much of the communication behind the computation so you need to be You need to make sure that you don't have too many Latency in your in your stack and so the improvements that I mentioned earlier the async data dependency will Decrease the latency of all of odd which is really good for strong scaling another tip also for strong scaling is to Reduce the cycle time so the frequency of 10 or what is shaking for work to do but the downside of doing that if you don't do anything else is You may be sending over the network just small data just one or two tensor instead of Aggregate aggregating them So to prevent that you can enable the the group or reduce where you can do static grouping so you can tell over Yeah, even if only one tensor is ready now I want to make sure that I have a bucket of tensor before sending it over the network So that I don't pay the cost of just sending small messages another use case that Became more popular as Travis mentioned is model and hybrid parallelism So a lot of work has been done inside over to to have that working and Since all of what was doing mainly data parallelism now you have the option With over being being a generic framework to be able to do also model parallelism So if you look at the timeline In the community you will see there's a lot of things happen also for model parallelism around the beginning of 2019 We'll focus on two use cases One is recommender systems Like the LRM So think about recommender system models that predict like on which ads you are going to click Which movie you are going to like or watch And The other use case is large NLP model large language model so we saw in 2019 the Coming up of different Specialized framework like the speed and Megatron LM that Focus on on that So in the case of DLRM So if we think about a model that predict Which movie you are going to like or watch You in in those inputs that you need to represent so they are important present in the unbearing space So the embedding layer you need to represent the user Demographics the different movies so the more movie you want to To put in your model the more user you want to put in the model the bigger the Embedding layer will be so that become like a memory constraint aspect of your model so the way to solve that it was to split the embedding layer across multiple GPUs and And after that when you pass these layers that is memory constraint you can do data parallelism like the the normal way that Travis talk about where every GPU have the same model, but they work on different input data and To switch between model parallelism and data parallelism We have a communication phase which is the old tool where eGPU send pieces of Their embedding layer to all the other GPUs The other use case which is large NLP models You can see on I wanted to put this graph to show that Yeah, over the years Those model become bigger and bigger so they are able to To do more language tasks in a more generic way and also to be more accurate And so you go to billions of parameters even now trillion of parameters So the way to scale Large model The common way is to do 3d parallelism where you split it in three way so the normal way is the data parallelism we talk about that and The two the way is pipeline and model parallelism. So for model parallelism is more you are splitting the layer horizontally So intralayer split and for pipeline parallelism You are partitioning the layers of your model into buckets and you put them in different GPUs So for data parallelism the communication is already use We talk about that for to some power with tensor parallelism. We added the concept of process sets So it's like grouping of GPU. So now instead of GPU has to communicate with everybody you can create communicate in a subset Let's say you were doing tensor parallelism inside just a server node You can say Your group is a node and you will just communicate with your neighbors inside the same node And for pipeline parallelism you can do point-to-point And we have a way to do it with a group size of two with the old tool operation of a robot So I just wanted to to show you how easy it is to use So you will of course import a robot But you can use MPI for pi to create a sub communicator to to tell over that For model parallelism for example tensor parallelism. You only want to communicate inside your box and and for that when you call the old reduce function you will pass the process set argument with the process that you created With the MPI sub communicator, so you will just communicate inside your box and I wanted to include also the case of when you switch from model parallelism to data parallelism for the LRM You just call to all tool on your local embedding tables And now I will let Travis talk to you about the use case of the cloud Thanks, Nicholas So I think a couple of the things that we just talked about are definitely on the very bleeding edge things that a lot of very Wealthy tech companies have been pioneering and I think that Another use case though that a lot of people come with is you know They want to do distributed training because they have what you might call data sets of reasonable scale as my friend Yacobo might say on these days, but Basically you still have the need for doing distributed training, but you're not quite at the scale of like You know really really big models, right? I think that's where this idea of you want to run in the cloud you want to reduce costs you want to make it You know something that is as reasonable as possible is where something like elastic or about comes in and the basic idea behind elastic or about is that you want to be able to both Accommodate failover that might happen in the case of a node going down So for example use a spot instance and that spot instance is reclaimed by the the system Or also a new instance becomes available like you're running in a multi-tenant environment Anyone optimize for high throughput and so one of the workloads finishes the GPU is kind of put back in the pool You want to be able to use it for your your job temporarily while that's While it's available to you and so the purpose of elastic or about is to very elegantly handle this handover where node goes down We remove it from the the job node comes up We add it to the job and all this happens seamlessly without an eruption and so the way that we achieve this is Essentially by having this separate driver concept that sits on top kind of like if you're familiar with spark There's the spark driver for familiar with Ray. There's a ray head So the way that elastic or about works is that the driver coordinates the work of all the other Nodes and kind of has a global awareness of who's you know responding who's not responding And so when enough when a failure occurs that gets detected And the workers that have not entered a failed state communicate back to the driver node and say Hey, I'm ready to continue work who's still available to do work and then the driver is responsible for adding any nodes that might have been added to System and made themselves available removing nodes that are gone and sort of recreating the topology of how the machines communicate We also had the ability to support what we call a graceful worker removal. So this is in the event where your System your kind of resource manager is able to tell you when it's going to reclaim a node before it does so and this has a Very nice property that if you know in advance that a node is going to be removed You can remove it at the time that it's going to cause the least amount of disruption to your process So a good example here would be if you're doing training and you're in the middle of doing like a backward pass and updating the weights That's not a really ideal time to have a failure because you kind of have to revert back to the previous batch Which isn't the end of the world But it does cause a bit of like time to sort of like, you know Reset the state and sort of resume and deserialize things that you might have cashed temporarily Whereas if you can remove a worker just at the right time at the end of a batch There's no kind of intermediate state you have to track and it's a lot more amenable to high throughput So that's exactly what we do here and this is facilitated by this notification system where we Propagate notifications about worker removals to the worker nodes so they can handle them at the time that's most beneficial to them and I think it's really interesting to point out that you know a lot of people have questions about how well this performs certainly I Think that if you're very interested in determinism there is you know elasticity is maybe not the best property But if you're just looking for good model results I think we've demonstrated in a series of blog posts that we've done that Definitely we see that you can get very good results with elasticity in some cases even better results Due to a property of kind of smoothing out the variance that can happen from having very consistent batches from one epoch to the next and the last thing I want to say on Elastic is that's also very nicely integrated in the kubernetes stack as part of a kubeflow and the MPI operator So this was some great work done by folks at Tencent to add this into Kubeflow and so if you're interested in using in kubernetes It's very easy to get started with elastic oravod And the last use case that we'd like to talk about today is the ML ops use case Which is how you would run a system like oravod on top of your existing data infrastructure Whether that spark array or other systems and so to motivate this let's talk a little bit about how you run ML on your laptop So this is kind of the setup for Something I'm going to be talking about a few slides later, which is another project. I work on called Ludwig and The idea is that oftentimes when you're doing end-to-end machine learning you don't just want to do training But you also want to process the data in some way to do feature Transformation or sanitize the data or what have you And then you also need to do some evaluation on the model before finally putting that in a registry or a model store And so, you know all these steps need to be thought about in scale not just the training part in isolation, right? And so this is in contrast to something like how you do an ML pipeline in production So this is a slide that comes from the Michelangelo platform that I used to work on at uber And the basic idea behind in all in production is that you want to take all these same components But you're going to be doing them at very large scales So you might have something like a feature store in place of just the normal Processing step you might do batch training and then your model store then goes out to like a real-time prediction service or You know, maybe like a data lake for doing batch prediction And this is where something like Horovot on spark comes in So this was one of our earliest in all ops integrations and the way that Horovot on spark basically works is that it runs the Horovot software on top of the existing spark executors but outside of the spark graph execution itself and so the way that we achieve this is that we actually use an Intermediate layer which is some kind of object storage like HDFS or s3 Where we write out the data that comes from the feature processing or data processing to parquet format and then consume that data by the training workers For training and then evaluation can then use the normal RDD representation that spark provides And so here's kind of a little bit zoomed in to what that process looks like So we have a framework that was ran at uber called pedastorm That is commonly used to do the actual data loading from the parquet file into The workers and then each worker maintains its own local shuffle buffer that then actually does the batching and sending of data to the workers for training and Let's talk a little bit about real-world example of this So Ludwig is like I said another framework I work on as part of the Linux Foundation as well low-code deep learning framework that supports multi-task learning and The high idea behind what we wanted to do with Ludwig was exactly this idea of scaling it up To support both the feature processing and the training parts simultaneously But when we're looking at how we might actually operationalize this Particularly for a project that's open source where people are kind of running it, you know and all sorts of different environments You know This was the challenge that was in front of us is we were worried we're going to have to build a system like this Which is a system pattern that we built a lot at uber while I was there Which is you have a separate component that handles the data processing a separate component that has training a separate component that handles eval and these are often different systems Usually spark for data processing training might be Horovod and then you have to orchestrate it all together using some kind of workflow orchestration system like Apache airflow and so we didn't want to necessarily force this on users in the open source and definitely we want to make our lives operationally easier as well and so that's where we came up with this ray integration for Horovod Which is now a standard part of the Ludwig framework and the beautiful thing about ray in my opinion is that it allows you to Combine all of these disparate heterogeneous steps into a unified layer And so the pre-processing training evaluation all that happens within a single ray cluster a single abstraction layer and The handoff between these different steps is similarly handled kind of transparently to you by ray itself So if you haven't checked out ray, I highly recommend it It's a really powerful tool for allowing you to get up and running with something like Horovod as part of as being one component with a larger pipeline and This also extends beyond training to hyper parameter optimization as well So another feature we have in Ludwig is that you can run the same kind of distributed training process for pre-processing training evaluation Etc and then also fan this out into multiple trials that are all coordinated by a Hyper-parameter search scheduler and so that kind of modularity is also one thing that makes Horovod on ray particularly interesting And that's it. Thank you very much for coming today Nicholas would you like to say any final words here? Oh sure so So yeah, we saw that you know, we had multiple use cases for for deep learning and we saw also that For different use cases like large model large model like specialized framework came into being like Megatron deep speed specializing in large model, but Horovod tried to stay generic So that's why You can do data parallelism. You can do model parallelism and you can support a rich kind of environment DL ecosystem and And so and so yeah, we are we are happy to to welcome you into the Into the community if you are if you are if you want to to help us to to keep improving the product I think you have a link to the to the slack Yeah, so I think on the next side, okay So if you want to to join our slack I think to join our slag there is kind of a poll at the beginning that help us to to see what's your interest are in which Which environment you are using DL and on which platform you are using it so kind of the stats you show now Yeah Do you have any question? I know we packed a lot into a relatively short amount of time so definitely I'm not surprised if You know aren't too many immediate questions, but definitely happy to you know taking questions, you know later or You know, please feel free to reach out to us on GitHub or LinkedIn or Twitter Yeah, that's a good question. So Nicholas you probably maybe give a little bit more flavor But the kind of surestake that we typically go with in the past is that you know if you have 100 gigabit That's like usually considered to be like, you know, very good during a good spot You can go lower in some cases if you're wanting to use things like FP16 compression or Potentially if you have weaker GPUs like maybe like older skews like T4s or like some lower powered ones like T4s potentially P100s things like that, but for V 100s with full 32-bit precision like 100 gigabit is usually what we recommend Yeah, I would say the same thing but Like nobody prevents you to try it on a on a smaller like a network Not more like internet network like 10 gig if you want But it will perform better on a 10 100 gig internet or RDMA or infini band I will say that yeah Definitely if you have like a small worker size like just a couple of nodes or something like that like you smaller like weaker Networks can generally get away with it and then another thing that oftentimes people do when they have like, you know Really bandwidth constrained environments like 10 gigabit is they'll use what's called local gradient aggregation Where you can instead of doing the backward pass every step You kind of save the the weight that the gradients and then you kind of do another forward pass and like aggregate those Inside a single worker and then do an aggregation across the workers like after 10 steps or something like that, right? So you will increase your global batch size Before like doing the optimizer steps. Awesome. Good question. Thank you for coming. Well, thank you very much