 Hi everyone. Thanks for coming out to my talk today. My name is Travis Adair. I'm an engineer working at Uber on our AI platform. And today I'm going to talk about Horovod and its role in our Michelangelo machine learning platform ecosystem and how we use it to build deep learning pipelines end-to-end on top of Spark that are scalable both at training time and at inference time. So to kick things off, I want to spend a little bit of time talking about the background on the Horovod project and how it came to be and what specific problem we're trying to solve with this technology. So the early days of Horovod at Uber were centered around solving the problems of scaling machine learning training for the ATG group, which is the group that works on autonomous vehicle technology. And the problems that they were facing were mostly around their models taking a very long time to train on a single GPU. So they had these big detection models. They had many GPUs available in their cluster, but they had no way to tie them all together. And so training using a single model and a single GPU was taking on the order of days or weeks. And they wanted to bring that down to hours or minutes. Additionally, some requirements they had was that whatever solution we came up with had to support both TensorFlow as well as PyTorch, which were both being used at the time. And also to make it easy for the ML engineers to add distributed training or scalability to their neural network training without needing to become experts in the infrastructure of how to scale things up. So there's a quick background on how deep learning training works. In this very simplified diagram, we have some data coming in, some training data. We do some preprocessing, some future engineering. And then it goes through several layers to output a prediction. And then after the prediction is made, we compare it with the true label and we get a gradient that tells us essentially how off we were and then that's used to adjust the model. Now these gradients, there's some problems if you were to think about how to parallelize this in a very straightforward way using something like MapReducerSpark, namely that there's very limited opportunities for parallelism in that these layers are typically applied sequentially. And this ends up being a challenge when you want to think about how to scale up neural network training. So in practice, there are a few ways that people have approached distributed training in the past, model parallelism and data parallelism. And model parallelism, the common approach is that you will take your layers and divide them across different GPU devices in your training. And in data parallelism, what you instead do is take separate replicas of the model itself and use a separate portion of the data to train that model independently and then do a global update on all the models at once using the aggregated, the aggregate gradients from all the models separately. So in practice, data parallel training tends to be the preferred approach that most people use. And one of the earliest examples of it was an approach known as the parameter server. We essentially have this single shared piece of state that all the separate replicas are sending updates to. And then at the end of each step, the parameter server will send back the combined updates to each of the models and they'll use that to update the parameters of their model. Now, parameter servers have some advantages. They are fault tolerant by nature. They can be used to run asynchronous SGD, which allows different workers to update the model at different frequencies. But there are some downsides as well, one of the biggest ones being historically usability. So tight coupling between the model and parameter server is how some of the earlier frameworks approached it, where you had to write a lot of scaffolding code to get your model to be distributed this way. And scalability being the other big challenge where you have this kind of mini to one communication bottleneck and so suddenly your parameter server is adding an additional layer of hops in the networking that slows everything down. And while async SGD may seem like an advantage at first glance, there are actually some problems if you choose to use it. Convergence being a big one where an update will come in, but the update is no longer very useful. And as a result, it actually does more harm than good to the model. So in practice, most people will use synchronous stochastic gradient descent, not asynchronous. So Horovod was brought in as an alternative to the parameter server approach. It was based on a research paper that was published by Baidu back in 2017, I believe. And its approach to the problem was to have a framework agnostic layer that can be used for TensorFlow, Keras, PyTorch, or MextNet and uses various high-performance APIs and features, including NVIDIA's Nickel library for fast communication between the different workers. GPU Direct, in order to speed up communication between GPUs, RDMA, as well as a special algorithm that we call TensorFlow Fusion that we use in order to reduce the number of independent network calls that we do in the system. And it's very easy to use, just add a few lines of Python to your additional training script and you're good to go. And it's been open sourced as part of the Linux Foundation and can be installed very easily with just pip install Horovod. So in contrast to the parameter server-based approach that I showed you before, Horovod's technique is based on an MPI concept called AllReduce. So in AllReduce, the gradients are sent around in some sort of decentralized topology, usually a ring, where each of the workers forwards its gradients to the other workers and then at the end, each one of the replicas has its own copy of the model parameters without having to go through a single shared piece of state. And in practice, this scales very well. We've observed over 90% scaling efficiency on common perception networks, as well as many common natural language processing networks up to hundreds of GPUs. And in fact, the Horovod has been used to win the Gordon Bell Prize for the first deep learning experiment to ever reach one exaflop per second of performance. And that work was done by Oak Ridge National Lab with something like 27,000 GPUs on the world's largest supercomputer. So now that I've hopefully convinced you that Horovod performs well in isolation, let's talk a little bit about the unique challenges of doing deep learning in production. So one of the recent trends that we've observed at Uber is that deep learning is now achieving state of the art performance on not just image classification and natural language processing tasks, but also on tabular data-based problems as well where existing tree-based models now need to be migrated to deep learning. And in these sorts of problems, what we find is that there are many features, but every feature has low average quality as compared to image classification or natural language processing. And as a result, there tends to be a lot of iteration between the feature engineering steps and the model training steps. So going back to this diagram for a second, what this means is that in this pre-processing step here, this is where we do our feature engineering. We would typically do this in, say, something like Spark, and then we need to do our deep learning in the green boxes to the right using a different framework like TensorFlow, PyTorch, plus Horovod. And that we need a way to kind of seamlessly migrate back and forth between these two as we're doing our model development. Now, in research, that's what it looks like in isolation, but in reality, in a company like Uber, what we end up having is that the deep learning itself just ends up being one small part of a much larger picture. So we, in reality, have data coming in from a stream residing in a data lake that needs to be processed into features stored in a feature store that then ends up being used to train models that are pushed to a model store that are then used for serving online or batch predictions that themselves are either written to a data lake somewhere or sent back to a client who's maybe using their phone to get the latest updated ETA for their Uber ride, something like that. So in reality, we need a lot of infrastructure to support this, not just fast training using something like Horovod, but it all needs to fit together seamlessly. So the first part of this diagram that we want to talk about is the feature store itself. So in Michelangelo, we have data residing in our data lake. So this would be something like HDFS that we run ETL jobs to process that into features. We have a feature store that we call palette that we've written about before that indexes all of these and makes them available with very low latency to different models that need to use updated features like say what the seven-day average click-through rate was for a particular user, things like that, so that individual data scientists don't need to constantly be recalculating these features. They're just available as a service provided by our platform. So once we have the features that we want to use to do our training, how do we actually do the training on Spark? So historically, Spark has not integrated natively with systems like TensorFlow or PyTorch, and indeed there are a number of challenges that come up when you want to try to fit deep learning into a Spark environment. So one of the first ones is how it relates to the problem of pre-processing. So in pre-processing, we have two different types. One is very straightforward, what we call example dependent, just image resizing or color adjustments. You only need to look at a single example to do them. But others end up being data set dependent like string indexing or normalization. And so this ends up being a kind of model itself that needs to be fit to the data before it can be fed into the machine learning model. And in Spark terms, what we call this is an estimator that needs to be fit into a transformer, and then we can chain these transformers and estimators together into a pipeline. So what we would ideally want to be able to do is take all of these steps of our Spark transformation and create the estimators out of them, and then additionally have an estimator that represents our model shown on the right. And this all can be wrapped into a single pipeline that we can use to do training and then later prediction. Now in practice, there are some limitations of Spark as I alluded to before. And one of the big ones is that the RDD Resilient Distributed Dataset format that it uses is very good for kind of these embarrassingly parallel problems, but not particularly great for deep learning. And so at Uber, we have a framework that we've open source called Petastorm that attempts to solve this problem by taking your RDD, materializing it as a parquet file and then feeding the data into your TensorFlow or PyTorch code by doing efficient, by making efficient judicious use of sharding the data, streaming it, shuffling, buffering and caching that data so it can be fed quickly to the GPU for training. So in practice, what this looks like is something like this, your tasks, your Spark job will have the data residing in an RDD format somewhere and then it will write it out to your storage layer, maybe HDFS or S3 as these parquet row groups. And then Petastorm workers running on each of your trainers, training workers, will read these row groups, break them apart into individual examples and fill what's called a shuffle buffer with these different examples so that the system can then sample these individual elements to form a batch that can be fed to the training process downstream. Now, in a larger context, what this looks like on Spark 3.0, which now provides something called resource aware scheduling, is that we can do our data transformation ETL feature engineering on CPU resources, write out the intermediate data to HDFS and then transition over to GPU resources in the same Spark cluster to do our training using Horovod. And then finally, we can run our post-processing back on CPU. Now, if your Spark cluster doesn't happen to have GPUs, which is actually the case for us at Uber Internally, what we do is we have a system called Horovod Lambda that we've run, which we're in the process of open sourcing now that allows you to run your data processing on CPU with Spark and then use GPUs residing a remote cluster like in the cloud somewhere to do the actual deep learning itself. So as an example here, you can see we have this AWS job context that we use to spin up some number of instances. And then we fit this into our Keras estimator here as a backend that we can then use to fit our model. And this will do the training automatically for us in our remote cluster. And here's a visualization of what this process looks like on premise. You have your Spark cluster, you do your ETL, your feature engineering, then you write that data to a storage layer like S3 in the cloud, then do your training and then finally send the result back down to your on-prem Spark cluster to do your evaluation post-processing. So that's the training side. What about the prediction side? So in practice, the way we serve it is that we have a front-end layer that serves the initial request coming in for the prediction, which is then routed to a particular instance. So we have these different service groups that handle different models so that each service group has its own isolated SLAs in terms of how many instance replicas it's going to have, et cetera. And then that online prediction service backend is going to spin up a Java interpreter that will run the Spark transformer steps very quickly. And then for the model prediction itself, we have a separate containerized environment that we use to execute the trained model prediction that uses shared memory plus GRPC over the Unix socket in order to minimize the amount of copying of the data that goes on. So with shared memory, this ends up being a very low overhead process. And this is all facilitated by a framework called NeuroPod that was recently open sourced by UberATG that allows you to take a model written in any of a number of different frameworks including Keras, TensorFlow, PyTorch, et cetera and then wrap it into a single API that can be used kind of framework agnostic in order to make your predictions. And NeuroPod supports a system, a technique called out-of-process execution. And what this allows us to do is have a single server that is running an arbitrary version of, with an arbitrary set of dependencies that can then support inference on various versions of PyTorch and TensorFlow by just having them available somewhere in the system and then quickly spinning up a separate process in order to do the actual prediction without having to load multiple versions of TensorFlow in the same processing environment, for example, which would of course be problematic. All right, so stitching this all together, let's talk a little bit about workflow authoring itself. So in our framework that we provide at Uber on the Michelangelo platform, what this looks like for our customers is something like this. So they first have their ETL step where they will query their initial set of features from the data lake. So something like this Hive query or Spark SQL query. And then we have some set of features that we want to join on this raw data. And so then we make the request to Pallet, our feature storage layer, to get the actual materialized features from that raw data. And then the next step is we want to process these features in some way. So for example, maybe we want to convert strings to floats or indexed strings in some way. So they're given a numerical value. We can do all this using typical Spark pipelines and standard train test split. And then finally we do the apply the transformations to the split data. And then next up comes the model construction itself. So all of this is just normal TensorFlow slash Keras code. You want to take every one of your labels and create or every one of your features and create a separate input to your model for it. And then create your standard in this case, feed forward model. And then as well as have your output layer correspond to the label column in your data. And then finally create a Keras model itself from all that. And then we can take that model as well as an optimizer and a loss function and then create what we call this Horovod Keras estimator from that model and give it a set of parameters say, you know, what batch size do we want, how many epochs we're going to run for, what validation split do we want, et cetera. And then we just call fit and then the Horovod spark estimator will go ahead and do the work of using Pedestorm as well as Horovod, the two frameworks I mentioned before to do the actual training. And then it spits out this model transformer, which we can use on our test data set to get our final set of predictions and run evaluation. And then once we're happy with it, we have internally this part has not been open source yet but it's coming soon now that NeuroPod is open source as well. Finally, we call build servable model to create a NeuroPod which we can then stick into our serving pipeline with our data transformation steps. And then finally upload this to our Michelangelo project with a description, et cetera, and some test data for sanity checking and do the actual deployment into production. So that's the state of the world today in terms of how machine learning and production works at Uber. I want to quickly spend a little bit of time talking about what's next for us. So one of the big things that we're doing in the Horovod community is working on what we call elastic Horovod. So this is a fault tolerant Horovod that is capable of recovering from a single host going down or a new host being added to the system dynamically. So in this example you see, for instance, host zero has an error of some kind but host one is fine whereas host two over there has just been discovered. So ideally what we'd like to do is go from this scenario to something like this where we remove host zero from the system, add host two, and then let training proceed as normal. And so we already have this committed to Horovod. You can go try it out today if you'd like. At a high level, the way we do this is that we have the driver, this driver process similar to a spark driver that starts up and creates the initial workers. They synchronize them amongst themselves. Their state training proceeds as normal. Once an error occurs, the worker that's still alive will send someone we call a reset request back up to the driver, which will then look for new workers and then decide to go ahead and launch it at which point training proceeds as normal until either a similar error occurs or things go back to normal or training completes. And so I encourage all of you to try this out. So as I mentioned, Horovod, Pedestorm, NeuroPod, all of these are open source available on GitHub free for you to try out. Please let us know if you have any issues or concerns with anything or would like to contribute. What I'll do quickly is answer some questions here for those of you who have asked some questions. Thank you very much. So the first question, are there any container orchestration features baked into Horovod? So that's a good question. So we don't have anything that's tightly coupled to a particular container orchestrator today, though there are a number of tools that have been written in order to run on top of different container orchestration frameworks that are out there. So for instance, if you want to run on top of Kubernetes, there are some utilities out there that can do that for you. Internally at Uber we primarily run on Kubernetes as well as our internal orchestration framework or orchestration layer called Peloton. But we will be actually contributing some more container orchestration specific capabilities to Horovod in the near future, particularly for the elastic Horovod work because it's a little bit, I guess what I should say is that today it's fairly straightforward to run Horovod yourself in a, just kind of an bare metal way, but running elastic Horovod is a little bit more complex and requires more integration with your container orchestration framework. So we will be adding support specifically for that and that's coming soon. So the elastic Horovod ETA, so the question is what's the ETA for elastic Horovod? So as I mentioned, there is a preliminary version that's already been committed to master up on GitHub. So you're welcome to go and download master, try it out yourself. We have some documentation there on how to use it. It's pretty straightforward to use elastic Horovod for basic fault tolerance today. And it's going to be, and it's, I should say the running on top of Spark with elastic Horovod isn't a pretty good state as well. Running on a particular container orchestration framework, as I alluded before, is one of the areas that we are still working on in order to make it convenient for people to do so. So they don't have to write a lot of code themselves, but that should be coming soon and we actually have ongoing projects internally to do that. So my estimation would be sometime later this summer, probably August or so, we'll have hopefully our 0.20 release, which will have full support for elastic Horovod. And I encourage you to continue following along with our progress there. And definitely if you try it out and anything doesn't work right, please let us know on GitHub. I'm always happy to fix any issues that come up. All right. So next question is, what is the comparison between Horovod and Databricks MLflow? So this is a good question. I would say that in general Horovod solves a problem. Horovod itself solves a problem that's pretty different from what MLflow itself is solving. But I would say that MLflow is very analogous to the Michelangelo machine learning platform at Uber as a whole. So some of the key features that MLflow provides would be experimentation, tracking, model management. These are things that our Michelangelo platform provides as well. And I would say model management being one of the bigger ones. But MLflow is pretty agnostic about how you do your training, how you do your productionization of machine learning. And so in that sense, I would say that there's actually a very little overlap between the two. So I think that's both a good thing and a bad thing about MLflow. So MLflow can fit into just about any environment you can imagine taking everything I showed you here and adding MLflow as an additional layer to track your model versions, your model experiments, et cetera. But the downside is that it doesn't give you as much expressive power, I guess, as some other ML platforms that are out there. But in general, you can definitely mix and match the two. All right. And don't see any other questions coming in yet. So unless there are any more questions, we can definitely stop there. Thanks, everyone, again, for coming out to the presentation. And if you'd like to follow up with any of the topics that were discussed here, you can reach me on GitHub at tgaddar. And of course, always happy to answer any requests about the project or any of the things we've worked on.