 All right. So next up we have Frank talking about model selection for deep learning neural networks. He's going to tell us more about it. Thank you. Thank you very much. Can you hear me okay? Yeah. Great. Well, it's really a pleasure to be here at Fosnum again to talk about deep neural networks. And this project, the software in this project was written by my colleagues at Pivotal now VMware, among other folks, but including Nikhil, Ekta, Orhan, and Domino. And we also work with the University of California at San Diego, Professor Kumar, and his PhD student, Yuhau. So thanks to those folks up front, all of the software in this project is free and open source. There's no commercial or closed source software here at all in the spirit of Fosnum. This is the agenda, what I'd like to talk about today. First, talk about how one could train deep neural nets in parallel. Secondly, to introduce you to something called model hopper parallelization, which is based on the work that UC San Diego did in this area. Talk about an implementation of model hopper on a massively parallel processing database. And then give a couple of examples, one with grid search and one with automated machine learning, using a method called hyperband. So let's get started. We've all heard about deep learning and deep nets and the popular press and the specialty press. It's a kind of machine learning, which is originally inspired by biology of the brain. Although these days it has less to do with neuroscience. And it is characterized by artificial neural networks, which are networks that have multiple layers. They have an input layer, for example, an image, an output layer, which could be a classification. And in between they have many, potentially many hidden layers with different components. An example is this convolutional neural network, which is often you kind of network, which is often used in image processing applications. The problem is, there are many problems, but one of them is that training neural nets is very, very hard. The accuracy of a model is a function of the architecture that you choose, like the kind, the number of hidden layers, what the components are of those. It depends on what the hyperparameters or the parameters for those models are. It depends on the data set. So it's kind of the hidden secret in data science that you have to do a lot of trial and error in order to come up with a model which is effective. So just as an example, let's say that we have four model architectures that we're interested in. We have different batch sizes, four different batch sizes we're interested in. We want to try four different learning rates, and we want to try four different methods of regularization or smoothing. Well, already we're up to 256 different combinations. So if you're training all of these models one at a time, it's going to be extremely expensive. It's going to take a long time. So there's definitely a need for speed here, which means work in parallel, and working on parallel means we're talking about distributed compute systems. So what are the different ways in which we could train neural nets in parallel? And in order to understand the various options, let's just briefly remind ourselves what gradient descent is. It's the most common optimizer, variations of this, most common optimizer, which is used in deep learning. So we want to minimize some loss function. So we have a cost function on the y-axis. We have a variable on the x-axis, which is perhaps a weight for one of the hidden layers. We have an initial value, which is that dot on the right-hand side. We calculate the slope at that point, or the gradient. And then we move in the negative direction of that gradient in order to find the bottom of the curve. And the size of the steps that we take is a function of what's called the learning rate. So hopefully we get to the bottom of this curve. Our model has converged, and it gives us a reasonable answer. The thing is we often don't operate on single weights at a time. We operate on tons of weights, and we do things in what are called batches, or mini-batches. So a brief kind of description of mini-batch stochastic gradient descent. It's much faster than working on one at a time. So in our data set we have two features. We have x1 and x2. We have a y-value, which is our label. So we're talking about a supervised learning problem. And we have a model that we've initialized. Think of the dot on that curve on the previous slide. We pass that model over this mini-batch subset of the data, and we get a new model. Well, how do we get that model? What we do is we calculate the gradient, which is the slope. And we use eta, which is our learning rate. It's negative because we want to move in the negative direction. So we're moving down the curve. So we do that for our updated model. We get an updated model. Then we take that model. We pass it over the next mini-batch. We get a new model. We pass that over the next mini-batch. We get a new model. And we continue until we've gone through all of the data, which is called one iteration or one epoch. Some people call that. And we do this in a sequential manner. You can visit the data in any order, but you need to sequentially visit all of the data because that's very important for convergence. But you notice we did this on a single machine. So how would we do this on multiple machines? How would we run this in parallel? And there's different ways to do this. There's task parallelism and data parallelism. So let's look at those briefly one at a time. Task parallelism. Now we want to train multiple models at a time because we've got a big cluster to work on. So what we do is we have a bunch of models we're interested in. Say those 256 we looked at. We replicate our data on each machine. So you copy your data into each machine. And once you've done that, then here we have our data copied on each machine. We move one model to that machine. And then we do the steps we showed on the previous slide. We train that for multiple iterations and we get a model. And that's fine. The problem with that is that it's very inefficient in terms of memory and storage because deep learning data sets are extremely large. So if you're copying them around, it takes a long time. They may also not fit in memory on a single machine or they may not even fit on the disk. So we'd like to look at other options besides task parallelism. Oh, you could put in on a common file system as well. But then you end up with issues around network bandwidth including data from your file system to your workers. So let's talk about data parallelism. Data parallelism is a bit different now. We're going to train one model at a time. However, we're going to do it a little bit differently. So we've got our 256 models that we want to train. Instead of replicating our data, we partition our data. Some people call those shards. So you put some of your data on one machine, some of your data on another machine, et cetera, until you've sharded your data. And then what you want to do is figure out how you can train a model when your data is split up like that. And there's different methods to do that. We heard about one earlier this morning with Spark. So how can we do that? So we have a model. We're going to train one at a time. We put the rest of them in the queue. And we broadcast that model to each of our worker machines. Then we train that model on the data that is on that machine. And you could train it on all of the data on that machine, the whole partition. Or you could train it on a batch of it, part of it. And you get different kind of characteristics depending on how you do that. Once you have that model trained, then you send the updates back to some kind of master node, which updates a global model. Then you've done one iteration. And then you rebroadcast and repeat that. And once you've trained that model, the next one in the queue comes in and you train that one. So that's data parallelism. If you update your model just once after you've gone through all of your data in the partition, you get what's called bulk synchronous parallelism or model averaging. And this can be very fast. However, it suffers from often poor convergence. The other thing you could do is you could update once per mini-batch. So instead of going through all of your data, you go through a mini-batch. Then you send the updates to the master. And depending how you do that, you could do it in a synchronous way. You could do it in an asynchronous way whenever a worker is done, it sends results up. Or you could do it in a kind of decentralized way. So if you took the orange master out of the picture here, then you end up with a kind of MPI all reduced approach. And these have different pros and cons. Some of them work well. They do suffer from a high communication cost. So what could we do to get the best of both worlds? What could we do to get goodness from task parallelism, goodness from data parallelism? And the approach we've taken is something called model hopper parallelism. And model hopper parallelism is based on work which was done at San Diego, which we've implemented in this distributed database. So what is model hopper parallelism? The way you start this problem out is it's the same as the data parallel approach. So what you want to do is you don't want to copy your data to each of your worker machines. You want to partition it or shard it. So that's the way you start in the same way that you do with data parallel. So you've got data on these different machines. Then you send a model to each of the, an initial model to each of those machines. And we have 256, so we're going to do this in a kind of round robin way. You train the model on that entire partition. But then instead of going back to the master to do an update, what you do is you move the model state over to another worker. So we've moved the model state from the blue one from the left side of the screen here to the center of the screen. And the data is locked in place because it's too expensive to move. You're just moving the model state. You're moving the model weights. Then on the, you do another iteration of training. So you train the data. You train the model on the data on the new partition. And then you do the same thing again. So the model that was in the middle, the green one has now moved to the far right side of the screen. And you train there. And once you've done that over all of the workers, then you've actually completed one iteration of your data. So what we're doing is locking the data in place, no data motion, but moving model state, which is much smaller. So how does that work? Well, it performs pretty well. This is from the paper. The curve that we're looking here is like before a lost curve on the y-axis and epoch. So number of passes over the data on the x-axis. And the model hopper is one of the bottom curves. So it has very good convergence performance. And if you look at runtime and GPU utilization compared to other methods, it's pretty good. So it has high GPU utilization, and as a result, it has very good runtime. So we want to take this now and implement it on a massively parallel processing database. So there's an open source project out there called Green Plum Database. There are a bunch of other massively parallel processing databases out there. Most of them are commercial. This is, to my knowledge, the only open source one. And this is what the architecture looks like. And as expected, you have a master node. You have a bunch of workers. You can think of these as many, many postgres databases running in parallel. And these are on these host machines. So on the host machines, you've installed Keras and TensorFlow, which are the libraries we currently support. We also installed Apache Madlib there, which is an in-database machine learning library. And the one thing I want you to note is that for cost control reasons, if you have a really big database, you might not have GPUs on all of those machines. You may only have GPUs on certain of the hosts. So what we need to be careful of is that we distribute data only to the hosts where there's GPUs. And when we do our model hopping, like I showed before, moving model state from machine to machine, we need to make sure that we only move it to machines that have GPUs. This is what the API looks like. I won't go through the details of it, but there's an API for defining the models that you care about. So those 256 models that we want to run, you define what those are, and those can be architectures or hyperparameters. And then you want to train those models. You want to change them from purple to orange. And there's a function that is used to do that. You specify the number of iterations, whether you want to use GPUs or not. So you create your models, you want to run, and then you train them. So I'd like to go through a couple of examples of the results of this. So we're going to use something called Cypher, which is a kind of well-known image processing image dataset. And we're running on Google Cloud Platform. The approach that we're going to do to training is to initially cast the net wide. And then once we've casted the net wide, we're going to look at certain areas that are of interest and kind of zoom in on those. So here's the model configurations that we want to run. We have three model architectures we want to run. We have a bunch of optimizers with different parameters, and we want to try four batch sizes. And if we do that, then ends up with 96 configurations. So we train those on this parallel database, and we get 96 different curves. So we have accuracy on the left side and the loss on the right side. So we might say why we just use the most accurate one. This is the most accurate curve. It's about 82%, but there's a danger in doing that because it may be overfitting the data. So the top curve here on the left hand side is actually the training data. The bottom curve is on the validation data, and you can see you have a big gap there. So that's dangerous because it means your model is not generalizing well. So what you can do is to do your next iteration is to plot the variance, plot those differences there, and then pick the ones that have the smallest difference. This is one technique. So on the left hand side, we have the deltas. The bottom is all the model support, the tuples that we ran, and then we're going to pick the good ones. Here's an example of one that overfits less. So we refine the model configurations that we want to run based on kind of the good ones we saw, and then it's a smaller number here. We're going to run just 12 of them, and we run those 12, and we can pick a good one from there because these are ones that overfit less. So here's a sample good result. And then once we have a sample good result, it's now in the order of 80%, the modest amount of overfitting. We can do inference on it, which is prediction. What does this look like? A dog. Yeah, and the model says it's a dog too. Okay. So I'm going to finish up quickly talking about automating that. So you can add automated machine learning on top of this. One example that we've done here is with HyperBand, and HyperBand uses the idea of what's called success of having. So it's automating some of the steps that we did before, so we don't have to do as much analysis ourselves. We start with a big set of models. We train them for a while. You get a subset of models from there. You train again. You have survivors there that you continue with. So the way HyperBand works is you specify a couple of parameters, including the resources you want to use and how you want to have your data. And once you do that, it creates a schedule for you. Don't have to worry too, too much about how the schedule is created. There's a whole theory behind it. But if you train this, if you run according to this schedule, for example, you start with picking 81 different models. You train it for one iteration. Then you pick the best 27. You train it for three more. And you proceed through these brackets. Then this is a way of automating the model selection approach. So we do that here. Again, we start with a bunch of model configurations. Now we're not doing grid search. We actually set ranges for things because we need to generate different combinations. And then we run it according to the schedule. And the same thing happens, right? You get a lot of information. You could stop here and say, I'm going to just go through one path of my Automated Machine Learning. You could then do another set of the same thing, which is what I did here. So it's the same steps we did before. As we looked at ones that were very good, we kept those. We threw out the ones that are not very good. And then we do another round. You might run that for much longer, for example, because you have your better models. So once you do that, you re-run and you get your next, you get better models. And here you can see they're in the order of 80% or so. Here's an example of a model from here with close to 80% accuracy, but very little overfing. So to wrap up, Model Hopper is an interesting kind of approach for doing distributed training of deep nets. You can implement it in a massively parallel processing database like Green Plum. It works pretty well. It scales very well. You can add automated machine learning methods on top of that so that you'd have to do less work yourself in terms of selecting good models. And for future work, a short list of things we're working on, we want to improve GPU efficiency because we're working in the database. We're not quite as efficient as the results that show in the paper. We want to add more automated machine learning methods and then add PyTorch support in addition to TensorFlow. There are some references, including some Jupyter workbooks to get started. Thank you for your attention. Thank you, Frank. We have three minutes for questions. Please stay seated. Any questions? Hi. Thanks for tonight. Nice talk. With the partitions of mini batches, you said that it's very important that data is processed sequentially. But then Model Hopper distributes the partition in the different worker nodes, but only one of the models is going to start from the beginning and follow the sequential read. Is that an issue or is the partitioning Model Hopper different to the mini batch? It's logically efficient to sequentially going through all of the data. So the question was, is there an issue with visiting the data in the right order or incomplete with Model Hopper? I think that was the question. And the answer is that it's logically the same as sequentially going through all of the data. And when you hop from machine to machine, you're visiting a new set of data on that machine. So when you finish one iteration, each model has seen all of the data. The characteristic of gradient descent is that it's robust with respect to visit order. So you don't need to visit the data in the same order each time. In fact, it's randomized initially up front before it's distributed. But you do need to visit all of the data because that's very important for convergence. Thank you. You're welcome. One more quick question maybe, but we're almost running out of time. All right, then let's wrap it up. Thank you, Frank.