 Thank you Roman. Well good afternoon everyone and thank you for sticking around late on the second day of the conference. It's my pleasure to talk to you about deep learning in the context of parallel databases. All of the software that's used in this project that we did is open source software of some of which is listed here. So in the spirit of FOSDM it's definitely something that you can try at home. So I thought I'd start by giving a very brief introduction to deep learning perhaps for folks who are less familiar with this topic although you may have heard a lot about it in the context of AI and the popular press. So I kind of favor this Russian doll kind of description of the AI landscape which has artificial intelligence, machine learning and deep learning embedded kind of within that. So deep learning is a kind of machine learning. It's inspired by biology of the brain and it uses a class of algorithms which are called, which uses a class of algorithms. Can you hear me at the top? Okay. No? Okay I'm going to shout louder. So deep learning is inspired by biology of the brain. It uses a class of algorithms called artificial neural networks and artificial neural networks kind of came from this brain biology from neuroscience but I don't think neuroscience really drives the innovation of artificial intelligence networks today. And they've been certainly growing in size. There's networks with hundreds of millions of weights right now. But I read as of like a year or two ago the largest artificial neural networks were roughly the size of the nervous system of an insect. So although they're large in computer terms in terms of biology and other organisms and systems, they're not actually that big. These are some examples of deep learning algorithms. Can you hear me better at the top now? Yes, sort of. Okay. These are examples of some deep learning algorithms. For example, on the left of the slide we have a multi-layer perceptron. So this is a classic neural net which is fully connected and what that means is every neuron from one layer is connected to every other neuron from another layer. So these are used for I would say kind of simple to moderately complex problems in the area of like machine translation and fraud detection and such. But there's other more specialized networks. For example, recurrent neural networks in the middle here, recurrent neural networks are used on sequential data like we heard in the machine translation presentation just a few moments ago. I'm going to talk a bit more about convolutional neural networks. So convolutional neural networks, these are very effective for computer vision. So you can see on the right side of this slide, this is an image from an autonomous vehicle. And what this vehicle is able to do using these convolutional neural networks is to identify objects in the scene and to rank them, to classify them with a probability. So that's a vehicle with a certain probability. These modern CNNs can identify hundreds of objects in the scene of different types, stop lights, pedestrians, people on bicycles, etc. And they're really, they work really well because they use a lot fewer connections between the networks. By virtue of using convolutions or just these filters that move through the, through the image. Another interesting thing about these networks is that they have translational invariance, which what that means is that it's hard to fool them. So if you move an image to a different, move an object to a different place in the image, because the convolution filter moves through that image, even though different pixels are laid up in the image, it can still sit, say, that's a cat there or there, that's a bicycle there or there, even though it's in a different part of the image. And here's a couple of examples of classic networks that people use. We've heard about GPUs this afternoon. Neural networks are, often they sort of devolved to doing tons and tons of matrix operations and graphic processing units are well suited to these algorithms because they do tons and tons of matrix operations in parallel and very quickly. So this is what I would say is like standard infrastructure for running deep neural networks. You have often a single server and one or more GPUs. And this works very well. You can stream data through these things and solve big problems. But sort of keep this picture in mind as we talk about parallel databases. So there's a particular parallel database that I work on. It's an open source project called Green Plum Database. And it looks like this. So a parallel database like Green Plum, and there are many of them out there, they're characterized by a kind of shared nothing architecture. So it's SQL based. You have SQL that comes into the master node. And that SQL workload gets distributed to the worker nodes. And in Green Plum, lingo, the worker nodes are called segments. So you can think of these as having post grass databases running on all of these worker nodes. And having the workload distributed to them by virtue of a pretty advanced query planner, then those results are brought back to the master and that's the result set. So what we would like to do is add deep learning to that architecture. So this is the approach that we've taken. We want to be able to attach worker nodes, GPUs to these segments to take advantage of all the great things that GPUs can do. And we also want to run deep learning libraries on the worker nodes. So where we feel we can innovate is not necessarily in changing those libraries themselves because there's so much rapid innovation happening in those libraries. So we want to run those as is on the worker nodes and marshal the data and return the results using Apache Mad Lab, which is an open source in database learning project. So if you sort of compare this somewhat lonely architecture to this kind of distributed system, then you kind of get the idea that, yeah, there's actually possibility to do more computation in parallel and be more efficient. Very quick word about Apache Mad Lab is in database machine learning. The idea is that you don't want to move the data out of the database to another place, operate on it and move it back. So what Apache Mad Lab does is it brings machine learning statistics computation to where the data lives. So you might ask, OK, what is the kind of distributed machine learning you can do in a cluster? And there's different options. There are different ways that you can go about doing it. I'm going to talk about the first one primarily in this talk, which is running a single big model across a cluster. So what that means is you take a very large data set, you distribute it to your worker nodes, so you have different shards of data on those worker nodes. Then you use a single model architecture, and then you run that in parallel and you build a single large model. So that's the work that we've done primarily that I'm going to talk about today. I did want to touch on one other area, which is hyperparameter tuning. One interesting thing I've learned about working on neural nets is that it's a bit of a black art. Like even experienced practitioners can't say, oh, this model architecture with these parameters is going to work on this data set. There's so much tuning involved. So another option for using a cluster is to replicate your data on your different worker nodes. So same data across each worker node, same model architecture by different parameters, and then run tens or hundreds of options of parameter versions and collect the results. So that's another kind of aspect of this project that we're working on. But let's talk about the workflow right now. So you take raw images, say JPEG images, and then for each pixel of those images, in the case of a color image, it would have an R, a G, and a B value. You create a numpy array from those. So you have your resolution times 3 in this stacked array. What we do is we actually write it to CSV and then read it in a parallel fashion into the database. So there is kind of tooling that we've developed that allows you to take these raw images and then stream them into the database. So once you have them in the database, then that's when kind of the deep learning can happen. So how does distributed deep learning work? The way that it works is by distributing the workload to the segments and then collecting the results back. So for example, step one is the model that you've written, say a TensorFlow model, chunks through the mini batches of the data that is on that worker node. So once they've completed that, then they do what's called a merge function. So the merge involves collecting the weights and the gradients from the models stayed effectively from each of the worker nodes, combining it together. Then for a particular operation, there may be a final function that's done, and then you iterate. So then you rebroadcast your weights, you go through another epoch of your data, and then you do that either for a fixed number of iterations or until you converge. So it's this idea of iterative model execution for distributed systems. So the question that we've been looking at is what does that merge function look like? How do we collect this data in order to do this distributed computation? So it turns out, not surprisingly, this is an open area of research. There's a paper I've quoted here, which I think is very useful. It gives a survey of different approaches to take to distributed learning. We tested three so far. We tested simple averaging. We tested an ensembling method, which is putting models together, and then we tested something called elastic averaging stochastic gradient descent. So I'd like to share some results from the work that we've done so far. The infrastructure that we used, we tested primarily on Google Cloud Platform. We used high memory CPUs with 208 gig of memory per machine, per virtual machine. We used P100 GPUs. And we tested database cluster sizes from a single node up to 20 nodes. And we have one GPU per segment. We used two data sets. This is a classic one that people in machine learning use. It's called Cypher 10. They're low resolution color pictures. They're 32 by 32, and they're 60,000. There's a more challenging data set, which is a really great data set from MIT. It's called Places. The version we used has 1.8 million examples. There's another one with 8 or 10 million examples. They're higher resolution, 256 by 256. And they comprise 365 places, like cafe, library, park, kinds of things. So what are the results that we've seen? I'm not sure how readable that is, but I'll call out the colors. So what we started with is a simple one layer convolutional neural net. There's an architecture there, which you can look at later if you want. So we ran it on a single node, four node cluster, 8 and 16. So what this is plotting is the accuracy on the y-axis and then number of iterations on the x-axis. So what it shows is that this is a very well behaved data set. So for one node, it takes fewer iterations to reach an accuracy of 75% or 80%. That's the blue line. The green line is a 20 node cluster. So in terms of number of iterations, it takes longer for that to converge, which kind of makes sense because each worker node is only seeing a portion of all of the examples. And there's only 50,000, so there's not that many. So this was sort of expected. The thing we're interested, though, is in this runtime, because the reason we want to do a distributed system is that we want our models to train much faster than if you're just using a single node. So this plot then shows runtime on the y-axis and accuracy on the left axis. So the way to read it is for a, this is 75% accuracy. So one node takes 40,000 or 4,000, can't read that, 4,000 seconds. If you run it on the cluster, then it runs much faster. However, you'll note that if you run it on four nodes or eight nodes or 16 nodes, it doesn't run linearly faster. So you get to a point where, yes, distributing that computation helps you, but if you throw more GPUs at it, it doesn't necessarily mean you're going to get faster convergence. So let's look at a more complicated model. This now is a six-layer neural net. So it has six layers stacked together. And this is the same thing in terms of run time, or sorry, in terms of accuracy on the left side and iterations on the right side. So looking at this model, this shows me that it doesn't perhaps generalize that well, because you see your accuracy going down after a while. But one kind of interesting thing to me is that in this plot is that if you run it on one iteration, if you run it on just a single server, you see that it doesn't generalize that well. But if you run it on the cluster, that doesn't show up as much. So if you cast that plot in terms of time, like we did on the other ones, then what you see is for this particular data set and this model architecture at least, to achieve 75% accuracy, one node takes the longest, takes 750 seconds, four nodes is 600 seconds, eight is 300 or so. So it does get faster. So here you see more a case for this model architecture that the bigger your cluster is, it does seem to run faster up to a point. You don't see a lot of difference between eight and 16 nodes. So let's look at maybe one or two more examples. This is on the places data set that I talked about. So that's a much larger data set. And then we're using a well-known model called VGG. And this has 130 million parameters. So this is a pretty big model. So as expected, we see that in terms of number of iterations, it takes fewer iterations to converge. We notice here on 20, a cluster of 20, we ran it up to 100 iterations and it still hadn't got to the accuracy we wanted. And if you look at the time, then this is really less convincing. What it's showing is that if you forget the 20 node cluster for a moment, which is the green one, these are 1, 5, and 10 nodes actually. And you don't really see any benefit from this distributed computation. So there's perhaps a number of reasons why this happens. It could be that this data set has so many levels in it like so many different classes that distributing that computation doesn't help you that much. Last example I want to show is on an ensemble. So what we did is we took this convolutional neural net. We ran it on each of the worker nodes. And instead of averaging the result of those, we fed that into another CNN, which is a very, very simple CNN to see what the effect of ensembling these models is. So this is the result. And one interesting, so the red line is if you just, so this is accuracy on the left side, number of iterations. So the red line is if you just average the worker nodes, you know what accuracy you get. And you see it's converging, albeit slowly. If you take the output at a given point and feed it through this ensemble, this simpler CNN at the end, then your accuracy actually jumps quite a lot by two or three times. So we only ran it for 40 iterations because it was very expensive. But at least we got a feeling that this idea of ensembling these models in the distributed system may be something that is very useful in the future. So I'm going to skip this one because I do want to talk about lessons learned both on the modeling and on infrastructure. Lessons learned on the modeling. I would say, yes, distributed deep learning can potentially run faster than a single node. So for a given accuracy, you can get benefits from distributing that computation. In our experience, we found it to be pretty challenging to do this. It was really interesting work. But anybody who's worked in the distributed systems know that things don't seem to come easily. The other thing is we were working in the context of a distributed database, this parallel database. So there are some things that if we were just on a Linux cluster, we could do differently. But since you need to work with the query processor, you need to work with how the database marshals data around, that can kind of handcuff you a little bit. So that was one of the limitations that we had to deal with. On the infrastructure lessons learned, I think my number one thing is beware of the cost of GPUs on public cloud. We had some sticker shock when we got our bill after the first month. The other thing is memory management can be very finicky, GPU initialization settings and freeing TensorFlow memory. There's a lot of folklore there. And I put at the back of this deck kind of a more detailed list. So if anybody's having issues around GPU memory and such, we found a bunch of blog posts that were helpful. A lot of stuff we kind of had to fail and learn on our own. So there's a lot of details here on if you're running into challenges, this isn't just database specific. Have a look there. There may be some, it may save you a bit of time. Yeah, so maybe since we're getting to the end here, I'll talk about the future work that we're planning to do. So Apache Madlib has a release coming up in probably within the next month, which is the initial release of the distributed deep learning. So we're currently going to support Keras with the TensorFlow back end with GPU support. And the work that we have planned for the remainder of this year is very deep learning focused. So we want to look at more distributed deep learning methods. We want to implement the parallel hyperparameter tuning that I mentioned before. And we're also in the process of adding model versioning repository. So that will allow you to run tons and tons of models in parallel, collect your results in one table, and then sort of sort through them to pick your best parameters. That was it. So thank you very much for your attention. Yes? I'm just showing you guys what we're talking about. Yeah. OK. I think that's because that model doesn't generalize well. So the question is, why, if you run it for larger accuracy, if you run for more iterations, does accuracy drop? So this can happen for certain kind of models that start to see increased gradients. So accuracy goes down. There is a blog post here where we got this. We didn't create this model. We got it from this researcher. So the results that we see here are very similar to that person's results. I think the bottom line is it's not a very good model for that data set. Sorry? This is it. Yeah. I was using this as an example to show that if you run a certain model architecture on a data set on a single node and then you run the same thing in a cluster, you actually can see different convergence behavior. That was sort of the point of showing this. Yeah.