 I think we're good. I think it's about time. So welcome, everyone, to the AI ThunderDome. We are going to be talking about parallel AI training. My name is Sean Pryor. I'm a senior Cloud Consultant at Red Hat. I've been working on one of the largest telco open stack deploys in the US. And I have a passion for AI and other fun tech, so I thought I'd submit this and see how it goes. So let's see. Excellent. So we're going to cover here a brief explanation, just a recap of some of the essential technologies in ML and Spark and Sahara. Some notes on preparing for it, and some of the fun issues that we hit in our lab when we were testing this out and preparing for the talk. I know it says demo, but we couldn't actually get the demo fully working, so I did have to drop that bit from it, and I'm sorry about that. But I can tell you about it at the end, if you'd like. Then we'll do a little look at how to do machine learning on Spark, cross-validation model selection, another framework called Sparkflow, which may help implement this for some large training data sets, and then look through some example code to really see how you would do all of this in practice. So without further ado, big data and open stack. So one of the questions you may have is, why would you want to do this? Well, as of the very latest user survey, there's a lot of data residing on open stack. Over half of the sites have more than a terabyte, and almost a fifth of the sites have over 100 terabytes. And a lot of the sites have fewer than 1,000 objects, so they're big objects sitting out in Swift. And because of that, since you wouldn't want to transfer 100 terabytes over your local network, that would be a little prohibitive to performance, it makes sense to just analyze it where the data already is. So talking a little bit about the architecture, Sahara, for those of you not already intimately familiar with it, is basically just a big wrapper around heat specific to data processing. It actually does a little more than Spark. We'll be talking about Spark, but you can also do Hadoop, Storm, and some other popular frameworks are already supported in Sahara. It also helps by setting up all the credentials you would need to access Swift and other things like that. And it includes scaling, so you can have your clusters scale up and down depending on need. Also, it's not, blah, sorry. Also, you're not specifically required to access data in Swift. That's what the examples will be doing, but you could access it in any way that regular code would. You're not restricted to just the stuff inside your cloud. So next architecture to look at is just a bit of a recap around Spark and what it's doing. It's a generic data manipulation framework. It has a master slave architecture where data and tasks are all distributed among a bunch of workers and a cluster manager coordinates. Notably, it doesn't actually have to be the built-in one in Spark. It could be Mesa, Ciaran, or Kubernetes if you're feeling fun. So it's gained a lot of popularity lately because it's sort of a much faster version of Hadoop in a lot of ways. It's built on the same framework, but it has other tools. It does a lot of in-memory caching, and it notably does lazy evaluation, which is a great performance gain in many cases. But you can run MapReduce and other things on top of Spark, not just AI. But since AI is the big important big data tool, it makes sense that you would want to use it for that. So one other thing that's really important to call out in the diagram there, you have notably a cache. I don't have a laser pointer, so just imagine I'm pointing to the little part on the workers that says cache. That's actually really important to know about for TensorFlow because TensorFlow will assume that all of the free memory on the system it's running on is usable. So when we get into a little more about Spark in a couple of slides, just remember that when you actually force something into the cache, if it's in the middle of a TensorFlow execution, that could possibly cause you to start swapping in the middle of AI training, which is going to be very painful. So let's see. So on to a few notes about deploying Sahara. So a lot of these are things we hit in our lab, so I figure better to share the pain with everyone else so that they know in advance. Let's see. When you have the Sahara Image Elements repo and you're using it to build your image, it will, by default, build a Ubuntu image. For right now, CentOS is unsupported. I hope we fix that. But yeah, so you'll probably want to. So in order to actually enable TensorFlow and other machine learning, you'll want to guest mount the image, and you'll want to do some PIP installs. You'll need TensorFlow, TensorFlow GPU, and the associated drivers if you have GPU instances. Keras is a nice set of machine learning tools that make TensorFlow easier to use. SparkDL is another framework by Databricks that has some convenience features. SparkFlow is another framework by Lifeomic that we'll look at a little bit, because it actually does the same parallel training from a slightly different direction, which I thought was a really neat implementation. So I was just going to talk about it. And also, notably, Yahoo has also invested in this, which is not in this presentation. But they have done that. If you Google Yahoo, Spark, TensorFlow, they have their own thing that they're doing as well. Other fun thing, by default, when you're running it, it will attempt to access HDFS. But the default image does not have the Ubuntu user in Supergroup. So make sure to add that group and add it to your Cloud config. Otherwise, your jobs will not run out of the box. That was a fun one to debug. So additionally, Swift support seemed to have a few issues. This might have just been something I hit in my lab, but I hit that fun class not found exception. You may need to reinstall your Hadoop OpenStack.jar. Your mileage may vary. Let's see. And last but not least, the job execution, the job, job execution, et cetera, framework in the OpenStack CLI and GUI expect you to have a Java program. So you might need to run in and do Spark submit manually to get some of your PySpark jobs working since Python, TensorFlow, fairly, fairly common things to put together. There are ways to do it in Java, but I didn't, so. Right, so now let's get into the fun stuff. Looking at the actual machine learning in Spark. So let's recap with a little bit about AI and what we're really talking about. Just since some people may have Spark as their main skill set, some people may have TensorFlow. Want to do a recap on each just so that we're all in the same place. So when we talk about AI, we're really talking about curve fitting. We're talking about using graphs to approximate some function. AI is really a universal function approximator. You have a lot of data. You don't know what underlying function generated that data. AI will help you approximate that by fitting it. So it does this by creating a big graph of set functions with weights and adjusting those weights based on the incoming data until the function converges to some function that hopefully approximates what you're looking for. Each iteration refines it. And as you can see in the diagram there, if you were to imagine the black line as your existing one, the dash line is the new iteration of your training. And it pushes the function in a new direction and may add more complexity, as you see in the tail there. And just for another recap, there's a couple of important terms to remember when talking about AI. Features are just characteristics of any data point that you have. So it may be size, weight, et cetera. Maybe the price of houses near to the one you're looking for, maybe where it is, et cetera. And any kind of characteristic of a data point that you can imagine would be called a feature. And labels are your sort of outputs. So you would take all of your features, the words in an email, and it would label it as either spam or not spam. You could take all of the pixels in an image and it will label that image something. You could also have it do things like predicted. You could have it say, take in all of the characteristics of houses in this area and label it what it thinks the price would be, optionally with some amount of confidence. Additionally, two of the other ones that we're going to want to look at specifically related to Spark and TensorFlow and parallel AI training are going to be the learning rate and the loss. So when we talk about learning rate, each of the iterations of the function will adjust all of the weights by a certain amount. It'll push the function in a certain direction. And the learning rate affects how far. Now, if it's too high, your function will bounce asymptotically between whatever the answers are and will never converge to a single function. Or if it's too low, it'll just never go anywhere. So that's one that you would want to optimize. And it's very domain specific. It's very specific to what your exact problem is, what the proper learning rate should be. And one metric to evaluate whether you're in the right ballpark is going to be loss, which is, broadly, how far away from reality are our answers? Is our function accurately modeling reality or not? Now, this one is fairly complicated and calculating it can be rather involved in statistics. And you can have functions with low loss that don't necessarily model reality very well. For example, if you were to say predicting whether or not a user will click a banner ad, you could have a function that would predict false all the time. And it would be very accurate from a loss perspective, but not necessarily predicting what you want. So normalization is one other thing that does help prevent some types of loss due to overfitting, which is a little more complex. But basically, does the function too closely fit the training data, but not is it broadly applicable or not? Normalization will help make your function more broadly applicable. So those are all the knobs you need to adjust to make sure that your AI functions as you would expect. And it could take many, many tries to find the right variables, but fortunately, Spark has a nice feature that helps us with this. So moving on. So when dealing with Spark and machine learning, there's a few important concepts to remember. There's basically four. There's data frames, transformers, estimators, and pipelines. Pipelines are just a combination of transformers and estimators, and they operate on data frames. Data frames, for anyone who's used Pandas, is a very similar concept. It's basically a column-oriented or database-like data structure. It supports sort of a SQL style querying. You can also do things like map reduce on data frames, et cetera. And it's very similar to what you would expect from Pandas or other libraries that implement something similar. Data frames are distributed across the entire cluster. So you can have parts of it resigning on each node. Additionally, and notably, they're lazy. So the data frame won't be cached, and all of it won't be brought into memory until you actually run a function on it that requires it to be so. Notably, not transformations. Transformations do not actually cache the entire data structure in memory, which is great. But that lazy evaluation is why it seems so performant on a lot of big data tasks. So moving on to transformers and estimators. Transformers, estimators are a subclass of transformers. They're both abstract in that you could implement your own. Spark provides a bunch of pre-canned ones. Transformers are just any class that implements a transform method, which can return a new modified data frame. So for cleaning your data before you hand it over to the AI to be trained, you may transform it in some manner. You may clean out empty strings if you expect it to always be populated. You may prune some outliers in some cases. Those transformations would be applied before you pass it to your AI. Estimators are a subclass of that that implement a fit method, which is going to take, and in our case, train the AI based on the data it's being passed. So calling fit will actually go through and perform all of the training, running it through the AI. So it doesn't necessarily need to be TensorFlow. For this talk it is. But when that fit method returns, it will return you the trained model. So let's see. So that's one difference in the outputs. They also do implement the transform function. And the transform function in most AIs is used to actually provide the labeling mechanism. So if you pass an unlabeled piece of data, it will provide that new label column to it. So that would be how you actually use them in practice. And finally, pipelines, as shown in the image, just comprise a set of transformers and estimators. So it takes your raw text, tokenizes it, hashes it, and then passes it off to the TensorFlow bit. And the TensorFlow bit returns the trained model at the end. Yeah. So transformers and estimators are just blocks of code that transformers just take in and modify the data and return a new data frame with columns appended, removed, et cetera. Oh, yes. And the red one is the estimator. There we go. OK, cool. So all right. So now on to the really powerful tool that Spark gives you for this, which is cross-validation. So as we mentioned, there's a lot of knobs that you can turn, batch size, learning rate, normalization, number of iterations, et cetera. And cross-validation lets you tune all of these programmatically. So it allows you to select them and form a grid and test against each of those different combinations. This does, however, come at a bit of a cost, because as you expand the grid, as you have more combinations to test against, that starts to get prohibitive in terms of performance. So what cross-validation does is it wraps the pipeline. So it will take your pipeline with some values abstracted out to the cross-validator. You pass it a grid of parameters, and it will run several copies of that pipeline in parallel and give you back the one that performs the best. So you may need a lot of resources, but if you don't want to sit there and hand tune an AI, this is a great way to do it. So this is where the Thunderdome concept comes in, because the AIs are sort of competitive, having fun with the naming. You know. So now, how do we determine which is best? Well, we need to optimize against that loss metric from before. Spark helpfully has an evaluator class, which has a bunch of common algorithms, AUC, mean squared error, et cetera, which are fairly common ones to evaluate an AI against. So let's see. Additionally, if you have a specialized AI that may need a specialized loss metric, evaluator is also abstract. So you can implement your own. So after it goes through this parallel training, and all of the pipelines are tested on their slice of data, each one is not tested on the full data set. It's tested on the slice of data that resides on that node. After the best pipeline is selected, the one with the lowest loss, it is then retrained on all of the data. And that gives you your finished model. So very powerful, but it comes at a cost of resources, depending on how big your grid of different parameters is. And of course, as powerful as it is, it's also not an end-all solution, because you still need to make sure that the AI functions as you would expect. It may have all the right parameters, but it might be solving a different error from what you think it's solving. So just to remember that you do have to make sure to sanity check it type of thing. So now to really dig into this, let's take a look at some example code and actually look at all of the calls and what they're doing, and hopefully that'll help actually writing this when it comes to implementing your own AIs. So this may be familiar. If anyone's browsed the code in Spark's ML guide, it's taken right from there. And it shows how to use cross-validation to find the optimal hyperparameters in a pipeline. I did have to doctor the code a little to get it to fit on the slide, but I wanted the whole thing here in case anyone was taking it straight off the slides. So all right. So to start off, we create our Spark session, and then that gives us the Spark context that we use to interact with data across the whole cluster. So following that, we create some simple training data. It's truncated on this slide. It was only a small set of strings that either contained the word Spark or not. After that, after that data frame's created, it's got three columns, just ID, text, and label. One label for true if it has Spark, false if not. So we train it to determine whether or not that's in the string. So after that, we take and break it up into individual tokens, which adds a new words column to our data frame. And then we take the next call down adds hashing TF, which just hashes the tokens based on their frequency and each input. So that gives us a new hashed features column. And then that gives us our features and labels for a training data set. So the logistic regression is just one of those nice pre-canned algorithms if you just need to implement simple logistic regression and TensorFlow. It's already there for you. You don't have to write any fancy code for it. And then after we set all these up, we create the pipeline, which takes in the strings, splits them into tokens, hashes them based on frequencies, and then passes into the logistic regression algorithm for training. And if we were to call fit right here on that data frame, we would have the model that we're looking for, given a learning rate and so on. But instead, we're going to create a parameter grid. And this is where we get into the parallel part of the parallel hyperparameter tuning. So this parameter grid just contains a set of different parameters to pass to each pipeline. So for hashing TF, we hash into 10, 100, or 1,000 buckets. And for our logistic regression model, we're going to try a learning rate of either 0.1 or 0.01. Now, this is a fairly small one, but it's still going to run six pipelines in parallel. So hopefully, you have a big enough cluster to run those. So we then take and cross-validator wraps the pipeline and executes it with the given parameters from the grid. So the number of controls, how many folds the input data sets split into. In this case, since the training example only had 11 examples, we used two. But for regular data sets in the field, you'd want at least three, possibly more, depending on how you want to carve up your data. Each of the data frames uses a 2-thirds training, 1-third test split. So 2-thirds of the data is used to train the AI, and then the 1-third is used to validate you haven't overfit it to the first bit. And then finally, we call cross-val.fit. And we pass at the training data set, and that executes all the pipelines in parallel and then returns us the best one. So CV model now points to the best model, since the last stage was an estimator. So now that we have it, we can evaluate it. And so we do that by creating a test data set, which is just the training data set without the labels, or sorry, no, which is data similar to the training data, but without the labels. Because if we were to do it on the training data, that would actually be really bad. So then our model is now going to label the data that it hasn't seen before. And that's really what you want an AI to be able to do, is to take an example that it hasn't seen before and give it the proper label. So we call our models transform method, and it returns two new columns. And it has our prediction and our probability, which in this case would be how confident we are for this specific model, how confident we are that it contains the word spark or not. In this case, probably fairly confident with a few examples, but larger and more complex models may need a lot more training data. All of AI is really data hungry. So now as sort of an addendum to this, there's one other interesting framework that I just wanted to talk about, because I thought they had a really fascinating implementation. So it also is a good fit for some algorithms that the previous one might not be. So spark flow, I actually made this diagram, it was fun. So this is sort of an alternative parallel training methodology. Spark flow implements an algorithm called Hogwild, which deals with parallel training from a different way, rather than dealing with selecting the optimal parameters. Assuming you do sort of know the parameters you want to use, this is great when you have extremely large data sets. And you may have data sets where it could take days or weeks to go through all of the training. This can cut that down quite significantly. So it takes advantage of the fact that while actually using the neural network isn't parallel, the training bit can actually be made embarrassingly parallel. So the way it works is it creates a parameter server. It creates a master server that has a copy of the AI and all of its weights. So your full graph and all of the weights reside on that master server. It then distributes a copy of this graph to all of the others, which are then trained on their slice of data. After a certain number of iterations of training on that data, it gathers all of the weights from all of the AIs, and you've now essentially trained the master AI because it aggregates all of them. You've now essentially trained the master AI on all of that data in parallel. So that can be great when you have extremely large data sets. So at time of writing, there's no automatic cross validation for this, though. So as of right now, it doesn't automatically select the right parameter. So you do still have to pass at the parameters. And so let's dive into exactly how to do that. This code, again, looks fairly familiar to anyone who's been to their GitHub page. It's the exact example they use. So it's going to use the same kind of Spark pipeline method from before, but it's going to do it with this parallel training as a last estimator step rather than a built-in one. And this example uses the MNIST digit classification data set, which is this thing, which should hopefully look familiar to anyone looking into AI. So our version uses a CSV version of that, which just has each pixel set. So to start off, one of the key differences in Sparkflow versus the other frameworks is that in this case, you're just going to write plain TensorFlow. So this could be a good fit if you're coming from a TensorFlow background and they're now saying, all right, let's do it on Spark. So our small model function, that's hard to say, here defines just the TensorFlow model with the inputs from the MNIST data set, all of the 784 pixels. And it takes and passes that to another layer. And then there's an output layer, which is just a vector of 10, with only one of those ever being true, because it's going to predict one of your digits, 0 if you're not. So notably, we also take and return a loss metric from this function, which is important because that's what we're going to use to evaluate it when we get done. So moving on, we now take our MNIST data set straight out of Swift. And one other important thing to note, the default parsing for CSVs is going to parse as strings. In this case, we pass in first schema true because we want those to be evaluated as integers. And then we have it stored in Swift there. So build graph then creates the actual TensorFlow graph. It takes the pointer to that function from earlier and creates it and serializes it into Google's protobuf format to be served off of the parameter server. It actually serves it just over plain HTTP. So you'll see a lot of get and post in the training output while you're doing this. And then it just takes our small model function and creates it. Then our vector assembler takes and cleans the data from the CSV into vectors. So it takes and encodes the color representations from columns 1 to 785 in our CSV and places that into features. So we now have a bunch of vectors. And then finally, we create the one hot encoder because it's only ever going to be true in one case of that last vector. And that just makes it easier to deal with the data at the end. And now, this is the interesting part. So Spark async dl is the part that does all of the major work. It takes and creates the parameter server with the serialized graph. It takes which input column to use as the features. And it creates that parallel training estimator. So note specifically in this one, we do pass at the learning rate and other parameters. So it doesn't optimize for those, but it's great if you have a ton of data. So finally, we then go and we create our pipeline. We call fit with the data frame from before, which runs all of this training. On the servers I was running on, it took about 20, 30 minutes. And it returns the trained model. And so my servers weren't using GPUs, though. So on GPUs, I'd imagine that's significantly faster. And so lastly, just as another cool thing, it shows us a good example of saving off that trained model so that we can call it back up later. So let's see. All right. And that was about all I had for going through some example code. So if anyone has any questions, there are mics towards the center there. Please come up and ask away. Oh, of course you are. Go for it. And this isn't just a haze. Now, a couple questions from the presentation that I figured I'd hold up to the answer. Probably the biggest one is you talked about training the data sets on different slices of the data. And the selection of what data is in those slices could really change. So how are those slices, if you know, how are those slices divvied up? And is it like a bootstrap method? I don't know specifically how it hashes the data. I know that it will distribute the data sort of uniformly around the given nodes. I don't know specifically what hashing mechanism it uses. So there is the danger in the non-sparkflow chance that you could have all outliers residing on one node. That is true. That is something that can happen. Do you know if that's something that you can go in and tune or use your own algorithm to bootstrap, basically? That is a good question. I would imagine it's probably somewhere inside the Spark code that you could tune how it distributes the data. I do not know off the top of my head. I know that what I can say, though, is you can use MapReduce to sort of massage your data before you pass it in. So one thing you could do before actually passing the data to your estimator would be, in some early transformers, say, make sure that all of the data doesn't contain outliers. So you could do that. Second question is can you just talk about the amount of resources that you used to do this? And you said that if you have enough machine tonight, can you give a sense like the size of the VMs you were using and all that kind of stuff? Let's see. I think I had VMs. I did this in a lab with, I had a small Spark cluster. I only had, like, three nodes. And they were only, like, 32 gig RAM, like, 16 vCPUs. They were not very big. I'm pretty sure the hypervisors could have held more. But during the run-up to the talk, I was just doing some smaller VMs for training. So, any other? Feel free. So the talks is running this on OpenStack. Were there any OpenStack-specific performance tunings that you had to do, or any comparison or thoughts on running it on VMs versus bare metal? Yes, so specific OpenStack tunings that you'll want to keep track of. Let's see. You will definitely want to make use of vCPU pinning. That one is a performance optimization in just about all cases. So you'll pass CPU thread policy equals isolate, I think. I don't remember the parameter off the top of my head. But you'll want to make sure that you're pinning exclusively to your CPUs. You'll, of course, want to make sure there's no memory over commit in an ideal case. The less over commit you have on your Spark nodes, the better they're going to perform so that you don't have the noisy neighbor problem. For GPUs, if you're using a multi-socket, make sure that you have GPUs in both NUMA nodes and you're not trying to access cross NUMA nodes. Let's see. Those would be the main ones. I don't think you need to go as far as ISIL CPUs, but CPU shielding would probably add another couple percent to your performance. It's mostly going to be bottlenecked in TensorFlow's case by the bandwidth between memory and GPU memory. So anybody else? Don't be shy. Well, all righty. I guess if nobody else has any questions, then thank you for having me. Let's see.