 Hello, everybody. And thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled, Tapping Vertica's Integration with TensorFlow for Advanced Machine Learning. I'm Paige Roberts, open source relations manager at Vertica. And I'll be your host for this session. Joining me is Vertica software engineer George Lariatoff. Right. That's George. So before we begin, I encourage you guys to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. So as soon as a question occurs to you, go ahead and type it in. And there will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to get to during that time. Any questions we don't get to, we'll do our best to answer offline. Now, alternatively, you can visit Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going so you can ask an engineer afterwards just as if it were a regular conference in person. Also, a reminder, you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And before you ask, yes, this virtual session is being recorded and it will be available to view on end. This week, we'll send you a notification as soon as you're ready. Now, let's get started. Over to you, George. Thank you, Paige. So I've been introduced. I'm a software engineer at Vertica. And today, I'm going to be talking about a new feature, Vertica's integration with TensorFlow. So first, I'm going to go over what is TensorFlow and what are neural networks. Then I'm going to talk about why integrating with TensorFlow is a useful feature. And finally, I'm going to talk about the integration itself and give an example. So if we get started here. What is TensorFlow? TensorFlow is an open source machine learning library developed by Google. And it's actually one of many such libraries. And the whole point of libraries like TensorFlow is to simplify the whole process of working with neural networks, such as creating, training, and using them so that it's available to everyone as opposed to just a small subset of researchers. So neural networks are computing systems that allow us to solve various tasks. Traditionally, computing algorithms were designed completely from the ground up by engineers like me. And we had to manually fit through the data and decide which parts are important for the task and which are not. Neural networks aim to solve this problem a little bit by fitting through the data themselves automatically and finding traits and features which correlate to the right results. So you can think of it as neural network learning to solve a specific task by looking through the data without having a human being have to fit and fit through the data themselves. So there's a couple necessary parts to getting a trained neural model, which is the final goal. By the way, a neural model is the same as a neural network. Those are synonymous. So first, you need this light blue circle, an untrained neural model, which is pretty easy to get in TensorFlow. And in addition to that, you need your training data. Now, this involves both training inputs and training labels. And I'll talk about exactly what those two things are on the next slide. But basically, you need to train your model with the training data. And once it is trained, you can use your trained model to predict on just the purple circle, so new training inputs. And it will predict the training labels for you. You don't have to label it anymore. So a neural network can be thought of as training a neural network can be thought of as teaching a person how to do something. For example, if I want to learn to speak a new language, let's say French, I would probably hire some sort of tutor to help me with that task. And I would need a lot of practice constructing saying sentences in French, and a lot of feedback from my tutor on whether my pronunciation or grammar, et cetera, is correct. And so that would take me some time. But finally, hopefully, I would be able to learn the language and speak it without any sort of feedback, getting it right. So in a very similar manner, a neural network needs to practice on example training data first. And along with that data, it needs labeled data. In this case, the labeled data is kind of analogous to the tutor. It is the correct answers so that the network can learn what those look like. But ultimately, the goal is to predict on unlabeled data, which is analogous to me knowing how to speak French. So I went over most of the bullets. A neural network needs a lot of practice. To do that, it needs a lot of good labeled data. And finally, since a neural network needs to iterate over the training data many, many times, it needs a powerful machine, which can do that in a reasonable amount of time. So here's a quick checklist on what you need if you have a specific task that you want to solve with a neural network. The first thing you need is a powerful machine for training. We discussed why this is important. Then you need TensorFlow installed on the machine, of course. And you need a data set and labels for your data set. Now, this data set can be hundreds of examples, thousands, sometimes even millions. I won't go into that because the data set size really depends on the task at hand. But if you have these four things, you can train a good neural network that will predict whatever result you wanted to predict at the end. So we talked about neural networks and TensorFlow. But the question is, if we already have a lot of built-in machine learning algorithms in Vertica, then why do we need to use TensorFlow? And to answer that question, let's look at this data set. So this is a pretty simple toy data set with 20,000 points, but it simulates a more complex data set with some sort of two different classes which are not related in a simple way. So the existing machine learning algorithms that Vertica already has mostly fail on this pretty simple data set. Linear models can't really draw a good line separating the two types of points. Naive Bayes also performs pretty badly. And even the random force algorithm, which is a pretty powerful algorithm, with 300 trees gets only 80% accuracy. However, a neural network with only two hidden layers gets 99% accuracy in about 10 minutes of training. So I hope that's a pretty compelling reason to use neural networks at least sometimes. So as an aside, there are plenty of tasks that do fit the existing machine learning algorithms in Vertica. That's why they're there. And if one of your tasks that you want to solve fits one of the existing algorithms well, then I would recommend using that algorithm, not TensorFlow. Because while neural networks have their place and are very powerful, it's often easier to use an existing algorithm if possible. So now that we've talked about why neural networks are needed, let's talk about integrating them with Vertica. So neural networks are best trained using GPUs, which are graphics processing units. And it's basically just a different processing unit than a CPU. GPUs are good for training neural networks because they excel at doing many, many simple operations at the same time, which is needed for a neural network to be able to iterate through the training data many times. However, Vertica runs on CPUs and cannot run on GPUs at all because that's not how it was designed. So to train on neural networks, we have to go outside of Vertica and exporting a small batch of training data is pretty simple. So that's not really a problem. But given this information, why do we even need Vertica? If we train outside, then why not do everything outside of Vertica? So to answer that question, here is a slide that Phillips was nice enough to let us use. This is an example production system at Phillips. So it consists of two branches. On the left, we have a branch with historical device log data. And this can be thought of as a bunch of training data. And all that data goes through some data integration, data analysis. Basically, this is where you train your models, whether or not they are neural networks. But for the purposes of this talk, this is where you would train your neural network. And on the right, we have a branch which has live device log data coming in from various MRIs, machines, CAT scan machines, et cetera. And this is a ton of data. So these machines are constantly running. They're constantly on. And there's a bunch of them. So data just keeps streaming in. And so we don't want this data to have to take any unnecessary detours because that would greatly slow down the whole system. So this data in the right branch goes through already trained predictive models, which need to be pretty fast. And finally, it allows Phillips to do some maintenance on these machines before they actually break, which helps Phillips, obviously, and definitely the medical industry as well. So I hope this slide helps explain the complexity of a live production system and why it might not be reasonable to train your neural network directly in the system with the live device log data. So a quick summary on just the neural network section. So neural networks are powerful, but they need a lot of processing power to train, which can't really be done well in a production pipeline. However, they are cheap and fast to predict with. So prediction with a neural network does not require GPU anymore. And they can be very useful in production. So we do want them there. We just don't want to train them there. So the question is now, how do we get neural network into production? So we have basically two options. The first option is to take the data and export it to our machine with TensorFlow, our powerful GPU machine. Or we can take our TensorFlow model and put it where the data is. In this case, let's say that that is Vertica. So I'm going to go through some pros and cons of these two approaches. The first one is bringing the data to the analytics. The pros of this approach are that TensorFlow is already installed and running on this GPU machine, and we don't have to move the model at all. The cons, however, are that we have to transfer all the data to this machine. And if that data is big, if it's gigabyte, terabytes, et cetera, then that becomes a huge bottleneck because you can only transfer in small quantities because GPU machines tend to not be that big. Furthermore, TensorFlow prediction doesn't actually need a GPU. So you would end up paying for an expensive GPU for no reason, is not paralyzed because you just have one GPU machine. You can't put your production system on this GPU, as we discussed. And so you're left with good results, but not fast and not where you need them. So now let's look at the second option. So the second option is bringing the analytics to the data. So the pros of this approach are that we can integrate with our production system. It's low impact because prediction is not processor intensive. It's cheap, or at least it's pretty much as cheap as your system was before. It's paralyzed because Vertica is always paralyzed, which we'll talk about in the next slide. There's no extra data movement. You get to benefit from model management in Vertica, meaning if you import multiple TensorFlow models, you can keep track of their various attributes when they were imported, et cetera. And the results are right where you need them inside your production pipeline. So two cons are that TensorFlow is limited to just prediction inside Vertica. And if you want to train your model, you need to do that outside of Vertica and then reimport. Just as a recap of parallelization, everything in Vertica is paralyzed and distributed. And TensorFlow is no exception. So when you import your TensorFlow model to your Vertica cluster, it gets copied to all the nodes automatically. And TensorFlow will run in fence mode, which means that if the TensorFlow process fails for whatever reason, even though it shouldn't, but if it does, Vertica itself will not crash, which is obviously important. And finally, prediction happens on each node. There are multiple threads of TensorFlow processes running, processing different little bits of data, which is faster, much faster than processing the data line by line, because it happens all in a paralyzed fashion. And so the result is fast prediction. So here's an example, which I hope is a little closer to whatever one is used to than the usual machine learning TensorFlow examples. This is the Boston housing data set, or rather a small subset of it. And on the left, we have the input data to go back to, I think, the first slide. And on the right is the training label. So the input data consists of each line is a plot of land in Boston, along with various attributes, such as the level of crime in that area, how much industries in that area, whether it's on the Charles River, et cetera. And on the right, we have as the label the median house value in that plot of land. And so the goal is to put all this data into the neural network and finally get a model which can train, which can predict on new incoming data and predict a good housing value for that data. Now I'm going to go through, step by step, how to actually use TensorFlow models in Vertica. So the first step, I won't go into much detail on, because there are countless tutorials and resources online on how to use TensorFlow to train a neural network. That's the first step. Second step is to save the model in TensorFlow's frozen graph format. Again, this information is available online. And the third step is to create a small, simple JSON file describing the inputs and outputs of the model and what data type they are, et cetera. And this is needed for Vertica to be able to translate from TensorFlow land into Vertica SQL land, so that it can use the SQL table instead of the inputs that TensorFlow usually takes. So once you have your model file and your JSON file, you want to put both of those files in a directory on a node, any node, in a Vertica cluster and name that directory whatever you want your model to ultimately be called inside of Vertica. So once you do that, you can go ahead and import that directory into Vertica. So this import model function already exists in Vertica. All we added was a new category to be able to import. So all you need to do is specify the path to your neural network directory and specify that the category that the model is a TensorFlow model. Once you successfully import, in order to predict, you run this brand new predict TensorFlow function. So in this case, we're predicting on everything from the input table, which is what the star means. The model name is Boston Housing Net, which is the name of your directory. And then there's a little bit of boilerplate. And the two ID and value after the as are just the names of the columns of your output. And finally, the Boston Housing data is whatever SQL table you want to predict on that fits the import type of your network. And this will output a bunch of predictions, in this case, values of houses that the network thinks are appropriate for all the input data. Just a quick summary. We talked about what is TensorFlow and what are neural networks. And then we discussed that TensorFlow works best on GPUs because it needs their specific characteristics. That is TensorFlow works best for training on GPUs. Well, Vertica is designed to use CPUs. And it's really good at storing and accessing a lot of data quickly. But it's not very well designed for having neural networks trained inside of it. Then we talked about how neural models are powerful. And we want to use them in our production flow. And since prediction is fast, we can go ahead and do that. But we just don't want to train there. And finally, I presented Vertica TensorFlow integration, which allows importing a trained neural model, a trained neural TensorFlow model into Vertica and predicting on all the data that is inside Vertica with a few simple lines of SQL. So thank you for listening. We're going to take some questions now.