 Hello, everybody, and thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled, Machine Learning with Vertica, Data Preparation and Model Management. My name is Sue Leclerc, Director of Marketing at Vertica, and I'll be your host for this webinar. Joining me is Wakas Dylan, part of the Vertica Product Management Team at Vertica. Before we could begin, I want to encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click Submit. There will be a Q&A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't address will do our best to answer offline. Alternately, you can visit Vertica forums to post your questions there after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Wakas, over to you. Thank you, Sue. Hi, everyone. My name is Wakas Dylan, and I'm a product manager here at Vertica. So today we're going to go through data preparation and model management in Vertica. And the session would essentially be starting with some introduction and going through some of the machine learning considerations when you're doing machine learning at scale. Remember that we have two major sections here. The first one is on data preparation. So we'd go through what data preparation is, what are the Vertica functions for data exploration and data preparation, and then share an example with you. Similarly, in the second part of this talk, we'll go through important export models using PMML and how that works with Vertica. And we'll share examples for that as well. So yeah, let's dive right in. So Vertica essentially is an open architecture with a rich ecosystem. So you have a lot of options for data transformation and ingesting data from different tools. And then you also have options for connecting through ODBC, JDBC, and some other connectors to BI and visualization tools. There's a whole lot of them that Vertica connects to. And in the middle sits Vertica, which you can have on external tables, or you can have in-place analytics on ARC, on cloud, or on-prem. So that choice is yours. But essentially what it does is it offers you a lot of options for performing your data analytics on scale. And within that, data analytics, machine learning is also a core component. And then you have a lot of options and functions for that. Now, machine learning in Vertica is actually built on top of the architecture that the distributed analytic database offers. So it offers a lot of those capabilities and builds on top of them. So you eliminate the overhead of data transfer when you're working with Vertica machine learning. You keep your data secure, storing and managing the models. That's really easy and much more efficient. You can serve a lot of concurrent users all at the same time. And then it's really scalable and avoids the maintenance cost of a separate system. So essentially a lot of benefits here. But one important thing to mention here is that all of the algorithms that you see, whether they are analytics functions, advanced analytics functions, or machine learning functions, they are distributed not just across the cluster on different nodes. So each node gets a distributed workload. On each node, there might be multiple threads and multiple processes that are running with each of these functions. So highly distributed solution and sort of one of its kind in this space. So when we talk about Vertica machine learning, it essentially covers whole machine learning process. And we see it as something starting with data ingestion and doing data analysis and understanding, going through the steps of data preparation, modeling, evaluation, and finally deployment as well. So when you're using Vertica, you're using Vertica for machine learning, it takes care of all these steps. And you can do all of that inside of the Vertica database. But when we look at three main pillars that Vertica machine learning aims to build on, the first one is to have Vertica as a platform for high performance machine learning. We have a lot of functions for data exploration and preparation. We'll go through some of them here. We have distributed in-database algorithms for model training and prediction. We have scalable functions for model evaluation. And finally, we have distributed scoring functions as well. But doing all of the stuff in the database, that's a really good thing. But we don't want to isolate it in this space. We understand that a lot of our customers, our users, they like to work with other tools and work with Vertica as well. So they might use Vertica for data prep with another tool for model training or use Vertica for model training and take those models out to other tools and do prediction there. So integration is really important, part of our overall offering. So it's a pretty flexible system. We have been offering UDXs in four languages, a lot of development there over the past few years. But the new capability of importing PMM models for in-database scoring and exporting Vertica native models for external scoring is something that we have recently added. And another talk would actually go through the TensorFlow integration, so a really exciting and important milestone that we have where you can bring TensorFlow models into Vertica for in-database scoring. For this talk, we'll focus on data exploration and preparation, importing PMML and exporting PMML models. And finally, since Vertica is not just a QE engine, but also a data store, so we have a lot of really good capability for model storage and management as well. So yeah, let's dive into the first part on machine learning at scale. So when we say machine learning at scale, we're actually having a few really important considerations and they have their own implications. The first one is that we want to have speed, but also want it to come at a reasonable cost. So it's really important for us to pick the right scaling architecture. Secondly, it's not easy to move big data around. It might be easy to do that on a smaller data set, on an Excel sheet or something of the like, but once you're talking about big data and data analytics at really big scale, it's really not easy to move that data around from one tool to another. So what you'd want to do is bring models to the data instead of having to move this data to the tools. And the third thing here is that some sub-sampling, it can actually compromise your accuracy and a lot of tools that are out there, they actually force you to take smaller samples of your data because they can only handle so much data. But that can impact your accuracy and the need here is that you should be able to work with all of your data. We'll just go through each of these really quickly. So the first aspect here is scalability. Now, if you want to scale your architecture, you have two main options. The first is vertical scaling. Let's say you have a machine, a server, essentially, and you can keep on adding resources like RAM and CPU and keep on increasing the performance as well as the capacity of that system. But there's a limit to what you can do here and the limit, you can hit that in terms of cost as well as in terms of technology. Beyond a certain point, you would not be able to scale more. So the right solution to follow here is actually horizontal scaling in which you can keep on adding more instances to have more computing power and more capacity. So essentially, what you get with this sort of an architecture is a supercomputer which stitches together several nodes and the workload is distributed on each of those nodes for massively parallel processing and really fast speeds as well. The second aspect of having big data and the difficulty around moving it around is actually can be clarified with this example. So what usually happens is that you, and this is a simplified version, you have a lot of applications and tools from which you might be collecting the data and this data then goes into an analytics database. That database then in turn might be connected to some BI tools, dashboards and applications, some ad hoc queries being done on the database. When you want to do machine learning in this architecture, what usually happens is that you have your machine learning tools and the data that is coming into the analytics database is actually being exported out to the machine learning tools. You're training your models there and afterwards when you have new incoming data that data again goes out to the machine learning tools for prediction, but those results that you get from those tools usually end up back in the distributed database because you want to put it on dashboard or you want to power up some applications with that. So there's essentially a lot of data overhead that's involved here. There are problems with that, including data governance, data movement and other complications that you need to resolve here. One of the possible solutions to overcome that difficulty is that you have machine learning as part of the distributed analytical database as well. So you get the benefits of having it applied on all of the data that's inside the database and not having to care about all the data movement there. If there are some use cases where it still makes sense to at least train the models outside, that's where you can do your data preparation inside the database and then take the data out, the prepared data, build your model and then bring the model back to the analytics database. In this case, we'll talk about Vertica. So the model would be archived hosted by Vertica and then you can keep on applying predictions on the new data that's incoming into the database. So the third consideration here for machine learning on scale is sampling versus full data set. As I mentioned, a lot of tools, they cannot handle big data and you are forced to sub-sample. But what happens here, as you can see in the figure on the leftmost figure A, is that if you have a single data point, essentially any model can explain that. But if you have more data points as in figure B, there would be a smaller number of models that could be able to explain that. And in figure C, even more data points, lesser number of models explain. But lesser also means here that these models would probably be more accurate and the objective for building machine learning models is mostly to have prediction capability and generalization capability essentially on unseen data. So if you build a model that's accurate on one data point, it would not have very good generalization capability. The conventional wisdom with machine learning is that the more data points that you have and for learning, the better and more accurate models that you'll get out of your machine learning models. So you need to pick a tool which can handle all of your data and does not force you to sub-sample that. And doing that, even a simpler model might be much better than a more complex model here. So yeah, let's go to data exploration and data preparation part. So Vertica is a really powerful tool and it offers a lot of capabilities in this space. And as I mentioned, we support the whole process. You can define the problem and you can gather your data and construct your data set inside Vertica. And then comes the data prep part, training, modeling, deployment, and managing the model. But this is a really critical step in the overall machine learning process by some estimates that it takes between 60 to 80% of the overall effort of a machine learning process. So a lot of functions here, you can use part of Vertica to data exploration, need to application, applied detection, balancing, normalization, and essentially a lot more. You can actually go to Vertica's documentation and find them there. Within Vertica, we actually divide them into two parts. Within data prep, one is exploration functions, the second is transformation functions. Now, within exploration, you have a rich set of functions that you can use in DB. And then if you want to build your own, you can use the UDEX to do that. Similarly, for transformation, there's a lot of functions around time, so you use pattern matching, applied detection that you can use to transform that data. And it's just a snapshot of some of those functions that are available in Vertica right now. And again, the good thing about these functions is not just their presence in the database, the good thing is actually their ability to scale on really, really large data sets and be able to compute those data for you on that data set in an acceptable amount of time which makes your machine learning processes really practical. So let's go to an example and see how we can use some of these functions. As I mentioned, there's a whole lot of them and we'd not be able to go through all of them, but just for our understanding, we can go through some of them and see how they work. So we have here a sample data set of network flows. It's a simulated attack from some source nodes and then there are some victim nodes on which these attacks are happening. So yeah, let's just look at the data here real quick. We'll load the data, we'll browse the data, compute some statistics around it, ask some questions, make plots and then clean the data. The objective here is not to make a prediction per se, which is what we mostly do in machine learning algorithms, but to just go through the data prep process and see how easy it is to do that with Vertica and what kind of options might be there to help you through that process. So the first step is loading the data. Since in this case, we know the structure of the data so we create a table and create different column names and data types, but let's say you have a data set for which you do not already know the structure. There's a really cool feature in Vertica called FlagsTables and you can use that to initially import the data into the database and then go through all of the variables and then assign them variable types. And you can also use that if your data is dynamic and it's changing to import the data first and then create these definitions. So once we've done that, we load the data into the database. It's for one week of data out of the whole data set right now. But once you've done that, we'd like to look at the flows, just to look at the data, how it looks. And once we do select start from flows and just have a limit here, we see that there's already some data duplication. And by duplication, I mean rows, which have the exact same data for each of the columns. So as part of the cleaning process, the first thing we'd want to do is probably to remove that duplication. So we create a table with distinct flows. And you can see here, we have about a million flows here, which are unique. So moving on, the next step we want to do here is we want to, this is essentially time series data and they're seven days in a week. So we want to look at the trends of this data. So the network traffic that's there, you can call it flows. So based on hours of the day, how does the traffic move and how does it differ from one day to another? So it's part of an exploration process. There might be a lot of further explorations that you want to do, but we can start with this one and see how it goes. And you can see in the graph here that we have seven days of data and the weekend traffic, which isn't pink and purple here, seems a little different from the rest of the days, pretty close to each other. But yeah, definitely something we can look into and see if there's some real difference and if there's something we want to explore further here. But the thing is that this is just data for one week, as I mentioned, but if we load data for 70 days, you'd have a longer graph probably, but a lot of lines and would not be able to really make sense out of that data. It would be a really crowded plot for that matter. So we have to come up with a better way of being able to explore that and we'll come back to that in a little bit. So what are some other things that we can do? We can get some statistics. We can take one sample flow and look at some of the values here. We see that the forward column here and TUS column here, they have zero values. And when we explore further, we see that there's a lot of values here or records here for which these columns are essentially zero. So probably not really helpful for our use case. Then we can look at the flow end. So flow end is the end time when the last packet in a flow was sent. And you can do a select main flow and max flow end to see the data when it started and when it ended. And you can see it's about one week of data from the first till 8th. Now we also want to look at the data, whether it's balanced or not, because balanced data is really important for a lot of classification use cases that we want to try this with this. And you can see that source address, destination address, source port and destination port. And you can see it's highly in balanced data in source versus destination address space. Probably something that we need to do. Really powerful Vertica balancing functions that you can use with min, under sampling, over sampling or hybrid sampling here. And that can be really useful here. Another thing we can look at is, the summary statistics of these columns. So of the unique flows table that we created, we just use the summarized num call function in Vertica. And it gives us a lot of pretty cool count means standard and percentile information on that. Now if we look at the duration, which is the last record here, we can see that the mean is around 4.6 seconds. But when we look at the percentile information, we see that the median is around 0.27. So there's a lot of short flows that have duration less than 0.27 seconds. Yes, there would be more and they would probably be bringing the mean to the 4.6 value. But then the number of short flows is probably pretty high. We can ask some other questions from the data about the features. We can look at the protocols here and look at the count. So we see that most of the traffic that we have is for TCP and UDP, which is sort of expected for a data set like this. And then we want to look at, you know, what are the most popular network services here? So again, simple QE here, select destination port count and other information here. We get the destination port and count for each. So we can see that most of the traffic here is web traffic, HTTP and HTTPS, followed by domain name resolution. So let's explore some more. We can look at the label distributions. We see that the labels that are given with that because this is essentially data for which we already know whether something was an anomaly or not, a record was anomaly or not and we can create our algorithm based on it. So we see that there's this background label, a lot of records there and then anomaly spam seems to be really high. There are anomaly UDP scans and SSH scans as well. So another question we can ask is like among the SMTP flows, how labels are distributed? And we can say that anomaly spam is highest and then comes the background spam. So can we say out of this that SMTP flows, they're their spams and maybe we can build a model that actually answers that question for us. That can be one machine learning model that you can build out of this data set. Again, we can also verify the destination port of flows that were labeled as spam. So you can expect port 25 for SMTP service here and we can see that SMTP with destination port 25, you have a lot of accounts here but then there are some other destination ports for which the count is really low. And essentially when we are doing an analysis of this scale these data points might not really be needed. So as part of the data prep slash data cleaning we might want to get rid of these records here. So now what we can do is going back to the graph that I showed earlier, we can try and plot the daily trends by aggregating them. Again, we take the unique flows and convert it into a flow count, into a manageable number that we can then feed into one of the algorithms. Now PCA, principal component analysis, it's a really powerful algorithm in Vertica. And what it essentially does is a lot of times when you have a high number of columns which might be highly correlated with each other you can feed them into the BC algorithm and it would get for you a list of principal components which would be linearly independent from each other. Now, each of these components would explain a certain extent of the variance of the overall data set that you have. So you can see here, component one explains about 73.9% of the variance and component two explains about 16% of the variance. So if you combine that, those two components alone would get for around 90% of the variance. Now, you can use PCA for a lot of different purposes but in this specific example, we want to see if we combine all the data points that we have together and we do that by day of the week. What sort of information can we get out of it? Is there any sort of insight that this provides? Because once you have two data points, it's really easy to plot them. So we just apply the PCA, we first train it and then we apply it on our data set and this is the graph we get as a result. Now, you can see component one is on the access here, component two on the y-axis and each of these points represents a day of the week. Now, with just two points, it's easy to plot that and compare this to the graph that we saw earlier which had a lot of lines and the more weeks that we added or more days we added, the more lines we'd have versus this graph in which you can clearly tell that five days traffic starting from Monday till Friday, that's closely clustered together so probably pretty similar to each other and then Saturday traffic is pretty much apart from all of these days and it's also farther away from Sunday. So these two days of traffic is different from other days of the traffic and we can always dive deeper into this and look at what exactly is happening here and see how this traffic is actually different but with just a few functions and some pretty simple SQL queries, we were already able to get a pretty good insight from the data set that we had. Now, let's move on to our next part of this talk on importing and exporting PMML models to and from Vertica. So current common practice is that when you're putting your machine learning models into production, you'd have a dev or a test environment and in that you might be using a lot of different tools, Cycler and Spark are and once you want to deploy these models into production, you'd put them into containers and there would be a pool of containers in the production environment which would be talking to your database, could be your analytical database and all of the new data that's incoming would be coming into the database itself. So as I mentioned in one of the slides earlier, there is a lot of data transfer that's happening between that pool of containers hosting your machine learning models versus the database which you'd be getting data for scoring and then sending the scores back to the database. So why would you really need to transfer your models? The thing is that no single machine learning platform provides everything. There might be some really cool algorithms that Python provides, but then Spark might have its own benefits in terms of some additional algorithms or some other stuff that you're looking at. And that's the reason why a lot of these tools would be used in the same company at the same time. And then there might be some functional considerations as well. You might want to isolate your data between Data Science Team and your production environment and you might want to score your pre-trained models on some edge nodes where you cannot host probably a big solution. So there's a whole lot of use cases where model movement or model transfer from one tool to another makes sense. One of the common methods for transferring models from one tool to another is the PMML standard. It's an XML-based model exchange format, sort of a standard way to define statistical and data mining models and helps you share models between the different applications that are PMML compliant. Really popular tool and that's the tool of choice that we have for moving models to and from Vertica. Now, with this model movement capability, there's a lot of model management capabilities that Vertica offers. So models are essentially first-class citizens of Vertica. What that means is that each model is associated with the DB schema. So the user that initially creates a model, that's the owner of it, but he can transfer the ownership to other users. He can work with the ownership rights in any way that you would work with any other relation in a database would be. So the same command that you use for granting access to a model, changing its owner, changing its name or dropping it, you can use similar commands for models as well. There are a lot of functions for exploring the contents of models and that really helps in putting these models into production. The metadata of these models is also available for model management and governance. And finally, the import-export part enables you to apply all of these operations to the models that you have imported or you might want to export while they're in the database. And I think it would be nice to actually go through an example to showcase some of these capabilities in our model management, including the PMML model export and import. So the workflow for export here would be that we'd train some data. We'll train a logistic regression model and we'd save it as an IndieBee Vertica model. Then we'll explore the summary and attributes of the model, look at what's inside the model, what the training parameters are, some coefficients and stuff. And then we can export the model as PMML and an external tool can import that model from PMML. And similarly, we'll go through an example for export. We'll have an external PMML model trained outside of Vertica. We'll import that PMML model and from there on, essentially, we treat it as an IndieBee PMML model. We'll explore the summary and attributes of the model in much the same way as an IndieBee model. We'll apply the model for IndieBee scoring and get the prediction results. And finally, we'll bring some test data. We'll use that on test data for which the scoring needs to be done. So first, we want to create a connection with the database. In this case, we are using a Python Jupyter notebook. We have the Vertica Python connector here that you can use. Really powerful connector allows you to do a lot of cool stuff with the database using the Jupyter front end. But essentially, you can use any other SQL front end too or for that matter, any other Python ID which lets you connect to the database. So exporting model. First, we'll create a logistic regression model here. Select logistic regression. We'll give it a model name. The input relation which might be a table, time table or view. The response column and the predictor columns. So we get a logistic regression model as a result. Now we look at the models table and see that the model has been created. This is a table in Vertica that contains a list of all the models that are there in the database. So we can see here that my model that we just created, it's created with Vertica models as a category. Model type is logistic regression and we have some other metadata around this model as well. So now we can look at some of the summary statistics of the model. We can look at the details. So it gives us the predictor, coefficient, standard error, z value and t value. We can look at the regularization parameters. We didn't use any, so it would be a value of fun but if you had used it, it would show it up here. The call string and also additional information regarding iteration count, rejected row count and accepted row count. Now we can also look at the list of attributes of the model. So select get model attribute using parameters, model name is my model. So for this particular model that we just created, it would give us the name of all the attributes that are there. Similarly, you can look at the coefficients of the model in a columnar format. So using parameter name my model and in this case we add attribute name is equal details because we want all the details for that particular model and we get the predictor name, coefficient, standard error, z value and p value here. So now what we can do is we can export this model. So we use the fillet export models and we give it a path to where we want the model to be exported to. We give it the name of the model that needs to be exported that because essentially you might have a lot of models that you have created. And you give it the category here which in our examples is PMML. And you get a status message here that export models has been successful. So now let's move on to the importing models example. In much the same way that we created a model in Vertica and exported it out you might want to create a model outside of Vertica in another tool and then bring that to Vertica for scoring because Vertica contains all of the hard data and it might make sense to host that model in Vertica because scoring happens a lot more frequently than model training. So in this particular case we select import models and we are importing a logistic regression model that was created in Spark. The category here is again PMML. So we get the status message that import was successful. Now let's look at the attributes, look at the models table and see that the model is really present there. Now previously when we ran this query because we had only my model there so that was the only entry you saw but now once this model is imported you can see that as line item number two here, Spark logistic regression, it's a public schema. The category here however is different because it's not an indivigated model rather an imported model. So you get PMML here and then other metadata regarding the model as well. Now let's do some of the same operations that we did with the indivib models. We can look at the summary of the imported PMML model. So you can see the function name, data fields, predictors and some additional information here. Moving on, let's look at the attributes of the PMML model. Let's look at model attribute, essentially the same query that we applied earlier with just the difference here is only the model name. So you get the attribute names, attribute fields and number of rows. We can also look at the coefficients of the PMML model name, exponent and coefficient here. So yeah, pretty much similar to what you can do with an indivib model. You can also perform all the operations on an important model. And one additional thing we'd want to do here is to use this important model for prediction. So in this case, we do a select predict PMML and give it some values using parameters model name and logistic location and match by position. It's a really cool feature. This is true in this case, set to true. So if you have model being imported from another platform in which let's say you have 50 columns, now the name of the columns in that environment in which you're training the model might be slightly different than the names of the column that you have set up for Vertica. But as long as the order is the same, Vertica can actually match those columns by position and you don't need to have the exact same names for those columns. So in this case, we have set that to true and we see that predict PMML gives a status of one. Now, using the important model, in this case, we had certain values that we had given it but you can also use it on a table as well. So in that case, you also get the prediction here and you can look at the compute matrix to see how well you did. Now, just a prep, just sort of wrapping this up. It's really important to know the important distinction between using your models in any tool, any single node solution tool that you might already be using like Python or R versus Vertica. What happens is let's say you build a model in Python. It might be a single node solution. Now, after building that model, let's say you want to do prediction on really large amounts of data and you don't want to go through that overhead of keeping to move that data out of the database to do prediction every time you want to do it. So what you can do is you can import that model into Vertica but what Vertica does differently than Python is that the PMML model would actually be distributed across each node in the cluster. So it would be applying on the data segments in each of those nodes and there might be different threads running for that prediction. So the speed that you get here for model prediction would be much, much faster. Similarly, once you build a model for machine learning in Vertica, the objective mostly is that you want to use up all of your data and build a model that's accurate and is not just using a sample of the data but using all the data that's available to it, essentially. So you can build that model. The model building process would again go through the same technique. It would actually be distributed across all the nodes in the cluster and it would be using up all the threads and processes available to it within those nodes. So really fast model training, but let's say you wanted to deploy it on an edge node and maybe do prediction closer to where the data was being generated. So you can export that model in a PMML format and always deploy it on the edge node. So it's really helpful for a lot of use cases. So just summarizing the takeaways from our discussion today. So Vertica is a really powerful tool for machine learning, for data preparation, model training, prediction, and deployment. You might want to use Vertica for all of these steps or some of these steps. Either way, Vertica supports both approaches. In the upcoming releases, we are planning to have the model import and export capability through PMML models. Initially we're supporting K-means, linear and logistic regression, but we keep on adding more algorithms and the plan is to actually move to supporting custom models. If we want to do that with the upcoming release, TensorFlow integration is always there, which you can use, but with PMML, this is the starting point for us and we keep on improving that. Vertica models can be exported in PMML format for scoring on other platforms. And similarly, models that you have built in other tools can be imported for IndieB machine learning and IndieB scoring with Vertica. There are a lot of critical model management tools that are provided in Vertica and there are a lot of them on the roadmap as well, which we keep on developing. Many ML functions and algorithms, they are already part of the IndieB library and we keep on adding to that as well. So thank you so much for joining the discussion today and if you have any questions, so we'd love to take them now. Back to you soon.