 Hello, my name is Mark Yarbrough. This is Prasant Anbalagan. I'm Balagan. Yes, sorry. And we're going to co-present today on a machine learning use case in the program management problem domain. So this is targeted not at developers or people who are advanced in machine learning, but more of a walkthrough of an initial use case, grabbing some data, preparing it, seeing how practical machine learning would be for our use case. So I'm with the program management team at Red Hat. And our primary mission is to guide multiple product releases through development and test and through ship. And so this means there are a lot of dependencies across a lot of diverse systems. Frequently, always, those systems work differently and they don't talk to each other. So the problem we have is how to anticipate and manage risk for heterogeneous systems like that. And that turns out to be complicated by the fact that most of the target dates in play are constantly in flux. So we have to do ongoing risk analysis against a set of changing target dates. So how could you deal with a problem like that? Well, if you have a lot of systems that don't work together well, maybe you just go to one system and get rid of the problem that way. That's what we call the dream on scenario. You could also assign dependency management guru to understand how all the systems work and be able to manage them kind of manually. You could write some rules-based code in the form of scripts or plugins or other applications or you could try to point some machine learning at the problem. So that's what we're going to explore today. For our use case, is machine learning a viable option? Do we have the data that we would need to do the training that the machine learning would require to be effective? If we do have that data, what steps would we need to undertake as non-developers and non-data scientists to get it ready for a process? And then once we have some training data prepared, how would we get that into a training algorithm? And that's where we leaned on the AI Center of Excellence with Dr. Unbeluggin. OK, once we answer that question, then we would make a determination. Do we want to add machine learning to our risk analysis process? So what we'll do today is I'll talk a little bit about the problem space we have. You may see that it resembles a problem space that you know, or possibly not. Then we'll go through preparing the training data for the machine learning process. I'll go through the first pass in a fair amount of detail, and it won't be super technical. It's just showing you the practical steps that we actually did to get the data ready. And then if we have time, we'll talk a little bit about some of the iterations. Once you do the initial data prep, what do you want to consider doing next? How many times will you iterate? Then Prasat is going to talk more about the actual training using the libraries that do the training model development, how you measure to know when you're done, and how that fits into the AI library strategy. So let's look at the problem space. Let's start out simply. Let's say that you've got an issue tracking system. We'll just pick a generic system and say that in the hundreds of thousands of issues that could be in this issue tracking system, maybe one of them represents a new capability. And that's a new capability that you care about for whatever reason. So when I talk about these systems, some examples of that would be JIRA, Bugzilla, GitHub, or GitLab, Trello, Rally, and there are lots of others as well. Now just to complicate things, there's probably not one system. There's probably at least two. And within those systems, what they're trying to achieve is pretty much identical in function and different in every detail. So for example, in Bugzilla, a new capability might be represented by something called a future feature that would have a state. What state is the development of that future feature in? And it would be complemented by custom flags and other attributes that would tell you more about the state. In JIRA, you might have something called an issue type of feature request. It would have a combination of status and resolution that are roughly equivalent to the Bugzilla state, but of course different. And then you would have custom workflows and other attributes wrapped around that issue to tell you how it's doing. You could probably have one or more schedules that want to incorporate both of these new capabilities. And I've drawn them as dotted lines here to illustrate that you have a schedule, and the schedule is the milestone. Successful that milestone depends on both of these new capabilities being ready. But it's a dotted line because it turns out tracking system A, tracking system B, and the scheduling repository don't talk to each other. So we know that there's a dependency, but there's no easy way to link the dependency in a way that's predictable. So let's extend the problem set a little bit further for this one case and say, what if we were to abstract into a logical table all of the dependencies, or at least all of the critical dependencies, on new capabilities associated with this schedule. Then we could have a table that, at the very least, would need to contain a pointer to the new capability issue in the native tracking system. So that's why the line's no longer dotted. And it would need to be conditioned by the target date. When do we need this thing to be ready? So we need to know at least those two pieces of information. And then what we want to understand is, based on those two pieces of information, what is our risk at this moment? Well, it turns out there's not a schedule in flight. Usually there's on the order in our domain of 80 to 100, depending on how many, how granularly you want to get, there could be hundreds of schedules. So there could be hundreds of tables. And then just to spice things up, there's usually more than two systems. So at Red Hat, we use bugzilla. We have several JIRA instances. A lot of these have been in production for years or decades, have hundreds of thousands or millions of issues relating to community development and also product development. So now let's take a look at, given that problem domain, how would we say, OK, maybe we have a use case. We've heard about machine learning. How do we go from having a use case to figuring out if we're a viable candidate for machine learning? So this is where we conferred with the AI Center of Excellence, understood a little bit about facilities that were available, and then made a few early decisions. Now we're going to go through some slides that are fairly detailed about just the mechanics. This is not theory. This is not advanced. This is just the mechanics that we followed to get this data ready. But in a nutshell, we're going to be looking at the case where we need to find or fabricate a large data set to feed the machine learning algorithms. We don't really know beforehand how large the data set needs to be. We're guessing that 2 million records is probably great. We're guessing that 10 records is probably not enough. But we don't really know until we do some training how many we actually need. And just to finish that thought, we're going to be munging this data. So I'd like it to be human readable. I'd like to be able to look at it and understand what I'm looking at. But at the same time, by the time we get it into a machine learning algorithm, it's going to have to be in a format that that algorithm understands. So we're going to do this following example using no programming at all, just munging things around in a Google Sheet. So if you remember the picture we looked at that had the fairly complex problem domain, let's zoom in on one of the dependencies and one of the tables associated with one of the schedules. Remember that had a pointer to the issue. It had a target date. When do we need this thing to be ready? And what we want is the risk. So if we look at the transformations we're going to need to do, to train, we're going to need to create training samples. And we're going to need to provide a value for the risk to check the training. Later on, when we're doing prediction, we're going to transform the information in the same format. But what we want is the output of the risk. So just consider that as we look at the next few pictures. So of all those different systems that we've got, let's choose one. First one we choose. In this case is the JBoss JIRA instance. This has been in production at Red Hat for over 10 years for both community and product. Has hundreds of thousands of issues in it. We're talking about new capabilities, not bugs, not tasks, but new capabilities, things like a new API, a new protocol, something that your product may want to leverage in that product as soon as it's ready. So in JIRA, the rough mapping of new capabilities to issue types would be feature requests, business requirements, and enhancements. And because there's so many of these issues in the system, we're going to try just grabbing the ones in 2018. They are the newest ones. And we'll see if we get enough data. So this will be anything that was opened after January 2018. And we took the snapshot in November 2018. So we're getting almost a full year. And then we're getting a variety of states. Some of them may be open. Some of them may be in test. Some of them may have been rejected. So hopefully we're going to get enough data. And we're going to get a spread of values that'll be representative for a training set. So the next step after choosing the data source for our issues would be to go grab the data. This is like getting the first raw data that we're going to start munging. So in this case, we just went to the JIRA system and issued a JIRA JQL query to get all the feature requests, business requirements, and enhancements since the start of the year. And recall, this was issued in November 18. So we get that bucket of issues. We definitely want the status and the resolution. Because between the two of them, they kind of tell you the state of a JIRA. And there are probably other fields of interest. We probably want to know the priority of the JIRA. There could be other things like target release, due date, et cetera. So we probably want to grab things that make sense. We don't want to grab everything in the JIRA because there's a lot. Export all of those to a CSV file. Import that into a Google Sheet. And that's where we can begin to prepare the data for the training process. So we executed that query. We got 5,600 issues. Is that enough data? We don't know. It's less than 2 million. It's more than 10. Once we run the training, we'll find out if it's enough or not. So here's a detailed slide of what we did. The JIRA query showing the different issues that we got in the fields. Dump that to a Google Sheet. And then you can start to see, at a glance, some of the parameters we're going to be dealing with. The priority appears to be available for most of the JIRA. It could be important. The status is present for all of the issues. It's definitely important. Resolution sometimes is a complement to status. And you need both of them to know the actual state of the JIRA. But now we're starting to see an interesting pattern. For a lot of the cases, the resolution is not there yet. It's too early. They haven't filled it in. So we've got some data that's sparse. Important that's sparse. Some things that might also be interesting would be the creation date. We haven't really figured out how to use that yet. But it may matter if a JIRA has been open a year versus a day, in terms of its state. And then things that might be interesting, like target release. A quick glance tells us that while that might be useful information, we can't rely on it. It's too sparse. So we kind of throw that out to get started. So let's do our first pass at preparing the data that we grabbed. And so what we're going to do is ask Prasant what kind of training we're going to be doing. He's going to say we'll be doing supervised learning, linear regression. And so we need to target that as our format. And we're going to grab just to start the status and resolution, just to kind of calibrate the process. If we get this data ready for a machine learning algorithm, will we get a result that's useful? And our original query, that 5,600 records we got, turned out there were 30 separate and distinct values of status and 15 values of resolution. So once we convert that to the one-hot format that we talked about, and again, on the top is just a screenshot of a Google Sheet. We're just doing this directly in a Google Sheet, creating these one-hot values. So the green 30 columns on the left are the status values. 15 columns on the right are the resolution value, including blank or not available. And then we have to assign the risk that last column. We export that to a CSV file, which is shown on the bottom, so 5,600 plus lines of the zeros and ones that you see. Give that to Prasant for training. And he'll talk more about that. All right. There are various iterations that are in the presentation that you can review, but we're going to turn it over to Prasant now. Thank you, Mark. Oh, I feel like a news anchor with all the wires going out. So I'm Prasant, and I'm from the AI Center of Excellence team at Red Hat. And we have my wonderful teammates out here, so any questions directed to them? So we had Mark walk you through the first half of the process, when you're trying to move from a use case to a machine learning solution to address that particular problem. So that's more like a user's perspective. Like you come up with a problem definition and then prepare the data that you feel like is relevant to address that solution to that problem. So the next half of the journey is more like you put on the head of data scientist and see what approach do I select by approach. In simple terms, I mean, what statistical algorithm do I use? And then how do I validate it? Say the algorithm works fine on my current data set. Will it work on future data set? How do I test that? And then kind of go in a loop and keep improving the model until you get a certain prediction accuracy that fits your process. And once the model is ready and it works fine, then it's like handing it over to the team or that's more like a model serving, like where they start integrating that into their existing process. So how do you pick an approach to solve a particular use case? Well, you first start by looking at the data and see is it well defined? Like can I label the data? And do I know there's some kind of relationship across the different features, or the input and the output columns? So if you have that, then it's more like supervised learning. But if you don't have a good knowledge about the data, it's more about turns to unsupervised, where you're applying techniques to learn more about the data. So in this case, say even if you choose the model, you decide, OK, I'm going to use random forest or deep neural networks. So then you have to go implement the code, pick up the infrastructure to deploy it, and then train it, and then go to the process of serving it. So this is where AI library comes in place. So that's an open source collection of AI components. So that has pre-packaged machine learning algorithms and solutions, solutions for common use cases we found at Red Hat. So it not only comes with the algorithms, but it also has supporting libraries that lets you handle the deployment or the storage or other sub tasks that revolve around a typical machine learning experiment. So the AI library is part of a much bigger project called Open Data Hub, which is a machine learning as a service platform that's built on top of OpenShift and Kubernetes. So I'll come to, I'll explain AI library in detail like time permits at the end of this talk. But this is like the structure of AI library. I thought it would be cool to point out, yeah. So let's focus on the list of algorithms that's present in AI library at the moment. So for the data set that Mark handed over, like that was definitely the supervised learning because he cleaned out the data to make my life easier. So anytime in supervised learning if you are like predicting like a number, a real numerical value, the easy way to start is like regression that kind of tells you if there's like a linear or a non-linear pattern within your data. So what we did here, we used correlation analysis to kind of study the data that Mark handed over. So we found that there was like a strong linear relation between the input and the output column. So we decided, OK, we had to move from correlation analysis to regression because correlation analysis just tells you that there is a strong relation, but it doesn't have the predictive capabilities. So we had to implement the linear regression model. Yeah, so here's the graph that shows like what's a linear regression is. So a very simple case is you have like one input column and the output. And you're trying to see can the input predict the output, say like if the input increases, can I say like the output increases as well, or if the input decreases, can I say the output is going to decrease too? So in a simple case, you would have like one input column and one output column. But in our case, it was a little bit more than one. That's like close to 50 or so. And we had like to predict the risk value. So that ended up being a multivariate regression model, meaning that I had multiple input columns and I had to predict the risk or the output. So note that like the days left until deadline. So initially we thought it could be like it's going to be a time factor like the number of days. But then we kind of like turned it into like a binary data. So kind of like classified as is the deadline within two days, or is it within a week, or is it within a month? So you could actually go with a logistic regression if you think like it's a binary data and you're not actually fitting it straight in, but more like a squiggly line. But in this case, we knew that there was a linear relationship. So we implemented using scikit-learn in Python, linear regression model. And so that was the output. So all the coefficients that you see there, that goes with each of the input columns that you use in the data. OK, so now you have the model. How do you know it's good? So to start with, we divided the data set into like two buckets, like 80% and 20%. So 80% is like the data you use to learn the line that best fits your data. And the next 20% is what you use as a validation or a prediction data set. Can I take this model, apply it to the rest 20% and see if it fits good? Does it fit good? So how do you validate that? So there are like measures of predictive accuracy. But the simplest one being like, you just look at the predicted value and see how close it is to the actual value. So that's called an absolute error. And it has to be as close to zero as possible. And in our case, it turned out to be 0.09. So that was fairly good. So now we decided the approach. We had the model. It works on our data. And what's left? So we had to just hand it over to Mark so that he can integrate that into his product, or into his process. And also, we wanted to take out the algorithm or the model and push it into AI library so that any user who wants to use it for their use cases, not necessarily a risk analysis, but a regression model, he can just leverage this, send data to the AI library deployment and get the result out back. So let me, do we have time? So let me explain the workflow with AI library. So this is the simple serving model that we have. You save the data to a S3 compatible storage, then run the model, and then you could get the results. So we support any S3 compatible back ends like SAF, AWS. And the model itself, you could either run them as Python modules, or you could make. So we provide REST APIs to just trigger the machine learning code. And they get executed as jobs on a container application platform. And once the training or the prediction is done, the results get pushed back to the storage. So this is more like a sequence diagram. So the user would just save the data, like invoke the action to open Visk. And if it's going to be a fairly simple thing, like polling the status of your job, it's going to just open Visk, it's going to do that by itself. If not, it's going to hand over the job to OpenShift and execute it and push the results back to your storage. So if you want to learn more about all the other algorithms that's out in the AI library, you could go to opendatahub.io. And you'll find all the documentation over there. So note that Mark also mentioned that you could, in this first slide, he mentioned that you could also define a set of rules to find out if one input column could dictate an output column. So you could use association rule learning, but that's for a well-defined set of associations. For things that are too randomized, you need a more machine learning process to go for. So finally, come to the conclusion. I'm on time. Do you want to go? Yeah, so the conclusion, did this approach work, or did machine learning work for the program management example? Yes, to start with. Where do I say to start with is for the data set that we have or the details that we have, it was more like a certain level of abstraction and it did work. So now the next approach is to keep expanding the data and include as more and more details that we want to capture the entire process that the program management team is trying to work on and keep improving on the model. So it's more like a iterative process that you prepare the data, the approach, then test the model, get the results, and keep going in a loop. So yeah, so in this journey, moving from a use case to a machine learning solution, we kind of learned a few things. So I'll probably explain on the modeling. So when it comes to the approach, always go from the simple to the more complex. Now if you know, okay, it's a supervised learning. I want to use like a neural networks, deep neural networks, but it's go from the simple to the complex, life will be easier. And the validation is like the most important part. Like once you implement the neural network, don't put it on your resume or just don't go to your boss and tell you I've done it. Validate your model because that's the most important part. Does it work or not? And then keep improving like until you get to your expected accuracy levels. Okay, yeah, thank you.