 So, hey guys, can everyone hear me okay? Awesome. Yeah, so as our counselor introduced me, I'm Anish Astana, I'm a software engineer on the data hub team in the AI center of excellence at Red Hat. And I'm Daniel Wolfe, senior data engineer within the products and technologies organization. Yeah, and we're here to talk to you about machine learning with open source infrastructures. More specifically, we'll be talking about the open data hub and one of our internal users, like the rocket team. So, the open data hub was originally started to address some internal problems we were facing around machine learning at Red Hat. Instead of having a number of different teams manage their own infrastructure, develop their own support mechanisms and security policies, we figured having a centralized location called the data hub where teams could share data and run their ML and AI workloads made more sense. That way we could free our end users from having to worry about the complexities of managing these systems and then focus on what is really interesting to them, like experimentation, getting inside, building cool stuff. The open data hub is a natural evolution of that. It is a meta open source project that brings together a number of open source technologies in like data and machine learning pipelines, all running on Kubernetes or in our case, OpenShift. One important thing to note is that it's cloud agnostic, so you can run anywhere, you can run OpenShift. So, if you wanna run AWS or on-prem or on GCE, feel free to do so, right? Or on all three of them at the same time. You can find a community at opendata hub.io. So, this is a reference architecture diagram for the open data hub. I know it looks like a lot, but I'll walk you through it. The components you can see in this diagram can be broken into roughly three categories. The first category is technologies or projects we are looking into and vetting to see if they would really fulfill the needs we have. So, things like Red Hat Data Grid or Three Span would sort of fit in there. The second set of components is technologies we have proved work for us internally and meet our requirements, but we haven't integrated yet into the open data hub operator. And the third set, and that's like technologies for that would be like Elastic Search, for example, or HUE. And the third set of technologies and projects is stuff that we've proved works and is a part of the operator. So, if you draw your attention to the bottom of the image, you'll see OpenShift. So, OpenShift is Red Hat's enterprise Kubernetes distribution. It's a container orchestration engine. So, and that's what we're running open data hub on. This is what lets us scale up to meet any requirements or run on any cloud. Moving up a little bit, you can see like the data engineer persona on the left. So, data engineers are responsible for building out big data infrastructures. What this really means is that they're responsible for developing and like planning out systems that can incorporate data from different sources and store it in one location or multiple, I guess, for your users to like play with, so to speak. Data engineers are generally dealing with two main types of data. Data in motion, which is like data that may be flowing from outside your system into your system, right, like from point A to point B. And then data in rest, which is data that's already in your system, how are you storing it effectively. So, with those like two sort of categories defined, we are using Kafka, LogStash and Fluendi, primarily internally and as part of the OpenData Hub operator, you can use Kafka. And then from a storage perspective, we settled on using Ceph as our data lake for unstructured data. We're looking into Red Hat Data Grid as an in-memory storage option. And then for your structured data, you can really use any database you like. We use Postgres equal for the most part. Is it, you need one from the data motion layer and one from the storage layer, at least to do the source? Yes, yeah, yeah, you can have more. Depending on your infrastructures, you may have multiple. So, some of our internal users, they have, oh, their systems have integrations put in with our SysLog. So, it made sense for us to incorporate that. But yeah, you can use any, you really need only one from each layer. So, once you have your data in your data lake, you're not done, right? You need to actually get some insight or value out of that data. And this is where your data scientists and your business analysts come into the picture. If there's just some life visualization work they wanna do, you know, just get a very rough idea of what the data may look like. And you can use projects like Hue, Kibana, or SuperSet to look into it. But if your data scientist wants to do something a little more involved, right? Like with processing data or transforming it or starting to train models on it, they can use Jupter Hub, alongside Spark with whatever, you know, Python or whatever language of choice you like with whatever machine-earning libraries you want to use. Now, once your data scientist is done with creating a model, they probably want to deploy it somewhere so that you have end users sort of testing it out, right? So, for that, we're using Selden to serve models which our data scientists have to develop. The final bit I wanna talk about in that portion is actually about the Open Data Hub AI Library. The AI Library is a set of pre-built statistical and machine learning modules that any user can download and can quickly get started with for rapid prototyping. So, say if I'm not a data scientist but I have a lot of data and I want to sort of test out what I can do with it, you can download the AI Library and sort of get started with some prototyping work very easily. Now, if you look over to the other side of the diagram, you'll see the data steward. So, these folks are responsible for restricting access to your data or to your platform, right? They're interested in just authentication and credentials. So, for them, internally, we have, yep. So, Jupyter Hub is... Yeah, Jupyter Lab, yeah. Yeah, so, we actually have it running and you can deploy it as part of the operator. So, just to clarify, we have Jupyter Notebook and Jupyter Hub, Jupyter Lab. So, your data stewards are concerned with restricting access to your data. So, for them, we have, like, currently we're using Red Hat, OpenShift OAuth, and the Ceph Object Store for credential management and, in the case of OpenShift OAuth, you can integrate with LDAP. So, it's very easy to restrict access to certain components which are sensitive. Finally, you have your DevOps engineers, right? These guys are the ones who are responsible for keeping the lights on. So, they're interested in monitoring the health of the system, right? Seeing if something is down, how the system is performing with, like, metrics. For that use case, we have settled on using Prometheus and Grafana, since they're quite a pretty easy way to scrape metrics and visualize them. The last thing I wanna talk about is jobs, right? You'll have jobs, your users may have jobs they need to run on a semi-regular basis, right? And they can vary in scope from something simple, like backups for, like, your elastic search indices to something a lot more complicated for, like, some data transformations for, like, a multi-stage data transformation for your machine learning modules. You could use cron jobs for that, but they're not very robust and not very reliable, right? Like, if they fail, you have no idea what happened. To that end, we settled on using Argo and Jenkins to manage our jobs. This is not just revision, right? We do have some components already integrated in, and if you were to go to the GitLab repository and download the operator, here's what you can deploy out of the box. So, you can deploy Prometheus and Grafana for monitoring, obviously. Selden, Spark, and Jupter Hub for data processing analysis and then model serving, and then finally, Ceph and Kafka for your data storage, data and motion needs. To answer that general question, you don't have to deploy any component here if you don't want to. So, if you like Jupter Lab a lot and you don't care about Jupter Hub, you can just tell the operator not to deploy Jupter Hub, and Jupter Lab will be at a slot in there perfectly, right? The operator just makes it easy to deploy, like get a prototype going. Next, I'll be talking a little bit about some of our practical deployments right now. The first one I'm gonna talk about is for the Massachusetts Open Cloud, or MOC for short. The MOC is a collaboration between a number of universities in the greater Boston area, as well as some industry partners to create an open public cloud for researchers in academia or like non-profits or in an industry to collaborate on and innovate in. We have an open data hub deployment in the MOC to provide a platform for these researchers and non-profits to develop AI services on and again collaborating on and driving value for them. As a side note, if you are a researcher or a non-profit in the greater Boston area and you're interested in working with the MOC, feel free to reach out. We can put you in touch with some folks there. And then the second bit I'm talking about is the internal data hub, right? Which is where it all started. We have three main goals internally. The first one is somewhat linked together, so I'll talk about them together. We wanna serve as customer zero for any new open data hub components and we want to prove that the open data hub can run in a highly available manner at scale. To expand on that, all of these new technologies and running and installing all of these new technologies and projects on OpenShift isn't always the easiest thing to get started with. Working through all of these, those kinks that come with getting started and sharing that knowledge upstream is gonna make life easier for everyone who's in the community, right? Going forward from there, running at scale requires, again, very specific configurations which may not be obvious initially, right? There's specific things you have to do. So as we have processed large volumes of data and stored it for internal customers, right? We've learned a lot of lessons and we've been contributing them upstream to the open data hub community. Finally, we also want to help drive teams at Red Hat to be more data-centric. And to that end, you know, we're creating blog posts, talks, videos, demos. I'm sure you've seen some talks about some of the work people have done. You've probably seen more of them over the next few days. And some of these use cases can be like, I don't know, hard drive, failure prediction and stuff, for example, right? These are useful things and there's a lot of data but not every of things that they can do these sort of applications. On that note, I'd like to talk a little bit about some of our internal customers. The first one I want to touch on is the products and technology DevOps team. They have applications in their build and product release pipelines that are generating a lot of logs. All of these logs are stored in the data hub. As I'm sure most of you know, not all logs are created equally, right? Like some of them actually really matter. Most of them are just silly debug things no one cares about, right? So detecting these. You don't care about it until that particular component break. Yeah, right? So detecting these important logs is like, anomaly detection is something that some of the BND DevOps teams are actually engaging with the AI center of excellence on. And if you're interested in learning more about one of these, there's a talk later this afternoon by Michael Clifford and Zach Hassan talking about their experiences building an anomaly detection model for that. We also have cluster operational metrics for OpenShift clusters flowing into the data hub for the telemetry project. This is all information about how the clusters are behaving once they're deployed, what the health of the system looks like, and it's helping like OpenShift VMs or tech leads make decisions on what future work they need to do or what features they need to prioritize or what people don't actually really care about. On a similar note, somewhat similar, we also have a lot of customer insights data in the data hub. This is data from things like Red Hat Insights or source reports for the customers. And we have a number of teams internal to Red Hat using that data to improve the decision making, right? This way they can know what work to prioritize, what maybe what customer partnerships do, or ISV partnerships to post you, and things in that vein again, right? Making smarter decisions with that data. One of these teams, one of the teams working with the Insights data is the Grocket team. So I'll hand it over to Daniel now to talk about that. All right, thanks Nish. All right, that was great. So yeah, let's see. So again, my name is Daniel Wolfe, senior data engineer, and you're kind of gonna get a two for one talk because we heard about the open data hub infrastructure from Nish, and now we're going to hear about an actual use case that we're building with open data hub. So I'm gonna cover some slides with some background and get into a little bit of detail and then transition to Jupyter Notebook and look at some code. So the name of our internal application is Grocket, G-R-O-K-K-E-T. And this comes from the word groc, which means to understand intuitively. Well, I wanted to ask by a quick show of hands if who has heard of the term groc. It is an actual word relatively recent. Okay, I'm impressed, I'm impressed if so. So what we are interested in understanding better with groc it is data on workload adoption across Red Hat products. And by workload, I'm referring to categories of applications like database or software development. And there's various key workloads that we're interested in tracking. And so then we can have a better understanding of how customers are using OpenShift and RHEL. So why are we doing this? Well, here's an example business question of how we are justifying our value of this effort. So for example, what software vendor partnerships should Red Hat create or enhance? The traditional approach would be interviewing customers and asking what they're running or looking at online, it's a Gartner research. But with Grocket, we have qualitative data that helps us answer this question. So we know what vendor to prioritize for OpenShift certification or run a co-marketing campaign. So how does it work? The one line response is Grocket works by grouping running processes on a system into clusters based on their similarity. And then ideally, each cluster corresponds to a running software application on that system. So our data is coming in through customers that I've opted into Red Hat Insights, which is shipped by default with RHEL 8. And we get data from thousands of systems and any particular process or any particular system can have thousands of running processes. So with that multiplier effect, this really makes it ideal for a machine learning approach as opposed to a manual review. So that's what I'm going to get into next is take a step back and give a brief overview of the K-Means clustering algorithm, which is the approach that we are taking. So the K-Means clustering intelligently groups data into K clusters based on the similarity of their features. And it is a fairly popular algorithm. And one of its more well-known use cases is with recommendation engines. So for example, recommending movies or songs. So let's say you're a streaming service like Netflix and if you were just watched a movie and you want to keep them engaged, how are you going to figure out what to recommend? So in this example, what you can do if you're the streaming service is take all of your movies and all of the attributes about those movies feed it into your algorithm and it is going to automatically cluster those into similar movies. So for example, if you watch Avengers Endgame, you might also like The Dark Knight. Now that's somewhat of a trivial example because those are both really popular action movies but the algorithm can show relationships that are not familiar to the naked eye and it can do it faster and at scale. So I'm not going to go into too much detail about how the algorithm works under the hood. You might need to be a mathematician to truly do that but it is based on calculating the distance between those numeric features and based on how close or how far apart those features are, it will group those into similar clusters. So back to our use case with the rocket, we are intelligently clustering processes belonging to the same software application. So here I've got a visualization, a K-means clustering visualization and this has K set to three which means there are three clusters and they're color coded here with a beige, a blue and a green. And so each black dot, if you can see in the visualization is a running process on a system. So again, like I mentioned earlier, we can really feed in millions of running processes and the algorithm will group those into similar clusters. So I'm gonna advance here and let's take a look at this beige cluster, a deep dive into it as a simplified example. I've got four running processes shown here and so the end result based on the clustering is that all of the processes have similar words and similar terms. So you can take a look here and see some of the similar terms for these processes. We've got forward slash bin, forward slash fluent D. You can see it's common across all of these as well as Ruby. So the algorithm has determined that these processes, they have this differentiator of having these terms and so it puts them into the same process. Question? Literally just the process strengths. Yep, so it's just the process strengths and when we flip it to the Jupyter notebook, I'll show a little bit more about how we do that. And so now that we have this cluster, we can do a little bit of manual research and with some domain expertise. We know that these processes all represent Fluent D unified logging software application. So then we know that if we scan a system and we see a pattern worth a process has Ruby and then bin slash fluent D, we know that that system and that customer is running the fluent D application. So we can label the beige cluster with fluent D and a couple other examples of a pure Splunk and MongoDB. Is there another question? Yes. Well, so think about the fact that we have, we're getting data from thousands of systems on a daily basis. And so if we were looking at a pure frequency count based on these processes, there's enough variation in the process strings that it would not become apparent. But the clustering can actually group these together so that we know that, okay, this is showing up on a ton of systems. We don't know what it is, but it's one of the most important clusters and then tie it back to the fluent D application. And then from there, then you have a string that you can search all future systems on a daily basis for fluent D. Does that answer your question? Right, right. The data is huge and there's enough variation in the process strings that it would be impossible to do it with naked eye or with frequency count. So the end goal is to be able to model the presence and the prevalence of workloads and software applications running on customer systems. Whereas before we'd have to just interview a customer or do online research with this way we have a data-driven approach of understanding the software and workloads that customers are using for RHEL and for OpenShift. Thank you very much. Okay, so yeah, just wrapping up this slide. Yeah, all in all we have over 200 clusters that we've tied back to software applications and we can then do the modeling from there. Another question. So the question was, I was told to repeat the question for the video. How do you choose the value of K? And there's several approaches to do that and we use one called the elbow method and I can show that to you in the Jupyter Notebook which we will get to in about just a couple minutes. Okay, so before we transition off the slides I just wanted to tie it back to Anish's Open Data Hub presentation. So again, we're building this with Open Data Hub. We use Ceph for our data storage and Jupyter Hub is our coding environment. Both of these are deployed on and orchestrated by Red Hat OpenShift and since his team has set it up for us, it's behind the scenes to us. We're just using Jupyter Hub and we pull the data from Ceph into Jupyter Hub and we do that using the S3 file system protocol which I think is a key point because it reduces the barrier to entry for using the Ceph storage because the S3 protocol is the same as you have with AWS S3. So you don't need to learn any new packages. You can use the most popular packages like BOTO3 or reading in through PySpark S3 paths just like you would from AWS. And that's what we do is we use the PySpark data frames to read in the data, run some preprocessing steps and then we use the scikit learn package which comes with the K-means to run the ML algorithm. So that's what I have for the slides. So without further ado, we can switch over to the code. Forgive me while I get it. So it creates a, it spins up a new pod every time you, and yeah, you can see it through the OpenShift UI. So it spins up a new pod every time you start your server. So how it kind of works is like you have your Jupyter Hub instance running and that's like your master. Okay. Yeah, and then every time a user tries logging into that interface, right? And then they say like start my server. You select what sort of like Jupyter image you want to use. So, you know, you want TensorFlow installed already, you want scikit learn or Spark, right? And that spins up a new pod based on that image. And it's, yeah. Okay, so I'm just going to step through some code here. It does take a little while to run even with a fairly small sample size. So I'm not going to run it real time, but I will just scroll down. So first we import our key packages and then here comes the Spark context setup. And this is where we can put in the S3, S3 like credentials, but for stuff storage and point the endpoint URL to open data hub and access key and secret key. And then we can set our file or a key path and read in the data. And it's partitioned by year, month and day. So we can use wild cards with the or a basic regular expressions for capturing a certain range of months or wild cards for capturing all months or all days. And then read it into a data frame. And then this is a sample of what the raw data looks like with an attachment ID for the insights upload and then the process command here. So these are all the running processes. So it's a very small sample of what we have out there. Obviously some of them are short like at CD and sleep, but then most of these others are just truncated here or are much longer. So we do some pre-processing steps to bring it into a list from data frame format. And then we question. Yes, that's a unique, or does the attachment ID act as an index? And yeah, that's like a unique identifier for that particular upload. So yes, if we did want to tie that back to an account, yes? So, okay, so with the, I guess it's not really acting as a key index. We don't really have a key index for this data frame. We're just going to take these processes out of the data frame and put them into a normal Python list. So yeah, those attachment IDs are not unique. And the process list is not unique. But yeah, we don't have a unique key index for this data frame. So yeah, back to the pre-processing steps. We removed some special characters that are not relevant to the analysis. And we replace all the forward slash with spaces so that we can have each term as a separate word. And then we removed trailing and leading white spaces. And for printing out all the processes, this is how it looks after those pre-processing steps. So just a lot of process strings in there. And back to the questions from earlier, we weren't looking at memory or cash or anything like that, just purely the process strings. So I also mentioned earlier how K-means has a, calculates its distance, or calculates the clusters based on the distance between numeric features. So we do need to convert the process strings into numeric vectors. And for that, we are using the TF IDF vectorizer. As you can see here, it's just two lines of code. And there's various approaches to doing this as part of the scikit-learn package. We settled with the TF IDF because some of the other ones are more geared towards natural language. But with this one, with this use case, we're using machine language. So we prefer one that does not focus on natural language processing. So then all of the process strings are then converted into numeric vectors at this step. All right, so here's the question earlier about optimizing K. So you do need to have K specified when you're training the model. So this is kind of a brute force approach for optimizing K, as referred to as the elbow method. And you can choose how many clusters you want to go up to and arrange for that value of K. So in this example, we go from one to 200 clusters. And so this is the step that takes a while because it trains the cluster for every value of K, as you can see here. And then it calculates the sum of squared distances, which is a way of calculating the air or the impurity of a cluster. And so as you can see in the chart, if your K value here on the x-axis is small, then you're naturally going to have much higher impurity of your clusters because you're putting a lot of different types of processes into a small number of cluster. And then out to the right is your K value increases, then that's where you can potentially overfit the model. And you may have similar processes put in different clusters because it's going to return 175, 200 clusters based on the value you give. So the reason it's called elbow method, as you may suspect, is the optimal value for K is going to be somewhere in the elbow of this curve, which in this example is somewhere in that 50 to 75 range. So after running that visualization, we then train the model with K set to 50 clusters. And that's the step that's done here. And once we've done that, then we've trained the model and we can start running some exploratory analysis on the model results. So for example, we can get the top terms for each cluster. And it's not necessarily the most common terms for each cluster, but it's what's at the center of that cluster from a numeric standpoint. And I've already got it scrolled down here, but the cluster number is labeled here and it's just printing the top 10 terms for each cluster. Those labels are not necessarily significant. It's not like cluster zero is more important in cluster 40, but you can scroll through all of them. And then in this training run, cluster 31 has some of the terms I was referencing earlier with fluent D and with Ruby and USR and bin. And so you can do this for all of your clusters and see if there's any key terms that jump out at you that you think, oh, that could be a running software application. And then continue on down. So the number of processes for each cluster does not have to stay consistent across all clusters. You can see the cluster labels across the bottom and different values for K. And the number of processes shown on the Y axis. So it certainly fluctuates. And we thought it was interesting that there was this big spike here for I think it's around cluster 19 or so. And when we looked into that one, it was kind of like, we refer to it as a junk cluster where it just put a bunch of processes in there that didn't really fit well anywhere else. But that can open up additional opportunities to then run the clustering algorithm on the junk cluster itself and see if you can get a little bit deeper into that. So, seeing or something like that, it's gonna, where does that fit into this, make use of that wider array? Specifically, this example that you have here, like where are you saying, oh, here's gonna be kicking off into some larger thing, and how does it know to do that? Right, right. So for example, in our scenario using PySpark, so all this is an ODH instance with the Ceph storage and with the Jupyter Hub. And then if we really need to scale up our compute resource, we can, from within this Jupyter notebook in ODH, we can connect to a remote Spark cluster that can spin up more nodes and have as much compute power as needed. So it's more seamless for us. And by the way, I forgot to ask the question, but the question was about how does this example demonstrate the use of being able to scale up, right? Scale up the compute power. Does that answer? Yeah, I mean, did we pass that already? Is it explicit? So towards the beginning portion of the notebook, we do specify the Spark instance that you're connecting to. Okay, so it's by explicitly importing Spark. You're saying, okay, that is gonna be making use. You have a URL in there, so there's nothing that really says, well, I'm writing this notebook here and I'm using the Spark instance here. It's almost explicitly going off to something else. Something else just happens to be the open shift implementation you have that this notebook is running on. Yeah. They're just kind of implicitly paired at this point. Well, in this example, this is the Spark master URL right here. So yeah, Anish, did you wanna add onto that? Couple of things that Open Data Hub does behind the scenes. One is if you selected in Jupyter Hub that you have a Spark-enabled Jupyter notebook, it actually spins up a Spark cluster for you behind the scenes automatically and it's your own personal Spark cluster. You can size that to whatever you want. So as you're doing your data discovery or your data exploration or your machine learning, you can size that independently. If you have a Spark cluster, let's say it's out somewhere else on a Hadoop cluster, you can also point to it, but by default we add in some environmental variables that make it automatic so that your Jupyter notebook knows how to use that resource. The other thing we're not showing here that Open Data Hub does is if you have a GPU enabled OpenShift, then we implicitly allow you to run your workloads on a GPU by just selecting a certain parameter when you spin up your Jupyter notebook. That's not something that we'll show here today, but those are the types of things that Open Data Hub does for you. It's automatically configured to use those resources for you. So that's outside of the code here. Yes, yes. So the code here is gonna take, make use of whatever resources you create when you spin it on up. If you start doing something like, oh, I need GPU, do you have to tear on down and reload, or is it like? You would basically, so you would click on that control panel and then there's a stop my server. You stop the server. You spin up a new one with whatever new parameters you want if you wanted to select the GPU. You'd select the GPU and now your workload would run on a GPU enabled container. All right, great. Yeah, so it's getting to the end of the notebook here, showing the number of cluster, the number of processor cluster. Yes, that's true. So the Jupyter Hub, that's exactly what we're doing. It's a multi-user environment. Different users, maybe there's a power user that has access much more data than someone else that's just looking at exploratory smaller data sets. We have the ability to control how many resources, how much CPU, how much memory they're allocating per user, or we can give a default that every user gets. So that is something that we've kind of lost over. It is a multi-user environment and it's allowing us to control the quotas for each data scientist. Yes, yeah. Okay, great. Well, that about wraps it up. I was just on this last cell code cell here. I was just going to tie it back in with the Fluent D aspect earlier, highlighted cluster number 31 as being Fluent D. And then here we can see the example processes. Let's just ignore that it says 10 there because it should be 31 from this latest run. But based on the preprocessing steps that had taken out the slashes and forward slashes and replaced them with spaces, but it looks pretty similar compared to the example processes from the slide earlier. And that concludes our presentation.