 Welcome back again to another OpenShift Commons briefing this week. We seem to be having an AI ML theme going on. So we're really pleased to have Will Benton and Power Egg Dave from Red Hat with us today to talk about continuous development and deployment of AI models with containers and Kubernetes. I saw a preview of this earlier internally at Red Hat, and I thought it would be a great thing to bring and have a discussion around. So I'm going to let the guys introduce themselves, talk about it, do some demoing, and at the end, we'll have live Q&A. So if you have questions, you can type them in the chat either here in BlueJeans or in Twitch or in YouTube or Facebook or wherever you're watching it live stream and we'll relay them back here. So with that said, Will, please take it away and introduce yourselves and let's rock and roll. Thanks so much, Diane. Really appreciate the opportunity. I'm Will Benton. I'm an engineering manager and an engineer at Red Hat in the office of the CTO. And my focus has been on helping Red Hat's customers build machine learning systems in the cloud with Kubernetes and OpenShift. And one of my sort of particular passions lately has been figuring out how to use contemporary infrastructure to make data scientists lives easier. So how can we improve that machine learning workflow as well as sort of running the system and production? Right, right. I'm Pragawe. I'm a member of the product management team in the Developer Tools BU. My focus has been, you know, aligned with what Will is looking at, which is how can we enable developers to create and deliver the applications across from dev to test to prod in the fastest way possible so you can increase the frequency in the most efficient way and the optimal way. And then what are the differences that happen when it's a specific kind of a workload? Whether it's AIML applications versus an IoT application versus a traditional Java application. So what are the differences? What are the similarities? And we want to bring that together and offer a solution so that makes it easier for developers to just get going with it. So we are pretty excited to align the continuous delivery and deployment from traditional world to an AIML world. So let's start by looking at a few preambles around and how do businesses really benefit from AIML? If you go, you know, we've been working with organizations, you know, globally across the board in trying to understand, you know, what are the AIML initiatives that they are chasing? What are the benefits that they would like to derive from these? If you put them into various categories, you can see that by adopting certain AI powered intelligent software, businesses can actually drive a lot of benefits from areas like customer satisfaction. So this way you can, you know, gain knowledge about what the customer usage is, what is happening, the trend sentiment analysis, trend analysis, and you'll be able to actually increase the satisfaction with the products and services. You can gain a competitive advantage by creating some differentiated digital services that are driven with an AI-driven, you know, concepts behind it instead of a rule or a process driven and a philosophy behind. Obviously, if you can optimize, you can leverage AI and deep learning to kind of optimize your current business services that are out there, and you can hence increase your revenue because you're optimizing them and also drive some new revenue streams by, you know, offering similar services. And lastly is intelligent automation, right? If you're able to automate the manual repetitive time consuming business operations, you can actually reduce your operational costs, and this allows you to be more efficient and yet offer a higher customer satisfaction. So AI ML internally in organizations is being leveraged to kind of drive, you know, increased value in these four areas, and that is being done via out of the box products or stuff that has been built in-house. In the next slide, please. So here are some examples, you know, when we looked at how companies are leveraging AI and machine learning to achieve some positive business outcomes. So for example, you know, financial services, we all are part of this, you know, outcomes like reduced fraud. So, you know, you've heard of credit card fraud detection engines, you know, things that can predict whether the transaction is real versus fate. That's driven with AI and machine learning. So it's something which is being, you know, driven very strongly by the financial services markets. Similarly, in the medical field in the healthcare, we see a lot of medical diagnosis being done with AI, right? So there's a lot of work happening there where they can speed up the time it takes to deliver diagnosis and also increase the accuracy of the diagnosis by augmenting medical professionals with an AI and ML driven application. When we look at insurance claim industries, we see that a lot of the automation that is happening around processing of claims around approval of claims has increased, you know, the amount of claims that get that approved and also decrease the amount of time the customer has to wait to actually get the claims approved. And obviously, we all heard of autonomous driving, right? I mean, the self-driving cars, these are all driven with lots and lots of data being processed on the edge with AI and ML applications. So what's, if you look at, like, what's driving all this, right? It's basically what is to, you know, why now, right? And what is happening now is the growth of AI is actually driven by easy access to abundant computing power, faster processing with specialized computing processors, rapid development with some rich open source frameworks that you can actually use and technologies for AI learning models. And a widespread awareness and acceptance amongst all of us in the world of AI, like we don't see AI interfering in our lifestyles. We see AI actually augmenting our lifestyles. And so when the awareness and the acceptance has spread, it has actually led to all of these initiatives not be taken by the companies. And the computing processing and the computing power what used to run on supercomputers can now be deployed to a cloud environment at a fraction of the cost and time. So these two factors have combined together to basically make AI ML real. So now let's look at, okay, so, you know, we understand the benefits of AI ML for companies. So what is the end to end flow that is followed in order to create an AI ML application. So let's take a look at a typical development lifecycle. And you will see that it exhibits many similarities to traditional software development. So it starts with the business leaders, right? So the business leaders, when you look at the outcomes that we discussed and the benefits we covered earlier, they will define the outcomes, the goals that they would like to achieve from the AI and ML application initiative that is being undertaken. So at one in time, you see that data engineers will then work with the data and to build the architecture and the systems that we ready to do the data processing, the data storage and make sure that it's in line with what's being taught as optimized and also with enterprise standards and policies that are in place. So the architecture and the goals defined and the system available data scientists will now start working with the data to build the models. They will develop the AI ML models. This will be, you know, collaborations will be done with the engineers to make sure that the AI models are able to leverage the architecture and the systems where the data resides. Once the models are created, the next part of that is, well, they need to get deployed. So the data scientists typically collaborate with the app developers to integrate the deployment of these models into the entire application development process. So this is where the applications consume the models, make them productionized and put them into the application that the end user would not be consuming to leverage these models. The app developers also lead the deployment of the applications because you have to deploy them out, right? So they deploy the application and when it's deployed at that point in time, the AI ML models will now start running, right? And so they're running its inference capabilities, new data comes in, they have to kind of infer the data to see how accurate it is. And these models are being monitored and managed together by the data scientists and the application developers. Because they want to make sure that the models are delivering the desired outcomes. Now, ITRs is typically continuously engaged across all the aspects of this life cycle. So it helps you the awareness of management monitoring and remediation the entire system. Why are these models are being monitored by the data scientists and app developers? They want to, and making sure that the correct predictions are being made. It requires you to go back and retrain the models. So the loop is, it's a big feedback loop that happens, right? Models will get retrained as needed. So for example, you're making predictions. So your goal is to increase the accuracy of the predictions. So you have a training data, you deploy it. You get new data coming in, you see how the outcome is, and then you go back and you have to end up retraining the model. So to make sure that the predictions are accurate. So this is what happens. Similarly, when some new data comes in and it's like, oh, we did not train on that data. Our AI ML models need to now handle this new data. So now we got to go back, recreate the models again, and then do the entire deployment process. So this continuous feedback loop is always happening from training, developing in the model to deployments to back to retraining them. And as we covered, like, you know, this involves personas from all over the place. Now, if we take a look at this typical life cycle, you will see that at the top, you know, we have the project lifecycle setting the business goals and the data and engineers that need to work and kind of gather the data, prepare the data, make it ready. In order to execute all this, you need it, let's call it the machine learning software tool chain. So for example, it starts with TensorFlow, Jupyter Notebooks, you know, Python stacks, for example, for development, AI ML models. And then comes the flow of getting the models across board. So you have some CI CD products, the data lakes that you need for data services, you can have SQL Server, no SQL, that's where the data service architecture comes in, right? So we can take all the pieces where the data resides, and then the models have to kind of work with the data that's being stored and made available in these systems. And then the CI CD tools, which is basically your automation or testing for deployment have to come in. And all of this has to be supported on a self-service hybrid cloud. So this hybrid cloud platform basically empowers data scientists and data engineers and developers to be agile and collaborative through the entire process. And you don't have to depend too much on IT operations for individual tasks. Now, this hybrid cloud platform because it's self-service also needs to be optimized for the kind of AI ML application you're building. So for example, if there are hardware accelerators, like for example GPUs, they can help you speed up the development of the ML models, and you're in the influencing tasks, then you want to bring it into the cloud. The cloud is to support that. And then finally, the hybrid cloud platform has to offer a consistent experience as an invoice hybrid cloud across all different kinds of environments, whether it's on premise, whether it's on the public cloud, if it's processing data that's coming from an edge location, then that needs to be covered as well. It has to be done in such a way that IT operations can manage it from a single place, can manage it in a singular way, in a consistent way, rather than trying to adopt for each particular environment in the infrastructure where it's landing. So this, if you take a look at the entire lifecycle and the tool chain, this is where all of it lives, right? We start with the bottom infrastructure where things are going to get deployed. We have a cloud platform that runs on top of it, has the compute power, has the knowledge of what goes into GPUs, what doesn't go to GPUs. Provides the architecture for the data, provides architecture for the tooling that includes the CI CD process for continuous deployment and faster deployment and all of this leads into the end to end flow of delivering the applications. So now that we know that this is the tooling and the infrastructure that is needed, let's look at the benefits what a container-based architecture and Kubernetes brings to this development and deployment of models through that lifecycle we just covered. So with containers and Kubernetes driving the orchestration of the containers, data scientists and software developers can develop ML models and the associated intelligent applications powered by these models with a very high degree of agility, flexibility, portability and scalability. So we think of leveraging the power of Kubernetes, right? And infrastructure is code. You can now automatically set up your AI ML environments across the infrastructure, you know, the hyperclouds, so there's public clouds or on premise. You can set it up automatically because you're declaring it as code. And you can now do on-demand provisioning of your compute resources during your development process of the models, during the deployment process of the models, and then during the running part of the models. So as your demand grows, you can actually scale out the running part of it, or as your data demands grow, as your data gets bigger and bigger in your training sets, you can actually now have a higher compute that helps you develop the models. So the power that Kubernetes and containers bring to the table, the biggest one is around scalability because you can then scale as you need and also around HA, right? Because if it's in a real world environment, when the applications are running, if you have downtime or failures, you know, whether it's network failures, hardware failures, your entire solution can keep on running. And it can actually be automatically provisioned where else it needs to go to provide uninterrupted service to your customers. When we look at portability, you know, we talked about how the models need to run across various parts of the infrastructure, which means that we don't want to create a model that can only run on premise, but if you take it to public cloud, it has to be re-architected or refactored. The idea is use containers and Kubernetes to allow us to port these models to run no matter where the end environment ends up being. And so this, you know, just in time inventory, just in time scaling, HA, portability, and being able to quickly deploy changes to very specific pieces of the products versus updating a monolith application with one change that you had to make to take care of maybe a new data model or maybe just some bug that was found. It's much harder to do that when it's being driven as a single application versus a containerized set of application, a containerized set of microservices to make up the application, because then you can update as you need for the respective pieces of it. So it makes you more agile in how you will respond to either new requirements or bugs and also about new computing requirements that you have on the scale column. So now I'll turn it over to you, Will, so we can take a look at this in action with OpenShift. Thanks, Farag. So I want to sort of do a deeper dive into that machine learning life cycle from sort of a practitioner's perspective and talk about how we'd use this to solve a concrete problem. So Farag talked about a lot of problems that are actually driving business value. I'm not going to talk about such a problem today. I'm going to talk about a problem that everyone is looking to build a new solution for right now. And that problem is spam classification. We're going to start with a hypothetical data set where we have two kinds of data sources. We have data on the top, which we're calling legitimate documents, and data that looks like data on the bottom, which we're calling spam documents. And if you look at these and think about them, you could probably say, well, I could see some differences between these things. I could see a way to tell them apart. If you really think about it, you might think that the excerpt on the top sounds suspiciously like Jane Austen, and the excerpt on the bottom sounds suspiciously like it came from a user comment on the internet recipe site or a review of a food product. And that's in fact what our data sources are. So we're going to call legitimate documents, documents that have been generated by a generative model trained on Jane Austen's complete creative output. And our spam documents are going to be documents that are trained on fine food reviews from a large internet retailer. The idea is that we can tell these things apart by looking at them, and we should also be able to write a program to tell them apart. So let's dive into that workflow and see what we do in this specific case to sort of solve that problem. The first task that a data scientist is going to do, again in conjunction with business leaders and stakeholders, is figure out a way to formalize the problem. We need to figure out what it means to succeed at this problem and turn success into a number. And that could be metrics that we're already collecting or metrics that we need to invent and record. In the case of document classification or spam filtering, success could mean not missing spam messages, right? Like I never want to see a spam message in my inbox. Now, of course, we could say I'm never going to see a spam message in my inbox by sending everything to the spam folder. So that's obviously not the whole story, but it could also mean that we don't miss file legitimate messages, right? That we don't see that we don't have a lot of legitimate messages that would wind up in someone's spam folder. Now, those are metrics that we can test when we have a training set when we know what the truth is for something. There are also business metrics we might care about, and in this case, it could be feedback from our users. How many messages did we send to someone's inbox? For example, they got flagged as spam. How many messages did we send to someone's spam folder that they moved back into the inbox? Obviously, these aren't the whole story because people aren't perfect, right? Someone is not going to go through every message in their spam folder and say, did I really mean to read this? And even if they did, they might not give us the signal by sending it back, but these business metrics are an important part of the problem. And responsible data scientists will focus on the whole picture, looking at all of these metrics together. Once we have those metrics out of the way, our next step is to collect clean and labeled data. In this case, that means going from raw messages where we have labels to labeled messages in a regularized format, where we have individual documents that are examples that we've labeled as either spam or legitimate. We're now into the sort of core of the data science workflow where we go from that clean data to what we call feature vectors. And that's going from regular data that we'd be happy to deal with in a database table or in a conventional programming language object to points in space. At a high level, machine learning algorithms are just figuring out ways to split up space or map from space onto smaller space. It's basically just summarizing points in space. And so what I want to do is encode every document that I see as a point in space in such a way that similar documents correspond to similar points in space. And then I can say interesting things like, oh, it looks like there are a lot of legitimate documents in this part of space. So I could maybe say that my model is going to distinguish between things that are in this part of the space and things that are in other parts of the space, just as an example. Once we have those features, we can use those as input to a model training algorithm where we take the labeled data where we know the truth, the approach that we use to turn that labeled data into feature vectors, and then we allow the model to identify patterns in those vectors that we can use to answer the question we care about. In this case, is this document spam or not? And really at a high level, all of this model training algorithm is doing is identifying good tradeoffs in how it summarizes the data. You want a simple model that has good performance by some metric that we care about. So once you have that, you basically have a function that takes these points and maps them into predictions. And those predictions, as you can see on this slide are not perfect, right? And actually we probably should be suspicious if they were, because that gets us to the sort of next phase of our model training process, which is that if we were to train a model just to memorize everything it had seen, we could have a perfect model, right? It would perform perfectly on our training set, but it would be extremely unlikely to perform well on data that it hadn't already seen. The next step in our process called model validation is where the data scientist goes back and tests the performance, tests the metrics of those model on data for which we know the answer, but which we did not expose to the model training algorithm in the first place. So we want to make sure that the performance we have on data we haven't seen is comparable to the performance we have on data that we have seen. The last couple of steps in this process are actually putting the model into production as part of an application as a service that you can interact with the rest of your application and then monitoring that behavior in production. If you think about the early days of automated spam filtering, you may recall that you would see a class of spam messages in your inbox, right? You know, maybe it was ads for online gambling one week and online gray market pharmaceuticals the next week and, you know, mortgage discounts the third, but there would be various topics that would elude the spam filter, and then someone would sort of identify that these were getting through to the inbox and push out a new version of the spam filter that caught those things. And so the spammers and the spam filters were playing this cat and mouse game. In the real world in general, models can start misbehaving. And the interesting thing about models is that conventional software components, if we're lucky break in obvious ways, they don't build, they don't deploy. They obviously slow in production models though. Remember, we just have a function that makes a prediction. And the way that this can can misbehave is sort of more insidious than the way that our conventional web apps might misbehave. And that's that the model could keep giving you answers. They might just be wrong far more often than you can accept. So by monitoring the behavior of the model in production, we can identify this before it causes us a business problem. Now, as Prague said, this is not a waterfall, right? This is actually a iterative process. And at a lot of stages in the workflow, we're backtracking and changing decisions we made earlier. Another really interesting thing about this life cycle is that because of all these loops we have, we need to be really careful about the latency between phases. And a lot of organizations, data scientists, when they need new infrastructure to try a new approach to solve a problem, have to file a ticket with IT. They have to get something supported. In a lot of environments, if data scientists are wanting to build a model service that can be incorporated into an application, they're either going to develop that service themselves using a skill set that's probably not where they'd rather be spending their time, or they're going to have to have a communication exercise with an active team. And they're going to have to say, hey, look at this technique I developed. Can you figure out how to turn it into a production application? And based on our experience of seeing this workflow in person, there are some teams where this works very well. But for some teams, this turns into a lot of time spent at a whiteboard and a lot of raised voices and a lot of eventual apologies. So we really want to streamline this every loop in this process as much as possible to increase the velocity of the teams that are building these applications. And we'll show you how we can do this automatically now on OpenShift. So I'm going to show you here how we solve this problem from end to end. The next thing I want to show you is the OpenData Hub operator, which is a community project sponsored by Red Hat that provides an end to end data science and data engineering discovery environment with a single click. Instead of filing a ticket with IT, if I'm a data scientist that has access to OpenShift, I can install this myself. Because this is already installed by my organization, I don't even have to install it. I can just go to an endpoint and get an interactive development environment for data science. Now, a lot of data scientists prefer to work in conventional IDs, but a lot of IDs, but a lot of data scientists also like to work in these so-called interactive notebook environments. And I'll show you what these look like. Here I have a directory of notebook environments that I've launched from Jupyter Hub on the OpenData Hub, and this is basically just a way to do literate programming in a document. So I have some pros here, and I have some code. And then I have the output of that code, and I can change this code as it runs and edit it. So this is a really nice way to experiment with techniques interactively. Right? I can say I want 23 rows of this data set instead of 50, and I get a different result. And I can edit this, and it's also a communication tool, right? This is, for a lot of data scientists, a lot of their job is communicating results to stakeholders. So we want to explain what we're doing, show the code, let people run the code, and see how we reproduce our work. Now the interesting thing is we can also have, we can have these sort of tables, and we can also have plots, right? So we can say, is there a clean separation between the points in space for this problem? And we can see that, yeah, there basically is. So I could take this notebook, use it to develop a technique, and then hand it over to a stakeholder and use this as the basis for a presentation. This is sort of more than code, but less than a paper, right? It's sort of a little bit of both. So we're thinking of it as a literate programming environment where you can, where you can sort of develop a technique and then explain it to someone else is a really good way to look at it. Now for this concrete problem, we've looked at a couple of different approaches here, and I've run them already, but I can just restart and run this again. And this is a feature engineering approach where we're basically going to turn documents into vectors so that we can feed them into a machine learning algorithm. And we can see that we have some sort of sanity checking our spam and legitimate documents. This spam document is talking about K cups and bad coffee. This legitimate document is talking about things that upper middle class people in 19th century England are doing. And this spam thing is talking about tea and dog biscuits and baby food and so on. So we see that there's some clear distinction between the kinds of things that these documents are talking about. If we go on and look at the rest of this, we can see that we're able to sort of turn these into these large vectors and then save that pipeline. There's nothing in this notebook that knows about OpenShift crucially. This is just a communication tool that a data scientist would work with. Now we're going to train a model. And we can, again, I've sort of run this in advance just before we started, but we can go through and sort of look at it and here's some metrics on how well our model is doing. This picture basically means how many of the legitimate messages did we actually predict were legitimate. How many of the spam messages did we actually predict were spam. And on the other diagonal, how many spam messages did we call legitimate and how many legitimate messages did we call spam. And again, this is just a communication tool, right? This doesn't look a lot like something you could immediately drop into production. Now in a conventional workflow, a data scientist would take these notebooks and send them to an active team and have the active team figure out how to implement them in a service. But we know that with OpenShift's developer experience, we can do better, right? And we actually have a source to image build as part of a tecton pipeline here that will take these notebooks, extract the code that trains the model, and build a microservice around this after training the model in a build. So I've already run this in advance. We can take a look at what it did. And if we look at the build logs, we can see, you know, after we are sort of setting everything up, we're actually going through and installing the Python requirements for this notebook. We're actually doing the model training and we can actually see our metrics report from the model in a tecton build where we have this performance here. And this performance is not great because these notebooks come from a workshop where we teach people about machine learning. So we let them tune the model to make it better. We're just showing you the first stage of this lifecycle. But then we go through from there and we actually deploy a Knative service based on the model that we trained in those notebooks in production. So what we've done is we've extracted the code that does the feature engineering and the model training from these notebooks. We run that in a build and then we take the code that does the feature engineering and the trained model and put that together in a REST endpoint. So we have a service that takes raw data and returns a prediction. Now we're running that in a Knative service right now and I can show you what that looks like. That's running right here in this pipeline service. We also have a parallel build that just uses a regular source to image. I've also built a version of this that just uses a conventional source to image build too. So if you haven't adopted Tecton yet, you can still use similar techniques. We like to show the latest and greatest though. All right. So here's how you might interact with this in an actual application. And I have a couple of different URLs here. We're using one of them. This is the one for the Knative service. We also have one for the conventional open shift service down here. So I'm defining the endpoint that I want to interact with and I'm declaring a very simple client library where I just take the text that I pass in and post it to that REST service that I created. So again, I said the model performance isn't very great. So we're going to look at some example predictions with two very short documents. One document is dog food. One document is it is a truth universally acknowledged. We would hope that this would get predicted as Jane Austen, but again, we left some room for improvement in the model. So these are close show up as spam. But let's try this with some more documents. I'm going to load in the training that I had. And I'm going to take a sample of these and look at how well the model performs on these examples. So we have a lot of examples here and a lot of predictions. And the interesting thing here is that we can actually go back and track metrics about the predictions we've made. And the interesting part is that we can sort of look at this service and see what it's done. Remember, we talked about data drift, right? We may not know whether a message is spam or legitimate in real life, but we may know that we expect that a certain percentage of messages we see are spam, right? And in the real world, maybe 90% of all email traffic is spam and most of it just never makes it to your inbox, maybe 95% is spam. But if that distribution changes over time, we know that the data we're seeing in the real world no longer corresponds to the data that we trained our model on. So we can simulate this with an experiment where we're going to track the values that we see over time with a given proportion of legitimate to spam messages. And we're going to see how it works when we... So if we start with like 100,000 examples with 5% legitimate and 95% spam, we should expect that the distribution of these messages is roughly comparable to 95% spam and 5% legitimate. So we're tracking these metrics from the model and we can actually see them in Grafana and as soon as Grafana catches up, we'll see those metrics reflected in this dashboard here. But you can see how we sort of built it up over time. This is a log scale graph. So we're looking at the slope of the line to see how quickly the individual lines grow. So we're looking at rates of growth rather than absolute numbers in the slope of the line. And we'll just go back and run some different experiments. And we'll see when Grafana catches up with those experiments that we've run. If we say 25% of messages are legitimate, we should see different curves in the graph. So we're getting a little bit of a tick up here as the metric system catches up. So you can see that these curves are going to catch up over time. We'll use a shorter time window so it's a little easier to see. And as we go on, so we see that there are a lot more legitimate messages than spam messages with this latest one. And that's not what we'd expect. We'd expect that these would be growing at the same rate because the proportion between them would be staying the same. So in a real installation, we wouldn't just have a data scientist monitoring this dashboard waiting for something bad to happen. We'd want to let them do something more productive with their time. But we could define an alerting rule for like, has this distribution changed or even have another model that detects anomalous behavior in these predictions. So just to recap what we've seen in this demo and to end is we've seen using Open Data Hub on OpenShift to provision a self-service discovery environment. We've seen using that Open Data Hub to do interactive development and sort of produce a machine learning technique in an interactive notebook. We've seen how we go from that interactive notebook, which is really a communication tool and not what we think of as a conventional software artifact to an actual production service that we can incorporate into our application using OpenShift's developer experience and Tecton build pipelines. And then we've seen how we can monitor the behavior of that model in production so that we can detect when it misbehaves. So I want to just close, I think, by putting this overall architecture picture back on the screen and sort of showing folks, again, sort of how we support that entire lifecycle from end to end. Thanks so much. And I think we have time for questions now. Yeah, that was great. And thank you for that explanation. We have a couple of questions in here and I'm going to unmute Pete Bray, who's also from Red Hat is here and he's been answering a little bit of the questions in the chat as we've been going. And you can see Pete there. And so one of the questions and I think it's a good conversation was around storage. And we'll lead had asked that question about, you know, what is trending now in storage for AI and ML data. I wonder if you could address that Pete. Sure. And I'll paraphrase a little bit what I wrote in the response back. But the answer is it really depends. We are seeing some particular trends, but it really depends upon the types of data. In general, you know, there's there's really too large, actually, there's three large categories of data, structured data, which you normally would think of as things like customer records or things that would go into databases. They fit very nicely into, you know, a tabular common or columnar type of format. But we know that not all data is nice and neat like that. In fact, there's another category of data called semi structured, which is midway between being very structured and columnar to being very unstructured, which is actually the third category of data. And in the unstructured category. These are things like files and I think while lead who is had asked the question was specifically asking about unstructured data files, basically, that were he looks like he's using NFS for today. And so I skipped over the middle section, which was semi structured data. It's basically a combination of both structured and unstructured data. You might hear that oftentimes referred to as data warehouses. Whereas the unstructured data quite frequently and I think Parag use this term at the beginning of the presentation here. Although data lakes is a very overloaded term that can mean a lot of different things, but most commonly it's, it's where objects files are stored, you know, from a storage standpoint, increasingly what we're seeing and, you know, much of this is being driven by, you know, the popularity of Amazon S3 were increasingly seeing movement towards an object storage based data lake to be able to support these needs. Now, the challenge for many customers that we work with and many upstream implementations also is that not all applications are yet ready to support S3 interfaces. So we're in kind of a transition period right now where there still are a lot of file based NFS based applications that are out there. And so quite often we find the challenge of how do I support my file based workloads, typically using NFS, but I eventually want to migrate over to S3. And so we've had a lot of experience helping people be able to do that to get to this new S3 environment. You might ask, well, why would you want to do that? There's a lot of different reasons. The most primary reason is that S3 presents a very flat namespace, which is massively extensible. And when you're building a data lake that could potentially be hundreds of petabytes, that's very, very important. And that's actually one of the challenges with traditional file systems like NFS is there's limits to their ability to scale because it's more of a hierarchical type of namespace. So we are seeing a trend that people are increasingly adopting S3 as their underlying storage technology. Awesome. Thanks. I figured you could answer that one, Pete. So that was up your alley. So that's good. Coming in, Mohamed Shafa is asking, how do you address data lineage versions, sensibility, and data versus model? I would like to take that one on. I'll take a crack at it. And I think Pete, you probably have some thoughts here too that I mean chime in if you'd like. But so there are a lot of technologies in this space that solve issues of data lineage. I really look at, I really look at sort of managing the model lifecycle. A big concern here is reproducibility, right? And there are so many facets to reproducibility. So I'm going to start by level setting and then I'll get to your question. Like, with Jupyter notebooks, you saw how I went back and edited things, right? And I ran things in different orders. You can do that in a notebook. If I do that in a notebook, the output in the notebook is not going to be what someone else is going to get if I send it to a colleague and she tries to run it, right? If I don't have the same libraries installed that you have installed, you will get different results than I will. If I have a library that has soft dependencies, right, where it behaves one way, if an optional package is installed versus a non optional package is installed, you may get different results than I do running a model. And then finally, there are all sorts of other concerns related to like, you know, making sure that I specify random seeds, making sure that I use random number generators in a way that's safe for the kind of parallelism that I'm exploiting my application, making sure that, making sure that any native libraries that my Python or JVM code is calling out to are the same versions and have the same behavior. If you really need bit level reproducibility of your model, which many people do, then you have a whole host of challenges in the code, and that's what we focused on today. You also have a whole host of challenges with the data, right? You know, your model is only as good as the data it gets, and your model is only reproducible if you know which data you use to train it and how you got that data. And I think in terms of actual data lineage tracking, there are a lot of great projects in this space that address that component. It's not something we addressed in the demo today. But you can look at technologies like Packarderm, for example. There are other projects like having DVC is another good example or the Quilt project has sort of a metadata layer for machine learning data sets as well. And it's a tricky problem, right? I think what a lot of people like or what a lot of people want to have is something that looks like a Git style interface where you have a content addressable set of trees. You can say I built this model against the immutable data that I had in this particular hash of a tree. In this particular snapshot is what I use to train the model. I can always go back to that. Now, Ceph, of course, is immutable by default. Like you're not overwriting things unless you have to. So it's a case where sort of our platforms here provide it or the Red Hat platforms provide a primitive that you could use to support this. But I mean, again, there are a lot of community projects that solve this problem really well. And I think those are those are all worth looking at in this case. And the way we've been thinking about this problem is that really, you know, you don't just want to track your code and your libraries and your hyper parameter settings and your random seeds. You also want to track your data. And in terms of actually thinking about lineage in terms of pipelines, if you have a sort of classical data lake to data warehouse architecture where you're going from raw data to sort of incrementally cleaned data in multiple steps. You need a way to replay those pipelines that you need to track the identities of the data you're dealing with at every stage. Peter, do you have anything to add there. You've made some really good points, William. And, you know, at the very high level about, you know, the code piece of the equation here as well as the data piece of the equation. And at a very high level, I think many of us probably heard the statistic that, you know, as as Prague was presenting this flow that's on the screen right now. And I think the data here and prepare data stage is actually probably one of the most problematic stages right now for data scientists. I think a lot of people cite, you know, the statistics that 80% of the data scientists time is spent just gathering and preparing data. And I think that's like an overarching issue. But what, you know, we're talking about here and what William was talking about is an even more specific case of this problem because, you know, not only do I have the problem with the data, but how do I ensure that there's reproducibility. And I was going to answer in exactly the same way that there are lots of different ways to address this. There's obviously commercial packages, but there's a lot of open source packages also that can help you with this particular problem. It is something that, you know, the industry I think is focusing on because it is such a problem. It is such a broad problem. And with respect to my earlier comments about object storage technology becoming much more prevalent. This is an area also where object storage as a technology can help, because it has built in versioning capabilities for objects. And so you're able to maintain that data, you know, as, as, you know, the objects, the files, whatever it is may potentially change. You would hope that it's not changing that your data sets would be static, but not every situation is going to be a static environment. So it's important, I think, to think about the multiple layers of the cake here, so to speak, in terms of, yes, thinking at the storage layer and do I have the capabilities, you know, to provide not only the government's things that, you know, what lead was mentioning, but also the ability to version objects and things like that. But then layering on top of that the tools to help you with this problem also is something to think about. All right. Well, we mentioned a couple of open source potential projects and things like that. And there's another question in here. And maybe we can tease out a little bit too about how to get started on OpenShift with all of this through a question while he is asking about, are there open data hub cookbooks, recipes for all these AI, ML processes and steps that one can refer to. And I think that's kind of, we've talked about it, you've demoed it. How do people get started? Where are the resources and things of that nature for this? What's the best next step if someone's looking to do this? Yeah, absolutely. So I think OpenDataHub.io is a great place to learn about the OpenDataHub. We have a couple of GitHub projects and GitHub, we have a GitHub organization where we've collected some of these tutorial materials and, you know, I'm happy to follow up offline with anyone who's interested in reproducing some of these things or trying to figure these things out. But I can, as soon as I'm not sharing my screen, I can put a couple of links in the chat. All right. There's another one that's coming in from YouTube. Rakesh is asking how to handle a case where the resources such as RAM, CPU from being completely used up by the ML application and OpenShift because usage of resource beyond limits makes VMs freeze. Yeah, we've been there. All right. So that's a great question. And I'm going to go back to the demo there if that's okay. Let me find it. Just a second. It's always fun to see everybody's screen savers. I minimized my window. So I'm needing to go back to where we were. And yeah, let's just go back to this OpenShift. Oh, it's because I had it on full screen. That always, always catches me. So I'm going to go to the routes here. I'm going to go back to Jupyter Hub. And what I'm going to do is I didn't show you this during the demo because I wanted to sort of streamline it. But this is the interface that the OpenDataHub is actually going to present a data scientist when they first log in to Jupyter Hub with OpenShift. They're going to get this launcher, right? And there are a lot of interesting things going on here. So remember we talked about a key aspect of reproducibility is do you have the right libraries installed? Well, a great way to solve that problem is with having your development environment stored in a container image because then I don't have to worry about whether or not the libraries I installed are even still available, which is a problem we see surprisingly often. Or whether or not you installed the exact same versions I did, I can just say, hey, I published my development environment as a container and so I can use that one in particular. But the other thing is we're not running in VMs. I mean, we are, in many cases, we're actually running in VMs. We're running in containers. So we do have firm resource limits on these containers. If I go into one of those Jupyter Hub notebooks and try and allocate six gigabytes of memory in a way that might crowd out other people whose Jupyter notebooks happen to be running on the same VM or the same physical node, the Linux kernel OpenShift would terminate my notebook kernel and then say you're using too much memory. Now, the question is, can I get work done in this environment? And the facility for that is to say we can set resource limits automatically. And these are profiles that you can configure as an administrator when you install the Open Data Hub. But we have basically T-shirt sizing. Do you want whatever the default is, which is typically small? Do you want small, medium, or large? And by requesting those environments, you can get more or less resources. And the idea is that people who need more to get their work done will request those resources. And ideally, you have the sort of cultural mores where people don't take more resources than they need and release them when they're done with them. But there are technical solutions to that problem as well. And then just as long as I'm in this launcher, we can talk about the sort of other aspects. We have a way to sort of preloads. We have a persistent volume backed by SAF running in the Open Data Hub. And I can pre-populate that with the contents of a Git repository. We also have integration with SAF. I didn't use it for the demo today. But if I had an object store, the Open Data Hub would actually fill in my user's credentials as environment variables. So I don't have to have those in a notebook. I just have to sort of refer to these as environment variables and access them. And we have a lot of other demos. If you go to opendatahub.io or search Open Data Hub on YouTube, you can see other demos where they're showing that SAF integration more in depth. Perfect. I think. Let's see if Rakesh comes back with any other questions. And I think we might not have everyone silent on all of the streams, which is amazing. So anyways, do you have a final slide that links to resources or anything that you can throw up so in case people want to find you or do some interesting research on top of OpenShift and test your theories? Test your theories and practices out? You know, I actually, I don't think we have one of those, but I can make one before the end of the call. So give me just a second and I'll put that together. All right. So that's all good. We're good. And Rakesh, as you answered his question, so that was great. So I'm wondering too is if there's any final words, we've got about 5 minutes left from Pete or Prague or anyone about what's next for ML on Kubernetes and maybe specifically OpenShift, if there's anything coming down the pipeline, new operators, new partnerships or things we should watch for? I think one of the things that we are, you know, as we expand the life cycle out, like one of the things we are focusing on now from a DevOps perspective, what can we do to make it better and make it easier for data scientists and for the app developers? Like the personas that we saw on the life cycle, how can we make it easier for them? Depending on the kind of AI ML application that has been built. So the tooling, the interactivity part of it, right? Because you have, you know, you're touching a lot of points. You're looking at data, you're looking at models in Jupyter Notebook. But then you also want to do work in an IDE level kind of, you know, work as well because it's broader than a notebook. You want to preview them or bring them in. So we'll be looking at how can we make, now that we have identified this and we have, you know, we have implemented an OpenShift. How do we make it better and easier for developers and for data scientists to come in and start creating from scratch? If you're like, you know, profile is, my company's got something going on about AI ML. How do I start? Like, where do I go? You know, it's like, it's like, how can we make it easier for them? So we are focused on that. Those leading a couple of good tracks around it. So we should definitely see some good, you know, goodness coming out in the near future. All right. So you see the resources screen here. There's one last short question. I hope it's a short question because we're almost at the end of the hour is how easy is it to customize the Jupyter Hub landing page. He's talking about he's on prem and would not need the AWS fields. So those AWS fields actually apply if you're on prem because they are also credentials for Seth. Right. So in the open data hub, we're deploying OpenShift container storage. That's, as I said, it's used to back the persistent volume. So your workspace is basically backed by Seth in that case. And you can also refer to larger data that is stored in Seth hosted on OpenShift as part of that open data hub deployment. So those credentials are going to apply on prem to the to the storage back end that the open data hub is provisioning. It's not, it's not for S3, you know, necessarily S3, the service. All right. I'm going to give it a pause for a minute. I'll mention that we are probably in the not too distant future going to be hosting a virtual OpenShift Commons gathering with an ML AI focus. So if there's topics you want to cover or people you want to hear from, reach out and let me know and I'll try and curate a very interesting day for everybody and reach out to some of the folks that are on the call here today and others to make that happen. But I'm not seeing any more questions coming in anywhere. So please do check out opendatahub.io and all of the Center of Excellence AI resources and tools. They're doing awesome work. We've got lots of end users and customers doing really interesting things using this from MassCloud to Anthem was on the other day, lots of people doing some really interesting work on an ML and AI and data science. Next week we have the folks from How's My Flattening.ca, which is a bunch of data scientists who are using the Ontario data sets for COVID. So take a look at what they're doing there so they'll be coming in and talking about their stuff. So there's a lot of interest in this use case on OpenShift and learning as we go and hopefully enabling you to do what you need to on top of OpenShift too. So Will, Pari, Pete, thank you very much for taking the time and giving this talk today and the demo. Always, always insightful and educational. And thanks again to Chris Short for producing it and making the live streams just flow so nicely. So with that, we'll let you all go and have a great day and we'll talk to you again next week, Tuesday, I think, is the next OpenShift Commons briefing. Okay, everyone. Thanks. Take care guys. Thank you. Thanks so much.