 Good afternoon and good evening. Welcome, everyone. I am Michelle DePama, Principal Solutions Architect for Red Hat and the host of the Data Services Office Hour. For those of you that don't remember, last Tuesday was International Women's Day. And in honor of that, I'm so happy to present members of the Red Hat Data Science Women's Panel. And they are going to be talking about the data science model life cycle explained. So before we get into it, I just want to let our viewers know that the format of the show is going to be a little bit different. So since we have so many panel members, we're going to have each person present. We will take questions in chat. We'll also have questions at the end of their section before moving on to the next section. So with that explained, welcome, ladies. How are you this morning? Hi. Okay, so we're good. Who's going to go first? It'll be me. Okay. All right. So I'm going to go ahead and disappear from the screen and let everyone else disappear and you can take over. Awesome. All right. Yeah. So hello, everyone. And thank you for joining in today. We will be going over the data science model life cycle today. And as Michelle mentioned, we are part of the Office of the CTO here at Red Hat. We are part of different teams, but overall, we make sure that we are working on these different use cases for data science and coming up with innovative solutions and leveraging all the open source tools that we have. So with that, I would just like to introduce myself. I'm Hema Viradi. I'm working as a senior software engineer here at the Office of the CTO recently formed open services group. And joined by me today are my colleagues Akanksha, Oindrila and Selby. And each of us will be going over the different pieces of the data science model life cycle. And to start off with what exactly is this entire AI ML life cycle that we're talking about and each of us will touch base upon the different components involved in this data science life cycle. So becoming data powered is first and foremost about learning the basic steps and phases of your data science project and following them from your raw data preparation to building your machine learning model and ultimately to operationalization. And I will mainly be focusing on the data engineering part of the life cycle. So as you can see in this diagram here, we will be setting up different goals for your project when you set off for a different AI ML use case. You will go ahead and gather all of the required necessary data that's needed for solving this problem. And then you will go ahead and develop your machine learning model followed by deploying this model so that it's available somewhere in the cloud as a service that other people can interact with. And finally, you can implement some inference ways to interact with this model as well as a monitoring in place to actually make sure the model is performing accordingly. So now let's start with the actual data engineering phase of this life cycle. So once we know what is the problem that you're exactly trying to solve, you need to start looking at the data that you have available. And data can be available from multiple sources. So aggregating mixing of all these data from as many different data sources as possible is what makes your entire project great because you can look as far as possible as required. Now, once we sort of identify what are these different data sources, you need to start actually fetching and figuring out the best way to store this data. And in our team for this, we use something called as Ceph, which is an SV backend storage where you can sort of split all of your data sets and arrange them into different buckets within different directories and folders so that you have all of your data stored. And we also have something called as Trino. Now, Trino is nothing but an SQL based query engine, which helps you to connect with your data sources, create all of the required tables that you might need out of these data sets. And then it's essentially like an database engine, right? So you can also query on top of it, explore on top of it, and then maybe connect it to required visualization tools if your end goal is to also do some sort of visualization. So once we know that we have all of the data and it's available and it's ready, what do we do next? We need to actually further pre-process your data and especially for a data science project, there can be a lot of noise in your data and hence it's very critical that we filter out the right data so that we have that to support all of our model development that we start working on. And to perform this sort of filtering and pre-processing or feature engineering, if you may call it, we use something called as Jupyter Notebooks. Some of you may be familiar with it, but Jupyter Notebook is essentially a web-hosted service where data scientists can write their code and create all of these interactive notebooks where the code is sort of in an interactive cell-based manner so you can execute and run them as in while you're working on the different pieces of it. And ultimately you can store this notebook as well as inside your repositories, that's where we put all of our notebooks in GitHub repositories so that other data scientists can also collaborate, provide feedback and have these notebooks as references as part of your data model management lifecycle. So once we have all of this data sort of pre-processed and filtered, what do we actually do next? Now it's time to enrich your data further, right? So you need to manipulate your data so that you get the most value out of it, you can start grouping all of the different data sources that you have so that you narrow down on the essential features and also taking into consideration other aspects such as ensuring that you're not incorporating any bias into your data so that your machine learning model is ultimately not affected by this. So that's what we do in this enrichment phase of your data is basically trying to figure out what the model would require and making sure that the data is not compromised and we have the right amount of information there to power your machine learning algorithm. And finally, once we have all of this done, you can start visualizing your data by creating some insightful dashboards so that you can help data scientists further decide what type of machine learning model would be best suited. And to do this visualization exploration, we have a tool called Superset. Now, Superset is very similar to Tableau with different capabilities and different features, of course, but it's an open source tool. And the great part about Superset is it's also interacts with Trino, which is our SQL based engine, and you can create your basic charts, which can be like time series graphs and bar charts and pie charts, all of the essential charts that you need. And you can create some dashboards in there so that they can be shared with not just data scientists, but also other members on your team, maybe the business team, the stakeholders, so that they have some insights into the initial phase of your data, and they can also help you come up with some key insights, which may actually help you figure out what is the right type of model that we need. So this is essentially the different pieces and different phases of your data engineering part of the life cycle. And I will quickly share a quick demo of exactly what we've done here for each of these pieces. So over here we have one of our projects that we started working on recently. In this project, it may not be an ML use case yet, but at least it still has some of the data engineering pieces involved in it. So in this project, what we're trying to do is essentially look at some GitHub data that we have. And we're trying to come up with some metrics to help analyze the health of a repository or analyze the health of an GitHub organization. So we're looking into different GitHub issues and pull requests and so on so that we can have some metrics to share with our team to show how the repository is doing overall. And we have an initial notebook where we have done all of this data exploration, pulling of the data and storing of the data into Ceph and S3, as I mentioned earlier in the slide. So here what we do is we try to create an environment file, which has all of the required variables to connect to that S3 bucket that you want to affect your data from. And then we go ahead and extract all of this GitHub data that we were interested in exploring. And once you have all of this data up and ready, we can actually do some pre processing of the data. We have some raw data, but we would like to look at what are the different columns that we have available inside of this data frame. What are the columns which are most relevant for our analysis. And then finally, we only retain the columns we think are useful for our data analysis. And then we further process this and store it back as a new file into our S3 bucket that we have over here. So once we have this sort of data available and pre processed, we can actually dive into creating those Trino tables that I mentioned. So Trino is what we are using as our database and we would like to connect with Trino into SuperSet as our visualization tool. So hence we go ahead and create some tables to connect with Trino. And we have an operate first Trino instance available, which we can connect to and have those tables created for us. So we would just need to pass the different Trino related environment variables like the host, the port. And once you have that you can check if your connection was successful or not. And then you can go ahead and create your table by passing the location of the bucket where your data is stored. So that pre process data frame that we did in the notebook is what we have stored as a parquet file in this example. And we are basically creating a table out of that parquet file inside of the Trino database. And that's all of this is being done inside your notebook. So this is where the collection of the data is done, the filtering of the data is done, and the enriching of the data is also covered within this notebook as well as the connection pieces to Trino. So here there is something called as Cloud Beaver. So like I said Trino is the database, right, but we also have a UI, which can help you play around with Trino. Let me just log in here. And what it does is it's basically helping you to run some queries are against your database. So we have the database instance here, and we are looking at a particular catalog where our data is stored. So you can actually go and further look into the schema, what are the different tables that we have. So in the notebook, we created a table called PRs to store all the PR data, and we created a table called issues to store all the issue data. So you can see those two tables here. There's a PRs table, there's an issues table. And inside of those inside of the table, you have all those columns that we looked at inside of the notebook. So great. Now that you have this table created and you know that you have all the columns, we are actually ready to go and play with SuperSet and create those dashboards. So again, we have a SuperSet deployed in our operate first cloud environment. So we can log in. Oops, let me make sure I'm in the right location. Yeah, okay. So yeah, so you can log in to SuperSet over here and create those tables and dashboards. So once you're logged in, you should be able to make some charts, make some dashboards based on the tables that we had. And this is a dashboard that we recently created for that exact issues and PR related data. So here we're just trying to look at a certain repo. And I can go ahead and select a given repo, go ahead and apply, and it'll show us some metrics that we're looking at like how many issues are open. What's the mean time to close issues for this repository? What is the trend of issues that we've created over time, as well as who are the main issue contributors for a given repository. And the nice thing about your dashboard is you also have the option to create different tabs. So if you want people to focus on different sections of your insights, then you can also create those tabs accordingly. And you can arrange your charts to reflect those. So here again, I want to look at all the PRs which is in this particular repository. And I have some metrics populating over here like what are the number of open PRs, closed PRs, and then what are the PRs created over time, as well as some PR contributors for this particular repository. So this is where the visualization and insights are all made from inside of this superset dashboard. And once this is shared with all of our stakeholders, as well as our teammates, we can get some more feedback into the data and they can also help us understand if there is some other missing data required. And if there is an AI, if it's an AI machine learning problem, then we can start moving on to the actual model development phase of your problem and figure out what's the best algorithm that would help us achieve the solution to your problem. So that concludes my section of the data engineering piece, but happy to take if there are any questions. Okay, so I don't see any questions in chat, but I have a couple of questions. Sure. First, what needs to be thought through once you identify the types of data desired? Yeah, so once you have figured out your problem, once you figure out what is the right data that you have, I think the next step is thinking about the ways to collect this data, as well as the data itself, right? So you need to think about the reproducibility of your machine learning model because essentially the data that you have can help develop that reproducibility of your data, reproducibility of your model, the automation piece of it. So if are we trying to train our model on a regular basis? If that's the case, the data also needs to be updated on a regular basis. So you should also have some sort of pipeline in place that can help support this automation and make sure that the database is frequently getting updated. Another thing is sometimes the data can be missing. So you might not have all the data at hand. So in that case, what do you do? So you need to figure out is there a way we can generate synthetic data? Is there some crowdsourcing way of fetching all of this data? Can we look at other different data sources out there which are open and public for us to use? And yeah, so I think there are these different challenges that you'll come across based upon the use case and the data at hand. And just thinking about those before would definitely help us through the entire life cycle. Okay, and you may have answered some of this already, but another question is what are your challenges of bringing together and collecting data from all the different sources? You've already mentioned a few, but what are you, in your experience, what are the challenges there? Yeah, I think the biggest challenge is having everything in a certain unified format because sometimes a data source may expect a certain type of format. And then a different data source might have a different type of how the data is being structured. So figuring out what's the best way to combine and aggregate both and have them in a common place where you can easily extract and pull and explore the data. I think it would be the biggest challenge. But with that being said, you also have a lot of tools that can help you do this and you can aggregate even in spite of them being in different formats. There are ways to manipulate your data and make sure that they're the right format. And especially since data scientists are more comfortable using notebooks, using Python, I think that gives more flexibility for us to actually do all of that manipulation inside of notebooks using Python because that will give us that exploration more in depth for figuring out what's the right way the data can be structured and fed into the model ultimately. And another one more small question. Is this where you spend most of your time? Is the data engineering piece and like if you look at the overall project, is this where you spend the largest bulk of time figuring this part out? It really depends. Sometimes, yes, you might be spending weeks here because the data is just not there. It's just not enough. Even if you have a machine learning model that you're ready to explore, it will not be as performing as you would expect if you don't have the right data. So it's always crucial to make sure that you do have the right data before you actually dive into all those complex models because the model might not even give you the right accuracy if the data is not in place. So I think sometimes there is a lot of challenge in figuring out the exact data that we need to get to and how do we get it? For example, we were playing with this GitHub data. So there are multiple ways to get GitHub data. There's an API that you can use. There are some open source tools out there, but everything comes with some challenges and some roadblocks that you'll have to face. So in case of an API, for example, we were getting rate limited because there's only so many requests you can make against the API to fetch those data. So such things will come up depending upon how you want to extract the data. But yeah, sometimes it's as seamless and the data is already ready for you. So you're just ready to dive into the model pieces. So yeah, it really depends on the problem. Thank you. Wonderful. So there are no questions in chat. So if you're all done, I'm happy to move to the next section. Yes, definitely. Thank you. All right. So hang on. Who's presenting next? Who's there? That would be me. Oh, okay. Akanksha. Okay. Hang on a second. Would you like to present your screen? Sure. Okay. Awesome. So hey everyone, this is Akanksha. I'm a data scientist in the office of the CTO with the Emerging Technologies Group. And today I will be talking about the AI ML workflow lifecycle that will consist of the model development and the model deployment. So a typical project lifecycle starting with data engineers working on gathering and preparing the data to make it ready for the data scientists to develop some machine learning models. A data set is a starting point in the journey of building a machine learning model. Simply put, the data set would essentially be an m cross n matrix with m, the columns would represent the features and n, the rows would be the samples. Next, the data scientists would start with exploratory data analysis to gain a preliminary understanding of the data, like descriptive statistics and some data visualizations. Next, this data will be subjected to various checks and cleaning processes to ensure the quality of the data before we start to develop the machine learning models for your intelligent applications. Well, you may ask, what is a machine learning model? A machine learning model is a file that has been trained to recognize certain types of patterns. You can train a model over a set of data, providing it an algorithm that you can use to reason over and learn from this data. And once you have trained this model, you can use it to reason over data that it may not have seen before and make predictions about this new data. A machine learning model could simply solve problems like identifying or distinguishing between cats and dogs, or bigger problems like predicting COVID cases or recognizing voice patterns or text predictions for our emails. We'll go over this in more detail when we go over the demo. And once the model is in place, we have these powered intelligent apps that are deployed and the machine learning models start inferencing, that is, making predictions based on the new data that it sees. And once all of this is done, you may have to monitor and manage your machine learning models in production to make sure that they are making the right predictions. And if not, they have to be retrained as it's needed. Now let's take a look at a demo where we are going to try to predict and recognize some of the car license plates. Just give me one second. Yes, here we go. So this is a Jupyter Hub environment, which consists of various interactive methods to create Jupyter notebooks, deploy models, create pipelines. So starting with this notebook that we've already created, and we have a machine learning model that will recognize and extract license plates just by looking at some car pictures. Starting with this, we have to first move on and install some libraries. If you're already using an Operate First cluster or Red Hat Open Data Science cluster, most of these libraries would be pre-installed for you. If not, you can go ahead and pivot install using the requirements file. And once all these importing is done, let's do the finding the plate. And to find a plate for us humans, it's really easy to understand what is a car plate. It's just a rectangular object in the front and back of the car. But for a machine to understand this, it's slightly more complicated. So what we're going to do is first we are going to find where the plate is located. So for a computer to understand, we will feed in some code for it to understand that it is a rectangular object in the front or back of the car. So we start by writing some helper functions and processing functions that will allow us to load a model that will help us to identify the location of the license plate in a car image. So we start by loading a model. Then try to resize the image of the car on the size that we had eventually trained our model on. So it might happen in cases where your model is trained on a certain size of images and the input that you might want to get an output of would not be the same size. So we'll try to pre-process this new image and fit it to the same size of the image that is more convenient for a model to understand. Moving forward, we will try to find out the image where the license plate lies and finally like to detect the license plate from the entire image. So let me make a detection. We load the model and we run our various functions on it. And let's see for this car image where we see that the number plate was on the backside of the car. Our model was able to identify the license plate area. So one job is done here. Now that we have a new set of image, which is the license plate, we would like to detect the various characters that are written on the license plate. So we first try to grab the contour of each digit starting from left to the right because that's how we need numbers. And one character at a time, we will try to recognize all the characters that are written on this license plate and then put it all together in one function. And finally load this new model to detect the numbers in the license plate. And finally to detect the number plate number. It might not be super accurate sometimes like it might miss one or two numbers here or there, but it does the job fairly okay, I assume. Once this model is completely up and running, we will put all images to this model and see how the detection looks like. So looking at this, it was able to detect some of the numbers in this license plate, then detected M as an N. So these are few technical inaccuracies that we might face when we are trading models. It might not be 100% accurate, but it does solve the purpose. And this is how we try to get a lot of number plates out of these test images. Next up, we have already deployed a Flask application for this model. And you will see how we can test out this application like we just give an input image and our model would be able to detect what is the license plate number, even without knowing what this car or this number plate looks like. So let's go ahead and start testing this application. First, we will start by testing if the server is running or not. Status okay, so that means the Flask application is running fine. Now we will pass our new image which is cars 374. Let me go ahead and show you what this image looks like. So this is the car image and we want to detect this number plate out of this car. And we pass this data file inside the predictions of our Flask application. Let's try to run that. This is like a command line command like call command to use and detect different answers using your up and running model, but we can also do the same using the Python code. We will try to run this. And here we see that the number that was detected was S7 JDV. If you go back and look at the car image, it is absolutely the same. So this is what a typical model life cycle looks like getting the data. We train the model on tons of images of the cars help help the model to understand what is the location of the license plate. We are moving forward get each set of numbers inside the license plate and deploy this as a Flask application. So you don't have to rerun your model again and again on heavy data sets and consume a lot of time. So this is my set on model development and deployment. Please let me know if you all have any questions. Oh, that was wonderful. Okay, so I don't see any questions in the chat, but I have a couple. Choose which model to develop. So this answer would depend on various, various factors like what is the problem statement, what is the kind of input you have and what is the kind of output you're expecting. And what is the type of the data size of the data also depends on various other factors like how much time you are able to invest and what is the set of resources you have but to give you a broad understanding of how we decide a model. So looking at the data set, there are three broad ways to classify this. So if you have a label data set, say you have a set of images of cats and dogs and you already know what is a cat and what is the dog. You have a supervised set of training. It is a supervised machine learning approach in which you have a label data set. And this is what you're going to go ahead with. In cases where you don't have a label data set, for example, you just have a bunch of images of cats and dogs. So you would go ahead with this unsupervised learning approach which would be like classifying these images into just two parts. Like all of the cats might be looking alike and all of the dogs might be looking alike. So this is an unsupervised learning approach. Whereas there is another approach which is reinforcement learning where your agent takes a course of action in its environment and tries to learn by trial and error method in an effort to maximize the reward. For example, if we were to play tic-tac-toe and we make mistakes time to time and we realize after a point that, okay, this is not how I was supposed to play. So using the same mentality of understanding by our mistakes, we try to train models using the reinforcement learning approach. So having said that it might not be a very straightforward approach when you are choosing a model. It might be a mixture of combination of a few different approaches. But this is the basic approach that we try to follow when we try to choose what model we want to use. Okay, so we have a question in chat, which is for you, what is the hardest part of building a model? All right, so there could be scenarios where you don't have enough data. I think that is one of the hardest parts when it comes to building a model. Like sometimes you know what problem you want to solve, but you don't have enough data to train your models on. So according to me, I think that is the hardest part of building a model. Okay, and last question before we move on. So what are the best tools for machine learning models to develop a machine learning model in your opinion? So there are tons of AI and machine learning libraries and packages to help you out when you are building a machine learning solution. But as I said earlier, it would definitely depend on the kind of data and the resources you have. But to name a few, we have Scikit-learn, PyTorch, TensorFlow, Keras, RapidMiner and XGBoost to name a few. But again, you could maybe use a combination of these to come up with the best model for your problem statement. Wonderful. Thank you so much. Thank you. Okay, ladies, I'd like to ask who has not, or Selby or Orindrilla, who's going next? I can go next. Okay, wonderful. Would you like to share your screen? Absolutely. All right, so here we are. My name is Orindrilla Chatterjee. I'm a senior data scientist in the emerging technologies group at Red Hat. So I'm going to be talking a little bit about AI ML pipelines now that we have an idea of how do you engineer your data? How do you train your machine learning models? How do you deploy them? And now let's take a look at how would you put all of these pieces together in an automated pipeline? So your data science project may involve all of these steps like we saw earlier, data collection, feature engineering and model training. And there might be several instances where you want to automate these different steps of your data science workflow and run your notebooks or run your scripts in an automated fashion in series or in parallel. So AI ML pipelines come handy there. So they are basically reproducible workflows that can be used to automate repetitive steps in a machine learning workflow starting from data extraction, pre-processing to model training and deployment. So how do AI ML pipelines make a data scientist's life easier? So first of all, ML pipelines help automate these manual and repetitive ML tasks. So it helps create faster iteration cycles by parallelizing these tasks. We also know that predefined and automated components leads to better reproducibility and more consistent workflows. So that's a great reason why you can choose to use AI ML pipelines. Next, pipelines can also help introduce better version control and they can help monitor your ML code and artifacts and makes it easier to track different executions and runs. So what tools would you actually use to train or create these pipelines? So using Ilyra and Kubeflow pipelines, which are two tools which are available on Open Data Hub, which is a project that's being operated in an open operate-first environment, we can directly use Jupyter notebooks or scripts and automate them in a pipeline without having to translate them into scripts. So Ilyra is basically a notebook pipelines visual editor that is present and available as a part of Jupyter Lab where most data scientists develop and it enables us to trigger Kubeflow pipelines from a very intuitive UI. And Kubeflow pipelines, which is a platform for building and deploying scalable machine learning workflows allows us to actually put all of these pieces together and run these in an automated fashion. So now that we know how AI ML pipelines can help us and make our lives easier, let's take a look at what are some prerequisites or what would you actually need to create these pipelines. So first of all, you need a Jupyter Lab instance to actually interact with Ilyra. Secondly, you would need a Kubeflow pipelines instance deployed within your cluster. You can easily deploy it on a Kubernetes-based environment like OpenShift. You would also need the Jupyter notebooks or scripts to actually automate the code or the environment where you're actually developing these models. And finally, you would need container images consisting of the notebook dependencies, which you actually want to feed into the Kubeflow pipelines environment or instance so that these notebooks can run seamlessly. So I'm going to quickly share a demo of a pipeline running in action. So let me pull that up. Great. So as you can see here, we have an already created pipeline in a Jupyter Lab environment here. You can see two nodes which are connected using the Ilyra UI. So I'm going to quickly go over both of these steps in the pipeline. So the first step is actually a step which collects some raw data from a CI data platform called Test Grid and saves the data onto S3 storage. And the second step actually downloads that data from the S3 storage and calculates certain metrics and stores the calculated metrics back onto S3 storage stuff. So let's see how we can add a new node or a new notebook to this pipeline. So for that, I would simply drag and drop a new notebook and connect those notebooks using the Ilyra UI. So once we have done that, we also want to set the properties for the notebook so we can select the runtime image that the notebook is running in, which we have already created. We can also select the resources needed to run these notebooks like the CPU, the RAM. We'd also select some file dependencies which the notebook or the script can depend on. So I'm selecting the metric template notebook which these notebooks need to actually run. And we can also select any environment variables which the notebooks need to function such as the cloud object storage credentials or any other environment variables that you have defined within the notebook. So once you have done that, I would save this pipeline and I would run this pipeline to actually see it in action. For that, I'm going to just simply hit this run or play button and I would provide it a name. I would also select a pre-existing Kubeflow Pipelines runtime configuration that I have created earlier. And once I have selected that, I would hit OK. And in order to view this pipeline in action and move over to the Kubeflow Pipelines UI, I can click Run Details here. This actually takes us to the Kubeflow Pipelines UI where we can see the notebooks running as nodes. So as you can see here, the first notebook has already started running. Now in order to view and debug any logs during the execution of this notebook, we can go to logs and detect if anything's going wrong. So this is how a pipeline actually looks like after it has completed executing successfully. There should be a green check mark next to each step of the pipeline. Now in order to view any metrics that were captured during the execution of these notebooks, we can also do that by going to the run output. So if I'm capturing certain metrics like the number of tests, number of bill failures, let's say the time taken to run certain cells and so on and so forth, we can also view that here. This can be especially helpful when you're trying to capture metrics during the execution of a model training pipeline. Let's say the model performance metrics like accuracy and you can also use this to track your hyperparameters, etc. So this was a simple demo of two notebooks or two simple three nodes in action. But this can be extended to complicated AIML development workflows. You could automate your data engineering, your data preparation, and you can also include your model development nodes all in an AIML pipeline in order to make your AIML development cycle pretty seamless. So yeah, that was a quick demo. Please feel free to let me know if you have any questions. Sure, I actually do. Okay, so I assume a data scientist doesn't come out of school knowing how to create pipelines. So what kind of skill set would you recommend that they work on? Absolutely. So as you saw earlier, this ML pipeline consists of the sequence of steps which can actually help a data scientist automate these different bits. So when you go from a prototyping phase into a production ready model, that's when you want to actually take these bits and put them together in an automated fashion. So the main goal of these pipelines is actually to make a data scientist life easier. So luckily you do not really need any advanced skills to set these different pipelines up. So the tools that we saw earlier like Ellyra, Kubeflow Pipelines, even MLflow, all of these make it really easy to set up these AIML pipelines from within a Jupyter notebook environment, which a data scientist is already comfortable with using, like you can use a simple drag and drop UI or a command line interface, whatever the data scientist or the developer is comfortable with. So and like you saw, most of these tools like Kubeflow Pipelines also provide a very neat UI for data scientists to compare model runs. So to answer your question in a just actually not a lot of advanced skills, most of these tools that are available luckily make it very easy for data scientists to build AIML pipelines. Okay, so you mentioned there's Ellyra which you showed in Kubeflow and then what were the MLflow was one of the other tools you mentioned? Are there any others? Yeah, actually there are like several tools which you can use. These were just two examples. You can use our Go workflows to create your own pipelines. You can also even use like simple cron jobs to tie together different notebooks or different scripts. A recent tool that I recently learned last week Apache NIFI is also a really cool tool that you can use to build AIML pipelines. It actually helps even build out the smallest of components like data transformation, loading models, exporting results and you can have these individual steps or nodes for the smallest of tasks which a data scientist is familiar with doing. So yeah, there are actually several tools you can use to build out these pipelines. Fantastic. Thank you. Thank you so much. Okay. Selby, I think you're next. Analia. Let me try to show you the demand for more computing power. Should I have started? No worries. But let me just show my presentation slide first. Okay. Is it visible? Is that was it showing it properly? On my left screen it does. Okay, hang on a second. But I see you sharing my window to that works. I can share my screen, I guess, or I can just introduce myself at the time. I think it's okay. Hello everybody. My name is Selby. I am a software engineer in AI services. So I work with AI ML ISVs and trying to bring their technology into Red Hat OpenShift also Red Hat OpenShift data science. Let me just mark our product. But today I'll be showing a prerecorded demo on distributed training across several GPU nodes. This demo was done by me and Diane Padema, who's not present today. But let me start the demo. In our short demo, we'll focus on how we can utilize all the resources available to us and speed up the training of a convolutional neural network model on an MNIST data set. We're going to use OpenShift, container platform, running Kubeflow and PyTorch to distribute machine learning training across multiple GPU enabled nodes. We'll use the distributed data parallel feature in PyTorch to replicate the model across multiple GPUs on multiple nodes and divide the training data so that each of the model replicas processes a subset of the total training data. With DDP, you replicate the model on multiple GPUs, run the forward and backward paths of the model in parallel, and then you synchronize, do a synchronize operation to aggregate all of the gradients. And you choose how to do this synchronization operation based on the back end that you specify. And you can choose from nickel, glue, or MPI. In our case, we're using NVIDIA T4 GPUs and we'll use nickel for our back end. We're going to look at our PyTorch Python code or script. And so let's say you have a convolutional network model that consists of two separate layers. And what you want to do is you want to distribute it on several GPUs across the nodes. So the one important information that you need is the master address, master port, rank, and world size. World size is a total number of your GPUs. Ranking is what's the ranking of those GPUs among all of them. And master address and master port is like the address and addresses of each GPU locations. So this information actually will be fed by Kubeflow, so you don't have to worry about it at all. It will be set up. But the things that we do need to worry about and set up is the disk init process group. So this command will initialize a distribution process. And here's where we choose our back end, which is nickel in our case because of the GPUs. And another important part of this is this distributed data parallel as mentioned earlier. Here we're going to wrap the model using the DDP and split the data across the devices. And here we're going to see which model and the IDs of the device send the output. And then the rest is we just download the data set. We do the sampling and load it. And then we run our machine learning model training. And after that we do the forward and backward paths and optimize. And eventually we're going to spit out our epochs and steps and loss on each step. And here we're going to set the environment variables that I mentioned earlier. We just set the environments and this data would be fed by the GPU flow. And here we're using 20 epochs. Now let's move to the Dockerfile. So now that we have our PyTorch script, let's actually containerize it. And in this Dockerfile, that's what we're doing. What we do is we have the already built image up on which we're going to build the further. It already has the UBI, CUDA and PyTorch and all other related dependencies installed. What we're going to do is we're going to copy that main.by the PyTorch code that we looked at earlier and we'll automatically execute the script inside the container upon deployment. Look at the YAML that we need to run this on OpenShift. And there are four things that you need to specify when you create an object in OpenShift or Kubernetes. So the first thing that you need is the API version which we have here. And this says which version of Kubernetes API you're using when you create the object. Then we have the next field is kind, which says what kind of object you're creating and we're creating a PyTorch job object. And we can do this because we've used the PyTorch job CRD in this cluster. So it's aware of this type of object. Then we specify metadata, which uniquely identifies the object and what namespace it will run in. And then fourth, we will specify with the spec the desired state for the object. So the coomflow operator is going to see this spec desired state and it is going to cause this to happen. It's going to go to the Kubernetes scheduler and say, hey, I need a master with one GPU. So find me a node that has one GPU and I need a worker with one GPU also. So please find me a node with one GPU. And that could happen on the same node or separate nodes. In this case, it's going to happen on separate nodes. So now we'll take a look at what it looks like when you run this YAML. The file that we just looked at, and we're creating a custom resource definition, which is PyTorch jobs. We can see that the pods are being created and initialized. So we can go to the master pod, look at the logs, and I can see that it's now downloading the data sets. Meanwhile, we can also go to our NVIDIA GPUs and see what's happening there. So we can see it's already being used. We can go to the other terminal and check the other one. The other GPU is also being used and you can see it's going through the epochs. All right, now it's on epoch number six. Now we can see that our training has been complete in one minute and 15 seconds. We can also go to the other pod and check how that went. And we can see that the other worker pod has completed also in one minute and 16 seconds. And the important part here that I would like to point out is the environment variables. So as I mentioned earlier for distributed PyTorch job, we need information master pod, master address, world size, and rank. And this information is obtained by Qflow and fed into the PyTorch. Yeah, and that should be all for my demo. Okay, so I have, of course, a couple of questions. So can you talk about how distributed training across GPU nodes fit into machine learning model monitoring and management? Sure, so basically with machine learning model and monitoring and management, what one can do is track performance, model retraining, CICD pipelines, or distribution of input and output data, or even monitor hardware metrics. And in our demo, we essentially like more or less collect the hardware metrics manually. So by testing this various hardware setups for distributed training as GPUs, CPUs, and so on, we can collect those hardware metrics and optimize the setup accordingly for like a more efficient and faster model training and inference. Okay. Okay, and one last question. All right. So in your demo, I saw that the workflow was distributed across two nodes with GPUs. So how does that scale? Like let's scale that bigger to lots of several GPUs on several nodes. Yeah, the beautiful point about the Qflow is that it's actually agnostic to the GPU setup that you have. So one organization may want to scale up by adding more GPUs right on one node. But some of them, some others might want to scale out by adding more machines with GPUs. So but either way, no matter what kind of setup change happens with these GPUs, there's no need to update or change the Qflow settings. It just with the help of no future discovery and NVIDIA GPU operator that we have, it's automatically able to identify how many GPUs are there in total, like be several GPUs on one node, or several GPUs on several nodes. So it doesn't really matter. It just after your request, after the Qflow basically requests the GPU resources, it will set one of the GPUs as a master and the rest will be the workers. And that way we'll be able to orchestrate this whole distributed training. Oh, well, so you don't have to worry about. Okay. So if everyone's still with us, I'd like to bring everyone back just to say a nice thank you and everything. Hang on, hang on, hang on. Okay, here we are. So I really want to thank all of you for joining me this morning. That was wonderful and I certainly would like to do this again and maybe do a deep dive on particular topics and hear more about what you're what you're working on. And I hope you had fun and I hope our viewers had fun as well. So thank you all. I appreciate it. Thanks. Thank you. Thank you for having us. Here we go.