 Hello. Can you hear me? Good. Hi, everyone. I'm Tushar Katarki, a product manager on OpenShift, AIML on OpenShift. I focus on that. I'm very pleased to introduce you this next topic, which is about really how do you do machine learning automated and on top of OpenShift. With us today, we have ETI and we have Guy from the IDF. And they are going to start next. Thank you, Tushar. So hello, everybody. We are ETI and Guy from the IDF. And today, we're going to talk about machine learning platform we developed on top of OpenShift and Kubernetes that creates state-of-the-art machine learning models and utilize the data scientists and software engineer jobs in our organization. So a little bit about us. I'm ETI. I'm the machine learning team leader in the IDF. Actually, everything you see here in the demo is something that we built in our team. And Guy? I'm Guy. I'm R&D manager of a private cloud managed services project, especially OpenShift in that case that we're going to see today. So a little bit about IDF, the Israeli Defensor, where ETI and I come from. Right now, the Israeli Defense Force are in the process of digital transformation that our main goal is to accelerate the delivery and the development of application inside our systems. We have a wide variety of systems from management applications to what we're going to see today that's accelerate the data scientists' jobs. So now let me move on to the main topic, how we can make you, each and everybody here in this room, a machine learning expert, and build really straight-off doubt machine learning models in just minutes or hours. But first, we need to understand a little bit about machine learning and its basics. So actually, machine learning is just learning from previous data in order to predict the future and will follow up an example that will demonstrate the world process during all the presentations. So we have here a data set of the medical diagnosis. And our mission is to predict if James will have a flu or not based on these parameters that we can see here. And how we can do it, actually, we have a wide variety of machine learning models and algorithms. This is a really small sample of them. And we can use each and every one of them in order to predict and create models that help us to predict the future. But each model here has its own configuration or its parameters. And it's usually a really exhausting task to just select the best one and feed it to the data. So we need help because it actually can take a lot of time in order to find the best solution to the problem. So let's deep dive about what a data scientist really do when he gets a new data set and starts building the models. So first, we start with the data. We have a step called data engineering, which include removing irrelevant columns, making the data more predictable. Like in our flu example, we'll obviously remove the patient name column because it obviously will not predict if one will get a flu or not. But we assume that fever, for example, will give us more predictive analysis if one will get a flu or not. So obviously, we'll give it more weight, for example. After that, we start the machine learning task, which is just taking a lot of algorithms and models and start fitting them to the data set to the problem in order to get the predictive model. You can see that it's a cycle, and it's almost every time. Exhausting process that takes a lot of time. It's based on trial and error. So it can take even a few months for a specific data set. And after that, we have the operations. We need to serve this model as a production service to different applications and consumer to consume the result of this model to predict the future. So in our organization, for example, if we take 100 new data sets that get into our organization, in the final round, only five are really going to production. And that's because of this exhausting process of building machine learning models. Some of them are falling because of really, really small things. And it's just a waste of data and waste of knowledge, really. So from this pipeline, we can gather four different top challenges that we are coming to solve. The first one is environment. Actually, machine learning processes usually requires a lot of unique resources, like GPUs, huge number of CPU cores or RAMs. And these things are really hard to allocate dynamically. Actually, before our platform, data scientists just bought really big servers that they own them. Each server was assigned to one data scientist. It just was really not cost effective because the resources weren't shareable. So we're going to show you next how we solve this one. The second one is history. We saw that we have a cycle of machine learning building process. And each time, this cycle includes a lot of model evaluations that are usually not stored everywhere. So they're not keeping track of the results they just got for each model and each configuration of model. And this is not good because not only it wastes a lot of time, we can use this history later in other projects and experiments in order to make the building process more efficient. And we're going to see it also in the demo. The third one is optimization. We just want, we look for a tool that just take wide search space and just give me the best combination of search results of just to give me the best model, obviously. We'll see how we solve it distributed on OpenShift and how we utilize OpenShift and do it efficiently. And the last one is deployment. And here we have a gap between data scientist knowledge and the software engineers knowledge because data scientist doesn't know enough about dockerizing applications and operations. So we can't just take his mathematical model and deploy to production as the rest API. And on the other hand, a software engineer doesn't know how to handle this mathematical model that a scientist built and expose it as a rest API. So this is why a lot of models really just don't go up to production. And we're going to see how we solved it. So now after we understood the challenges, let's see how we solved them. So basically, let's go back to the first challenge. It was environment. We deployed Jupyter Hub, which is a common tool of data scientist these days. And we control how these resources are being allocated. It's allocated dynamically, actually. So guy, let's just spawn a new notebook. Here you can see a different variety of machine learning, let's say, environments. Each one of them is just including unique packages, unique environments, unique resources that depends on which data set and which problem I'm trying to solve. So right now, we'll just select the data science notebook because we want to solve the flu problem. We spawn it. And if we go back to the open shift to see what happens behind the scene, we can see that we got a new pod here with its unique resources and unique packages. Actually, each notebook is just a docker image that is being controlled by us. And when the research is over, when the data science go to sleep on that day, you just turn off the computer, turn off the notebook, and the resources have been freed to other data scientists. This includes also GPUs, which we have. Everybody doesn't have enough GPUs today, so this is how we solve this thing. Now let's move on to my environment that I prepared before. You can see here Jupiter with the flu data set we've just generated. We have here the same data set we saw in the example and a notebook that includes our demonstration. So what I'm going to do now, we're just trying to feed the data to machine learning models. So we're going to run different basic data science operations in order to feed the data to it. So now we're just reading the data. We are dropping irrelevant columns like the name column. We're just converting some columns to numbers and dates. So the mathematical algorithm can understand what's in the data and splitting it in order for us to evaluate it. Now let's remember the second problem, the history problem. I'm going to build here now three decision trees algorithms. Decision tree is just a predictive model that learns the data and then know how to predict future data. So we can control the depth of the tree and this parameter actually really influence the performance of the model. So we're going to run to create three different decision tree classifier and we're going to change this max depth parameter every time. So we start with depth of three and we got 0.54 accuracy. We try again with another value and we got another accuracy score. And we try again with five and we get another accuracy score. So now who remembers the value of depth equal to three? I assure that who is concentrated enough is remembering it now. But if I ran another 1,000 experiments, I assure you that no one will remember what was the result. And maybe it was a good result. And we are losing a lot of data that way. So right now I'm going to show you how we solved it using our ML tracker platform, which is also hosted on OpenShift. So let's just go. This is the UI of this. And we are going to create a new project. You can see I just need to specify a project name and a description. And I can control also using no selector feature of Kubernetes which resources will be running, the pods and the workloads that my experiment needs. We also have a direct connection to the object storage if my data is saved there, but we'll not use it right now. So let's create that project. You see an empty project right now. And let's return to the Jupyter and type in the newly created project here. And we are doing the same thing. We're just running decision tree algorithms with different parameters. You can see that I'm importing here a Python package, reroute, and we will use it during the wall representation. So we're just creating a tracker. And we know how to input with the model object. We give it input also the accuracy score. And each key value metrics we want to keep track of. We're going to run again the three experiments as before and move back to the tracker. And we're going to see that we have three different experiments with the results. And the metrics, we can see that we know how to extract each and every parameters of the model. And it helps us in the future to understand which model was really the best one. So this is the history part. The third one was optimization. And we are going to see how we, as a machine learning platform team, can help data scientist optimizer models really easily using smart search optimization algorithms that we see right now. So I'm doing the same thing. I'm just defining the basic data science operation I did before. And I'm defining an objective function that's actually going to run my models. And here I'm defining a search space. Right now, I'm searching between two different algorithms, decision trees and KNNs. And each algorithm has its own parameters. Right now, we're just testing two different parameters for each algorithm. And you can see the range here. We're actually defining a really large search space for our optimization problem. For example, the max depth that we tested before, right now, we're testing it between the range of 3 and 20. And we're going to define it and then deploy the optimization task to OpenShift and this platform. Good thing to mention is that we are not just running through the whole combinations. We have a smart search algorithm that knows how to do it really, really fast. You can pay attention to what we need to specify here. So it just displays the objective function we defined, which search algorithm I'm going to use, the number of workers, which we'll obviously see later what does it mean, and how many evaluations I want to do. That means how many models I will actually build and the more the better. But it's also power consuming. So now let's run it and see behind the scenes in the OpenShift what's going to happen. So we just ran it before, so 100 experiments are now over. But let's do it again and move back to OpenShift. And you can see here that a new pod, the manager pod, is being created. And after it, three different machine learning workers are created, because we specified it before. And their responsibility is to just run models with different combinations that this manager is responsible to provide. As you can see, we just specify 100 evaluations. And in about five seconds, I think, we just finish. And if we go back to the UI of our platform, we can see that the whole results, and we see that we got slightly better results than we got before. And it's all been automated using this platform and OpenShift. So this is how we optimize, actually optimize and make easier job to the data scientist in our organization. And the last challenge, the deployment challenge. So I'm actually not going to show you how we deploy real models, but I'm going to show you what is needed in order to deploy one. So we only need to specify which frameworks the model was built in, the path to where is it saved as a file. And we support object storage today. And the preprocessing function name that is responsible for converting the data we want to predict to a one that a model can understand, using the same basic data science operations we did earlier. After we run it, a scalable deployment on OpenShift will be deployed, three different pods and can be scaled by stress. And a deployment in OpenShift, this is really awesome. Will you show it? No, let's go back to the slides. So let's talk a little bit about our architecture. How are we going to things down? So we have three main components. The first component is the OpenShift components, actually the master and the inference running on a bare metal servers. The second and main component of this architecture is the GPU compute nodes. Each compute node have two V100 NVIDIA cards on it, especially for all the GPU workloads that the system needs to be done, the application. Actually, we have a quite problem there, because right now each pod can utilize, can assign GPU, and no other pods can share the GPU with it. So when you assign to one pod, one GPU, the GPU is only for that pod. And actually, on workload and production, you don't utilize 100% of the GPU. And it's quite a problem. This is the NVIDIA device plugin that's using right now. We fork that plugin and write our own written plugin for the GPU scheduler to actually split the GPU and timeshare with everyone, with the other pods on the system. And this is our main component of the GPU right now. We are adding four for a multi-GPU node based on NVIDIA DGX. Actually, when you want to run more complex machine learning models, you need to run it on a multiple GPUs, more than one or two, maybe eight, six at a time, sometimes more than 10. And this machine can do that. It's mean like one pod can use six or eight GPUs without sharing them with any more pods. Actually, the live demo we are showing right now is based on this DGX. And everything is running on GPUs. So AutoML. Until now, we just saw optimization, smart development of how we optimize the data scientist life in our organization. But actually, we needed some knowledge in coding and some knowledge in data science in order to use it. But right now, I'm going to show you how we build machine learning models using only a data set and making all this process automatic. Actually, AutoML is one of the most research topics in many academic institutions today. And its goal is to automate the data scientist job. So we implemented it in our organization. We're going to show you how we give as input the data set files, the CSV file we just researched before, and return as output a great model, a state of the art model with really good results that is already exposed as the REST API hosted on OpenShift. So let's do it. And just move on to the AutoML tab here and upload the file. You can see that we know how to extract the relevant columns from the file. We just need to specify, again, a project name. And when we click the interested column, in our case, it's a got through column. What is happening behind the scenes? We are trying a lot of different machine learning processes, really complex one that scales the data and make it more predictable. And after that, again, we are using the same optimization tool we used earlier. But right now, we're running really complex models that usually requires a lot of optimization and consumer of time in order to do so. If you move back to the OpenShift, we can see that right now we have only two workers that run this task. But each one of them is really strong. Actually, it has eight cores and 16 gigs of RAM in order to do this task, because inside there are really complex models that built in. It may take a while. It may take an hour or even more. It depends on the size of the data and its complex stability. But we already prepared the deployment that we ran earlier. We are going to show just the REST API that is going to predict us if one will get a flu or not based on the only relevant columns our model did automatically. So we can see our model just new to remove the name column because it's irrelevant. And right now, we're going to just give us inputs, different values, and see the predicted result. So let's type in some values. Actually, this is a UI for testing proposals only. And real applications that hosted other applications that hosted on OpenShift also just requested this prediction as using REST API. So this is how we really make smart applications. OK, let's click on predict and see. This person will get flu, we think. But let's see how is it shown in our platform. So let's move on to the tracker again. We're still at 0%, but it may take a while. And move to our prepared project. And you can see here that we got a new deployment with green status. We can also choose its URL, because we support, let's say, a kind of A-B testing for models. We support for each project. Each project can have multiple deployments. And each time, we can switch the model that is actually being served using this API. So this is how our data scientists just test their newly built model using this way. Let's move on to the presentation again. Do you want to take a look at the OpenShift? Yeah, it's a point to mention. Let's see what's really behind the scenes and what we have in our OpenShift project. So actually, we have scalable deployments of models. We have the Jupyter Hub. And we have a lot of notebooks that being spawned on Jupyter Hub. We have Minio, that is an object storage that help us to save models and keep tracks of some experiments. We have a PostgreSQL SQL that helps to host to be. This is a database of our application, actually. And a RebitMU MQ cluster that is actually the master and the workers from the optimization task actually communicates through a RebitMQ cluster, a queue, in order for us to keep a stable communication between them. So this is actually our development. Right now, let's talk a bit about the impact and what we did in our organization. So each machine learning model that were built manually, when it gets into our platform, it got improved by a verage in about 30%. By improvement, I mean performance or some other metrics that we evaluate model by. We increased the number of machine learning models in about 70%. And we got a huge growth in users in the past half year, about 600% of growth. And we face today a lot of challenges. One of them is code remote debugging. Not every data scientist work on Jupyter. Some of them needs to work on the local machine. But actually, they can't because they need a GPU. So we are looking for a way that they will work on the local machine. But the code will run on a pod or on a server that is actually hosted on a GPU. We're looking for a better way also to save experiments. As you saw, we needed to specify in our demo what we want to save, which model, which metric, which result. And we're looking for a way to do it automatically. And also, we're looking for an auto available for unstructured data, like images or video. That includes also deep learning neural network search. This is a complex topic. We actually don't want to keep on with our own written device plug-in. We want to get more upstream and standardized device plug-in that everyone can use and to share the GPU as we want for multiple pods, as we explained earlier in the architecture. And we have a challenge to run and manage the multi clusters of open shifts and Kubernetes to get, manage them, operate them, monitor a lot of them in different locations. So this is our current challenge. No too short. So that is great, right? Did you guys like that? That is great, right? So what I thought I would do here is kind of take a step back and kind of see what's happening, right? You are already familiar with OpenShift. I thought we'd start with the OpenShift architecture. You can see your master and your worker nodes here and your pods running. They are being exposed as services either within the cluster itself or through the routing layer. And then all of this, you've got the storage. You've got the registry. OpenShift is abstracting. And it can run on all the hybrid cloud infrastructure, be it physical or public or private or virtual. So you're familiar with all this. So where it is helping a data scientist, where it's helping machine learning as a service is you can confidently take and build your machine learning pipeline and workflows on top of OpenShift and bring a machine learning into production. Because one of the stats that I recently learned was 70% to 80% of models don't go into production, right? So I think we think, and Italian guys showed that, we think that OpenShift is a great platform for overcoming that barrier, right? Like how do you actually create workflows? How do you use the best of software development lifecycle that earlier we saw with McGuire Bank, for example? How do you bring that to basically machine learning workflows? And how do you bring that into production? So I think that's kind of the first step. The next step really is some of the stuff that Etai showed in terms of the Jupyter notebooks, for example, that were being developed, that he showed wherein it can spawn off multiple workers that can use GPUs, et cetera. Those are things that we are trying to, that's one example, but then there was also an example of how workers and masters communicate with each other using some kind of, in that particular case, it was RabbitMQ. There was also storage that he showed, some kind of interface to an object storage where this is stored. So what we are doing really is with the collaboration internally and with the other external partners, we have created this project called Open Data Hub. And Open Data Hub really is a reference architecture built. So the Open Data Hub has two aspects. One is we have used that reference architecture internally to build machine learning as a service on OpenShift at Red Hat. And we are using it internally to do optimizations for our data scientists. And then we are open sourcing it, all those things as a reference architecture. So these are some examples of those things that are shown here. There is, for example, Kafka to do streaming, there is end-of-messaging, and then there is Spark to do streaming and other real-time data processing. There is the Jupyter itself, Jupyter Hub, wherein it has these pre-built notebook images, and you can add more as you need. And then we have AI library, which has optimized frameworks such as TensorFlow, built on the Red Hat stack, which includes things such as UBI, which is universal base image. So that's basically in a nutshell what the Open Data Hub is. It is about saying that it's a reference architecture which is open source-based, which is all open source, that you can use to create these end-to-end workflows on OpenShift and Kubernetes for building ML as a service, deep learning as a service. That's one aspect. The second aspect really is also to help partners. If there is a partner such as Anaconda, for example, for Jupyter, they could use this reference architecture to bring their products and services on top of OpenShift using operators and operator framework. It has showed AutoML and other vendors such as H2O, driverless AI, which can do the same thing. So all in all, we are excited about this. So we welcome, obviously, its open source, go to opendata hub.io, provide pull requests, et cetera, and provide input to us and feedback. The other thing that I wanted to quickly highlight before ending really is our continued partnership with NVIDIA, which around this stuff. So at Red Hat Summit, we introduced this idea of a program called Accelerated AI. And what it is, is it's really an easy button for bringing AI and creating ML as a service in the Enterprise Data Center. Here on the left, you can see a stack, right? Like you can see here that there is a x86 server. Imagine a bunch of them having GPUs. And then you have the CUDA drivers, the NVIDIA drivers, and other drivers, for example, if there is an infinite bandwidth metanox, those drivers all pre-installed, automated. And then you have the Device Manager plugin. And then on top of that, you see that there are the NVIDIA NGC containers, which are these pre-optimized CUDA-based, optimized for GPUs. Like they have various frameworks and libraries for data scientists, like TensorFlow, for example, is one of them. But they also have very industry-specific libraries for machine learning. So that program, basically, what you get really with this so-called Accelerated AI program is you get the NGC containers on OpenShift, on x86 servers with GPUs, fully supported end-to-end from Red Hat, obviously, and from NVIDIA, and from the OEM. So that's the program that we announced. We are very excited about that. There is a lot of buying at both companies, as well as the OEM level. So you can find more about those. I mean, there are those resources. There's a blog on it. And we collaborated with NVIDIA on that. And then we are also inviting customers who are interested in participating to sign up if they are interested in this early access program. So anyways, that's where I wanted to conclude. And thank Itai and Guy for coming over and talking. And we are very excited about what we're going to do with respect to AML on OpenShift. So we'll handle questions offstage, I suppose, if you have any questions. Thank you. Thank you.