 Hi everyone, welcome to this presentation on how to build data science pipelines with OpenShift using Cef, Kafka and Knative. My name is Guillaume Moutier, and don't worry, I don't usually carry my sword with me. I am a technical evangelist in the Data Services Business Unit at Red Hat. I'm the former city of Laval University in Quebec City, Canada, and with now more than 20 years of experience in various roles in the industry. Now I'm focusing my work on data science platforms, that I'm generating, well, everything that relates to data all throughout its life cycle, from its gathering or gestion through processing then to its archiving. But enough about me, let's talk about data. The most important thing that I want you to retain from this presentation is to embrace the cloud-native way of doing things. I'm not saying here that you must run everything in the cloud, no, not at all, but architectural patterns that we have seen emerge from cloud services can completely apply anywhere, including on-prem, and this is exactly what is helping a lot for data pipelines. But first thing first, how do we define this cloud-native approach? Here's my totally very opinionated shortlist of characteristics that we must aim at for a cloud-native data platform. Most is agility and elasticity, you know, tools and frameworks and datasets, they evolve constantly and very rapidly, so you must be able to act accordingly with your infrastructures. Then cloud standards. I guess it's important to avoid any vendor locking with proprietary tools and formats, and we must embrace widely recognized open source protocols and standards. Hybrid cloud architecture. You know, what you are designing in terms of architecture must run anywhere without any change or maybe some small configs that you can adapt, but not the architecture itself. Then automation. You must embrace the DevOps philosophy, everything must be automated and code-based. And finally, separate compute from storage, so that you can take advantage of the rich computing ecosystem against the central storage. To sum it up, it's all about agility, standards and this ability to run everywhere. And all of that will allow us to reach our business goals, which are speed, efficiency and adaptability. And I will say this last one again, as it seems to me is the most important one, adaptability. Now let's take a look at the standard way of doing things. In this schema, I illustrated a standard data interaction where a user would produce a file that would be later on consumed by an application. What I don't like about this is the coupling that you have at different points. You know, the user who has to mount, let's say, a p-drive on the computer, and same thing on the server side. The application, which relies on a very specific configuration of the application server, therefore making this difficult to scale, especially on demand. So that's why I really prefer this approach, using, for example, object storage, where all interactions are done in the disconnected mode. It's purely a put or a get command from whatever location where you have network communication. So this is definitely more agile. And then we can also use the bucket notification feature that is available with self-object storage. What it allows us to do in this example is to send a message to a Kafka topic whenever a file is uploaded or deleted. This topic can then be consumed by an OpenShift serverless service that can scale from zero to whatever is needed to process this file. This is what I call a cloud-native architecture pattern. And now let's go for a demo where we'll see this architecture pattern applied to an automated X-ray analysis pipeline. The use case in this demo is about pneumonia detection from chest X-rays using an automated data pipeline. So imagine the problem is this one. We have some X-ray mages to review, some from people having pneumonia, some people who have normal chest X-rays, and we want to automate this process. So of course, we think that an AI ML model can help. And we can use tools that are provided by OpenData Hub, for example, with Jupyter notebooks and TensorFlow. And we can train a model to be able to do some inferencing on those images and determine if the new images that we want to process are from people having pneumonia or not. So we have this model, but it has to scale, so we have to automate it. So now the question is, how can we analyze those images as they come in for a continuous flow of thousands of images? And if we want to retrend the model and redeploy it seamlessly, at various locations simultaneously, how can we efficiently do that? And again, my answer is to use cloud-native architecture and patterns with bucket notifications with OpenShift container storage, Kafka topics with the MQ streams, and Canadian eventing and serving with OpenShift serverless. So here is our demo environment. Let's say we are at an hospital and we are generating new X-ray images. What we will do is send all those images into a bucket, into a self bucket that has been instantiated by OpenShift container storage. And this bucket has been configured to send notifications whenever a new image is coming in. Those notifications will be sent to a Kafka topic that is linked to a Canadian eventing and serving function. And the container that is spawned when a new message is coming in will do a risk assessment on this new image, you know, basically using the model that we have trained to try to infer if there is a risk of pneumonia or not. In a standard production scenario, all results will be sent to a doctor. But here I have a special step, because of course not all models are totally perfect. There is a certain degree of uncertainty. And this is what I'm doing here. When the model is not able to have a certainty above 80%, what the process will do is anonymize the image so that it can be further processed in a central data science lab, for example. And normally, again, in a standard production environment, you would have a doctor, a specialist doing a manual assessment and the classification of this image for which the model was not able to infer the result. It would be classified as risk of pneumonia or being normal. And this would trigger a retrain of the model that could be re-injected here back to our hospital origin through a standard OpenShift CI CD. This second part here with the model retraining and everything, we won't see it in the demo. It's not implemented because training a model like this takes a certain time. But I have a way to simulate a new model being used to do those inferences. And to make the link with the scenario that we described before, we can imagine that in multiple hospitals, there is also the same model being used to make some inferencing on images. And again, images for which the model is not so sure about the result, those images would be anonymized and sent for further processing here. Let's see that live. Let me walk you through the environment I have prepared. Here is my OpenShift cluster where I have a few things that I have created for this demo. First there is here this deployment config of what I call the image generator. This is a container that will, well, you know, in fact it won't generate extra images. It will just copy randomly source extra images that I have in a bucket. It will pick randomly some images and send them, copy them to an incoming bucket. We can see that here the image generator is deployed. There is one pod running. That's the blue circle here indicating that this deployment configuration is up and running. But at this moment, it doesn't do anything. I have a parameter here that is set now to zero that makes it sitting idle, not sending any image. I have here a Kafka source that is called xray images. So what this container is doing is just listening to a Kafka topic and waiting for some messages to come in. And we can see here that this Kafka source is connected, is linked to this service, to this serverless, this Knative, you can see the logo here, the Knative service that is called risk assessment. So this is a full serverless container, serverless deployment. So meaning that right now as there is nothing to process, it's just also sitting idle. So we can see here there is no blue circle around the container, meaning that it is scaled down to zero. There is no instance of the risk assessment container running. I have a few other things that are deployed. First is my Kafka cluster deployed through the MQStreams operator. So very basic here, only one instance of Kafka and ZooKeeper. It's totally ephemeral Kafka cluster. Please don't do this at home. Only you don't want to run only with one instance of itch. But for resource purposes here, it's enough for what we want to do. So all the notifications from the set bucket that will receive the image will be sent to a topic in this Kafka cluster. And this is to this cluster that we have here, our Kafka source subscriber that is listening to the specific topic we want. We have also a deployment of Grafana with its own operator. That's a dashboard that will allow us to see what's going on. And I have also a few helpers here. I have a small database, a very basic MariaDB database where I will record the names and timestamps of the images as they're coming or being processed or being anonymized. And this is what we will also display in the Grafana dashboard. And finally, I have a small image server. As you will see on the dashboard, we will display directly the images as they are coming in. So here is everything that I have deployed. And we are now ready to launch the image generation. What I will do now is launch the demo. And to do that, I will patch the image generator. I remember the value that is set to zero to idle it. I will make it now be one, one second. So that means that now a new image will be generated, will be copied inside our incoming bucket every second. Let's launch that. So I've launched the command and the image generator will be patched with this new version. We can see here that it has already deployed. It went very fast. It has deployed the new version. And now it will begin to copy every second a new image. And we can already see that something is happening here. You have noticed here that my risk assessment pod has now been spawned. There is something happening. So that means that we have our workflow going. The image generator is putting up new images inside our save bucket, which triggers a notification to our Kafka topic. Here we have our Kafka listener that retrieved the message and pass it to the risk assessment pod for the image to be processed. Let's have a look at what it looks like now. So here is the Grafana dashboard that represents in real time what's happening in our pipeline. On the top left here, we have a summarized schema on this pipeline. So we can see the images are being sent to an incoming bucket here. And we have the counter of the number of images that have been uploaded so far. Then as notifications are sent to a Kafka topic and the risk assessment container has been launched, we have the number of images processed. And again, if the certainty of our model is less than 80%, then we will have another function that will anonymize those images. Okay, so we can see that the pipeline is running. On the right side, we have the list of the last 10 uploaded images. Okay, and don't worry, those are totally random generated names and birth date and other personal information. Those are not real patients here. We have also, again, the list of the last 10 uploaded images, then the last 10 processed images. And we'll see in a few seconds what is happening on those images and then the last 10 anonymized images. We have some counters on the left side, the CPU and RAM usage that you can see has increased because now we have some processing to do. We have the number of risk assessment containers which have been launched so far to be able to handle the load. Again, this is something that is automatically scaled by OpenShift Serverless. And then we have here a risk distribution. So, so far within all the images that have been uploaded, we have the distribution between the ones that have been assessed as normal or a risk of pneumonia or unsure. Okay, we have here in this small graph, the number of images that have been processed by model version and we'll see in a few seconds what happens when you change the model. And we have here a counter of the number of deployments of the risk assessment pods. Okay, while I will explain to you what is happening on the images, I will do two things. First is to increase the rate at which the images are sent. So far, it's only one per second. And I will also change here a parameter simulating that we will have a model V2. Now that will be used to do this processing. So I will do the first patch here and then the second patch. So while my containers are being updated to reflect those changes, let's have a look at our images. And here I have another special dashboard with a bigger version of the displayed images. And maybe I will wait for another one to refresh so that we can see better. It's refreshing every five seconds. Okay, let's stop here. So what happens is this is a base image. This is the image that I have prepared beforehand. I have about 800 of those images, which are x-rays, chest x-rays, with some personal information that I've printed on these images. These are, as I said, randomly generated information. When a risk assessment is made by the model, what my processing container does is write on top of the image the assessment that has been done here a risk of pneumonia with the level at which the model made this assessment. So risk 100%. So the model is pretty sure that there is a risk of pneumonia for this specific x-ray image. But when the model is not sure and the risk is less than 80%, what we are doing also here, you can see that the personal informations that were on this specific x-ray have been blurred. That's kind of a simulation of what you would do when you want to anonymize images. Let's go back to our main dashboard and see what happened. Lots of different things. First, the usage of CPU and RAM has further increased because if you remember, I increased a lot the rate at which the image are processed. Now it's 10 times per second. Okay, so here those counter are growing much more faster. And we can see that the OpenShift server less has done its magic and automatically scaled the number of containers, the number of pods it needs to be able to handle the load. Of course, we can see here that many more images have been assessed. And at the same time, we can see here that I am now using the V2 model to be able to make the risk assessment. So here with this model change, I am simulating that following image anonymization and manual classification in the Central Data Science Lab, a model has been retrained and pushed back to here, to our hospital so that it can be used from now on. I hope you've enjoyed this presentation and the demo. If you have any questions, please feel free to reach out. You can also find the code used for the demo in my GitHub repo. Don't forget to check out our websites and YouTube channel to learn more about Data Science on OpenShift. Thank you.