 Hello and welcome. My name is Kyle Bader, I'm the data foundation architect at Red Hat. Joining me today is Ryan Looney from Intel, OpenVINO Product Manager. Today we'll be speaking about scalable natural language processing using BERT, OpenVINO, AIKit, and Open Data Hub. To set the stage, I'll start by talking about Open Data Hub. Open Data Hub is a community project that was started out of Red Hat that aims to take a variety of open source components and packages them into an end-to-end AI machine learning platform for OpenShift. So the types of components that it draws from are Apache Spark, Kubeflow, TensorFlow, Jupyter Hub, and a variety of others. And we make them available very easily for users of either the OKD, open source distribution of Kubernetes, or a commercial offering from Red Hat. In either case, you can go to the Operator Hub Community Operators section and you can install Open Data Hub and get going with your data science and data engineering problems. To give an idea of the problem domain that Open Data Hub seeks to address, I present this workflow that's more on the industrial and scalable side of machine learning. It starts with your existing data, and oftentimes before you get to the part where you start doing data science, you'll need to process and refine the data into a state where a data scientist can start to do an iterative loop of feature engineering and model experimentation to figure out a model that can be applied and solve some sort of business problem. When you're doing scalable machine learning, oftentimes you'll want to take the extracted features and flatten them and store them into some sort of scalable storage system like S3 object storage solution like Cep. And then ultimately, you'll want to train them and validate them using some sort of machine learning framework like PyTorch or TensorFlow. With the resulting model, you'll want to preserve it in a model repository. And from there, you can do an iterative loop of model optimization to create intermediate representations of the model that can be used for various different hardware targets or can potentially eliminate layers and make trade-offs between accuracy and throughput that may be appropriate for different situations. After that, it's time to push the model into production, in which case the model will get loaded into some sort of model serving engine. And that model serving engine will be integrated with some sort of application that's doing, that's driving the business value. So whether it be some sort of recommendation service or image detection, that's ultimately when the model is in production and providing some sort of business value. But it doesn't stop there because the data that the model is acting on can change over time. So you need to have some sort of monitoring in place so that you can see whether there's drift between the data that the model is seeing in production and the data that the model was trained off of. And in the case that you do detect drift, that's when you go through another iterative loop of training off of fresher data and then repeating the cycle. So wrapping up on Open Data Hub, it's really about creating a blueprint and giving you an opinionated set of patterns and tools in order to kind of streamline data science and machine learning on Kubernetes platforms like OpenShift. And recently, we've been doing some work with Intel to add some additional tooling to help with the lifecycle I kind of walked through in the previous slide, specifically by integrating the open source, open Veno technologies, which help provide a variety of optimizations for models and then also have a model server for when it's time to put the models into production. But I'll let Ryan go ahead and speak more on Open Veno and the integration work we've been doing recently. Thanks, Kyle. I'll quickly walk through some of the deep learning software from Intel. So on the left, we have deep learning frameworks like TensorFlow and PyTorch, where we upstream some of our one API optimizations to get better performance with training and inference using the frameworks directly. We also have the AI and analytics toolkit, which bundles together some of these third-party frameworks and tools for data science and classic machine learning, like SideKit, Pandas, Moden. And then on the right, we have Open Veno toolkit, which is an open source toolkit that includes tools for optimizing and preparing for models for deployment. So we have tools for quantization and wear training, post-training quantization, model serving, annotation, and more. This is a look at Open Veno in a nutshell. On the left, we have deep learning frameworks, like I said, TensorFlow, PyTorch, and you take these fully trained models and then we can optimize them with the Open Veno tools, which are available now through Open Data Hub using the Open Veno operator. Once you've optimized the models, you can then deploy them on Intel hardware, whether it's an Intel integrated GPU at the edge or a Xeon scalable processor in the cloud, and you can deploy on Windows, Linux, and Mac. Today, we're going to show how to deploy Linux containers and Kubernetes. We're proud of our ecosystem adoption. You can see some of the partners who've integrated Open Veno into their solutions, and we are always trying to go this list. So if you're an AI developer, whether you're open source enterprise, we would love for you to integrate Open Veno and leverage it for your deep learning solution. Kyle mentioned the workflow in Open Data Hub, and here I'm showing where Open Veno fits in. So we have an integration that plugs directly into the Jupyter Spawner in Open Data Hub, and we'll show that preview of that in a minute. We also have the Kubernetes operator for deploying and creating inference endpoints so you can serve your models and serve predictions in a Kubernetes cluster. This is a high-level view of the Jupyter Spawner in Open Data Hub, so once you've installed the Open Veno operator and created a notebook resource, you can see that there's just a button to select Open Veno toolkit, choose the size of your Jupyter container, and then click start, and you'll have access to a set of tutorials that show how to perform post-training quantization, quantization-aware training for different use cases. Take models from deep learning frameworks and convert them to the optimized Open Veno intermediate representation or Open Veno IR for short. So here you can see an NLP example. We're actually going to show how to deploy a natural language processing model in a Kubernetes cluster at the end. Open Veno also includes an Open Model Zoo. It's a collection of 220 pre-trained deep learning models, some that are provided by Intel and others that come from the open source community. What this does is it enables you to quickly get started and download models for object detection, image classification, automatic speech recognition, segmentation, natural language processing, you name it, there's a wide variety of use cases, and all you have to do is launch a Jupyter notebook and you can quickly, inside Open Data Hub, use our command line tools to download a model and start using it directly in Python in the Jupyter notebook. We have a tutorial that shows how to use these tools, so if you install Open Data Hub and Open Veno together, you can quickly dive in and start downloading and trying out some of these models. As I mentioned, deploying and creating inference endpoints in Kubernetes environments requires a containerized microservice. So Open Veno Model Server provides a container that exposes endpoints, GRPC and REST, where you can send prediction requests from your application, whether it's written in Python, C++, GoLang, you name it, over the REST API or GRPC API, you can send your input images or your text as an input and then get the results back to the client application. This is great for scaling deployments with Kubernetes. So here you can see we have multiple pods with Model Server and that we're load balancing between requests that are coming in from our application. And this is what we'll show a demo of at the end. This is a high level view of the Open Veno Model Server architecture. As I said, it exposes a GRPC and REST API endpoint. And additionally, it's doing configuration monitoring. What that means is checking which models need to be served, and if there's additional models that you've added that you want to serve in production, the configuration management and monitoring is happening under the hood in Open Veno Model Server. Model management, we have a concept of a model repository, which I'll talk about on the next slide. But basically when you're ready to roll out a new version of a model, let's say you have some new data, you've improved the accuracy, and you want to roll it out in production, you can do that automatically without interrupting the service and having any downtime for your application. Open Veno's model server is built on the plugin architecture from Open Veno. So it's also easy to switch between CPU, GPU, and other Intel devices or combine them together for increased throughput. So the model management, we have a concept of a model repository. Think of it like a GitHub repo for your deep learning models. So it can be in Google Cloud Storage, NES3 compatible storage, a persistent volume in Kubernetes or OpenShift, and you just create a directory structure that has your models, the versions, and then the the binaries, the graph of those models stored in those sub-directories. By default, the highest sub-directory is served, and every time you add a new model, automatically the new model gets loaded. The previous version of the model is not removed until the new model is starting to serve predictions. So there's no interruption in service, you can hot-swap the models, so to speak. Now the fun part, we're going to show a demo. Derek Trawinsky, the technical lead from the model server team, is going to show us a demo taking a pre-trained and quantized model, quantized to integer aid precision. This is a BERT natural language processing model. It's available for download from the open model zoo. So first it's going to download that model, and then it's going to deploy it using OpenVINO model server in a Kubernetes environment. Once we have the model deployed and we're serving predictions, we can use a lightweight application to client application to ask questions. So Derek will define a corpus for the source, it's going to be like a Wikipedia page, and then we can send queries to the API over the GRPC API, asking questions like what is BERT, and then getting a response back. So Derek, why don't you show us the demo? Hello, my name is Tendarek Trawinsky. I'm a software engineer at Intel, and I will walk you through OpenVINO model server demo in OpenShift. I will show you how the model server and inference service can be deployed using OpenShift operator, and how the cluster can be used to scale the inference execution. During the presentation I will use the BERT model from OpenVINO model zoo, which performs questions answering. It's a model quantized to integer 8 precision, and it's trained on a squad 1.1 dataset. Now let's go to the OpenShift console and see how to deploy the model server using the operator. The operator is installed, so I can create new instance of the model server by just using the operator graphical user interface. I just click create model server. That brings a fully functional template with the ResNet model hosted on the Google Cloud Storage. The parameters define various aspects of the model server, like location of the model repository, model name, model configuration, and also performance tuning options. The model server can be deployed also using OpenShift command line by creating a resource model server with defined configuration. Here I prepared the configuration of the model server with BERT model. The model is stored on S3 compatible storage. Let's check the data structure in the model repository. Here are the model files. Now I will apply this configuration using OpenShift tool and deploy the model server. Let's check the results in the console. The model server is already initialized and deployed. Let's check created resources. Here we can see the operator created the pod. This model server is deployed in OpenShift service mesh environment, which gives more capabilities for controlling the traffic and monitoring it. It adds on each pod a sidecar proxy container, which is load balancing the calls to the service. That is the reason why the model server pod is reporting to containers. I can check the server logs in the console, so the pod has the ready state, so the logs should confirm that the model is loaded and service is started. Yes, the logs confirm everything works as expected, so model is loaded. Now I will use the service created by the operator. This is the service. I will use it to run the predictions to answer the questions. The service is enabled in the cluster, so I will expose it using ingress component from the mesh. I will do it by adding the mesh virtual service and gateway resources. This is the configuration of the gateway and virtual service. In this cluster, the ingress controller is exposed using the port node, so I will be connecting to this port using any of the nodes. So now I'm ready to connect to the service. I'll switch to another terminal. And first, I will use a gRPC client querying the model parameters. This client is from the model server github repository, so I'm connecting to the node using the exposed node port. Okay, so in response, I can see the information about the model inputs and model outputs. Now I will use the BERT client written in Python. Its code is also in the github repository of the model server in the example client, so here we can learn more about it. So this client takes as an argument the URL to the web page with the knowledge source and the questions to be asked. So here I'm connecting to the gRPC endpoint. Here are the parameters related to the model input names, vocabulary file, and here is the URL to the wiki page and the question. So the script is splitting the whole content of the web page and asks the model server for answers for each part. Then it gives the most likely free answers. So I can start this client also in the loop, so it will keep sending the same question. So let's check what the mesh monitoring can detect. So Kiali and Grafana are the monitoring tools installed together with the OpenShift mesh service. So we can see here the throughput results but we will not fully take advantage of the cluster scalability. So right now each request is sequential, so there is only one inference execution at a time. To show you the full advantage of the scalability I will use another client which sent asynchronous gRPC calls to the service. It's written in C++ and it's not doing any print post-processing to simplify the load generation. This client is documented in the GitHub repository in a CPP folder. I will stop this one. I have already prepared a docker image with this client. So I will start a job with the client which connects to the model server and asks 10 million questions. So I will use the OpenShift command line to deploy this job. Let's check the results in the console. Okay, so the client is starting. Client has started and it's generating the load. To improve the throughput in the model server I will tune the CPU plugin config for automatic configuration of the OpenVinom execution streams. So here is the plugin config. So that will swap the port with the model serving with the new configuration, but all will happen without any interruption for the client. So the reports show the utilization with some delays, so we will see the change in a moment. So beside the Grafana statistic I can also monitor the traffic in beside the Kali I can also open Grafana dashboard. Okay, so we can see the throughput is now increased and stabilized. But let's imagine that a single node cannot deliver a sufficient capacity for our needs. So we might have hundreds or thousands of clients connecting to the model. With OpenShift cluster we can easily scale the capacity by adding more replicas and nodes. I'm going to do this now. So I will edit the model server configuration. And I will add another replica. Okay, so now two replicas are operational, so we should see the increased throughput from the service. The calls from the clients are now distributed between two replicas and two nodes. With the default Kubernetes load balancing, which is operating on the third OSI network layer, GRPC calls are connection preserving. That means that each request from the client would be routed to the same replica. With the load balancing on the OSI application layer, which is the case for the mesh, each call from the GRPC client is dispatched separately, so it can utilize several replicas and nodes. So in a moment we should see increased throughput from the service. This is in Kiali and also in Grafana. It's already increased. So let's increase the capacity even more by adding the fair to a replica and see what happens. I will repeat similar steps. So we have now three operational replicas. And again, we can check how that will impact the throughput results. The throughput should be increased in Grafana and in Kiali in a moment. So to summarize what was presented here, I explain how the model server can be deployed in OpenShift using the operator. I also demonstrated how to use GRPC client to run a query to the BERT model. Finally, you saw how the inference service can be scaled horizontally when adding more resources on a single node is not sufficient. So now we can see the impact from the third replica. So the throughput is increased again and the response time is reduced. And in a moment it will be the same results visible here in Grafana. So that concludes the demonstration. So back to you, Ryan.