 Morning everybody. Thank you so much to our previous speakers. We have a pre-recorded session that's going to start now and it's called LIFTWing, WMF's machine learning model serving infrastructure. So we're going to play the pre-record right now. Hello and welcome to this session. Today we're going to be talking about LIFTWing. LIFTWing is WMF's machine learning model serving infrastructure. This presentation is brought to you by the machine learning team from the WMF. Let's take a look at what we're going to be discussing about today. First of all, we'll go through a brief introduction. Then we're going to be mentioning what is a machine learning model and what does it mean to take a model into production. Then we're going to be mentioning ORS and the usage of machine learning at the Wikimedia Foundation until now. And then we're going to take a deep dive into LIFTWing, which is the platform that is the main subject of today, and how does a model get served through this infrastructure. Finally, we'll be mentioning some future work that is going to be done around LIFTWing. This is the foundation's machine learning team. It consists of Chris, who is the manager and the director of machine learning at the foundation, Luca and Tobias, who are our site reliability engineers, and me, Elias Aiko and Kevin, who are the machine learning engineers of the team. Now, we're talking usually about machine learning models. Let's take a look at what actually is a machine learning model. When we're talking about deploying, we usually mean about a trained model. Usually a model would be used to solve a specific use case. Let's take a look at an example. Here, given an article revision, we want to predict how likely it is that it's an act of vandalism. A model that will solve this task is basically a function that takes the article revision as input and outputs the probability of the revision to be vandalous. When we're talking about the trained model, we are referring to this function. What we want to do with the trained model is that we want to save it to store it on disk in a way that it can be reused. Storing a model means that we want to store the model weights, biases, and parameters so we can then reuse them in our infrastructure. And then when we're talking about model deployment and taking a model into production, we mean getting this serialized version of the model from a development environment, which could be a Jupyter notebook, for example, as we can see here on the left, we're taking it to an infrastructure that's more production ready. For example, the Kubernetes cluster. That would mean that this new environment now is reliable and scalable. It has monitoring and alerting. And on top of that, we may wrap our model around an API allowing its usage. Machine learning has a long history in Wiki projects. And the most famous example is the usage of Ores. Ores is the scoring platform that has been around for many years. And it's being used. Its most famous usage is on the recent changes page of Wikipedia, where it is used to help patrollers fight vandalism by detecting and removing vandalous edits. Every revision is flagged by Ores. And then by using the filters, the patrollers can select specific revisions. As we mentioned, Ores has been around for many years. And it's used widely in Wiki projects. Ores focuses in the area of edit quality and article quality. It has a custom model server written in Python. And it's exposed via its API, which is available in Ores.Wikimedia.org. On the other hand, what we're introducing, LiftWink, is a platform built to have generic usage. It's based on Kubernetes. And it has a standard API that can serve multiple model server implementation. The community can propose models that will be wrapped around an API. And also users can replicate and try out model servers locally. Now, let's take a look at a high level overview of LiftWink. LiftWink consists of a Kubernetes cluster where K-Serve is deployed. K-Serve is a standard model inference platform on Kubernetes built for highly scalable use cases. K-Serve is an open source project. It used to be part of the Kubeflow ecosystem, but is now graduated and is a standalone open source project. Along with the cluster, we have the model cards. Model cards are model documentation and an essential part of our effort to apply ethical machine learning as they introduce transparency and community model governance. Finally, LiftWink is connecting to WMF's API gateway, which is the way that the model servers are exposed to the outside world. All right. Thanks, Elias. We are going to be delving into the LiftWink infrastructure. As Elias mentioned, LiftWink is the foundation's new machine learning model-serving infrastructure. It is a multi-year effort to modernize Wikimedia's machine learning systems and processors to enable rapid and self-deployment of a wide variety of models. Next slide, please. As Elias mentioned, we use model cards, and these are basically documentation. Once a user has created a model, they write documentation about the model and this documentation asks questions like who the model creator is, what the motivation behind creating this model was, the ethical considerations that were taken, the data that was used to train these models, the model architecture, the open source licenses. We mainly use open source licenses because models have to be open source. And then where the model is used currently, like which Wikis are using the models. Also, which Wikis shouldn't use the model, say if a model is mainly trained for the English language. It may not serve the Swahili community, so it may not be used on the Swahili Wikis. Next slide, please. So once the model has been created, we've made the model card for it. We store the model in Swift. Swift is an open source object store that is widely used across the foundation. It is better than Git LFS, which is what always was used previously. Next slide, please. So once created the model, created its documentation, self-determined Swift, then we go ahead and create a model server. And this model server is essentially created using Python and KSAP. So create a KSAP inference service and wrap it into a Docker image. And that Docker image is essentially the model server. Next slide, please. So once we've created the model server, we send it to the pipeline. How we do that? We use Blaba, which is an internal foundation tool, and we use Blaba to build the configurations of the previously created Docker image. Then we run this Blaba file or this Docker image through the continuous integration pipeline. And for this, we use Jenkins. Jenkins is an open source automation server. And that's essentially how our CI is set up in the foundation. Next slide, please. Then we go ahead to... Once the model server has gone through the CI, it is stored on the Wikimedia Docker Registry. And the Docker Registry is public, and then the user from the community can download these model servers and run them or their machines locally to test how the models are working and be able to propose changes on them. Next slide, please. Now, once the model server is around the Docker Registry, we use Helm to configure how these model server images are deployed in Kubernetes. Next slide, please. So we use Kubernetes to orchestrate the running of the model servers. So they've been downloaded by Helm and deployed onto Kubernetes. Now, Kubernetes is used to either set the model servers up or put them down or manage the computer services. And this is done for both model servers in staging and those that are in production. Next slide, please. So once model servers are running on Kubernetes, end users can query the models through the API gateway. And we use an internal API endpoint for staging and we also have an external API endpoint for production. Both can be accessed through the API gateway. Next slide, please. So we mentioned previously model cards, but now let's take a look at an example of such a card. So as we mentioned, model cards are the documentation of the models that allow transparency, visibility, and governance from the community. So in this example, we have the language agnostic revert risk model. Every model card starts with a description. This description states what the model is supposed to do. So what problem is it solving? So in this example, this model, it identifies revisions that need to be patrolled. So its goal is to detect revisions that are likely to be reverted. In the next section, we have users and users of the model. So here in this section, usually it's stated exactly for which cases this model can be used for. So for example, this model can automatically find revisions that require patrolling. It is used for vandalism detection and it can be used to create bots for assisting admins and patrollers. What it shouldn't be used for is for auto-removeing edits that users make without having an editor in the loop. Or it shouldn't be used as ground truth to use the model's outputs as a ground truth for training another model. And in every model card, we also have the section with the current users where over there, the use cases and the products that it currently supports are stated. A really important section is the one that we see on the bottom, ethical considerations, caveats and recommendations. So in this part, we may always see about specific biases or characteristics that may raise some concern about the model that the user should be aware of. The next section is about the description of the model of more technical things. So a really good, a really important thing are the features that the model has been, that the features have been used to train the model. So in this example, we have article features and user features. And the users can find the specific features of each entity that have been used while training. In the next section, we have a description about performance metrics and the implementation of the model. For example, this model, it has been built using Exibus library. And there's also a link to the code used for training. So someone can go and audit the training procedure. Also we have the output schema and an example input and output of the model. Finally, in every model card, we have a section about the data that were being, that were used while training. So in this case, we can see what are the, what is the data snapshot that was used for training and how it was broken up during the training process into the training and test data to be used for evaluation. And on the bottom, we have the license that ships with the model. For example, this model comes with an Apache 2 license. Finally, we have future work that I would like to mention that plan to do around lift wing. One big area of work is hosting large language models on lift wing. Most of the people will have heard about large language models which have been really popular over the last year. The difference between traditional machine learning models for us, for hosting them is mostly have to do with the model size. So the serialized version of the model extends to several gigabytes, which in addition require the usage of GPUs in order to have a fast inference service. And also on top of that, we plan to do work around having easily accessible data for our model service, which will make the models more easily accessible and this work could be around using a feature store. Please feel free to try lift wing by running this command, for example. You can find more commands in the first link in the API docs. And also you can visit lift wings documentation page on Wikitech. We'd like to thank you very much for taking part in this presentation and feel free to reach us about anything. You can reach us with these three preferred ways, either via email on ml at wikimedia.org or you can open a task on our fabricator board, have a discussion over there and always you can reach us on IRC in the Wikimedia ML channel. Thank you very much.