 All right, so I think we're going to get started. So thank you very much to everybody for coming. I'm quite excited today to give you an insight of the state of production machine learning in 2018. This talk is going to be a high level overview of the ecosystem and is going to tackle and dive into three key areas, the ones that I personally am focusing the most. So to tell you a bit more about myself, I am currently the chief scientist at the Institute for Ethical AI and Machine Learning and also engineering director at this open source, open core startup called Selden Technologies based in London. To tell you a bit more about both of my roles, with the Institute, I focus primarily on creating standards as well as open source frameworks that ensure people have the right tools and infrastructure to align with all those ethical principles that are coming out as well as industry standards. So it basically asks the question of what is the infrastructure required so that reality matches expectation? If there's a regulation like GDPR that demands the right to explainability, it's really questioning what does that mean from an infrastructure level and what would it be required to even enforce it? And then from day to day, so I lead the machine learning engineering department at Selden. Selden is an open source machine learning orchestration library, so you would basically use Selden if you want to deploy models in Kubernetes and basically manage hundreds or thousands of models in production. And some of the examples that I'm going to be diving on are actually going to be using some of our open source tools. You can find the slides as well as everything that we're using in that link on the top right corner. The link's going to be there, so don't rush to take a picture. So let's get started. In terms of small data science projects and just data science projects in general, they tend to boil down into two different steps. The first one is model development. The second one is model serving. In the first one, the standard steps that you would go through is basically getting some data, cleaning the data based on some knowledge, defining some features to transform the data, then selecting a set of models with hyperparameters. And then with your scoring metrics, you would then iterate many, many times until you're happy. And once you're happy with the results of the model that you've built, you would want to persist this model, and then you would go to the next step, which is you serve it in production. That's when unseen data is going to go pass through the model and you're going to get predictions and inference on that new data. That is basically a very big simplification, but we're going to be using this throughout the talk. However, as your data science requirements grow, we face new issues. It's not just as simple as you keeping track of the features and the different algorithms that you use at every single stage. You have an increasing complexity on the flow of your data. You perhaps had a few cron jobs running the models that you pushed in production, and now that you have quite a few, you go into a cron job hell. I mean, I don't know who uses that color palette for the terminal, but I guess each data scientist has their own set of tools. Some hard TensorFlow, some loves R, Spark, you name it, and good luck trying to take them away. Not just because they just really like them, but also because some are more useful for certain jobs than others. So you're going to see a lot of different things that you're going to have to put in production. Serving models also becomes increasingly harder. So you actually have multiple different stages that have their own complexities in themselves, building models, hyperparameter tuning. Those in themselves become one big theme in themselves. And then when stuff goes wrong, it's actually hard to trace back. If something goes bad at production, it's it because of the data engineering piece or the data scientist or the software engineer. You always have the Spider-Man pointing fingers. So basically what we build down from there is that as your technical functions grow, so should your infrastructure. And this is what we refer to today as machine learning operations or just production machine learning concepts. In this case, it is that green layer that involves that model and data versioning, orchestration. And really, it's not just those two things. And the reason why it's challenging is because we are now seeing an intersection of multiple roles. This is basically software engineers, data scientists, and developers engineers, which are condensing into this role of machine learning engineer. And the definition of this role in itself is quite complex because it does fall in expertise in those areas. And you see that when you look at a job description. These AI startups are hiring for this PhD with 10 years experience in software development, maybe three years McKinsey style consulting experience for a salary of an intern. I mean, that's basically what you have a lot of the times. And the reason why it is challenging is because we're now seeing things like data science at scale. And you have the requirements for the things that you would normally follow in the data science world to also apply in some of the, in certain extent, in the software engineering and DevOps world. And when I say it's challenging is because it actually breaks down into a lot of concepts. And we've actually broken down the ecosystem in an open source, awesome production machine learning list, which we would love for you guys to contribute. You see one of the tools that is missing. It's one of the most extensive lists specifically focused on production machine learning tools. So basically, just the explainability piece has an insane amount of open source libraries. But the ones that we're going to be diving to today, not saying that the rest are not as important for sure, but is the ones that I myself work mostly on a day-to-day basis are orchestration, explainability, and reproducibility. And for each of these principles, we're going to be diving into the conceptual definition of what they mean together with an example, a hands-on example showcasing what is the extent of the ways that you can address this challenge, as well as a few shout-outs to other libraries that are available for you to check out. So to get started, model orchestration. So this is basically training and serving models at scale. And this is a challenging problem because you are really dealing with, I guess, in a very conceptual manner, handling an operating system challenge at scale. You need to allocate resources as well as computational hardware requirements. For example, if you have a model that requires a GPU, then you need to make sure that the model executes in the area where the GPU is available. So it is really hard. So it's important to make sure that you are aware that this complexity involves not just the skillset of the data scientists, but also it may require the CIS admins and infrastructure expertise to be able to tackle it. And the reason why it also gets hard is because having stuff in production that is dealing with real-world problems also dives into other areas. So you have this already ambiguous role of machine learning engineering. And it's currently intersecting with the roles of industry domain expertise, as well as policy and regulation to create this centralized industry standards. This already introduces that ambiguity of how do you have that compliance and governance with the models that you deploy in production. And this is kind of like the very, very high level. But for some of the DevOps engineers, they may say, well, the standardization of metrics. If you're in a large organization, you may actually have to abide by certain SLAs. And with microservices, these SLAs are quite standard. They are uptime. They could be latency. But when it comes to machine learning, you may actually have some metrics that you have to abide by, like accuracy, things that you need to be aware, like model divergence. And of course, you could actually put together the code required for every single one of your deployments. But to a certain extent, it is necessary to be able to standardize and abstract these concepts on an infrastructural level. And that's what we're going to be diving into in certain level today. And it's not only metrics, as you would know with any microservice or web that you would deal with in production, but it's also logs and errors. If you have an error with a machine learning model, the error may not just be a Python exception. This may be an error because the new training data was potentially biased towards a specific class. So you had a class in balance with more examples in one class than in the other. That could be in a way leading to errors that are not specifically exceptions. So you may not get notified because something failed, but you may see stuff failing because of that. And it is also how do you standardize the stuff that comes in and out of the models? How do you track this? And then also, for example, if you have images coming into a model, you can't just go into your log Kibana dashboard and just see that binary dump of the data. So it's really understanding what to log in those cases. Now, when you actually deal with machine learning and production, you also dive into complex deployment strategies. So normally, you may imagine just putting a, I don't know, text classifier in production, but perhaps you may want to reuse components. Or maybe you want a more complex computational graph where you have some sort of routine based on some conditional cases. You may have some multi-arm banded optimizations that may have different models at which they're having it at the end. Or you may have other things like explanations. Right now, we're going to dive into that, but explanations are a big thing in the machine learning space. And you may want to have those things in production so that your domain experts can make sense of what's currently deployed. And again, yes, you could actually do this custom for every single thing, but the reason why you wouldn't want to do that is because if you have a manual work with every single model, what you're going to end up having is each data scientist having a maximum of, say, for example, 10 models that they can maintain in production at one possible time. So if you want to deploy more models, you're going to end up having to hire more staff. So you actually want to avoid that linear growth of your resources, technical resources, with your actual internal staff. OK, and this is where the concept of GitOps comes in. And this is the concept of you use your GitHub repo or your version control system as your single source of truth. And whatever actually gets updated there will reflect what you have in production. This may not be only limited to the code of your application, but may also reach the extent of the configuration in which your cluster may actually be currently following. And in this case, we're going to be showing an example where we are going to first start with a very, very simple model. We're going to be taking a very common data set that you're probably used and followed a tutorial with, which is the income classification data set. And we're going to basically assume that we're taking this data set of people's details, like your number of working hours per day, your working class, et cetera, et cetera. And we're going to train a machine learning model to predict whether that person earns more or less than 50k. And in essence, in this example, we're going to assume that we're using this metrics for approving someone's loan. If it predicts more than 50k, it would be approved, otherwise rejected. I don't recommend anyone to do this in production. This is just an example. And what we're going to be doing is we're going to be wrapping this Python model and then deploying it and seeing how we can get some of this like standardized metrics, getting some of this standardized logging, et cetera, et cetera. So in this case, all of these examples are actually open source, and they're all available on the link. So you can actually go and try them yourself. Within an hour, I mean, we're only going to be able to cover them in a high-level perspective. So in this first part of the example, we're only going to be creating a Python model, then we're going to be wrapping it, and then we're going to be deploying it in a Kubernetes cluster, so it's going to be containerized with Docker, and then it's going to be exposing the internal functionality through a RESTful API. So the way that we would do it is we would set up our environment, which basically requires you to have a Kubernetes cluster running. I'm not going to be trusting the internet for that to actually help us today. So I already have everything set up in tabs, as you can see, just in case. So what we're going to be doing in this case, we're downloading and here the data set. So this data set contains, in this case, applications of people and whether they get approved or rejected. We do a train test split, as you would normally would. And in this case, you would have, let's actually have a look at the data set. Yeah, so you basically have already a normalized data set where you have, in the first row, the age of the people and then remaining classes for the rest of the features. And then we actually come print the labels as well. So I think I have them here. Feature names, actually. We can see the feature names. And that's basically the order in which we have them. So as the age, the working class, education, et cetera, et cetera. Perfect. So the first thing that we're going to be doing, we're going to be using scikit-learn. So just to get a bit of an understanding, I mean, who here has used scikit-learn? Let's see a show of hands just for tutorial. OK, perfect, awesome. So what we're doing here is we're just building a pipeline. We're going to be scaling our numeric data points, as well as creating one-hot encoding of our categorical data points. And we're going to be transforming the data with that. So now that we've actually fit our preprocessor, we're going to be then training a random forest classifier with that sort of data set so that it takes the preprocessed data and then predicts whether a person would be able to get a known approval or reject it. Once we actually train our model, we can use the test data set to see how it performs. We can see that in terms of accuracy, it has about 85%, precision, recall, et cetera. So now we have a trained model. We have, with our scikit-learn, CLF is our random forest classifier. The preprocessor is basically our pipeline of our standard scalar and the one-hot vectorizer. So then what we're going to do is we're going to actually take this model and containerize it. And what we're going to do for this is we're going to first just dump those two models that we've created. So for that preprocessor and classifier, so we're dumping them in this folder. And we can actually see the contents. So we can see that we basically dumped it there. Once we have those two models that have been already trained, we basically create a wrapper. And this wrapper is just going to have a predict function that will take whatever input comes in. This predict function will be exposed through a RESTful API. But basically, whatever input, we pass it through the preprocessor and we pass it through the classifier. And then we return the prediction. So this is very simple. So we load the models and then we just run whatever is passed through this predict function and return the predictions. Super simple. This wrapper is basically the interface that we just require so that we can actually containerize it. So for the next one, for the containerization, we just need to define any sort of dependencies. So in this case, we use scikit-learn and the image because we're actually going to be sending, well, in this case, we actually don't need the image. Just scikit-learn. And then we actually just define the name of our file. And we run the basically S2I CLI tool. That basically what it does, it takes our standard image that exposes and wraps this model file through a RESTful API and gRPC API. So once we actually have this container, so just to get a bit of an understanding in the room, who here has used Docker before? Perfect. So here you just have a great awesome. So here you basically just have a Docker image called loan classifier 0.1. This Docker image, when you run it, the input command is basically just going to run a Flask API that exposes the predict function. Whatever you send to that predict endpoint will be passed through basically your wrapper. So that is basically what it would be doing. So once we have that, we would just specify it in our Kubernetes definition file. So this is just saying like the container that we're going to have is this loan classifier. And your computational graph in this case just has one element, which is the loan classifier. And that's all basically you would have. Once you define that, if it's built, now you can actually deploy it. Here you can actually see that it's being created in local Kubernetes cluster. So I think it is downloading it, which is not great. But basically what you would then see is this model is now deployed in our Kubernetes cluster. It's going to be listening to any requests. So it's basically as if it was a microservice. And then as any other RESTful endpoint, we can actually interact with it, in this case with Curl. So in this case, we're actually just sending it one instance to actually perform an inference. The response is an NDA of the positive and negative label. In this case, it predicted a negative label. So in this case, what we've done is we've actually wrapped a model with a very, very simple thin layer wrapper, put it in production. The wrapper itself also exposes a metrics endpoint, which for the people that have used Prometheus or Grafana in the past, Prometheus, you can actually hook it up to this metrics endpoint. And you're able to get some metrics out of the box. In this case, let's see if I can actually show it. Here is basically our income classifier that we have deployed. And out of the box, you get, in this case, this is a Grafana dashboard, you would get basically all of the requests per second. You get the latency for that specific container, et cetera, et cetera. And we're actually going to be diving a bit more into some of the metrics in a bit. You also get some of the logs. So again, this is basically just the output of the container. It's just being collected with a fluent server and then stored in an Elasticsearch database. So for the ones that have used Kibana in the past, this is just basically also us querying the Elasticsearch for the logs. And for tabular data is basically what we actually expose out of the box. But basically that is an initial overview of the orchestration piece, the benefits of actually containerizing your models. Of course, it's obvious in terms of making it available for business consumption. But the core thing from this is the push towards standardization. Is if you were to have 100 models in production, you would be able to interact with them as if they were microservices. And what this allows you to do, we have just covered a very, very simple example. But what this really allows you to do is to leverage this get-ups structure that I was talking about earlier. And just to see here, who here is familiar with PyTorch? And with PyTorch Hub? OK, so PyTorch Hub is basically a new initiative from PyTorch where they encouraged people to save trained models like BERT or VGG, where you can actually submit your models to a Git repo. And what that allows you to do is to have a central sort of like a standardized interface towards your already trained models. So in this case, basically, you're able to define any model. In this case, it's ResNet. And you say, this is how you load it. And this is where the trained binary is located. So it's an initiative from PyTorch Hub. And what we have been able to do is to actually create an integration to PyTorch Hub where any time that you actually point a new configuration deployment towards a repo, what it would do is a very thin layer wrapper that just downloads that model. Because the actual code to load it is standardized by the actual deployment. And to be more specific, the way that we actually do it is a wrapper where you basically take the repo and the name as input parameters that you can pass through the config files. And then when it actually loads, it downloads the model from PyTorch Hub. So you basically have an ability to dynamically publish any sort of like BERT or VGG-like models. I mean, anyone who has actually tried using BERT or one of those state-of-the-art models would know the pain of often setting them up. So there's a lot of benefit of actually trying to standardize the way not only to define them, but also to deploy them. And again, you can actually jump in and try these examples. So that is basically a high-level overview on the orchestration part. Before we jump into the explainability piece, some other libraries to watch, one of them is MLEAP serving. So their approach is they actually have a single server that allows to load standardized sort of like serialization of models. So if anyone is familiar with the ONNX sort of serializable definition of models, you'd be able to have a single model that loads your trained binaries and expose them through, again, an API. And then another one that is also one to watch is DeepDetect, which unifies behind a standardized API a lot of these Python-based models. And these are two of a large number of libraries to check out. I definitely would advise you to have a look at the entire list. It's quite extensive. All right, so the second piece, it should be actually explainability, so we're going to jump on that one. Explainability, this tackled the problem of black box model and white box model situations, where you have a trained model that you want to understand why did the model predict whatever it predicted. And the way that we tackle it requires the people tackling this issue to go beyond the algorithms. And the reason why is because this is not just an algorithmic challenge. It does take a lot of the domain expertise into account. And the way that we actually emphasize this is that interpretability does not equal explainability. You may be able to interpret something, but that doesn't mean that you understand it. And of course, in terms of the English definition of those words, there is not that conceptual perspective in place, but we tend to push that way of thinking about it because it's not just bringing the data scientist to address these challenges. It may require also the DevOps software engineer, but also the domain expert to be able to understand how the model is behaving. And we actually did a three and a half hour tutorial at the AI O'Reilly. So each of these things, we could actually dive into an insane amount of detail. But just for the sake of simplicity, today we're going to go and do a high level overview. The standard process that we often suggest to follow, it actually extends the existing data science workflow that we showed previously. And it adds three new steps, which they're not really new, but they are three steps that are explicitly outlined for explainability. These are data analysis, model evaluation, and production monitoring, production monitoring being the one that we're going to dive into today. In terms of data assessment, you would want to explore things like class imbalances, things whether you're using protected features, correlations within data, perhaps removing a data point may not mean that you are actually removing 100% of the input that is actually being brought by that, as well as data representability. This is how do you make sure that your training data is as close as possible to your production data? And this is a very well known problem. The second one is model evaluation. This is asking questions of what are the techniques that you can use to evaluate your models, things like feature importance, whether you're using black box techniques or white box techniques, whether you're using local methods or global methods, whether you can actually bring domain knowledge into your models. And this is important because what your models are doing, they're learning hidden patterns in your data. But if you can actually give those patterns upfront as features or as combinations of your initial features that leverage some of the domain expertise, then you're able to actually have much simpler models doing the processing at the end. One of the use cases that we had is in NLP, so automation of document analysis. We actually have been able to leverage a lot of the domain expertise of lawyers, asking like meta learning questions of how do you know this answer is correct? Or what is the process that you go into finding an answer? Things like that allow you to actually build smarter algorithms and not just in the machine learning models but in the features as well. And then the most important one is the production monitoring. How can you then reflect the constraints that you introduced in your experimentation and make sure that you can set those in production? If you think that precision is the most important metric and that you should not have a set of false positives or false negatives, then you need to make sure that you're able to have something in production that allows you to enforce that and monitor that. So evaluation of metrics, manual human review, not forgetting that you can leverage humans too. Like that is also something that with machine learning you can definitely do. And the cool thing about this is that with the push that we have into the Kubernetes world, we're able to convert this deployment strategies from just things like explainers into design patterns. So instead of just having a machine learning model in production, you can have deployment strategies where you may have another model that is deployed in production whose responsibility is to explain and reverse engineer your initial model. And this may get into a little bit of inception, but this is actually a pattern that has been seen quite effective and a lot of organizations are starting to adopt, which we named the explainer pattern, which is not very original, but this is what we're gonna be doing now. We have already our model deployed in production. We're saying that this model is predicting whether someone's loan should be approved or rejected and assuming that this is a black box model, we're now gonna deploy an explainer that is going to explain why our first model is behaving as it is, right? So that's what we're gonna be doing now. And we're gonna be using that same example that we were leveraging. So now we have our initial model in production. We can actually reach it through this URL. So what we're gonna do now is we're gonna actually leverage this explainability library for which they are actually many of, but this is one that we maintain. It's called Alibi and it offers basically three main approaches to black box model predictions. Sorry, black box model explanations. The first one is anchors. And anchors, it answers the question of, from the features that you sent to your model for inference, what are the features that influence that prediction the most, right? And the way that it does it is by actually going through all the features and replacing a feature for a neutral value and then seeing which one affects the output the most, right? So this is anchor and this is what we're actually going to be using. But there's another very interesting one called counterfactuals. And counterfactuals are basically the opposite. Well, not really the opposite, but conceptually is the opposite of anchors. It asks the question of what is the minimum changes that I can add to this input to make that prediction incorrect or at least different to what it was, right? So if you were actually approving someone's loan, the question would be, what are the changes that you can make to that input so that the loan is rejected, right? So this basically allows you to understand things like for example, with NIST, you can ask questions of, well, what are the minimum changes that you can do to make that four, not a four. But more interestingly, you can actually go from one class to another. You can say, what are the minimum changes that I can do to this four to make it a nine, right? So what we're going to be doing is first anchors on our dataset. Well in here, we're actually just using our seldom client to also get the prediction. So we're literally just sending a request and this is the response, which is the same as their curl. But yeah, so we're going to create an explainer and we're going to be using alibi and the anchor tabular explainer. So for this, what we're going to be doing is we're going to take our classifier, so that classifier that we trained, that random forest predictor that we trained before and we're going to actually expose the predict function and we're going to feed that into our anchor tabular, right? Because it's going to be interacting with the model as if it was a black box model. It's only going to be interacting with the inputs and outputs, right? When using text or image, only when using tabular. The reason why is because with tabular, you need to ask the question of what would be the neutral numbers that you would use to replace, right? And in this case, for numeric datasets, you have to get the minimum and the maximum and then you say, well, I want it to be the quartiles or something like that. That's the only reason why you would use the training data. But yeah, so you would fit it and then you would actually see what is the inputs that we're going to be sending. You know, we are actually sending this one, you know, somebody of age 27 and we're just predicted as negative and we're going to actually explain it and it basically says, well, what makes this prediction what it was is the feature marital status of separated and gender of female, right? So that's what basically your explanation for this instance is. And what is now starting to get interesting is that we're now going to actually use our local explainer on our model that we basically deployed already, right? So in this case, that predict function that we basically had, we're now going to be, you know, using sort of like that remote model. So we're actually going to be sending the request to the model that is currently in our Kubernetes cluster. And when we actually request the explanation, we're going to get the same thing, right? The only difference is that we're now actually reaching to that model in production. And now we're going to actually follow the same things. We're going to just containerize the explainer and we're going to put the explainer in production, right? So again, we actually create a wrapper. The wrapper has a predict function. The predict function just basically takes the inputs and runs explain and returns the explanation, right? So now what we have in production, so we've containerized, we deploy it. And now what we have in production is now an explainer. So we have our loan classifier explainer as well as our initial model. So what this is interesting is that now you can actually send to one of these components a request to do an inference and you can send another request to explain that inference by interacting with that model in production. And we can actually visualize it here. If you remember with our income classifier, if we actually have a look at the logs, these are all the predictions that have gone through the model through basically as requests. So what we can do now is we can actually take one of this and send a request for the explainer to explain what's going on. And look, this is the exact same thing that you just saw in the other one, but just flashy, shiny, and colorful, right? Like this just basically says, for that other explanation, you still have that marital status of separated, influence your prediction by this much, gender female by this much, and capital gain by this much. And then you also can see predictions that are similar or different. But in essence, you're still getting the same insights as if you were using it locally, but again, you're getting those sort of standardized metrics. So that explainer also has the metrics exposed, also has the logs exposed, et cetera, et cetera. So you get that benefit. And that's basically the sort of example to the explainers. And now we're actually going to go one level deeper. But before that, I want to give some libraries to watch in the model explanation world. These are ELE5, which is explained like I'm five. This is a very cool project. They do a lot of different techniques. SHAP, which you've probably come across if you are in this space or have looked at model explanations. And XAI is one that we really specifically focused on data, so techniques for class imbalance, et cetera, et cetera. And then again, as I mentioned, there's tons, right? I mean, with this black box model explanations, you can actually dive into so many different libraries. It's a very exciting field, so I do recommend to actually have a look. Now for the last part, for the last part is on reproducibility. So reproducibility, this basically answers the question of how do you keep the state of your model with the full lineage of data as well as components? And this really breaks down into the abstraction of its constituent steps. For every single part of your machine learning pipeline, you're going to have a piece of code, configuration, input data, right? And for each of those things that you have, you may want to actually freeze that as an atomic step. And the reason you may want to do that is because you may want to perhaps, you know, debug something in production or for compliance, have audit trails of what happened, when it happened, and what did you have in there. And the reason why it's also hard is because it's not only the challenge on an individual step. The challenge also goes onto your entire pipeline, right? So each of the components on your pipeline, each of those reusable components may actually require to have that level of standardization. And you saw it with the configuration definition that we had in the previous example where we actually had a graph definition, right? There you can have multiple different components which are docker containers, which are containerized pieces of your atomic steps. And one thing is to actually be able to keep those atomic steps. And another one is to actually be able to keep the understanding of the metadata of the artifacts that are within each of these steps, right, because metadata management is hard, right? And now we're getting into a point where it's not only metadata management, but it's metadata management on machine learning at scale. And it's doable is just that it requires to sort some areas a new way of thinking. And what we're gonna be diving into here is basically the point that we haven't covered. We talked about models that are already trained, but what we haven't talked about is potentially the process of training models. And we're actually currently contributors to this project called Kubeflow, which I'm not sure if you've heard about, Kubeflow focuses on training and experimentation of models on Kubernetes. And what it allows you to do is to actually build reusable components. What we're gonna be diving into in this last example is going to be a reusable NLP pipeline in Kubeflow. And what this is going to be more specifically, let me actually open it, is going to be this example, which I have the Jupyter notebook. You can try it yourselves, but we're going to be actually creating a pipeline with this individual components. If you guys have ever done NLP tasks, we're going to be doing a, let's call it sentiment analysis, where you would find the usual steps, cleaning the text, tokenizing it, vectorizing it, and then running it through a logistic regression classifier, right? The first step is just gonna download the data. And we're basically using the Reddit hate speech dataset. So from our science, all the comments that were deleted from mods, they've been compiled. And yeah, so basically what we have here is these components, we would want to actually create this computational graph in production that uses them as separate entities. And the reason why you want that is because maybe you want to reuse your Spacey tokenizer for different other projects. And you want to keep perhaps your, wholly met feature store, right? Where you actually just pick and choose different things. That ultimate drag and drop data science world. But yeah, so basically this is what we're gonna be doing in this example. From a high level perspective, what it's gonna consist of is five repeats of wrapping models. But in this case it's just wrapping scripts in that same process that we did previously. For example, the clean text step is basically again just a wrapper called transformer with a predict function that takes the text as an umpire array. It runs the vectorization of that. Oh, in this case is the actual TFIDF vectorizer. It runs the vectorization and then returns the actual vectorized output, right? And then in terms of the actual interface to it, it's just like a CLI. But once we have these components, then we're able to define our pipeline and we can upload this pipeline into Kubeflow, which then looks like this, right? It's basically all of the steps with all the dependencies. The only difference is that it uses a volume that is attached to each of the components to pass the data from one container to the other, right? So for each component, the volume is attached. And the interesting thing here is that you can actually create the sort of like experiments through your front end. You can actually choose what parameters you expose and here I can actually change the number of TFIDF features, et cetera, et cetera, and then just run our pipeline. And then you can actually see your experiments. You can see which ones have run. And then for each of the steps, you can actually see the input and output as you print it for each of the components. So here we can see the text coming in and then the tokens coming out from the other side. And then the last step, it's a deploy and what that basically does, it just puts it again in production, listening through any requests. For this specific demo, what we've done is you can see the deployed model here. So it's an NLP cube flow pipeline. And you can see that there's actually live requests through each of the components. So you can see that the clean text, the space tokenizer, the vectorizer, et cetera. What we're sending it live, and this is actually quite funny, we're actually sending all the tweets related to Brexit. Do you guys know what Brexit is? Yeah? So it's actually doing hate speech classification. So the funny thing is that doesn't matter what side you're in, there's a lot of hate. And we can actually see here, we have the sort of like nice looking logs, but as I mentioned, you can also jump into the Kibana. And here you can see like Celtic Brexit, Spring, yeah, well, I don't know. I don't want to read them out loud because there are some that are not very appropriate. But yeah, so basically now we have just this production, Brexit classifier that can actually be trained with different data sets and it just exchanged automatically through this step. And the objective here is to actually just show the sort of complexities of this reproducibility piece and how there are different tools trying to tackle it. This dives more into the experimentation and training part and I haven't even dived into the pieces around the complexity for tracking metrics as you run experiments, right? This is basically, I run 10 iterations of the model. I want to know which perform better. How do I keep track of my metrics as well as the models that I used. So each of these things that I've covered has so many different dimensions to tackle them from. And we actually have talks online where we have an hour and a half of just one of this. Today was more of like a high level overview. And other libraries to watch, data version control, DVC, they're basically a Git-like sort of CLI that allows you to run the usual commit, push workflows, but for that sort of like three components of your code, configuration, data, et cetera. Another one is MLflow from Databricks and this focuses on actually experiment tracking. We actually have some examples where we integrate and Packarderm which dives into full compliance. So as you can see, the ecosystem of this is so broad but it's also at the same time, super, super interesting. And yeah, so I'm gonna wrap up and jump into questions just in case anyone has any questions on this or any other libraries. But before that, I'll just give a few words on this sort of stuff. We covered three of the key areas that I have been focusing on. These are orchestration, explainability, and reproducibility. But as I mentioned, the content is insanely broad. Things that I actually haven't talked about which is also insanely interesting are things like adversarial robustness. As you saw, some of our explainability techniques have an approach to explain through adversarial attacks, kind of. So it's also interesting to see how there's a lot of overlap across each of these areas. And also not only overlap, but also different levels into which some fit in other of the categories. Privacy is one that is super interesting that we haven't covered that dives into privacy preserving machine learning, which is an interesting area in itself. Storage, serialization, function as a service, et cetera. So with that, I have been able to give a high level overview of the state of production machine learning in 2019. It wasn't exhaustive, but it does feel like it was. But yeah, so if we have some questions, I'm happy to cover them now or later at the pub. Thank you very much, guys. It's a pleasure. Thank you very much for your talk. I'm actually chairing your session. So do we have questions? Please come ahead. Come to the microphones. It's working for your questions, please. Hi, excellent talk. Thank you. I was mostly inspired by this explainability idea and have two questions about it. So first, let's assume we have a lot of features and they have, like, they produces a huge space of variants that can be, like, huge space of variants. So it seems that when I try to explain this black box, I need to iterate all these features, all variants of these features and it seems like performance issue here. How can it be solved? And yeah. So that's the first question, was the second one? And I'll repeat it. Okay, it was first. Yeah, and the second one, that some models itself has some information about feature importance within it, like this random forest. Have you compared some results from this explainer with internal results of the model itself? Yeah, okay, no, that's two really good questions. So the first one was basically, you have a lot of features, what's the computational complexity around that and how you deal with that. The second one is basically on, what was the second one? The second question was? But internal importance. Internal importance, yeah, comparing internal importance to the black box model explainer. So yeah, okay, so let's dive first into the computational challenges. So that is 100% correct. And in terms of anchors as a technique, we are conscious that in order for you to explain black box models as a whole, it often becomes quite expensive. The way that we have been able to tackle it is by separating the way that you actually request explanations and predictions. So for explanations, you may not want something that is like real time and for every single one of the predictions that go through, but instead is for actually diving deeper into one or a few of the inference predictions that you may have, right? So perhaps if something went wrong, you can use explanations to debug how it performed, or if the threshold that you set for accuracy was 90%, you would only request explanations for things that fall under when you assess them. So that is from one side. In the other, interestingly enough, this week our data science team just published a paper that actually proposes a way to deal with the computational challenges with counterfactuals specifically and with contrast of explanations. And that is basically using prototypes, the concept of prototypes. And this is with sort of like neural networks to reduce the dimensionality of your features themselves. So that paper is in archive and you can check it out, but there is a lot of research in that space to actually make it more feasible without sacrificing the power on explanations. Unfortunately, there's no silver bullet, so I do acknowledge that it is a challenge, but then that is why there's the benefit of also leveraging white box model predictions in certain situations where you actually can, and following to your second question, you can actually leverage some of the internal structures of the models like random forests or neural networks, using sort of like the weights of the networks to actually explain much easier. It's also worth mentioning that the explanations themselves, the explainers, some of them, they're optimization problems. So for example, we use gradient descent to find some of the explanation techniques for the counterfactuals. Now for the second piece in terms of leveraging the internal stuff and also seeing how it performs against the black box models. So we actually haven't done benchmarks of how it performs against, but that's definitely something that we would be interested on. If you're interested on that, otherwise it's open source, so we would love a pull request or an issue to our documentation on that. But that's really good question, yeah. Yeah, okay, thank you. Thank you. Do we have more questions? Please go ahead, hello. I have two questions. First is, what are your views about setting up a kind of a feedback loop, a pipeline for feedback loops, saying that after your model has gone into production and you have a result at the end of the day, saying that, hey, you know what, for these you had the correct predictions and for these number of records you had wrong predictions. How do you go about retraining or incrementally retraining your model after it has gone out into production? Yeah, that is an excellent question. And that was one of the key things that I actually discussed in, I guess they call it three hour workshop, I call it three hour rant, because I was trying to push how important that piece is. And unfortunately, again, there's no silver bullet in terms of, you know, you can't just deploy a model and have that feedback loop out of the box because not always you actually have data that is relabeled in production, right? I'll give you a specific example. If you're doing automation of support tickets, routing, then at the end the support tickets will be resolved at some point, right? So you're actually getting data that is being labeled in real time. So you could actually get that feedback real time. Other times where actually labeling of data is so expensive, you may not have that benefit, but you may still want to have that specific feedback loop. And in that term, you may actually require to establish that manually. And what that would mean, say, would require every month or every week or every year, once a year to evaluate the performance of the model by having a set of random data, you know, perhaps, you know, on a balanced set of classes that is labeled by hand and then compared to what it should be and to see the performance. So that feedback loop should definitely be in place. The way that it should be installed is different depending on the use cases. There is also that sort of other part which is not feedback loop in terms of performance, but it could be just feedback of real time performance of the metrics. And actually for one of the things that I mentioned in the, I think it was orchestration, is, you know, you may have like three different models that in real time, you may want to optimize the routing. That's also other type of feedback. So in the API, in the SDK that we build, we actually have an endpoint called feedback that allows you to actually send stuff back. But yeah, the word feedback can mean so many things, but on those two specific ones, that would be my thought. And just one thing more. You mentioned about production monitoring. And when you said that data scientists have to maintain a certain number of models in production, what would actually trigger manual action on that particular model? What are the KPIs that you actually, yeah, the prediction accuracy is one of them, but what actually would trigger that, yes, there's something wrong with the model and the data scientist needs to actually go and evaluate that model from the ground up. So I think it's not as explicit as, you know, the manual time is because things go wrong. The manual time, it actually goes all the way from the moment the data scientist goes like, my model is ready. I wanna put it in production for business consumption. From that moment, the data scientist has to think, well, maybe I need to expose a RESTful API. So he needs to write, here she needs to write the code to actually wrap it on a Flask server. Then it needs to expose the endpoints because the endpoints are quite custom and are not standardized across all their models that other data scientists put in production. Here she needs to actually assess how it's performing. If something goes wrong, the data science needs to jump in and assess why it went wrong. If it needs to be retrained, again, the design needs to retrain it. So it's a lot of little things that require that manual, not just input, but also continuous thinking around that because the responsibility of that model beyond it's ready is it still falls within the data scientist. So it's just pushing it towards that. Once a model is done, it should become similar to microservices to a certain extent because, you know, as a software engineer, you still have to jump in and debug it. But to a certain extent, once the model is ready, it becomes a sysadmin or DevOps challenge, right? So then you can have hundreds under the same metrics so it's not just individual people assessing their own things in production. And you have the same thing with software engineering when you deploy microservices. You want to avoid that and standardize it. Thank you. Awesome. We have two minutes left for questions. Do we have more questions for our speaker today? Yes. Do you have a generic components to ensure that the confidence levels that are outputted by the models are calibrated in one way or another? So we don't have a standardized sort of metric per se, but you are able to expose custom metrics. What is standardized is the way that these metrics are collected. So they're collected through Prometheus. Well, they're exposed through a metrics endpoint that then can be collected through Prometheus and then consumed by Grafana. Then it's very easy to set thresholds to get notified. So it is possible to just set thresholds for any of that standardized accuracy metric. But then again, when you say 90% accuracy, that may vary from use case to use case and also accuracy is often irrelevant because sometimes false positive may have more influence than a false negative. So what we try to standardize is the metrics that come out and the way that they can be evaluated as opposed to the metrics that should be evaluated. If that makes sense. Yeah. My question was more specifically on, for instance, if you do classification as the example that you gave, the model can output, I'm confident that it's 80% chance a negative. But maybe it's actually not like, if you take out of 100 predictions and you've been the predictions by the confidence level, you could see that the fraction of negatives in each bin are not actually reflected the confidence levels outputted by the model. And depending on the models that you do, you might have different calibration issues. And I was wondering if calibration is something that is like a generic tool that you could put in your pipeline and is something that is requested by the users or how to leverage the calibration and or maybe it's not addressed yet and that's it. No, no, I think definitely calibration is one of the important things. I mean, we do have some open source work that exposes not only the things like the multi-arm bandit but also techniques for things like outlier detection that you can use. We don't have like a generic piece, but that is not because there's no demand. It's just because we don't have enough hands. So we'd love that. Again, you know, it's open sourcing open an issue if we get enough thumbs up, then we definitely prioritize it. And we actually have a bunch of examples. We'd love to have just another Jupiter notebook example showcasing how you would do that. But that is definitely a good point and it's a very interesting area in this space. Yeah, awesome, thank you. We have time maybe for one last. Okay, if we don't have any further questions, let's have a very warm applause for Alejandro for his talk. Thank you.