 Thank you. It's so wonderful to be here in Madrid. It's great to be here for the first big things. I love big data Spain, so it's good to be back. Yeah, I haven't actually managed sheep. I have spent some time managing people and I was thinking there's an old quote you can't understand Virgil's bucolic unless you spend five years living as a shepherd and I thought yeah if you want to spend if you want to understand engineering you should spend five years leading a team And that became my Twitter bio and it should probably be changed after today. So I'm will Benton I'm an engineer and an engineering manager at Red Hat and Today I'm going to talk about machine learning systems machine learning techniques have been employed to solve a lot of different problems in a lot of different Domains and they're really more exciting than ever, right? We have not reached the peak of the current machine learning hype cycle But going from an effective technique to a system that actually solves a business problem is really difficult and There are a lot of ways that you can get it wrong and actually get a worse result than you would have not using machine learning at all So in this talk, we're going to talk about some of those and sketch some solutions First though I want to talk about how machine learning practitioners solve problems Just to sort of level set and give you a sense for how I see this space How I see how I want to talk about it and we'll do this in the context of a concrete problem But before we get there a word from our sponsors if you're interested in machine learning workflows my colleague Sophie Watson and I are giving a workshop tomorrow on Machine learning workflows if you're a software developer who's interested in doing more with machine learning You'll learn about how to solve some of the problems I identify in this talk with contemporary software engineering techniques tomorrow If you're a machine learning practitioner who wants to learn how to use contemporary software engineering techniques to make your job More fun and less boring you'll learn that too Okay, so let's look at machine learning workflows in the context of a concrete problem a problem everyone understands It's spam filtering. I won't be able to see your hands, but I'll be able to tell if someone is raising them Who has looked at their spam folder lately? Yeah, neither have I right? Can't see any hands The thing with spam filtering is it's made it possible to use email, right? And it's actually a very early commercially successful application of machine learning We've been using machine learning for spam filtering for over 20 years But if I wanted to solve this problem from scratch, how would I go about doing it? The first thing I need to do is define what it means to succeed in the context of can I turn whether or not? I've succeeded at this problem into a number This might be a number about how well my model performs. Does it correctly identify spam messages? Does it correctly identify legitimate messages? It might be a number about how many people like to use my application after I have this spam filter installed It doesn't matter as long as I have some way to define success Once I've done that I need to actually collect the data that my machine learning technique will learn from in This case it's going to be taking historical messages and going from raw messages to labeled messages in a regularized format Where a labeled message is a data structure that says either this is spam or this is legitimate the next thing I'm going to do is I'm going to Engage in the process called feature engineering. This is basically trying to expose some structure in the data By mapping data to points in space So if I have a message I want to put it a point in space so that similar messages are at similar points in space and Ideally I want the way that I do that mapping to expose something interesting about the data in this case I Can see that if I look at some part of space I have a bunch of these legitimate messages clustered together so this indicates that this is a problem I could solve with machine learning Once I have those features I'm going to pass them into a model training algorithm Which is basically just going to identify good trade-offs in summarizing those data so If you had to look at these things and develop a function to say Does this feature vector it does this point in space correspond to a legitimate message or not a legitimate message The model training algorithm is going to identify that automatically by choosing good trade-offs that minimize some sort of error metric Once I have a trained model I'll want to test how well it performs on data It's never seen because if I just look at how well it works on data that it knows the answer to Then it's possible that I could just memorize that data and have a model that's not going to be good for anything in the real world By checking it on data. It hasn't seen we can make sure it's generalized Finally, I have a model and that's great but we want to talk about how to put this into production how to get value out of it and I'll need to actually deploy this model in a way that another application can use it often We're doing this as a service that has an API that I can call from other parts of my application but We need some way to get it usable by the rest of the application We also need to extend the process of validation of tracking those model metrics to production Somehow to make sure that the model is still performing well So conventional software when it breaks if we're lucky it breaks in an obvious way Models may not break in such an obvious way and we'll talk about that more later in the talk So if you do a lot of software development You might be saying if I squint this sort of looks familiar this looks a lot like a conventional software development workflow There are a lot of stages you're going from thinking about the problem You're solving to structuring data to identifying how you want to use that structure to solve your problem And then you're probably gonna have to go back and revisit some decisions You made earlier when you realize the things you did didn't work out. I also have That arrow there is doing a lot, right? And what we want to really focus on in this talk is sort of all the things that can go wrong in That ellipsis between those two blocks right between that validated model That's passing your experiments and the production service that you can deploy and the production system that depends on it So here's what we're gonna talk about for the rest of the day For the rest of the talk we're gonna talk about what it looks like to have machine learning in production What is a machine learning system? How is it different from just a model? We're gonna talk about problems we can run into in our data cleansing pipeline We'll talk about problems. We might run into with feature engineering We'll talk about reproducing results and how not having reproducible results makes it very difficult to publish models And we'll look at some trade-offs. You might have to consider when you're deciding how to publish a model Finally, we'll look at the problem of data drift. That's where the data you see in the real world is no longer Faithfully modeled by the data you train your model on and you have to sort of figure out how to handle that We'll look at some causes for that and see how we can identify it We'll close by looking at how contemporary software engineering practice can make a lot of these problems less critical to the overall success of your application So let's start though by looking at how we put machine learning into production so we have this workflow and The interesting thing is that a lot of these stages in the workflow These are things that a human does but a lot of them correspond to software that we want to run as well Our feature engineering part of our workflow is going to turn into a feature extraction pipeline stage where we're gonna take raw data and turn it into feature vectors our model training is Gonna inform code that we run in production to train a model and so on with validation or with deployment We obviously have some way to take the model that we've trained and make it something that another part of our application can interact with and So on for monitoring and validation as well. We can combine these different components in different ways So I'll be using those gray gears to represent feature engineering I'll be using these orange gears to represent model training and tuning and this cube will be model deployment So we'll see how these fit together in a couple of different ways One is our production training pipeline where we take labeled data That's where we know the truth about whether it's spam or legitimate. We extract those features and then we train and validate a model So the input is raw data the output is a model We also have a production scoring pipeline where our input is raw data where we don't know the answer like data we see in the real world and We want to turn that into feature vectors pass those feature vectors into our model and make a prediction We're going to track metrics about the predictions we make and we're probably also if we want to do repeatable experiments in the future Archive the results of what we saw and the predictions we made Now we're not just going to have this in isolation Like I said usually we want to wrap this up in a service So you can imagine this part of your application living in a service that has a rest interface or a gRPC interface Or some other way you can interact with it from other application components So these pipelines are already more complicated than the model we train Right the model is just the cube right, you know these pipelines are more complicated than that But the systems that we put them into production as part of are even more complicated than the pipelines If you're trying to keep a social network or an email service free from spam or abuse in real time If you're trying to decide whether or not to make a trade on millisecond latency with someone else's money Or whether or not a payments transaction is fraudulent Or if you're trying to decide which songs to recommend to a media customer before they skip the last song you recommended to them You have a complex system, right? It's it's a hard software problem and machine learning winds up only being a tiny part of that system But it's a part that can make everything more difficult So if this figure looks familiar Great, you can you can smile knowingly if it doesn't I'm about to introduce you to a great paper I'm borrowing this figure which details all the responsibilities of a machine learning system from a great paper called the hidden technical debt and machine learning systems and This paper starts with the premise that machine learning techniques are easy to develop But machine learning systems are hard to maintain because they have all of these moving parts and these things are all interconnected By this more or less black box of a machine learning model and these data pipelines that can be sort of brittle So the paper is called hidden technical debt in machine learning systems. These go is the lead author So let's look at what can go wrong when we're putting these things into machine learning systems And let's start at the beginning of that workflow with our data cleaning and in just pipelines Look at these first two stages data collection and cleaning and feature engineering and We'll see what a typical example of this looks like In a system you might be federating data from a database from scale-out object storage and from streaming data So we have our three different data sources represented by three different colors of records here And then maybe you're gonna join data from two different sources to get a Relation that has information from both of those sources combined Maybe you'll join data with a stream streaming data and then get some result now in the ideal case You'll be able to pass That collected data on to your feature extraction pipeline and get feature vectors which are just gonna be vectors of numbers But there are a lot of things that can go wrong in this part of the process Let's look at the first one by focusing on just one part of this pipeline with a single data source So how many people in here have transformed a database table to a database table that is more like the one you want a Few of you cool. I can actually see a few hands. That's nice. I guess no, right. It's really no one checked There's BAM folder, huh? So let's look at what happens inside this orange gear here So a really common pattern with data cleaning is that there's some manual work involved, right? You figure out how to get part of the way to where you want to be and then you figure out how to get a Little bit more of the way to where you want to be and you keep generating these sort of intermediate tables or intermediate relations or intermediate files and Once you're happy with them you work on the next step. So we might have something that looks like this, right? We have The first to the original table and then through some iterative transformations We finally get to where we want to be if these tables are more or less static the source data We might just keep the final one and maybe some of the intermediate ones around and maybe we write up some documentation Here's how I generated table for which is what we ultimately want, right? So you you put together a read me, but the problem is Your teammates may not understand your documentation or may not follow your instructions the way you intend it It's hard to write human readable instructions that Everyone will follow exactly as you intend those you have children know this know this already So you might figure that maybe okay human readable instructions are error prone humans are imperfect But machines the machines really have our best interests at heart machine readable instructions are what I want, right? So maybe we want to capture each of these transformations in a script Now if you're a well-rounded engineer you might say I know a lot of different programming languages I have the flexibility to choose the right tool for the job. So I'm gonna do just that I'm gonna do some scale-up processing with my big data in on the JVM. I'm gonna run some good old sequel queries You know for that second stage and then finally I'm gonna use a general-purpose programming language that supports some of the statistical Operations I care about to do some final clean-ups before I have that last table But these scripts are gonna be dependent on one another which is gonna make them somewhat brittle, right? If the output format of one of these scripts changes the table that the next Script consumes is gonna have a different format than what it expects So you have to take care in what you generate what you accept and how you glue these scripts together The other problem is that these scripts can break with any change to the data or any change to the environment Which they're run if these things are run infrequently It's possible that the maintenance developer who has to fix them may not be as skilled and well-rounded a developer as you were When you wrote the code and this is true even if the maintenance programmer is you in six or 12 or 18 months and you suddenly realize you're no longer as Good at writing are as you were in grad school Since the source data are generally out of your control a Related problem is what happens if the source data changes will your first script to be able to handle this Will your cleaning pipeline be able to handle this or will it simply fail now? If it simply fails that's a bummer, but it's actually like the best case scenario, right? Because the worst case scenario is that it silently keeps doing the wrong thing so instead of having a Failure or instead of having the relation we expect at the end of that process We maybe have some garbage records. We have very you know fewer records than we expect We have missing data a special case of this problem is if we look at joins, right? If we if the schema or the values change independently in the two tables to be joined the results of the join can change dramatically If the field you're joining on changes in one place, but not another you might have way fewer results in your output or Your resulting relation might just be wrong So there are a lot of things to look out for in this first stage of data cleaning Once we get to feature engineering though, we're not in the clear There are a lot of things that can go wrong with feature engineering as well the first one is sort of related to what we saw with data cleaning, which is that if the format of the records we produce changes our feature engineering approach feature extraction approach may crash Because it doesn't expect the values it's getting But a related approach once even if you have a sufficiently resilient feature extraction pipeline to handle changes in your data The techniques you use matter to So let's look at this example of Some very simple records. We have a color and we have a size So we need to turn this into feature vectors So those of you who in here is a data scientist or a machine learning practitioner for their day job Okay, so you'll you'll know some of what we're about to talk about this So this is but you may not know all of what we're about to talk about and that's that's sort of where we're going so Here I have some colors and some sizes now. I want to turn these things into vectors of numbers So if you're a programmer of a certain age, you might say well I'm just going to look at these things in the order that I run into them And you can turn anything into an integer if you count it So I'll just say well red is is the first thing I've seen and since I'm a programmer. I'm gonna start counting at zero Yellow is the second thing I've seen I'm gonna call that one and blue is the third thing I'm seeing So I'm gonna call that two and you know this may echo some programs You wrote that used the C proper preprocessor to provide enumerated types, right? but What's wrong with just saying I'm gonna turn all the reds into zeros I'm gonna turn all the yellows into ones that I'm gonna turn all the blues into twos The point of feature engineering is to expose some structure in the data But I'm not exposing anything useful about the data by saying that red is zero and yellow is one and blue is two Because I certainly don't mean that blue is one more than two Right or blue is one more than yellow. I don't I'm not meaning that yellow is one more than red But the approach I can take if I count the things in this way as I can use something called one-hot encoding and Basically what this does is it takes some value a field that has n possible values and turns it into an n bit vector where Perhaps exactly one bit is set. So in this case we have Red is the first bit yellow is the second bit and blue is the third bit So for the rows that are red, we have one zero zero for the rows that are yellow. We have zero one zero and so on now This works really well, but it makes your feature Extraction pipeline brittle in a couple of ways one is that you actually have to keep track of that dictionary like What is zero? What is one? What is two? But the other is what happens if you run into a color you haven't seen before in some new data Then your whole pipeline is you know, there are ways to deal with this But you have to worry about it or else your whole pipeline breaks or just produces less useful values when You run into this so who here already knew about one hot encoding before today anyone? Okay, so I want to I want to get a show of hands for if this next thing surprises you So even if you use this technique those that can still go wrong and it depends on the whole machine learning system So here's an example So we can do one hot encoding in one of two ways we can do it the way We just saw where if you have three possible values you have three bits We could also say if you have three possible values Well, we could use the value where we have no no bits set to right so you just have two bits So in this case we say on the on the right-hand side we say well if I'm gonna do red I'm gonna have just two zeros I'm gonna do yellow. I'm gonna have a one zero and if I'm gonna do blue I'm gonna have a zero and a one so if it's yellow or blue we have one bit set if it's red We have no bit set we can do this encoding either way. Does it matter which one we do? It might right So it depends on what kind of model you're trying to train if I'm trying to train a linear model If I'm trying to say I have this space of features and I want to identify a division between them I actually want to use this approach Where I'm using n minus one bits where if I have three values I use two bits Because I don't want any bias in these vectors and if I have a lot of features encoded in this way I'll be putting bias in the vectors But if I'm training a different kind of model, I want to use a different kind of encoding I want to If I'm training a decision tree or a random forest or a tree ensemble I want to actually use this other approach because decision trees can only Split or make decisions based on one question. It's like playing the game 20 questions, right? You you ask a question You can't ask a question with a conjunction in it because that's two questions, right? so I could say I'm splitting here if this is if The third bit is zero or if the second bit is one I might want to say well if the first two are zero then I want to split this way But I can't do that right because that's just not how the tree training algorithm works So this is a case where the approach you take and feature extraction is going to influence how well the model you have works And it's sort of a tight dependency between those parts of the pipeline, you know in general software development You try and avoid this sort of thing So going back to our simple records, there are a couple of other kinds of problems you could run into that Really get worse over time in a machine learning system. We saw what happens when you see a color You've never seen before but what if we have these sizes? Well, if you're being responsible with these sort of numeric features, you're gonna standardize them, right? You're gonna convert them to a z-score. So maybe I'll say I'm gonna take take this and Remap it so that these numbers are are not, you know, all of the same magnitude, right and The issue is that this is gonna be based on the distribution of numbers I saw When I set up this scaling when I set up this standardization if the distribution of things I see in the real world changes a lot over time. I might see a lot of outliers in the future, which could be a problem We have a bonus problem, which is this word size Right, what is a size? Define size in one sentence I mean it could be a length, it could be an area, it could be a volume It's probably meters, but when I go back to the United States it might be feet, miles, inches We don't know, right? We don't know what it is and we're sort of asking for it by having data that has a value But we don't know what kind of value it is. We just know what we want to call it So we want to be careful with these pipelines because if I have one part of the pipeline that says I have a size and it's in meters and another part that has You know cubic meters. I could really get in trouble by combining these things irresponsibly Okay, so we've we've thought about some things that could trip us up with feature engineering Let's let's think about how model publishing is tricky itself and Sort of start by thinking about reproducibility So how many people in here have used Jupiter notebooks before? So a few so a Jupiter notebook is a literate programming environment where you combine pros and code You can interact with it. You edit the code you run the code You see what the code does you edit the code again you run the code and see what it does It's a really great powerful environment for experimenting with new techniques the issue is That it's really like getting a snapshot of someone's physical notebook you have no way of knowing if this document someone gives you will actually produce the results you expect and I might have a notebook that works for me, but it might not work for you if I give it to you Fortunately, there's a way to solve this which is that I no longer give you a notebook Instead I just put my notebook in source control And I refer you to a website called my binder org which is a service that will take that source control Repository with my notebook in it and publish a service in a controlled environment so that you'll always be able to reproduce the results I had in that notebook So being able to reproduce the actual exploratory work that we do is an important part of the problem But we're gonna look at an even more interesting part, which is that once I have this model service What do I really know about where that model came from like how can I reproduce it? The sort of classical workflow for this is that you have a data scientist working in interactive notebooks And they're communicating with an application developer who works in an IDE the application developers Reimplementing the code they have some dialogue going on Maybe the dialogue is not as productive as we want and the feedback loop is not as tight as we want it to be This is a lot of manual effort It takes a long time to put things in production this way real organizations see it take up to 18 months From a successful experiment to a model in production So we want to make this easier. How are we gonna make this easier? We'll try and make it easier with software because software solves all the problems. It doesn't cause right so Let's say I have some Job to train a model I Might have tooling that takes that model File and puts it in object storage in a distinguished location so I can refer to that model later I Might then be able to publish that as a service I might also have tooling that just sort of directly publishes the model as a service that I can use again in the future Another option is maybe I have my notebook environment I run my experiment I see it work successfully and I have an option to fire off a build in a controlled environment From my notebook with the results with the code that I developed in the notebook with the techniques I developed in the notebook. I might even be able to fire off a bunch of different experiments With different parameters and see which one works best Choosing that one and publishing it as a service That's a little more of an attractive technique. This is one that projects like the kubeflow faring project uses a Third approach is to say well, I don't trust the environment that someone is actually running that notebook in right I want to run everything in a controlled environment So once I have a model that I like in my notebook I'm going to extract the code that trained that model and run that in a controlled environment producing a Model and publishing that as a service that I have everything as a controlled environment there So there are a few trade-offs that I want you to think about when you're thinking about how to publish a model The first one is who is automating this? Are you relying on a human to do the right things in a multi-step workflow? Or do you have some kind of automation is software taking that responsibility away from the human? The second question is Are you saving something to a file or are you producing a service? There are trade-offs there. There are valuable things for both options But what you want to do is going to depend on what your ultimate goal is and Then finally are you choosing something? That's fast and flexible for production, but might not get you totally reproducible results Or are you choosing something that will give you the same results every time at the cost of a little bit of additional flexibility? Considering these trade-offs will make it easier to choose which of the many open source model publishing frameworks you want to use Okay, so let's talk about data drift now. This is sort of once we're in production once we've gotten there This is something that can still go wrong. We're not out of the woods and like I said before Conventional software if we're lucky breaks in obvious ways it crashes. It doesn't compile it fails unit tests It starts getting really slow Machine learning models don't fail in obvious ways and we'll see why going back to our spam filtering example So this is a message that we're going to predict a spam Does this look like spam or a legitimate message? So I have someone who's presenting himself as an attorney to a deceased dictator and Is asking me for personal information so that he can transfer a large sum of United States dollars To my bank account in exchange for me helping him launder the money This is this is a really common genre of spam, right? Any surprises, I think it's not surprising. It looks like spam Another message this is someone inviting me out for a bike ride. This is this is probably not spam, right? It's doesn't look like spam. It's not asking for my bank account number or anything. It's just just Have another message here What about this one is this look like spam so it's a much much older head of state who's deceased and has some Treasure hoard to distribute but it's it's still sort of in the same same vein of the spam We've seen before right if most of our spam is people offering to Give us money in exchange for us helping to launder it for deceased heads of state then We have a pretty good idea of what spam looks like, but what if we get a message that looks like this is that spam I guess my model didn't assume this was spam because none of the spam I'd seen before looked like this and maybe you've seen this in real life, right like maybe you get a lot of messages about deceased dictators who really Need money laundered in the United States or Spain or wherever Maybe you get a lot of messages about casinos, but it seems like you only get them for a while and then The spam filter adapts and then the spammers adapt and change things too in this case We could be pretty sure that free online casino with best odds is going to be spam But we don't know why it was misclassified, right? Was it a problem with? the feature Extraction code was a problem with the model was it a problem with the feature extraction code Was it a problem with the training data? We had we don't know we just know that there's a problem We know by looking at this that it should have changed So how can we address these problems? How can we track? Things and identify these things before they cause business problems for us well one lesson of Faster feedback in programming languages that we especially see in interactive environments like Like the three-year notebooks is that you don't need to be perfect in advance if you can find out right away that something is wrong If we make the way that these machine learning systems behave easier to observe Especially as they get more complicated with many data sources and multiple pipelines running at once and ensembles of models that are all cooperating to solve a problem We can find out when things go wrong before they impact the overall system If we publish metrics at each stage of the pipeline We can tell when we're seeing words and emails that we've never seen before and we know that maybe we have to Go back and get some more training data Or we can tell when we've seen a color in a record that we've never seen before so our predictions are going to be garbage If our predictions depend on that color Right we can we can identify these things as they happen and go back and fix the pipeline or go back and identify that something's wrong And a special case of that is for when the model itself is actually behaving Let's say we have a model in production that performs more or less adequately on our in our experiments But it's performance decays over time because the data we're seeing shift say, you know We're just getting different kinds of spam we're getting more spam, you know, whatever That performance decays over time and we don't actually have a good way to know it, right in a lot of applications You're making a prediction. You don't know what the truth is So you can't say was I right was I wrong? But what you can do is you can collect other metrics that are proxies for truth Like what's the distribution of input data look like? What's the distribution of predictions look like if I expect that 90% of my email is spam and all of a sudden I'm saying 80% of the emails I'm testing our spam then I know that I'm probably letting a bunch of spam through I could even look at what it looks like to score the model like how does this model behave sort of Interacting with my computer hardware by tracking metrics like these we can get an idea for when things are going wrong And have someone go back and look at our pipeline and sort of use those metrics that we're collecting elsewhere to debug what's going on We don't just have a black box anymore. We also have sort of a Documentation of how we got there Also by tracking metrics like this I can have an ensemble of models that solve the same problem and promote the ones that seem to be performing better automatically, which is a really cool feature We want everything to be reproducible We want everything to be repeatable and a really great feature that you see in contemporary container platforms like kubernetes Is that you can provide a declarative specification for how to run? applications whether your application is a training pipeline an inference pipeline or Even an entire system consisting of data sources Ingest pipelines training pipelines inference services and the application that depends on them with the declarative Specification you don't have to say how to get there You just need to say where you want to be and the platform will maintain it in working order on any environment You want to run it on if it's your computer your data center or in the public cloud Finally continuous integration and continuous deployment which we've used for software development for some time Really make machine learning easier and more repeatable and if you want to see how that works I will again refer you to this workshop tomorrow afternoon that my colleague So if he Watson and I are delivering on how to use Software engineering practice if you're a machine learning practitioner and how to use machine learning workflows if you're a software developer Okay, so what have we talked about today? We looked at how things can go wrong. We started with problems in data cleaning and ingest pipelines We looked at some of the ways that feature extraction can be confusing and Have problems we looked at model publishing from a sort of Stone-age approach where we have actual communication with other humans to sort of more advanced Birches that use software to automate these things and we looked at the trade-offs involved in different approaches Finally we saw how Monitoring and observability and some of the other features that we have in contemporary software platforms make it a lot easier To manage these systems in all their complexity at scale Again, thanks so much. It's super great to be here at big things And I really appreciate your time and attention this late in the afternoon, especially with the beer at lunch I'm available on Twitter or github as at will be I do answer email And I have a personal blog that's updated rarely, but the last post on there is about publishing models automatically And again, I'll remind you that this workshop is an excellent use of two hours of your day tomorrow Thanks so much, and I'm happy to take questions offline