 So our next speaker is Theodor, who will be guiding us through all the projects and things that you might need to know about if you're doing machine learning, so take it away. Thank you. So my name is Theodor, I'm a researcher at the Swedish Institute of Computer Science. I work on large-scale machine learning systems. And the goal for today's presentation is to just put out some guidelines for people looking into putting their first machine learning project into production. So the idea here is that you already have a product. You have some data gathered and you would like to use machine learning in order to improve some parts of it. And one thing that I should clarify from the beginning is that almost all of the things that I will present here was not written by me, but by other much more experienced people. And what I've done here is that I've gathered all them all together and put them in context in order to make them more useful. And all the sources that I have used are of course included at the end of the presentation. So we'll just jump straight ahead into the content and we see how we move essentially from an idea into something that we can actually put in production machine learning. And what I'll try to do is present a simple framework that we can use to think about most machine learning problems. And one thing that I found very useful definitely as a programmer is to create interfaces for things. So what we do is we find common things that all our problems have in common. And then we use that language in order to describe what we want to do. So then we can do the same actually for machine learning problems and we can try to figure out what are the things that machine learning problems also have in common. So the first thing that we have in common is definitely the reason that we're using machine learning in the first place is that we want to describe the behavior of a very complex system. And because the behavior is very complicated it's not possible for us to sit down and actually write a program for it. We need some kind of algorithm that will learn it. And in order to do that we need a way to describe the system in a way that is concise and understandable. And this is what models do for us, right? They allow us to reason about this complex behavior in a mathematical language. And the second thing that all machine learning problems have in common is data. So in machine learning almost all of the models that we're going to use are statistical models. And what that means is that we make assumptions about how the world behaves. And then we use the data, the observations that we have in order to narrow down the assumptions in order that is much more specific. So we always need data in order to train our algorithms. And one thing that we definitely need is to be able to tell how well our model is able to emulate that behavior. So we need some way to estimate the quality of the model and something that can guide us when we are trying to optimize the model. So that is our objective function basically. So an objective function is the measure of the quality of our model. So for example, if we're trying to predict the temperature tomorrow, an objective function could be the square difference of the actual temperature versus the temperature that we predicted, right? And next we might have some knowledge and desires about the model. So for examples, maybe in our data set we have 10,000 features, but our assumption or our desire is that only a few of them are actually relevant. So we might use something like regularization or our prior knowledge about the problem in order to bring down the number of features that we have. And if we combine all of these, then we get our machine learning program. So a machine learning program essentially is a function that takes the model that we have, the data that we have, and it provides an answer. And in order to optimize this function, we use the objective function that we have created. And this is usually done in an iterative manner where we take the model that we have and we update it by taking a look at the data and we do this again and again until we are satisfied with the quality of our model. And all of these are the basic primitives that we can use to describe almost any machine learning problem. And we can see how this works in a specific example. So let's say that we are Twitter and we observe that our users are leaving the platform and never returning. And after some research, we decide or we discover that users that live usually do not engage with the platform. And what we decide to do as a solution is we try to increase the engagement of the users by trying to make them to retreat more. And this is actually where the engineering decisions start. So this is something that management could do, for example, up until now. So the engineering decisions that we can make is we can take this interface that we created before and then just fill in the blanks in order to solve our problem. So first we have the common components and we start of course as always with the data. And the data that we have is features about the users themselves, like the profile of the user. And we have features about the tweets, the words, the contents, hashtags, everything. And then we have the labels, which is the decision or what we have observed, this tweet get retweeted by a specific user. That would be the Y in our case, what we're trying to predict. Next up is our model. So what we said is that we want to find tweets that are more likely to be retweeted, so ideally we want to have a probability as an output from our model. And logistic regression here is a very good first model to try, because it's a well-studied classification algorithm that gives us probabilistic out. And in this particular model itself, the logistic regression will actually give us the objective function as well. So we take that as a given. And we should know that in more flexible algorithms, we can actually sit down and design our own objective function in order to fit our problem much better. And finally, we choose which algorithm we're going to use to optimize the objective function. And one thing that I should note is that pretty much you're done with the algorithm design at this point, but there is a bunch of other problems that might come up as you go from the data all the way down to the algorithm. And what I presented here is very simplified. A lot can go wrong if we're not careful. And we're going to talk a bit more about these problems now. So data is the common thing that underpins all machine learning problems. And it's perhaps the most important part of your pipeline. And data can come in with a myriad of problems. We can have measurement errors. We can have privacy issues. We can have even spread the errors in a function. And in machine learning, as in most principles, the output that we're going to get from a lot of algorithms is only going to be as good as the input that we give it. And what we want to do here is to find a way to quantify the quality of our data in a principled manner. And Neil Lawrence here is a researcher from Seville University. And he works a lot in medicine, which has notoriously bad data sets. And he recently presented the idea of data readiness, which I'll try to explain here. So what we want to do is we want to make it easier to reason about how appropriate a data set that we currently have is for learning. So what Lawrence suggests is that we create different levels of readiness for our data set, and we move from one level to the next. And when we are sure that our data fulfills certain quality criteria. So we can start by having three bands for these different levels, let's say C, B, and A. And maybe we can create sub-levels for each one of those. So the lowest band is band C. And it's about the accessibility of our data. So when starting out at this level, we might actually hear what we call hearsay data. Data that somebody has told you exists, but you're not actually sure if it's there. So this is very common when starting out a new machine learning project, where you might hear from other teams like, yeah, we should have that, or we are logging that, or we have been logging that for many years. But until you actually sit down, extract a data set, and look at it, there's no way that you can actually be sure about this. So in order to graduate your data set from this level, you need to ensure first that the data exists. What kind of format it comes in, if there's any privacy concerns or legal concerns in using it. And anything that can make it difficult to actually obtain the data set. And at the end of this level, which we can call it C1, we have cleared all these obstacles and the data set is actually ready to be loaded into analysis software. Then at level B, we are concerned with the faithfulness of the data. So first of all, did the data that we have actually record the correct thing? And what is the level of error in the measurements? And did any sampling occur in the data? And how did we treat the missing values? So all of these things are very important. In order to graduate this level, we need to be fully aware of the faithfulness of our data, about the representation and the truth that comes from what we originally wanted to record. And one day, which is the last level, is the first level where we can actually make questions about how appropriate the data is for what we're trying to answer. So here is the first time that we actually answer. Can we use this data set to predict the odd clicks from users? Or can we use this data set to predict how the time to failure for a specific component? And here we might actually discover that we may need additional data sets. We may need human annotation. We may need to iterate through the whole pipeline again. So again, this is an iterative process. And every time that you discover that you need some more data set, you need to go through the whole thing again and make sure that the data set that you have at the end is actually fulfills all the quality criteria. And the idea with this level is to provide a common language so teams can communicate their data readiness levels. And we can ask and answer concrete questions about the state of the data. And trying to skip any of these parts will almost always lead to problems. And as a final warning, do not underestimate the time and effort required to bring this data from the level C and B. And they're perhaps the most exciting brands, but they are all equally important. So next, after you've selected your objective function and ensure that your data is in a good state to learn from, what comes up is the selection of your software and your algorithm. So I've labeled this here as easy choices, because I actually believe that the other parts of your pipeline are much more important for the success of your product launch. So when I was first thinking about this presentation, I had two images in mind. The first one is this one from scikit-learn, which is kind of like a cheat sheet to guide people in order to select a specific algorithm. And you can already see that there's a multitude of algorithmic choices, but there's not too many. So here I call this a farm. It's not enough animals here. And the second one, this one is from the Asimov Institute, where they try to illustrate some of the most popular neural network architectures. And my point here with these two images is that, already at the algorithmic choice, there's a staggering amount of choices that you have to make. And the reality is that for your first launch, this is mostly unnecessary. For your first launch, what you should be focusing on is simplicity. And there are a lot of good reasons for that, and we'll see those next. So if you've read a couple of things about machine learning, you probably come across this suggestion, where picking the simplest model possible is often motivated theoretically by things like Occam's razor or the chance of overfitting if you use a complex model. But what I would like to point out here is that there's also very tangible engineering benefits in using a simpler model. And first, the initial model that you will deploy, it's more about getting all the infrastructure right. So when you deploy your model, you already have to deal with serving your predictions, making sure that the data is fed correctly into the algorithms, the predictions are output and provided to the user. So there's a lot of complexity there, even before dealing with the algorithms. So if you add to that algorithmic complexity and trying to figure out why the algorithm gave a reply that it did, then you're going to have a bad time, basically. And a recent Google article actually suggests that you aim for your first learns to be neutral, which means that you just aim to get the thing out there, make sure that it doesn't break anything, and then you can focus on gains later. Second, simpler models are usually interpretable. If you run a linear regression, every weight that you're going to get in your model actually means something. And that becomes very useful when you try to debug the algorithm. You try to see what predictions it made and try to explain why it made these predictions. All of these things are very important when you're starting out. So make sure that you're using the simpler model because they're actually interpretable, which is much, much harder to do if you have a neural network with one million weights. That's impossible to do, basically. And thirdly, the use of complex models erodes boundaries. So what do we mean here? Like in software engineering, we use concepts like abstraction and encapsulation in order to isolate different parts of the code so they don't affect each other if we make changes, right? But in machine learning, we actually very often mix signals. So we have features that interact with each other and this is in the nature of the algorithms itself. And this leads to a principle that's called the cake principle where changing anything changes everything. And this principle applies not only to the features but also to the hyperparameters and sampling process pretty much every knob that you can tweak in your machine learning pipeline will actually affect every other thing in your pipeline. So by making every single part of the pipeline as simple as possible, you're making your life much easier as an engineer. And of course there's a bunch of other things that can affect your choice of algorithm here. But I think for people who are starting out, just going for the simplest possible is actually a very good piece of advice. So with that, we can move on to software. And actually I believe this is an even less important decision to make compared to the rest of the pipeline. So we only have like a single slide on this topic. And I would just like to illustrate the machine learning software. So these are some of the most popular open source machine learning libraries on GitHub. All of them or almost all of them have more than 1,000 stars, which means that all of them are popular in their own way. And my reason I plan was to pick a few of these and try to talk about them more and explain them. But I think that is better to point out that by now machine learning software has become a commodity. And there is very little differentiation between the top choices. So what I would suggest for people who are starting out is just pick one that you are comfortable with, maybe something that your team has already worked on, and focus on other parts of the algorithm that will have a much bigger role in the success of your project. So with that, I'd like to move on to another more neglected part of machine learning. And that is what happens when your model comes in touch with the world. And this is something that you won't find a lot of research on. And everyone seems to come up with their own solution. So what I would like to do in this section is point out problems that are common when deploying a machine learning model. So first, I would like to know the expectation versus reality of having a machine learning system in production. So in an ideal world, so this is the academic setting here. So we have data systems that are clean and standardized. We develop a model, and then we test it on some benchmark data set, and that's it done. But the problem comes when you actually deploy the machine learning model in production. It needs to interact with the real world. And that means that it's probably going to end up looking much more like this. So to have a running machine learning system, you will need to have a large number of components around it, each with its own complexities. And in a recent Google survey, the authors mentioned that in a running mature program, often only 5% of the code is actual machine learning logic. And 95% of the code is all the polyamide that is required in order to make this whole thing work. And then what are some common pitfalls when deploying machine learning programs in a complex setting, like this one? So first, we almost always have data dependencies. This is similar to how you would have code dependencies in a project, but they're even harder to deal with. So they are, to some respect, unavoidable in machine learning, because at any point in a machine learning program, we need to have our data set. And we usually pass it through a complex data processing pipeline in order to make it and prepare it to be ready for learning. And this can create a bunch of problems. So a common problem is that the source data is unstable. So that means that it can change distribution, or it can have even more dramatic changes. An example would be if a different data team owns the data pipeline, and a different team does the learning. So for example, let's say that the data team starts monitoring the time that the users spend on a website using seconds, because they want to have it for their own purposes. And then you, as a machine learning team, you take that feature and you use it in order to do recommendations. So maybe three months later, the data team decides that they want to have more accuracy in their measurements, and they start measuring the time using milliseconds. Now, if the data team does not have proper infrastructure to detect all the consumers of their data set, or if the machine learning team does not have the infrastructure for monitoring to detect this change of distribution in the data, the model will actually continue working, and it will start producing a bogus prediction. So solutions to this would be to have very strict ACLs for your data. You can have very good monitoring or a pipeline. But what I prefer, actually, is for teams to actually have full ownership of the pipeline from sermon predictions all the way down to creating the data set. A second related problem are feedback loops. So feedback loops can be direct, which are easier to deal with or indirect. So in a direct feedback loop, the model actually affects its own training set. So this, for example, if your model ranks items in a list, the items that it puts towards the top of the list are always going to get clicked more often. And then the algorithm will believe that it's also more probable that it will get clicked. And the solution to these type of things is that you can just remove all the predictions that have been passed by an algorithm. But that, of course, would be very bad because you would be reducing your data set by quite a lot. So a better idea for this is to actually include the ranking of the item that your algorithm produces in its features. So if you include that as a feature, then the algorithm itself has to figure out the importance of the ranking that the item has. And that can help you quite a lot without you having to do anything. And indirect feedback loops are much harder to deal with. So for example, Netflix uses a learning system in order to provide different covers for each item that the user see. And then it uses a learning system in order to figure out which is the best cover for each of the items. And of course, it has a recommendation system that recommends each item in the first place. So then you have to think about what will happen if I get a good recommendation with a bad cover. So if Netflix is not careful in the way that they implement their algorithms, it's very likely that one system will actually start influencing the output of the other system and the data set that it's trained with. And that can be a hidden feedback loop that's very hard to detect. So I think I would like to conclude here more or less. And the idea is how do we bring this all together? So I think that the key takeaways are that you can use a common interface to define most of the machine learning problems. And this is very useful when starting out that you should determine the readiness of your data before starting learning and make sure that you monitor the readiness of your data at all times as well and that you shouldn't spend too much time worrying at first about the selection of algorithms or the selection of software, because it's not as important as it may seem. And finally, I think that you should worry much more about what will happen when your model comes in touch with the world and what will happen when you put it out in the wild. And with that, I thank you. Now for the questions. Thank you. One of the things that I'm experiencing is where I work, if anybody has any ideas or solutions or wants to chat about this, if you're really welcome, is we've built a pipeline to ingest data, train models, run models, and optimize. And it's quite fragile, as I say, to be polite. And we've also got some monitoring as well, where we can get the AUC, where we can get different aspects of the model, how they are in production, but again, that's also quite fragile. And one of the things that we have is quite contentious, is the business is really excited about this and they're like, oh, great, this is good, it works, let's build some more features. Or we're sitting on a very fragile infrastructure because we're trying to get things out as quickly as possible. So if anyone knows of any good techniques to produce that technical data or what convincing arguments to the business to invest more time in really tightening up the hand-building. Yeah. So that lets fragile that. I think this is, yeah. Could it be a repeated, just to start with? Yeah, okay, so the question is like, okay, so. Hold on, hold on. I think that the main thing is that as you move on in a fast-moving environment that the new company, you always aggregate more and more technical debt, right? And the thing is that, especially in machine learning as another place, you have to pay it off at some point, it doesn't work. You can just, you cannot just keep including more and more fragile thing because at some point the whole thing is going to break. So it's actually a very good idea to at some point to just stop and do this thing where we call this neutral loans, right? So it would be much better to just say, okay, we stop worrying about the model, but what we're gonna do now is we're gonna deploy a new thing where the new thing is going to be a new infrastructure. And what that is going to do is that we're gonna build a just an easier system to make thing that everything works. We might have a prediction that is, I don't know, the mean of everything, right? That's always fine. Okay, so that's our model now. It's very simple, we know what will happen. What about the infrastructure all over around it, right? It's much easier to just develop one thing at a time. If you try to change the infrastructure at the same time as your algorithms, that's not going to work. So I think it's a very good idea to just go for one sprint or something that I don't know what system you're using. Just say, okay, we're gonna do a new model deployment, that we're gonna do infrastructure around it as easy as possible. It's actually very good to actually invest that time and the effort into it because it's gonna pay off in the future. Yep, yeah. So that is, yeah, so building your own model versus using the existing model. Here are you talking about the theoretical model itself or are you talking about using your own implementation of logistic regression instead of the Spark one? So you're using, you're talking about building your own. What I found, yeah, so what I found from speaking to people in smaller and larger companies is that the bigger companies almost always end up using their own because there is no infrastructure out there, open source that actually works in the scale that companies that with millions of clients want. So I've talked with Criteo as well and Spotify in other places and all of them end up building their own because it's impossible to do these things and there's so many moving parts that you end up breaking at some point always. But for the smaller companies, I think that they're good to just start with whatever they want, but they should be aware that at some point they will probably have to roll their own everything basically. No, no, so that is, okay, so if it's a good idea to use different programming languages between training and producing the model, the ideal thing is that you use as much code as possible in production or for serving the model as you do for training it. So changing languages, changing the infrastructure, changing anything, it just adds a lot more complexity. So one of the articles that I've linked that at the end which is originating from Google makes a specific point about of this to increase the code reuse between these two parts of the pipeline as much as possible. And also to make sure that the difference in the quality of your model, when you train it and when you serve it is one of the most important indicators that you have for the quality. So there are a lot of points to make about that as well. So there needs to be very active monitoring between the error that you get in the training and the error that you get in actual production because it can indicate a lot of problems. Okay, last question, and make it good. Can you speak up a little bit? I use these, so often it takes really long to train in the entire pipeline by generating it in the training model. Is there a way to effectively scale the exams that are around the data and scale it up later? Yeah, it's definitely possible. Okay, so the question is since a lot of models, for example, I'm guessing you're talking about neural networks here, that can take days or weeks to train. So if it's any way to just make this whole pipeline go faster. And usually, I mean, this is, as in any computer science problem, there's two ways to do that. You can either add more hardware and hope that it works or you can do sampling. And in machine learning, the sampling part is so complicated that there's a whole subfield of science around that. So if you have samples from users, you wanna make sure that they are stratified and you're trying to predict, I don't know, a class, you want to make sure that they're stratified by class as well. So it's very, very hard to actually get the true representative sample in order to train. But there are cases where just taking a random sample from your data, if you're looking into exploratory analysis, it's not so important. If you're not really worried about the quality of your modeling, you just wanna try out stuff, it's better to just sample it down your data and use it in your computer there instead of running it on the cluster and iterate on that. It's going to be much harder to try to do that with your complete data set. So it's very important to just finish, more so to speak, with your exploratory analysis as soon as possible. And then you can go back and iterate through the whole pipeline again. All right, thank you so much. Thank you. Thank you.