 Hi, thank you for being here. So we I'm loris. That was just said We're here to talk together about how to production eyes your and what could seamlessly as was quickly mentioned I work for yelp and this is not working nice No neat Yes, I'm over here. Yes. Awesome Yelp is all about connecting people with great local businesses. That's the core of our mission You probably have used the site especially here in the UK if you type things around here to make you might end up looking for this That's what you will do find restaurants Your local plumber a movie in company to help you move your code seamlessly into production anything that is needed Kinds just to give you an idea of the scale at which yelp work We have a lot of reviews a very high number of unique visitors Per month quite a big team lots of service a lot of code and Which means a lot of people working on data So we have learned quite some ideas on how to make our life a bit easier putting the models into production What's on the menu for today, but it's kind of like presenting What does that mean like putting a model into production? That's probably the least explainator self-explanatory sentence ever made and then I will Kind of give some tick and trips to actually make your life easier Before we actually go into the deep of the subject I just want to kind of just take a moment so we all kind of agree on what I'm talking about and what's Kind of all of this fits into I think every ML project in Pison are started in a notebook If it didn't probably something weird was happening and this notebook Started you started writing this notebook because someone gave you a data set and the questions and you needed to answer it and In the ways that's kind of the core of it. You don't do your machine learning just to do machine learning You're trying to predict a desirable behavior. You're trying to recommend. You're trying to detect you're doing Something you have an objective when you start doing machine learning and your notebook does something Whether your notebook is simple your features are already all nice and you just apply a few transformation You do train your model do some feature analysis and maybe check that your model actually didn't train in a too crazy a manner or You end up with Pages and pages of SQL queries that perform feature extractions and you have a complicated model and you have everything missing help It doesn't really matter in a way because what brought you to the stage where you think you can bring this to production is At the end you had a result you had some things that made you think. Yeah. Yep. I think I can do it There's a problem. I can crack it. But now You cracked it in a way once and you want to crack it on a regular manner and you want to be able to generate train Regularly manage make sure is that every day your model is not too bad and then your model produce some things that you want to use and you need to deliver as this something without the predictions trend to the final users and Finally kind of hey, I wanted to do something right at the beginning now that everything is done Is it still happening? So right now you are at the first step and I will show this probably over shown Schematics from hidden technical depth in machine learning systems that Google presented at neeps to three years ago and Right now you're here the small very black circle Which I put in red because it's even really hard to see and you need to interface yourself with all the rest of what your life systems are so now In a way, what does running an ML model in production involve? Well, it's kind of putting any other piece of code in production. In a way, there is not so much difference you're Going to interface yourself with the surrounding infrastructure and the goal is to run from something that runs Under your benevolent supervision Skipping the few cells which actually don't run correctly in your notebook to something that runs every day Tells you when it's wrong perfectly and probably even after you start looking at it Great. So as you might start to guess, I'm not going to talk too much about tooling That's really not the point This present day this conference was full of great people showing you awesome tools to do everything I'm all going to focus on Actually, what does it mean like what are the things how you think about decomposing the problem Putting things into production into a series of step and questions that you're should be asking yourself and answering To arrive to it probably the answer to many of these questions is use airflow, but that's another topic Great so For sake of argument and as I'm talking generality, let's just Hagry altogether on a simplified view of what the pipeline is So first you have data sources. This can be many many different things You perform some sampling on it. Maybe or not You extract some features that you have defined probably that's your notebook told you that this set of feature was working really well I should definitely use that train your model you evaluated everything went well You have a model rinse and repeat for production and then your loading this prediction you have obtained into your product kind of fair Why does this why is this useful? Well, actually probably your notebook performs exactly all of this operation But what this tells you is which piece should go together which piece should be one function or one Python module and Be tested together because it makes sense together Right tests Cool, let's kind of focus on specific parts of this pipeline The first part I want to check on is the data sources even though I'm not going to talk too much about it It can be S3 logs S3 logs anything redshift my sequel postgres something I didn't think about when I was writing the slides It's likely to be changing regularly you have new data coming in on a regular basis and how your system in just the data is Just admitted for the purpose of this talk I won't go into detail you might also have noticed especially if you have already worked with models before that this looks a lot like a Kind of specific type of training which is everything happens offline. You have all the data you need at all time You have offline training offline predictions Things with online prediction or online training would not be that different. It's mostly constrained on the data sources I will stay in the simple case Cool so first thing first you need to update your model on a regular basis and now the question is Regular basis Does that mean so that's up to you in a way? It's how often does your data change enough that your model needs to be changed too? The other point is what happens when things go wrong? How do you rerun your pipeline? What happens if your data some data is missing? Do you pick old data and you fill it in? Maybe if you maybe it's not worth it You could just reuse an old model or any other strategy you can think about the other one is the scale, which is how many? Prediction or how many what size is my input training? How long does this thing should take? This is what allow you to think about how you should dimension the infrastructure that all of this is running on We're talking about model. We're talking about failure very often when people write code and this thing failures They think oh, I got the trace back from Python But it's not exactly the only way a model training can be failed And that's why you have an evaluation steps that I really want to dig a bit deeper into and the first some questions you could be asking yourself is Does the evaluation metric I'm using actually reflects the problem I'm trying to solve and I think we all use like things Like log loss or area under rocker for other kinds of problem and Question and I want to tie it back to what I was saying is does it solve your problem? Maybe maybe not think about it some functions are very good mathematical property But they might not represent what you're actually trying to move Last part when you value your model Kind of think about which feature are used because this is the point when you can look as if there is feedback loop Your model doesn't just run on its own now It's generating prediction which means is affecting probably how your data is generated and if in time You see things like oh your model just rely on one feature Maybe this one feature is actually your model just repredicting what he was predicting before so be aware This is a time when you could say Actually this training is failed because the model doesn't behave the way it should be Now let's go to the prediction side of the schema Questions are exactly the same. What happens when things fail? How often does this happen and? How many predictions should I be generated every day and the last thing is which is a bit? Savory is how do prediction go are used and how are they using your product? And that's what I would want to kind of Push into right now. I've said nothing about in which all those things should be made I just said that's a whole problem deal with it Actually is this probably what you might want to start with how are the predictions using your product because you have predictions You had them once and so if you can already start using these just to test It means you can see if you're actually successful or not and that's probably the last thing Which is how are you measuring success? How do you say I did my job? It's right project over. Thank you very much The first thing is you need to track the metric business metric yet You're trying to move you'd need to go back to your original problem and be sure that you can actually track your model is doing Something and you need to test it confront it to reality And you might not get it right from the get-go in which case you should test you versions against old versions against probably status quo And the last part is measuring success is always very easy You could say yeah, I'm going to do this and it's well story time I work for a team called base gross and our main objective is to get people to create a special kind of account Which is tied to a business so people can manage it One of the good way we found to do that would be to show people a little pop-up But if we show it to everyone, it's kind of not really working. So we thought hey, we're going to start predicting Which people would be actually likely to be owners and so then they would create the account as a special business account And we show them the pop-up so this agree design if we told the pop-up or not and then we move forward And so that's what we did. We built the whole model we trained to predict if someone was potentially an honor or not we showed them the pop-up and All the time like 94% of the time they would create the account immediately afterward. It was great We had great numbers because we're measuring our success by is it clicked the little button they clicked on the pop-up we're showing them and then they were creating the account but Actually is a number of Total big number of account created didn't go up at all It's that so we looked again and actually what our model was Doing is it was predicting what would happen whether we did something about it or not So we kind of checked that we created all that set of people that we could have shown a little pop-up to and we didn't just to test that our model was behaving correctly and They were creating almost as much Account as the people were showing them to but a bit less. So we're still happy. It was still working still was investing in it Still, it's really hard to get that wrong. It's really worth spending sometimes thinking this through I've already kind of think smoothly into a transition with stories Tip and tricks tips and tricks. Oh, sorry So start with general advice. You might feel like this is something you might have seen into putting your service into production Yeah, good reason for as I was saying ML code is code Use containers Docker Kubernetes, what whatever you want containers are great virtual environment even more awesome Try to spend some time persisting your work with error with version control with Persist everything you do the logs everything you want You might want to be able to look back at it in two weeks and months Maybe not a year. You might want to have TTL to that the last two points are maybe a bit less Common is use the production technology from the get go Storytime again. We had that as scientists. They made a lot of effort to try to figure out how Search and all lots of pages were related. It was all done with redshift It was working in redshift because redshift didn't have all the data set and then I had to write 1000 line of SQL query into sparks is to three months This is a lost of time for everyone spark is probably even easier to use an SQL If the production technology are widely available, even if it's an extra cost It's actually saving development time most of the time and the last part is if you're working in a company and you're doing ML models It's probably already a lot of things happening. You probably already have a software that runs software in a regular basis Just don't reinvent the wheel cool. Now. Let's dig into several parts Might schema was not that great because it might have led you to believe that these two things were different They are not you should not have two piece of code for future extraction You should apart from label Very suitcase Use this should be unit tested you can use things like hypothesis to generate a bunch of random data to taste all of your edge cases and Don't write SQL. This is making people life hard. Write everything as code that can be tested You need tested mostly Now on the training. I'm going back to evaluation again. I actually I might have said These in previous part but just to put emphasis on it perform feature important analysis Keep it somewhere so you can know when your model actually goes out of way if you had done any piece of codes that we just checked that your data set was right or that you Anything you had done basically in your research phase that gave you confidence that your model was good You should still keep it implemented and run it regularly. This is what going to Keep make sure that you're as assumption. You've made with the data set stage through in time also classics have a small sanity check a sanity check test suit With just a small set of data just to be sure everything is running while you're developing not breaking production with the push This is bad Now to all the things advice First one is log all the things you want to know what's happened with your pipeline It's not running under your supervision Get your get it to log everything like the sampling you had like some ideas on how the class was selected What are the things like it's feature extraction log everything that happens? How many features were extracted? Maybe some small statistics about the features you have a problem. You can just look at your log You don't have to recalculate everything Model training log everything that happens evaluation. We just talked about it log feature importance this kind of things log log log log Look and in your product log when you're actually using your prediction What's usage is made of it everything every time you're actually doing something you want logs You want to know about it. This is how you measure things second one is version all the things and Yes, that's how you keep track of change Feature extraction you might hide remove features write different functions maybe with different names So also if you want to persist the features are extracted before the model training in case the model training fails Same thing if you have a file which is written with the days of features were extracted And which version of your feature extraction was actually performed It's much easier to go back and understand what happened if your model has a version It can be to several like just a git comet. It could be semantic versioning. It can be both It's great You want to change which go from logistic regression to exe Bruce bumps a version you're changing the hyper parameters Bumps a version there is nothing more frustrating than to know that there was a model that was generated And you can't remember how or know how it was done Plus if you have a version and you have login It's really really easy to know which prediction by generated by what especially when they are used which facilitates Evaluating success and doing experiments. This is basic data traceability So now with all of this we can think of maybe some general ideas on how to monitor the pipeline as a whole Keeping track of the number of predictions generated is a good way and prediction used too So it tells you which part of your things could be broken if it's not things Keeping track of trimmings to be able to see which part of your code starts to become slower and why that's also very important Especially if it has been running for one year, no problem and slowly crept up It's very different from it was running really well and suddenly the amount of time it takes doubles Alerts on errors in your pipeline code It can go as simply as send me an email at this email address if you don't have an alerting system set up Otherwise set one. It's really practical and kind of as the last line of defense if anything else fails Alert on the metric you are trying to move you had a project at the beginning and this all ml Things is all production all this effort don't live in void. They serve a purpose So just alert when the purpose is not accomplished and maybe everything else look fine But if this doesn't maybe there is a problem. It's worth looking into it last but not least write run books how is this system supposed to work and initially how you thought about it and Then every time some things that you didn't think about actually happens because reality is reality Added there so you we can remember how you solve the problems and how and to end all failures easily especially when it breaks at four in the morning so Now just if it was a bit boring and you fell asleep I'm really sorry, but if you just need to remember three things it would be this design for change It's a very the main difference when you do systems in production the thing as a code is probably going to outlive you in the company sometimes think just Change slightly the mindset about it and as I was saying and machine learning code is codes. They were 30 good years of best practices in writing software engineering and Use it. It's not different. It seems different, but it's not all the good ideas are mostly there the last part is Verify any assumption you make because again things change and evolve and so you need to be sure that any assumption you've made Are still verified with your systems We are hiring if any of this was interesting for you and you want to join us at Yelp We have offices in Hamburg London, San Francisco. We had a stand With this and of advisement. I thank you all very much and I will be taking in questions Thank you for this great talk where our questions Let's begin here Thank you for a talk So you say it was on to talk about tooling, but do you still have any advice because as Well some part of software engineering still apply to the science, but we've got some specific problems like like data unit testing or input Validations and also versioning everything you can just put all your data on git. So that's also a problem My advice is we use s3 quite extensively because storage is actually really really cheap and just throwing everything at s3 and figuring out later is Actually working It's on bad. It sounds really bad, but it works But as you keep like this model was learned on this that that you have like just keep ideas Everywhere, or do you have any way of I'm not sure I understand the question When you you get this model and was learned on this extract of data And How would you keep track of that that's this model produce this result from this data and this Right now we have a semantic versioning So every time we change something to a model which is in production We bump this version all the models when they are written are written as a file name It contains the version that was generated with and sometimes the git commit That of the version that was running when this was done and most of the time Combining the two you can actually know what happened and just reset your repo to a earlier version to If you need to reload it or remember what was done Thank you very much Thanks for the talk any comments about regression testing after you deploy models So this is something which I have faced in teams that I've worked for because regression testing is a part of software development and machine learning the deployments are kind of non-traditional because when models change Outputs of non-inputs change and then QA and let's go I Don't really have a great answer, but I would say If I couldn't go back here if you look regularly at the metrics you're trying to move and what your model is supposed to do You will know when there is a regression because the numbers you're trying to grow are not going to move in the direction They were they used to so just I don't it's not like regression in the same sense as a code It's more you look at regression in the patterns in the things This is I wouldn't mean this is more problematic than with code because it can be all the reasons like hey today is actually All the metrics are going down. What's happening. It's horrible. Oh, yeah. Oh today is a holiday. So actually everything's fine. So oh well so Here what we're doing right now is we're not never deploying a new model immediately We are doing progressive releasing. So you have the current mobile version which is big We release a new version as for a small part of the traffic and we look everything is fine before actually switching it. So That's how we solved it I'd Yeah, it works. Thank you for the interesting talk one comment and a question on the measuring Things there's a great blog entry by uber that talks about how they do it with actually machine learning the metrics On a question on the log everything kind of approach That goes very much Against the way for example Jupiter notebooks makes us right. How do you square that and and what's your advice on that? Yeah, I might have gone a bit about it I'm not really a fan of there are a bunch of things that allow you to productionize Jupiter notebook directly Things are several problem with that which goes with code quality because Structure in Jupiter notebook is very linear and code isn't It's hard right testing. I would be very interested in seeing tests actual real nice tests that are written in Jupiter notebook Maybe it exists. I've not seen it So it's complete. It's completely a different Beast in a certain sense. You're going to check in your code real and it's going to become Not an experimental thing anymore. It's going to be a git repo that runs is deployed. I Maybe there is a way to bridge that gap at the same time I'm not sure it's a good idea like Jupiter notebook is really really good at doing what it does and checking in code So it doesn't change too much It's not that easy to change and you can track all of the change and do everything is what you need for production Does that answer the question So we use Jupiter notebooks for everything that happens before we put it in production And then you start copy pasting the code from Jupiter notebook into an actual repository. You start writing tests you start Going like going back to data scientists like so you did these two things here like the two intervals are not exactly the same You ask questions to try to improve and you start doing code reviews and you make sure everything works in a regular fashion So it's really too separate workflow Thank you The top was brilliant and I think great advice for for people who are developing the models Do you have any advice for? applications software engineers who need to Integrate model into a product or a feature or perhaps advice around collaboration between a data scientist coming up with a model and a software engineer implementing it Yeah, that's so I have some advice in a way and it's It's more organizational and what I think is important is everyone kind of is on the same page That the data scientists are using roughly the same tools we're going to use in production And not have their own little word Which works but actually never never you're never able to translate it and the other part is once you If you your data scientist you spend four months working on the model and you have an engineer Like that would be me that comes and kind of take it and does a lot of things with it and putting in production I think it's very important to keep that person in the loop to have because you have a sentiment of ownership with this and this is What we're doing so keeping people in the loop and being sure that things stay Within the things That's was a very vague answer I hope this was understandable Okay, two more questions, I think Thanks, you mentioned this tree. Do you also use like for example as three to trigger? Let's say the modules are in the lambda or something Who's serverless for that tour? So I Don't I know some people are playing with it. I have no experience in the subject. Thanks One last question of anyone No, okay, then thank you Sorry So it may be a little bit off topic because you said you won't be talking about tools itself But can you guide us what to look for and just pop out some names to check out? Yes, I have a particular liking liking for spark to be honest because it handles that asset of all size It's very easy to run locally. You can write unique tests with it and just been Pretend that your class your test instance is gigantic and actually it's just I'm using one CPU and everything works ish so Yeah, I have had very good lots of algorithm models exhibits are compatible with spark Trying I would always say kind of have two two guns one which is a good old tool Which is well tested and this spark is kind of old and not trendy anymore Which means it's stable and actually works in production and Have like try out the new things to see if you can leverage them and what are the problems with them? Thank you So, thank you very much again