 So we are coming to the next talk I'm happy to welcome Alonja Hi, Alonja. Hello. Hello So from where are you streaming from Amsterdam the Netherlands? Okay, we have I think we have many people streaming from Amsterdam. Oh, cool. So It's nice there of course and you are doing machine learning. Oh Yeah, and everything related to it indeed That's great. Yeah, that's what I sometimes do too. So please Please start your talk Okay, thanks so much great, so Yeah, first of all happy to see everyone here. Thanks for joining. Also, I was Happy to see the Frida speaker. I'm really really surprised to see the youngest speaker ever in Europe fight and so I think the Europe Python org team is really great job to find the good speakers. So, yeah without further ado Let's start. Um, so we have 25 minutes and then five minutes for Q&A So, yeah, without further ado, that's what we're gonna do actually today We start with introduction. So you need to know who am I why should you listen to me? Of course, and then we're gonna discuss the awesome solutions So this solution will allow you to build a mobile pine out of the box in white line of codes And also there is a possibility to add this pipeline anytime even in production environment So yeah sounds like it's good to be true Indeed. Yes, that's not what my talk about. So if you want to find one stop solution that fits all sizes Um, I'm not the right person to talk probably Um, but let's get back to the point. So, yeah, that's indeed the real agenda that we're gonna do today. So um The whole conference is full of really awesome talks and tutorials and also sprints That will discuss a lot of wonderful python open source libraries to help you fix specific issues or Make life much easier via automation So in this case, I want to get a little bit Let's say hi over and to discuss a pretty pretty important concept that is usually neglected and We're starting to think about this concept. Unfortunately When we are ready in production environments, and I just want to save you some time nerves Especially money costs. So Let's take a look what we're gonna discuss today. Um, I will show you the concept of a machine learning pipeline Why we need to use it when it's useful to use it Then we move to the building blocks of it And also we'll share some let's say hard learnings from my own experience Then we'll take a look at engineering around failures and engineering for performance Of course, there will be some time spent at debugging and monitoring And as a bonus, there will be open source python libraries to save your time and quick tip In my case, all these libraries are literally battle tested in production environments So I'm going to share the things that I used to work with At least open source once Okay, so yeah, who am I? Uh, why should you listen to me? Um, my name is Aliona Gareva I'm streaming right now from Amsterdam the Netherlands I have a full-time job in this job. I am developer or as a title set Applied AI and data engineering leads. So in this case what I'm doing I work in as a great team of engineers and we help in different companies To find the best data engineering or AI engineering solutions So literally helping organizations to became more data driven and do it mostly in enterprise environments So it's as extra layer of complexity On top of it, I have a non-profit by ladies. Amsterdam Probably some of you just rolled eyes right now say, oh a lady is again. No, it's not about that by ladies Amsterdam is the Amsterdam chapter of global non-profits and our specialization is on Doing workshops and boot camps. We do talks from time to time But the majority of the time is every month you can expect a workshop and we do it on two levels completely beginners and More advanced levels if you want to deep dive in a specific topics in python And on the pictures just like the pictures of the previous events We were doing everything offline before covet and with covet we move completely online and doing things only online right now Also, I'm helping different AI startups as a tech mentor and this year I was awarded with Microsoft most valuable professional awards in the field of AI So, yeah, as I mentioned this talk is based on some High-level facts, but it's all based on my own experience And I just want to share it with you to make sure that you're not making the same mistakes the time made and the past so, yeah machine learning pipeline what it is and when you need it like every use cases of data science projects needed or not In a nutshell think about pipeline as a sequence of automated steps And these tabs could be small or could be big And these steps can be even implemented in different languages And if you have this sequence of steps Automated in a specific way it makes everything that you do reproducible and easy to scale And also easy to move between environments So Let's say why do we need this pipeline? Okay When I do something regarding machine learning, I also have a set of specific steps. Then why should I automate it? Pretty simple set It's reduce the cost of any data science project in a huge way From one side it sounds like a time investment in the beginning when you need to set up specific processes But when it's set up It's free away a lot of time of data scientists. Frankly speaking a little bit hate Or probably not so okay. I don't like the statement that data scientists are not engineers It's a true and false at the same time. Sorry, binary classification will not going to work here But in the case are we talking about the team that has Different types of engineers and data scientists on board or we're talking about the startup where the data scientists should do everything from scratch From zero to hero The citations are different But if you have some reproducible orchestrated steps It will reduce the cost of any project and in the majority of the cases It will do it by actually let data scientists switch their focus on experimenting with new cases What's with new ad-hoc queries from business people? And if it's about modeling if we went if we are let's say there and the modeling part It allows you as well spend more time on doing proper feature engineering Hypertuning parameters instead of fixing just an array of stream Of bugs that come in from production environments when your model will be fading each day On top of it if you have specific automation in place It does well prevent bugs And later I will share let's say my top three time eaters regarded bugs And that's what i'm trying to fix first when i'm starting working on an envelopes project So Another thing it adds the oditable Paper trace and what do I mean by that if we know how the pipeline was run By whom when in which capacity you can trace From the source of the raw data to the source of your predictions Post-process of your predictions and as well the modeling feedback So overall It's a huge cost reduction Because it frees up a lot of time for data scientists to do what they're good at It's also minimized and prevents a lot of bugs and it's added this oditable trace And probably say okay Sounds good to be true, but when should we use it? If i'm just starting is ever time in my In the life of the company, it's like the first experiment should be ever done. Should I immediately apply this concept? In this case, it's just first steps And nobody know whether in the future will do something in this direction Probably is not the good place to start if you just want to experiment with one specific model or you want to Just to try one of the scientific papers Probably it's not the case to do But if there is something that already you as a data scientist or your colleagues a team started to work on And right now there's a conversation that will finalize this business poc We finalize the technical poc and if it works Then the next iteration will be moving in the direction of the mvp Where our models will be a part of or a set of models. Um, there are different scenarios so Indeed when it's time to go from proof of concept to mvp That's a good place to start implementing machine learning pipeline And on top of it when it's time to scale. So if you have only one model Um, and if there is no conversation about any other models in the future Probably we can live up and keep up with some manual work But if it's about we need one model we need more more we need two models Oh, by the way, not the department wins the model as well. And then that's where it's time to Stop and spend some time on creating the pipeline And also what's the biggest difference between Let's say having automated pipeline and having a manual pipeline in place is In case if you're just starting Your product will be the model everything will be around model You prepare some data engineered some features you train specific model You evaluate the model you check the model how explainable it is are there any biases with the data So you do all this checks and then at the end you have probably a model artifact that you want to deploy somehow There are millions of ways how you can do it but then The question is What to do when you put this in production How you're going to make sure that the feature engineer happens the same way How you're going to make sure that all the features that you need are present to run predictions in it So all these questions couldn't be just solved if you have only One separate artifact as a model and when you create an automated machine learning pipeline This is the case when you have literally the whole pipeline And then you can easily move this pipeline from one environment to another and further So overall when it's good way to start with machine learning pipelines If you're just making the first ever steps and just experimenting probably wait until you have some processes in place And if there are some conversation about we have one model we want more That's definitely time to start and of course when the different departments want to do it as well By the way, especially one of the biggest mistakes that you see in a lot of enterprises They have separate departments do separate things and they have different levels of maturity But there is no one overall strategy how to orchestrate and organize the whole data science project management It's also something to think about but okay enough words about this. Let's take a look on the building blocks of the machine learning pipeline probably when I'm talking about the pipeline, you're thinking like probably one block is to Extra data another block is to prepare preprocessed data next block is to engineering features Next block is to train a model next block is hyper parameter tuning next block as an evaluation next block as validation Actually not I spent a lot of time focusing too much on only inference pipeline That's starting with I pick up this model from registry or from um as rebuckets, azure block storage, whatever it's Saved and then I'm going to find a way how to deploy it properly Then I will define how to do preprocessing and post processing of the results of the model And that's where it spends a lot of time But when it's come to model monitoring and when the model should be retrained how the model should be retrained If you don't have specific building blocks in place, you will spend a lot of time writing glue codes and different Back scripts or power shell scripts that will try to keep all this glue up together and it's really really prone to errors So the first steps in the building block That's exactly the moments when you're going to orchestrate the development of the models or experimentation And I'm explicitly saying orchestrate if you don't have specific steps running each after other And if all these steps couldn't be reproduced or replicated in a specific components of code Then it will be manual process error prone. So in this case what I'm asking for I'm asking to think about how you can Organize the model development process to make sure that doesn't matter in which environment you will put it It will run in the same way. So in these steps, you're trying with playing around exactly with different probably machinery and algorithms or doing some extra feature engineering or doing some Hyper tuning as well And in this case, what expected to have not a bunch of jupyter notebooks scattered all over local laptops In this case, it should be a repository with a source code and the source code could represent different steps of this pipeline Also different steps could be different languages Usually when you work with big data, you start with a data ops part that's executed for example in Spark and it could be pyspark. It could be scala So there are different versions of it java And then the next steps for example for the model training if it's not about big data But it was working with small data. It could be done pretty easily with available python libraries So once again, think about it. It's not like one guitar repository But think about it at a place with a source code for each pipeline components And why do you need to have this because the next step will be continuous integration? And what's happened here is we want to make sure that we can go take your source code. Remember, it's reusable separate separated components per each step and we can actually Build the source code and run the test and when I'm talking about the test It's not only about unit test integration test It's also about a specific test that's required for machine learning models. It will be data validation test It will be data inputs output test It will be the model test to make sure that the model returns the same results And also to make sure that it's within specific limits that you expect to get from it. So that's exactly where As soon as the source code is there, for example with Merge was a master a different set of repos Or if you're using mono repo style, it could be different. There are specific triggers that start this continuous integration process and build source code run the test the specific test and then As an output you get a package and it could be also not one package Don't think about one component, but a set of packages Also could be done in different languages and the set of these languages could be easily Transformed further by the next step The next step is the continuous delivery of this pipeline. So although you have your code ready You package your code properly. It could be a set of different Artifacts and then it's time to test it to battle test it in different environments Because usually I see a lot of issues while setting up the proper environments for machine learning processes Usually I see some kind of sandbox and then immediately production nothing in between and when you start Questioning. Okay. I'm just about where are the testing staging environments? Oh, we don't get it's really hard to get data there So if you have the pipeline that is capable of extracting data from specific sources That's exactly it's really good first step to orchestrate and set up other environments where before going to production settings and without Let's say firefighting things in production that could be easily prevented with some specific test or orchestration steps You do it in a proper environment So we had our pipeline that starts with from data extraction And delivers a specific component some machine learning model Then we package it and then we deploy it to specific target environment to check how it's doing there Also to try it with new subsets of data, for example, that the model has never seen before And the next step we where we go and that's exactly when you have your Automated pipeline and it just waiting for the trigger in a production environment And here as soon as the new data coming you can trigger your pipeline And it starts and it do the whole run And at the end you will get the pipeline with the trained and preferably registered machine learning model So for these there are also different tools to use We get back to these tools at the end of my talk So in this case Once again, don't think that you have only a model is a separate artifact But you have the whole pipeline in a case something went wrong You can always retrigger it once again to get this model Also, the beauty of this approach is in the sense that You have the whole pipeline Traceable and you know exactly that everything that you've done in depth environments And the results that your model delivered in the depth environment will be almost the same results here And if you see a huge difference, it's a sign that something goes wrong in a specific Specific steps So the next step is as soon as your model runs in production and it's delivered The let's say the most awaited model That's where another steps coming in and it's usually triggered as soon as for example The model is added to register or the model is moved in the model registered from one stage to another stage That's where you start the so-called model continuous delivery process And in this case, it could be different pipelines running It could be pipeline that pick up your Your model as split in the specific components and expose your model with a restful api It could be a pipeline that grab your code and Deploy it as a part for example of the Kafka streaming application It could be the pipeline that pack your code and deploy it on edge device and do some transformations There are differences of the options. So don't think about this step as only I'm just gonna expose my model with a restful api. There could be different approaches in it And the last step the really really underestimated step in the monitoring I will talk about it later. What's important to say here So based on some specific subsets of monitoring you can set up a triggering or to retrain the model Or if you see that the changes are so great that your Your machine learning pipeline will not be capable of handling it even with retraining of the model That's where you can exactly trigger The development and experimentation process where you can get new data where you can discuss new features And where you can discuss and really Fast test different approaches. So in a nutshell There let's say the all they think about machine learning pipeline was that we have something we have for radio model We just picked this model and we talk only about inference pipeline Where we're working with this model. We're deploying it in a specific way And then we monitor and then we decide at hawk what to do if something goes wrong And the the must write approach and it will also Save you a lot of time later if your specific standard process in place That's going from the moment of raw data to the moment of Generating insights and predictions and this process could be packaged could be described in a specific state could be properly orchestrated with all for example experiment logging in Experimentation of all runs with logs in and then you go to the end where you have this Service in production. It could be once again a combination of different models It could be a model is deployed in one or another way And you have specific monitoring specific of this use case that is capable of triggering or model retraining Or capable of getting back to the start and doing the whole process again so Regarding engineering, of course, there's a lot of words, but usually what we are more interested in it We want to make sure that We know not we ready Let's say for any fail that gonna come and also we know that it's not possible literally to pick up everything that we have and During this time to solve During for example development state to pick up all edge cases. There differently will be some specific use cases You're not aware about or not expected to get but anyway, what we can do to Make sure that our machine learning pipeline is properly engineered. So in this case I always say it's pipeline. It has input and it says output always check do input checks and by input checks, what do I mean here is Not just a testing What are the data data validation? But make sure that If the data that are there is not in the proper format You don't start your pipeline at all and if you don't start your pipeline at all the next question is Is the output checks? so If something went wrong and the model delivers some really strange prediction, what should you do? Based on how your model is consumed or is it like end users? The customers or is just the pipe the machine learning pipelines that deliver something in batches and then just person reads data from models prediction from sequel or from other object storage What to do if model produced really weird results? So that's the third point. It's there should Always be a fall back there So in case that if there is no proper input data coming in of the data It's really weird and have really a lot of strange values That's where we stop the pipeline and we get in back a user specific message Or we're not capable of doing predictions because of this or update data There could be different options output checks It's how saying the results of your model are if you know that you're doing for example binary classification And you're expecting zero and one it will be really strange if your model delivers something like three, right? And in this case also the model feed fall back if you're seeing that something really weird happening with your model What do you want to get back to your users? And then the next thing is engineering for performance. So in this case, we always want to make sure that Our model predictions are scalable and if for example, we are exposing our model We arrest for apis that we know that our model was tested properly before that Otherwise, we will not be capable to scale it in the way as we want Another thing is usage of caching. I also noted that a lot of people completely forgetting about it And actually it's really useful too So you can cache inference results For example, if you see that there are some specific requests that are exactly the same as The or there are some often types of requests. You can just pre-cache some inference results There are some specific models that require caching. Otherwise, it's really expensive to calculate Calculate the results each time. So there are also something to think about it and also Based on how our model is deployed, how we're going to implement the feedback collection about our model performing, for example, from the perspective of the user So that's really important things to think about and then let's say the debugging and monitoring part Probably you can spend a lot of time on it. It's worse a little some let's say a lot of weeks. I'll just Learn something about it. What I want to share here There are different things that's Different from traditional software engineering with ml because on top of specific code versioning code test and Code and let's say the simple system monitoring we get on top of it data test model test and data monitoring and prediction monitoring And as I promised, I would like to share the top three debugging issues that I see almost every day in my work with different clients And guess what? This is the biggest issue, especially with Python libraries So we have a model trained in specific environments where we have unpins libraries And then for example when pandas or circuit learn made an upgrade and deprecate some methods That's guess what happens the whole data processing part followed the parts If you have for example, some specific pandas transformation or you do it one hot encoding With circuit learn or something different and it doesn't work at all That's we're getting back to the time welcome glue code and welcome unpins libraries And another thing that also Create some issues. It literally scattered config for different environments So that's another story and the hardest part in it that We have some specific config model related config hardcoded in the code In other parts is hardcoded in some json files. Then we have some some parts partially let's say Deploy it with a configuration settings that's written from here. That's a set of configuration settings sitting in another json that is from Let's say it's written from the bash script in specific mode and of deployments people. I'm just begging you Let's put some Let's put some order in this cows because every time with every quick fix with the bash or with every quick fix with Like i'm just right config here or right config there That's the say that's that's the problem. So the thing is here that's That's exactly the most three top debugging issues that I have so yeah Um, if you can just start pinning libraries, trust me, it saves so much time for all engineers that will be going to work on this Um, you will guess how much time it could be free up In regarding monitoring to see that our time is going up But what I wanted just to mention about monitoring Usually remember that you have three types of monitoring for the models where the system you just monitor a specific technical metric such as cpu run If you're working with apis amount of request If you're working with serverless, there are different types of things to monitor And there should be some data monitoring as well that's checking the input data and the checking data distributions changes in the data The model monitoring that checking what's the model output how it looks like what's the different between them between the different let's say The different environments and also the difference between mapping on the income and data and the output that you get But without further ado As I mentioned, there is just a set of python libraries that you use and I skip uh, really let's say that the The libraries would probably use every day such as upon the circuit learn I just went a little bit high over that's exactly the tools that I was working with and one tools One of my favorite ones is the great expectations for the data validation the fist and the feature store and uh, alibi alibi detect and alibi explain because it helps a lot with model interpretability and debugging and monitoring and has a lot of algorithms already implemented by you So you don't need to implement everything from scratch And also one of the favorites is a mouth flow Because there are different components that you can reuse and it aids traceability and reproducibility To your machine learning pipeline journey So, yeah, I think uh, I really I probably all time for the question. So martin Do you have any questions? Yes, thank you very much for this talk. I really enjoyed it and I really like your pipeline. I will adapt it for sure um There are some questions being typed and let's start with we have a little bit time for questions questions are important One question is do you still use jupyter notebooks for prototyping or not at all? Okay. Yes, it's possible. Uh, only one thing that I want to ask you Like data breaks notebooks dappling jupyter notebooks. It's awesome tools great. You can add extra things to it But please keep them indeed for prototyping for ada There are lots of issues with them in production There are tools to fix it, but trust me this tools just add in the layers of complexity So yes, I do. I use them although I'm not a great fan of it But if I need quick prototyping, I usually do it next one Next one. Do you have experience using dbc for setting up machine learning pipelines? What's your opinion on it? Opinion from like if you have nothing it's a great tool to use Although it depends on the environment you're working on Is it on prem? Is it hybrid? For example, is a cloud is a multi cloud because for example in the cloud you have a replacement for it So it can run differently usually if it's not about big data and There is a pretty simple on-prem setup. That's where I use dbc Or as well for example starting point while transition to the cloud And let's say my opinion. I use not the whole package. I usually only use it for a literally data set registration But I'm not use it as specifically for all for all pipelines tab because in my case I'm more probably I prefer the old ways where you define specific things at the young files and just give to any orchestrator card launch to use it So yeah, yes and no at the same time Unfortunately, we are running out of time and there are more questions I hope you will come to the chat and answer them While we are moving to the next talk. Yes, thanks If not, you can always look up for my name is her make and find me an intern ads And I will happy to answer your questions because you'll need to work further today Thanks a lot. Okay. Okay. Thank you very much again for this talk And see you around