 Okay, but as George mentioned, I'm there. I'm from Kyrgyzstan, but I live in Vienna nowadays. So if you're ever in Vienna, you can get in touch with me Okay, so what are we doing in this talk? I'm gonna demo a project and I'm gonna demo some tools and Hopefully we will learn something together at the end. So we're imagining that we have Online shop where a one-person data team in an online shop It's not the biggest shop in the world, but it's a shop that has hundreds of customers maybe thousands of customers all around the world we have to ship orders every day and We get a request from our bosses saying that we need to make predictions on the order So we need to be able to know if an order will be returned within one week This is something they want to do for I know storage or something I don't know but they want to know if the order will be returned within one week and For now, it's okay to do it in batches. So at the end of the day, we look at all the orders for today and we do a prediction and But ideally they and we will get bonus points for this is they want it to be live so they want some kind of an API endpoint that they can call and it will give you a prediction right away and Yeah, so that you know even before the user clicks order They know that this order is likely to be returned or not Yeah, so because we are in an automated shop We need to build a pipeline that trains ML models and uses ML models And ideally it's an automated pipeline So we don't want to by hand sit and train it every every day Ideally we have an automated system that's training and then choosing models and then deploying them and using them And everything is automated. So all we have to do is press a button and monitor from time to time What which technology are we going to be using to to build this MVP of this feature? We're going to be using dbt for transforming and cleaning the data. We're going to be using dbt for training ML models dbt is going to be used to make predictions and dbt is going to be used for automated deployment So could you raise your hand if you know what dbt is? Okay, awesome. And could you raise your hand if you work with dbt at the moment? Awesome. Okay, so for you, it's going to be a little bit boring next five minutes But I hope I hope I clarified if I do something wrong, please tell me that you know what I'm saying is wrong Okay, so in a traditional setup with a data Infrastructure we have ETL pipelines. We extract Transform and load into our data warehouse, but nowadays Compute and storage are separated. So storage is super cheap So we can just dump everything into our data warehouse and this means that You know, we can just load all the raw data from all of our data pipelines straight into our data warehouse And then the challenge is we have to go from raw data to transform data in order to be able to consume it in Downstream when we do like machine learning do reports to dashboards. And so this is where dbt comes in dbt is a tool that allows you to relatively easily define these transformations and productionize them and Run them in a more more or less reliable manner Okay, so how does it work? So what what is actually dbt dbt comes with a project? It's a CLI tool so you run dbt run and It needs a project which is a centralized structure of your transformations Within the project you'll have dbt models models are like tables or views inside a database You'll have tests to make sure that the models being generated properly And it can be done automatically and then you have a DAG because models can depend on other models So you'll have like a like a DAG of models, you know going like I want to run model a to be and we'll run it What is a dbt model dbt model has to be a transformation. So when it's a sequel model, it has to be Like a select statement. Yes. So here for example, we have a simple sequel model where we're doing some selection And you notice that dbt supports Jinja. So here we have from ref model C This is this is the dependency. So it shows that model D depends on model C So before you compute model D you have to compute model C Okay, and dbt also supports Python models and Python models like sequel models They have to define transformations and in order to define transmission in dbt. It has to be a function So you have to have a model function. It's definitely just called model and here inside a model function You have to return a data frame. So that data frame is that is then going to be persisted as a table in your data warehouse Okay, so and here you see that this model has a dbt object that references model B. So model C depends on model B and Yeah, so not all the dbt so dbt by itself is not a database, right? It's just a transformation tool so you need to connect it to the database or data warehouse and for Python models Not all the dbt database is supported so snowflake supports it data bricks supports it they both supported natively Big query supports it kind of you have to set up a data proc cluster to run it And if you know how to do it, that's awesome. If you don't yeah, well, it's an adventure That dbt supports it, but you have to do it on your, you know local machine And then there's this thing called dbt file dbt file is now our open source package It's a that that we build and what dbt file allows you to do is allows you to do Python models on almost all Adapters of dbt. So but what we do is like when you just install dbt file It allows you to do the transformations locally on your computer So not on the data warehouse, but on the computer So what happens is that dbt file downloads the data from the data warehouse does a transformation and then uploads it automatically to the data warehouse Okay, so this is how we define a profile siaml This is a credentials file in the dbt project and here you see we have a staging db These are the credentials to our staging database It's a postgres database that I have on my computer and then there's another staging a credential definition of type file And this one is we're basically telling dbt that we're gonna use file with Postgres together so file will do the Python transformations and postgres Is gonna do the sequel transformations All right, so we have I have a test project. This is this is the project that I was talking about We have a shop. We need to do some predictions on the orders and we need to be able to You know persist those predictions in our data warehouse This is the you know, I get what the github repo looks like So we have some requirements in order to set it up This is dbt postgres and dbt file and Jupiter notebook because you know, we need to be training and predicting things Okay, so this is our models directory. So inside a dbt project. You'll have to the code This code. Oh qr code. Sorry. Sorry. Sorry It has to do animations now There we go, okay, I Think there's one more person Okay, awesome Okay, so we have a models directory so this is where we actually define the transformations and as you can see we have four models here and Don't worry about the bottom two. These are what we are actually building here But the top two are the models that already in our data warehouse So this is some something that either we or somebody else already made and we have customer orders model and customer orders labeled model So as you might guess customer is labeled is a table that has customer orders Labeled with whether or not they were returned within seven days and then customer orders doesn't have that label Okay, so now dbt as I said comes with a CLI So all I have to do to actually run a calculation on these sequel models is I do dbt run I tell with profiles deal where where my credentials are and I with select statement I'm telling which models to do the calculation on and so now dbt You know talks to Postgres. It's a it's a Python package But it talks to Postgres and it runs the models in the order that you know, it knows from the references Okay, so Now we go into the Jupyter notebook. This is the same project I just opened the Jupyter notebook in this project and here. Yeah, I'm opening a notebook and here we have This from file import dbt file. So this is again, it comes from us This comes from the dbt file package. This is an unofficial dbt Python SDK Yeah, so you can create this fall dbt object with and tell it where the credentials and where the project directory is And once you've done this this means that now this fall dbt object is now aware of your entire dbt project Context yeah, so what this means now is that you can actually download Your dbt tables as data frames into your Python runtime. So here for example, I'm getting a customer orders labeled Table and you see I have a data frame and I have a return Column is one or zero and it's telling us whether or not the order was returned within seven days It's just a data frame I can you know look at the statistics of the data frame just to see you know what the distribution is how many different things they are this is a Completely generated data, so you know Might show some AI bias so don't don't judge but so this is the data that we have we have total price and we have age and the red dots are the dots that were the orders that were returned and The blue dots are the orders that were not returned and as you can see there's some correlation between price and age Where when when a when a customer is younger and the price is higher the orders tend to be returned more often So this is a very simple logistic regression that you can train that that would be able to you know give you the prediction and So this we can do it straight in the notebook of course because we're in a Python runtime and so here I'm just getting a simple logistic regression and I am Incentivating it and I'm training a model with it and then I'm looking at the classification report So seeing how good that model was yeah, and it's just in these lines You can train a simple ML model and we have an accuracy here. You say accuracy line. It says point eight four This is eighty four percent accuracy for this model for the given data So at this point I would go to my boss and be like hey is eighty four percent good enough and they would be like It's better than sixty and you can go ahead and keep them to me MVP Yes, this is the kind of thing that you can improve over time I think for me what's important most of the time is to get MVP out so that we have something to improve Okay, so now that we have that now let's put it in actual dbt so that it can run automatically with a dbt run Right, so I have a model here and in the model. I'm calling this function called train ML model So this is now the function that I defined in my notebook I just have to port it into a Python file, which is a dbt model Yeah, so this is a train ML model function and you see here It takes a features data frame and the labels data frame and that's exactly the same thing that we were doing in a Notebook except now the model that we trained with scikit-blorn We get some metadata off of it and we store that metadata in our dbt table So that it's persisted in our data warehouse. We can come back to it whenever we want We also store model weights, so we tell it the location where to store model weights We create a pickle file and we store it, you know here ML models home directory We just put it there and our model weights are now persisted So now we have a mapping between, you know, our model metadata and our model weights Okay, so we can just run this I think Yeah, first we tell it where to store the models and then we run our dbt model and that runs and it does a quick, you know model Fitting and it's done. So this means that now in the ML models directory I have a pickle file d48 This is now the model that I just trained and so now that I have a model I have to make a prediction and then store the prediction in a data warehouse First I do some experimentation I go back to my notebook and here I now open the table customer orders that does not have a label So this is a this is a raw data that is not labeled We don't know if the orders are going to be returned or not And so now my objective is to make a prediction with our model. So here this is now Sorry, this is now a dbt Table that you see it has a model named d48 So this is the model that I just trained and I just put in the in the pickle I mean I trained it in April because I prepared this presentation April, but it's okay Yeah, and so now I write a little Python, you know, a couple of lines just to say, okay picked a model with highest accuracy You know Loaded from the local storage and then run a prediction by using that loaded model And here we do a prediction and then this orders new df sample will be the sample of our data frame that now Contains our predictions and now we see that we have a predicted return column that has you know zeros or ones And we can go ahead and try to do a plot of it So it's plotted and it looks about right So the color dots are approximately in the right place where they're supposed to be So this we could be happy enough to now put it in the in a dbt model So that again it can run automatically and this is it This is the model definition as you can see we have two functions here the model This is a dbt specific function that tells the dbt to run the transformation and all it does it call It calls make prediction function and in the make prediction function. We're doing the same thing. We did in the in the notebook. We're just Finding the best model opening the pickled model definition and then Model weights and then running the prediction on the dbt table. And so now we run dbt on this so it's running the Python it did the fitting and it saved Save the result so now I have a new table called order return predictions And so now this is a dbt table. That's now that's now in our data warehouse And now we have our predicted returns there Okay Now this means that I can now run Oh Okay, so now there's a problem in that the dbt Fall runs Python models on our local computer, right? And I cannot deploy my local computer to you know run the dbt models for the company So I have to put them somewhere and one of the places you so you can put it anywhere You can put it on github action. You can put it in docker Like digital ocean or whatever but we have file serverless offering. This is Project that that we have it's a product that we have and what it allows you to do is It lets you import this decorator call isolate it and when you decorate a function with this isolated decorator It now runs on the serverless platform that is you know file serverless So this is our platform you can set it the requirements You can set it the machine type and what it does is that when it runs the function It cold starts a machine it cold starts a process it installs the requirements if it needs to install them And then it runs it and then it kills the machine. So I'm doing the same thing here for this other Model and now this means that When when these models are calculated now They can be run on a GPU machine like you can control the machine size So you can run it on a GPU machine or you can run it on an extra large machine depending on how much data and how much You know training you have to do Yeah, so now with those defined I now have to tell where to stall the store the weights because serverless now stores weights in a serverless directory This is a persistent directory. And so now I can just do dbt run and this is the training So now like I said, it takes a bit more time because it needs to start the machine. It needs to run the thing Here it did it and then it needs to turn off the machine Now we do so that was training now we do the predictions And we're doing the same thing So again, it takes a little bit of time because depending on the requirements if requires are different It will be a different process that will start and now that's done and we can be happy So this means that now in order to test we can test everything end-to-end So if I remove all the select statements now it runs the sequel tables and Python tables all together in the correct Dag, and so we have the end-to-end solution. We did the training and we did the prediction Okay, so now it's time now I have an end-to-end pipeline it's time to deploy it somewhere so that it runs regularly without me having to press a button and For this my favorite tool is actually github actions and in github actions First of course, I need to set the credentials right so it's not going to be on the Postgres Database that I'm running locally So I have we have a redshift cluster From AWS and this is where we store our production data. So we have dbt file and dbt redshift adapters and In our credentials we have to make sure that we have production credentials as well that now have the credentials from redshift and With that set we can actually define a github action We just have to create this dot github workflows directory and there you can define a YAML file and the YAML file I'm just saying on which conditions this YAML file should be executed And so here we have pull requests or like a cron schedule or workflow dispatch is actually pressing a button and then what happens when the This work this job starts is that now github is starting a machine and running the process and what github does here Is that it starts a Machine and it basically runs dbt and you see at the bottom there We have dbt run we tell it where the profile is and then we tell run with target production so that it writes to our Production database. Okay, so now once this runs we're done. We have a batch level ML pipeline that a trains ML models be selects the best model out of the train models and then And then does predictions on the best model and but as I mentioned our bosses actually want Something extra they want live predictions, right? And this is something that you can actually do With with file serverless as well, but first how would you even do a live prediction? So in order to do a live prediction you have to get one row, right? You have to get a single row with a single you know the column values and We create a fake data frame out of it just so that we can give it to the To the model and then the model does a prediction and then we return whether or not, you know the prediction will be Positive or negative and here we test this particular turn function We give it different values and it predicts either true or false. This is now our ML model doing those predictions And so we have this web endpoints directory here So what it has is a this is simple Python file where we basically have the same function So we have some environment variables where we tell Which model to use and where to find the model and then the function itself does the same thing It just opens a pickle It takes age and total price as arguments and it opens a pickled model Waits and then does a prediction and then returns this Dictionary saying prediction and this is my value zero or one okay, so our file serverless Package comes with a CLI tool So first I tell it where to find the models and which model to use and now as you can see we have this Command called file serverless function serve We tell it which file to serve and which function to serve So we'll return is a function name and I give it an alias so that I remember What this function actually does and when I call this it gives me a URL This is a URL of that function. That's now an app. That's a cold startable app Yeah, so this means that I can do a curl to it and when I do a curl what it does It starts the machine it will You know install the requirements if it needs to install the requirements and then it will just Return the result so here. Yeah, you have to set up credentials and I'm passing a simple JSON that has age and total price as as keys and values and Yeah, so now I started running it and it returns result prediction zero So I can try it again now now it should be faster because the machine is already started So maybe it's still you know not off yet. So it's prediction zero So I change the value and now it's prediction one So now we have a web endpoint that we can call that loads the model that we told it to and now it can return us their results Predictions life now we can give it to a front-end engineer and they can just call this endpoint and and get Get the information they need All right, so in summary we have a dbt pipeline that trains models We have a dbt pipeline that predicts batch data We have an automated workflow that runs the whole batch process we can serve our model models by API and Everything is in a single dbt project. This is a single repository Now you don't have to like run around in different places figuring out, you know, what's doing what everything is in one place All right, so I'm from I'm there. I'm from features and labels We're a company focusing right now on serverless Python and I'm very happy to be here This is our team if you have any questions about file serverless or about dbt or about anything else I don't hesitate to find me and thank you for it Thank you my dad for your talk. It was great If you have any questions so you can use the microphones We have almost five minutes Sorry quick and unrelated question right here. I'm here. Oh, hi The mouse cursor was moving while you were talking what black magic was that is there someone else controlling while you were Yeah, I have I have a slave Okay, no, no, no, it's it's a recording That's why it said it was in March. I mean in in June Thank you for talk just wanted to ask do you do I need to always return only pandas data frame Or can I use something else instead of pandas? That's a good question. So for our for dbt file, it has to be a pandas data frame right now Yeah, but it depends like why are you asking about the different? Yeah, so so I can perform transformations. I don't know using bars or something like that Yeah, I mean the the model has to return a data frame in the end. So yeah, that's the requirement right now Yeah, so it can so it can be like pores that data or spark data frame. I think so. Okay. Yeah, thank you Thanks very much for the talk one question. Do you support the entire machine? I'm learning model life cycle such as Monitoring data drift more model shift or also how do you do you have like a model registry? We do we do yes We have we have a model registry and we have models that that can detect that but we also provide your platform To make it easy for you to to deploy them, right? so if you have if you have a model in mind that you want to use and It's pickleable and it's loadable onto the on tower service platform. It will run without an issue and For example in the github actions pipeline you showed like is there any way to ensure that if a model doesn't reach a Certain accuracy that it won't run in production. So it gets cancelled for example Yeah, so this is this is something that dbt can handle on its own So for example, if you are if a model if a parent model fails, no children are going to be executed Is that is that the question? Yeah, kind of all of the Training stage does not reach like a certain threshold. You want to like cancel the entire Yeah, yeah, so you can you can fail you can fail the training model and the children models will will not work in that case I don't say no you can find the model outside to ask more questions if you want or in discord. Thank you again. Great. Thank you