 Hello, everyone. My name is Tom. I'm a data engineer and architect at Quantum Black and this talk is Writing and Scaling Collaborative Data Pipelines with Kedro. So let's go ahead and get started a little bit about myself. I've been doing data engineering for actually quite some time. I started out at Palantir back in 2013, which was actually before data engineering really became a thing. So I've been doing it for as long as data engineering has existed in the mainstream, I suppose. I also have a little bit of hobbies of doing yoga as well as teaching meditation. So if anyone is curious about yoga or meditation, very happy to talk about those kinds of things here. So here I am doing a shushasana to pose on two fingers. And I also run a YouTube channel that talks about Kedro and a little bit more information about that later. So the talk itself, Writing Data Pipelines with Kedro, Writing and Scaling Collaborative Data Pipelines with Kedro, this can also be called instead architecting your pipelines with the Kedro framework. For the way that pipelines have been written up until now, and this is back since like as I mentioned, like only in the past recent years has data engineering really become a thing. Unfortunately, the truth is that it's a relatively new discipline. And since there's a relatively new discipline, there's not that much around it in terms of like best practices or kind of like established methodologies. And as a result in data pipelines that we write and the things that we do kind of go all over the place. And so I think like this is something that we quantum black kind of observed and do and something that we wanted to address by using Kedro. Really quickly, I also want to mention that my colleagues are in the discord chat room and hashtag talk Kedro. And they're actually one of them yet to she's the product manager of Kedro. So if you guys have any questions feel free to talk in the discord itself and that way they can get access to your questions and perhaps answer them better than I can. Okay, so let me go ahead and go to the next one I have a little bit of a demo right away for us to play with so let's go ahead and talk about pipelines. So with pipelines themselves it's a little bit easier to to visualize some of these things and we can begin to describe why do data pipelines kind of grow out of control. And then what are the kind of contributing factors and how can we reel that in so let me stop this share really quickly and then reshare one of my other screens here. Okay, so you should be able to see a chrome windows that true. Yes. Here I have the demo. So, so I only see pipelines are a little bit. Yes. Thank you. I think it should be popping up in just a second here it is. So what I have here is a modification of a data pipeline visualization tool that we use. And this is called Kedro this schedule visualization. It's a very powerful tool when coupled with Kedro and in this case I've customized it to kind of demonstrate how data pipelines can grow and change over time. So here, you know, is like your typical data pipeline. It kind of already is like going all over the place right. The real question is how did we get here. Why is it so all over the place. And what can we do about it. So data pipelines before they get to this kind of state, they start out always very small. You have a data source, and then you have a cleaning function or some kind of transformation function on top of that data source. And this is how we always get once you run your transforming function, you get this output. And so in this case we have this clean iris, which cleans this iris data and then outputs this clean iris data. And then once you have this formatted iris data, you want to start analyzing it you want to begin your analysis. And so you have another function that does your analysis, which then outputs your iris analysis obviously here. This is usually like how data pipelines start out. They're just very simple, very typical. But as soon as you want to expand any of the work involved, of course you need to start off shooting. And so you might want to take your iris data and then start to like split it out into training sets and example sets, so you can try to predict some attributes of the data itself. You feed it through your modeling engine, you feed it through your accuracy reporting mechanisms, and you can see how things start to expand out. And this is just for one data source. What happens when you have several others come into the picture, you know, you have your companies, your shuttles, you also want to do this very similar like cleaning processing and saving of that data. And then finally, you're going to want to connect everything together. And this is where the mess starts to happen here. The truth is that these kinds of the this process itself is very similar to how software also kind of like sprawls and grows. But because of the way that we have, I mean, because of how new data pipelines are into like the software world, we don't really have these established methodologies for controlling and constraining the way that pipelines grow. And as a result, that's why we become a little bit almost like arrested. There's a there's a plateau onto how big and how powerful our data pipelines can become. And as a result, they never grow past the stage that you see here. But thanks to Kedro as a framework for growing scaling and writing these data pipelines, you can go from something like this to something even larger. And so here is actually another typical pipeline that you'd see on a project and I actually myself have worked in projects where we have pipelines that are quite literally 10 times the size and we have a developer count that is in the dozens. And because of the way that Kedro works, we're still able to collaborate effectively and properly maintain that data pipeline. So Kedro is a very powerful framework for how we architect it. And so we're going to go in a little bit into those details now here. Let me go ahead and share my slides yet again. I'll share like this. Okay, so now we kind of talked about the pipelines themselves. What are the, how do pipelines grow? Pipelines grow because they're like centered around our teams. We have teams of data engineers, as well as data scientists. And the truth is that the difficult data engineer and the difficult data scientist, they're not necessarily super compatible, right? They're almost our opposites in many ways. And so you'll see that data scientists may not be engineers, for example. They have a lot of like amazing knowledge of these particular like modeling topics and the way that we can transform data, but they might not necessarily have those engineering practices in their tool belt. Furthermore, data science as a discipline tends towards more experimentation. And this requires having faster data pipelines or rather being closer to the data. So you want to be as close to data as possible when you're a data scientist in order to manipulate, experiment and kind of play with the data. Now, on the other side of the spectrum, the data engineers have the opposite problems. They might not be scientists necessarily. They might have more engineering backgrounds. And so when you're a data engineer, you might not know a lot of the ways that someone can model pipelines or how you transform data. And furthermore, the things that you're most interested in is order. You want to keep things kind of like neat and tidy. You want to abstract away things as much as possible in order to sustain that robust stability. And so there is one problem with both of them is that they both still must clean the data. And so data cleaning itself is just like a whole topic by itself. But it's something that is inescapable. It's quite interesting because the way that you clean the data changes the way that you analyze the data too. And so this is a collaborative problem that both of these team members must tackle together. Because if the data engineer doesn't necessarily clean the data in the correct way, and the data scientist instead gets the data in an incorrect clean format, then it's hard for them to do their analysis. And the data engineer, it's hard for them to take any analysis out from the data scientist. So cleaning, cleansing is definitely an important portion of this. And then finally, what you'll see is that data pipelines are almost always asked to be in production right now. The reason why you write these data pipelines and do these data analyses is because the business wants some facts. They want some insights. They want you to take things out as soon as possible and give it to them immediately. So these are kind of the problems centered around data pipelines. Can we find a balance with these two guys? So how can we find a balance with collaborating with data engineers and data scientists in a way that doesn't necessarily introduce so much friction, so much process, but still maintains that fluidity that the data scientists need as well as that the data engineers need in order to make their pipeline stable and still agile. And that's actually just an immediate need. The final need is, is it ready for the handoff? One day those data engineers and those data scientists may not necessarily be working on that data pipeline anymore. Is that data pipeline ready to be handed off to another team to pick up the baton? And I think like this one is actually the most fascinating to solve because when you think about it, when you hand off a data pipelining, you're handing it off to another person, right? That's a person down the line, someone who has no knowledge of your data pipeline. The truth is that that person can actually still be you when you think about it, right? Because in the future, two months, three months, four months in the future, you'll find that you might not remember anything about your data pipeline. And when you come back to the data pipeline, you look at it and you have no idea what's going on. So effectively, you yourself can be that person in the future that you're handing the baton off to, right? This is your past self and then your future self. And so is your past self really creating a pipeline that your future self can handle? I think that's something that we need to definitely keep track of. And so this is something, these problems are things that Quantum Black discovered in their work. And so Quantum Black, if you're unfamiliar, they're a startup that came out of London, very famous for some of the cases that they worked on in terms of data analysis for F1 cars. And so McKinsey picked them up and brought them in as their data science, data engineering arm of the firm. And so Quantum Black, they do hundreds of these projects all over the world for data pipelines and data science and data analysis. And they found that they kept running into these similar patterns across all of these different projects. And so instead, what they did was they tried to codify these processes and bring them back into Kedro. This is in order to maintain and make it easier, not only for the data scientists and data engineers to work together, but also to allow handoff to the clients that we would work with. And so QB open sourced Kedro last year, and it's been growing ever since. And so, yep, why does Kedro exist? It really is that collective learning, trying to deliver those applications. And our product mission, I think is really fantastic. It's this empathetic intention. How can we tweak our workflows so that our coding practices are the same? And so I like this word empathy here because it really is important to think about code in a way that allows other people to help you with the code and you to help other people with the code. And so I kind of think of this as like almost altruistic programming in a way where the way that you write the code is not for your own selfish intention to get things done right now, but really for the benefit of people who are going to be reading that code later on. And what I found is that that return on investment of making the code readable, maintainable and understandable is really, really beneficial. And this comes and pays in dividends later down the line. Okay, so how does Kedro solve these problems? So let's think about like how data pipelines really are set up and how we can break them down. So let's take an example here. Let's imagine audio as data. So literally just like your audio signals. Let's just imagine those as data. We can think of those as things that we would want to push through like data pipelines and etc. And so you have this kind of like for audio you have standard inputs and outputs you have standard mechanisms that can take input from the environment and then output things into somewhere else right and so we have like microphones, we have amplifiers, we have like compressors there's a lot of technology that goes into audio engineering and we want those those inputs and outputs to be standard. Next, we also want to have these kind of transforming mechanisms here. And so actually the one that I mentioned earlier, these kind of like microphone compressors for example, or these kind of mixing boards where you can modify your equalization on the audio itself and so these guys will transform the audio in a way that can affect it for the needs that you have with your low pass filters, your high pass filters, etc. And this is something here that is overlooked and I would argue that this is one of the most important parts is not the most important part here is that you want to be able to redirect your output right for actually this is very similar to abstracting your API right we want to make sure that each of the components in our audio system can easily talk with other components in the audio system. So when you have a microphone input, you want to be able to put that microphones output either into your mixing board into those compressors into those filters, etc, etc. So having this ability to plug and play different portions is really vital there. And then finally, you want to have this convention for organization. How do we actually structure our audio engineering setup in order to most effectively do our work right and so here we have like a setup, you got your microphones here, you got your audio mixers here you got your computer over here. You know where everything is. And because you know where everything is, you know where to find things you know where you know where to adjust things and you know how to tweak things as you desire. And that's back into pipeline building. And so we have here a quick demo also regarding Kedro. And so let me go ahead and pull up my where is my mouse. My mouse has disappeared. There it is. Okay, we're just going to stop to share really quickly. Make sure that I have this guy here in place. And then let me share this screen once more. Okay, great. So here we have just our CLI here, or and inside of the CI. It's very simple to go ahead and get started with country actually before we go ahead and do this one. Let me share a different screen, which shows an example of what data pipelines can look like, just like as a raw example. So here we go like this. This is the one here. Okay, I think that that should be showing up. Do you see a Jupyter notebook I think that's available now. Okay, so here we have an example like pipeline which we're breaking down iris data. And so the truth is that this is what you see in a typical data pipeline right you have like a single Jupyter notebook. Oh, this is the wrong one. This is the one here. That's it. Okay. You have a single Jupyter notebook and you're suddenly like inundated with a whole bunch of programming that there's a whole bunch of code here. There's a lot of things going on and you don't really know how to approach this guy right and if you're lucky you're going to find that there's going to be functions things are going to be broken down and kind of like, you know, ways that you can understand, but more often than not. You're going to find notebooks that just have a plus like just just node just I mean I'm sorry just code splatted on there, all doing these kinds of different things and it's difficult to kind of trace and understand. So what happens usually is that you just go through you look at the notebook, you begin to like run the notebook, and then you hope that it works right. And of course what will happen is you're going to be missing out on some data and you don't know where things are you don't know how things are tweaked, you don't know where parameters sit. And that's an unfortunate thing that happens here. But we can start to break this down into those four components that we kind of mentioned earlier. And those four components are right the first one is the standardized inputs and outputs. So right here we have an input and we have these outputs here pandas data frames is of course our well beloved mechanism for making our pipelines and they had to come with different reading and writing mechanisms and so this is like an example of that standardized input and output. You can use this to interface with your system to read our CSVs and then output the CSVs as you please so that's the standardization there. Then we have our transformations right this kind of split data. So this split data here will take a data frame and then we'll split it out into our train X train Y test X test Y and so in modeling, of course this is just to actually train the model and then test the model on example on the data that we're using. And so this right here can be considered that kind of transformation these are actually pure functions which will take an input modify the input and then give you an output and there's going to be no actual side effects so this is kind of a transformation here. Next, we have this guys here and so you see this three in a row, and this is where notebooks as pipelines kind of break down. It's like how do we start to string all of these different data frames together right we're getting the inputs or we're getting the outputs. And it becomes hard to figure out where things are coming from where you have to pray that someone has named your variables correctly. And you have to manually trace things has in this manner and so the and this is where things get broken down because obviously there is literally no convention where when you set up your notebooks and you set up your, your normal data pipelines and so this is where things start to break down completely. And now let me show you what this pipeline actually looks like. Inside of Kedro. So Kedro actually comes with a lot of really great. Command line functions. Share this screen. And so in order to start a Kedro project, first you have to install Kedro very easy. You can get it with you with a pip install Kedro. That'll go to the pipeline package index. Grab it, download it, install it. It's very simple straightforward. Then what you can do is you can actually use Kedro new in order to create a new pipeline. And this allows you to like name your pipeline and then create the project. So we're just going to really quickly say. Euro Python 2020 as a Kedro project. And then EP as the package name. We will generate the example of the pipeline. And there we go. We've already created our pipeline. Let's open this guy up here and I'll open this up in pie chart. And now that something that's really cool. Also, and I will show you guys that a little bit later if we have time is that Kedro also has now comes with these kind of starter templates, which allows you to modify how Kedro creates those modify those initial conventions. But here I'm just going to show you what Kedro comes with. And so you can see here on the left, we have a few different things going on. We have here our we have here our configuration folder a source folder and then like these data folders logs docs and notebooks. And so all of these guys are actually related to that original like template that we were talking about right how do we organize our pipeline. And so right from the get go Kedro gives you an example of how you can start to organize things. And it helps us with our separation of concerns right. So in the previous example, we didn't really know where data was going where data was coming from how we were reading the data. And not only that but everything inside of that previous pipeline is actually hard coded right everything inside of that pipeline was hard coded. Let me share this desktop again and I can show you that example. And so right here, for example, the read CSV path. This is hard coded this train dot one train and train one train to test one test two. This is also hard coded. And then there's also parameters inside of our functions. We have like our test data ratio, which is hard coded here. All these things are hard coded. And so as a result, like you find that pipelines themselves are very different from the other. The pipelines themselves are very difficult to maintain because you don't know where anything is. But thanks to Kedro, we make that easier. We put our configurations here inside of this configuration folder. And then we can instead parameterize how we start to find our data sets right. And so here we have the location of the data set, and how we want to read that data set so we're just using this pandas data frame to read it in. And you read in that CSV file. And then we use Kedro itself to kind of bring things together. And so in that other portion here that we're talking about we don't really know how things are the relationships how these are built. We instead have these pipelines. And so inside of our pipelines, we can start to see the relationships. And so we see those same examples here of this train model predict taking these inputs and these outputs. And then putting them through the pipeline themselves. And so actually we can even visualize this guy. But because of because everything is standardized, we can, we can programmatically extract the pipeline itself, and then present the pipeline to the user. And so if we go here to this web page, we can see that Kedro visualization. And so suddenly, you begin to understand like how your, your pipeline is built, it becomes easier to explore it, and then easier to understand because we are separating things into these different concerns here. And then so, as that is the case, I only have a few minutes left so let me just rush through these guys really quickly. We have that catalog of those standardized inputs, we support a lot of different inputs and outputs. We have the nodes and pipelines, the nodes are those transformations themselves, the pipelines, pull things together. The configuration, this is where we have our variables that we would otherwise be hard coding kept inside of one folder so you know where they are. As well as the project template and so this is the standardization of everything together. And once we employed Kedro we found this consistent time to production, reusable analytic code stores, increased collaboration and even upskilling of our developers who otherwise would not have exposure to these kinds of software practices. So of course, PIP install Kedro, we can visualize, and then we actually have some deployment mechanisms built in so Kedro Docker, Kedro Airflow, so you can immediately deploy these pipelines as you are, I mean as they are. And we have a great support team. So we are on, we have our own Slack channel internally. And then we have like Stack Overflow, read the docs and then our GitHub which come back into our Slack channel. And we also have a budding community. I run a, I run a YouTube channel called Data Engineer One where I mainly talk about Kedro. So if you'd like to learn a little bit more, my, I mean I've got a ton of videos there talking about Kedro there. And I think that we have about two more minutes left. So why don't we go ahead and open the floor for maybe like a few questions. And I think I probably rushed through that last bit. But there's, there's, I'm sure you guys have a lot of questions and I think there's a lot of great stuff to talk about on Kedro. And so you can find us there inside of the talk-Kedro discord channel, where again the product manager, as well as one of our tech evangelists is there, available to talk about Kedro a little bit more. But going through this, this final example here, normally the hardest part about a pipeline is figuring out how to run it. And thanks to Kedro, we do have this ability to just simply move into the directory and then type in Kedro run. And this will run our Kedro pipeline. And so this standardization allows anybody who is familiar with the Kedro framework to enter into any other Kedro project, run the Kedro project as they wish to, and break it down and understand it as necessary. And so I think this is why some of these, the benefits of Kedro there become evident as you begin to use it and you begin to expand your data pipelines and collaborate with the rest of your team members. And I think that's time for me. Thank you very much for having me and hope I can come back again and speak with you guys. Alright, thank you so much. And thank you actually. Yeah, technically we're kind of towards the we're kind of right up against the clock here but as the closing session is not until was 20 minutes from now so I think I don't think and there's no one else behind you so I don't think it's a problem and it is one of the questions now. So, one is, is it possible if I just use Kedro vis feature. Is that possible if I just use the Kedro vis feature. Yes, absolutely. And so the Kedro vis by itself is an open source library that you can use with any kinds of note or graph structures. It's a simple JSON JSON file which shows the linkings between different nodes, then a list of nodes, and then you can actually use it and in fact we have a react component, which means you can embed that visualization into any kind of front end that is using react so it's actually really quite cool. All right, cool. And then the other question I have here is Kedro configuration is it difficult or no. So for the configuration itself. It's actually quite straightforward to set up the pipeline, and then for the configuration of your nodes, for example, inside of data science we have inside of the split. That split data, we had actually a hard coded value for how what what the split ratio was and so that's here inside of this example Jupiter notebook we have a 0.2 in Kedro. The way that it works is that pipelines, you give to the pipeline the name of the data asset that you wish to use and so here we have the iris data as a data asset, as well as the parameters and so your parameters actually become data assets by themselves, which means that you can actually keep all of your parameters inside of a parameter folder or parameter configuration so here we have our example test data ratio is written right here and so that means that we don't need to change any hard coded values. If we wish to update our parameters and then rerun our pipeline so in this example I can easily change the ratio from 0.2 to 0.6 and then immediately rerun the pipeline and then get a different output there. So it's very very easy to handle the configuration. Excellent. Thank you so much. So, anyone who wants to chat with Tom or any of the rest of the Kedro crew quantum black crew check out the talk Kedro chat room on discord, and they'll be hanging out there. And if I remember correctly you guys also have a sprint tomorrow don't you. Yeah, that's correct and I think we've opened it up to work with people to contribute and work on the Kedro project and I myself have a lot of pull requests into Kedro it's very easy it's a great way to get started with open source projects, because the community is very dedicated. And we're backed by a lot of great engineers in our London office. Awesome. So yeah, we are we are a little bit over time everyone so these other two questions, please do repost those so Diego and Steven I do see you guys, please do repost those over in the Kedro chat room. I really appreciate that. And everyone thanks for joining us so stay tuned at about 15 minutes we're going to have the closing session. So, Noah said to see the end of of another Euro Python, but hey, all good, all good things must come to an end, and hope you've really enjoyed your time. Also, typically only past the talks section. So we still have sprints over the next two days so please do hang out for that. Also, after the closing session we have some fun in the after party here in the Microsoft room so that is word peril, the Python word game where you win absolutely nothing. So we'll be taking volunteers from the audience maybe I'll see you there Tom. Yeah, I don't need to go then. Anyway, anyone who's interested in joining in on that that'll be that'll be starting at 2130 in this channel. So see you all hopefully in the next 15 minutes for the closing session and