 Our next speaker is Robson Jr., as he is online perfect. He works at Microsoft, he is a developer there and he's involved with software community, of course, especially the Python community. And he talks about the data pipeline with Python. Six years of learned lessons from mistakes. What is better than learning from mistakes? So please, Robson, start sharing your screen. Okay, it's time to start. Thank you so much. Good morning. I'm talking from Berlin. My name is Robson. Thank you for the organization. Thank you for everybody to make it possible, especially online. It's hard at times. It's quite a challenge. I know how it's hard to organize in a conference, especially online. I have to keep everything on track, on time. So congratulations for all the team. It seems to be a nice conference. It's my first time talking on Euro-Python, so I'm quite happy to be here. Actually, I'm working in the GitHub. It was bought by Microsoft in 2018. So basically, I'm still in Microsoft, but we do have it right now. Yes, feel free to drop a message anytime, even in Discord, even in Zoom. Let me introduce myself a bit. So I'm originally from Brazil, so I moved to Europe four years ago. I'm a developer for more than 16 years. I learned programming with Python. Python followed all my career. I'm quite happy. Even I transitioned for another technology along the way, so Python always followed me. Feel free to reach me out on Telegram. My Telegram is public, so if you have a question, if you want to talk, feel free to add me there. B-S-A-O-0, or you can ping me on Twitter, or of course, in the GitHub. It's the same username, so feel free. All the feedback is welcome, and I love to talk about everything. What do you talk about today? It's very important to equalize the expectations. Even this is a fully Python conference. Of course, you'll talk about Python. But for now, half of this presentation is about data products. Python is a keystone on data products. It's because you are talking Python here. But first of all, you need to equalize. This is a beginner talk. I introduce all my mistakes because, as mentioned, I learned a lot with my mistakes. It helped me a lot. Now I think that's fair enough to share the knowledge as well and come back to the conferences. It's not about code. It's not about how to optimize code, but it's about how to understand the anatomy of a data product. Data product is not basically data pipelines, but having to see the data products as well. And all these principles, it's applied for the same. We covered the concepts of lambdas and capital architecture. It's very important to understand. The main quality of the data pipeline, as any other software in the world, software should have some quality and pipeline as a software as well as quality. And of course, you'll talk about Python. Python is a keystone, so you need to glue everything with Python and you'll talk about where Python matters. My goal in this presentation is to help you to start planning a great data product. Think as a data engineer. How to help your team of software engineers to produce, to collect, to store, to process, to serve data in a large scale. First thing is to understand the anatomy of the product. It's very important. When you're talking about a data pipeline or a data product, you need to understand which kind of variables you are dealing. So nowadays, in modern softwares, you are collecting a bunch of data. You can imagine any kind of product today are collecting, using behaviors, telemetry logs, databases, interacting through APIs and you need to collect all the kind of traces. It's very important. So the concept of the big data, the old concept of big data that follow us by the book, by theories that you represented by five or four v's, the pains of the author. And this is our splitting tree. One of these v's volume, volume means the quantity, the amount of data that you are collecting that you are storing, right? So the type of data. So JSON files, party files, API calls, unstructured files, structured data, unstructured data. Imagine that you have the image, you have the tweeters, you have web pages, logs, database, and other things needed to be stored. In general, this kind of unstructured data is stored on one thing named data lakes. And data lakes, it's basically where you store everything. You put everything there. Of course, you'll have some techniques to store the data in the data lakes. But in general, you put everything's there, you organize that. And from that, that place, it's called source of truth. You are able to collect your data and instruct what you call it off data sets. For example, you are collecting a bunch of tweeters, and you need to get just a piece of these tweeters by the date or by username or something like that. So you extract this data set, right? In the middle, you have like the process. The process basically are the softwares as well. So when the magic happens, right? So you have a very city. So when you, so you can trust, you can trust in your data. This is the most important question to do, right? So you can rely on our data. So your data represents something for the final user for who needs to use this data, right? And how fast your software are able to process all the data that you need. This is the velocity. How frequent you need to process this data to deliver this data, right? So imagine a bank that you needed to process the payment, the fraud detection. So you'll have like machine learning models or you'll have some kind of analysis and you needed to do it in a near, near real time process. Or you'll need it to be this kind of decision in a short time. So it's the fee that you needed to talk to the velocity. And the lesson in another, in another point, you have the velocity. It means your data represents exactly that you collected. And after you process it, after you analyze it, your data represented exactly that you are expecting for. Not the result itself or exactly what are expecting a good or bad thing, but the calculation, the analysis, right? So you need to make sure. And it's basically the code as well, right? As you can imagine, a data architecture is a software as a computer program, right? So you have a memory, you have files, you have inputs, you have functions and variables that deal with the software, right? They are processed, they happen as a site of the process. And in the end, you need it to delivery. You need it to spread this, this, this information that you took from, from, from your software. Sometimes you'll have the IPI, you'll have the files, you'll have UIs or even you're delivering another way. But works exactly the same. The idea is the same, right? Another important aspect is to understand the different architectural styles. And you are talking the most acceptable architectural styles in the market nowadays. They are not different, but they complement each other, right? So Lambda started first. So you need to understand the Lambda and then you need to understand the cap after in order to determine what's more simple and you can evolve better in your challenge, right? There's no silver bullet. So you need to understand what you need. For example, even I work for a bigger company. Actually, I'm not working more with data science two months ago, but before, I'm just working with batch process. Along with timing just to work with batch process for solving my problems. You understand, so I don't need the real time process, but probably you can need because you need to understand the architectural styles. And this one is what you call as Lambda. Lambda is an architectural style that tries to deal with a huge amount of data in a efficient way. And here you have two prints, right? You needed to reduce the lattice between the process and the serving, the data, because you have a high throughput. You needed to have a low latency to process this data and you needed to deliver this data. For that, you need to keep in mind that any change in the data state needed to generate a new event. So it's like it's a hetero event. Every time that you have an ingress, you are collecting the events. Some systems are sending events to you like logs, API calls, whatever. And you needed to decide if you needed to process it in a real time or in a batch layer. It's up to you to decide. But you needed to understand what's happened, one concept called event sourcing. Event sourcing is a concept that you use in events to make predictions and storing chains and assisting in a real time base. It means so you'll have like a column in your data storage based on date or something that you can have like historical data, right? To ensure that all the chains in your data, it means our events that are happening in your data, you'll be stored in a sequence. For example, you are trying to buy something with your credit card in e-commerce. So you put your, you type your credit card to paste for something and then something goes wrong. And then you have like a transaction. This transaction has a holdback and then you try it again and this creates a sequence of events that you needed to analyze it before. For example, this kind of a situation that I mentioned so can result, for example, to block a card by doing the fraud, for example, right? Because you need to understand what's happened and how it can happen in this speed or batch layer. Imagine that you are trying to put your credit card in e-commerce, right? So we identify our IP and your IP is not from Berlin or it's not from Dublin. Your IP is from any place in the world that is suspicious. So you needed to get all this information, cross this information in a very short time, low latency to determine if this card is facing a fraud or not, right? This is called event searching because you are generating events and events and you are able to analyze this timeline quick. But you have two splits here. You have the batch layer. The batch layer is the most common nowadays because it's the most cheap, it's the most defunded technique. Basically everything that you consume from the ingress is started in a data lake, right? And from this data lake, you have like periods of jobs or jobs running hour by hour, day by day or in periods that get this data from the data lake and create new data sets, right? They extract the data sets, they clean the data, they structure the data, they can generate machine learning models, they can do all the jobs. And then in the ends of the batch layer, you have one thing named batch views. And that means that is the representation of your data, right? So your data lake. So any manager from your company, any person who wants to consume this data can go there and query the data. In another hand, you have the speed layer. The speed layer consumes the ingress as I streamed. Once the data comes to your system, they started to be processed. As I mentioned, it can be used for financial proposal to identify fraud or not, right? And in the end, all this data, even you use speed layer or batch layer, in the end, you have all this data prepared and started in a database that permits you to consume or to serve this data, right? The main difference in the next slide where you talk about, so you have some pros and cons on this Lambda architecture. So for the Lambda, you needed to keep your data permanently started, right? So based on that, you might have a problems with GDPR in the Europe. So because you needed to retain user data, you needed to be compliant if you previous the loss, or even you spend a lot of money to store a huge amount of data to be processed. So you needed to plan it, right? The second one, all the queries are based on the immutable data. It means that nobody can change the data. If data is changed or the state of the data is changed, a new event is created and then you can follow all the things happened before, right? Since this is used for a long time, so it's re-elabeled and safe, so it's fault tolerant. It's why it's fault tolerant because you have all the data started in a permanent basis. If you found any bug, anything that is wrong in your code base, you can correct it and then you can run one thing they call it backfill. It means that you can go to your data lakes, read all that data and reprocess all the steps of your pipeline. And it's become a lambda architecture scalable. Vertical is labeled because if you need to process a huge amount of data, you can put more machines in your cluster to process it, right? And you can manage your historical data in a distributed file system. The most common or basically in fact distributed file system nowadays is Hadoop, you are in the cloud based in Amazon, Microsoft or Google, for example. But you have some cons that you needed to take care. One is prematurely modeling. It means that when you are talking with your team and you are preparing your data ingestion. When you are modeling your data or how you want to receive your data, you needed to think a little bit more. So you needed to avoid some kind of prematurely modeling, try to evaluate your schemas and try to evaluate your tools, how you validate your scheming step by step in order to not to broke your pipeline. As I told so can be expensive because you are storing a huge amount of data and depends off the batch cycle that you decided to use. Maybe you can spend a lot of money to process a huge complex pipeline, right? Now we will talk about the Kappa, right? As I mentioned, it is not a replacement for a Lambda architecture, but it's alternative or even you can consider like it's like a opiate of serving layer of a Lambda architecture, right? And Kappa architecture is unique use it for streaming process. Things that you needed to do close a nearby real time process, especially for analytics, segmentation or fraud detection or something that you needed to have like a faster response. Another nice thing about Kappa architecture that you can keep just a small code base for that. Different from Lambda architecture that you have different components that you can use different pipelines inside one pipeline, super pipelines. In general, Kappa architecture is you can share the same code base because the data ingress tends to be restricted for field data sources, right? So basically, if you are not, it's the most important, if you are not up to real time answers, keep your life in the batch process and apply all the things that you know about software engineer to keep your code healthy and probably you'll have success in that takeaway. Applications of a Kappa architecture so you always have a well-defined event order as you have a life in a Lambda architecture most part of the time so we can instruct any data set any time, right? It's more used, it's more often used as a show network, set platforms for detections and it's very important because you needed to answer the questions fast. So if you needed to change and deploy your code fast to fix any bug or implement a new feature, you can do it in few minutes or few hours. You needed to run a bunch of tests, you needed to create a huge amount of refactors if you started to create your code clean it enough, right? Another thing about string process that they use less resources than Lambda architecture because you can implement idea of horizontal ability scale that you can just grow your machine instead put more machines and then you can apply better the ideas of how to scale in your machines. Another thing about the Kappa architecture is that leverage machine learning to real time basis. Okay, have you ever did a question here? Let's see. So to do a stream. Okay, you answer after, right? But you have a console because if you introduce a new bug into your code probably you needed to reprocess part of your code if it's if you have some loss of data, right? And sometimes you needed to stop on your pipeline. It means that if you are running a fraud detection, if you needed to stop your pipeline for a few minutes to deploy to fix something in infrastructure, you have, you needed to have like a big plan for that. So, otherwise, you can bring problems to your business. So you have a pros and cons here. Now you talk about quads off on a pipeline or a data engineer product data product so science it's like a computer programming so the problems are almost the same. Of course, I mentioned the what I see the difference, right? If you see something wrong is in a software development, probably you see the same wrong thing in a data product as well, right? The first things the first things that it's very, very, very important on on data products access access level for the data. You are a software developer probably you are you are deal with very sensitive data, user data, financial data and you as a data engineer. You must implement access level to all data levels. It means that you needed to implement access levels for your tables for your data lakes for your data sets. That's how your code interact with your level codes. For example, how your codes can read the data from the data lake or your code can read the data from different data sets. So this is more about the ethical and how you implement the compliance in your company doing the GDPR in Europe in another country, then technical things, right? Another thing about security is try to use a common format for most part of your your data. For example, try to use the JSON files for text files or PNG or CSV for image or parquet files or files for columnar format. So you have different different kind of formats that you need to understand the better how to use. There's a bunch of formats and there's a bunch of applications. Separation of concerns as well. Who needs to access the data in a while, right? And based in the code. Of course, the separations of concerns in your code, but avoid any kind of hard codes. You need to use your application, your code, and especially when you are writing down a sequel into your code, try to use all your columns, all your installments, because you broke your code easily when you needed to change and then you'll get easy to fix, right? Automation is the most important, I guess. I think that's a consensus. So science data products are codes. So version of your code is using the best way to use your version. Power of technical tech platforms. Try to distribute the automation in different tech platforms. So you'll have your server for continuous deployment for continuous integration. You'll have your tools for code review links. You can use different tools for monitoring logs that you talk ahead. So automation is the key to keep the data product easy to fix, easy to improve, right? Monitoring. Don't waste your time creating monitoring using the monitoring. It's something that I learned in any company, even the small, even the big company that I'm working for. Delegate your problem of monitoring things to cloud all the time, logging, but try to avoid some kind of window login. So you'll have some right wrappers around the market that you can use different vendors in the same code base with the same analysis. It's very important, but try to not waste your time deploying monitoring infrastructure. Even it's easy to deploy, it's quite hard to maintain and tends to be very, very, very expensive because even more, as much as you started to collect in the data, more expensive your infrastructure become, right? And here, the challenge. So test and trace your code. Regression test is a must in the data engineer. It's a slightly difference. If you change your schema any time, if you change your input, if you change your code, you need to make sure that your old data is still working. That you can read your old data from your data lake. Remember that the data started in data lake, for example, is immutable. They can change along the time. So for that, you need to have some regression tests to make sure that you can read the data from data lake, for example, from some stream because everything changed. Remember that I told about inputs, inputs can change, can bring problems. Try to keep your inputs deterministic enough so how you can define your inputs, your schema to test and how force your tests to break, to correct and then you can identify and trace all your code, right? And of course, as any other software, try to, sorry, try to focus on unit tests, especially for internal components. And just two tricks here that's very important for data products. It's try to containerize all your third party components like, let's imagine that you have a Kafka, you have a message quiz, you have like Spark, you have different softwares from different platforms that interact with your Python code, with your Python platform, and then you need to to interact if you already started to deploy it into your machine. You can use some kind of container and integrate if your continuous integration service and continuous deployment service, right? And of course, when you reach this position, it's easily to reach like an end to end test. So once a week or each new nice feature that you are developing or delivery for your pipeline, you can run an end to end test to make sure that all the things are running, right? That said, a lot of concepts. Let's talk about Python, our love it technology that bring us to this nice conference. Are you talk about six mainly areas of data products? The most common it's what you call the UTL, extract, load and transform. And this kind of tools, this kind of APIs is responsible to read the data, to process the data, and to send the data to another place, right? The most common and famous tool is called Apaches Spark. Apaches Spark was developed in Java, Scala, over JVM, but they have a native wrapper for Python that you can write Python and interact directly with JVM, right? There's a lot of magic behind the scenes, but don't worry about managing Spark clusters. You can just use PySpark for that. It's my recommendation for anyone who are starting to deal with data. We started with Data Engineer Stux, right? Apaches Spark is a must nowadays, right? Okay, I don't want to use Java too. I want to use something more Python way. So I offer to you Desk. Desk is a Python project that's right in low level code, but they have like a Python wrapper. It works basically like the same as Spark. They are a parallel computing library that you can integrate with your pandas, and you can have like a bunch of machines in a cluster, and you can distribute like your process in different machines, right? It's a very useful product. It's not the well-used, I guess, for most part of the market, as far as I know. Correct me if I'm wrong. But it's a great choice if you want to start to interact with pandas, jupiter, and don't want to spend so much time to learn Java and Spark. You can go directly to Desk. Luigi is an open-source project that handles different tools, but they work as a library in Python. It's writing in Python. It's quite easy that it handles complex pipelines. Basically, you have a class that you define methods, and basically in these methods, you put your logic to consume, process, and deliver your data. It's really useful. I love Luigi. Luigi was created in Spotify, and I still use Luigi until today, nowadays. MIR Jobs is a quite old project. It's wrote in Python as well, right? But they run method-reduced jobs over Hadoop. It's more old-school-style way to create distributed files, but it's still useful. I need to be honest, a long time I don't use MIR Jobs. Ray, I just read the documentation. I'm not able to recommend it or not, but seems to be promising. When you need it to do with stream, streams, we don't have like a Pythonic tool for streams, but you have Kafka. Kafka is basically the facto tool in the market nowadays. Basically, everyone is using Kafka. That Python has like a library named Faust. Faust, you can just plug it into one topic and start to consume it. As I told before, Python is a keystone here. You can do everything with Python, even integrate different platforms and technology. You can use different platforms, instruct the best of each platform, and bring everything to work properly with Python. It's awesome. Storm is another platform for a stream that Python is accepted as well in a stream parse. It's well-used as well. For who uses Amazon or who uses Amazon, it's supported in Amazon websites. You have a brief analysis when you need it to do. Pandas is the facto analysis tool for that. You have different parts for plugins that you can use in Pandas to provide a high-performance analysis. You have data structures, data analysis tool, integrated with NumPy, and then you can instruct a lot from Pandas. You won't talk about those three or four libraries because my time is about to finish. Keep in mind that Pandas, if you are going to analysis way, Pandas is the facto tool and you can go and interact with a desk or Spark. It's very important. Pandas works with Spark and desk. You can use both if you want. Management. Airflow is another Python de facto tool for managing scheduler for pipelines. Please, if you want to work with data, learn Airflow. Airflow is a Python-based tool that acts as a cron job and you can create, as a Luigi, you can create very complex pipelines with Airflow. For testing. Testing is very important. So when you are testing your data code, you need to create fake databases, fake massive tests, seed tests, so you have a bunch of those libraries, especially for Spark testing base, if you are using Spark, because they have like a Python library for testing your Python code, or PyTest, that is an awesome Python that integrates with data things. And for finishing, the most important, the most important also, it's very important that you validate your data all the time. Remember that I told about to validate to have inputs deterministic and something like that. Yes, you have a bunch of tools that help you to validate the schema, how your data are coming from the external data source, so you can define if the types of the columns, if the types, if the information is correct enough, right? So, serverals and envelope tools is really awesome. Tools are really recommend, right? I use the schema for small tasks, but I really recommend those three tasks, okay? And I would like to say a very obrigado. Thank you, Danky, and another language in European. And if you have any questions, you have two questions here in the Q&A you try to answer. And if you have further questions, please drop me a message on email, discord, telegram, Twitter, whatever you want. You'll be more than happy to interact with you. Thank you so much for the talk. I mean, quite happened. It was a pleasure. Thank you very much, Robson. We indeed have questions. One very interesting question is related to the Kappa architecture. How to deal with streams that needs a strict order so one event cannot be processed if the previous failed? Okay, great question. So, you can assume that in the Kappa architecture, all your events, all you'll be in a temporal timeline or an historical basis, right? What can happen if you lose some of it is because your streaming software failed and you couldn't retry, but you can assume that all your data will be in a time series. If you needed to process, as I told you, so they state of your data, you always change. So, if you have one update, one deletion, one upgrade, whatever, you can follow it by your time series. And then you can reprocess and then you can extract your data sets from there. Okay. But if your code are not able to process the event, you must make sure that at least you have some retention policy in your stream software. Like in Kafka, you can retain your software. You can retain your message or if you are using another software in Azure or Amazon. Okay. Thanks. So, we have all the questions. Oh, they are popping up very quickly. Next question is, what's a good way to link data and the version of the code that processed it? For example, the version column per every row doesn't seem like the most elegant solution. It's quite a complex question. I needed to understand better what the mentor wants, but are you trying to be generic to try to give you some insight? Doesn't make so much sense to relate the version of the code with the version of the data because what's matter is you have your version, your schema, not your, let's separate. The version of your code as a tool, it's okay. But the version of your schema as a technique as a good practice is even better than relating your code with your version of your data. For example, in general when you use some JSON or if you use some tool for creating schemas for your machine, for your data, in general you specify the version of that schema. If you use it on that version, you can have your regression tests, you can manipulate. For me, this is the most elegant solution because you can focus on your schema and then go to your code to fix it. So, we have another question. Thanks. Any recommendation for end-to-end tests in the stream world? Simple Python script can handle it. So, for my case, most part of the case, I create like a small Python script or I create some Ansible script that gets spin docker container with all the 35 components that I need. And then I run end-to-ends and this Python script that I mentioned, check the inputs, the process and the output to make sure that everything is running smoothly. Because there are three kinds of tests. So, you have the regression tests, my unit tests, but the end-to-end, I need to make sure that the previous tests are running fine. You can use this technique for any kind of end-to-end tests. For example, if you needed to run like a streaming process with Kafka and Spark, you can have two different docker containers running Spark and running Kafka. And then you can deploy your Python codes and then you can create a small script that validates the results for you. Thanks. So, we have one more question. One is a good idea to replace legacy SQL-based pipeline with Python code. Benefit would be just readability or one my count on speed-up? I think the answer is when you are not able to create more easy queries and you started to create a lot of procedures, functions or anything that helps you to automate your pipeline probably is a signal that you need to improve your architecture. Because functions, procedures into the database is quite hard to test, it's quite hard to maintain, it's quite hard to version it. So, probably you bring more complexity to your architecture that I afford to migrate to code and you have like a good data pipeline that can work better than a simple SQL-based pipeline. Okay, thanks. Just one thing for Philippe. So, I still use a lot of SQL-based pipeline because solve most part of the problem. This is not a problem, but when you get your product, getting more users or getting more complex, you need to change the strategy. But if your strategy is still working, this is very good. So, thanks again, Robson. Thank you so much.