 So hi everyone. Good morning. This is another day of your Python and We're starting with a talk of chin Hi, Chin How are you No, no, you're okay because I'm the first to start. So let's see how I do Okay Sit Where are you streaming from? I'm streaming from Singapore. Oh, and how's the weather there? Very very hot. I mean global warming is taking a hit upon us The same to here in Athens. Yeah, but we're not talking about a better today Okay, if you're ready we can start first of all Okay, so Today I will be sharing about like, how do we design functional data pipelines for reproducibility and maintainability Using some functional programming features in Python. I'm a little about myself. I'm Chin Hui. I'm a data engineer at DT1 So what we do is that we actually provide like we actually provide A bunch of services such that people from image economies can stay connected with the loved ones and I came from a I saw I before I Engineering I have a background in aerospace engineering and computational modeling and outside of my work I do speak at conferences and Occasionally I do writes about data processing So if you like to follow along you could actually look at the slides that I have that I have on this link So as a data engineer what we typically do is that we build data pipelines So when we talk about like when we talk about the data pipelines We're thinking about all of those complex data pipelines offline around but We can but then when fundamentally we need to look at The basic design pattern of Data pipeline which have like a function like an operation and then you have a target output So in this case why we're actually having a square and I would like to do this operation Such that I have a circle that fits the square and then my output will be The square and the second that fits the square. So that is like a basic structure of a data pipeline So it looks pretty easy like when we have an input and then we finish a process and then we have the output But in reality when we are designing a data pipeline scale It gets a bit more tricky because we will need to consider like three key factors Firstly we need to be reliable the data panel needs to be reliable So what does this mean is that the data pipeline must produce the desired output Then which brings us to the point of reproducibility It has to be Scalable So what does it mean is that the data pipeline must be able to run independently across multiple loads So why do we need to consider that is because like if we have a data pipeline and we want to be able to run across multiple loads Then it needs to be able to be run independently Across multiple nodes across like multiple regions and this brings us to the concept of parallelism last but not least The data pipeline has to be extensible because we need to because as we as a business logic involves we need to be able to extend the data pipeline with trading business logic such as when we have a new feature we need to be able to We need to be able to add that logic more easily So this brings us to the concept of like code maintainability So let me elaborate a little bit about some challenges in designing data pipeline scale So let's go to the example of like reproducibility in testing So let's say like you know We need to be able to like know whether a customer is trustworthy Based on certain like customer credit ratings. So when we have a loan right, we need to process a loan So we need to be we need to be sure that the customer right has a good positive credit rating before we approve the loan So in this case, right? Let's say Now at time t I process all those data and I approve the loans and And I need to ensure that two months later when I need to like audit all those approvals for compliance reasons I need to ensure that like with the same set of data in the data source With the same competition logic, I need to be able to get the same result for my target. So Yeah, so you need same data source same competition logic and then I process it at time t I process it at time like t plus two plus two months later. I need to get the same results. So The main challenge is that when we want to be able to like test for the reproducibility, we want design for reproducibility during testing We need to consider two, so two different two different assist the data pipeline design So first is the data source and second is the competition logic So those two different assist affect what what you get from the target in the target So the main challenge that we have is that given the same data source, how do we ensure that we replicate the same result? Every time we rerun the same process. Whether it is it like at this point in time or whether it's a two months later So it has to be the same result So now that we now that they say we have the data pipeline and we actually like ensure that it works through testing Then we need to ensure that we still have the reproducibility in production So and so this brings us to the concept of like parallelism across multiple nodes in the case of like a sales transaction So typically with a sales transaction, right? We will need to compute our margins And then and then like and then we actually like to collect all collect all the information into a target source So in this scenario, right? Because of course each row in the transaction Each row of transaction Is independent of each other That's why we can actually chunk the sales transaction into independent chunks Compute the margins And then collect the results into a single into into a collection of target It's then we need to also consider the fact that It may not be that simple depending on depending on the dependencies in the data pipeline So for example in a case of trying a case of trying to check whether you have enough balance for the transaction So imagine that at time t, I have a transaction which I have a transaction for a certain amount And then I need to check whether there is enough balance in the inventory So I need to be able to keep track of the balance Like whether there's enough balance in the inventory in the first place before I access the transaction So imagine I have another transaction, right? Maybe smaller or between be larger than the earlier transaction But then whether we have enough balance in the inventory Depends on when the transaction is actually made So if I stop the order, right? Then the inventory balance actually changes according to time So we have this problem of Do we have enough balance and what exactly is the inventory balance at a point of time Because whether we have enough balance for those transactions Depends on when the transaction is actually being made So it's actually affected by the time And we need to know what is the current state of the data source at a particular time And this actually introduces some problems in terms of debugging parallel current code Run time due to the shared state because we have this shared state called inventory And in this case, how do we know like what is the current state of the data source? Because it depends on what like the time So the main challenge is that how do we design data pipelines That run the same competition logic across multiple nodes And reproduce predictable results every time Because if we have a table to abide We are not able to predict the results every time Because of some randomness in the process introduced by time Then how do we actually ensure that we are getting Results that are predictable every single time we run the pipeline Now we look at the second aspect In terms of the challenges designing a data pipeline Which is the problem of maintainability during debugging So this is a typical scenario of work in testing Breaks in production Why is it this case? Why is it so? Because while the code works in testing Typically when we write tests, we are actually looking at a subset of the data source And this subset may not be fully representative of the actual data in production So this leads to the issue of having each cases and inefficiencies That are not detected in your test cases Which actually cause your performance issues and failures in production Because it can't be detected during testing And when the situation of having issues during production happens This actually introduces some complexities in debugging and logging Especially when running on a parallel cluster Because typically for logging solutions They don't really work very well in a parallel cluster situation And it's not exactly very possible for you to keep track of What is exactly happening in each node And actually running your pipeline in production And this actually will cause a lot of inefficiencies in development productivity And so the challenge is how do we design data pipelines That are readable and maintainable at its core To reduce inefficiencies in production debugging at scale Because we know that there are limitations in trying to debug and log A pipeline that is running in a parallel cluster So we need to be sure that we can actually understand the logic of the pipeline Of the code itself And last but not least, we need to understand that Whenever we try to add new features to our code base The code reasoning actually becomes more challenging with increasing code complexity This is to be expected because as our business evolves and grows They will keep having new features And we need to keep adding new features to be able to adapt to the business requirements And if we keep adding all those business requirements Without being very careful about how we manage How we manage the addition of new features It comes to the risk of introducing unintended behavior due to some dependencies That may not be well documented and managed during development And so the challenge is that how do we design data pipelines That adapt well to changing business and technical requirements And at the same time ensures developer productivity So that we don't end up introducing unintended dependencies that are hard to debug So this brings me to the concept of viewing data pipelines as functions And this also brings us to the paradigm of functional programming So what is functional programming? Functional programming is actually a declarative style programming That emphasizes writing software using only one pure functions and two immutable values So there are actually three key principles of functional programming That we need to be aware of Number one, we use pure functions and we avoid side effects Secondly, we need to ensure some immutability Property And last but not least, which is the most important principle of functional programming Is the concept of referential transparency, which I will elaborate later So firstly, about pure functions and nobody's side effects So what is a pure function? So a pure function is such that the output depends on one, the input, number two, internal algorithm And pretty much nothing else And secondly, a pure function must not have any side effects as illustrated in this diagram And what is the implication of the concept of a pure function Is that output depends only on its input parameters and its internal algorithms And there are no side effects Which means that if we have the same function and the same input parameter ends We will get the same result regardless of the number of invocations An example of pure function is, let's say, we are making pizza And then we have the dough, we have the tomatoes, we have the ingredients And we also have the pineapples, we put them together and then we make it into a nice pizza layout We put it in the oven, we have some temperature, we have some time And then ideally, we should be able to get a nicely cooked pizza So this is an example of pure function In reality, making pizza is an impure function And it's inevitable that we will end up making pizza with side effects So what do I mean by that? Because making pizza will actually cause certain side effects other than making the pizza So it could be in the form of radiation sheets So that's something that is not intended in the function And it could also be a case whereby when you keep reusing the oven And even though you are setting it at 160 degrees Celsius But the oven is actually cooking the pizza at 180 degrees Celsius So you end up with a side effect oven oven over heat And you end up with a burnt pizza With this pizza analogy, let's look at what exactly do we mean by a side effect So formally, a function with side effects changes the outside the local function scope Which in this case is the oven So some examples would be you are modifying a variable in today's Or you are modifying a variable state Or it could be IO operation or even throwing exception of the error in the case of like a burnt pizza Secondly, we look back at the concept of immutability So what do I mean by immutability? It means that once I define a variable If I try to assign another value to the same variable I am going to get an assignment error so that it's not allowed It means that the state of the variable cannot be changed once a value is assigned to a variable And the key implication of the concept of immutability is that it enables us to enforce some discipline in state management Especially when we are trying to ensure that the input of the data pipeline is not altered upon initiation And this will actually prevent the side effect without the state change of whatever data input that you have in the pipeline Which is related to the concept of ensuring that your function remains pure The key implication of the concept of immutability is the ease of writing parallel and concurrent programs Because now that you do not need to keep track of what is happening in terms of the state change of whatever input data you have And you are sure that the input data will not be changed Hence it is easier for us to be able to parallelize all those input data and safety parallelizes all operations And then gather those set of results and collect them And we can ensure that there is no state changes that we need to worry about Secondly, we also need to look at the concept of referential transparency So referential transparency means that, let's say if I have a function Which is equal to sub-operation, I can actually interchange both the function and the operation And the formal definition of referential transparency is that a function is referential-transferential-transferential When an expression can be substituted by its equivalent algorithm without affecting the program model for all programs And some of the terms for referential transparency is that number one is to be a pure function And number two, it has to be deterministic, which means that the expression always returns the same output given the same input Without any random factors So what do I mean by deterministic versus non-deterministic? So what doing by non-deterministic is that, let's say we have a bread and put it the other way And then we expect toast But if the operation is deterministic, we should be getting toast every single time But another deterministic operation is such that instead of getting toast, at some random time, I end up getting burnt toast, which is not toast So that means that my output depends on the time, which is not deterministic The last condition for referential transparency is that the function has to be idempotent Which means that the expression can be applied multiple times without changing the result without its initial application And the illustration of idempotence is, for example, the absolute function And by once I apply the absolute function on a negative number, when I keep applying the same operation of multiple times, I still get the same result as the initial application And the key consequence of referential transparency is equational reason, which means that the expression can be replaced with this equivalent result, as shown in the example And now that I've talked about the pre-responsive functional programming, let's go on to the concept of how do we actually write control flow in functional programming So the key principle of how do we write functional control flow is through the concept of function composition, whereby which can be illustrated in the following example So let's say I have x, and I fill it through f, fill it through a function f, and I fill it through a function g, and then I can actually express that as a composite function Second, and we need to understand that in Python, functions are first class objects, which means that a function can be number one assigned to the variable, pass as a parameter to other functions, and return as a value from other functions And the key consequence of first class functions is that we can actually write higher order functions And a higher order function has at least one of these properties, either it accepts functions as parameters, or it returns a function as a value And in Python, we also have the concept of anonymous functions, also known as lambda expressions in Python, whereby we use the function as input without defining the name function object As shown in this code snippet And now that we actually talk about function composition and how it is being expressed as higher order functions in Python So we can go on to like some key higher order functions in Python, so for example is map, whereby I have a bunch of shapes I can map with an operation, and then I get an output So let's say if I want to add smally faces to the shape, I have other shapes, I feed it to the map, add smally face function operation And then I'm going to map each shape with the add smally face operation And after that I collect all those results as a set of shapes with smally faces The filter operation is as for the filter operation, what it does is that it filters a set of elements which fulfills a certain condition So in this case, we filter, we only select those shapes which has finite edges And that's what we have is that we actually reduce them into a single output As in this example And how do we actually like apply map filter and how do we compare it to follow So the key difference is that for follow-ups we need to manage the state changes of a whole bunch of mutable variables in a follow-up Such as the squared variable But for the map function, we do not need to manage state changes And when we talk about functional control flow, I also need to talk about recursion as a form of functional iteration Because recursion is a form of self-reflection function composition, which takes the results of itself as inputs into an instance on it while it's itself However, if we don't have an end, then it's going to have an infinite recursive mode condition So to prevent this scenario from happening, a base case is required as a terminating condition So it's such that we have a recursive course that whereby I have this operation which calls another instance of the operation And it goes all the way to the base case So if we look at the correct recursive course that you compare it to iterative mode, so that is this difference And because what we have is that when you keep building up the recursive course that it can get a little bit too heavy if you have so many calls in the iteration Hence, we have the concept of tail-call optimization, which aims to reduce the stack frame cross-out frame in the call stack So what it does is that it looks for a tail-call which does nothing other than retaining the value of the function call And then it compiles them to iterative loops, such as in this scenario So if we look at this tail-call optimization example, it identifies an instance that keeps being repeated on the call stack as a tail-call And then with that, it compiles the recursive as an iterative loop in the compiler So now that we've talked about some functional control loops and some principles, now we can go into the functional design patterns for data pipeline design in Python Firstly, we have built-in higher order functions in the case of, so we have map filter and we also have list computations So if you look at this example, you realize that list computations are actually a static sugar for map filter operations in the data collection And how we use map filter in data transformation is that we actually filter certain inputs and then we map certain values that are being filtered with the operation So it's something like this example And the benefits of using map filter in data transformation is that we can keep the data and the transformation logic separate, which enables improved code visibility with better transparency of transformation logic Such that we can apply the transformation logic to another instance of the data or in another use case And we can actually extend the concept of map filter to parallel and concurrent programming So you could use multiprocessing, you could also use concurrent futures, which I've actually talked about in an earlier talk So you can look at this example So in this example, I'm actually using multiprocessing.poo to generate an iterator using map and then we filter the result to a collection So if you want to find out a bit more about how we use concurrent futures for parallel or concurrent programming, you can check out my Eurofighter talk or speed up your data processing process Second is about the immutable data structures pipeline For immutable data structures, once it is created, it cannot be changed And the benefit of using immutable data structures is that it is number one easier to use it because what you see is what you get It's easier to test because you worry about the logic, not the state, because you can't change the state And because you can't keep, you can't change the state, it ensures that your whatever data pipeline you design with the data structure is thread-safe, which is easier for parallelism But instead of using a list, which is mutable, we can use tuple, which if you try to do something as what I did for the list, you will end up having a type error So instead of using a class to define a point or a dictionary We could use a name tuple, whereby when you try to do a similar operation, you will get an attribute error because you're not allowed to modify a new name tuple that has already been assigned to, that has already had the value assigned Although you can actually use a replace operation But what it does is that it creates a new instance of the point of the point object So you are not actually modifying the original object I think this is the part whereby it gets really more interesting and it's actually a Python 3.10 feature that is inspired by a similar syntax of Scala, which is structural pattern matching That is especially useful for conditional matching of data structure patterns So what is structural pattern matching? So it is actually characterised by the match case syntax If I made a certain case, we do something So this code snippet is actually inspired by whatever syntax that is being used in Scala So this is more of a code snippet to actually give you an idea of how structural pattern matching can be used So for instance, if I want to be able to check whether the instance is a variable, it belongs to a certain type So if I use an if and if else right, so you can see that I keep having to use the instance function So it can get a little bit messy It can be compared with using a match case whereby I am actually doing certain operations based on certain cases, certain characteristics of what is being matched And why do we want, why was I just using pattern matching is that it ensures maintainability of data schema And so this can be actually illustrated in this following example that I use based on a code that was delivered that I use Scala So for the purpose of this talk, I actually use data classes as the private equivalent of Scala data classes So in this case, you can see that I'm actually matching the variable request according to certain characteristics And a little short note about the recursive type, Python is that telco optimization is not supported in Python So optimization unfortunately has to be implemented manually And on top of that, we need to consider that there is a recursive limit of 1000 by default as a pre-version mechanism that is caused by overflow in the C-Python implementation So that's a small note about recursive Python And last one, and I need to mention about type systems because this is an important part of functional programming And Python does have support for type hits, although it's not exactly enforced in runtime And we can actually make use of type checking with ViPy to enforce a certain level of type systems in Python And why I advocate using type systems for writing functional code is that it prevents bugs at runtime by measuring type safety and consistency across the data pipeline So that we know exactly what is the type of the input and what is the type of the output And after I talked about all those features of functional programming that can be done in Python Can we write a purely functional data pipeline in Python? Turns out, not really Because we need to consider that IO operations are still needed for reading and writing data outside of the application domain Hence there's this design pattern of functional code and imperative shell Such that we are keeping the core domain logic and the infrastructure code separate So we have the core domain logic in a functional form And for those IO operations that are interfacing with the infrastructure, we will keep it separate in the imperative shell So if you'd like to find out a bit more about this design pattern, you can refer to PyCon to give a detailed talk about this So as a part of how I use this design pattern is in this code snippet whereby I have an IO layer To delete the data, I have a functional layer for compilation logic and finally an IO layer to write the data outside of the program So the key takeaways that I'd like you to take with you is that we should adopt functional design patterns when designing data pipelines at scale Especially for parallel distributed workflows And the reason for that is that we want to be sure that what pipelines that we design are first are reproducible, scalable and maintainable And on top of adopting functional design patterns in the core logic, we also need to consider the use of a functional core imperative shell design pattern To manage the site effects separately from the data pipeline logic And now this is the end of my talk, thank you very much And you can reach out to me via our followers on social media links and also check out my ongoing series on functional programming at the following link Thank you We don't have time for KNA, you can always continue your discussion to the breakout room Okay, I think now we'll have a break Thank you