 Our first speaker will be Jin Hui. She will be talking about speeding up your data processing using parallel and asynchronous programming in data science. Jin, can you please unmute and then start sharing your screen? Thank you. Hello everyone. I hope everyone had your lunch, coffee, dinner or whatever. Today we will be talking about how to speed up your data processing using parallel and asynchronous programming in the context of data science. A little bit about me. I'm Jin Hui and I'm a data engineer at ST Engineering. I'm part of a relatively small data science team which works on interesting data science problems in engineering. My background is in aerospace engineering and computational modelling. In the course of my work, I use pandas pretty much every day during working hours. That's why I contribute to the documentation for pandas 1.0 release. If you look at the documentation for pandas, you might see something that I've contributed to. In my free time, before this whole epidemic, I volunteered as a mentor at Big Data X, a community-driven community which organises data engineering workshops for people of all types of skills and backgrounds. That's a bit about me and that's why I have skin in the game for this presentation. As I mentioned earlier, I work in a small-scale data science team. This is what a typical data science workflow looks like. First, we have to extract the raw data from the data source. We will be getting it from the business like some client, some business problem. It could be some form of CSV or some database or it could be an API. That's why we extract the raw data. Second step is that we have to process the data. We massage the data. We do the processing that is required to train our model, which is a third step. The third step is whereby we fit the data into the model and train the model. Lastly, we evaluate the performance of the model and if it looks fine, it looks great. That's where we deploy the model into production. It looks pretty straightforward. It's a very nice pipeline. But when you're dealing with a real-life data science project, it doesn't really look like your Kaggle data set or your nicely cleaned up bootcamp problems. First, in the real world, a major bottleneck in a data science project is the lack of data. If you have lack of data, you don't need to think about how to process the data. But usually it will be the problem of poor quality data. If you have poor quality data, that means that you need to put in more effort into your data processing. Some examples would be that you have noisy images, noisy text, like missing values. All these problems will require data processing. That goes to the very famous 80-20 data science dilemma. What is the 80-20 data science dilemma? What it says is that 80% of the time is actually spent trying to acquire the data and clean the data. And only 20% of the time is actually spent developing the models. So if you think that 80-20 seems pretty reasonable, right? I don't have to burst your bubble because in reality, it's closer to 90-10. The problem is going to get even worse when you have even more data. So in terms of what sort of data processing do we use in data science? In data science, Python is the common language that most data scientists use. We will have to iterate an operation over a list. Let's say I want to perform the square operation on a bunch of numbers. So the first thing that we learn is that we use follow-ups in Python. And then how do we conduct that? We initialize an empty list. And then we use the follow-up and then we append the value to the list. But as mentioned by our previous speaker, follow-ups are actually a bad idea. Why is that so? Because follow-ups are actually run under interpreter and it's not compiled. And if we compare the performance of follow-ups in Python, you'll see it's terribly slow. At least I think one to 100 times slower, which is quite disastrous. So follow-ups are bad, so why not list comprehensions? List comprehensions are slightly faster than follow-ups because the list comprehensions are optimized for interpretation on the Python interpreter such that when the Python interpreter sees the list comprehensions, it will be able to identify the representative patterns in the list comprehension. And hence, there is no need to call the append function at each iteration. So this is in contrast to follow-ups in Python whereby for each iteration, it will have to see that there's an append function and then it calls the append function from the list. List comprehensions are slightly better than follow-ups, but it may not be enough. And now we go to pandas. So I think the previous speaker have talked a lot about pandas and the performance optimization because pandas is designed to be optimized for in-memory analytics using data prints. So because of its elegance and ease of use, it is very popular among data scientists. However, when we look at large data sets, that is when we're running to performance at out-of-memory issues. So what I mean by large data sets could be that data that is at least more than one gigabyte. So if I run on a sufficiently large data set that is less than one gigabyte, it's great, pandas is great. But if you're looking at hundreds of gigabytes or terabytes, then that's not a good idea. And comes to the next problem. Why not just use a Spark cluster? Because it's big data, right? Big data. It's like if my data is very big, then just throw it into the Spark cluster. But well, there is always a price to pay for such tools. Because when you suggest using a Spark cluster, there will be a communication overhead. So what do I mean by communication overhead? Because in a Spark cluster, you're leveraging on distributed computing. So in distributed computing, the compute are actually communicating between independent machines in a network. Let me give you an example of how communication overhead looks like. Let's say I have a phone. I have a phone. And then I WhatsApp you. I WhatsApp you. So I'm in Singapore right now. I WhatsApp you a message. And then the bad message has to go through a network. And then it has to transmit to your network and get to your phone. So this is what I call communication overhead. By your phone, it's a machine and you have to go through the mobile network. And secondly, it's the problem of small big data. What's the definition of big data? Big data is not just about data that is too big to fit in memory. It is also about how diverse, how diverse the data set is. So you have like five or five these. One is the volume. One is the volume. Another one is the variety. So even if your data is too big to fit in memory, it has large volume, but it may not have a lot of variety. And it may also be large enough to justify using a spark cluster. So if you want, okay. So this particular term is, if you want to find out more about this particular term, you can watch. It's almost top on small big data. So I will not elaborate so much into that. So now I say that these completions are not good enough. Pan does not good enough. I don't have data that is big enough for a spark cluster. So that leaves me with parallel processing. So what exactly is parallel processing? So I don't like to look at definitions. So let's imagine that I work at a cafe with sales posts. Okay, so I from Singapore and a traditional Singaporean breakfast consists of coffee, toast and egg. Today I should not talk about the egg, but we will focus on the coffee and the toast. The task one, I like to toast 100 slices of bread. So some assumptions that I make is that one, I am using a single slice toaster. Two, each slice of toast takes two minutes to make. And three, this is a major assumption that there's no overhead time. But in reality, there will always be overhead time. So keep that in mind. So what we are used to is sequential processing whereby we do things in sequence. So if I have 100 slices of bread, I feed them one by one into the toaster, which in this case is a processor. And then after this whole process, I will get 100 slices of toast. This whole execution time is going to take me 200 minutes. Imagine that you have only gotten 100 slices of toast in 200 minutes. And imagine that you are in a cafe whereby in Singapore where my people are very impatient and then you have a lot of customers. So you're not going to be able to serve your customers in time. But if you think about parallel processing, same thing, we have 100 slices of bread. We set them into four portions. We feed them into four processors, just like toasters. And then after that, I get four batches of toast. So the task is actually executed into a pool of four. It's executed in a pool of four toaster sub-processes. So each toasting sub-process, they run in parallel and independently from each other. Which means that even if one toaster is out of order, it's not going to affect how the other toasters are working. And then after that, I consolidate the batches of toast into one whole stack of 100 toasts. So what this means is that the output of each toasting process is consolidated and returned as an overall output. And I don't really care about the order of my toasts. So it may not be in order. And this whole process is going to take around 15 minutes. And the speed-up, compared with sequential processing, will be around four times. So four toasts with a speed-up of four times. Sounds great. And next, I will go through what's the concept of asynchronous execution versus synchronous. So what do I mean by asynchronous? Let me give you another example. So let's go back to the example of a traditional Singaporean breakfast. Now we have the toast ready. Now we need to brew the coffee. So same thing, some assumptions here. First thing, I can do other stuff while making coffee. So it does make the coffee and then make my toast or something. Second assumption, one coffee maker to wake one cup of coffee. Because sometimes you have to do it manually that one person can make one cup of coffee. Third assumption is that each cup of coffee takes five minutes to make. When you talk about synchronous execution, what it means is that first I brew a cup of coffee on the coffee machine. And then just stand there and wait for five minutes. After my coffee is done, then I toast my two slices of bread or a single slice toast after test toast complete. So this is two times two equals to four minutes. And then the total execution time will take nine minutes. So which implies that if I want to make 100 times of this, I will take 900 minutes to make 200 toast and 100 coffee. So if we're looking at a cafe, that will be 100 sets for 900 minutes. And 900 minutes, it's going to be like 15 hours. I think by the time I will be out of business. But if we look at the asynchronous way of execution is how I will do it is that while I brew the coffee, which I know that it's going to take five minutes. I will make some toast, which takes two minutes each time. And if I do this process as synchronously, I'm going to take five minutes for the same type of output. So effectively, your execution time is being cut by almost half. It looks good, right? If I buy four toasts, I get four times speed up. If I do asynchronous, I can do more things at a time. So this goes to the question of when is it a good idea to go for parallelism? Or to phrase it in another way, is it a good idea to simply buy a 256 car processor and just parallelize all your codes? Well, this is not that good an idea if you consider some practical considerations. One, is your code already optimized? Well, sometimes all you need to do is to rethink your approach. For example, if your code is slow because you are using for loops in your processing codes, you might want to consider converting your follow-ups into list comparisons or map functions for array corrections. Secondly, we need to consider the problem architecture because the nature of the problem limits how successful the parallelism can be. Let's say there are some computational problems which are embarrassingly parallel, which means that it's very easy to parallelize everything. But if your problem consists of processes which depend on each other's outputs or intermediate results, then it's not a good idea. Okay, then dependency means that I have a function and then I have an input and then an output. Second function, the second function depends on the output of the first function. If there is some dependency between these processes, then you might not want to be able to parallelize that. Or it could be that I have one task and another task and then one task is coming out with some intermediate output and then the other process is going to take the intermediate output. Then you can't really just parallelize your code that way. And last but not least, there is no free lunch in this world. I repeat, there's no free lunch in this world because there will always be parts of the work that cannot be parallelized. But this is summed up in Mdals law, which I will go through in more detail. Secondly, there's also extra time required for coding and debugging parallelized codes versus sequential code. Because I have to re-factor my code, I have to arrange it in a way whereby I can do the parallelization. So this adds on to increased complexity. And on top of that, there is also the problem of system overhead, including communication overhead. So Mdals law states that the theoretical speedup is defined by the fraction of code that can be parallelized. So it looks all good, but let's just look at the outcome. If there are no parallel parts, you have no speedup. If you have all parts parallel, you have infinite speedup. But the speedup is limited by the fraction of the code that is not parallelizable. Because there will always be initialization whereby you can't parallelize the initialization. So this is going to limit how much you can parallelize your workflow. So it will not improve even if with infinite number of processes. Now let's go into what's the difference between multi-processing and multi-training. So multi-processing allows multiple processes at the same time using multiple processes. Multi-training means that the system executes multiple threads of sub-processes at the same time within a single processor. So the difference is between multiple processors and single processors. And for multi-processing, it is better for processing large volumes of data. But multi-training is best suited for IO or blocking operations. And I will talk more about that using some examples. But before we implement the code, there are some considerations. First one, first data processing tends to be more compute intensive. So switching between threads becomes increasingly efficient. On top of that, there is also the global interpreter log that does not allow parallel thread execution. So how do we do in this case? How do we do parallel asynchronous in Python without using any third party libraries? It turns out that in Python 3.2, there is already this module called current.futures, which is a high-level API for launching asynchronous parallel tasks. And it is an instruction layer over the multi-processing module. And there are two modes of execution. One is the track pool executor for asynchronous multi-training. Second one is the process pool executor for asynchronous multi-processing. And if we look at the Python standard like re-documentation, it explains about how the executors work by separating those chunks. The task is to separate stuff like the iterables into chunks. So you can read more in the documentation. So if we look at multi-processing and multi-training. For the multi-processing executor, it uses the multi-processing module and side steps the GIL. But for the track pool executor, because it is still subject to the GIL. So it is not truly concurrent, even though concurrent frutures has the word concurrent. So you need to consider that. There are two, and then there's the summit operation, the summit function, which takes the function and the input arguments for the function and returns the frutures object that represents the execution of the function. And then map is similar, executor.map is quite similar to the frutain function map, whereby you return an iterator that yields the result of the function being applied to every element of the list. So this is where I show you some examples of how we use the concurrent frutures module. So first case, it's about getting data from an API. So I use the data that got the SG Rotary Weather readings and the response is the JSON format. So first I initialize the module and then I initialize the API request task. In this example, I use the trading module to monitor the execution. I initialize the submission list. And then I, so first I try to use this comprehension. And it takes me about 16.3 minutes to be able to process a certain amount, a certain number of dates. But I use trackful executor. The speedup is about 20.9 times compared with using this comprehension. So I just mentioned that this comprehension is the most optimized way of iterating without using parallel processing. So this speedup is quite significant when you compare to just compare using this comprehension. And now the second case is whereby we are doing some image processing. So I use the chest x-ray with the images. And the reason why I need to do the data processing is because the images in the data set are of different dimensions. So I need to standardize sizes. So same thing, I initialize the Python modules. I initialize the image resize process. In this example, I'm using os.getpid to monitor the process execution. And I initialize the file list directory. So in this data, in this example, I am processing 1,431 images. If I use the depth function, I get, so it will take me about 29.48 seconds. If I use the list comprehension, it is slightly better in that it is slightly better, but quite like not much difference. It's about 29.71 seconds. But if I use the process pool executor, if I use the process pool executor to process my images using 8 cores, I get a speedup of about 4.3. So effectively, I take about 7 seconds to process 1,400 images. So that is the power of leveraging on process pool executor for your first parallel processing. And if you take a closer look at the code, you can realize that the code is actually pretty simple to implement. So some key takeaways that I'd like you to remember from this my talk is that not all processes should be parallelized because parallel processes come with overheads. There is no free lunch in this world because of Mdell's law. And you need to consider system overhead, including communication overhead. This is not just a problem of distributed computing. This also exists in parallelization, even though the communication overhead is not that significant. And last but not least, if the cost of rewriting your code for parallelization outweighs the time savings from parallelizing your code. This usually happens when your data set is not big enough. Please consider other ways of optimizing your codes instead. And if you can't understand everything else that I said, just remember, please do not use your for loops. Please either use list corporations or if that doesn't work, use concurrent doc futures module for your parallelization. So there are some references. And reach out to me at all this social media platforms and you can check out my slides at this link at this e-heart with both. Okay, excellent. Thank you very much. That was very good. So we don't have any questions other than the comment that the attendees love toast too. Actually, good thing that we had lunch before so otherwise we would have gotten really hungry. So thank you very much again. And there's one question there. Why do you not use third party packages like multiprocessing? Okay, first I need to emphasize that multi processing is not a third party library. In fact, multi processing is part of the Python standard library. And concurrent of futures is an abstraction layer over the multi processing module and Python standard library. And for this particular use case is whereby I just want to process my data. It's not about trying to implement my machine learning algorithms like scikit-learn because if you are trying to if you are trying to parallelize your machine learning training model training process. All these third party libraries like scikit-learn, TensorFlow, PyTorch, they have their parallel implementation whereby what you need to do is to set some settings on end jobs and they will have some parallel implementation. But then if I just want to be able to do some processing that is not involving the model training process. I want to do the pre-processing that goes before the model training. In the case of images, images is not really very clear cut. Okay, thank you. So thank you very much for the talk. Let me get your URL plus like this.