 Thank you very much. I had a little bit trouble to get this slide. It worked perfectly on my screen, but connected with the project or change resolution and all of this off. There was supposed to be programming in parallel with threads, a scientific use case. So I would like to give you a bit of use case what you can do with threads. I'm pretty focused on the use case and then show you one solution what you can do with threads and where they're pretty useful. So the use case is a scientific application. So we'll go into a little bit how this scientific application is what's supposed to do. And we're talking about condemnations and micro concentrations from agriculture and domestic housing. So you apply a lot of chemicals when you do agriculture and then you have a house, you have a paint and then the rain washes off some of the paint. So we talk of a very small concentrations here and we have a use case, we have the river Rhine, which is a very big river in Europe here. I will show you in a minute and that's this area. So it's a bit out of the screen, sorry, but that's the best I could get in time here. So microcontamination. So very small concentrations. So if you nano to micrograms per litre, the litre is not there in the river, they have biochemical effects on water organisms. They can be pretty strong. So for instance, herbicides inhibit photosynthesis and this has a very big impact on the whole ecosystem. Because if the photosynthesis is where everything starts, where all these organic substances are produced and this is inhibited, then there might be something wrong in the ecosystem. Likewise chemicals can inhibit reproduction of fish and currently there are more than 100,000 registered chemicals in Europe. So it's a lot of them and this simulation model is just, we will see it how it works. It should be able to model most of them. Okay, some of the sources. So we have diffuse herbicide combinations from agriculture. You see here we have agriculture at different places and you can see here you apply lots of chemicals like herbicides and they are moved. Not everything is actually used for the purpose, depending on the precipitation on the rain. Some of them is washed into the river and this is diffuse because it's widespread over a larger area. So that's one source agriculture herbicides. The second source would be diffuse and point source combinations from buildings. So if you have buildings, then you have chemicals that are used for constructing the building, painting for instance. Plus rain and then with the rains, the rain carries many very small amounts of this stuff and it finally ends up in the river. You have also condemnation from households, like medications, medication, radio opaque substances. So if you go to the hospital and you either get a better picture for x-ray and something like this, then eventually it will happen in the river when you go back home. Cosmetics, household chemicals, artificial sweeteners is very interesting. They don't have any calories so it's not attractive at all for any microorganism to break them down. They didn't break down, they can accumulate in the environment. And also effluent from the sewage treatment plants, they are pretty good nowadays, but they don't get everything, especially those small concentrations or four kinds of stuff that won't be able to get it out. Okay, at the catchment area of the river Rhine you see here, it's more than 1200 km long and more than 800 km you can use a ship, you can navigate on there. And the area is about 185,000 km2. And the discharge varies between 1,000 and 2,000 cubic meters per second, depends where you are, and there are about 58 million people living in this area. So this would be the area that's called the catchment. This green thing is called the catchment or the area from where the water actually finally ends up in the river. So all excess water from precipitation will end up in this river. Okay, and this is just the big catchment, but it consists of about 18,000 sub-catchments. So small areas, you can see here, you see the river and all this tributary, so small other rivers that flow into the river Rhine eventually, and it's 18,000 of them. You can delineate them in space, it's depending on the surface. The water flows down, and then you always say, okay, this flows this direction, this direction you have different areas, sub-catchments we call them here. Now, let's have a short look how these numbers come about. So we have a lot of statistics. So you know about how much medication an average person consumes per year and how much of this will be actually ends up in the wastewater. And you know perfectly how much of this will be removed in the sewage treatment plant and also some of them are used in the narrative science. You know about a lot of animals get a lot of these medications. And then typically, this is just a function of the population density. That's a population density. Dark colors means a lot of people. White means very few people. And you can see how people live around Zurich and here in the war area and this one in big cities. We have a lot of people and of course more people have more input in the river. And contamination from the agriculture is a bit different. So you know about how much herbicides are usually used per crop and we know what crops are growing there. And there's a loss rate. So the loss rate depends on the precipitation. The more it rains the more it's lost. And also it depends on how the rain goes. So if there's a very strong rain there might be more than if you have a little bit of rain at a time. When you see it's different, it's not that in big cities, of course it's more where the agriculture areas are. And you see this weed here. So you have a lot of weeders using how much they apply there. These are the statistical numbers that will be visited into the model. So what are the available data? So if you have local discharge as a time series. So you will see this. We have hourly data for one year. We have land use data. So we know if there's agriculture, if there are cities and how many square kilometers or square meters of this will be in these catchment areas. And we know when the substances are applied over time. And also we know the loss rates and decay rates of these substances. Some of them decay, some of them don't. So little that we can neglect this decay. So the amount will be reduced. So there's some prior art already. So this project didn't develop the software. There is one external programmer in C++ but we only use executable. And this calculates the concentration loads for one catchment. So this is for one catchment and for one substance. So we have this for one catchment. And typically it reads an XML file with all this information. And then it also reads a CSV file. This is time-bearing data. I will show you in a minute. And it spits out a CSV at the end. So the result will be a CSV file. So that's how it looks like. So we have this XML file. It's just a typical XML file with stuff inside. And you see it describes what the substance is and some of the features of the substance. It's not very big. That's it already. So that's a few things and many of them don't really apply. So you specify the name of the input file. This is the other file we showed in a minute. And some of these features, what it is and some of the things for that. But the model is supposed to do. Okay. Time series is also very simple. So we have this time step. Every time step stands for one hour. So we have about 9,000 time steps. It poses for one year. And we have a temperature, degree Celsius, precipitation, which is down to zero here. And we have a discharge. So we know how much actually water flows for every hour from this subcatchment into the river. Okay. The result will be a file that looks like this. So again, we have this time step here, which is also cut off. And it echoes back the input. And then it gives you, this is the most important number. This is the concentration. In this case, it's atratzin, which is one of the substances. So that's what comes out. Okay. We had to prepare the data a little bit because we have 18,000 catchments. And we get them just in the wrong format. So there is a file with 18,000 columns, but we need 18,000 rows. So we need to transpose these months. I've done the responders. I'll do this here. It would be a different topic. And the land use data, also in the DBS file, we have the DBS file, we have to read it from there, as well as the subcatchment connections, which we'll use for the next step. Okay. So what do we need to do? So first we have to get the data from this big file, as I said, rearrange these subcatchments. And that's actually what I want to talk about, this red box here. So we have to calculate the loads for each of these 18,000 subcatchments. And finally, I have to rearrange the data again, because I need them per time step, not per catchment. That would be another post-processing step to rearrange the data. Okay. Now, this is what we want to do. So we want to calculate 18,000 catchments for each catchment. We get some input data. And then we run our external model and we get some output. This is Python 3. I think I saw this Python 3.5 at the time. But everything is running with Python 3.6. We use NumPy, Pandas, and PyTables. So because we store all the data in HDF5 format, which is a scientific format that can hold a very large amount of data. So you can have terabyte-sized files, if you like. In this case, it's not a big output. It was just 6 gigabytes. So not that big. Okay. So we calculate the load and concentration for one catchment. And that's what we need to do. So we get those data from this HDF5 file. We put it in. We generate input files via this XML file. So I use a template. And I just have to fill in those mythic things. That's not much. And also we have to generate this time series for the temperature precipitation and the discharge. Then we need an external process and wait till it's finished. And then we have to read the output and process it further. So there's four steps. And now we come to the core of this talk, actually case for threads. This is a case for threads. So you might know CPython. So the standard Python has what's called the global interpreter log. So even CPython supports threads from the very beginning. It uses operating system threads. But actually they don't run in pro in terms of CPU. So if you have CPU bound tasks, then even if you have multiple threads they all run one after the other. And they never run exactly at the same time. So if you have CPU bound problems, threads don't make sense because actually it will get slower. Because you have overhead switching from one thread to the other. If you have IO bound, input, output bound tasks, like reading and writing a file, starting external processes, then threads make sense because when you have input-outbound tasks then Python releases the global interpreter log and you can take advantage of the threads. This is very important. If you want to use threads you typically have to work in the input-outbound things, otherwise it won't make things faster. Unless you want concurrency and you don't care about making things faster, that's also fine. But you only release a guild when you have input-outbound. That's a very important fact. So now let's re-examine our steps we have here. And if we go to the steps then we see all of them actually are pretty much dominated by input-output. We generate files, we write files, we start an external process which is by far the most time. We spend we'll see and then we wait for the result and then we read the outputs. So most of these tasks are input-output, dominated, input-output bound. Though we have a use case for our threads. Okay, this is a big picture. What happens? So we have a main thread and we have raw cross. These are the acting persons. So here in our process the main thread is interacting with the data. It reads the data from this big HDFI file and saves the results back. And then it distributes the work to each worker. So it starts a new worker for each calculation. Yeah, the worker does their tasks. We will see this in a minute. So the worker actually is doing the calculation and then the worker doesn't return in this case but the worker puts the result into a queue. So whenever you can on your work with threads you should use queues. Don't try to do locking. So there's other ways to do that. If you want to work with different threads you can do locking but it's really not recommended because it's very, very difficult potentially to block these things. Here I use a queue. So each worker is doing something and when it's done puts the result in a queue and the main thread taking the result from the queue. So that's a maybe come back to this picture. So actually I do start a new thread and now it's a question isn't it inefficient wouldn't be better to use a thread pool and maybe it's maybe more elegant to use a thread pool but in efficiency I did a short test and starting 18,000 threads and joining them back takes about a second more or less on my machine but the total run time of my whole program is two and a half hours. So one second versus two and a half hours it doesn't really matter for this use case. For this use case it doesn't matter if you use threads and kill them after this because you can potentially gain maybe half a second or up to a second if you do this and if it's two and a half hours run time you can forget it. So sometimes it makes sense to do use threads pools and here this case in terms of time it doesn't make any it doesn't give you any advantage to do something like this. Okay so let's have a look at the code here so how the code works still so moved I don't know why so shifted here so that's a this is source code of the thread so I import threading and the class thread from the threading module from the Python standard library and I need to inherit from thread to make it a thread this is a raw code that's doing all the work and you see this has has a lot of methods but all of them have underscores that mean they're only used internally so the thread has to generate the input files which is either first the XML file and then the time varying file that's doing some work has to execute the external program run it, get the result then read the output of this so these are the tasks the worker has to do but the only thing that's really concerned is the thread is run so I have to override a method called run and then we will see how this method will be executed so these are just the things we showed 1, 2, 3, 4 what's supposed to be done inside one of these workers and the main thing is the run which will be executed and if you look at the run you see that's how the run looks like so the run is just creating the input calling the execution of this external program that's doing the scientific calculation and reads the output back that's what one of these threads is doing and of course this will be called the 18,000 times this run will be called 18,000 times so this is how this looks like and I just gave it all the sub tasks names it doesn't really matter what's inside this is just details there's nothing to do with threads the only the run method is important here and then that's the boss so the guy the main thread the main thread is just the normal class here so you can inherit from object and this is this is doing a lot of preparation work everything that has to be done once like I generate the templates first that will be filled later on I generate some temporary files directories actually each worker is putting files and reading the output from it opens the hdr5 file so I have an open and in it you have to do something special with hdr5 to make it work here in this case to keep it open and close the hdr5 file and then I read the parameters the global parameters and so on and then the interesting part is actually this run all method the run all method it's doing the main work this is the only method that's concerned with threads all the other ones are not concerned with threads they do global tasks so it's always a good idea when you work with threads whatever concerns a thread in one method if possible then you know if something goes wrong with a thread you only have to look in this one method it might not always possible but in this case it's possible so again the burst is doing something that concerns the global thing so everything that concerns all the 18,000 calculations you have to do so I read the hdr5 output and write the hdr5 output you could read in threads from the file but I tried it, it didn't work so giving all this workers access to the hdr5 file to read separately it didn't work and I think it doesn't matter that much because when you read from the hdr5 file typically the hard disk is the limiting factor anyway so if 8 or 10 or 50 threads read from one file won't get any faster because it doesn't come out any faster than the hard drive delivers it so I have this main thread reading the data and writing it back I don't think it's a bottleneck because typically the hard drive is a limiting factor to be fast enough hdr5 is a very fast way to read and write data because it's a binary format so typically the hard drive is liable so it's not a problem okay and now how the communication works I find this joke I don't know if you understand so for native speakers it might be understandable so the patient asks I feel like a bill and the doctor says please go to the end of the queue so we're using queues so maybe someone understands otherwise I can give you a solution later so we're using queues here to communicate and that's what I would suggest to when you use threads always try to use a queue for communication between workers and the main thread has a big advantage also if you later want to go to multiprocessing you can move your code to multiprocessing post-multiprocessing it's a very similar model using queues for communication so that makes your life much easier whenever you can avoid shared structures of data where somebody is putting something and someone is texting out use a queue which is thread safe and then somebody else solved the problem for you already so and the boss creates a lot of work typically as many workers at CPUs actually you can do more I try and then I start a new thread feeding it initial parameters and then the main most important thing the offset doesn't have return value but rather puts it a result in a queue so that's how it looks like see this is this worker again now I show you the relevant parts for the thread so I get a queue and then the important thing here actually it reads a result you see these are the column names they use pandas here which is very very convenient actually and then the important thing here this is an important line I put my result back in the queue so I have an ID so every thread gets a new ID so I can tell them apart and just count up a counter and I put my output which is a pandas data frame and that's it so there's no return I put it there and then I'm done I put it in the queue that's how this communication works and this is a main thread and the main thread makes one queue in this case I use one queue you can use different models so it would make a different queue for each thread that's also possible I use one queue and then I have a big loop here and it's a bit involved here but you don't need to start everything you see I have IDs and I try to get an X ID and after a while all the IDs will be used up and then I get a stop iteration exception and then I break out of the loop and stop the whole thing set down to true which we'll break later on we'll see this now the important part is here I make an instance of the worker give it a queue so that's the way the worker can put the result and then I do a few checks down here and here now it's unfortunately moved out you see as a queue get and queue get gives me the result so the queue actually asks is there something and that's what I'm doing so as soon as something in the queue I get something out and get the next result and since I attached an ID to each of them and the worker gives me back the ID I know which result it is so I use this ID and that's pretty much it and then here in the end I've joined the worker so I killed this thread and here I also I have two places where I get this get this is just for some yeah that's at the very end so that's in the main loop and then when I get fewer and fewer that's why I have a second get but this is a communication so the thread says put here puts it in and the main thread says get and gets the result out so no return value but put and get and that's pretty important so queue is actually the way to go here okay now what did I gain so how did it work out so compared to doing a serial calculation where I just have one thread or one program without any threading doing it one after the other so I tested it on an Intel i5 I think this eight cores actually it's four CPUs more or less with hypers threading so I keep eight CPUs and which is pretty good so you won't get much faster then get much more out of the 90% and actually amazingly it scales up to 50 threads so first I thought if I have eight cores or four CPUs maybe four or eight would be a good one but I tried more and even I got a bit faster using more so operating systems I juggled those processes for me and doing this and 95% of time I end in external processes so my program is doing very little computation compared to external processes and I just need to feed them to get them run in parallel and all the heavy lifting is done by the operating system because it's doing the threads for me so that's what you want you want to have the minimal effort and get other people doing the hard work in this case it's the operating system that's doing all the threads I run this on a Windows machine but it also runs actually on Linux because I use wine I didn't touch the executable I use wine and works because it's just a program so I can run it on my Mac also so that's what I did there could be other ways of solving this so that's just my approach so you can use concurrent features which is a part of Python so I haven't used it I don't even know why I didn't use it but you can use it, it gives you a higher level approach it gives you a thread pool and you don't have to handle it yourself you could use multi-processing but I don't think that helps because since most of the time it's spent in external process multi-processing is starting even another external process with the Python to put it on it doesn't gain anything here but eventually you could move it to something on a cluster so if you have a cluster and you would have of course I have 18,000 independent processes I could theoretically scale it on a cluster with 18,000 cores would run, would work something like this and it could use something like Pyro 4 or some other technology for this so if you use Q you're pretty close you might have to change the flow a little bit but you could use other technologies but for our project it was okay because it was supposed to run on a single computer and two and a half hours the scientists are very happy with it and they're very close to something running overnight and everything that's less than overnight that's okay, fast enough good, conclusions take home message use threads whenever you have input-output bound problems the code, yes you can see most of the code it looks very much like serial there's nothing special about threads only there's few places where I work with a queue and try to isolate them in separate methods whenever it's possible and whenever it's possible also use queues for communication which makes your life much simpler than using shared data structures okay, thank you very much and there's a few minutes for questions I guess so who has questions? in the back of course thanks for the presentation my question is about thread safety you mentioned there might be some issues like what lines of your code could be changed so wrong things can happen yeah, the main thing is if one of the sub-process doesn't finish then this thread this one broker would block that could be a problem here so you need to have some time out which I don't have here now to say if this one doesn't come back after two or three minutes then you need to either restart it or at least make a red flag that it didn't work in my case everything was finished so all the calculation finished because simple statistical calculations there's no numerical problems that can happen so I didn't have too much error handling if the sub-process dies in terms of threads since every broker is totally independent and doesn't interfere with the others there shouldn't be a problem like that because nobody's really waiting the only one who's waiting is the main thread for the broker to write a result and if the broker never writes a result it will wait forever and if it doesn't finish it works if it would change you would need to say ID 5,350 didn't finish now somebody has to look what went wrong if you have 18,000 you lose two or three that might be okay okay any more questions okay my question is can you make it possible to have multiple threads access the same data frame you can but it's difficult of course you have to do locking if you want to have multiple processes access the same data structure can be done but of course threads are not deterministic you don't know what time or when this thread or the other thread is active the operator is taking this over so typically I would recommend try to avoid it, it's possible but it increases the level of complexity possibly one order of magnitude make it as simple as possible if you want to do this just like if you have an array and you have a shared array you have four threads this thread is working this part so they don't interfere with each other otherwise you need to lock because now I'm doing something and if you lock something you have the potential for that lock and it's very difficult to debug because it might never happen because it only happens once and if you're going to reproduce this you're going to debug it try to avoid shared data structures whenever possible if you really need something in parallel and you do some computational heavy things you might want to look at some other technologies like open NP or something with Sison or other parallel technologies then Python threads might not be the way to go my question is regarding the bottleneck is there a specific reason why you didn't use a database so in order to avoid the bottleneck of writing to one file and having a database actually HDF-5 is very close to a database it's designed for scientific data and there are a lot of benchmarks HDF-5 is for a lot of use cases faster than Postgres for reading, writing you can also the thing is if you look at it I spent 95% of the time on the process, there's a little bit of reading, writing it's just maybe a percent of the time so even if I make it twice as fast I go from 1% to 100% it doesn't make any sense you want to spend most of the time it's spent I cannot change anyway so the performance only becomes because I don't wait to finish one process out of the other I run 50 of them in parallel and that's where the time comes from so the writing and reading to the file the bottleneck because the database also has to write it as a hard drive eventually or the SSD so this would be the physical writing would be the limiting factor but in this case it doesn't really matter because this external process is dominating by far okay, any more questions? here's one more here I didn't see you hello, thank you for the talk it was awesome regarding reading the file did you were reading it all at the same time or you were reading a chunk feeding it to a worker each time? I have one big file with all the data, so HDFI file and I only have the main thread accessing the file that the workers never access any of this main file they do generate intermediate files for the external program the XML file and the CSV files are generated by each worker and then the external program needs physical files so they generate input files that are pretty small and then the external program reads them and writes an output and then the worker reads the output and changes into a data frame and hence the data frame back to the main thread and the main thread eventually writes this in a big HDFI file and up is a HDFI file with 132 million lines and then I need to rearrange them which HDFI is doing because you can index HDFI and it's doing all the sorting for me so I don't have to load everything in memory I can index it and it's sorting this and it takes about half an hour to sort 132 million lines which is pretty good anyone else? I have to run today what do you think about using greenlets instead of threads for the same problem? can you repeat it? what do you think about greenlets instead of threads? I think it could also be done there could be many approaches you could use async programming if you want you can use greenlets but if you use greenlets on async you need to totally change the way here you can mainly use serial code as it would write a normal Python and only these few things when you work with this you have to inherit from thread and you have to run method and you have the queue with the get and put and that's pretty much it I think it can be much faster if you have hundreds of thousands of things going on at the same time then threads are probably not the way to go but here I have 8 CPUs I don't have more than 8 things going on at the same time so threads are perfectly fine so having a few thousand or 10,000 threads for nowadays computers shouldn't be a problem you cannot run a million I think you can easily run a million things at the same time it's a bit different task but you could use greenlets you can use many different approaches it just depends on your preferences what you know already and what's available here I don't need an external library and I think for my imagination it's pretty easy to follow what's happening if you have greenlets you have to learn how they work but it's certainly possible okay anyone else? okay, thanks Mike give him a round of applause