 Thank you very much, it's always nice to be at the Euro Python. Actually I'm a long time Pythonista. I started 15 years ago, 2002. The reason of the X in the title here is this X. I'm not sure how many years I've been doing parallel programming because I learned about it the first time 15 years ago. Then more or less 10 years ago I worked in a company where we had the cluster. So I was doing parallel computing but it was not real. I would not consider I was doing really parallel computing because we were just, I was just writing sequential code that the cluster was doing everything for me. But now in the recent years, let's say five years ago, I've been working for this gem foundation which is a global earthquake model and we talk about it briefly. And here I'm really in charge of the parallel, how to parallelize the algorithms to compute to simulate earthquakes, things like that. And so in these last years I really been doing parallel computing a lot. So I learned some lessons that I would like to share with you. So this is a picture that shows what we produce at the end. This is a map of the earthquakes that may happen in the next 10,000 years in California around San Francisco, you see, magnitude things like that. So we do this kind of plots. Also we do estimates of the damage that earthquakes can produce. And we have a number of sponsors. And you see, there are two reasons why I'm showing this slide. One, the sponsors are giving us money. So I'm here even because of the sponsor. Thanks. Thanks to Dan. And the second reason is that even you can be a sponsor because, okay, here you see big names like you see big, typically insurance companies or you see states like Australia, New Zealand, Taiwan, Switzerland, the United Kingdom, Italy, Germany, Japan, several countries. And the USA is giving us a lot of money because also we have projects with third world countries. I don't know, Tanzania, Africa. We have also worked with South America, Central America, Kyrgyzstan. It's very difficult to find a place in the world where we don't have any collaboration. So there are big names here, but I say that you can collaborate with Gem even if you don't give us money. But if you want to give us, I don't know, free access to computer resources or if you are Intel you won't give us software licenses or hardware to test. You can help and you can contribute. This is no-profit foundation. So it's also ethically good because you contribute to save people anyway. So that was my introduction. I personally work on the open quick engine, which is our computational engine. Gem is not only about the engine. There are several other things that we do. We have a scientific staff. Actually I'm the IT guy. Originally I am a scientist. I'm a physicist, but I didn't do anything at all with earthquakes. So I'm not an expert in earthquakes. Please don't ask questions about earthquakes. But I'm saying that we have other things that we do and there are some colleagues of mine here too. You can recognize them, at least one of them because of the special OQ logo, like open quick. And other things that we do, for instance, that guy here, my colleague, is doing the visualization part. So he has written a QGIS plugin to plot the result of the engine, to do analysis of what happens after earthquakes, how much time it takes to recover, things like that. And they have another colleague there. He's working on the platform because also we have a web platform based on GeoNode where you can share maps, download data, writing things like that. So lots of things. The engine in particular is a mature project, I would say, because it started 2009, so eight years ago. I have been working on that more or less four years, four years and a half, nearly five. So it started before my time. So it has all the good buzzwords. So it's on GitHub. We use Travis with Jenkins. We support Python 3.2. We also actually have a plan to drop the support to Python 2 next year. And there is also a small part, which is the web interface, which is really small. Really most of it is the computational part. We focus a lot on the scalability. So this engine works on a Raspberry Pi, works on an atom, works on your laptop, works on a workstation, works on a server, works on a cluster. And we believe it also works on a supercomputer, but we don't have access to a certain computer. So if you want to start a collaboration with us, we are open to you. Okay. This was the introduction. And now we start the real talk. This is not my laptop because of projectory issues. So I accept the fact that I didn't want, I wanted to show a demo, but this is not my laptop. We keep on that. So this talk will be not, we will say nothing about earthquakes. earthquakes want to be more general because I expect here people are, most of you are programmers, not necessarily the worst in numerical simulations. And my point here is that parallelism can be useful even to people that they don't think it is useful. They are not doing numerical work. So the reason why it is very useful is because these kind of problems that are called embarrassing parallel problems are extremely common. So anytime you have some data, you want to process some data and you can split the data in chunks and apply the same argument to each chunk. That's a parallel problem. And there is also a good support in the point of order to solve them so it can be even easier than you think. You don't need to be a computer scientist. And there are also some tricky points that make this fun. So I will give you a motivating example here. It's very simple. It also comes from real life because before coming to Jack, I was working for a finance company. And I was a back end engineer and did several things. Among them, I was in charge of importing data in our Postgres database. The data were essentially CSV files. With these structures, there were code for each financial security. They dated the price at that date. Very, very, very simple. The problem is that we had thousands of files, hundreds of thousands of different options for financial security of these different kinds. And we were going to import, I don't know, 60 million rows in Postgres or something like that. And this was an issue because we had a process where we were importing the new data and then compute the prices of the derivative financial instruments during the night. So the night is, I don't know, eight hours a night. If you can import things in two hours or three hours, then you have five hours to do the computation. But if it is imported, it's too slow, you don't have enough time to do the computation, and you cannot provide users with the output they want. So we had this problem. So I was thinking how we can speed up the import. And one idea could be, okay, I would do this import in parallel, but I was not very convinced that this was a good idea because at the end, you know, you have one disk, everything is important, the same database where there were no sharding or sophisticated things. So I was not convinced that parallelism was going to help. On the other hand, you know that Postgres is able to make use of all the processors that you have. So we have several processors, why to waste them. Let's do some experiments. So think again. So the final messages I'm going to give you is that actually, this looks like an IO bound problem, but actually the parallelism can help a lot. So this is a demo I wanted to do. The demo that I wanted to do, I was producing 500 files, more or less, each file has 10,000 rows, and there are more or less 5 million rows, 200 megabytes to import. This is an example, actually, we had a much bigger thing, but this is an example I could run on my laptop. And okay, and I'm not doing any magic here except that I'm not regenerating the indices, because you probably know if you have worked with these kind of processors, the best way if you want to import stuff is to drop the indices, import everything and then restore the indices. And maybe restoring the indices takes more time than important if you do it that way. So how much time do you think it will take to import 5 million of rows? Okay, prices. On this laptop, which is a very old laptop, it's actually 6-year-old, it's a thin part, 12 inches, it's not the worst stage. So the idea was to do the demo here, but I don't do that on a computer, which is not mine. So but I can give you the answer. You might think, oh, 5 million rows, I don't know, 5 hours, 5 minutes, 5 seconds, can you raise your hands? How much do you think? 5 hours? Who says 5 hours? Nobody. Who says 5 minutes? Who says 5 seconds sequentially? Somebody 5 seconds or 5 minutes. Actually, you are right, because it takes essentially 1 minute on this laptop, which is really impressive. If you think 5 million rows in 1 minute, it's really impressive. So it is vast sequentially. This is the code I'm using, if you are curious. Of course, I'm using a copy from, which is the right way to do if you have to import CSV. And I'm doing things sequentially in this example. I'm printing how much time it takes. So I have a first line here, where I call psql, and this db.sql is a file where there is a create table. So I create the table, and then I look inside this directory where I have my data. So I look, which are the CSV files, I loop all over the CSV files, and then I call the this map function, which is just doing the copy from this file, calling psql. So for each file, I'm starting again the psql command, which is not the most efficient way you can imagine. You could do everything in Python, but still, if you think that the start-up time of psql is small, the file is big enough, let's try it. In this way, you can do it in 1 minute. Now, suppose I want to do this in parallel, to see if there is an improvement or not. So to parallelize is extremely easy. You know that the Python standard library has a multiprocessing module, from which you can instantiate a pool, and that is the common pool map. I assume most of you already know this. Everybody does not know about the multiprocessing pool? No, yeah, so everybody knows it. Maybe you don't know that in multiprocessing, there is also a dummy module, and if you use this dummy module, you get a thread pool instead of a process pool, so you avoid the penalty of instantiating a pool of processes. You just use a pool of threads. In this case, it's fine, but then the processes are actually the psql processes. So you can do that. Very simple. And how much it takes? It takes 21 seconds, so it's three times faster on this machine. This is a machine which is a dual core processor. There is hyper threading. So they look like four processors, but if you have done these kind of things, you know that the real threads are the important ones. So I should get... If this was dominated by CPU, I should get a factor of two or speed up, not a factor of four. That factor of three means that there is something strange here. Some things are dominated by the disk, some things are... There is some CPU parts, which is not. And you can do a lot of exercise, or this kind. These are exercises that I'm suggesting that you think about. So what happens if you have short CSV files? You can think that the files are so small that the start time to start the psql external command may kill your performance. So maybe instead you want to use PsychoPG and access directly database from Python. Then you have a copy from PsychoPG, use that. Then you can think about... If I do this concurrently, how does it work? How does it take the progress of the possibility of instantiating one connection for each thread or one course for each thread, which is the most efficient? What happens to use processes instead of threads? How much does this depend on the hardware? Because I was very surprised that this old laptop takes one minute in my office. I have a new workstation with ten cores, very powerful. It takes 80 seconds, so it's lower the big machine so there are all kind of strange things that may happen. And also, if you try something with one version of the software, it is very slow, you think, ah, this approach will fail. Maybe after six months, a new version of the same software appears and then immediately becomes faster what you thought it was slow. So it's very tricky. What you really need, absolutely need to do, is to measure. So you need to have a way to measure, to instrument your code, your specific use case and see how fast it is, depending on all these kinds of circumstances. This is the most important thing. And I want to show you how we do that at GM, with the OpenQuake libraries. Just to give you a suggestion of how the monitoring we do is doing, works. I'm not saying this is the best monitoring in the world, but it's really good for what we are doing. So essentially, you have to import a monitor object which is a context manager and then you do with monitor and you monitor a piece of code. This will time out, time at the start, the end of the code, also will measure the memory occupation, that block of code. We have a star map class that actually you use like the star map in EtherTool star map also in pool star map there is a pool star map in Python 3 where essentially you map a function on a list of tuples. This list of tubes can be an iterable, etc. This is just a theme wrapper to abstract the way the concurrency layer underlined, we can use trace processes, we can use salary, we can use any kind of a lot of kind of different frameworks or currency, but this way is homogeneous depending on the configuration variable or you can use one way or the other. And also we have a lot of things that we like very much the HDF5 format which is really, really good if you need high performance, if you have arrays, lots of numbers to store, we really like the HDF5 format. So we store the performance information inside the HDF5. In the past we store the reference inside the Postgres, so now we are storing this. So the function that we used to store the performance information that we used to store the performance information must be 4. Except there is an additional argument, so the last argument is the monitor, so this is the instrumentation part. Then you instantiate a monitor, you give to the monitor the path to the HDF5 file where you will store the performance information like calc1, HDF5, calc2, HDF5 every time you run a new calculation we will produce a new file. And these are the arguments that you want to pass which are the name of the file and the instance of the monitor and your computation and it will be fast as before because essentially by default we will use the process this is underlying of the future module and by default we will spawn Python processes but you can also spawn processes if you want, this is an environment variable, you can use salary, you can also say no, that means do not distribute which is really important if you want to debug something because if you have an error you put or you distribute equal no run your calculation and see what's happening so that's very important and we have others for instance we have an experimental implementation that works with a grid engine but I have only tried it on my machine and we don't have a supercomputer to do that but in any place where you have a kind of map map concordacy framework we can easily stand so I was saying we have a standard library and we have commands that we can see the performance information we can see for each task how long it took we can plot the length of the times of the tasks, see if there are for instance low tasks, this is very common problem that we have sometimes one task is so slow that it dominates the computation everybody is waiting for that task and another very important thing it gives you the measure of how many bytes you are sending to the drive and how much bytes you are receiving this is very important when you are in a cluster situation because there are limits so it can happen that your calculation fails because the data transfer is too big so there is an error it fails, maybe it does not fail but still becomes slow or you have memory issues so you have all this kind of useful information another nice thing that this approach does not require you to decorate the task function I don't know if you are familiar with salary or you have other concordacy framework you typically have to put a decorator with this you don't need to put a decorator but you don't need to put a decorator so you have a parallel function from a third-party module where you do not have access to the source so you can change it is nice and of course the Open Quay libraries we spent several years working on them so we know the problem so if there is an error there is a problem if you are on a single machine and you kill the parent process with multiprocessing the children stay alive which is real annoying but we can work around that because there is a library which is in this process control which works on Linux which automatically will kill the children if the parent dies so the problem is that we fork the process before loading the data because if you have I don't know 5 gigabytes of data you first read the 5 gigabytes then fork maybe you have 10 cores you will consume 50 gigabytes of memory so it's best to fork before there is also suicide functionality in the sense that if you measure how much memory you don't produce any task you die so there are things like that we also I mean found problems with the libraries we were using for instance in salary they have this feature that once you do a calculation the task returns the result or the results are kept in memory are cached so if I am returning 100 gigabytes of data my cache is 100 gigabytes so you see a linear increase on memory and then you run out of memory and you die so we have to do something trick to avoid that essentially touching a private method of salary to do what for us is the right thing for them is the feature salary is meant for not really for numerical calculations anyway we are using salary because of historical reason was not the decision the current indeed comes from the past but it works enough once you know the trick there are situations when it does not work so I don't think it will stay for us forever but for the moment still so and we have other things interesting like there is a SQLite database which stores all the metadata for the informations and the metadata in the data for the calculations the data like when the calculation started, when the calculation finished, who started the calculation so the users and things like that and we can now restore the database and also the calculation on HDF file we can copy those files so we can delete lots of things and there are facilities to convert from Python objects to HDF 5 objects or we also things like reading seismic sources from XML but the library is totally generic and you can use this as a generic service and library from Python to XML both can read them right we have some data structure this is an accumulation dictionary which is convenient typically we have a dictionary of arrays so we return back this dictionary then you can sum this arrays easily so there are things which are useful so you may think well then we go and download the engine and I recommend you to do that you can, it's free it's on GitHub, you can install now very easily on any or Mac Windows Linux very easy but I must warn you that this is a framework if you know me you know that I'm not particularly in love with frameworks I mean there are people that like in the keynote say oh Jangu is really beautiful and I agree for my people their use of cases is very wonderful but for my experience whenever I had a framework I was always fighting against the frameworks because I needed to do something which is slightly different I started with blown so if you have a work of that I don't mind my pain no, I'm joking but not so much so it's very good if you write your own framework I think you must write your own framework for your job, for your application but think twice before shipping it, okay think twice because I don't know any web framework that I think it is good not even one of them I like so the idea here is to ship you the idea I want you to take this idea from the engine, look at the code the parallelization part is very small there's a few hundreds online maybe copy still is fine you can if you really want also to use the code you can but it's an after all after all GPL license and this is documentation you can click on this I'm not sure if interconnection works but you can do all these things so now very quickly I want to give you some some lessons that we learned one very important lesson that you have to do the real calculation so a small example is good, is nice but when you scale up you will find problems so always think twice before believing a trivial example for instance in this example the database import is totally different if you start from an empty database here I was starting from an empty database but if you start with a database which is already filled with millions of rows the performance is very different also it depends if on my laptop I have four processes I have four connections to Postgres but there is a limit to the number of connections that Postgres can take you can transfer the limit to 156 or 500 or 12 you can tune it you can configure it but in the past we ran over that limit the completion just did not work so maybe even if you are under the limit it will work but maybe there is context switching contention of resources and things will become immediately slow so don't trust the small example this is the first thing another very important lesson very important memory when you unpeak all things you need a lot of memory and it is difficult to measure the memory you spend in peaking and peaking so be careful about that the best advice is to use NumPy arrays everywhere especially when you have transfer things transfer NumPy arrays and also another surprising thing is that running out of memory in some sense is good because I can have an algorithm which does not run out of memory but it takes forever so my scientist comes runs this calculation, this takes forever one week, two weeks, one month we don't know what's happening so it's a lot better if this runs after one hour so the scientist look at this file this parameter was wrong they will have a plan B they will fix it but it's important that you give information early and also a wrong algorithm if simple I discovered it is better than a correct algorithm which is complex and slow because if you have something which is simple and fast even if it is wrong you fix it it's simple and easy to fix but if you have something which is complex and slow because it's so complex you don't know what to do at the beginning I was doing some changes then I will fix all of my tests I will spend hours, days to fix all of my tests then I will put this in production not in production but in testing I will discover that actually this change works on my machine but not on the cluster now I'm doing different I will do this algorithm but I will not do make green all of the tests so I know it's wrong in some cases but I don't care I put this in testing as soon as possible I see the performance if the performance is good then I fix all the other tests and continue but if the performance is bad I change approach it did not work let's see how so other things it's very good to have a concurrency layer which is independent for the parallelization technology you have because you want to try you want to see what happens if I use tests what happens if I remove it's important and also we have a very simple approach to concurrency which is this map because I had something more complicated in the last year I went to the high performance computing training they saw they were only using maps so I said in reality we only need maps so I removed the features more than the other other things if you have problems with data transfer as we have, maybe instead of sending this data across the wire maybe it's best if you read them from a shared file system it was a lot better also if you read from a shared file system you read directly the array from the HDFI file you don't need to unpick all that so it's much better to read on the other hand you don't want to write directly from the worker because you can easily get bottlenecks like that and also another very surprising thing for me that I thought I would do high performance computing I would be using the Python profiler all the time instead actually I use the Python profiler I don't know one, twice, per year sometimes you need that because it's one month ago I had a problem I did a small change everything was slow I had to use the Python profiler to discover why but in reality it's 99% I don't use the Python profiler the problem are more to do with data transfer the way the arrays are how to store the arrays the language arrays and this kind of information I get from the instrumentation that we wrote so that's the final lesson I'm finishing think about instrumenting your system write your own instrumentation because only you know where is the right place where you can get information and information you need that's all thanks Michele any more questions give him a big hand don't be scared I think somebody here is scared but we will be available for lunch if you see me or people with this kind of t-shirt ask us and even at the coffee break but then after 4 so if you have a question ask him now