 Okay, thanks Okay, let me let me make some more wide introduction about what what would be the reason for me to be here today so the the objective is that I mean initially I was able to To give you an overview about the say high performance compute in trend and so and then I look at the at the program of the school And then I saw that at some point there will be Professor Michele Parinello and then afterwards there will be Thomas Schultz who is the director of one of the main night performance computing center in Europe and then I say well, I think these people will definitely give you a trend on What will be the? Well, Michele Parinello will definitely talk about more the scientific challenges and the Prospective in science computational science that you will be dumping into in the next five ten years or so But I'm I'm quite confident Thomas Schultz as results to go into technology trend of technology because this is most of his field And so I say well, so let's let's let's do some basic so let's let's just Or let's focus more on on what are the aspect of technologies that impact more on on software nowadays and also Give you the foundation in terms of what is at the base of parallel application? I don't know. I mean I I see some known face here that have been previously to some of our programs and but I mean I Think that most of many of you will benefit of this of of the lecture of today and especially the lab session where I will show you Basically, what are the the the basic programming paradigm that are behind the parallel application that I think most of you is already using So at the ICTP, I mean why would be a person that let's say I wouldn't say an expert Let's say just the experience there for the last ten years I mean on this on this field working on on high-performance computing center. Why would be a person like me? working in a Center for theoretical physics and they I mean the fact is that As we will see today, I mean the the complex of technology and the complex of software packages is getting bigger and bigger and the gap between Theoretical scientists that anyway need as oxygen. I mean the use of computer for science and the I mean the the technological background neither to master today state of the art in term of of software packages I mean it requires people to to be trained and to be I mean and to be Exposed to topics that are not so common in general courses for computational science In fact, I mean one of my main rule here is to basically develop Educational profile one one is to assist the say the community itself to access to parallel systems But the other in which is actually basically 80% of my time today is to develop and and deliver Educational program short-term long-term educational program on Parallel computing parallel programming and make performance Okay, so we we are here in a in a course of computational science. So I mean everybody should be No should know here the answer to this question. I mean why why do we use computer inside? This is actually a historical issue. I mean all these three main pillars here are are they are Reasonable answer to the to the to the question why we use computer in size and actually we can also add if you want I Don't mention it here, but I mean also visualization can be one of the pillar of this of of the answer of this question So definitely to solve a complex problem that we couldn't solve otherwise Do we I mean do or this be able to predict? Events that we would be waiting years or or maybe would be impossible to to study otherwise And then also recently came a lot of this this last pillar. So the fact of Computing science are actually more and more applied it to evaluate models and and evaluate the correctness of models but while I mean here we are a computational school. So why is it's trivial to understand why we need computer in science? I don't think it's always trivial to understand the why people move or Recently, I mean all politics a lot of a lot of they say he found raising move from talking about Scientific computing to talking about high performance computing and And so I mean But it's important to understand it this difference between scientific computing can be anytime I use a computer for science okay, but instead I performance computing even so that As I say nowadays a lot of different domain of people are talking about this giving a lot of different Answers to it We refer I refer here I mean this is my idea of that to a performance computing to any time. I want to make something faster Any time I don't into the need of make or make my simulation or or or get or reduce the time of simulation so HPC can be Definitely made on workstation HPC can be made on desktop and HPC can be made on laptop or smartphone I mean HPC is for me And this is actually the the way I will I will take all this this like this session of lecture today is to understand Which are the technological issues to get code run faster or also what are the technological issue that program programmer Deal with to deliver efficient program And then of course HPC can be made on super computers super computers are the most powerful platform platform worldwide That hallows to take all scientific Challenges that wouldn't be possible otherwise for problem of memory of time of simulation storage was not But we don't really need to have a very Let's say Exotic or or or massively parallel system to run or say to To get the faster result compared to what we are used today So we do definitely do a performance computing on Linux cluster like the one we have in in in ICTP and the one We have also recently deployed together with this We can also do high performance computing of grid and cloud I mean if we target to to speed up and to and to make and to and to get faster time to solution even on those kind of platform I think we do also high performance computing. So in general, I Would say that I will refer it high performance computing like high productive computing So something that I do in order to get the either faster result or to exploit more efficiently my The technology that I have available and now here we come at the real question here, but why would you matter to that? well Usually I am in contact with a lot of for example, PSD students that you know, they have been committed to study a small system and and And they I mean and they do with they do with their own desktop or let's say they access to a cluster and they Do is using one only one node Okay, and and they and they and the the feedback that I have usually with these people is that they don't really need to know Why they need to go faster? okay, but Maybe two three weeks later or one month or two months later the same person will count to me desperate Because he needed to have results for the need for the week after and he doesn't know it has no idea how he can do that yesterday happened to me that one I'm registered about let's say yesterday happened to me that one of the scientists I see P came to me desperate because he just developed a program his own and He needs a serial program and he needs results for a conference next week Okay, and he was running a simulation that takes five or six days on it on a nice to be desktop We have available 2500 cores at ICTP so it was just matter for him to Apply some of the generic rules of parallel programming to what he recently developed and he wouldn't have done in this issue Beside the fact that after three days of simulation is depth is desktop crashed And so he was he would he will not be able to complete the simulation for next week so at some point is likely that for even for say theoretical Science you will need or you will dump into the issue of Getting faster results or even simply scale your problem size Up to the point that the problem side that this the problem doesn't fit the technology that you might you you have You couldn't have available then if you are lucky you are a CTP where? IT people try always to Accomplish the quest of scientists you might get up to the point where you get a 64 gigabyte desktop You know 64 gigabyte memory RAM desktop, which is kind of crazy We've considering that we have hundreds of nodes on available here between ICTP and sister that have much more than those memory already available Okay, but you know like sometimes people needs immediate immediate result immediate feedback And so you know we buy memory we bring the memory to the lab to the to the office Which would be actually the other way around so the scientists should join the platform that is made for that So it is it is important for us to make a little bit of dissemination of this concept to try to let's say I Don't want to say educate but at least inform people on what are the main aspect of all the main side effect of this so Problem become also more complex and that that's what I was saying before but on the other end we'll see actually today is not only matter of of of Let's say bigger problem or faster time to solution but nowadays we are also the point that if we don't improve Thank you, sir If we don't improve our Let's say our software system. We might risk to get Slower and slower time time to solution as more as the technology evolve because our software system is not longer able to support or to or to run on neck on on current generation of computers or Even more We we will we will reach the point and this is already the point where number of scientific of scientific communities are We reached the point where the where the the essential tool for our research is no longer suitable for running on a Platform that are based on devices that need to be programmed differently. Okay, so and On the other end, I mean I think that HPC should matter in science nowadays or let's say those concept with those general comes that should matter in science nowadays because it also represent a big opportunity for Visibility for why not finding a job position tomorrow or having let's say an additional Haste to spend compared with your colleagues that that do don't know those those aspect I mean basically if you look at Computational group most important computational group or wild they almost all have nowadays a figure within the group that is expert or that support the group in software development and And they say and provide the background to To technological aspect All right, so let's let's start basically so anything get got more complicated when Computer vendors And to change the business model of providing next generation of computers So up to a given point For vendors for for let's say computer manufacturer The the deal was to create a processor to shrink The let's say the silicon wave inside the chip and provide processor at higher frequency processor diagram frequency were a very A petable very interesting to the market because for for for consumer higher frequency was Equivalent to faster result to simulation. Okay, so people were attracted to buy new generation of Computers because due generation of computers will develop for higher frequency and then I get frequency was equivalent to Faster time to solution to run faster simulation at some point this business model break break for a simple reason that the I mean you teach me that the physics within the transistor arrived at the point when shrinking the I mean the wave of silicon that is inside the chip We actually double the the power consumption the power needed to say to power that that particular chip indeed While the the so-called Moore's law kept increasing so the fact that The computational power of a processor should double Approximately approximately each 18 months while the Moore's law continue The the the the the the builder of computer had to change the strategy So they couldn't keep doing to increase the frequency of the clock because of the of the problem of power consumption so that what so they stopped to increase the frequency and they started to Increase the number of cores so of computers per chip Then it's where we we arrive at the point where we have So-called multicore system any of the tablet that you have on your pocket or any of the desktop that you access here at the CTP any computer that you access nowadays has more than one core Okay, so it has more than one computer inside the chip more than one unit that can perform operation inside the chip But on the other end this means that while serial programs Could have performed faster and faster with new as new generation of processor before We arrived at the point where in order to exploit technology so to get advantage of a new processor We have to increase the parallelism of our of our application Okay. Yes Okay, the Moore's laws basically says that The the trend of the increasing of number of transistor per chip will double every every 80 month say every two years okay, and and within this and The say a consequence of this is that the power the computational power of a processor would also double every two years Okay, but since the increasing of the of the of the number of transistor per chip arrived at point where the It wasn't longer able to deliver processor because of the increasing of the power consumption Vendors said okay We limit the number of transistor per core and we start to increase the number of core per chip And we arrived at the at the technology that we have we have available these days, but It's not only this one It's also the fact that while years ago the market trend was dominated by desktop computer and servers Okay, so basically Consider that oh I mean approximately one billion dollars needed for a vendor to design a new generation of chip But you can imagine that those those vendor of this this manufacturer before to build a new chip they they make some Market investigation to realize which would be the target of their product Okay, so if if if ten years ago those And say investigation of market would have say that the majority of computers are dedicated to desktop computer and to server So they were ported to design processor for this kind of devices Nowadays if someone make a market about how to design the next generation of of computers You can imagine that there is something that definitely dominate the market and those devices here that today dominated market they they Are big consumer of low power processors Because they want to they want to consume as less as possible because the battery that That that that power those devices as to still as much as as possible Okay So is so even so here basically I'm trying to summarize and to give you a flavor of the fact that not only let's say Processor maker for lack for for smartphone are interested in low-power processor, but basically this trend Extended to all the main producer of of processor for scientific computing for desktop for server for supercomputers So we arrived at the point that nowadays Intel still deliver Generic x86 processor, which are the one you can find in all desktop here But at the same time massive investment is to is to be able to to to create or to to produce So-called many core system many core system So kind of computers that are based on very low-power cores, but with massive amount of cores The same is for harm, which is actually the major category of computers for smartphone and and And mobile phones that is actually basically the dominator of the computer market nowadays We have also Nvidia that came along immediately with all the say is historically related to computer graphics. So Devices that are built on massive number of low-power Cores that together gives a very high computational power But then also IBM nowadays is not it's not longer interesting in delivering Generic purpose Gerical purpose processor, but it's actually more interested in embedded market and to or to build a platform that can support more powerful device like And then I am the true, but I mean I'm the nowadays is almost out of the HPC market. So so so you see the trend of all main device build builders is going into a point where anybody is trying to build a Largely parallel processor But largely parallel processor it means large parallel software Okay, because if the single core reduces the frequency Generation by generation the software the signal software with will run slow generation by generation Okay, maybe you haven't you never realized this but actually there are other two historical Component of a chip That requires parallelism of that express parallel So one is the fact that any computer system nowadays is able to schedule more than one instruction at a time Okay, so what does this means well basically consider that if you have a processor that is trigger earth You want this processor to deliver three million of operation per second? Okay, but each operation requires a big latency a big latency to take the data from memory To move the data on the registers and then to perform the operation and then to move the data back to memory If this if all this latency is equivalent to I don't know 10 or 20 CPU cycle It means that instead of doing 3 million of 3 billion of per second per second you will do 3 billion divided by 20 operation per second Okay, so already the design itself will drastically reduce the performance of your computer so in order to why to why did this latency the stages of computers have been divided in say has been made independent so that the different stages can be overlapped For example here. I'm showing you a very basic picture where consider that this will be a full extraction where you first fetch the data Then you decode the destruction you load the data Sorry, you fetch the structure so you you you read the destruction you decode the destruction so that the computer can understand it You load the data you execute an operation and then you store back the data as you can see here at step 2 Basically, you might have an overlap where they while the first two instructions are already the coding Sorry. Yeah, I already did the coding stage. You have instruction that are at the first stage and all this Complicated design is made in a way that at the end after n stages. You basically are able to get one result at each clock Okay, so this is why nowadays you arrive at the point where if you run benchmark for example synthetic benchmark Those can arrive at ninety ninety five percent of the efficiency of the CPU It means that I don't know if you expect to do three billion operation per second. You will arrive to do I don't know Two two dot six billion operation per second Okay, and this is and this is due to all the mechanism that is inside the the chip It's called pipeline to create this to to let's say to hide the latency to the memory access. Okay, but Suppose that you have so suppose that you have an instruction like this one, okay This instruction is very simple because and and it all was perfectly to exploit our system because in the mean while We are doing a of one Okay, and we are incrementing a of fun of one we can already Load the data of a or two because we know that the next instruction will be that one. Okay, but if you have something like this this schema will completely break our pipeline and The efficiency of our system will drastically reduce for Ninety-five percent to as I say before three billion divided by 20 because in order to load the next data I need to wait the previous extraction to be completed Okay This is this is a very simple example of that dependency and I'm bringing this to you are not because I'm expecting you to To that you want to know pipeline But I'm bringing this here because I want you to understand how complicated is to develop software that reduce at the minimum the data dependency At the data dependency is a main Issue to get efficiency in the in in your technology Why because you have this kind of system implemented in there so you have some implicit parallelism within chips that are Made to increase efficiency to wide memory memory latency But even simple things like data dependency can immediately break this problem this stuff Another trick that vendors are or say vendors computer manufacturers are implementing nowadays to increase Computer power is the so-called vector unit any processor any x86 processor that you might be using these days So genetic purpose processor is capable to perform more than one operation at clock per clock Okay, how this works. Well, basically instead of having the scalar operation. So let's say like this one Okay when when a of one is loaded in Memory even a of two a of three and they are four are loaded into memory Okay, so the so the data is no longer one single scalar double as you might be thinking at But is a bunch a series of of of element of the same kind that can be Processed in parallel. Okay, you already have implemented this in any x86 processor nowadays. The last generation does 16 double-precision operation per clock Okay, but again This work perfectly if I don't have kind of that dependency of let's say if my algorithm in general Okay, how long to process operation how long to perform operation in this way For example consider the most trivial example the a matrix multiplication when you have to multiply one row by one column Okay, that is a perfect schema To exploit this kind of this kind of of of of technology Okay, but as soon as you have something more complicated all those technology become completely obsolete Okay So there are component in into the into into today technology that are fundamental to get power processing but It's very easy that scientific application, which are usually made by more complex operation than matrix Multiplication basically screw all this Okay, so basically we have we have been seeing that we are we are nowadays living in a world where computer are getting parallel not only because You know we heard about parallel system everywhere, but because also Small I mean I mean simple processors are getting Are increasing the the the explicit parlance within them but even But mother other than the expletives parallel in which which I'm referring to the number of cores that is growing you also have in intrinsics or inside parlance that is also a Fundamental issue to get They say I performance So basically today there is no way to get Or to benefit of next generation of computer if we don't think in parallel and here we come to the third pillar, which is even even I mean as well important So the fact that now so far I'm always been talking about what Makes operation faster, I mean what what makes computer faster in term of how many operation per clock we can do but in order to Let's say In order to increase this number as much as possible We need to fill our CPU of data Okay Well, this is all fine despite the fact that actually access to data into memory Is one of the most expensive operation we might have within a chip Both in terms of latency so time to get to get to the point where the result is in the in the in the chip And we can perform the operation either in term of power consumption so Let's say engineers have been inventing layer of of memories That basically should reduce the time to access to to data in memory Again This is true this works but works only in in some specific case And if we go out of that specific cases we Dump into the problem where we always need to access data to the main memory and we pay a lot of CPU cycles any time We access to one data. What is this case? Well, basically This level here of memory works in a way that when a data is retrieved from the memory Nearby elements are also retrieved are also moved from the main memory to level of intermediate very fast the same memory levels This is done because engineers Predict the fact that if you load a data from memory is likely that you will use the next element very soon And it's also likely that you will reuse the same element very soon So if it if a time t1 a of one will cost us 192nd or sorry hundred hundred nanosecond The to load the data a2 which is next to a1 But is actually loaded into lower lay say deeper level of memories It will cost us up to the point of one nanosecond Okay, so after one load If we are able to load the number of continuous element and to reuse them the next accesses to memory will cost us almost nothing Again if we look at matter if we look at a matrix multiplication where we have to multiply a column by a row This is the perfect schema Because we multiply a of one times B of one and then we sum it to a of two times B of two I mean a of one one times B of one one and then a of one two times B of two one Okay, so perfect schema format is multiplication, but suppose you have for example a random access of memory If you have to access to particles that are all stored in different places If you have this this party if you have these cases This is exactly the time where basically any access to memory will cost you the full cost and Your performance will drastically be reduced So it so it is important That algorithms are designed in a way that they can exploit what is so called the data locality So data are reused as much as they as we can and At the same time we we should translate Algorithm and codes to perform operation that can be as I said before vectorize so where we can perform more parallel operation on on Given set of data and then there is another technological issue, which is a trend So the trend about how this the the CPU power is Increasing respect the memory banded. So how much basically if I go back here, I can feel my computation Okay, so basically this is becoming a Ferrari that needs a lot of data To be processed together to be able to get the maximum Computational power, but at the same time the channel To to fool to fool this this this Ferrari is not is no longer suitable and actually the gap Is is always is increasing? Okay, this this is what this Picture is about Okay, so to show that while the processor are keeping that the Trajectory of of growing in computational power at the same time memory bandwidth So the way that we we fool these computers is not increasing at the same Moreover nowadays we are in cases where with this multi-threading multi multi multi-course processor and then let's say a Number of processing working all together the memory bandwidth is sharp among all this Okay, so the memory bandwidth that I end up to have per process is even lower When we full load the process for some time, okay, so Here I come this is what I was just mentioning the fact is that is not only Because I mean memory bandwidth is is gradual increasing even if much lower than what computational how computational power is increasing But at the same time we are we are creating this is the design of the say Say a Representation of a processor of a dual socket processor like we are one we have on our cluster where we have Where we get the complexity where we have two different processor That see the same memory But as you can see even if I can see if even if in even if this is a shared memory system You can understand them the man the time to access to this Part of the memory from this process It won't be the same to access to this one because you have a much larger path And at the same time you also have the fact that course sharp layers of memory Okay, so all this caching that I was saying before so the fact that when you load one data then this data can be reused To make this system working It is it is essential that there is no conflict in data access between all these processes In fact, this is one of the you can see the complexity of a simple dual socket the 66 system we have nowadays in term of how many cores we have how many layer of memory We have and the way that the memory shard because at this stage the memory shard Per socket at this stage the memory shard per chip Okay, so really Big complexity for programmer So it's not only the power crisis, but it's actually the programming crisis as well And then of course everything becomes even more complicated if we then think at parallel system where we have a Single chip or let's say single motherboard Connected on number of single mother or connected all together through a network And this is the say the representation of what is a cluster Linux like the one we have available on in here And this is how it would looks like in a big data center all this plug together So we have say what what does make? And and has a drastic impact of performance. So how fast my CPU can work? So how many operation I can produce per? Cycle and how many cycle I can I can have per second at the same time if I want to get this Goal I need to move data very fast from memory to CPU Because if I if any time I have to pay the full path to get data from the from the main memory my CPU will Would just go extreme slow And then here we come finally The performance or the the time to my solution also depend by the way that we can Paralyzed our problem. So the way that we can express the parallelism of our problem. So the way that we We have to divide our problem in pieces As much as our problem can be divided into pieces Possible independent among them as much as I can get performance, but now they system If my if my problem is not is I mean is not let's say easily parallelizable It's not really parallelizable then at some point I will dump into problem of scalabilities And if I consider complex problem where I have different algorithm that scale together like most of the application You would be dealing these days then it It becomes even more complicated to understand how to how to parallelize and how to let's say I Increased the number of of computational power on which I want to run my system because I will have Difference issue of scaling now I will I will show you a picture about this that they should clarify this and then at the end even if I have the most power powerful machine and I have the most Embarrassingly parallel algorithm, which means that basically it can scale without any problem every task would be independent I always have to respect the underlaw which a which says that there always be there is always a Fraction of my simulation that is serialized and so even if I even if I even if I can Parallelize perfectly the 90 percent the 95 percent of my application. I won't go faster than 20 times because if I if I leave the 5 percent of my application Out of the parallelization so I can't reduce that time at the end my my my my speed up it reach at most factor of 20 You do the math you see that Okay, different way of there are many different way of thinking in parallel The most used are the they are these two that I'm trying to represent here so The one you commonly are dealing with so anytime you run, I don't know CP2k or CPMD or or any other Molecular dynamics code is like you are you are into this kind of parlance the data parlance. So you have a data set This data set is divided in in on among different processing unit and this processing unit work together on the same data set to perform or to solve a problem But there is also another way of of of thinking in parallel and Is the way where instead of dividing my data set in pieces? I divide my operation and task in different I mean I say I divide my operation in my task and I distribute them among processing unit in this case I would have different processing Solving different problems But at the end all this works together in a concert to reach the goal of one single Simulation so how how is this nowadays all programmed? well The basic idea is to is to all I say the the foundation is to think about Our friend here a multi socket CPU so if we think at this basically we are in front of a 12 core system which share memory between the two different CPUs and and and With this I have to think how to how I can program it you wanted to say something no Okay, I have up to what time 10 no I have up to 10 Okay, and so so let's see let's see how how program runs on the system And our programs are normally Let's say Develop or developed to be able to run on the system. So there are two main Philosophy which depend how the memory is considered So if we think of the system as a one Blocker and if we log on on if we do SS age on this one on a Linux system We see and we and we do like cat slash proc CPU info we see 12 cores we see a Memory capacity equivalent to the overall memory available in the motherboard and So let's suppose we have I don't know 32 Gigabyte of shared memory which would be 16 per slot and then we have 12 cores of the role How do we program this? Well, there is one Main philosophy, which is the shared memory Philosophy the shared memory philosophy means that we look at that system as a single system and We have one single process That is divided in small pieces called thread. I don't know if you ever heard about this, but let's say is like Child or let's say other process is generated by one main process Let's suppose. This is our I don't know cpmd.exe That runs into this system and then at some point. This is divided in pieces that will run each of one core Okay, so the system is seen as a whole system one single executable is at some point divided into small processes that run together to Let's say to exploit our compute power okay, and then we and and This system I mean works. We'll see this today We'll we'll we'll experiment multi-threaded program. We'll see how it's actually Already easily There are a number of application and library easily accessible that are so-called multi-threaded Which means that those library even if you link them To a serial program when you when you when you when you call this library when you call this library to solve a given problem this the problem can be performed in parallel and It is designed to run on those kind of systems Okay But on the other end this this kind of paradigm hands have some side effect. What is the side effect? Well, first of all is that if we if we work in shared memory So if if any piece of my process can access to the same memory I can have conflicts because I can't write a Position of memory while another another process is reading in it Okay So if I if my algorithm at some point for example to peace to task of this process or to threads and say threads Want to access to the same? Memory area in so-called race condition. I have to handle that Okay So I need to be very careful that I don't have multiple processes access into the same memory area and At the same time I'm limited in scalability because if I have 32 gigabyte of memory available okay, even if my Code would allow me to scale ever Okay, the size of my of my problem in this case would be limited to the size of the computer I have available under me Okay, even if I with this schema even if I have access to resources like the room I show you full of of of racks and and full of hundreds a thousand of computer the maximum size that your problem can can Can reach is the size of the memory available on one single note Okay, so the shared memory part of the show threads Work only on single notes very important Because they can guarantee you that is very common that people Think that on a system there are hundreds of cores Because maybe those hundred of cores are divided in in in ten nodes So there are ten different nodes each node Equipped with ten cores and connected through a network Okay, so people log into one computer and then Spawn it says or create hundreds of threads Okay, but hundreds of threads in that case doesn't work on the wall hundreds of cores available Work at most on the course available in one single node. So on ten cores Let me make another example Suppose suppose we have a room full of desktop okay hundreds of desktop is just is each desktop will have one gigabyte of RAM Okay, at the end our Problem could reach ideally hundred gigabyte of RAM in terms of size Okay but if I program if I if I develop my code to be programmed for shared memory the Maximum shared memory that that my process can see is the memory available on one node. So one gig Gigabyte so if I have in that case if I even have hundred gigabyte of memory available My simulation would be limited to be to get to the size of maximum one single gigabyte. Okay, so this model Is usually more trivial to implement? so it's easy is easier to program because Is based on is based on on directive that you can add into your code at loop level and this Parading will split trivially the loop And divide the loop into tasks But it has this limitation. So the fact that It doesn't scale or it scale up to the memory available on one single node Okay So multi-threaded you find you find multi-threaded library nowadays everywhere because it's very is much easier to get Multi-threaded parallel is then the one the other paradigm that will see a minute But those implementation are limited to the memory of a single node. Well, on the other end instead if we look at this system has a Distributed memory system Then our model can scale even outside our node. How is this make well instead of having one process one single process? Then then then is then it becomes a multiple set of threads that work together for a problem We program different processes. So different different instances of cpmd that in That together work to solve the problem, but in this case the memory is not Visible I mean not all processes have the same memory, but the memory of each process in this independent so having one process running on this core and Having the other process running on this core It will be exactly the same to have one process running on this core and having one process running on this desktop If those two systems are connected together Maybe we can go back here so If I if I if I program shared memory Okay, my program is limited to one single node Okay, if I program distributed memory So if I think at my program as difference as distributed that said that that ideally all together represent one full Dataset, but this data set is physically distributed among processes Then my model can scale all on the overall machine This I mean the lab of today is to give you a practical example of what are the difference between those those two models so Here we come. How does the distributed memory works? Well, basically any Programming parody parallel programming paradigm is based on how to describe communication So how to make that those different instances can work together Okay, at the end the communication in programming is it means someone that write and someone else that has to read that Read the same information Okay, very similar to emails. Okay, someone write an email that email goes in a box So which could be our main our our memory and then someone else will read that email from that box That that in in in parallel programming This communication is implemented in in in memory or via network But it's always the the same thing that one process right and one process read so How it is the how it is all this implemented Indistributed memory well if we are in shared memory basically here is trivial because we have that All processes is the same memory so if I if I write I don't know a data on my on my On the on the memory other space of a process all the other tasks can read the same data But if I have different processes this cannot be done because each process will see his own memory Okay, this is true for any independent process that you have on your Linux system For example each of them as his own memory area and one process cannot communicate with another one Except they go through file or other systems Okay, and this is exactly what happened in case of distributed memory so to make processes talking together I need to find a way to I mean to to explicit this communication and in in the inshah and in Let's say in the message passing Paradigm this happens through the network. So the show one process communicate with another one by sending a message through the network Okay, so process a send I Don't know the value of fee to process be Sending the data through the network Process be we'll have to know that this data is arriving and we will be releasing your network up to the point that the content of fee is arrived in there Okay Okay, so this is basically this is implemented through the To the so-called MPI paradigm MPI is nothing more than a library that you include in your code and that allows you to express in this parlance and So So even if you see this on on system sometimes as a compiler So as an additional tool for compiling it what you see is nothing more than a wrapper that include your compilers And that allow you to compile parallel programs, but at the end is just a library. So while you will compile your code like minus a IC or GCC let's say GCC minus C. Hello dot C If you will have a name PI program. So a parallel program you could even easily compile this as minus C Hello dot C Let's say minus all say Without minus C. So we link it and then we do minus L and PI Okay, so being a library That provides the the the routines to make space to make process speak together We can we can easily compile it normally just linking at the end of the library that is needed as you will do with any other library Now we will be do some training today about this as well so Okay, what is what is the? What are the aspect of this of this model of programming that you you would need to take care of all the same that program is Have to take care of well the fact that this paradigm is based on Nothing is shard. Okay. So any process is completing dependent from the other He will have his own memory area. He will have his own file Pointers he will have his own Socket he will have his own. I mean anything is related to a process will be replicated among independent processes And of course this is Let's say a model that as I said before hollow to scale much more because It allows to go beyond the memory available on one single node Okay, so there are me. I mean nowadays are coming up many many different paradigms I mean we have other than the MPI that they was present was mentioning before which is the distributed model We have those languages which are the pigas languages Which are trying to make an extraction of the of the distributed memory as it would be shared memory We have the open MP, which is the way of Let's say program shared memory System and then you have specific languages for devices that are nowadays available as external to Generic generic process all those languages all those models of programming Do nothing more or let's say differentiate among them by the way that they express The communication among the different tasks Okay, okay here we come this is the and this is at the end. So having said that We have parlance within course We have parlance within Nodes we have parlance because of multiple nodes Okay at the end to reach the best performance our codes or our programs as to take care of all these aspect otherwise, the result is that if we if we go Let's say if we have serial program that doesn't include anything of what we said before we basically are at this level of the performance if we are single serialized the program But that can Exploit the vector unit so that can perform multiple operation at the time. We are we can reach this level of performance Very similar is that if we can't Exploit the vector unit. So if we can't exploit the multiple operation at the same time But we are able to parallelize our program. We arrived more or less at the same Level of performance because the number of cores that you have per chip is equivalent to the number of operation that Each core can do per clock, but at the end if you want to get very I get efficiency we need to parallelize codes and Parallelize codes in a way that can also hide the latency memory and perform operations in parallel and you see that I mean between the Generic programming and and and say and the Goal that we have to reach there is a big gap in performance of the of of the or two order of money to nowadays Because if you consider a 20 core system that is able to perform 16 operation per clock It means 20 times 16 With Which is which is which is an important factor of of speed up compared with the serialized program Okay, why is important the under slow and why is important to consider the under slow when when when we run in parallel? Well because It's very common That when you when you when you run a program in parallel you expect this program to provide you faster time to solution But it's also very common that in That while increasing the number of processes your time to solution instead of scaling down actually goes up normally I Always receive people that have this kind of problem. So if this is the number of cores or let's say of processes That you have and this is the time They they people people some time only say pretty often come to me and say why am I getting this one? So The the time to solution Scales down while it increasing of the number of processes But we reach a point where we get an optimal level and Then the the time to solution actually start to increase up. Well, this is because the parlise Ideally Doesn't bring any of a red Okay, so let's say if I don't have any of a red Worst case scenario would be the following under slow so I reduce the time of the parallel region up to zero and Then I'm still left with the such called serialized that part of what is not paralyzable. Okay? This is the ideal work. So even if even in the ideal world you won't get this to get to zero Because of the under slow because of the fact that there is always part of the program that they can't paralyze Okay But at the same time we don't we don't we don't live in a ideal world We live in a real world and and the parlise is not only something that I can express to reduce my time But the parlise is something that brings over read Such that I'm there is a point where my overhead instead of bringing me advantage. It brings me at disadvantage so Look at this case where you know, basically there is a profile of a genetic application where we have Where I divided the problem in in in pieces and they and I try to Understand how this different pieces work at the increasing of the number of processes and For example, you see that the blue is perfect is ideal Okay, so the blue scales perfectly with the number of with the number of cores Okay, so the blue will give me will give me a very very good curve in this sense But on the other end I have this is this represent the I mean the time of simulation How it is divided in percentage compare it with the number of of of core of the increasing of the number of cores So while while the while the blue line would be represent a perfect algorithm for me because scale very well on the other end I have this part, which is the yellow one that Can that at the at the increasing of the number of processes is actually costing me more and more within my application Okay, actually this since here I am I'm representing the percentage of the time spent and I'm not saying how much time I'm spending This could also be that actually my time to simulation is is decreasing But the yellow part will reach the point where we'll take the hundred percent of my simulation in one school in one scale down Okay, so I reach a limit of scalability in that case But I have also problems Which are which are which I don't have here, which is actually Honestly, I have also problem which which I can I can see in a different chart like for example this one Where I actually have part of the simulation, for example like this green one here, even if it's likely visible that not only Limit my scaling but actually if I increase the number of course here This will increase my time to simulation because the overhead of communication between the processes will bring me Higer time to solution not faster time to solution Okay, so indeed in this case will arrive at the point where my curve will Explode up so here. I'm showing you how Carparinello simulation on performed on Chineca by Carlo Cavastoni and the Rigo Calciari. You might you might know them the I mean He's actually a Say a profiling of a scaling up to Thousand of course, but then you see that even if this application looks to scale very well Anyway, at some point I reached the limit. Well, my algorithm don't scale anymore And then and there is another important point here all these pieces Colored pieces basically represent different routine and different routine Represent different problem. I don't know or to normalization FFT Update of the coordinates and whatnot. Okay, but all very different tasks of all these very different problem Do not scale with the same nature Some problems came dramatically well blue But some other Scale completely different so for example this green stop scaling already at this stage. Okay, so It's really common on nowadays application which are they say composed on on on of different of different algorithm of Solution to different problem. It's very common that the the the component of my application do not scale in the same way So it takes so it takes an effort When consider scaling to understand the what is the best level of my of my of my system Okay, they say the best level in term of scaling level. Okay, okay, then Saying that there are also a lot of I mean with the with the increasing of computational power available these days There are a lot of a way to express parallelism is not only one Single problem that is paralyzed and and it and that let's say It's paralyzed it for faster time to solution But it's actually the the computational power available gives us the opportunity to express many aspects of parallelism so There is there is there is a way of I mean there is an easy way of expressing parallelism through farming It's called farming in jargon, which means basically Run millions of of instances of same processes on different data set Not the same data set divided in PC, but different data set. This is very much used in for example in high-energy physics in Yeah, also in quantum Monte Carlo physics and I mean and and and and while this kind of of of Of They say parallel expression wasn't really common years ago Nowadays with the fact that you have hundreds of processes available or easily accessible This start to become a I mean an important way of of using parallel machines Then the then you have the case where there are people that run basically the same system The same parallel system multiple times to verify the the quality of of their model or to basically just to pick The the the most common combination of their input data set and then okay There is also parameter space, which is very also also very common applied way to Try to fill all these computer power available So it's not longer only one single parallel system But sometimes is also a combination so a framework of software that work in concert to perform Or to express this kind of of potential parallelism Okay, of course everything gets even more complicated when we when when we think that On top of anything I've been saying in this last hour. We add External device such as accelerator Why this becomes even more complicated? Well because you have Normally you have different parallel languages to program the systems and Not only because the the programming itself becomes quite complicated, but also because We dump into the problem that being those device completely separated and con and work completely different from our genetics Processor we have to handle all the memory transfer and all the computation on those external devices We when I say to handle isn't it doesn't means only as a programmer But even as a user becomes really complicated to use these systems efficiently Sometimes also because the programs are not I mean Efficient themselves, but even if you have a very efficient program for accelerator as a user It's very complicated to get performance out of it Well, let's see to see why Basically This is the general concept behind accelerated computing which is very similar to also Shared memory paradigm. So the fact that I have a program I don't know a serial program or already a parallel program I have a section of that program which is extremely parallelizable and I want to move it on an even faster device Which I can didn't identify as an accelerator. Well What is the generic the generic schema behind it logically? I mean, I mean then people also program it But it is important to understand it logically as also as users So we need to copy the data from the CPU device where the data are initially hard because sometimes they came from a disc For example input data. We move them on the device of the of the accelerator We perform our computation on the accelerator and then after that very fast and then after that we move the data back So first of all this schema Already bring a big evidence that the parlise include an overhead because even if this Device and even if my algorithm is perfectly parallelizable and the time-to-solution of my problem Goes from 50 percent to zero Okay Even if the execution of the problem goes to zero I Need to pay the time to move data from memory back and forth which I didn't have before Okay, so this is an expression of how parlise can bring overhead Okay, because in order to exploit the parlise of this device I Need to bring into my problem the overhead of moving data back and forth an external device that I didn't have before Okay, and I will arrive at that point at the point where that overhead Will cost me much more than what I actually benefit on going on that device and Moreover even if the the data transfer between The the device the external device and the and the and the and the CPU would be extremely fast. I Mean I could I could reduce the overhead But on the other end I have the fact that actually the channel Between the external device and my CPU is the slowest channel. I have in my chip Okay, so when I talk about accelerator or when I talk about Whatever way to accelerate my computing outside my my processor People have to keep on mind that the the channel to move this the data from from the chip to the to the Accelerator to the say the power profit the power processor need to go through the one of the slowest channel I have in my chip Okay, so it's not only an overhead Because of the model but it's also an overhead because of the of the I mean that the way the technology that I have To move data from back and forth. I mean is extremely slow compared with the other order of the order of mine It was low so here we come and I try to Well nowadays we arrive at up to 12 gig 16 gig also on on on on modern modern GPU device I mean the one we have on here. I mean and sister. They are at 6 gigabyte of memory Yeah, this is also actually another limitation usually on on accelerator. You have also a lower Capability of memory that what you would have on on chip so you cannot transfer the whole data set back and forth Okay, so now now we come to few Few conclusion in a sense that so why everybody speaking about accelerator today because it is a promising market why is a Accelerator Promising for users Well, because it represents it has a very high scalable device. It has a very high speed memory within the device and and is actually I Mean fairly speaking a low power device compared with the with the compute power that it can deliver But we are not interested in in why accelerator speed up simulation We are interested in why accelerator do not speed up application because any time we try to run or to perform Something on accelerator. We actually see that we don't benefit of it as as we are thinking it Well first of all because we have this PCI bottleneck that you know is the slowest channel to move data We have the fact that Okay Accelerator represent a massive computational power available because of this large number of of Low power cores that are inside it But as soon as I move from a massively parallel Algorith to something that doesn't scale and they and as I saw you before within nowadays Software there are various components and do not scale all at the same time So I might benefit a lot from moving one part of my code of my software in the accelerator Maybe because I have also to move less data But on the other end I will I is likely that I will arrive at a point of my simulation Where the problem that I have to solve is no longer suitable for that device So if I moved all my data on the device then I have to move my data back and So on and then and then people tend to forget to really to forget the under slow So the fact that again, I mean there is always a portion of My code that cannot exploit the overall compute power available. So it won't allow me to scale ever so consider so if we consider this approach and We we ended we ended our software to a very good developer Which actually identified that the 50% of my code can be built an accelerator Okay, and is able to reduce the time to simulation at zero of That particular section of the code My simulation itself won't run more than twice faster Because even if it reduced the 50% at zero I still have the other 50% that it won't benefit of the accelerator Okay So when you heard about ten times twenty times hundreds of times This is all story Okay. Oh, well is it has to be the outcome on a unfair comparison Because if you do fair comparison Okay, and if you consider complex software not something that you know you you Synthetically run on the on the on the accelerator to show that is faster than a CPU But if you consider a complex software that has a yo That has some serialized part that basically didn't possibly have some algorithm that scale more than others It's very very difficult to find this a solution that can be moved on an external device and give you a massive speedup Okay Now the problem is that is that people nowadays are Let's say misleaded by the marketing because when I started to do this job ten years ago Companies were pays a lot of money to get five ten percent faster time to simulation Because if you spend five million dollar a year in compute power Ten percent of reduction of debt it means five five half million dollar. Okay. Do we agree on this? so Before all this marketing wave though of of accelerated system that broke people to think that Something magic is possible five ten percent fifteen twenty time percent of speed up was considering a new era Okay, so the message is not that Accelerator are a waste of time and and is not even that accelerator are Let's say a fake technology, but the message is that if you if if you consider If you think at all these aspects that I try to mention today You realize that it's very it you know, you realize that when people speak about 10 20 hundred times of speedup Something is bad in the equation Okay Moreover Why so complex using GPUs and let's say accelerator and why so complex to get a performance out of it? well Because when we consider I mean if we have if we have an accelerator that is capable of I don't know I'm just throwing number, but let's say Three terraflop double precision operation per second. Okay Well, if we plug that accelerator on a computer That is able to to the to deliver. I don't know ten giga ten giga operation per second So let's say three are the three order of money to the last okay well, then It's likely that I would benefit of that accelerator because if I identify big portion of my code that can run over that Okay, well time to simulation can respect in the end of slow time to simulation can be reduced effectively But if I consider nowadays modern system That are let's say generically Not one or two order of money to slower, but one or two times slower than what an accelerator can deliver Okay. Well, then I have to consider the fact that even if I move all my code on the accelerator I'm wasting a Big computational power available on my CPU platform Up to the point that sometimes I should think that instead of buying an accelerator I should buy another CPU platform, which I which I can use more genetically and they could exploit at the same way Okay, so in order to get Power out out of an accelerated node. Okay in order to get to exploit power within accelerated node Programmers as have to think a way to benefit both of these CPU and GPU and accelerate platform Okay, because if we this if we discard one or the other Okay, we are wasting big part of the portion of of the computational power available. So and This is exactly what happened when running application on accelerator. Okay, it's it becomes Important for user to get performance to be able to balance What am I doing on CPU? What am I doing in there? How can I make them working together? Without dumping into load unbalancing here are some example. Well, I'm taking it from the PW code of espresso because it's the one I was more involved with but is the same for any other lamps a Gromax what not code is being enabled on accelerator nowadays so Okay, suppose we say as I said suppose we we run only one process on the CPU side Okay, and this process goes to the accelerator and then Paralyzed perfectly. We are anyway wasting our CPU platform. In fact, this is what this is the case we have Sorry at the same time What is it? Oh, there you go At the same time I can say, okay, I don't want to waste my CPU platform So what I do is that I I spawn as many processes as I can in there as well And then I make them going into the accelerator altogether Well, but in that case I dump into another problem and the problem is that I have basically I'm not only accessing to the slowest channel I have on my system But I'm creating a lot of Concursing to asking to that to that to that channel Okay, it's like I have like one island connected to the mainland I know that on Sunday There is a big traffic of people of tourists going on to that island to go to the beach and I have Possibly I don't know possibility of off of bridges and then I close all the fastest Bridges I live I live open all only the slowest bridge and then I have all these millions of people trying to access into it Big mess. In fact, here is that what exactly happened in this case if you force To too many too many processes into the CPU side Then you will have conflict to access to this channel. Okay So what is the ideal case well the ideal case I definitely don't know what is I mean That's why as I as I was mentioning earlier to get performance Sometimes Some preliminary benchmark is needed to balance all this and this was needed at all years I years I had where I had only one processor With one hike frequency Nowadays I have a lot of components that work together to get the same result and If I have a lot of components working together to get the same result I'm increasing the complexity of the system of the software and I am and I am Very likely to miss the load balance Which give me the perfect the perfect outcome So what I do here basically I am I am basically mixing the two paradigm that I was presenting before So instead of saying I either have one process That I know it doesn't work well because it doesn't exploit the CPU side And at the same time I don't have n processes where n is the number of cores because otherwise I will create conjection to access to the channel What the ideal case is that I reduce the number of processes to minimum and Then each of those processes will become a multi-threaded process Okay, so basically I have here two process. Okay, so you see the difference so this is one domain Sorry splitted on Independent processes distributed memory model. Okay each Piece of the domain is not visible to the other processes except you go through this message passing that I was saying before I have a model where I pack all my system on one single process, but then I waste CPU time Or I have the way where I can say, okay, I use a mixed a mixed model So I reduced the number of processes. I exploit the CPU any way because I thread so my my course don't slip Okay, but at the same times I reduce the traffic to the to go to the Hi to the island where any anything becomes magic. Okay, so even as users All this consideration has to be made in order to fight the right on balance. I mean in the last two three years computer scientists HP centers anyone is in the in the area and being working a lot to find a way to reduce this I mean the possibility of long balance through Auto-tuning so finding a way that is generic for this kind of system But at the same time we live in an era where this technology change every year The software available for this technology change every year So it is almost impossible to software development to find a solution That doesn't move into the hands of the user the responsibility of balance these things In order to get performance, of course, if you don't care about performance then anything gets easier But here I'm talking about HP system then so having seen that and and and try to provide you a logic overview of what is the background now, I mean you can see that we I mean we reach a quite of a Mean an important level of complexity in the technology which translate immediately to a big complexity in terms of software development In fact today most of the application that you you know lambs numd gromax quantum espresso Vaas what not and that offer solution on based on shared memory Programming distributed memory programming a mix of them possibly accelerator Basically do software are build over layers different layers that makes those packages really complicated so in fact nowadays we have Let's say basic basic implementation on Based on distributed memory access MPI So I have to write my program in order to tell to a process at some point Look communicate with the other one because you need to use data At more deeper level I arrive at the point where I program shared memory Because of the fact that I was saying before because sometimes I don't want to create as many processes as the number of course that I have But I want to reduce at minimum the number of processes to avoid conflicts And then but then at the same time to get advantage of the CPU power. I go threading So open MP is the generic way to express multithreading Then if I go if I have an accelerator available, I need to add to my program the part for the accelerator and Then if we think also at what I was saying before about the farming and whatnot We get into the fact that maybe we have a Python layer that we add on top to handle or pre-possessing And and process pony and then even worst if you want we arrived at the level Well, basically we talk with the system administrator to say look I have such a massive parallel production that I want also to instrument the scheduler or to or to or to let's say to Address my scheduler to be as efficient as possible. I don't know maybe because I have a client server model or whatnot okay, so This is the so the so all this try this complexity in technology that I have tried to present today Translate in big complexity of software package for for compute simulation nowadays Okay, I am at the end. I think I am also late, but I think five minutes more So as I say from the technological point of view the memory bandwidth is one of the most critical bottleneck or mother CPU design especially because I need to feel always a Faster and and powerful car At the same time I need to Program efficient that a pattern because otherwise if I create as I say that a dependency or random access to memory I'm not able to exploit the intrinsic parallelism of the chip and memory and and level a level in the different level of memory hierarchy To get the maximum power. I need also to arrive at exploiting Internal Vector unit we'll see this how it is possible today And then as I said at the beginning, this is the reason why am I and I think this is the reason why my Http so the fact that the complexity of parallel systems requires a given technological background to master lower balance it Also to use a level and obtain this at efficiency I mean even if you have a very good parallel software nowadays if you don't understand how this software works in term of of data distribution and and and load the Distribution is very likely that you won't get any good performance out of it or or simply you keep using it inefficiently Okay, so Software application Okay, of course, I mean people have to design algorithms that Can express more parlance and then can be more suitable for this kind of massive repair system. We are going into Compilers also have to work very hard to create and to be able to optimize codes in order to advantage of all of this Okay, this is fine And and and and then of course this complex this complexity is Destinated to increase even more because with with highly availability of massively parallel system There is always someone than think at the way to exploit the systems and to increase the possibility of this of simulation that we have So What if you want to learn how to program all this well We have activities that we organize the Thai CTP every year We we organize two or three activities per year one dedicated to parallel programming one dedicated only To what we call collaborative software development so we do a school of two sometimes three weeks only to All the same with the ambition of teach people how they can get into development of complex software packages because There is a big gap between one scientist his own working in a lab and developing his own code and And and and someone a scientist that is open to the community that you won't Take advantage of the of the of the product of this community, but he was also contribute to the tools of this community Because when you want to contribute to the tool of the community in terms of the software packages that I was mentioning before You you don't need only to know how to program, but you need to know how to make produce clean program produce modularized program Produce possibly efficient program and be able also to work in concert with the others because if you download Namdi today You spend two years of your life to improve or to implement something new For for the namdi package and then after two years you go to I don't know Johnstone or whatnot and you tell them look I have this massive improvement Please upload it to the software package and then when he will do SVN I don't know what I mean to merge your code into the distribution He realized that your code is based on a two years old distribution that is no longer compatible with what is available today He will disregard your work Okay, and he won't give you the possibility to include that work in the For this into the repository of the software package You have been working for two years because the work that is needed for the main developer of the maintainers to include this package within Current generation of the software will require more than what you have been spent to develop it Okay, so we teach people how to to work with the software version with get and with I mean and and and basically be compliant with what the main developer of Complex software package requires these days Then we have the school on on parallel computing and and and parallel programming And then if you want to master of this we also have nowadays a master in high-performance computing, which is an 18 month program That is fully dedicated to high performance computing and parallel programming in collaboration with the with the with the groups of ICTP and CISA and Is a full-time commitment program eight hour days for six month the first six month and then three months of different scientific The same member of the scientific community of the of the two institution Explaining particular scientific program, you know parallel algebra parallel FFT or not And then at the end there is a there is a program for I mean there is six months for a thesis and a project And that there was written in the last slide. Thank you for your attention, but that is gone Okay