 For the Human Brain Project, and today I would of course add on this, but going a little bit more specific in two technologies. Review at first what are our applications and what they might require and then go into some technology trends. I must apologize, I cannot be very exhaustive in the shortness of the time that I have. I will present some scaling laws and ideas how technology will develop, go into the question of how to increase performances on all, let's say, fields, so this is computing, this is memory and memory bandwidths, and of course increasing it by more parallelism and discussing concurrency levels, discussing some memory developments, going into unconventional architectures where I believe we have a good chance to achieve the goals that we want to achieve in simulation and analytics and maybe comment on the technology development that we go at this time in the Human Brain Project in the ramp up phase and maybe comment also what will come, give an outlook. So my, let's say, CRMs I start with is that first of all the simulations of the Human Brain at scale, so full scale simulations or large simulations will require supercomputers approaching the exascale dimension both in compute capability and of course also in the memory capability there it will be certainly hundreds or hundreds of petabytes to be available for that. On the other hand going for analytics and I have the example of the imaging of the full brain, resolving them, the single neurons and go to the submicrometer level, this will be constrained by the progress of data processing systems. It is very much related but it is of course also different and we have to discuss this. Yesterday I spoke of our code design project. They exactly are intended in some part of their activities to address those questions so to help to build up to some supercomputers and you see the keywords NEST coming up again, the NEST code for simulation of the mouse brain, NEST code also using in the Human Brain, then Neuron for mouse-based cellular cortical and sub-cortical micro-circuit models, then the Atlas structure connectivity and function and visual motor integration, Neuron Brain again, let's say matching top-down with bottom-up top-down theories and ideas with bottom-up simulation and robotics. So this will give us, let's say, the constraints and the requirements of the applications and let's start with some discussion on applications. So the neural network simulations with NEST and Neuron have this main feature that they involve a very high level of concurrency. High level of concurrency means huge parallelism and I know they involve different levels of parallelism. I would call it different levels of concurrency and it is not yet known how to utilize computers if you have programs with different level of concurrency. In fact, all larger programs expose different levels of concurrency but this is something that is only, let's say, taken into account on the node level or on the processor level but not on the level of, let's say, different technologies. Today you have different types of processors. You have multicore, you have Manicore and different directions of those processors and I will show you in this short talk that we also want to address this issue which has not yet been addressed at all, only rudimentary using the CPU-GPU paradigm where you have accelerating type of applications but you do not use today the accelerators as, let's say, a sort of parallel repository maybe for a bunch of CPUs to be useful to carry out highly scalable parts of the code. So we have a very large memory footprint as well and this is an important memory. It seems to become one of the very limiting factors. It's expensive if it is high bandwidths. They have lower arithmetic intensities. It's not completely true. The nest has lower one, the neuron has higher one and the more detailed the knowledge about the neuron becomes and the more detailed physiology and other data are, the more computing intensives the models might become. In so far, I would say this low is only with a grain of salt and we have a global communication pattern so we cannot really go for localization of both data and communication. That's an important issue here. Architectural challenges, we have to provide suitable models of parallelism in order to define those architectures. This holds for the node level, this holds for the aggregation of nodes. We have to provide both high memory capacity and bandwidths with minimal hierarchization. Minimal means in a sense that it's also a question where we do the hierarchies on which level and of course optimize the communication. It's a less good fit what I have sketched now to current roadmaps of technology providers. That's an important issue. Current roadmaps of technology providers go, let's say for combination of CPU, GPU, Xeon Phi type processors or ARM type processors will come. So it's not clear at this stage how this will really fit and we have to do a lot in order to bring in our issues and one of those is interactive supercomputing. It has requirements. So it is data intensive, interactivity and steering should be in analysis and visualization on the fly and given the high amount of local memory, moving of data is just a cost, a power factor. So moving data means investing a lot of energy and we have to avoid it at any means. So the mode of operation is probably also changing. It will be users running multiple jobs concurrently in a single session, large scale simulation jobs together with data analysis jobs together with visualization pipelines maybe on the same or on different type of computers but that are connected as well as a dynamic change of the required resources. So it can be that you at a certain path of the job you will change resources within and those resources available to the session. So this will be a significant change of the operational model. So again a challenge we don't meet yet. Technical aspects. Of course we have to increase the memory footprint. We need a huge memory space. First of all to enable also checkpoint restart type of mechanisms but also in order to do the huge, to cope with the huge memory requirements at all that we have. Maybe new non-volatile memory and VM memory technologies will help. This is our great hope so that we can, let's say, boost the memory footprint by a factor of 10 to 50 which will be required for the full brain simulation so at this time it does just not match to the technology roadmap. Scalable visualization, the same problem. It's constrained by letting the ability to move all data so we will not move. We will only move the visualization data out of the machine and dynamical resource management has to be developed. If we go to the imaging problem, let's take again the imaging of the brain atlas. Here the main features of course a high level of concurrency but the sort of less difficult one where you have of course to exchange data at some stages of the process but many of those processes can be done in, let's not call it trivial parallelism but in parallelism. And again you have architectural challenges. Efficient concurrent update of shared data structures and this is again of course connected to communication. It's one of the problems. We don't have yet an efficient accelerator to accelerator communication. I'll come to this with my proposal again. And the management of pedascale data volumes is required. If you look at what data volumes can be managed in current largest supercomputing centers or computer centers, I don't think that projects that reach up to five pedascale are easy. They would nearly require all the resources such centers have available. This means although here we have to really find ways to work with those environments but we have a good fit to the current technology. We knew how to do it but it comes with a cost and we have made sort of extrapolation. We think that computing cost and the data management cost will dominate everything in this field if one wants to go down to the micrometer level. What are technology trends? Let me present some, the scaling laws of course are well known. Moore's law there came from a cost estimate Moore did in 1965 where the evolution of optimal manufacturing costs for integrated circuits results in an exponential increase of the number of components per circuit finally. So on the left side you have those development of cost in the sweet spot as you see at a minimum and then there comes a line where you see how it has developed. In the recent years it went down a little bit, we know it is stalled. There is a reason for that and it can best be understood by denad scaling. Denad scaling has also been found early by Mr. Denad and I think it was 74 so where the idea is that it can change some parameters given a constant power with increasing or decreasing lengths of the gates. So one thing you can do is to increase the transistor density. This corresponds to Moore's law. Another one is to increase the clock frequency. We know that there was a breaking down so this denad scaling is not followed anymore and you reduce the supply voltage. So you can play with all of this because that is that the trend goes, the trend goes, now it does not appear, towards having more processors on one die and this is, we will have an increase of let's say cores and we have seen already an increase of cores on the system. Another technology trend is expressed by Ren's rule. It's a question of you have a number of logical elements, a number of edge connections and Ren's coefficient. It results in a strategy because it will become difficult to balance the communication and the computation. We come again back to this later. So the trend is that you compute much more than you probably need for some problems. So those that require a very high memory rate are increasingly neglected and that's interesting for all fields of science, especially for physics and chemistry where we believe that the current trend is completely contrary to what we would need, namely very high memory rates at a given let's say high rate of computation. So one potential strategy is a better memory hierarchy to have very large memory on the one hand, maybe through non-volatile type of memory to memory on the network. But essentially this is a very difficult field of research and we'll also touch this in one of the next slides. Concarnancy itself, we have different level of parallelism. We have core level, we have an instruction level parallelism. Each core can, at each of the clock cycles, execute multiple instructions concurrently. As you know, we have data parallel instructions, SIMD instructions that are back to our units, however you call it on those Intel and other chips. We have simultaneous multisetting, so that you can have different sets of instructions executed in a simultaneous manner. On the processor level, we have multiple cores. We have many cores. This is an issue that's interesting if we would have proper multicores which we don't have in my opinion, which are not provided through the vendors, then we could really play with using those in a different level of concurrency. On the node level, we have multiple sockets, we have multiple accelerators. This is a sort of question of economy, maybe also of how to organize the communication on a system. On a system level, we have many nodes, but why don't we have many systems? This means, as I said before, why don't we have systems where you have, let's say, highly evolved vector cores, like, for instance, provided by NEC, then other multi-cores like by Intel, and on top of that, two different variants of cores, those Xeon Phi cores, or the NVIDIA type of cores in separate systems that can, of course, interact through these highly evolved networks, and I think that's one potential important way to go to solve the extra-scale problems that we really have. On the device level, you see that more and more flops can be carried out, especially this is another hint that there is a problem because not in the same manner the data rates go into the devices as flops they can carry out, and a very high level of parallelism you have on such systems. Sometimes, if you have problems that, let's say, for molecular dynamics, where you have not infinite amount of degrees of freedom to compute, of course, you are restricted to a small part of the system, and it is not used at all to have a parallel machine. Maybe you would then better buy an untone system by Shaw or so. All this is somewhat neglected by industry because industry is not at all interested in science. They want to make money through selling systems, and sometime it fits with our scientific needs, so we have to do something here. Memory. The memory question is all important, as I said before. We have the capacity-driven technologies, the DRAMs, density scales, roughly according to Moore's law. We have a very moderate and no increase of bandwidths. We only can increase bandwidth through parallelism again, which is limited because this is very costly, and you see the memory rates of current processor types. If you go to bandwidth-driven technology, where you want to be faster, again, of course, it limits the capacity by the costs. So, in future, to exploit this, we have to go to, let's say, from two-dimensional to three-dimensional stacking technologies, maybe even including processing memory technologies. Again, you have some numbers here. They are not much larger than the numbers before. So, here we are better, but we are still limited, and especially we are limited by a factor of 10 in capacity. Non-volatile memory. This is a third kind of, of course, you have it all in your pocket with your memory sticks, but they are more highly evolved ones. They have a significant higher density, but, of course, their bandwidth is very low. They are packaged as these IO devices as I said before. In the future, there might be a possible replacement of DRAMs. We have to find out. I will just present one example. There are several ones in lack of time. I cannot go into detail here, but you see there are different levels of memory, and the question is how, let's say, are you unconventional future architectures of computers could be, let's say, brought up with those type of memories? So, if you make, again, a sort of model looking at the question what is the capacity or what is the bandwidth of memory, so you see these high bandwidths memories have a low capacity. The DRAM memories have higher ones, and, of course, the non-volatile ones that can be network attached in the future might be a very high one, but if we reverse with the bandwidths and if we then build a ratio of both, it is not a constant that, thanks God, it's different orders of those ratios, and this is actually the guiding parameter to bring together a balanced system. So, you see that you have, for the high bandwidths memory, this ratio is 0.01 for the DRAM, and this is made in this manner, it's 1, and for the non-volatile one, it's 100, and the question is how those numbers in a given code can be, let's say, composed to give an optimal type of performance. And this is a challenge for the programmer how to improve the data locality, how to manage the data transport. Of course, it's not only the programmer, the architecture, the overall architecture has to be formed in this manner. And one of those unconventional architectures is if we focus on the processor as processing in memory technology as by the IBM Active Memory Cube, here we can avoid data movement at all just by having the compute layer below or on top, however you see it, of the 3D memory stacking. And then, of course, you would have a very high memory rate. Certainly, such technologies will be one part of the future solution that we will see. If it is realized as fast as we believe, we have lots of experience. Just one of our people got a price for evaluating this. This remains to be seen, but I'm very, let's say, optimistic that such type of technology on the node will solve part of the problems given such a hierarchy solution. Another one we pursue in our lab, or in my lab, this is the so-called deep projects. Again, an EU-funded project altogether. This is, I think, 30 to 35 million euro funding. It focuses on concurrency and hierarchical memory. So the idea is to match some application scalability patterns that we have in, for instance, also neuron is one part of our test programs. Match those hardware characteristics with these application characteristics with benefits of process or heterogeneity. So you have the Manicor and you have the Maldicor processors. That's the idea. We want to profit from the new memory technologies and so network-attached memory on a special network that can handle network-attached memory. This is the other question. You don't want to have a special node to do that. You want to have the network just plugged in, the memory just plugged in into the network, and the network itself is intelligent enough to handle the memory. That's the idea in an overall efficient loop. Energy envelope. So what you see here is a sketch of how such a system looks. It is, of course, separated in order to give you an idea. You would have a cluster system on the left hand with cluster nodes, and you would have something that, again, is a cluster system. But to distinguish it, we call it a booster system and with a very regular memory, a simple machine, maybe something like the connection machine of the 1990s or something like you had with early parallel computers. It's not SIMD. It is, of course, absolutely SIMD, but it has networks that allow for very fast and much cheaper communication. And Manicor, in this case, it's Xeon Phi. It can finally have any type of Manicor. The major point is that this system will certainly be able, so you see what we have today, will be able to operate, let's say, attributing resources in a dynamical manner. So if on the left hand side the cluster nodes are, let's say, 100 cluster nodes are assigned for a program, they can choose on the right hand side what they need, and an MPI process is spawned over the whole system. This will go on in order to encompass, and you can see it here as a sketch, to encompass the idea of network-attached memory so that we have non-volatile network-attached memory also available on both sides in different, let's say, manners, so directly on the node or off the node. And of course, then you need a way to access those memory, maybe in such a way that you would not even know that it is on the network. That's the idea, the final one. And in order to realize this, you need a programming environment that can spawn those MPI processes as well as can manage the associated network-attached memory question. So you see, we go more and more into virtualization of systems. This also was, in some sense, asked in my keynote from yesterday. I think that technologies, when developed for such type of system, finally we also helped to really realize virtual supercomputing, high-performance computing as a service, like we do today, maybe with a single node or with farming type of applications. So that is one of those unconventional architectures where I believe that those codes that we have in the Human Brain Project really will profit from in one or the other way. I have shown it in two compartments. Of course, such type of technology might also be realizable within one system just for the benefit of explanation here. I've done this. Just one last word of this talk onto the question of what do all the activities that we do in the Human Brain Project? How do they touch these issues? We do pre-commercial procurement and, again, the same questions that I have asked so far are asked to companies. They know the patterns of, let's say, requirements that we have in the Human Brain Project running on pre-exascale, to be run on pre-exascale HPC platforms, and, of course, our mode of operation. We want to procure those R&D services from companies and for that we get them technical goals, dense memory integration, integration of visualization, dynamic resource management. So this is part of the questions I have asked. There is a timeline of this PCP process and we are already now entering phase three. So you see we had five applicants. We have chosen from those five. Not we. Our international panel has chosen three of them. Then our international panel has chosen two finalists. They will both provide prototypes. These prototypes will already give us much more information on the future of technology than we had so far. So this will be finished within the ramp-up phase of the Human Brain Project. And if it is successful and partly I already know today it is, it will certainly define the final procurement of a first stage of a computer for the Human Brain Project that we have a good chance to be funded separately by the European Union. So the status is, as I said, we have a good response to the technological goals. We are just in this phase summary. The technology trends go towards increasing parallelism and deeper memory hierarchies. This sounds like a triviality, but this means something because we are already at a very, very high parallelism if we talk about 500,000 cores, for instance, on our blue gene Q. What we think, the simple road maps of the companies quite well fit with data analytics, but the cost of data analytics will be dominated by computing for sure if you go to very low resolution, high resolution. We need better and faster hierarchical memory concepts and data locality for the simulation that's very clear. So what the vendors offer us is by far not enough new, let's say, memory technologies like computing and memory with the agency of IBM and other technologies like I call cluster booster technologies are very promising. So the booster can easily scale to access scale. Clusters today cannot. These are the heavy-weight computers. They cannot scale in the next 20 years, as you can foresee that. And I guess it is more the other way around today that supercomputing and data analytics as fields as such as technology profit certainly from the guidance by Neuroscience. Thank you. Okay, we've got time for one or two questions. Thomas, I just ask you about the interaction with industry that you described. Is that something you expect will be, you know, like iterative over the life of the HPP, for example? Do you see that this is going to produce the technologies that are more... At least part of it. At least it allows us to... And we have feed already to steer companies. I can name two of them, for instance, for instance, Cray, for instance, IBM, to really look at their roadmap. And I think already this small project I'm talking about has changed the roadmap a little bit towards what is our need. But it is not only within the human brain project. So the idea to, let's say, to support or to really interest other supercomputer centers and technology users to help develop systems, this is really shaping the whole field. And that's the reason why I said finally the technology development profits a lot from the application field like the Neuroscience. That was one of my motivations, really, to go into this field because I can foresee going down to one micrometer, for instance, with imaging questions, or to simulate on a level of, I don't know, 500 petabytes of degrees of freedom that you need for the full brain, that this is something that is unprecedented and I don't see it as necessary in most other fields. They would not know one field except maybe the particle physics they always can go very high up or climatology at a very low resolution but it's unclear to me at this time because many of those inverse problems would not profit from going to more resolution because the unknowns that they have in order to determine them are so high. Okay, would you like to join with me and with Hank Thomas again for his talk? Thank you. It's my pleasure to introduce...