 Alright, so thanks everyone for coming to this session, and thanks to the organizers for having this talk at LinuxCon this year. My name is Brad Chamberlain. I'm a distinguished technologist at Hewlett Packard Enterprise, but maybe more importantly than that, I am one of the founding members and the technical lead of the chapel project, which I'll be telling you about today. I didn't notice till I made this slide that my title is a little bit long and unwieldy, but if you're not familiar with chapel, it's a programming language. It's designed for parallel computing, but not just on the laptop, but also scaling up to the largest scale supercomputers. And the goal here is to make parallel programming, particularly at scale, much more productive than it is today using conventional techniques. And I'll be telling you more about that today. So I want to start out very basic, just because I don't like to assume that everyone is from the same community or mindset that I am. So when I talk about parallel computing and scalable parallel computing in particular, I'm talking about using the processors and memory of multiple compute resources. And the two reasons you may want to do this are to obviously run the program faster than you could otherwise. And the second would be to be able to run larger problem sizes. And so I'm going to introduce a little cartoon that I'm going to use throughout the talk to illustrate various things. This is maybe where I started programming back when processors only had a single core and some memory. But nowadays, of course, we have multi-core processors. So everybody has parallelism on their laptops. And then, of course, I'm interested in scalable parallel computing. So I'm interested in hooking together multiple compute nodes with some sort of network, usually one more complicated than the one on this slide. And using all of the cores and all of the memories together to solve problems bigger than I could just with a single desktop or single node. And this is probably statingly obvious, but parallel computing has become ubiquitous over the last decade or two. So when I was in school, if you were doing parallel computing, traditionally you would either have to have access to a supercomputer. Maybe you were at a school that happened to have one or otherwise you'd apply for time at a computing center. Or maybe you'd build your own commodity cluster in your department and run there. And of course, today the landscape is very different. You can't really buy a computer anymore without a multi-core processor. Most of them also have GPUs, which give you additional sources of parallelism. And if you want something larger and you're not wanting to buy a supercomputer or build your own cluster, of course, many cloud providers are happy to sell you those kind of large resources for a reasonable fee, typically. So some people sort of think maybe supercomputing is dead. People say to me sometimes, oh, you used to work for Cray, weren't they the ones that would build systems that filled rooms? And the answer is yes, we still do that. We're just doing it at HPE now. This is the frontier supercomputer that was recently installed at Oak Ridge National Laboratory. It has 74 Cray cabinets, which you can think of as kind of a small refrigerator-sized thing. It's got 9,000-some CPUs and 37,000-some GPUs for a total of 8 million-some cores, tons of storage, fast network. And as a result of all of this, this is the first computer to sustain an exoflop on the Type 500 list, where it's currently number one. It's also the number two on the green 500 and number one on the HPL M times P, which are two other ways that we rank supercomputers. So supercomputing is still very much alive and well. And I think a lot of people, if you're not in the supercomputing business, maybe don't really have a sense of how these are programmed. And there may be sort of a sense or a hope that maybe we've got some really cool, exotic programming technologies to do it. And we do, but they also probably feel to a mainstream programmer like very low-level programming models. In the supercomputing business, like the race car business, everything is about speed. And so you're building faster hardware, faster hardware. You're coming out with what you need to program it, but not really raising the level of abstraction much beyond that. So these are obviously eye charts, but these are a couple of fairly simple computations. If I describe them to you, I could do that in just a few minutes. We'll actually come back to one of them later on. But as you can see, they take quite a bit of code to express and to get running on the supercomputer. So you can think of one of the goals of my project is, could we come up with a scalable parallel programming language that's as nice as Python? And I mean to hold up Python as a perfect language in any way, but it's obviously very popular, and for good reasons in a lot of cases. So the idea here is, imagine that we had a programming language for scalable parallel computing that was as programmable as Python, in terms of being easy to read code in, easy to write code in. But where Python doesn't really generate great performance and doesn't really have scalable parallelism. We want to retain all of those features from traditional HPC technology. So we want to be as fast as far as scalar code goes, as Fortran CRC++, as scalable as MPR Schmem, which are the two main ways that we program supercomputers today. As capable of running on GPUs as CUDA, OpenMP, OpenCL, OpenACC, or one of the many, many other ways that people are programming GPUs. As portable as C. And then the last one here sometimes gets snickers in the HPC community, but we want it to be a fun language. I think a lot of us get into programming because we think programming is fun. And in HPC, I would say that programming is thrilling in the sense that you're running on these really big systems, and maybe getting some really cool results. But few people really think of it as fun. It's like sort of something you have to do to achieve your results. So these are our motivations for programming in the Chapel, basically trying to thread this needle between the attractiveness of Python, but all of the goodness that we get from conventional technologies in HPC. So in a nutshell, what is Chapel? It's a modern parallel programming language. And by modern, I sort of mean post Fortran C C++ era. It's portable and scalable. So we do a lot of Chapel programming on our laptops, like I do most of my development on this Mac laptop. But you can then recompile things and run it on the largest supercomputers. It's open source, all of the development's being done at GitHub. And it's collaborative like any good open source project. So we get code contributions from our users and from the open source community. And at the highest level, Chapel has two main goals. The first one is to support general parallel programming. And you can think of this as if you have some parallel algorithm in your mind and some parallel hardware you'd like to run that on, you ought to be able to do that in Chapel or we're not succeeding at this goal. And the second one I've already kind of touched on, making parallel programming at scale far more productive, easier to read and write the code. And before I get into some of the details here, I'm going to go through five key characteristics of Chapel that address some commonly asked questions when people hear about it. The first is that it's a compiled language. This is part of our approach to getting the best performance possible. So you invoke a compiler, it's not interpreted. It's statically typed, like a lot of traditional HPC languages. And this is to avoid simple errors after hours of execution because you had a type error that didn't show up because you were dynamically typed language. It's interoperable, so no language can really be an island today. And so Chapel was designed to be interoperable with other languages, and we support calls between Chapel and C and Fortran and Python. And probably most others, given that most people can go through C. And it's portable, as I mentioned. It runs on laptops, clusters, the cloud, and supercomputers. And as I mentioned, it's open source, both to reduce barriers to adoption and also leverage community code contributions. All right, before I forget, I want to say if you have questions during this talk, I should have about 10 minutes of questions. If I do my job right, or 10 minutes for questions. So please feel free to interrupt me. I'm happy to be interrupted as we go, and if I feel panicked, I'll cut some slides or throw all back on the questions. But here's the outline for my talk today. I'm just giving you a little bit of what is Chapel and why. Next, I'm going to take just a high level tour through some benchmarks and applications to give you sort of a taste of how we compare to other languages and programming models. And then I'm going to spend most of my time actually giving you an introduction to Chapel features by example. So I'll walk through some sample computations. And this is because as a programming person, if I'm in a talk and somebody says they have a great language, and I leave not really knowing anything about their language, I usually am very skeptical. I'll wrap up with a couple slightly more in-depth discussions of a couple applications in Chapel, and then I'll wrap up. Okay, so first, Chapel benchmarks and applications. A lot of times people ask us how we compare to other languages, which is a completely reasonable thing to do. We all have our favorite languages or ones we know and sort of knowing how things compare as useful. Of course, comparing languages can be very time consuming and difficult process. So here's our attempt to do that in a nutshell. This is a scatter plot generated from the computer language benchmarks game, which if you're not familiar with it is a site that compares a couple dozen languages across about 10 benchmarks. And each language can have multiple implementations of each benchmark. What we've done here is scraped the data and find the fastest implementations in each language, the most compact implementations of each language, and sort of plot those as two extreme points. Just like small talk, for example, these would be the geometric mean of the fastest entries. This would be the geometric mean of the smallest entries. And so the further you are to the left on this graph, the smaller your codes are, the further down you are, the faster you are. And generally speaking, you might think of productivity as being kind of down into the left. And so what you see on this graph is that the most compact languages are scripting languages, which is maybe not too surprising, but also that they tend to be slower. And then there's a bunch of fast stuff down here at the bottom that's all clustered together. So let me zoom in to a different scale. And we can start to see those resolved. We see C, Rust, C sharp, Fortran all being quite fast, but also being larger codes. And so down here in the bottom left, you see a couple of anomalies. One is Chapel and the other is Julia, where both of these are languages that generate reasonably compact code, yet code that's also still reasonably fast. And I think there's room for both these languages to push further in both axes, but this is where things stand today. So that gives you a bit of a sense of how Chapel compares against other languages. But I'd say one of the things that really distinguishes Chapel here is of all these languages, we're the only one that I'm aware of that was initially designed for parallel scale computing. And of course, these are just desktop benchmarks. These are very simple kernels. A lot of them are serial, some of them are parallel. So what really matters is, well, how do we do once we start scaling? And for that, I'll return to the two benchmarks I showed you earlier, Stream Triad and Random Access. Again, these are how they'd be programmed conventionally for supercomputers. This is the Chapel code that's equivalent to those iChart slides. So you can see it's much shorter, much more compact. You could probably read through these and get a pretty good sense of what the algorithm did, even not knowing Chapel yet. And the reasons for this are kind of twofold. One is, again, we're a modern language. We've learned a lot from modern languages and incorporated a lot of those lessons. And the second is that we've built parallelism and locality into languages first class features. And so therefore, we can express very rich computations very succinctly. And of course, being compact is nothing if you don't have performance. So these are some scalability graphs. Higher is better. The x-axis is the number of compute nodes, where we call them locales. And there are 36 cores per locale. So this is up to about 9,000 cores. You can see for the first benchmark that we're scaling neck and neck with the reference, which is in green, we're in blue. And for the second one, we're actually outperforming the reference. Okay, so not only are the Chapel codes compact, they also are performing quite well. Of course, all we can do is benchmarks. That's one thing, but if nobody's using your language, then that's another. So this is a slide that lists some applications of Chapel. I'm not gonna go through all of these in detail, but we'll come back to a few of them later on. But at this point, I'll just point out that some of them are sort of very traditional HPC-like computations, like computational fluid dynamics. And others are very computer-sciency, like branch and bound optimizations, or data-sciency in terms of doing pandas-like things. So a real rich set of different styles of applications here. And that's something we're really happy about. All right, with that, I'm gonna start teaching you a bit about Chapel itself. And again, I'm gonna use some examples to kind of go through this. So first, let me return to my diagram I showed you before. And I'm gonna define this term, I guess I've already used, but I'll define it a little bit better now, which is the notion of a locale. So in Chapel, the term locale refers to compute resource that has processors. So it's able to run tasks. And it has memory, so it can store variables. So I've been showing you this chart, calling them compute nodes. We would basically call these locales. And for now, I think it's very safe to think of each compute node as being a locale. Although it's a little bit of a white lie that we'll get into a bit later on. So let me relabel those as locales. And I'll just say that when I think about scalable parallel computing and programming, you have to worry about sort of all the things you normally have to worry about when programming. But then there's sort of two key concerns. The first one is parallelism. So what should I actually run simultaneously in my program? And here, for example, maybe I say like, here are four things I could do simultaneously, let me create four tasks for those to run in parallel. And the second is locality, which is to say where should tasks run and where should data be allocated? So if I've identified these four tasks, should I run them all on a single node like I initially showed? Or should I run one task per node? Or maybe I could actually make it 16 tasks and run on all the cores of all the nodes, right? This is something that you really need to think about and get right for your program to work well. And then the same thing applies for data. Let's say I want to allocate, because I don't have much space here, a small two by two array, very small for HPC. Do I want to allocate that on one node's memory? Do I want to distribute that logical array across all of the node's memory? Or maybe I'd like to replicate it so that each node has its own copy of that array. Again, really crucial decisions to make because where your data lives and where you access it is going to heavily impact your execution time. Let's talk about how we manage these concerns in chapel. So first thing I'm going to just tell you some basic features for locality. One thing you need to know is that all chapel programs start running as a serial program, a single task that's running on locale zero or node zero. So if I just have this little right line statement, here is a way of referring to my locale. And so this will just print out hello from locale zero because that's where the program starts running. And then one of the ways you have to engage other locales, not there yet, sorry. If I declare a variable like this variable A, again, two by two because I don't want to draw very many boxes on my slide, all variables by default are going to be stored in the memory where the task is running. So because I'm running on locale zero, this array is going to be stored in locale zero's memory. Very simple model. All right, now we use an on clause. This is our sort of manual way of moving a task from one locale to another. So if I say on locale's one, all the code bracketed within there is going to run on locale one. And you can see in my picture, my task is sort of migrated itself over to locale one. So now I declare a second array B. I'm running on locale one now. So B is going to be allocated in locale one's memory. And then something very cool about chapel is we have a global namespace. So even though I'm running on locale one and A is back in locale zero, I can still refer to A and the compiler and runtime are going to basically take care of all the communication required to make that happen. So here I say B equals two times A. I'm initializing my local elements using those remote elements. And I can just access those variables because they're in my lexical scope, even though they were remote. All right, now this is, oops, I'm not there yet. When I leave the on clause, I sort of return from locale one and I'm back on locale zero again. So this is a serial but distributed computation, right? I'm not doing any parallel computing. I'm just moving a single task around the node. So next what we'll look at, what we'll start looking at is how would we actually make this parallel? Before I get to that, let me write it slightly differently. I'm going to write a for loop here that loops over my array of locales. So the compute nodes that I'm running on and basically does a similar kind of thing. I'm just going to go over to each locale and allocate a new array B on it, signing it A. Since locale zero is one of the locales, I start out by creating B on locale zero. Then I leave that scope, I go over to locale one, create B there, copy it over, leave that scope, go to locale two, create B there, copy it over, and so on. So I can just serial iterate through my locales. And again, this is another serial program, but it's just distributed program. And if I want to make this into a parallel computation, I can do it as simply as changing that four keyword into a co-forall. And a co-forall is a loop form in chapel. You can think of it as like a current parallel loop that literally creates a distinct task for each iteration of the loop. So in this case, we're running on four locales. I'll create four tasks. Each one will execute the loop body. So it'll execute on its respective locale. And then when I get to that variable B equals A, everyone will create their own B array, copy over A in parallel. So this is sort of composing the locale features of the on clause and the perils and features of this co-forall loop in order to basically drive all the compute nodes in my system. Any questions so far? Just make sure you're with me. Give me a little nod or thumbs up or something. Yeah. Good question. Yeah, so I've left out a lot of interesting stuff in these little toy computations. The question was, is A initialized? So by default in chapel, yeah, all variables are initialized. So each type has its own kind of default initial value for reals, which are real floating point variables. It's 0.0. So A would be initialized to 0.0 here. Yep. More questions. I'll get back to you. That's right. So the question was, you mentioned that there's a global namespace, but there are now copies of B on every locale. So that is accurate. So the global namespace basically means, for any given task, if I look at the code that I'm executing right now and I look up the lexical scope, I can refer to everything that's within my lexical scope, sort of as in traditional programming languages. But your point is a good one that, because this B is declared within that co-forall loop, I have multiple tasks executing that code, each one of which has their own B. That said, they can't see each other's Bs because it's not within their lexical scope. They can only see their own B. And if I introduced additional scopes or moved to other locales after that, each one of those would only be able to see the B that sort of corresponded its parentage in terms of the task hierarchy. So it is still a global namespace, but you're right, it's one in which any given symbol may have multiple instances in the code due to the parallelism. There was a question back here with the yellow and black shirt. Yeah, sorry. Yeah, yeah. So the question is, how are these locales actually networked together and does chapel take advantage of that? This is a place where my diagramming skills are not very good. I showed a bus in practice. Most of the time, we are running either on proprietary Cray networks, like the Cray Aries Network or the Slingshot Network that's just come out on our HPE Cray EX line. Or we also run on InfiniBand, we also run on Ethernet. We basically run on almost anything, but the more the network has capabilities like RDMA or atomic operations in the network, those are things that we can leverage in our runtime. If the network does not have those things, we fall back to slower implementations. So the choice of network has pretty big performance impacts on the code, but won't prevent something from running necessarily. And I should say, we rely a lot on a package from Lawrence Berkeley National Lab called GasNet, which takes care a lot of the network portability issues. And on my team, we spend most of our time tuning for the Cray proprietary networks. Yeah, so the question was, as a programmer, if I'm interested in squeezing everything out of my system, do I have to take into consideration what the network is? I would say that's still true, yeah. So if you want to really performance tune down to the very best your system can do, then being aware of what network you're on and what it does well or poorly is definitely going to factor into that. That said, I would say that apart from big things like, does the network have RDMA and does it have atomics? Different networks obviously have different performance characteristics, but we don't find ourselves rewriting code a lot for different networks. The main thing we do is we try to steer people away from, like people sometimes say, I'm running on ethernet, my performance isn't that good, why is that? It's often because of the lack of atomics and RDMA. So something to be aware of, but not something we find ourselves coding to very often in practice, I would say. And was there one more question in the back? OK, so you're wondering about memory consistency and atomic atomicity, is that correct? OK. So Chapel has, I would say by default, a fairly relaxed and loose memory consistency model. This is typical in high-performance computing in order to get performance, but we have a couple of special kinds of variables. One is actually just called an atomic variable, so you can have an atomic int or atomic real. And that's going to have stricter memory consistency properties as well as implying fences for other reads and writes. We also have synchronization variables that have full empty state associated with them, and those are similarly designed to coordinate between tasks and defense memory accesses. So in practice, what people are typically doing is using a lot of normal variables to get speed and low overhead, and then using some atomics or synchronized variables in order to synchronize between tasks when necessary and get the appropriate fencing. OK. Yeah, great. So if questions, thank you very much. I'm going to now proceed, so I don't run out of time. But again, raise your hand as I go if you have questions, and I'll try to remember to pause again as well. Let's see, so where was I? So we were looking at this co-fraw loop. Right, so now we've got parallelism and distributions. This is a slide, even before this round of questions, I decided I was going to skip past. This talks about a second way to get parallel distributed computation, which is to use distributed arrays. I'm actually going to come back to this in a few slides, and I think it is much clearer there. So let me just blow past this. And what I'm going to do now is actually focus on one of those two benchmarks I was showing you earlier, and we'll look a little bit about how we might write that in chapel. So this is the stream triad benchmark. You might be familiar with it. I think people use it a lot in local computing, but we also use it in distributed memory computing to see how fast we can drive the memory of a system. And I want to just say at the outset, this is a super simple computation. There's nothing magic about this or difficult. It's just one that fits into a half hour talk pretty well and lets you see a little bit more about chapel. So if you're not familiar with it, we're basically going to create three vectors, A, B, and C. We're going to multiply one of those by a scalar, add that to a second, assign it to a third. So this is, again, a very simple computation. And if you've done any parallel programming, you might recognize this as what we call a pleasingly parallel or embarrassingly parallel computation in the sense that all we have to really do is chunk it up into a number of chunks. You do your chunk, I'll do my chunk. They'll do their chunk. And together we'll all solve this much faster, right? We don't really have to coordinate much with one another. And I'm going to introduce a couple visual idioms here that I'll use in my talk. These light blue lines are what I use for kind of shared memory or multicore. So here, for example, maybe I'm running on a four core processor. We can just chunk those vectors into four roughly equal sized pieces. We can all do our piece of it and then we're done. If I move to distributed memory, it's very similar. We're just kind of chunking across larger boundaries, if you will. But here what I'm typically going to do is replicate that scalar alpha so that each of the distributed nodes has its own alpha and we're not always communicating back to somebody to figure out what alpha is. And then, of course, on modern systems, when we run in distributed memory, we still have multicore processors within each node. So we end up with this hybrid of both the red distributed memory lines and the light blue shared memory lines. All right, so let's look at how we would write this in chapel. The first thing I'm going to do is declare a couple of constants, n, which is my problem size, and alpha, which is the scalar I'm multiplying by. And you'll see here I've used the keyword config in these declarations. When you put config on a declaration chapel, it allows you to override the default value from the source on the command line. So for example, this is how I would compile this code. Our compiler is called chpl. And then if I just run it with no arguments, I'm going to use these default values from the source. But if I run it and specify dash dash n equals 10, dash dash alpha equals 3.0, those are going to override the values in the source code and I'll use those instead. So nothing deeply related to parallelism and distributed memory computing, but sort of hoping that you don't have to spend your whole life writing argument parsing. So now I've got the basic scalars for my problem. Now what I'm going to do is declare my three arrays of size n. This is the syntax for doing that. I've specified the index set, or we call it a domain, in those square brackets, and the element type afterwards is a real. And then I can write the computation as a equals b plus alpha times c using whole array promoted operations. And this is going to result in parallel computation. Now as you can see in my picture here, I've used the light blue lines. This is only a shared memory multi-core program. So I haven't done anything in the text of my program to refer to remote locales either explicitly, like using the on clause or implicitly using some abstraction that's built in terms of them. So let's look at what we would need to do to turn this into a distributed memory program. The first thing we could do is use some of the tools we've already seen. Like we could use a co-ferral across our locales to create a task per locale, then an on clause to move each task to its respective locale. And then we could declare the variables and do the a equals b plus alpha times c using just our multi-core implementations before. So the picture this is going to give us is, each locale has its own local arrays. It's using its own local cores to compute its section of those arrays in parallel. And across all of us, we've sort of done the aggregate stream triad computation. Now the second way we might write this is the thing I skipped past just a few minutes ago, using distributed arrays. And so here I'm gonna use one of our standard modules. This gives us a block distribution that chunks up arrays as evenly as possible across compute nodes. I then declare an index set called dome. I say create a domain, which is a first class concept of an index set. And I say the indices are one through n. So this is gonna chunk up those indices as evenly as possible across my locales. And I'll use this domain to declare my arrays. So I've replaced the anonymous one to n before with this block distributed domain. And that's gonna give me an implementation of those arrays in which each locale owns a subset of the global array elements one through n. And then as before I can say a equals b plus alpha times c. And because those arrays themselves are distributed, the implementation will also be distributed. We use kind of an owner compute style model. And each locale will compute using all of its multi-core resources, its local chunk of the globally distributed arrays a. Okay, so this is another way you could write stream and this is what we call our global version. So at this point, you've now become enough of an expert in chapel about the code I showed you at the beginning here. You've essentially seen, I think the only difference is I actually initialized the arrays in this case with some non-zero values. And let's zoom in on that performance graph now that we've seen this a little bit more detail. So it's hard to see, but there are three lines on this graph. There's the green line, which is the reference MPI and open MP version. And then there are two chapel lines, one for that explicit co-forall on based implementation I showed you and one that uses the global arrays. And you can see that these three are all essentially superimposed. This isn't too surprising. This is a memory benchmark. And if you're doing things right and saturating the memory, this is the performance you're gonna get. So the main thing you're seeing here is that chapel doesn't introduce unnecessary overhead compared to traditional approaches. All right, returning to my little picture of parallel computers, there's one thing that I sort of skirted past here, which is that compute nodes now often have GPUs and they have their own processors in memory. And so my little cartoon here is a little bit overly simplified and perhaps a more accurate representation would be something like this where I've got a bunch of GPUs, they each have their own cores and their own memory. So I'm gonna talk briefly about how we deal with this, which is that we refer to these GPUs as sublocals. So kind of locales within our locales and we're gonna use the same features to target them. I just wanna return to my frontier slide here to point out how important this is. So, you know, 37,000 of the processors in this are our GPUs. And if you look at the amount of cycles on the system, I think it's something like 95 to 99% of the cycles are GPU cycles. So if you ignore the CPUs, as I've done up to this point in my talk, you're basically leaving a lot of the machine unused. So these stream trials I've shown you, these only run on CPUs and it's kind of for similar reasons as before. I haven't done anything to refer to the GPUs either explicitly or implicitly. So let's look at how we could rewrite these to target the GPUs. I'm gonna start out with the co-forall locales on locas before to create a task per locale. But then I'm gonna use another co-forall on combination to iterate over all of my GPUs sublocals. And I have each of those GPUs sublocals declare its own arrays and do the a equals b plus alpha times c. Because I'm within a GPU on clause, I'm gonna allocate those arrays in GPU memory. The parallel computations can use the GPUs cores to do the computation. So now I've got a code that is running on all of my GPUs, but leaving my CPUs unused other than some coordination to get things up and running. Yeah, question. I'm sorry, I guess I... Mm-hmm. Yeah, yeah. So that's just... Yeah. So if I were doing this for real, that is what I would do. So what I've done here is, and I guess I should have said this, the end, this is sort of the embarrassingly parallel version where I'm sort of focusing on the per node or per GPU code. And here the problem size represents how much my specific GPU or CPU is gonna allocate as opposed to the global problem size. You're right though, and in fact, you're making me realize, so I've been a little bit sloppy in my slides because I started out talking about we're computing this on n element vectors. Here what it actually would be would be like n times the number of GPUs times the number of nodes. So if I were doing a real computation, not just sort of this little toy benchmark that I can fit into this talk, you're right that what I would have to do if I was using this explicit model is take that global problem size and figure out, well, in the global space, I'm sort of GPU I out of K, so I own indices low through high and I'm gonna allocate my rate. And that's actually what we do in practice. Again, this is sort of what I could fit onto a slide. Now if you're using the distributed arrays that we saw briefly, that bookkeeping is being done in the abstraction that distributed arrays themselves. So I'm just saying one through n and all the subdivisions happening underneath that. So yeah, thanks for pointing that out. I'm gonna have to think about how to improve these slides to clarify that now. So yeah, here this is a local problem size, you're right. I'm sort of focused on me, myself and I. What do I wanna allocate? I'm gonna allocate n element arrays and hopefully I have enough memory to do that. Okay. So as I was saying, so far this is a GPU only program. I haven't done anything to refer to CPUs. And in fact, I've referred only to the GPUs. But with a couple other changes, we can bring the CPUs into the mix as well. So I'm gonna use this co-begin statement, which we haven't seen yet. This is basically a compound statement where each of the child statements will be executed by a distinct task. So the first task here is gonna go off and spawn off all my GPUs and get them up and running. And then the second, yes. And then the second task is gonna run the traditional CPU triad that I started out showing you before. And so at this point, I sort of have a task firing up the GPUs, a task firing up the CPUs. The GPUs using co-for-all to spawn up all of themselves. The CPUs are taking advantage of the fact that the arrays automatically do that. And now I'm using all of the CPUs and GPUs across as many compute nodes that I have available to run on. There's one other thing to notice here, which is I basically have some code duplication here, which seems kind of unfortunate. So something you could do is to basically refactor this code, by having a little run triad procedure, and then just calling that from both sections of your code. And our compiler is basically gonna go and specialize two versions of that, one for the CPUs and one for the GPUs. I'll mention that these GPU features are quite new, even though GPUs have been around for a while, particularly in HPC. We've only been really targeting them from the last year and a half or so. So these are some performance results showing where we are today. The left graph is NVIDIA, the right graph is AMD. The x-axis are number of elements and this is for a stream triad targeting a single GPU. The reference version in CUDA or HIP is in green and the highest blue line is the current chapel and other blue lines are kind of stepping stones that have gotten us there in the past few months. So you can see as the problem size grows, we get pretty competitive with the reference version. As smaller problem sizes, we're sometimes off by quite a bit still. So we still have some overheads there that we need to identify and optimize away. But so far we haven't seen any showstoppers and we expect those curves to continue to go within sync to one another as time goes on. All right, I'm rapidly running out of time. Let me just tell you, so we've been focused on this very, very simple computation of stream triad, which again is not really representative of real HPC codes. Well at least, not unless you're really, really lucky. That doesn't happen very often. So I wanna return to the fact that many sort of significant things are being done with chapel and I'm gonna focus in on two of them in particular. The first is this code called champs, which is a 3D unstructured computational fluid dynamics framework for aircraft simulation. And this is currently about 85,000 lines of chapel that was written from scratch over the past three years or so. This is done by a team at Polytechnique Montreal. And this was a case where the students were really eager to use chapel and the professor was very hesitant because we're not sort of name brand yet, right? We're not CE, we're not MPI, we're not Python. And the students basically convinced him by writing up some code and showing him how good it was and how easily they could do it. So they found that they got performance and scalability competitive with MPI and C++, which is what they would have used otherwise. And yet the students found it far more productive to use. Where my favorite anecdote is the professor who's now a convert says that he has students, master students who in three months can do a project that would have taken them two years before. And he's a believer as a result of that. The other thing is, you know, this is a fairly, I would say, modest sized university and over these three years they've basically created software that allows them to be competing with established computational fluid dynamic centers like NASA and JAXA and Stanford and sort of the big players in the field. So it's really allowed them to punch above their weight. And I don't want you to just take my word for it. This is the PI, he gave a keynote at our annual workshop a couple years ago the link to his talk is here at the bottom. And there were some great quotes about how, again, he sort of became a believer and how much they've been able to get done, how easily. But in the interest of time, I'm just gonna blow past that. I'll mention that CHAMPS is kind of a traditional HPC application, sort of simulating the physical world. And this next one I wanna zoom on is kind of the exact opposite. I would say very different than traditional HPC, at least the most of us think about it. And the idea here is, what if we wanted to do data science in Python at scale? And so the motivation is, you've got some data science problems that sort of require HPC scales to solve and solve well and quickly. And you've got a bunch of Python programmers and you've got some HPC systems to run them on. But how are you gonna leverage those Python programmers to get the work done? Because again, Python isn't really designed for scalability and performance, particularly on these large scale systems. And so Arcuda is what was built by this team to deal with this. It's basically a framework for doing interactive HPC within Python. The idea is that the user is sitting in Jupiter, writing Python code and making what look like normal NumPy and Pandas calls. But that Arcuda library is basically a client that is communicating to a server written in chapel on the HPC. And so when you say allocate me a 30 terabyte array, it goes up to the HPC and allocates that array and sends you back an ID to it. And then you can say, fill it with random values or multiply it or sort it or whatever. And so you're sort of just sitting there happily doing Python and you're sort of powering this HPC as you go. So what is it? It's Python client server framework, as I mentioned, for interactive super competing. It competes massive scale results, which again, think terabytes scale arrays within the human thought loop. So the whole goal here was we can't afford to have every operation take a half an hour because I'm gonna lose my train of thought. So every operation had to complete in seconds to a few number of minutes. And it's an extensible framework, so you can sort of plug anything into this if you're willing to write it in chapel. But for starters, it's focused on a key subset of NumPy and Pandas for data scientists. It's currently about 30,000 lines of chapel and about 25,000 lines of Python, again written over about the past three years. And happily it's open source, so if this intrigues you, there's something you'd like to try out. You can download and try it out. You don't need a supercomputer. You can use a nearby laptop or whatever is your server if you want. This is written by Mike Merrill and Bill Roos at the DoD. And the reason they chose chapel is that it's close to Pythonic. Allow them to write it rapidly, which they needed to do for a deadline. It also meant that their Python users, if they needed to look under the hood or add their own functionality, they wouldn't be repulsed by MPI and C++. And that's proven to be the case for them. It gave them the performance and scalability they needed. And they love this ability to do all the development on the laptop and then go in and recompile it for their supercomputer and have it run. This is a scalability graph. In Arcuda, ArcSort is one of the crucial routines. Lots and lots of data science operations are doing sorting under the cover. So this is something we've worked on tuning with them. And these are a couple of performance runs. One is just recently in the past month or so on a Cray EX. The other was a couple of years ago on an HP Apollo. You can see the numbers here. They're running on 100,000 cores in one case and 73,000 cores in the other. Sorting 28 terabytes in one case, 72 terabytes in the other. And the sort times are kind of in the 24 seconds to two and a half minutes time frame. And we're getting 1200 gigabytes or 400 gigabytes per second. Now, if you're like me, these numbers mean nothing to you. But the curve maybe looks good. But I'm told by sorting experts that this is close to world record performance, if not. And it was done in 100 lines of chapel. And I think if we considered gigabytes per second per line of code, it probably is a record. With that, let me wrap up. I'm realizing I am rapidly running out of time. Though I get to talk to you today, there's a team of 21 of us at HPE working on chapel on a daily basis. This is our full-time job. We'd love to talk to you more if you're interested in it. And so let me just summarize and I'll just do the high level bullets here. I'll assert that I think chapel is unique among programming languages. I don't think there are other languages that's focused on perils and locality and scalability as we are. It is being used by real users for productive parallel computing at scale. And we've seen this in this talk in a couple of very different kinds of applications. And if you or your users are interested in taking chapel for a spin, please let us know. I know that learning a new language can always be intimidating. We're very happy to work with users and user groups to help ease that learning curve. I mentioned briefly before we have an annual workshop. The 10th instance of it is coming up at the beginning of June, this next month. It's an online and free workshop. There's a day of interactive programming. So if you'd like to sit down with us and do some chapel programming, you can propose a project to do there. There's also a day of presentation. So if you want to hear about some of these applications I've been telling you about, many of them will be represented there. And the link is here. I've also got a slide with resources. You can click on and link to you later on. Social media, ways we interact with the community. All these slides will be available and so you can, you know, explore this later on and I will stop there. Thanks very much. And I really did run us right up to the end of our time. In part, I think that's because we had some good questions along the way, but I'm happy to take another question if somebody has one or just to chat with people informally during the break. Yeah, question back here. Yeah, that's a great question. So the question was, you know, we want to run on these big systems. I've sort of glossed over like, how do I say which resources I can and should run on? So when you compile a chapel program, you specify what we call a launcher, which is sort of how do you want to sort of get things up and running? And so a lot of systems are managed by something like Slurm or PBS. And so we have launchers that wrap those technologies. And there's a command line argument I didn't show you, which is whenever you're running for multiple locales or distributed memory, you specify how many locales you want to run on on the command line. That will go off and talk to Slurm or PBS or whatever and say, hey, this person wants 1,000 nodes, it'll then give us back those nodes and then we'll fire up our code that we've generated from the compiler across all those nodes. And then the details of like, well, how do I say which nodes or which kinds of nodes? Those obviously depend on which of those launchers you're using and you can use sort of the normal environment variables and things to set it. Lots of details, but the idea is that once it's configured properly, a typical user can just say like, run my chapel program on 1,000 locales and you're off to the races. Well, once you get through the queue on your system anyway. Another question up here? Yeah, so that's a good question. So the point is that in, I've said that locales, sorry, I've said that GPUs are just kind of another sub-local within my locale, yet we've seen that GPU, I guess it's not really a keyword but it's an identifier in the code. And so sort of how do we think about that? I'll say that this is something we're wrestling with ourselves to an extent. What we have is a notion of a locale model and when you compile your code, you can say I'm gonna compile it for, today we have sort of two main locale models. There's what we call the flat model, which is one that has no sub-locales. So kind of a CPU only resource. And then we've got our GPU locale model and that's the one that has these GPU sub-locales. And the idea is that you can actually define additional locale models within the chapel language by sort of creating an object with the right interface and things. But those are the two instances we have today. So it isn't the language that embeds the keyword GPU into it, for example, it's the fact that the GPU locale model has a field within it that is this GPU field that's an array of sub-locales. So that said, I think there is a tension here between like how much does the language and compiler need to know about in order to generate good code and to what extent can this be sort of completely user defined and extensible and that's something we're still wrestling with. I'll take, I just wanna be respectful of the room. Let me just do some math real quick. So three, so 12 minutes. Why don't I take any other questions offline so the next speaker won't be held up? And again, thanks for attending the talk today and thanks to the organizers for having us. And yeah, I'll be around if people wanna ask other questions. But thanks again.