 Well, welcome. And it's my pleasure to introduce Connolly Eleckelton from Stanford, who was a student here back in 1991. And he's going to tell us about his 25-year odyssey. So I guess it started back then. I'm not quite sure what he was doing then. There was been parallel computing, but maybe you'll tell it. He's had a very distinguished career at Stanford. He was one of the first people, if not the first person, to propose that we start building multi-core machines on a single chip, which Intel subsequently invented. And then after that, he founded a company called Afara, which built a multi-core processor on a chip, which you may know as the Niagara series. Sun subsequently bought that company their own attempts at the same solution were failing miserably. They were missing that market deadlines. And so their solution was to buy a solution, essentially. And that was Afara, which Connolly was a founder of. And now he's back at Stanford working on, well, he's director of the Pervasive Parallel. Parallels and lab. Parallels and lab. Thank you. And he is looking more at language-level issues and specifically domain-specific languages that have, that's probably a better solution to solving problems. And I won't go into it, but I'll let him explain that. So without further ado, Connolly, thank you. So thank you very much, Trev, for that introduction. So if Trev didn't know what I was doing when I was his graduate student, then, you know. I don't know who should know. Yeah, exactly. So I'd like to thank the department of CES, what was it? CSE. CSE for inviting me to give this lecture. And whoever orchestrated the weather, you know, my hats off, I've never seen it this warm at this time in the year. And also, I've got to commend you on the building. This is the second time I've been in the building. I was telling people earlier that I started life down on Central Campus in what is now known as East Hall. And it was East Engine. And I bet it looked a lot worse than it does now. And so this is certainly a big upgrade. So I'm going to talk about making Parallelism easy, a 25-year odyssey. And yes, it started before 91, at least if you count properly. And you, so it's kind of good that there's lots of graduate students in the audience because it's focused on graduate students. And basically what I want to do is give you the benefit of some of my experience in doing computer systems research. And then hopefully the insights that I have over time will spur you to do interesting work too. Okay, so let's start with the computing stack. So if you look inside a modern computer system, you'll see a number of levels of abstraction starting at the top with applications. And applications are typically written in high-level languages, programming languages, and then compilers translate these programming languages down to machine code, which is run on the hardware with the help of the runtime and the operating system. And then of course the architecture or hardware is implemented using integrated circuit technology. And it's my experience that most of the interesting research problems in computing systems require crossing these stack boundaries, right? You can stay within any of the stack boundaries, but I think crossing them leads to the most interesting sorts of problems that need to be solved. And I think this is especially true with explicit parallelism. So here we're talking about parallelism which is managed and orchestrated by the programmer, right, so not automatically created under the hood, but here the programmer has some role in how the parallelism gets executed. And as I've shown, basically it crosses all of the stack boundaries, so as we'll see as we go through the talk, you'll see that we cross all of the stack boundaries and it makes this a really challenging problem to work on because of course there are trade-offs to be made between the different layers and figuring out how to make these trade-offs correctly is pretty challenging. But the benefit of crossing so many stack boundaries is that typically as a researcher, you can't be an expert in all the different layers. And so you need to reach out to your colleagues who are expert and so this can lead to lots of interesting and fruitful collaborations between people. So one of the things that makes explicit parallelism so challenging or difficult or hard is the fact that there's this tension between what you want to do at the upper levels of the stack and what you want to do at the lower levels of the stack, right, so the upper levels, you want higher productivity for programmers. You want it to be easy for them to write programs that are both correct and achieve good performance. And at the bottom, of course, you want a high performance hardware. And as I said, there's this tension in the things that you want to do for productivity sometimes will lead you to low performance solutions and the things that you might want to do for down in the hardware to increase productivity may impact your performance or may lead to higher power. So John Hennessey, who's of course the president of Stanford and a noted computer scientist says that parallelism is a problem that's as hard as any computer science has ever faced. And so, you know, this means that of course that lots of ingenuity is going to be required to make parallelism easy. So let me start at the beginning. So this is a picture that some of you with white hair will recognize. And this is a picture of the NQ-10 and when A-Cal was getting started back in the mid-80s we got a machine that was designed by some people who left Intel and it was the NQ-10. And this board has 64 processors or nodes. Each of the nodes has 128 kilobytes of memory and in total you had eight megabytes of memory on the whole board and the performance was about half a megaflop. So of course this board is completely dwarfed by the iPhone that you have in your pocket, right? So, you know, so of course performance was not nearly as good as you could get today. So the question is sort of how did you write programs for this? Well, you had eight megabytes of memory but it was divided into these 128 kilobyte chunks and in order to communicate between the chunks you had to explicitly send messages back and forth between the different nodes and at that time there was no Linux available and so the host ran a custom version or a custom created version called NQ called Axis, it was a Unix version flavor of Unix and each of the individual nodes ran a special node kernel. So developing applications for this machine was actually not that easy, all right? So what happened was, you know, message passing, writing message passing programs was difficult, the machine was, the hardware was flaky so it would often die and you'd have to go reboot it. In fact, you know, as I was discussing with Professor Stout earlier today, every graduate student who worked on this machine had a key to the machine room so they could go reboot the machine and essentially the software was fragile and buggy, oh by the way, NQ was a startup so that was some of the problem here and the overall process of developing software for this machine basically led to fairly low productivity, it was difficult to develop software and so after this experience, I thought parallelism was interesting but it was clear that the hardware was much more capable than the software, right? And so the hardware was interesting but the software made it difficult to use and so based on that, I decided that, you know, the more interesting thing to do in architecture was to work on improving single chip, single CPUs and so that's what I did for my PhD and I left parallelism by the side, okay? So then I went off to Stanford and the question then became sort of what research direction to go in? How should I focus my research so that I was gonna kind of do the sorts of things that would get me tenure at Stanford, okay? So this is the kind of, in retrospect this is kind of the process that I went through and I wanna kind of say a few words about picking a research direction, right? So of course, fundamentally, you wanna work on something that's intellectually challenging and I suppose most of us are working on things that are intellectually challenging and I think for me that means really crossing multiple stack boundaries so not just staying within any one layer but crossing the boundaries. Secondly, I think you really want to revisit conventional wisdom, right? So, you know, most areas have a bunch of widely accepted truths about the research area and the question is, you know, are these truths fundamental or are they just an artifact of the way things have always been done? And so, questioning conventional wisdom I think is a key element in picking a research direction. Also, picking something that's different and new because it's easier to make progress, it's easier to get your papers published, it's much better than trying to polish the ideas of others. And a key thing that you wanna do in an engineering field like computer engineering or computer science is actually create new engineering ideas and change the way that people actually do engineering, right? And so you should find some way to change the practice of engineering and in my case change the way people design and build computer systems. And finally, you know, sort of goes without saying that you don't wanna be working on things that are close to what industry is doing because industry has many more resources and capabilities and potential knowledge about certain areas than you do and so trying to keep up with industry is typically not a good thing and often it may be the case that the ideas that you're working on industry doesn't think are good ideas and the question is do they not think they're good ideas because fundamentally there are some flaws or it's just because they are still hewing to the conventional wisdom. Okay, so given that then, let's think about what the situation was in the mid-90s when I was kind of trying to pick this research direction, right? So on top 40 radio, U2 and Whitney Houston was playing and in the microprocessor world we were going through this microprocessor performance boom, right, so clock frequency was increasing at 40% per year, single processor or single CPU performance was going even faster at 50% per year and in the computer architecture research realm there are a lot of people thinking about how to do things to make super scalar architectures better and give ideas to Intel, right? So I know there are a lot of people in the audience from EECS 470 so all the ideas that you've been exposed to in that class, a lot of them came out of this period in the 90s when people were thinking about how to do deep and complex processor pipelines but the net result of all of this is of course that we had a free lunch for the software developers, right? So what does this mean? Well, it means that the software developers sell software by increasing the number of features that they have in their software so last year's version has fewer features than this year's version and the problem is of course as you add features potentially you make the software slower and as far as the software developers were concerned that didn't matter because if the software wasn't fast enough today all they needed to do was wait a little while and because of the underlying performance improvements from the processor, the software would all of a sudden run fast enough, okay? So they didn't have to work very hard and they just got to ride on the successes of the microprocessor's performance boom but there were clouds on the horizon, right? So one of the trends that was happening in the architecture area was that the techniques for exploiting, for improving the performance of a single CPU were starting to run out of steam, right? So all the complex pipelining out-of-order ideas with broad prediction and so on and speculative execution were starting to run out of steam and the complexity of actually implementing all those ideas was steadily rising. From the IC technology point of view what was happening was the interconnect, the wires that connect the transistors was not scaling as well as the speed of the transistors themselves and so they were lagging, the interconnect was lagging behind the speed of the transistors, okay? So given that, what did my research decide to do? We decided to look at the idea of building multiple processes on a chip instead of building one complex CPU. So the idea then is you can take a die and maybe half of it is CPU and half of it is cache as you look at most microprocessors, that's what you'll see and so traditional way of building the chip was to of course take all the CPU area and build one complex CPU and the alternative that we were proposing was to take multiple CPUs, four CPUs simpler and design the microprocessor that way and so the advantage then is each of the processes was simpler to design and the wires were shorter, right? And so we got to mitigate this problem that long wires were becoming particularly problematic in the IC technologies of the day. Now of course having multiple CPUs on a chip made it possible to exploit multi-threaded parallelism and in fact if you really wanted to take the most advantage of this sort of multi-core chip then you actually had to write a parallel program, right? And now of course compared with the NQ board that I showed you actually writing a parallel program for this chip would be much easier because now the processes can actually share memory through the cache very efficiently. However you still fundamentally had to change your programming model and write a parallel program, okay? So what was the performance potential of this approach? Now I'm a computer architect by training and so I have to show you some performance graphs otherwise they'd revoke my union card, okay? So here we have some performance and basically what I'm showing you is a bunch of benchmarks, some fundamentally sequential and some more parallel and a comparison of the four simpler CPUs versus the one complex CPU and as you see down on the left side the complex CPU is basically winning or keeping up with the multi-CPU and then as you multi-core CPU and then as you move towards the right as you get more parallel programs then you can get much better performance using the multi-core approach, right? So reasonably close on the sequential applications and then much better on the parallel applications and so that was the crux of the idea presented in the paper that we wrote for Asplos in 96 called The Case for a CMP and this is basically the first paper that kind of lays out the arguments for why this was a good idea, okay? So some people thought it was interesting but industry really wasn't very convinced and one of the reasons they weren't convinced was because now we need to write parallel software. So why is writing parallel software fundamentally so difficult? Even in a shared memory environment? So the problem is of course the software has to be correct, right? And what does correctness mean? It means that if we're sharing memory then we have to make sure that we synchronize access to that shared memory so that we get the correct result but we have to make sure and typically what we're gonna do is we're gonna use locking and the problem with locking is if you do too little locking then you will have access which is not well synchronized and you'll have races and you'll get incorrect program operation. However, you can have too much locking or locking in the wrong order and this can lead to deadlock and your program will hang, okay? So now that's correctness. Well the software also has to perform well, right? So this means of course fundamentally you've gotta find enough powers in your algorithm for your to run well and you have to make sure you don't have too much synchronization since locking essentially means stalling, right? So if you have too much synchronization you'll stall too much and you won't get good performance and you can't communicate too much because if you communicate too much that also means stalling and so fundamentally you're between a rock and a hard place if you do too little you'll get races if you do too much you'll get bad performance and deadlock, right? And so this is a difficult problem. So what's the bottom line? The bottom line is there are lots of people lots of smart people like you who can write decent sequential programs and they might even perform well every now and then. But few people can actually write correct parallel programs because of all these locking issues and there's a tiny few gurus who can actually write software which is both correct parallel software which is both correct and performs really well, okay? And so this is not a problem I mean if you have this problem then how do you produce all the software that's gonna run on all these multi-core devices, okay? Well one idea is to say well if millions of people can write decent sequential programs then why can't we take decent sequential programs and turn them into parallel programs, okay? So what's the fundamental problem here, okay? Well let's look at some code here because it's just computer science, right? So we're all used to looking at code, okay? So here we have a loop and it's a while loop and basically what we're doing is we're reading sentences from a file and we're checking to see that this is not the end of the file and then we're passing the sentence and if we get some error from the passing then we print out some error message associated with the sentence. So fairly easy loop and the question is could you parallelize it, right? Well I'm not gonna give you a quiz but let's look at this. So well maybe we could run these loop iterations in parallel if we knew that the read was parallel that we could run multiple versions of the read at the same time and if we could run multiple versions of the pass procedure or function at the same time but even if we could do that we have this problem that this loop has an unpredictable exit, right? We don't know any time when the data that we read is gonna run out and so we'd have to have some way of handling that, right? And so the problem with this is that if we give a compiler a loop and say please parallelize this loop the compiler has to be conservative, right? It has to guarantee that no matter what happens it's always safe and legal to run this loop in parallel and if it can't make that guarantee at compile time it has to throw up its hands and say sorry this loop is not parallel. So many loops like this and most of the time the reaction of the compiler is I'm sorry, I can't guarantee that this will run correctly in parallel and so I can't parallelize this. Okay, so compilers have to be conservative and the question is can hardware help you be more aggressive at finding parallelism and so the idea that we worked on in the HIDA research project was this idea of hardware support for speculation and the way to think about this is just that it's a safety net that says even if you go and do things that aren't strictly legal all the time you will never generate a bad result you'll always get the right result it may take as long as doing it sequentially but you will at least get the right result and then if there is parallelism you'll get higher performance so with this sort of capability then potentially you can take sequential programs and automatically parallelize them and so we did this in the context of Java so we had a complete system for dynamically parallelizing Java programs so you take a JVM, you feed it a Java program it analyzes the program and sees where the loops are cause that's where most of the parallelism is it decides how it's going to orchestrate these loops to run them in a specatively parallel manner and figures out how to move the code around to optimize the performance and then generates the parallel code which relies on this underlying hardware which provides support for speculation so the performance is pretty good on floating point apps cause you expect those to be dominated by loops also it was pretty good on multimedia apps again dominated by loops and of course integer applications which are not loop intensive a lot less performance but still not bad and of course we're starting with Java which is not as optimized as C and so things look good so it is possible to do unfortunately it's not really a scalable solution so maybe you can do it on maybe four or eight cores but going much further doesn't give you that much benefit okay so the question then is so we did a bunch of research on multi-core architectures and as I said one of the things I wanted I'd like to do from a research point of view is not just do academic research which entails publishing papers but you also want to change I'd also like to change the way the industry practices engineering, change the way that industry designs computers so what you do, of course you've got to write the papers because after all we're academics and then of course you also want to develop prototypes to be more convincing that your ideas make sense and potentially give them away to people to have them play you want to give talks and go and talk to engineers who design processes and tell them about your ideas and convince them that these are the right way to do designs but after doing all this still no one was really convinced and what was really limiting the acceptance of this approach well one of the things was the fact that single thread performance was still improving at 50% per year and so as long as that was going on people said so why should I care about this model of building microprocessors that involves changing the way I do things and more importantly changing all the software that has to run on the microprocessors and so because of these reasons the fact that your industry is naturally conservative they want to keep doing the same old thing until it really hurts and when it really hurts then maybe they'll still continue so the idea then is sort of we're faced with the fact that we thought these were really good ideas and we didn't think that they would see the light of day unless we actually played a part in transferring the knowledge from academia to industry and so how do you do that how do you transfer from academia to industry well you can go to industry yourself go to an existing company and try and go and push your idea that way but another way to do that of course a way that is a path that is well tried and at Stanford is to do a startup okay and so that's what I did so the name of the startup was called Afara Web Systems and this was the logo so Afara means bridge in Yoruba which is a West African language and so that's why the logo was a bridge so it was founded in 1999 which was the height of the internet boom everybody was starting up some internet company to sell pet food and what have you on the internet and what was happening in the data centers is for these large websites they were running out of both space and power and part of that was really because the microprocessors they were using Pentium 3 and Pentium 4 at the time were really not optimized for the kinds of workloads that were being used in these internet data centers so the goal of Afara then was to revolutionize internet data centers which of course was a huge market, multi-billion dollar market and approach and do that by by getting a ten times performance per watt improvement with a new microprocessor based on CMP technology and the obvious question is why do you as a startup think that you can compete with Intel, Sun, DEC and all the other microprocessor companies and actually come up with something that might be better with a lot fewer engineers, well the key thing is that the CMP approach allows you to do things both more simply and also at higher performance and so after I kind of pitched this ideas to a few VCs, at least two of them bit and I was able to develop a team and I had top people from all the existing processor companies and other systems companies and the whole idea was to design both the chip and the system around the chip, right, turns out you can sell your product for a lot more money if it's a system with software and a chassis than if you're just trying to sell a chip so doing chip companies is hard selling appliances with software gives you more margins and so the idea, you know, one of the VCs says you know we're gonna start with twenty million dollars but we'll commit it to give you a hundred million dollars all the way to go to market with this idea and he said this is a big boy project and it means big boy money and so this is what it was about and so here is one of the slides from the pitch I gave to the VCs when I was raising money and basically it's arguing that this approach is one that will really work because both from a software point of view we have lots of open standards and so it's possible to have software available that could run on a new architecture that from the point of view of the type of workload there was lots of parallel work there's multiple internet packets multiple network packets multiple requests because this is server sorts of applications and multiple sessions and the key thing is that in a server environment throughput is more important than latency right so because you are accessing the server over a wide area network link it's more important to have multiple sessions and multiple requests being serviced at the same time rather than uh... to increase decrease the latency of any single request right so uh... if you looked at the type of workload you would find that you know exploiting ilp with a complex pipeline didn't give you much benefit and the cash behavior of the workloads was also pretty bad and so small caches uh... didn't buy you very much so these were the characteristics of the microprocessors at the time and so the argument was okay well this wasn't giving you a lot of benefit was wasting a lot of power and you could do much better a factor of ten better at least okay so uh... that was the idea of uh... uh... kind of motivating hydra and uh... the small multiple sorry the hydra idea motivated uh... the niagra approach and higher and the key idea was high performance per watt so being uh... energy of power efficient and high throughput for commercial server applications okay so lots of thread-level parallelism uh... with limited amounts of instruction level parallelism and very bad cash behavior and so the way that you're gonna design the chip is by having many simple cores versus fewer complex course so you throw out all the complex fancy uh... sophisticated pipelining techniques you throw out branch prediction you throw out uh... multiple issue and what you get is a simpler processor that burns less power that is simpler to design it can be done with a few smart engineers rather than teams of hundreds of engineers and then so uh... so the idea then was to have a microprocessor with thirty two threads uh... and we'll see that those threads were organized as uh... as eight groups of four threads and then a memory system that could feed all those threads and again you want very high bandwidth both to the shared cash and also off-chip to the DRAM so so we founded the company in ninety nine then three years later the market looked a lot different right so we went from the dot com boom to the dot com bomb and uh... the vcs who promised you know funding the company all the way to uh... the market with a hundred million dollars decided that they really didn't want to do that so they wanted to cash out and uh... at the same time you know we had implemented spark processors and the sun processor design uh... group was not actually uh... coming up with processes which were both meeting the deadlines and meeting the performance uh... goals of the company wanted so they decided to buy us and most of the team moved from a far to some and continued uh... design the process we basically almost done we were about to take out and then uh... uh... then we got sold to some okay and so about uh... you know six spot about a year later after changing some things the key thing that we had to change was the i o bus we came out with a chip called niagara one uh... or the ultra spark uh... t one and as you see it's got four so it's got eight cause each of these cause has uh... four threads and uh... each of the cause communicates to a on chip level two cash and you can see the four banks of the cash at the four corners and then you had four interfaces to d-ram case a lot of bandwidth to d-ram uh... basically a factor of two more bandwidth to d-ram the most micro processes across bar in the center communicating between all the processes and the cash so the first bill aside projects i did with with forever was a crossbar and so the idea of crossbars kind of stuck with me uh... all the way through uh... to uh... this design and uh... and uh... one thing other thing is that we had a single floating point unit for the whole chip okay so turns out that in server benchmarks floating point wasn't that important so why spend a lot of uh... uh... chip real estate doing floating point if uh... it's not an important uh... part of the uh... the instruction mix okay so let's say a little bit about the performance of niagara versus uh... the alternative at the time and for that let me just briefly talk about uh... what e-business applications look like in in the data center right so what you typically have is this three-tier model right so the first here is the web server tier uh... and this typically of course uh... generates the web content uh... that you're gonna see in your browser okay so there may be static web content and dynamic web content and then the middle tier is the application server tier and this tier basically implements the business logic so if you've got a shopping cart it manages a shopping cart and does the the updates and then uh... the last year is the database server tier and this tier you know keeps the persistent data right so when you finally check out you give uh... uh... the uh... site your credit card number it records uh... the transaction in the uh... database and so at each of the tears there are different benchmarks that's web two thousand five for the uh... spec web two thousand five for the uh... web tier uh... spec jbb uh... two thousand five for the uh... application server tier and uh... tpc c for the database tier okay so given that let's look at some performance uh... so this is comparison with pentium four so the scale is relative to pentium four and uh... comparing versus ibm power five plus opteron and niagara which were these are all contemporary processes to niagara and so what you see here is that the throughput spec in rate number for all four processes is roughly the same uh... and both the other three processes all beat uh... the pentium four by a bit this is the spec fp rate uh... and uh... what we see here is that the ibm power five plus does really well the opteron is on par with intel and there's no number here for niagara since son never released one however if they did it's not clear that the uh... uh... graph would look much different because remember we only have a single floating point unit right and then uh... the spec jbb uh... now here you have uh... uh... niagara doing you know factor three better on on spec jbbb factor of of six uh... better than uh... on spec web and a factor of uh... three better on tpcc throughput performance how about performance per watt now the other thing of course is that uh... the niagara uses a lot less power than these other processes and so here you you are starting to get uh... uh... factors of five to ten better in terms of performance of what per watt commercial server benchmarks and so you know what you actually did show that this approach and you actually get the advantages that uh... we said that you could get and uh... of course intel and others started to take note once we uh... we show these results okay so what happened well you know time marched on from the nineties and uh... the tricks that that were you being used to increase single-thread performance started to reach uh... plateau right so power consumption became a really big deal the wire delays started to really limit what sort so the size of processes you could design and the amount of design and verification that was required for these complex pipelines uh... became quite high and so intel made what they call the right hand turn right so they they would move away from frequency and as performance and it was multi every everywhere so multi-thread and cmp so this came from the idf uh... in september two thousand four so they called it a right hand turn i prefer to think of it as a u-turn and so they they went away from this idea of frequency and they kind of started to do things that which were a lot less uh... much more much shallower pipelines and uh... multi-threading and cmp were big elements of their design as we see now uh... so we kind of look at uh... the broader picture we can see uh... the microprocessor trends we have the red line being the transistor count uh... microprocessors over the years and so you see this going up with morse law uh... the blue dot show the performance and you see the performance leveling off around two thousand five and that is being driven by the fact that that power is leveling off and the power leveling off uh... levels of frequency and frequency was the main driver for perform for single-thread performance okay so uh... my group at uh... that i brought into into sun which is of course now our call continues to the design microprocessors around the uh... principles that we establish uh... with the niagara one so there was the uh... t two or nagra two which had uh... uh... sixty four threads uh... then there was uh... t three which had a hundred and twenty eight threads and so basically we're doubling performance each time uh... then uh... more recently there's t four which goes back to sixty four threads but the process of the pro cpu's or cause themselves become more powerful uh... to broaden the uh... number of application spaces that you could use this sort of chip for right so the idea was that that uh... son decided they didn't want to have multiple microprocessors they wanted to use a single microprocessor for all their different of classes of application and so it was necessary to make the uh... cause more powerful but notice the crossbar is still here in the the multiple threads are still here uh... and that the shared cash is still there and then t five which was just announced that hot chips uh... basically again doubles the number of threads uh... going to hundred twenty eight uh... threads but now of course we're we're running at uh... above three gigahertz so the idea now is that this is the dominant way of designing processes at sun and if we looked at intel of course we wouldn't see designs that are as aggressive as this but again you would see uh... a lot of the multi-core ideas that we established uh... with Niagara one so where are we today so today we're in a power constrained world right doesn't matter where we're talking about the mobile device in your pocket which of course is uh... running on a battery and is passively called or whether we're talking about the data center uh... which in which the amount of computing you can do is constrained by the amount of power you can deliver and the amount of cooling that you can uh... deliver to get the heat out and so the issue then is how do you uh... design computing systems which are more power efficient so one of the ways that you can do that is by coming up with something which is more specialized and so if you look at and so specialization means that it's not one size fits all now you have heterogeneity right and so uh... there's already heterogeneity today and so today if you look across the computing space you'll see multi-core gpu maybe some reconfigurable architectures maybe based on fbgs and then of course you'll see clusters uh... based on on nodes that contain all of these three types of architectures what does this mean to the programmer it means that the programmer has a lot has a pretty difficult time uh... to actually program these heterogeneous architectures uh... you have multiple threads uh... programming models with locks uh... for the multi-core you have some sort of data parallel programming model like kuda for the gpu you have to actually design hardware to uh... uh... map to a fbj and of course to cross multiple address spaces you have to fall back on message passing or maybe uh... model like pgas and so what does this mean then that if you're an application developer who either wants to simulate the future wants to deal with the present either virtually in a virtual world or you want to design some application that that uh... deals with the real world in real time in a robotic situation maybe you have a huge amount of data that you want to analyze that was created in the past now you have to convert your application into these low-level programming models in order to get advantage of a modern heterogeneous parallel uh... computing environment and there's a wide gulf between the ideas you have in the application space and what you need to do to get high performance so the hypothesis that is currently driving my research it is possible to write one program and run it efficiently on all these machines okay so what should that program look like so we think the program should look like a domain specific language so what is a domain specific language well it's a language that is targeted to a specific domain with operators and data types that match that domain and abstractions that match that domain so it's much easier and more natural to write applications using the dsl the key thing about dsl is it's restricted right it's not general purpose it can't do everything uh... however the restrictions allow it to be more productive in a particular domain so what are some good examples so those of you who deal with matrix and linear algebra know that matlab is a is a fairly good example and it's an example of a domain specific language if we're talking about manipulating relations in a database then sql is another example of a domain specific language if you go to the graphics world opengl can be thought of as a domain specific language so what are the advantages then of using a domain specific language to get both high productivity and performance well the productivity uh... ideas is fairly straightforward you shield the programmer for having to deal with these low-level programming models and you allow the programmer to say declaratively what they want to achieve rather than how uh... they are going to do it with with the detailed implementation issues from the point of view of performance the key idea is to take abstractions in the domain and map them to elements that we're gonna call parallel patterns that can be efficiently executed on a variety of different uh... architectures and the key thing that is allow you it's going to allow you to do this mapping between domain abstractions and parallel patterns in an efficient and effective way is to restrict what the domain specific language can do so that way you constrain what you the domain specific language can do and you make it easy for the compiler of the domain specific language to do this translation furthermore what you want to do it is use domain knowledge to do optimizations that you could never do in a general purpose programming language and so this is going to give you high performance and finally what you're going to be able to do then is as you move to architectures with more processes or with different types of heterogeneity what you do is you re-implement the compiler rather than changing the programming language so rather than changing the program so the applications don't change just the uh... the dsl compiler in the runtime so here's the picture uh... of programming with domain specific languages so the idea then is you start with applications uh... that i've listed before and you have a number of dsl's you might use offer statistics well are something i like you might use uh... a graph algorithm uh... dsl you might have a matlab like dsl one that we call optimal which is uh... used for machine learning which is basically based on matrix and linear and then each of these dsl's uh... could be merged together in a single program so they can interoperate together and then they each have a dsl compiler which can take programs written in that dsl and translate them into a variety of different architectures uh... listed at the bottom here okay so let me show you some examples quickly so optimize as i said is a dsl for machine learning based mainly around matrix and linear algebra and uh... the motivations for the dsl as i said raise a level of abstraction and allow for domain specific optimization another example uh... is green mal and green mal is used for analyzing large social network graphs or graphs that have low diameter right so you know you all heard of six degrees of separation well that's typical of a social network graph another graph they might might want to analyze is a movie database uh... and so you might want to ask questions about the movie database network such as you know you know is kevin bacon really the center of the world or how often do all you know these actors ben stiller jack black and oh and wilson appear in movies together so these are kind of you know uh... somewhat interesting examples but essentially there's no notion of analyzing data what people have been called big data is uh... something that that uh... has uh... lots of interest uh... among lots of different constituencies today so uh... in uh... computational biology and social network analysis and in data analysis in general it's uh... something that uh... is is uh... is often done and uh... you can often represent uh... the information in some of these big data sets as graphs so that's an uh... uh... example dsl another example uh... would be work we're doing with a group of people in bio engineering and they have a center which they call the national institute of health center for simulation biology and this group wants to simulate biology at multiple levels they want to be able to simulate how proteins fall they want to simulate how cells and viruses interact and they want to simulate how muscles and the skeleton system interact for artificial prosthesis right and so essentially then they've got their own sets of dsl's that they uh... have developed and they also want them to be able to run at very high performance on a variety of architectures uh... more dsl's being developed there and finally recently uh... i've uh... been involved in a research grant to look at high performance uh... genomics right the idea is there are a bunch of next generation sequences which are making uh... the generation of uh... genome sequences very cheap so you've got huge amounts of data and you want to analyze it to uh... understand disease understand a bunch of things about uh... medical things that you can do by doing gene analysis and again we are developing dsl's for this so the question then is there's lots of dsl's and the question is sort of how are we going to make developing all of these dsl's tractable and so the second hypothesis driving uh... this work is it is possible to significantly simplify dsl development using a common dsl infrastructure so the idea then is why reinvent the wheel for each dsl pick out some common elements within each dsl put those in a framework and then allow multiple dsl's to be developed on top of that framework uh... so the idea then is you want to be able to scale the dsl development approach so the way to think about dsl's is you are a smart cs graduate and you want to code up a parallel program so one way to do that is to pick some application and code up the parallel program but now the knowledge that you've imbued into that parallel program is fixed and unless somebody wants to do exactly what you want to do in that program there's no way for them to reuse your hard work so what we want to do with these dsl's is to allow you to use your knowledge uh... to uh... and create these high performance applications in such a way that that knowledge can be reused by other people who want to do similar things okay so so you you want to enable these cs graduates to easily create new dsl's and create reusable knowledge and uh... the whole goal is a few smart cs uh... graduates can enable a lot much larger number of people with less knowledge uh... by creating dsl's okay so let me briefly say what the infrastructure looks like uh... the infrastructures called delight uh... and so the elements that delight has fundamentally i've talked about domain specific languages the way that we actually implement them is actually as libraries so think of these as domain specific libraries but these libraries are smart libraries these libraries can optimize themselves they can analyze themselves they can take programs written using these libraries and look across all the library core boundaries and do very high-level optimization so you want to be able to do domain specific optimization so these parts in blue get done by the dsl developer the smart cs graduate and then these parts in red get done by the framework so the framework then can the optimizations of the domain abstractions and apply parallelism optimizations which are generic across all the dsl so you get to leverage this part and then it can take those power optimizations and apply them to parallel patterns and these patterns are meant to run on a variety of different heterogeneous architectures and then that result then is both what we call abstraction without regret right the notion that you can develop something which has high productivity but also has high performance you didn't have to give up uh... the performance to get the high productivity okay so let me summarize what i've talked about right so uh... so the idea you know if you want this if there's one takeaway message uh... from the talk then it's this notion of breaking across boundaries right so this is the computing stack and we want to break across the boundaries because that's the way that we're gonna get high performance to show you some examples from uh... ex explicit parallelism but uh... these are uh... you know just examples that could be used in in other areas other domains okay so break boundaries of course uh... breaking across the layers in the computing stack means breaking the boundaries between your colleagues in academia uh... and then of course potentially if you want to see your ideas get used in engineering practice you need to break the boundary between academia and industry okay so they said these are examples from uh... my experience in making parallelism easy but i think that if you do anything really interesting you will naturally break boundaries okay so thank you i should do before i end i have worked with a lot of uh... great collaborators and uh... there's some of them are listed here and uh... you know not none of the stuff that i've been able to do would have been possible without uh... these collaborators both uh... advisors and students and and colleagues and working with people as one of the hallmark working with really bright people as one of the hallmarks of being in academia and uh... they said uh... none of this would have been possible without them there they are by how you go from yeah because i didn't want to get into i didn't want to get into the details of of compilation technology right and that's but what we have is we have a let me see do i have any more slides well so one thing we have is we have code that you can download uh... but but essentially what we have is an environment that that takes uh... dsl's which were embedded in this programming language called scala and on top of uh... the libraries uh... you go through the process of what we call staging and staging takes the programs written using these libraries and converts them into an intermediate representation that intermediate representation is optimized at multiple levels as optimized at the uh... level of abstraction of of the domain it's optimized at the parallel level and it's optimized at the generic uh... scala level scalar level i should say and then the result then is something that can be used to generate code for a variety of different architect this is in scala yes what is it right so today so today it targets uh... multi-core gpu and uh... we're on the verge of doing clusters uh... where you can't buy anything that so how is it compared to matlab so it's better than matlab much better than matlab it's basically on par with hank yes it's basically on par with with uh... coding uh... kuda by hand would you do it again if you could or would you avoid it or i guess it wasn't a positive experience or was it a yeah it was positive it was absolutely positive it was positive because a the experience of starting you know i one of the things people ask about sort of how much uh... how scholarly is doing a startup right and you know i would say my most cited paper is a paper that came out of a startup so i mean definitely has an impact and i think we explored ideas that that could not have been explored in academia uh... and uh... i think that it also gave the opportunity for the kind of kind of initial cmp ideas that we developed in hydra to see the light of day so yeah that's definitely something i would do again now you know having done that it was great to go back to our academia and work with smart students and do whatever i wanted and not have the pressure of vcs breathing breathing down my neck you know because uh... there is this notion that you know when you're in a company that you need to make money so what was the low point? did you ever feel like you were going to get out? no that was never really a low point i mean that you know that was we were stuck in new york over nine eleven that was kind of not so hot but apart from that external issue right i think you know the things went incredibly well i think you know as my uh... co co-founder said he'd never and he'd been a veteran of many companies he'd never found a group of engineers that were quite as good as the group that we assembled in a far and part of that was you know it was an neat idea and part of it was was him because he's a very well-known guy les cone was his name other questions scala has a life of its own right so it's being used by a number of different uh... people's being used in the financial industry it's being used by twitter it's being used at linkedin and there's a number of people who use scala so scala and i haven't said much about for those of you who are not at all familiar scala is a language which tries to combine functional programming concepts with object-oriented concepts and it generates java bike code so it can get completely interoperable with java and has a lot of nice features which make the sort of thing that we want to do possible uh... i didn't say much about it because i didn't want to be uh... in too much detail but this whole notion of being able to embed one language into inside another requires a very sophisticated type system and the capability to redefine all parts of the language even things like foreign if and while so you can redefine that inside the libraries and this is why you've got so much power right because you can essentially uh... take in a uh... representation of something which is legal scholar will run and then completely change the way that you that it operates i think that's okay right i mean for some programmers not for every program every program i mean there are lots of programmers who just thinking in that lab or just thinking sql uh... yeah yeah yeah yeah exactly you let them out of that space and they start doing things that that a you can understand and being don't necessarily lead to performance you also don't want them to do premature optimization right you want to do optimization based on you know the particular target that they uh... are of aiming for right so they might start out with some sort of uh... dsp algorithm and they might may want to initially targeted to dsp and then maybe they want to retarget to a gpu and then later to some sort of uh... hardware or fbj and optimizations may change based on on what the target is and so you don't want them to encode any extra knowledge necessarily uh... that would impede you from making the right trade-offs for that particular target having to restrict the demand to do the kind of optimization how much do you have to restrict matlab to be able to do the kind of optimization well so we don't do all of that so essentially we want to restrict it to basic linear algebra capable of these right yeah but but we can allow you to do sophisticated sorts of things such as you know you can do you can do you can deal with objects and you can structure your program with using modern uh... structured programming methods the alpha release uh... yeah so it's probably less optimal yeah so we never intend to do so i was just using matlab as an example we never intend to actually do matlab because matlab is owned by mathworks and it's not open yeah yeah so we don't do anything uh... that is that specific today for GPUs but we do essentially we will generate code uh... i should you know now i have all these questions on the DSL compiler i should have put more information in uh... what we have is we have the ability to uh... have kernels all these patterns and have them generate code for multiple different targets right and so we'll generate all the targets uh... GPU multi-core and cluster and then uh... at runtime we'll specialize for the exact type number of cores we don't do a specialization for GPU today uh... but we uh... potentially could i mean so you know it's generic cuda right it's not specialized to any particular nvidia yeah well an algorithm an algorithm data and uh... data type oh absolutely so again uh... so the way to think about structuring domain specific languages in delight is you need to have a way of structuring the computation and the data and so you need to an abstract way of thinking about the data so that the compiler can optimize the data layout and optimize the locality of your your computation so that's absolutely important well i mean that would require you actually knew what the data itself i mean not the data sizes were you know for for the data rather than in an abstract sense what the data layout looks like you actually knew what the actual size of the data uh... was and that would require that you feed information at runtime so we currently don't do any optimizations at runtime besides you know figuring out how much parallelism gets executed on some given number of processes but uh... we don't do any any cash size optimizing blocking is something that potentially we would have to do it would optimize the blocking for uh... the cash size and we'd have to do that at runtime