 My specific research is running to the area of irregular parallelism but today I'm going to take you all the way from the basics of why you want to do parallelism, what regular parallelism is and then the challenges that we see in regular parallelism. I'm trying to emphasize to all of you why no matter what your field is and what sort of software you're working in, why you'd be ingested in these techniques and how we're going to use them in the future. So why do you want to write parallel programs in the first place? Most people are taught to write sequential programs first and doing parallel programming is something you might get taught, a technique you might use every now and again but it's not something most of us think to do by default when we're programming. It used to be the case that all computers had a single CPU run all the codes sequentially and it's been like this for many decades so it's informed all of the way we write software, all of our education, all of our techniques and ways of doing things, we're very used to thinking sequentially and it's a nice comforting place to be, everyone knows what they're doing on a sequential system so we've tended to try and stay there. There were some systems in the past that have had two CPUs so expensive servers, some specialist systems might have had to do two CPUs but it's only an extra CPU so if you ignore it it doesn't matter particularly and generally if you had two CPUs because you wanted to use them specifically to do something so again most people didn't worry about dealing with two CPUs in a single computer. The problem we've had is we as we've advanced processor technology and added more and more transistors and made them more and more tiny inside the CPU. We've used more and more elaborate systems to get more performance out of our processors. We're starting to hit a point where we can't put any more power into the system. We've only got so much power you could put into CPU because we can only draw so much power we can only get rid of so much heat and we're drawing more and more power to create these more elaborate CPUs with these tiny transistors and it's not working anymore because we can't show any more power into achieve that so this graph shows the the number of transistors in processors at the time which just keeps going up and up and up and this is a logarithmic scale mind but the amount of power that we can put into a CPU is limited we can't stick much more in and starting around 2004 we start to hit what we call a power wall. This has meant that the clock speed in processors can't go up any further because we can't put any more power through the processor so we can't use the clock speed any higher that's caused the blue line where again around 2004 the clock speed stayed the same we couldn't put any more in and this has meant that the processors aren't getting more powerful they've made more transistors in but in terms of how fast you can actually run your programs your benchmark your applications you're not getting any more any more out of it over a single processor it's likely that we'll still get some performance increase over time but the key difference is you're not going to get that exponential increase of Moore's law giving you more and more performance again and again again like we used to we used to be able to build a program how we like and just think that the next generation processors will take that performance slack up for us that's probably not going to happen on the same scale as it used to so the way around this is to took two processing cores in a single processor it's different having two CPUs because you have the same heat and power budget it's integrated into a single package so they can share cash they can share the memory but it's not the same as having two CPUs it's much more efficient way to get the same processing power of two into a single processor but again this is pretty easy to deal with you can simply run one application in the foreground and one in the background or one a couple of background tasks or you can just ignore it and you're not really using much over that processors continued so my phone typically has two cores most phones do my laptop's got four cores and here you can run a couple of background applications perhaps but you're starting to think what am I going to do about fourth core and as this process continues over time you can start to see how the problem exacerbates when you've got 16 cores which is probably something we'll see in the very near future you start to very quickly run out of ideas of things to put on these cores if your application is running on just one of them it's what you're going to do with all the others and they're taking up space and the key thing as well is that if you add more and more cores we've got the same space the same power budget we might start getting less and less powerful on their own and eventually some people may think we end up with hundreds or even thousands of cores that's some of the research we're looking at here in Manchester and if you have thousands of cores what earth are you going to do with them all the danger is if you have a sequential application it's going to end up running on some tiny little corner of your processor ignoring all the power that's got around it and this applies no matter what your application domain is if you're writing software for modern processors you're going to end up in a situation if you're writing a parallel code that's why we want to do it in the first place we do already know how to make parallel programs we've been doing it for 30 to 40 years now it's not something new people talk about parallelism as if it's come from nowhere the logical revolution here has it we know lots of good techniques already all we have to do is structure your program so you've got tasks that can run at the same time with each other so two tasks at least which can run at the same time doing their own thing making their own calculations and you've got parallelism that sounds really straightforward sounds really easy so instead of your biggest sequential program split it up into lots of tasks which run alongside each other we've run them in parallel by sticking them on separate cores and if you create as many of these tasks as possible then we can fill up some of those cores on these potentially huge multiple processes so a really good example is matrix multiplication it's really simple because it's one of those traditional scientific operations that you were using things like weather processing large linear algebra systems the sort of thing we want to have a parallel computer and that's why we're good at it because it's where the money's been in the past so matrix multiplication for each set in the matrix you use a column from one of the input matrices and a row from the other that informs the results of one cell and then the next set in the matrix has different inputs and a calculator's own value but the key thing to know is that each set of the matrix is entirely independent of any other cell and therefore we've got tasks which can run in parallel all we have to do is separate those out and then we can run them on different cores this is a really good solution because you can have very very large matrices and some problems or you can parallelize those pretty easily now there are some issues still about how you schedule them and about where you move the data around in terms of creating the parallelism in the first place is pretty straightforward this only works if those tasks are entirely independent of each other as they are matrix multiplication there needs to be ideally no communication between them and if they all they need to be able to run so they're not using the same memory ideally so each of these tasks needs to have nothing to do with any other tasks so there needs to be nothing moving between them they need to not stop each other and then you'll have as much parallelism as we need the problem that you might find is that you create a choke point by having some kind of shared object if you have some sort of object in your program that everyone wants to use and I don't mean a jar of just any kind of conceptual object if everyone wants to use it then you may find that everyone goes down into this one object and then they have to wait to use it and if you are familiar with parallel coding you might have used locks before and this is the sort of problem that locks use and that things become serialized as they all try to go through this single object and you can imagine if we try and schedule out onto different cores you can end up with that problem where it had lots of cores and only one being used this is where you get that problem from one point in your program where everyone's trying to use it at the same time and you may think you've got all these tasks but in reality if they use a single shared object then you're going to have one task perhaps for a significant amount of your program ideally obviously don't write software choke points in the first place and you can avoid it for lots of scientific applications but perhaps there are some problems where you can't avoid writing these choke points and what are we going to do about those? It's traditionally been the case that this is what they're running scientific parallel computers and solving big problems on these sort of computers this is what people think of when they think of parallelism they think it doesn't apply to them but in reality with these massive multi-cores you've got that kind of parallelism potentially going into your single computer over the next few years as we get more and more cores so you're operating this kind of machine that they at moment have specialist programming and specialist running on your own for your application even if you're writing something nothing to do with parallelism you must try to manage and program this sort of computer so let's look at a tricky problem and this is a problem which I'm looking at intensively in my PhD it's designed to be really awkward but it's not something I've just picked out of the air it's a real problem so this is called Lee's algorithm it's an algorithm to solve the problem of circuit routing on a printed circuit board we have lots of the black dots where components connect to the board we'd like to join them together by building wires and the the colored lines represent those wires and you can see that there's lots of wires and they all crisscross each other and they're allowed to go over the top of each other it costs a bit more so you'd like to try to avoid that but generally the wires cross the entire board and it's a bit of a rat's nest of a problem if we zoom in you can see the kind of complexity if you pick out a single line and follow up your eyes you can see it's going under otherwise around others the shape the wires take depends on the shape that otherwise have taken in the past and this is that algorithm for working this out isn't particularly relevant it's a simple breadth first search but what it points out is the kind of inter complexity between the different routes and this should be ringing an armbos because we said we wanted our task to be independent and you think about starting to pick out routes from here which are independent of each other so they can be solved in parallel you can start to see where the problem is so if I pick out one route it goes from one of these black dots where some components connected to another and there's already wires on the board so in order to get the the cheapest cost of each wire it already moves around in a slightly funny way it's quite hard to work out why it's doing that from looking at a single wire it's as the program builds up you get to this state which forms where a particular while great and then if you have another wire would like to solve at the same time the problem is that it crosses the other wire so it's going from where it thinks it wants to start and finish it looks at the best route and it tries to use the same bit so if you've got something that's being shared between the two if we try to run them in parallel then overwrite each other's memory would end up in a big mess what we have is a conflict the two tasks are getting in each other's way at this point in the middle however the really tricky thing is that you don't know where the route's going to go until we've started trying to solve them so all we know at the start is we've got these black dots and we're going to try and link them up and that's all the information you've got you can try to use some heuristics but in practice I found that routes are so unpredictable where they go any kind of heuristic is not like to be useful so you end up with a problem that the thing that the task conflicts over the thing that they try and get each other's way with is basically anywhere on the board and so you can see that massive shared object is instantly a cause that problem a conflict where everyone tries to go down through the same object to use it so it looks like we can't find any parallelism and this is a really annoying problem because of thinking about how the algorithm works it should be really easy there should be loads of routes you can solve in parallel and if you knew what solutions of the routes were then you could make any parallel really easy but you don't know that stuff beforehand so you can't solve it in that way the entire board ends up being one big shared object no matter how many tasks you create they end up going through either one choke point or reality a few choke points but still enough to really constrict how much power is we are getting this is what we'd like to try and avoid we're going to see an irregular problem we can't divide up the shared resource and so it stays as one big or shared resource that contrasts to the irregular parallelism things like matrix multiplication so how can we tackle this i'm using this quote as inspiration this is admiral hopper she was a famous us naval programmer um some people say she invented the terms of bugging but that's debatable what she said was it's easier to ask for forgiveness than it is to get permission now she was actually talking about us naval politics but you can take the same sentiment and apply it to solving irregular problems it's easier to ask for forgiveness than it is to get permission so we'll assume that tasks are not going to get each other's weight if they do then we'll do something about it that's quite an unusual thing to say in the science we'd like everything to be in terms of being deterministic to know what's going to happen let's just say things can get each other's weight it's not an issue when we find there's a problem we'll sort it out now that sounds like a little bit of magic sorry i'm carrying on talking here the idea is that these shared objects can be accessed by every task they can all go through them and instead of saying they all have to stop for one to go through we'll let them all go through willy nilly and we'll just see what happens and we'll sort out any problems afterwards so if two tasks conflict with each other if they try and access the same kind of status each other then we need to do something about that what we'll do is we'll we'll stop one of the tasks we'll forget about it we'll cancel it and then we'll run it again later so there was a conflict then these two run at the same time what we're going to do is somehow stop one of the tasks and then run it again later so it's got nothing to complete against that takes a little bit more time but we still got loads of parallelism out most of them and then a bit less after that last one but that's not too bad it's not better than having anything go through one shared computation so that brings up two questions it sounds like magic how can you tell when one task gets in the way of another and how can you cancel a task that's already money you've already drawn half your route onto the board if you try and take that away somehow are you going to get in the way of other tasks by trying to delete it and how do you know when one task is getting in the way of the other if you're just letting them just write to the third state however they like but one solution to this problem is transactional memory the only way is instead of writing to memory we're going to write to a log so instead of all our reads and writes go straight to memory we'll write them down in a log somewhere we'll report what you've read what you've written and what the values were and we'll store that in a separate data structure and then we can tell if two tasks are getting in each other's way they're conflicting by simply comparing the logs search through the logs to see if they're getting each other's way and the great thing is we can cancel the task by simply throwing the log away forgetting about it pretending it never happened and nothing's actually happened to memory all we have to do at the end is actually finish up by writing the log to memory from a disciple on that task to finish up so if we have two tasks these two routes again they're running at the same time looks like they're going to conflict but we don't know that from the start so instead of writing the memory the results into memory we'll store them in some log somewhere and this says what memory access we made and what the values were we can then search that and find that wait a minute these two tasks they're finished on the road but they've actually used the same piece of memory so therefore we're going to have to throw away one of the tasks just forget about it scratch it off throw away its log and let the other one write it's the data it produced to memory and then those tasks get stopped from interfering with each other we can rerun this one later um at some other time and hopefully won't conflict with another route but if the routes don't conflict so these two routes are not likely to get in each other's way then they could do their calculations write to their log and then we can simply finish off by writing the both of memory and these two run in parallel without getting each other's way and therefore they've been simply allowed to go ahead and write to memory at the end with no problems and they've run in parallel so that's great we solved the problem and transactional memory is a real technique applied today it's moving from research as it has as it has been over the last decade or so and it's now being moved into production systems so it's really good to transactional memory support in C, C++, Java we're working on transactional memory support here at Manchester we've written a library for Scarlet which implements transactional memory what's in languages like Clojure and Haskell it tends to be more functional languages which have better support for transactional memory because they control their side effects better so they're easier to control in this sort of way and it's coming out in the next year or so it's going to be a real hard work implementation of transactional memory from Intel in their new Haswell processor Haswell stuff to do with Haskell it's still in the name that's all so your processor will actually understand the concept of redirecting memory to a log rather than writing it straight to memory it's going to use the cache system to achieve that effect so hopefully that'll make it nice and fast nice and simple to use and it'll be in our hardware for real and the hardware guys and to solve this problem is that the people have given us these voltage cores but we're not really sure about how to program them so thankfully they've come back to us for these hardware techniques enabling us to try and program them in the future it's not only about transaction memory there are other interesting ways to reverse computation for example you do things at a semantic mathematical operator level if you know you add an x to a value then you reverse that computation by subtracting x if you know you inserted a node into a graph you can reverse that computation by removing the graph this is what's really great for you people in your own specific domains because you know the the right way to compare the semantics of operators within your domains if you work in graphics you know what graphics operations can be reversed a lot and this is what my research is in i'm trying to find more high-level ways to be intelligent about how you reverse computations about how you find out when things conflict and try to stop the conflicts in the first place transaction memory can be a bit of a blunt tool applied because it's working at the machine word level at the rational memory reason right so i do like to use some of these high-level techniques there's some fantasizing research from the past in the 80s there's a system with an absolutely amazing name the jeffson time warp system and this was a system for reversing computations which are made across a network where you're normally sending packets of messages around they invented a system of anti-messages that when they collide with their original message destroy each other and produce more anti-messages to delete those that were created by those messages and these anti-messages spread throughout the distributed system reversing computation it's actually fantastic and it's got a great name so do we have a solution it sounds like we've got a system there that's going to sort out our irregular problems really well well there are lots of downsides transactional memory can be slow in the past the implementations in software have been prohibitively slow they're getting a lot better but redirecting all these reader write scenarios turns each read write instruction in the processor into hundreds or thousands of instructions it can be a lot of work the hardware is probably extremely limited our hardware implementation is always going to be restricted into how much work it can include buffers of limited size so perhaps the hardware's poor won't really give us much the ability to solve these larger problems and generally perhaps transactional memory isn't the the magic bullet people thought it was my own belief is that we need to apply a lot more knowledge about the problems we're solving in order to apply a similar approach to transactional memory from a higher level and that's where we need the people with the their own domains in your science you know about what sort of operators they can reverse and what they can't and then general optimistic execution we general we said it can be sorry can be wasteful we said that this whole problem started when we ran out of power and here we are doing computation that we simply throw away at the end of it so there's a bit of a contrast there do we want to save power don't we and general it turns out to be a trade-off in that it's okay to waste this amount of work because you can show that overall it produces a solution that's more parallel. Irregular problems are billion-dollar questions they really are they're the kind of problems that Facebook and Google work on all the time computer graphics physical simulations the only thing about irregularity is that before a regular simulation might be less precise because it's constrained to always be as specific as it is with irregular computations you can divide things up further make more precise in a particular area they want to be so a regular simulation can be more precise over a particular area of the country that's a regular problem now we're starting to tackle those we might be better at achieving that sort of thing the web and social graphs now these are a hundred billion dollar questions the kind of data structures that Facebook and Google operate on are very irregular graphs with nodes being added and removed people changing around things changing the whole time and lots of computations wanting to operate on that same data structure at the same time really complicated problems and they're they're regular and those are the some of the big problems these companies face there's also lots of regular problems in machine learning networks and data mining both big problems that's been lots of money research being spent on here it's at Manchester as well so perhaps there's some stuff we do to help you paralyze your problems when you find yourself with a 32 core processor your sequential program's not running very fast anymore thanks very much any questions about my work or a regular parents irregularity is a sort of pathology it's something which goes wrong in problems it's quite hard concept to decide to define what it's saying in general that an algorithm or data structure is regular or irregular people are trying to do that because it's easy to do but generally it's a a set of problems that arise so having shared states generally gives you an irregular problem having an irregular shaped graph often gives you an irregular problem and Google has got in their web graph they've got a massive data structure that doesn't fit on one computer bits of the the computers it's running on are constantly failing they're trying to update it at the same time as they're trying to read stuff it's constantly changing because they've got so many people using it at the same time it gets into an inconsistent state of it not careful um so those are the sort of things which generally build up to be a pathology of problems which are important and regular can be solved using these sort of irregular techniques so you talked about transactional management solutions sort of the things that make you a problem so yeah you've said that a lot of the things but do you know of any big companies who have decided to try and use transactional management as well those larger companies tend to be using a high-level abstraction so Google's produces lots of their own systems they produce a map reducer which I need to try and tackle some of these problems but map reduces quite a regular system everything gets put through in these big batches and they realize the problems with that so it's better to future systems you may want to look up that are really interesting such as percolator or pregul for processing these problems and they've got similar ideas in some of them about optimistic execution trying to solve the problem of little things want to change in a big data structure and how do you handle that so Google and Facebook are doing the idea of they're going to make their own domain specific solutions for their big domains of their specific problems that's because they've got lots of money to do it I've got the experts somebody's trying to write a machine learning algorithm doesn't know anything about parallelism it's going to find these problems as well when they get themselves a 32 rule machine they're trying to run using any kind of proportion of the parallelism available in that machine and they want to but potentially each call is going to get less performance or the best case is going to stay the same we'll get a lot we'll get faster a lot slower than used to be used to your presence getting faster and faster and faster and solving all our problems that's not likely to happen anymore it just it just took years up that you said that you produce a lot of things like Google are dealing with transactional memories and abstraction when you think they almost have the money to put it into hardware which would get rid of them so Google like to use commodity hardware they like to use their own systems on top of that and when Haswell comes out it's possible that within each compute node people like Google might want to use transactional memory but you can't use transaction memory easily across distributed systems it's a much more complicated problem so people often find they like to use a regular parallelism system between compute nodes and within a compute node they solve problems using a regular parallelism because a distributed system is a bit more regular but when you've got shared memory you find more of those pathological cases where you'd like to use the technique for a regular parallelism yeah so in your research you're talking about you spoke about moving towards lifting this concept of transactional reactor layer yeah using more abstract notion of undoing operations yeah it strikes and these operations are going to be more expensive at a higher level um do you think you're just removing the expense upwards? I think that'll be a lot cheaper because of when you do things on a machine word level you've got a lot of stuff in there you're not actually interested in so temporary variables you wrote to a temporary data structure that gets all logged stuff like that you don't need if you can just record the signal information log by inserting a node here you need to remove it afterwards then that's very easy to log all you have to do is put one entry in the log and then the rare cases when things get aborted if you have to undo them then we can then worry about the cost of that operation but typically we think things aren't going to conflict when you do lesagro and circuit routing examples the routes don't often conflict it's just they may always conflict so you'd like to have as little information stuck in the log as possible it's almost like you're writing it forget about it so i insert a node here remember that for laser that's a lot easier than instrumenting every single read and write new processor does so it should be faster if we do a high level semantic so it's about the common case yeah we'd like to be on a critical path that's easy and it matches up the whole other philosophy that we'd like to do the easy way forget about the problems when we get problems then we'll deal with an expensive case on the problem because if you're if you're undoing an operation you've already ruined your your cash you know you've already got me over you're already starting again so it doesn't matter if you expect it then particularly thank you any other questions? Thanks so much Chris