 Mae'r tŵr yn ddweud llyfr o'r llyfr. Mae'n ddweud â'r llyfr yn y llyfr, mae'n ddweud ymlaen yn hynny'n ddweud, mae'n ddweud ymlaen yn hynny, ddweud ei bod yn ei ddweud eich ddweud. Wrth gael, Fel Andrew yma ymddynt i'n gydag y tŵr, hefyd y gallwn yn ei gweithio'n ddweud mae'n ddweud yn y ddweud ac mae'n cymdeilio'n cyfnod ymweld i'r gweithio. Ond, mae hyn yn gallu gydag eu symud I spent a lot of time handing over my last job to the people who I feel very sorry for taking it on and so this is not as complete as it might be. What I want to talk to you about today is something that's very interesting to me for lots of reasons and two particular takes on it. My interest for the last few years, although probably not for the next one or two, has been in concurrency and parallelism. Particularly, how we write good software to do that is known to be a particularly difficult and error prone area of software development. It's one that we've had lots of people creating good ideas for making it easier, some of which have really taken off, many of which haven't. Often people choose to use the more error prone ways of writing concurrent and parallel systems rather than the easier ways. One good example of this is this little chip, some of you will recognise as a transmitter. This was created in the 80s, I think, by Inmos, led by Tony Hall, who's a very well-known computer scientist. The programming language that came with this is called OCCAM, and it was deliberately written to be a really safe, easy to understand, nice to use programming language for concurrency. After that, it seemed that a lot of the lessons of OCCAM were rather sadly lost. There's good news on the horizon, because since the 1970s people have done a lot of other things, but then come back to look at the work that was done in the 1970s. There's a lot of new and interesting pieces of hardware that are looking at multi-core and many-core ways of doing processing, where this sort of technology from the 1970s might be, and the maths that went behind it, might be really interesting and really applicable. A couple of years ago, this board came out, this is a development board by a company called Adaptiva, which some of you may know about. What's interesting about this is it has two chips on it. One is a dual-core ARM chip, which is what you'd expect, I guess. The other is a co-processor, and that's the silver chip on the board there, on the left. That co-processor has 16 cores, and it has some interesting features. It's a little bit like the idea of the transmitter before it, where you would put lots of these cores together in a big grid, and they would all be able to talk to each other, but they would be running their own software independently. That's sort of, but not quite, what the epiphany chip does. That whole board is called a parallella, the co-processor is called an epiphany, and I will probably mix them up in a very confusing way, so shout at me if I do. Why is this important? Well, one reason is that processors have really come to the end of what the manufacturers can do to make them much, much faster. This data set is looking at, over time, how clock frequency is increased, and it's reached a plateau. If you want your code to go faster now, you've either got to write your software in a different way, or you've got to have lots of machines to run your software in, or you've got to have lots of cores within your machines, or maybe use something like your GPU board, but basically you've got to find other ways of making your software faster other than just hoping that another chip manufacturer will bring out a better chip for you and you won't have to do anything. So that's one issue that's very current at the moment, but the other is that the dominant cost of computing is an interesting one to look at, and hardware gets cheaper and cheaper and cheaper, and the cost of hardware is not a particularly dominant cost anymore. The cost of labour obviously is always going to be a dominant cost, and you'll always need humans to write code and make hardware and do different things, and that's not something you can do a great deal about, except possibly persuade people to work for cheaper wages, but maybe that's not something that we as technologists would favour. Another dominant cost for a lot of people is electricity and utilities. So if you run one of these big data centres or a big supercomputer, the cost of your utilities is a very big deal to you, and anything you can do to bring down the cost of those utilities is going to be something that's a great benefit to you. So another interesting feature of the epiphany chip is that it runs on very, very low power. So it's obviously nice for people who are interested in embedded, but it's also a very nice thing for people who are interested in computing generally to think about, particularly if you're doing work that requires some sort of parallelism or concurrency. Possibly the epiphany is one of a new style of chip where both parallelism is important, but also power costs are important. So packing code into that small space, I mentioned the less than two watts figure there. What I didn't mention, and I should have done before pressing next, was the amount of RAM on that co-process. So one of the things that's interesting from a software point of view about the epiphany is that it has a very, very small amount of memory per core, and also it has no sense of memory protection. So whereas with the transistor chip you would have a whole grid of transputers, they would all be able to talk to each other in grid formats, so north to south-east to west connections. But they had their own memory, and they couldn't corrupt the memory of another chip. That's not the case with the epiphany. So the epiphany has a flat memory map, so any core can use or look at the memory of any other core. There may be different costs to those look-ups and so forth, but it can be done. And also there's a bit of general purpose memory there that all the cores can access. So that's a very different memory model. It's a memory model that allows the programmer to impose whatever programming discipline they like. So if you're like me and you think, ah, CSP, it's the best thing in the world, yay. You can write your code in that style, but if you prefer some other style, you prefer bulk synchronous processing or whatever it is, you can impose that style too. And there are people who've written really nice, really nice libraries to do that sort of thing already. So that's really neat, but it poses all these challenges. How do we get a reasonable amount of code and reasonable amount of algorithms into that small space? It's one. And the designers at Adoptiva had some neat ideas about this. And one was that they do not have a fixed-width instruction set. So normally you say, well, you know, this chip has a 32-bit instruction set. The epiphany has 32-bit instructions, but it also has 16-bit instructions. And the compiler is relatively smart in that if it can fit your instruction into 16 bits, it does. Which is really nice and makes it a lot more difficult to simulate. But hey, so that's neat. There are other little bits of tricks and neat things in the epiphany to look at too. So one is that loading numbers into registers. So we've got 32-bit instructions. The registers I think are 64-bit, or certainly you can load 64-bit instructions into them. But of course the registers are, so there's a flat memory map. The registers are not registers in the sense that they're some special bit of hardware somewhere or other. The registers are just part of the memory or the RAM that each little epiphany core has available. So you can load a 64-bit value into the register. And if it happens that you're doing that, then the way the compiler generates this is rather neat. It has two sets of instructions. So this instruction here is a move immediate. So it says that the number that we want to move into the register is encoded in the instruction that's given to the CPU. That's the 32-bit instruction, which is what this 32 says. And that will load this particular one, I think, loads 16-bits into a register. And then we want to load the other 16-bits into the same register so we can use this move t in. And that moves the higher 16-bits in. So there are some really interesting and unusual artefacts to the epiphany ISA and to the way that it's built that makes it a really intriguing architecture to look at. So those are the things that make it tricky. But I want to say a few things about the simulator in general. So there is, if you use the epiphany, and you want to write software for it, you will use the epiphany SDK, which I think was written mostly by Jeremy Bennett's group. Is that right, Andrew? Yeah, it's company. And that comes with a simulator for the epiphany. And that simulator is part of GDB. So that simulator, if you look at the source code for it, unless you're really familiar with GDB and you're familiar with the GNU tool train, it's quite a tricky piece of source code to read and understand if you're interested in these things. And simulators for hardware are often written in C. They're often written in low-level languages, I guess, because people write software for hardware in low-level languages. And they're often large pieces of code, and in my view, quite tricky to understand. In my view, C is tricky to understand, no matter what you're writing with it, to be fair. So this is something where a software person would say, hmm, maybe there's kind of an easier way to write these sorts of simulators. And a recent piece of work that's been done by a group in Cornell has looked at exactly this. And their view was, well, simulators for instruction set architectures and for CPUs are very much like interpreters for programming languages. So if you use a programming language like Python or Ruby or PHP or JavaScript in your browser or scheme, that code is not compiled. It's run instruction by instruction by an interpreter that's sitting underneath looking at each bit of code as it comes past. And that's exactly the sort of thing that the CPU does. It fetches a bunch of instructions from somewhere in memory. It runs them in some way, does something with its memory and its registers and so on, and then it moves on to the next bit. So these two things are sort of very similar. So the pigeon people in Cornell thought about this and looked at, and thought about ways to use the technology that works in the underlies interpreters to create a simulator. And one of the issues with that is that anybody who uses interpreted languages thinks that they have a reputation for being a little bit slow. So if you have a very large piece of software that you want to simulate for the epiphany, you probably don't want that to happen really, really slowly because simulation is something that you want to happen quickly because it's probably something that you're doing as part of your prototyping. So what this piece of work does is it takes some of the very modern technology that goes into making interpreters fast, which is jitting. So jitting is just in time compilation. And the idea of that is if your code is running in a loop, then the first few times around the loop you might run that code in the normal way and it's going to be a bit slow because you're in an interpreted language and you're doing all this extra work at runtime. But if you know that that code is going to run again and again and again, you might as well compile it down to native and run it natively. So you're actually doing a little bit of compilation on the fly. So the outcome of this pigeon work is to create a framework with which you can write simulators for any instruction set architecture easily and simply in something that looks a little bit like Python but is subtly different and then compile that to a simulator that contains a jit. So you don't write the jit because writing a jit is a difficult thing and to get it really good in terms of its performance is particularly tricky. But you have in your toolchain there a piece of code that gives you a jit, gives you a garbage collector, gives you all the sort of difficult things you need to create an interpreter. And then all your writing is the instructions that say, if I see a bit of binary that's all zeros, that means jump or that means do whatever. So what I want to do is, and I'm kind of most of the way through it now, is to write a simulator for the epiphany that does this. So the guys who wrote pigeon already have one for some of the ARM chips. They've already got one for Spark. But the epiphany is particularly interesting to look at because of all those things that I said before about the mixed-width instructions, the tricks that it does to compact the code and so forth. So this is small. So when I say small, I'm thinking sort of less than a thousand lines of code. So far there's less than a thousand lines of code to implement all the instructions in the epiphany ISA apart from a couple of the multi-core ones which I haven't done yet. So in my view, that's quite a win. That's much better than the sort of enormous hairy sort of C simulators that we've had in the past. So I don't want to show you tons and tons of code in this talk because I thought you'd probably be, that may not be welcome, shall we say. But I wanted to show you just a little bit of code so that you can see the style in which a simulator like this is written. So the idea behind the pigeon framework is to make it as easy as possible to do this. And in particular, the thought that the authors had was that you could almost take the reference manual for the instruction set architecture and copy bits of it out and make it sort of proper Python syntax and then that would do for your simulator. So that's the idea here. So what we have in this piece of code is the small amount of code that implements four jump instructions. So there are two different sorts of jump instructions in the Epiphany ISA. One is one just jumps and the other saves the current program counter in a link register before it jumps. There are two different versions of each of those because there's a 32-bit version and there's a 16-bit version as well. So what you're seeing in this yellow bit of code is my copying and pasting from the Epiphany Reference Manual. So this is what the Epiphany Reference Manual says these four instructions should be doing. So the JALR instruction saves the link register so in the link register we've got the program counter plus two so that's the next instruction along with the program counter plus four and then we've got these little bits of code which just do the actual saving. What's the time? Good. Okay. Does everyone think that looks relatively straightforward? Or am I completely crazy? Awesome. So benchmarking. I feel like it's my mission in life at the moment to talk to people about benchmarking. It seems to be something that comes up again and again and again. I've heard people say about the Epiphany. You have 16 cores on your Epiphany chip. Your Epiphany chip runs at the same clock frequency as the arm chip that sits right next to it on the parallela. Therefore, your algorithm will run 16 times faster on the Epiphany but it runs on the parallela. Okay. I'm glad you're already upset about this. Okay. So obviously that's not the case. If you want to look at the speed up of your code you want to look at the speed up of the whole thing from start to finish and if you're going to push some of that out to a code processor whether it's the Epiphany or a GPU or something else you've got the time taken to transfer that data across and back. You've got time that's going to be taken by all the Epiphany calls talking to each other by writing to each other's memory spaces or whatever it is they do in your particular algorithm. So because the Epiphany has a very small amount of memory available it's probably actually going to be slightly worse than that if you've got an algorithm that needs a lot of data to process. So you're probably not going to get 16 times faster. You're probably going to get something much slower which may be fine for your particular use case but all these things count and all these things matter and one of the sort of interesting things for me is that because in the job I've just left I teach this sort of stuff every year and one of the things we do with our students is we give them code that's just serial code it's sort of straightforward code that runs on a CPU and then we get them to improve the speed of it by using a GPU or using a cluster and some MPI programming and very often what happens is the students start by writing a naive translation of that serial algorithm and split up all its loops into one iteration per machine or whatever it is and they end up with a slower piece of software than they started with because it's actually not that simple to petition these things properly. So according to Armdell's law you'll get at best something just under 16 times but almost certainly a bit less. But the main message I want to give you about benchmarking is don't believe them. Even mine, don't believe them. They're never right and there's some fascinating software on how badly they're never right and why. Sorry, software, publications. So this is a fantastic paper if this is a really readable paper even if you don't read academic papers read this one, it's really nice it's really easy to get on with and this is showing the effects of two really simple things on benchmarks that you would not have thought would affect the benchmark. So one of them is bytes added to environment variables. So this is basically saying if I log in as me I have my environment variables which might be however big you log in as you you're going to have yours which might be a different size and if we both run the same benchmark we will get different results so this is changing the number of bytes in the environment and running the same benchmark and getting as you can see from those two outliers vastly different results. This one here is link order so you may not be able to read this at the back so this is the default link order which is whatever whoever wrote this benchmark put in the make file. This is alphabetical unless this is a bunch of random ones. So changing the linking order of your object files changes drastically your benchmark results. So these are good reasons not to believe benchmarks. There's also been some great literature on GPU programming so when GPUs first came out people said you will get a billion times speed up if you use a GPU. Well they didn't quite but it turns out that 10 times speed up is a really good speed up for GPUs. So don't believe anything at all and if you're running your own benchmarks think hard about how you make them solid and reproducible. It's really not an easy problem to resolve. So things like always using the same linking order always using the same environment are great but also things like disabling all the clever tricks that your OS plays. So one thing Linux now does is it will put your stack at a random point in memory and this is to prevent people doing stack smashing and it prevents the security issues. That's lovely if you want to prevent security issues it's really not lovely if you want to take a benchmark on your very fragile data and if you think well this probably doesn't affect the epiphany it kind of does if you're because you're starting with your code on the ARM CPU sending it out to the epiphany. So all these things affect almost any benchmark. So thank you very much.