 Now we have the keynotes, the closing keynotes, Martin all over to you. Right, well I'm really happy to see quite a few people stayed right to the end. It's really an honour to be asked to be the closing keynotes speaker when representing. I was going to say, you know, it's something completely different, but I can't really say that now. We had two solutions in the Jugalbandi. Anyway, I'm definitely coming back next year. This is becoming one of my favourite times of the year, coming to India in September. Will it be in September? Yeah. So now I'm going to try and tell you a little bit about some features we've just added to Dialogue APL about a year ago. They're not very dissimilar to futures that are in other languages like Clojure, but the difference in flavour is probably that we have arrays of these things because as APL programmers we see everything as arrays. I'm going to, very quickly, I may not need to do this now that both I and Jay showed you some APL a few minutes ago, but I think I'll spend a few minutes just showing you APL. How many people were at my talk last year and sort of remember how APL works? No, not that many. So I'll spend just a few minutes on that. Then despite the fact that we have all these parallel array oriented language features in APL, it turns out they don't really allow us to achieve this goal of putting parallel hardware at the fingertips of all these domain expert users that we have using APL. And then explain a little bit about how we arrived at the proposed solution and talk about what we're doing. So, very brief APL refresher. I think this is one of my slides from last year. So the syntax of APL is really very, very simple there. It's only, you know, you can't even fill a complete page with the different forms. Either you have an array which is typically created by juxtaposing items of data with spaces in between them, or you have a function followed by one argument. This is the index generator. So iota 6 returns the numbers from 1 to 6. Or you have a function with a left argument and a right argument. And the map is typically implicit in most of these functions. Not all of them. Map is implicit. So this is adding item-wise on the left and right. Then we have things that we call operators that you would normally refer to as higher order functions. So this slash here is an operator we call it, which takes this function multiplication as the left operand and produces times reduction in this case. So this is the product of all those elements. Then with the reduction with a number on the left gives you a sliding window. So this is the sum with the window size of 2 reduction going through the array. And all of these extend to higher rank arrays as well. It's not just vectors. Then we have things like the inner product. So this is the vector product where you're multiplying elements. You're mapping the multiplication and then doing a plus reduction at the end. So that's the regular vector product from mathematics. And then finally indexing. And one thing which is a bit different again about APL is you can index with an array and you get a result back which is the same shape as the index array. So let's just very briefly play a little bit with this. So here's my interactive APL session. It's the font size about right. I'll try and bring it up. See if I can get away with that. So if you're trying learning APL we have the language bar with all this. We use these symbols. You can basically see all of them up here now. So students of APL tend to just hover along this thing and get the definition of each one of these functions. And win programming competitions doing that. So here's a matrix. The 3 by 4 reshape of iota 12 is as close as you'll get to a type declaration in APL. I want a two dimensional array of integers here. And then the map is usually implicit but if you want better control, I mean you have each left and each right in some languages. This is an operator called rank. So we're multiplying and we're saying we want a left rank of 1 and a right rank of 0. So take vectors from the left and scalars from the right. So it's multiplying each element on the right with an entire row on the left. So we can control exactly how we split the arrays up and match the items when we do the map. We have outer products. So combining all the elements. Every item here has been paired with every other item there using this function. So that gives you the little multiplication table when you use the numbers from 1 to 10. Here's the vector product that we saw before. But in APL, this is a general construct. So this is not just vector product. This is the specific case of plus dot times. You could also do a plus dot not equals. So if I make a 2 by 5 array with these characters, Italy and Benin, and say how many are different from India, I get 4 for each because there's only one letter that matches in each of the rows. So we have these generalizations rather than specific matrix multiplication, vector multiplication, and so on. Reduction we talked about. And of course the definition of the vector product could also be written like this. It's the map of multiplication and then the plus reduction. Something we've added very recently, which is also inherently a very parallel operator, is an operator called key, which takes a function on the left and keys and values. And it then, for each unique key, it calls this function with the key as the left argument alpha and all the data items corresponding to that key as the right argument. So if we have a function here that says, give me the key and the sum of the data item. This is a group by select, sum, right argument, group by left argument. And I don't know that we can quite run this as fast as Julia, but this is a dynamic interpreted language. I'm generating 100 million random numbers and doing the frequency count by saying, just count the unique items for each distinct key. So when I don't provide a left argument with the key, it uses the right argument and the indices, possibly indices is the data. So we have all these forms. And yet, oh yeah, if you want to see more of the talk that I did here last year, I got some really good feedback from Ryan Lemmer, who was here. And that helped me turn it into what became a talk which was recorded by Google about two months ago. So if you want to see an improved version of that, thanks to Ryan, go to the YouTube Google Talks channel and look for that talk. And that's about an hour and then half an hour of grilling by Google engineers at the end of it. So we have all these things where we can... The user only has to write a few symbols and then the interpreter just does all this looping through to produce lots of stuff in parallel. We have implicit map. We have implicit if you have a user defined function, something which doesn't just implicitly map, you can ask for things to be mapped. And of course reductions and scans can be broken down and parallelized as well. So you would think, you know, we're pretty much there. We also have asynchronous language features. So we've had these for about 20 years that the user can launch a function in a separate thread and then wait for the result and you can have critical sections and batches and semaphores and so on. But we still had to do something last year, you know. We can't just rest on our laurels. And why is that? Well, the existing time slicing with a threading function is actually... we're actually cheating. We only have one OS thread like some of the older languages. We have an interpreter call that's 30 years old and we can't really reasonably refactor it to be thread safe in a short amount of time with an acceptable bug curve. But also when you think about it and you read books by experts on concurrency, it seems quite clear that, you know, if programming with threads and locks is so hard for real software engineers, our domain experts, our factories and our petroleum engineers and chemical engineers who are using the product won't stand a chance with those things. Now, of course, we can thread all that implicit stuff and we have done work on that. So I think I'm going to have to bring the... So if you're working with large arrays, you can do things, you can ask the computer, well, how many threads have I got? Please use four of them in parallel when there are large array operations. You can set a threshold at which, if you're doing large floating-point array operations, the APL interpreter will just automatically go into that. We don't do that automatically because you might be running on a server where you're competing with other users for resources and you don't want to just automatically go off and multi-thread stuff. If the machine's already running at 100% CPU utilization, there's no point to do that. But just to show the performance that you can get if you have a dedicated workstation. So here are two arrays that are... One of them is at this above the threshold and one of them has done a negative one drop and dropped the last item off. So these two vectors called single are just so short that if we benchmark them, they'll run single-threaded whereas the other ones will run multi-threaded. So for plus, adding two vectors together, we actually get a slight slowdown and that's because the chip can add numbers up so fast that you just get the memory bottleneck and contention and things don't speed up at all. I think you'll get this kind of result in just about any typical multi-core machine today. Things get a bit better if we do division. There's a bit more work for the CPUs to do between each piece of data that arrives. Actually, we got about a third. And then if you go off and do something like take the A logarithm of B, which is really hard work, you might get a little bit more than double the speed on this kind of machine. So it turns out that although you can get those speed ups, actually a few applications where there's a significant quantity of this kind of data power isn't going on. There'll be some of it. For some applications they can get significant speed up from that. There's fluid dynamics, image manipulations and so on where these effects work. But for most of our users who are doing things like asset management and risk calculations, they'll only have a few sections of their code where they can benefit from that. So paralyzing SIMD primitives that are executed sequentially doesn't help much. We are funding work at a university to have a compiler written for APL J is who also appears working on doing data flow analysis and trying to do compiling in the interpreter. But idiomatic APL is shape, type and rank invariant. So it's really hard for a compiler to understand how to do this. We could add optional type declarations we'll probably do that. But we need something more in the short term to really help people use the course that they now have available to them. We actually need to ask the user to help to give us some hints. The user knows how big the arrays are whether parallel threads or cores are available. So we needed to come up with some new language features that would make it easy for the user to express optionally asynchronous sections of algorithms without using the traditional locks, semaphores and so on. So it turns out that there is actually one more parallel form that Dialog APL has that the other APL interpreters don't have. And that's because about 20 years ago we started working with objects. And we have a thing that we call a namespace I think in traditional jargon it would be a dynamic object. So I just created a space with a built-in function called NS and into that I can insert anything that I want. It's a dynamic container. I can insert any code variables into it as I wish. And I can refer to it of course. And I can execute expressions inside of it. So I can't just refer to its properties but if I put a parenthesis after the dot I can execute any APL language expression in the context of this space. And I could create another one and let's put a slightly smaller matrix into that one just a two by three. And now I can catenate these two spaces together and call it NSS. So NSS is now an array of two namespaces. And we decided that arrays are not objects in APL. Arrays are more important to us than objects. So when you put the dots after an array of objects that's a reference to each of the objects inside the array. So this is now executing that expression inside both of those. So we have the matrix catenated the sum of each row. So this expression to catenate the sum of each row to the matrix has been executed inside each one of these spaces. So in APL if you have an array of in dialog APL specifically if you have an array of objects dot expression that's an implicit map operation. I don't know if there are other there may be one or two other languages that do something like that. Does anybody know of a language that uses that syntax? The only one I'm aware of is SQL. Because in SQL if you say something like this it's a reference to the collection of rows that are in the table. So it's actually executing that expression if you like on each object in the table. The comments are unfortunately appearing off the size side but I think it's better to keep a high font size. So what we thought was what if we came up with a function that we call isolate. Which if you apply it to a namespace it creates an isolated namespace which is to all intents and purposes it's right here I can still do the same kind of thing in it but when the expressions are executed it's happening asynchronously. So they are actually isolated from the main body of the interpreter. In the current model they actually run in a separate process that's been started for that purpose. Now if all you could do was to block and wait for all of those results to be computed in parallel that would still be interesting. But this is actually a two-step process the result of each one of those expressions is immediately returned as a future no matter how long the expression takes to execute. So we immediately get an array of two futures back but because I decided to display them in the session the system had to block on the futures until they were materialized and then display them in the session. And the same thing here I have my array of two isolates I call the delay function in APL this is a built-in function called quaddl so I called it with an argument of four on the first isolate and six on the second isolate. I'll do that again. So we have to wait for six seconds since we displayed them both we have to wait until they've all materialized and then we get the result back. But if I were to assign them to an array I can immediately say well how many are there and if I ask for the first one I mean well I talked too much so five seconds had passed and I immediately saw the result. I'll just do that again. So I wait, I ask how many there are I ask for the first one I have to wait until five seconds have passed and if I ask to see the whole array I now have to wait until the whole thing is materialized. So I can decide whether I want to wait on them individually or there are also functions you can call to ask which ones are ready if you're a service that can't afford to block and so on. But this is sort of the key to the whole idea that you get these arrays of futures back. So just to recap what was happening there we have our main workspaces we call it an APL where all our data is the working storage. If I execute this expression to say I want three isolates in this case not passing namespaces but just three empty arrays as arguments it sort of creates this illusion that my workspace has been extended with these namespaces and I can do things like say well assign to x a three element vector that does a distributed assignment because there are three objects on the left and a three element array on the right it assigns x to these three different values and then when I execute a statement like well compute the average of x inside each namespace those expressions run in parallel. So just to give us an example to play with here's a mathematical definition two numbers are coprime if they have no common factors in APL you can write that because the or boolean or function has been extended to more complex types as the greatest common denominator if I can say if one is equal to the greatest common denominator of my two arguments they are therefore coprime and of course creating explicit isolates as we saw before is all very well but if all you want to do is just execute a function like that that's expensive on a large array of arguments you don't really want to go through the process of creating all those isolates so we propose a new operator called parallel which if you say function parallel that will automatically create an isolate containing just that one function execute it and return a future and then discard the isolate so say we wanted to count all the coprimes smaller than n for n equals one to ten in APL we could say one equals omega gcd and that would be one of the numbers up to omega so that's if they're equal to one then those numbers up to omega are coprime and apply that with each operator to the numbers from one to ten and that would give us a vector like this now that we have the parallel operator the user can say well I happen to know this thing is it has no side effect you can safely run it in parallel and just insert the parallel operator there into the interpreter the hint that these things are safely parallelizable we get the same result of course which is one of the really important things and of course futures we saw that we could we could ask for the shape of an array containing futures the interpreter the primitives don't block until they actually need a value we can also pass arrays containing futures around as arguments to other functions it doesn't happen until somebody actually needs the numeric value of something for example so we could do something like we could say we want to do this computation we saw before on the numbers one to a hundred immediately get an array of a hundred futures back ask how many are there a hundred we could partition it so I'm creating a hundred element vector here with a one every 25 elements and then using that as a mark as a mask to splice up my data into four pieces of 25 each because I have an idea that these things will materialize at different times I can see I have so none of this is blocked yet because I haven't asked any questions that require the actual value of it but then I say well I'd like to compute the average of each one of those 25 numbers and I'll launch that computation in parallel threads as well so that'll then start for parallel threads each one of which will wait for its 25 inputs to materialize before it runs so by inserting those two parallels there we have been using 105 threads to do this computation under the user's control and just to show that actually running here's our task is to fill up the fish tank I just installed windows 10 on this machine and I found to my utter dismay just before coming here to do this presentation that even when I'm running all four cores it only says about 30% I don't understand so here's our co-prime ratio so we're not just going to count how many co-primes there are less than the right argument but we're going to express that as a fraction of the number itself so how large a percentage or fraction of the numbers less than omega are co-prime so for one it's all of them for two it's half of them and so on up to 10 so we're generating a floating point number here for each number and here's a little function I want to get the minimum the min reduction catenated with the max reductions is something we call a fork so you have two functions that are both applied to the right argument and then a joining function which could be anything but in this case is catenate so if I ask for the min max of co-prime ratio I'm getting the smallest and the largest now this is all very fast for small numbers but of course as the numbers get bigger this all slows down so that's still very fast 10,000 you still can't really tell 100,000 it's a bit of a delay there so let's do 200 numbers in the in the 100,000 ratio and we'll see that I don't know how many 100 million GCD computations we're doing but by default of course although we do that with each you see only one of the cores was in use this is an i7 so I guess it really has two two cores that are multi-threading and in current APL we haven't actually implemented that parallel primitive as I displayed it on the slide but parallel each has been given this name you see it looks a little bit like a parallel and each so this is a valid name for a user defined operator in APL a defined one which gives the same effect for the user while the users are playing with this so they can give us some feedback on whether this is really would come up with a good language design and you see that runs it did I mean this was using all the cores on the machine but it still only gave a peak at about a third but you see it ran in a bit under half the time which is sort of par for for this kind of laptop you can't go much higher than that because you're getting memory contention between the CPUs we're using quite a large amount of memory here generating those arrays and of course the good news is that the results are identical they are the and reduction of the element-wise equals is one okay so the thing that's really important to us about this design is that it gives us deterministic parallelism so if we have this is the shortened version of the code that we had up on the slide before we're creating a hundred partial results dividing them up into quarters computing the average of each one and getting this number here the really important thing to us is that we can insert or remove without it changing the meaning of the expression and since APL is very often used as specification language a mathematical description of the problem to be solved rather than just as a traditional programming language we think this is really important to our users you can sprinkle these things in your code where you think there is parallelism you can measure the performance find out whether it's a good idea complain to us that we're not scheduling it but you can continue to use APL as a notation as long as your functions actually have no side effects of course if these things are doing oracle database insertions in there which is not something that we can easily always detect they may be doing it in a very indirect fashion by communicating with some web service so we can't see what's going on out there so it's up to the user to make the statement that this is safe and of course if you don't have errors because if you execute a function that returns the future and you never refer to the result you might actually have errors occurring in your code and never detect it because the future is returned immediately when the tree falls in the forest and there's no one there to hear it it might actually change the behavior of your code if it needs to trap errors so in the it's a model implementation at the moment that is the futures are fully implemented in the interpreter because it needs to decide when to block on something but all the machinery for manufacturing the isolates and launching function calls in them is still written in APL actually launches new processes connected to TCP so in the future we imagine we might very much optimize how that's done with very completely separate purposes use other communication forms in TCP but of course the TCP allows us to run isolate servers on other machines so you can create a farm very easily with these things the full model implementation has more of these combinations so the parallel forms that we looked at earlier so for example you can do a parallel a key parallel rank and a parallel outer product using creatively selected names if you don't like those they have very traditional names as well that you can use I know some of my American friends are rather annoyed at me for picking these names which I can all type on my Danish keyboard without any problems but I feel I'm just getting getting my own back for 30 years of having to live with dollar signs and things that weren't on my keyboard if you want to read more about this it's a model but it's fully documented on our website and there are videos demonstrating their use much more extensively how to use all the infrastructure management functions to decide how many processes to start whether to start them on other machines and so on so apart from the obvious things like once we get the feedback from the users which so far is pretty good in terms of the design implementing this much more efficiently and then sort of all the obvious things like giving users knobs to twiddle on to optimize the use of processes fault tolerance, queue batch management and so on because as soon as you start using this you actually want to be able to schedule things and declare dependencies and so on but we also want to make sure for the casual user who just needs to insert that one or two parallels in their code to speed things up in a small application that's all there already we have ideas for promises so where you create a future explicitly rather than just having it created implicitly by making a call in an isolate it immediately leads to although APL is a very eager language at heart we could have an operator where you can give a function as an operand to an operator the Schrodinger operator is the internal name for this now because you could have a function that tells you whether the cat is alive or dead pass it as a left operand to this function you wouldn't evaluate it until you actually asked you could have an array of 10 cats which were futures but the fate of the cat would not be determined until you reference the ifh item of the array and the work that's been done at Indiana University by Aaron Zoo who we are cooperating with here he contributed to the design of this because he's planning to use this compiler where he'll be able to do data flow analysis at a very finely grained level so that gives a whole if he's successful with that it gives a whole bunch of new opportunities so far in the typical results with this current naive model achieved by domain experts mostly refactoring their own code sometimes with a consultant to help them for a day to get started is sort of what you would expect I mean most machines are doing hyper threading they don't really have the number of cores that they claim to have when the pedal hits the metal but these numbers are definitely worth having the people who are doing writing actual applications and so on your mileage will vary a lot depending on whether you're doing enough number crunching compared to your memory consumption and we're waiting for the compiler they're also quite fun you can do crazy things like sit on a Mac and create Windows UI on a remote computer with this there are functions to I'm going to show you video in a second which shows what happened when you start you have two Raspberry Pi controlled robots you start an isolate server on each one and now you can just have bot one and bot two as objects in your in your workspace you know add the servers they have these IP addresses 101 clone your bot control namespace that your code gets copied out onto the Raspberry Pi and then you could write an expression like this so bots is now a two element vector of objects corresponding to the two robots and you can say for 500 milliseconds for both of them drive one of them with this power on the right and left wheel and the other one with that which gives you this kind of effect you can see the arguments well you can't read the arguments you can have dancing robots and yeah well take that slide so I've actually managed to leave a little bit of time for questions this time last time I was a little bit rushed I seem to remember does anybody have any questions about this is it just all obvious and similar to what you're already using yes yeah so the question is how could you use this technology if you had functions you wanted to run functions that actually did have side effects and there's nothing to prevent I mean there's no problem with running functions that have side effects they will just have side effects and the result if you're relying on a feature in which the functions are executed then your result will be non-deterministic you know they can have errors they can have side effects there is also a mechanism that you'll see if you go and look at some of those videos that allow these isolate processes when they're running a function to call back into the main process for example if you did have an oracle database connection you wouldn't want to create a hundred instances of it so you'd want your functional code to be doing its functional stuff and then when it gets to the point where it needs to make the transaction it can call back and ask the main process to use the database connection that it already has to do that yeah and at the moment when there are callbacks we serialize them on the on the server side so you can create secure transactions when these little guys out there call back to the main process yeah so error handling well if there's an error inside one of these things by default it's just trapped and returned to you so if I had do I still have these guys around so if I said iss.delay 4 seconds in the first one negative 4 seconds in the second one we do have an engineer John Scholes who should have been here but had to stay home at the last minute who says he's working on this and he hopes before he retires to have implemented negative arguments to the delay function but I think he might be joking but anyway if we do this we get a 4 second wait because actually it would be better to assign this to result let's do that yeah so we immediately get two futures back if we ask for the second one it's failed immediately ah okay this is what happens when you deviate from the script I was unfortunately I loaded an older version of the interpreter for the which doesn't properly support this it should have signaled that back to me let me just give me a second load version 14.1 there have been some bug fixes from 14.0 to yeah so isolates is two empty isolates delay 4 and negative 4 so if I ask for the second one oh quad nl that's the list of names I wanted to delay it gives me a it just says when I refer to that item of the array it's like the errors have almost become first class objects in the language they haven't quite but it gives that effect if I refer to the first one since 4 seconds have no past I get it if it had been less than 4 seconds I would just have waited on that you can say well I'm developing code I want to be able to debug it and it would then run the isolates in a debuggable version of the interpreter so you could connect a debugger to it when they fail and fix them and that of course is one of the really interesting challenges going forward is that if you have a hundred of these things that you've launched you actually want the debugger to come up and say well you've got 23 domain errors and four length errors and one workspace for which one would you like to look at and you look at one of them and you trace through it and you fix I mean edit and resume has been the norm in APL I think since 1966 when the first APL interpreter was so then you want to say ok so do you want that code fix redistributed to the other 22 isolates that are all currently suspended at the same place in the code and say yes please patches them all up and they continue running so when this stuff really kicks in and starts being used by these I think everybody would want that but certainly the more non-technical users would demand that so this is the beginning of many years worth work I don't know do other languages that have futures have are there any that have reached that level now in debuggers where they will patch on the fly in multiple instances simultaneously yep so there's I mean I think that the comment that was made yesterday about us in the fishbowl about us not having reached the industrial revolution is very very accurate for a lot of software if you consider the automobile after about 50 years so it was invented in 18 late 1800s 50 years later the cars that you had then I think are a good mental image of where we still are with software there's a long way to go with this stuff but I think writing code in a functional style is clearly the way forward I think all the other there is no alternative as everything becomes more parallel we must go that way because nothing else can work but fortunately there's lots of good languages popping up where you can do that and APL of course originally had well it still allows you to write in a very imperative style and every user meeting I have including the one I was just at last week I tell my users you've got to move away from the imperative style and the object style and use the functional or the curly bracket style of writing functions because otherwise in five years time you're going to be in big trouble you may still be okay but your competitors are going to be running rings around you in five to ten years if you don't solve this now and as a vendor of course we have to as language designers we have to provide the tools to make that easy for them yep yeah so I'm really looking forward to being back next year to see what's new I thought the Julia presentation looked awesome there must be some ideas worth stealing there yep okay so that's it I think thank you very much alright that was a fantastic way to close the conference so thanks Martin for that