 Good morning! It's nine. We're gonna get started. How many of you were here yesterday? Oh, that's a lot of people and that's why you're here on time. Because we like to start on time and end late. That's how we do it. All right. A few people were asking about slides and other things I'm gonna do. I'm gonna give the welcome talk after this where I'm gonna talk about some of these logistics. But I wanted to kick off the day with something that's very important and hence we have Martin here. I actually met Martin a couple of months ago in Hong Kong. We were both presenting at a conference. Martin was doing his talk on, you know, the performance stuff that he's been working on. And one of the things that occurred to me as functional programmers, we seem to be so away from the metal that, you know, a lot of times we actually talk a lot about using multi-cores and stuff like that. And sitting in his talk, it realized that I actually don't understand anything about even using a single core. So I think Martin's talk is really gonna help understand. It was actually a request to Martin to kind of create a new talk for this conference to talk about some of the things of how we can actually leverage a single core and then talk about how we can leverage multi-cores. So without too much due, thanks Martin. Oh, what do you? Okay, good morning, everyone. Hopefully you can all hear me. Everyone awake? I'm asleep, had coffee yet. I'm gonna ask questions. I'm gonna be a scary presenter. I'm gonna come out and talk to you for a while. It's gonna be very scary. So my normal programming languages are not one of the functional languages. Can I confess a little bit? I write Java most of the time, which is probably pretty horrible to most of the people here. I write a lot of C, C++, and I even write a fair bit of Xcd6 assembler. So that's all horrible, horrible stuff. But truth be told, I actually like functional programming and I do a fair bit of functional programming. If anyone sees my C or Java, it actually looks quite functional. And there's a lot of really good things in this space and what I want to talk today about is how do we bring together some of the functional side along with a kind of awareness of how computers actually work and also some basic mathematical principles and we're gonna kind of dig into this a little bit. So, kind of, let's start off with a question. I wanna look at memory operations. We're all kind of used to like, is all memory equal? Like it's random access memory. It doesn't matter really what we do, but does it? Let's sort of test some of our thinking, test some of our understanding. So here's a simple question. If I want to go through an array of memory, big vector, and I'm gonna touch every single slot or every element in that array, and I want to add up all of those values, they're all integers. And if that array is one gigabyte in size, what is the average time per operation to do this? So I said I was gonna ask questions, so I am. What do we think? How much time, let's say nanoseconds, microseconds, milliseconds, what do you think? How much time does it take per element in the array? And you gotta think, I've gotta load the element. I've gotta sum it up into a scalar for what I had before. I'm gonna move on to the next element, I'm gonna load it, sum it up, and move on. What is the average per operation to do that? Depends. Depends. Good answer. That's always the right answer in computing. It depends. But if we were to measure, what do we think we're gonna get? What would be the typical average on, say, a three gigahertz machine? Or two gigahertz machine, actually, is what I tested this one on. Guesses? Ten. Ten what? Ten nanoseconds. Okay, cool. Let's see. If I actually run the test, this is the answer, I guess. So I've written a benchmark, I'm gonna actually warm up a VM and get the result. And I find that it's actually 0.8 of a nanosecond per operation. And it kind of goes, ugh! When you run this, how do I end up with one nanosecond approximately? Well, we're probably quite surprised if you look inside a modern processor. Because you can think, what is this? Is it really this one nanosecond? This is inside a Huzwell CPU. Sorry for the squish screen. We had something with the presenting mode. But inside, we don't have one ALU. If you sort of studied at school and you look at the all sort of all human architectures, we look at, we have one ALU, we've got memory. We don't have one ALU in our modern processors. In fact, we have a scheduler inside the processor. This is not an operating system scheduler. This is inside the processor itself. Inside a processor core, not a processor socket. And we have eight ports on that scheduler. And of that, we have multiple ALUs. Look, there's an ALU, there's an ALU, there's another one. There's another one. Four inside a typical CPU core. So you can be doing four integer operations at the same time. And you sort of think, how can I do that? I've written single-threaded code. How can I get that level of parallelism out of my data? You find there's actually a lot of natural parallelism in code when you write it. For example, if I'm going to go over that array and I'm going to add up all of the values in the array and incrementing the value, which is the summation that I have so far, I'm also incrementing the index that I'm walking over that array. Those two things can happen at the same time. And they can actually happen slightly before, slightly after it doesn't matter as long as it's ready by the next iteration. And we can go forward with that. So we get quite a lot of natural parallelism that's inside there. So we're looking at, I've got a load, a piece of memory. I'm going to store it in a register. I'm then going to increment the value of this scalar that I've been keeping for the summation by that register. I'm going to store that result. And then I'm going to write it back out at the end. And I'm going through doing this. So our processors can do a lot at the same time. But let's make it a bit more interesting. What if I've got a different pattern of access? Yes. Yes. No, there's no paging. Yeah. Good question. Good point. So yes, this memory has, the way the benchmark has been set up is I've actually created an array of one gigabyte in size and I've pre-initialized it with random values. And particularly it's important to do this with random values so that the compiler just doesn't eliminate all of the dead code and give you the answer at the end. Because our compilers are very smart. They won't compute something at runtime if they can do it at compile time and sort of save all of that. So this is an array that's been pre-allocated with random values put into it. So we get these sorts of results. There's some different patterns. What if, rather than going through the whole array linearly, I decided to stay within an operating system page, a 4 kilobyte page, and go around and pick values at random and add up all of the values that are in that 4K page, then move on to the next page, do the same thing. Move on to the next page, do the same thing. So I'm going to do the same amount of work, but I'm going to do it with a different access pattern. And then I'm going to try stuff. If I just randomly go around the whole heap and do that, will that make a difference where I'm not inside the same operating system page? Or how about, let's make it really interesting, if I go from one step to the next step, and I can't take the next step until I know the value of the previous step. So the next function, the step function that I'll apply needs to have read the cell and the random values in the cell would feed into the random step that goes next. I'm going to write this so that it's going to all do roughly the same amount of work, because we have all of these ALUs and we have all of the address generation units to do all of this work in parallel without really impacting the code. The only thing that's going to be different is the memory access pattern. Let's see what happens if I benchmark some of those different scenarios. So here's these different scenarios. We had our sequential pattern going through this. Let's do randomly within a page and then step on to the next page. Randomly within a page where the next step is dependent on having read the previous value, randomly through the whole heap, and then randomly through the whole heap whereby the next step is based upon the previous value. Same amount of work, big difference in the different steps. And for those at the back of you who can't quite read it, it's nearly 90 nanoseconds when I'm going random through the whole one gigabyte heap where the next value is based on the previous one. I think, well, why is this so different? Well, our hardware is doing some very nice things for us. So in the first simple cases where we're just linearly going through the array, reading everything sequentially, we have a thing called a hardware prefecture helping us where it's going ahead, getting the memory before we need it. It's helping us out. It's going to have the memory in our cache ready to use whenever we need it. The in-page random one is benefiting from different things. So it doesn't have that hardware prefecture, but it has the thing called a TLB cache, a translation look-aside buffer where our virtual memory addresses get converted to physical memory addresses. And if we had to do that lookup every single time, we'd be slow. We do this lookup on a per-page level, not a per-address level. So if you stay within the same operating system page, you benefit from that cache. If we start going down to the wider heap, the heap is too large, in this case, for the TLB cache to benefit. It's too large to benefit from the prefecture. But there's something interesting going on with the dependent loads. Now, what's that? Well, what's kind of interesting is we can have a cache miss going on to main memory. So you go to our cache. If it's in the cache, we get it. If it's not in the cache, we've got to go to memory to get that. We can have 10 concurrent cache misses being tracked concurrently by our X86 processors. And so in the case of where you've just got random values and there's no dependency between them, those 10 can go off at the same time. They can be amortized and come back at the same time. So we divide the cost, basically, by 10. If you take one step and you can't take the next step until you know the value of the previous one, there's no going off in parallel to get it. It's like go out and get some shopping or whatever it is. If you don't get the first item and you still have to feed that into the next one, you can't go ahead and get the other nine items. If they're totally independent, you go off and fetch them and you bring them back. So how much dependency we have in our data structures and our code starts to really matter. And this becomes fundamental to a lot of our design, the design of our data structures, the design of our code. So let's go forward from that. So I'm going to sort of expose you to some fundamental laws and then we're going to dig into some implementations and see what happens. So software and fundamental laws. Anybody know what that is? Come on, this is India, the home of mathematics. Q in, yeah. Kendall notation, that's an MD1Q. So how does this play out? If we want to work out the mean response time of something, we need to know the service time and we need to know the utilization of something. Service time and utilization are very interesting in how they work. So service time is the time to do a job. Utilization is if I have some resource, how much of its time as a percentage is busy doing that, for example. If I have a job that takes 100 milliseconds and I'm using it once per second, I'm using 100 milliseconds per second is 10% utilized. If I'm using it five times per second, it's 50% utilized. So you'll see this little equation quite often where rho equals lambda s. And rho is utilization, lambda is the arrival rate and s is the service time. You'll also hear this called little's law. It appears in lots of places. Really fundamental to how things work. So we plug in some figures to this. What does it look like from a response time perspective? And you'll see that this is not a linear function. So if I increase the utilization along the x-axis, what happens to the response time on the y-axis? Things look pretty good as you ramp up utilization until you start getting to about 70%. Then all of a sudden you go up this curve where things become unresponsive. And so if you use things too much and you have no slack in your system, you end up with something very unresponsive. It doesn't respond. And if you work in teams, you'll know this if there's no slack in your teams, you just can't react to requirements from the business. So anybody who's got a project manager hat with the worst sometimes, be aware that if you work your teams to the limit, they become totally unresponsive. How can we put this in a little bit more context? That's what happens when you get something fully utilized. I'm a huge fan of cars. And the road system here is insane. I have never seen anything like this before. So I can look at the mathematics of queuing theory and see what's going on. And your road system is a wonderful example. When you use something too much, this is what happens. And it ends up with gridlock. So how could we make this better? So tip for driving. If you leave no space to the car in front, you have no slack. And if you get no slack when something goes wrong, you have no buffering. And that's what ends up as a result. So if you want to tip to improve your driving, keep a little bit of distance. As basic mathematics, you cannot get away from it. Yet most people drive right up to the car in front of it. It happens all over the world. So we've got to give it a little bit more slack. It comes the same in our software. We'll see some of this later. What's this one? Maybe less people can recognize this. Anyone heard of Amdahl's law? Yeah, this is not Amdahl's law. This is actually what's known as universal scalability law, which takes Amdahl's law further. In fact, Amdahl's law was just an argument put together by Jean Amdahl to scare people away from mini computers. He wanted you to buy his mainframes that were very fast, single-threaded machines. He wanted people to use them rather than the mini computers. He wanted to scare people off. Neil Gunther, who came up with universal scalability law, discovered that when he was measuring stuff at Xerox Park, that they couldn't achieve Amdahl's law. They couldn't get the scale-up factors. And what he realized is there's two major components to scaling. One is the contention penalty, is what Amdahl talked about, which is the alpha in this. But there's also the beta that needs to be considered, and that is the coherence penalty. What is coherence? How do we think about that? Well, it's really agreement. Whenever you've got multiple parties working together, it takes them time to agree to do something. It's not even just the time that's the contended part. You have to agree that. So in a cash subsystem, that's the time to get the state to where everybody knows what's going on, so you have agreement. And without that, you really limit what's going on. What's this look like if we feed in some figures to a graph? So let's take a job, and this job can be run 95% in parallel. That's pretty good, really, in many cases. So there's 5% of it I have to do, say, under a locker, under some form of contention. 95% of it I can do in parallel. If I add CPU cores or CPU nodes of some description, how does it scale up? But then I'm going to mix in a coherence penalty, and the coherence penalty in this case is 150 microseconds. What would be the difference between a pure Amdahl's law and a universal scalability law? Well, you get this graph. This blue line is Amdahl's law in action. So as I add processors, the speedup that I get as a factor never reaches 20x. Because if you think about the problem, if it ends up being 95% parallel, one 20th of it, the 5%, cannot be done in parallel. So it doesn't matter how much you subdivide the rest. It's still left with that one 20th, so you cannot get more than a 20x speedup, no matter how many processors you put at the problem. But what if it takes some time to get agreement? And the agreement becomes interesting, and on small numbers of processors, it's easy to get agreement. As the number of processors goes up, the cost to get agreement starts to actually become the dominant factor. You don't even get to that, and after a while, the cost of agreement starts to go down. Anybody notice how you work in teams? Small teams are really good, they're really productive, and as teams grow to a certain size, they actually slow down. It's mathematics. It hunts you down. There is no escape. You cannot beat mathematics. It gets you, and that's the basic coherence penalty that's going on here. And so you see this in lots and lots of places. Let's put this again into real life and context. This is what happens when you try to get agreement to cross a road without rules. It slows down because you're spending so much time reaching agreement rather than getting the flow. What I find kind of fascinating in this picture here is if we look up here, that's a stoplight. Are those like, are those decoration here? They're just like out of a little bit of color to the street. It seems that nobody uses them. I've seen this in Italy as well. It's also a common factor. I think in Italy they're advisory. I've seen sort of here, Brazil, a few other places, they're really like just decoration for street furniture, and people do their own thing. If we get organized and we get the agreement quicker, we can achieve really quite cool things. Now where is some of this going? It's kind of interesting. For a long time we've had single processors, in the machine, what if we have many more processors than where are we going? AMD are just bringing out their new Ryzen chips, and their server infinity chips are kind of fascinating. This is like what an AMD chip looks like in a server. We've got four little groups of chips on the same socket. Each of those have got eight CPU cores, and those individual cores can run two hyper threads. So we've ended up with 32 here and 32 over here, and they're all interconnected in different ways. What's interesting about this is you've got so many cores that are freely available now so we can go in parallel, but if you look at the interconnects, they're all not the same. You've got different levels of coherence time, so you cannot escape this. So two cores talking together here are very different from a core here talking to a core over there, and taking a walk through this mesh. So the figures I showed you from memory access times are your best case scenario on a single core. This gets much worse in multicore because what if you're going to memory and you're joining the queue waiting? Queuing effects come in. If you start dealing with cores communicating that are on different sockets and across this and using the fabric to talk with longer latencies, you've got more of a coherence cost. So we end up with there's a lot of time that we didn't think about. So how do we design for things and deal with it, and particularly how do we design whenever we care about what's going on underneath this? Who's heard of NUMA effect? Non-uniform memory access? That's what this is. Memory costs are not the same these days. Depending on the location, next a big difference, so it really is location, location, location matters a lot. How do we put this together? There's an interesting concept called systems engineering and that's where we got to look at problems as a whole. If you're a mature engineering discipline that exists in another field like the mechanical engineering, chemical engineering or an article, you don't do your one little bit. You have to have a wider view of what's going on. You can specialize in an area, but you need a wider picture. And today we need to have some understanding of our stack on which we run, choice of what our data structures, choice of what our programming patterns apply to that. And if we show some of that sympathy, we can achieve really good things either for our hardware and either for software on the whole. So I want to sort of talk about like how do we put some of this in context, what sort of stuff is useful. And I want to use an example of a messaging system I worked on. So I'm going to pick messaging because messaging is a kind of cool example. It has lots of really interesting features. It also happens to be something I'm quite experienced at. When we deal with messaging systems, they are highly concurrent. So we've got to deal with all of that side. They're also distributed. So the coherence costs can get to be quite large. And they also require not just the CPU and memory knowledge. You also need to have a fairly good knowledge of networking and what can go on if you build one of these things from scratch. So it becomes a really good task case for how we put some of this stuff together. And so you look at what are some of the normal approaches. So if you're going to build a normal messaging system, you can take the easy route. And the easy route is you build it on top of TCP and a lot of the world is taking care of it for you. The hard route is you build it on top of an unreliable protocol like UDP. You can get certain benefits to that and we're moving very much in this way because TCP was not designed for what most people are using it for. It's actually a really good protocol that's designed for something else than what everyone's typically doing with it today. So anybody here use Google Chrome? Do you use any Google services like Gmail, Google Docs, all that sort of stuff? When your Google Chrome browser is talking to one of those Google services over HTTP, do you think you're using TCP? Anyone think they're using TCP? I think most people probably think they are. In fact, you're not because Google have realized that TCP is an inappropriate protocol for that and they're using a protocol called Quick that runs over the top of UDP and it's based upon a lot of the problems that we're having with queuing and setup and different ways of working. So I'm working on our messaging system called Erron. It's UDP based in doing similar things because it's more suitable for this sort of environment. But once you go to UDP you've got a couple of interesting problems. Like how do we deal with this unreliable data? Like we send a packet from one machine to another. It may not get there. It may arrive in a different order than you sent it. So you send a bunch of packets. They'll arrive in a different order. They may in fact arrive multiple times. You get some really interesting things that you've got to deal with. And so to do that, you have to use some interesting data structures. And typically those data structures are things like skip lists, trees, score boards, those sorts of things where you put together the packets on the side so that they're back in the order that you sent them in and not where we can deal with it. It becomes an interesting challenge. Now typically if you look at trees or any structures like this you end up with something that's got a lot of depth. And what happens when you go from node to node? Remember the memory access to being dependent? Data dependencies? It's called a data dependent load. You'll also hear it termed as pointer chasing. This is one of the biggest performance hits you will take on a modern processor. So if you've got link lists, if you've got any sort of data structure where you can't get from one node. So how do I get to that node starting up here? Well I've got to resolve this one. I've got to resolve that. I've got to resolve that. And I can't jump. I can't do anything in parallel. I have to resolve everything before I get there. So we've got to watch out what we're doing. And this is I think one of the biggest challenges to the functional programming community is so many of the data structures are built on link list and tree-based structures. Because how do we achieve immutability? We do a thing called path copy all of the time. And for example, if I wanted to change something down here, what I have to do is I take this tree, I create a new version with a new root. I can reuse a lot of it, but I've got a path copy from the leaf back to the root, replacing all of those nodes. And now I've got to reference to immutable data structure that I can use for that. So this is kind of cool. But it gets a bit worse than that. Ah, what's wrong with this thing? Isn't it working anymore? The Wi-Fi just stopped working on my thing, and as a result, it's locked my machine up. So I'm going to have to go to this instead. So if we do the path copy stuff, we end up with this. So I create the whole thing from top to bottom. And if I do that, I've got this interesting challenge of concurrently. The typical technique is you go from bottom to top, and you cas in, you do a concurrent swap of the top node to replace that node, and now I've got concurrent access to that, and it's safe and it's immutable. But the problem with that ends up being is if that fails, because another thread has beaten you, you repeat the whole process over again. So you start at the bottom, you copy it right up to the top, you cas it in, and it will work. Now go back to universal scalability law at this point. You use universal scalability law. You've got the contention penalty of building that up, and then you've got the coherence penalty of getting an agreement. The more threads you've got doing it, the more you repeat that process. Ten threads go to it at the same time. One of them is going to succeed and win. The other nine are going to fail, and have to do it again. And of those nine, one of them is going to win. Eight of them is going to fail. We've got this really interesting problem. You see the math unfolding here. It's a really difficult thing to do. So you end up with this kind of interesting costs of what's going on, and you also end up with garbage collection hell, because all of these nodes have to be allocated. They've all got to be collected again afterwards. So this becomes a difficult and interesting challenge. If you build these things, you end up with this sort of interesting concept. So in the UK, people love sausages. These are things... This is as horrible as you can imagine to a vegetarian culture. Usually, the skin is something like a pig star, a sheep stomach, and inside that the stuff meat and lots of other things. And they are really quite gross. Many people like them as tasty, but they're really not very nice if you understand anything about them. I've kind of got an interesting way of thinking about this. And if you get into functional data structures inside, think of them a bit like sausages. They may look beautiful on the outside and they may taste nice, but the more you know about how they're made, you really don't sleep very well. I've built a lot of these structures inside, so sometimes people are using this nice, beautiful facade, but you really don't want to see what's behind there. Actually, E6 is a really good example of this. So you've got to be aware of what do we do and how do we do this better and stuff. So one of the really interesting techniques that's available now are CRDTs. These are our different conflict-free resolution and replicated data types. And there's some great ideas in this that we can use for concurrency. So what I want to look at is, how do we take some of the concepts from this and how do we take some of the other nice functional concepts and build something that performs but also sort of is given as good behavior underneath, but nice and easy to reason about. It gives us all the benefits of functional programming in this space. So there's two major types of these. You'll find one is operational type and they're known as the commutative replicated data types. You see that the C small m, RDT. The other type is the state-based ones are the convergent types of these and these are the convergent-replicated data type. What's a little bit more detail on these and what are the good properties we can have? Well, of the commutative ones, you need to replicate operations. So imagine I wanted to change the structure. I can send an operation like decrement 10 from a value, increment 10 to a value and if these all arrive at an end point, you will eventually end up with the state changed at the other side. The thing is they all must be commutative operations so I can send a plus 10 and a minus 10 doesn't matter what order they arrive in, we'll end up with the right state. This is a kind of really nice feature. But the thing is they can't be idempotent. So we can't apply the same thing multiple times. So if I'm going to build a data structure to go to the network, I can't really do that. We need to have an underlying reliable transport. The other ones are kind of interesting. So this is what we're going to replicate the state from one structure to another. We can use this for concurrency, we can use it for distribution, but we have to replicate the whole state. The interesting thing is the state must increase monotonically. It must be purely increasing monotonically so it's a pand only. The problem with that is if you need to do a removal, you don't need solutions to that. It's a tombstone and I'll show an example of how that can work. And when we do this, we need a merge function and that merge function must be associative at this stage. And so if you're going to have idempotent, we've got to deal with that and we've got to deal with the conflicts. The really nice thing is some of this can be dealt with at a lower level if we deal with deltas and the deltas is a good way to go. So how did I build something like this and bring these techniques together? So Aaron's an example of this. It's not really a sales pitch bird. It's about how do we take the ideas and how do we apply them. Well, one of the concepts I want to do is apply that systems engineering thinking. I've got to ship messages from one machine to another. I want to respect the hardware. I want to respect the data structures. I want to respect the properties of the network. I want to work together and come up with a good result. And the log buffer is one way of doing that. So I want to have something that goes through memory as a nice bigger rare function. I don't want to look at it in random patterns. And I also need to be able to deal with it concurrently. So if I have a file and that file contains the header for the message followed by the message body and a tail for words next, I can grow the state monotonically. So I can add in messages. But there's going to be multiple steps to that. So I need to increment the tail. I need to copy in the message. I need to apply the header. And if I do that, I can move forward in a sensible means. You may say, well, one big file that goes on forever, that's the way a lot of the academic research goes behind this. I keep finding these interesting sort of views that you read academic research papers and they don't work in practice for lots of very practical reasons. And one of the practical reasons with this is it doesn't work because of things like pagecast churn, virtual memory pressure, all sorts of issues. We are constantly allocating those. So how do we deal with that? Well, if you do the sort of good old technique from backpacking where you wash one, wear one, dry one, and bring it around with you, we can reuse these and go forward. So if I've got a clean buffer, I've got a dirty buffer, and I've got a currently active buffer, I can continue adding and use them over a period of time. Where some of these ideas come from is anybody familiar with the guy called John Carmack who started ID Software, Quake, Doom, all of that sort of stuff. He wrote the original games engine in C and then he immersed himself in functional programming and he liked a lot of what he found. He rewrote the whole thing in Haskell and then he learned a lot from it. It took him maybe 18 months and so he really just immersed himself properly. When he came back from it, he realized there's really good elements of both. One of the things he wanted to do was bring the immutability because it's a highly parallel, highly concurrent problem with games and immutability really helps with that. But if you just try a pure immutability approach to it, he discovered that it didn't work and it didn't scale. The garbage collection pressure, the management of that just become totally unpractical. What he realized is if you limited over periods of time, you can have immutability within a time frame and then start over again. The functional to a point had a checkpoint. He would go to be an imperative, change the state of the world so he could start again with a clean slate and he did this by using the frame-to-frame within the game. I stole the idea from it and done it buffer-to-buffer and by taking that, I was able to have this nice immutability within the buffer but once I reached the end of the buffer, I'd flip over and there's an imperative technique to do that and then I continue again. Sometimes it's not just one or the other, we'll blend the techniques together. How do we do things like concurrent publication into something like this? Well, I don't want to have locks. I want to have something that works well with universal scalability law. What do we need to care about? Well, let's say I have two threads racing to put messages in here. I could lock the structure. I could put in what I need to put in and then release the lock on another thing. I could have a contention penalty and I've also got a large coherence penalty because handing locks off are quite expensive. There are many microseconds. What can I do that cuts down the contention and coherence penalty? So contention penalty, I want to do the minimum under contention and with the minimum I can do is just increment the tail character. One message has incremented it, the other message now has incremented it. That is the only contention penalty and the coherence cost of that is a single cash line between processors. So we're down to very small amounts of time. Now then in parallel I can copy in the message completely independent of the concurrency window that we need to care about. The really hard case here is I've actually reached the end of the log and I've got to deal with rotation. The one that's reached the end can be responsible for just filling in the log and rotating to the next and putting in a message again. Looking at the contention and the coherence penalty, pulling them apart and fixing your algorithms to do with that. No locks, no complexity and we're able to go forward and mostly because it's a monotonic function that gives you the beauty of doing a lot of that so we can put it together that way. You may say, what's in a header? How do we get this across to another machine? Well, our header, sort of typical ITF porn here where we lay out our header with the different fields with what's in the message, the length of the frame, the version, some data to rebuild the structure on the other side. Merging the two together becomes really interesting. Well, we've got to look at what happens here. So what if a publisher dies mid-operation? So the previous case is I could be putting something under the lock. If the lock fails, I can revoke the lock and I can continue forward. With this being concurrent, in fact, even across processes with no coordination other than what's in there, we need to have a protocol and protocol says how we interact quite nicely. So the way to do this is if I'm going to write the frame, I can't write it all atomically but I can write it in an ordered fashion. So first of all, I can write the length. I write the length negative to begin with. So if this is going to be 100 by frame, I'll write minus 100. I then fill in all of the rest of the header. I fill in the message body and I say I'm done by flip negative back to positive. By putting in the negative to begin with, if this fails and I need to go in and fix it up, I know the size that's been reserved because I've got the absolute value if I take the absolute of it by reversing the sign and flipping the sign at the end is a really nice way of dealing with this. So now I've got around the different laws that I care about. And well, how do we end up with the removal in a failure case? So let's say I put in the frame length here and it's negative. If this thread crashes, it's a process that crashes, what do I do going forward? So if you think of the Erlang world, we just let things die, but we clean up then. Well, I fill in the type to say this is actually a tombstone and I flip the length back at this stage. And so now we're going to have monotonic state going forward even on the buffer that's in there. And that gives us what we need to do. We replicated messages to another machine and we go forward from there. So as the messages arrive, we copy them in on the other side. So copy in like the first message. The header contains all the information for where it needs to go. If something arrives out of order, we just copy it to its location in the buffer at the other side because the header says what is the offset it starts at. We don't need the trees. We don't need the skip lists, the scoreboards, anything like that. Then another message arrives out of order. But notice there's two characters. And the high water mark. By having those two counters, whenever they come together, we know we've got a complete log. If we don't have them together, we know we've got a gap, we've got loss and we deal with it. So just protocols. Very simple ways of dealing with this. And this gives us something that's strongly eventually consistent and a nice way of putting it together. Now, what's interesting about this is consider the memory layout. The memory. That will prefetch. It'll work really well. It'll vectorize with our modern processors. So we've got to go through these things. We can apply a lot of really interesting things. Like I can apply searches that are monotonic functions across here to see many different states. Most of this just becomes mathematics and how I deal with it. Very little branching. Just looking for values. It's all just math and how it applies. It's been around for a long time, just not applied. Like the APL folks got a lot of this right a long time ago. And a lot of the industry has just kind of ignored what's going on. So again, we can steal lots of cool ideas from other places. And so, how do we get to the states where we reason about all of this? And this is another beautiful thing about the monotonic functions, is everything that we have. Like if we've got publishers, we've got senders, we've got receivers, and the counter is the latest version of state that they've ever seen. And that way we can see progress right through the whole system. You can track it, you can debug it, you can deal with it in really kind of nice ways. So kind of wrapping up here now and closing, so reaching the interesting part. We could all kind of agree that shared mutable state is evil. Is that reasonable in this room? Is it gonna... I think it's a kind of fair thing to say. So we want to avoid this. We want to avoid what we learned. What can we do to avoid this? Well, if we are going to share processes, we start looking at sort of the different laws that I've talked about, is if we must update, we should only ever have single writers. If we've got concurrent update to anything, we're in the world of shared mutable state, and it's hell, and it's not nice, and we need to deal with it in another way. And the easy way we can start thinking about that is having, if we must share a process memory in some way, we'll only have one owner of it from a write perspective. Many owners from a read perspective, and we can share. How can we have then multiple processes or threads or whatever updating that? We'll send a message or send something on a queue to the one that updates it. It can update it with whatever type of structure it needs. Then it becomes nice and clean. We can hide all sorts of ills provided we only have to update it in one way. Once you get updated concurrently, it becomes really, really difficult. The other way to better think about this is to start thinking about shared nothing. So you get a little bit of a flavor of different structures and how we need to think about them. If we do not share them at all, we can do whatever we want. We can have things in different ways. So the whole idea of the shared state we need to get away from. We need to have our own local state and with our own local state we can use whatever techniques, whatever we want and look at messaging. So I've spent a lot of my career looking at messaging and developing messaging systems to deliberately get the isolation. This is a really, really important thing. So we've got many languages where we're sharing memory and by sharing memory we're difficult. That gives us issues for garbage collection. It gives us issues for what types of structures we have. It gives us issues for failure detection. By putting the evilness of the sausages into the communication mechanism. So if there's one thing I would say to the underlying folks is you got the stuff right from a high level. You really need to invest in your messaging infrastructure. So it is super fast, super efficient and sort of obeys all the right properties to let this stuff work really well. And we build protocols on this. The secret to so much of it to work well is we've got to look at building things with increasing monotonic state. If we do that it becomes so much easier. We don't go backwards, we always go forwards and we tombstone if we need to deal with that. If you want to see some examples of encode have a look on Aeron. Come to the workshop tomorrow then I'm going to be giving. And on that I'll thank you very much and I'll take questions. Think I've left enough for some time. Previous slide.