 Right! Tonight I want to talk about performance. It's got an interesting topic. Let's see if you can sort of stay with us because if I fall asleep, please shut up me, wake me up. I know you're the one who probably fall asleep but I might be the one because in the last three weeks I have covered 19 time zones. So I might be falling asleep. Yesterday I flew in from Australia and I kind of had worked my way from sort of San Francisco, London, Hong Kong, Australia back to here so I may fall asleep at some stage. So my own performance might be in question tonight. So what do we want to talk about here? Well, kind of this question, is it difficult to write software with good performance? An interesting question. Who thinks it's hard to write high performance software? Ooh, good. Do you think it's all a pile of compiler tricks and knowing all of the secret bells and whistles for the operating system and all of the stack that you're using? Well, what I'm going to try and do is convince you that that's not the case. Actually, writing high performance software is kind of about writing just really nice, good software. I work a lot in this space where I've got to write high performance, I've worked in security, I've worked in highly available systems and there's a simple theme to most of this is actually writing nice, good, clean code is the right way to do pretty much all of those systems. It's not actually a bunch of tricks. You'll see that some of the highest performance software that's out there when you look at it in detail is actually incredibly simple. It's actually very easy to reason about. I kind of want to break that down and explore that subject and see, well, why is that the case and what can we learn from that? Now, the way that most people go about developing these days is to do things like this. If you want to have high performance software, don't start with that. Now, I know we all do it. I've done it myself. I've copied and pasted stuff off the web the times when you need something done. But knowing what something does underneath becomes really important. In fact, I think this is one of our biggest problems in software at the moment. I call it resume-driven development. We want to add things to our CVs or resumes and when we put all of this stuff into an application, it's usually one of the things that slows it down. It's usually one of the things that makes it unreliable, it gives us a lot of problems that are in there. But actually looking at something a little bit more interesting, why should we really care about this? This is a great paper. If you get a chance, go look at it. I'm also a good believer in reading papers rather than sort of being quite insular about this. This was the semiconductor industry got together last year and they were talking about trends in our industry and where things are going. What was really, really scary from this? By 2030, if we continue building software where we're going, there will not be enough energy production on the planet for our data centers. So it's not a trend that we can continue. We're going to have to start getting better with how we write software. How many times have people heard here, like, let's throw hardware at the problem. You don't really need to care about performance. We kind of have got away with that. For decades now, processors did get faster and faster, but they've slowed down a lot. And it's actually going to get more and more difficult because not only now is clock speeds not getting faster, processor densities are slowing down now as well for the transistor densities that we're building these from. So things are going to change. We're going to need to start thinking about this. And that's a kind of fascinating piece of reading. If you get a chance to look into it, I work a lot in finance. We're on building systems that are trying to be the fastest that are out there. You're trying to win opportunities in markets. I've been doing that for a while, but I'm starting to find more and more people are talking to me now about how do they make their software more efficient because the cost in the data centers is getting so significant, but also just the scale of managing very large clusters in that complexity becomes a difficult challenge as well. So it's really interesting how performance is becoming something that a lot of people care about for a number of different reasons. So the main topic here I want to talk about is like how do we design for performance? What do we do to go about doing this? Well, I'm going to cover four things. I want to start off with what is performance basically because it's a kind of nebulous concept. I want to break it down a little bit more. I want to talk a bit about what does it mean to have clean and representative code because I alluded to that at the start. A good clean code tends to be good performing software. And actually we've got the kind of clean code movement and craftsmanship that's moving forward. It's good in many ways. I want to kind of explore why. I'll talk a bit about how we actually build models themselves because all of our software is just a model. It's a software simulation of some business problem that we're trying to do. And then how do we test this? How do we measure what we're doing is the right sort of thing and how do we put this together? So let's go forward. What is performance itself? Well, it's kind of one of these nebulous concepts. If somebody ever says, make your code faster, make it high performance, what does that actually mean? If you start studying performance theory, there's a number of things that stand out. One of the things is throughput. So when we say we want something faster, we need some way of measuring it. What does this mean? And throughput is how many units of work do we do for a given period of time? Now, examples of grid throughput is things like a container lorry. If you go outside and look in the ocean out there, there's plenty of them around here. I've noticed there's a lot of them. But these things can move a large number of things in a given period of time. But it takes them a certain time to get there. There's a large duration to that. You hear the term throughput and benchmark. How are they related? Well, benchmark bandwidth is your maximum throughput. So if you look at a network, if it's a 10 gigabit network, it could be running at any level of throughput up to the maximum of 10 gigabits, if that's the type of network it is. So you'll hear the two terms interchange, but one's the maximum of the other. The other term you'll hear a lot is, especially in finance, people talk about latency. I prefer to talk about response time, and I'll break that down a little bit why later on. But it's the time from you go to do something to you get the response back. And actually, the throughput with the response time together is what really matters when you start measuring systems. So what is your response time at a given level of throughput? Running at maximum throughput at your full bandwidth, you'll tend to have terrible response time because you get curing effects, and I'll pick into that and why that happens. So those two things is what you want to measure. So when someone says you want high performance, what's that mean? You start asking them the questions, what level of throughput do you want and what response time do you want at that level of throughput? Those are the sort of questions we need to be asking when it comes to performance and how we pick it apart. Kind of related to this, but not really performance, is scalability. And by scalability, what we mean is, a lot of people will talk about the term scalable as well. If you look at it from an economic sense or from any other practical sense, if you add resources to anything, there's a proportional increase in throughput for the amount of resources. For example, if you can do 10 units of work per second and you double the amount of hardware, like be that CPU, be that memory, be that network, whatever, you should see a proportional increase. So you should be able to see 20 units of those things at that given time. If you double the amount of resources and you only get like a 20% increase, that is not scalable. The costs are far too high. You would sort of talk about in economics, it's not very economically viable. You should be getting economies of scale as you go up, not the other way around. Yet in software, we tend to get things the other way. And that's because there's some maths behind this that we need to think about. When it comes to scalability, it's kind of interesting what's important, and this also happens a lot in finance. So I think ants are a kind of fascinating little example, and they show how to have an optimal system is often not what a smart animal would go and do. Like for example, they go off from their nest, they go to collect leaves and they bring them back for food. When they go to a tree, these simple little animals, they grab the leaf with a pincer and they use the other pincer to cut the leaf. Really nice and simple. They have a small brain. 50% of the time, they grab the wrong side. They start snipping, oops, it's falling on the ground. I need to start all over again. Now, for us, if we looked at that problem, we go, I want to optimize that. We're going to make sure that they don't grab the wrong leaf. That will require a much larger brain for that little animal. That would increase the complexity of it from the whole evolutionary sense, but is it the right thing to optimize for? So you go snip, snip, snip, drop. Grab another one, snip, snip, snip. 50% chance on the second one you're likely to get it. The real cost is the time back to the nest, not the time to go snipping. I see this a lot when you see people optimizing systems is they start picking on the things that they think they want to optimize and miss the important thing. It's the trip back to the nest. That's the big time. That's the thing that should be optimized if you want to optimize your system. You see this so much when you watch teams working on problems because people will work on the code that they know, the code that they want to learn, the code that's going to help with their resume-driven development, those sorts of things that people do. So we have to get more scientific about this. We have to follow theory constraints and do the right sorts of things. So where does this all come from? This is a lot of my life as you can kind of guess from what I talked about to begin with, but this kind of perfectly describes everything you need to know about performance, all the major components of it. I said that I like to use response time rather than latency as a measure, and this is kind of why, because if you look at queuing theory, a lot of these things are well-defined. The terms that many people use are not well-defined, and there's not good mathematics to back them up when we start modeling our systems. So if we take this case, there are well-defined terms. So, for example, when you're at the front and you're getting serviced at the desk so someone's reading your passport and giving you the stamp, you're being serviced. That is your service time in queuing theory. That's a well-defined thing, and that's something we can easily measure. So whenever a request eventually is getting processed in your system, it's being serviced. We time how long that takes. That's something useful that we can feed into some basic mathematics, and I'll show you in a second. Another component of this is whilst you're waiting to be serviced, you have wait time or queuing delay that you'll hear talked about, and so that's another measurement that's interesting, is how long am I waiting in those queues? And how long are you waiting in the queues is a direct function of the rate of arrival at these systems. So whenever a large plane arrives at a given time, you get a lot of people turning up in a queue, so we get a burst of traffic. That's what the real world's like for our systems, as well as we get bursts of traffic. We don't get constant arrival rates. And so the time in the queue starts to matter. The response time is the overall time from I go to interact with this system to eventually get a response, and that is a combination of the queuing time and the service time. And so we need those together, and that's why it's usually very good to measure the overall thing. We'll often miss when we measure systems. We'll only measure the service time and we'll not measure the whole response time. There's lots of products out there that make claims about the performance of their system, and they only quote service time. Many of the major databases, and because a lot of them are open source, you can see it on track at Cassandra is a really good example of this. It will measure how long it takes to perform a transaction from the time it picks it up to process and hand it back. If it has a big GC pause, it doesn't talk about how long that request is sitting in the network stack waiting to be serviced by the database that's currently having a GC pause. But the end user wanting to use that system, they don't care. They're waiting that whole time. That's what matters. So we have to measure the outside in. And we can see all the differences here. We can have multiple queues in effect. We can have works dealing between queues. All of the different components are all kind of there. So how does this play out from a math perspective? Well, for most systems, when you load them up, depending on the rate of utilization, the response time starts to change. So we're talking here about what is the average response time of a given system. So let's say you're going to do an action and it takes one unit of time. Say that unit of time is 100 milliseconds. If you have things arriving not very frequently, like say once every second and it takes 100 milliseconds of time, you're only going to be using that for 100 milliseconds of time per second. So that's the utilization. It's only 10 percent until you've got 900 milliseconds that are not being used by that service at a given point in time. And that's where you get this average response time. As you increase the arrival rate, so say they're coming in now at 5 per second, we've now got 500 milliseconds of service time being used up in the utilization. So it's 50 percent utilized. Response time is still quite reasonable. Look what happens once you start going beyond about 70 percent utilization is you end up with this probability effect of when something is highly utilized, the probability is it's already in use at the time something turns up and that compounds and we form queues. This is how queues happen on highways and motorways, all of those sort of things. It's why we deliberately try to slow down traffic in the system. Sometimes to go faster you've actually got to slow down. That's counterintuitive, there's really simple mathematics behind it to why it works. So we're looking at our given systems. If we're running them under high load, we will be having long response time because queues form and these queues are everywhere. This theory is 100 years old. It's the work of Erlang studying telephone exchanges. What he realized is at certain times of the day you cannot get a phone connection. You'd be waiting for a long time and it's because utilization has gone up and the response time is really poor. Here's where this sort of stuff can get really fun from a performance perspective when you're writing in code. If I take a system that's running at 90% utilization in this case, we're taking 10 times the average response time as whenever the system is not being utilized at all and you turn up. So 10 times the typical service time you're having to wait in total. That system, so you're looking at 10 times as long. If you take the service and you profile it and you optimize it and now that service takes 50 milliseconds rather than 100 milliseconds, if the arrival rate's the same so let's say you've got 9 of them turning up per second 9 of them taking 50 milliseconds is only 450 milliseconds of the second so your utilization is only 45%. You can write down the curve of response time so for making something twice as fast you have now made it 10 times more responsive and so the math can play out really well in this. It also plays out the other way. So once you just go over utilization you end up with real problems with responsiveness of a system. So this stuff's really interesting. The pro tip on this is make sure all of your systems have sufficient capacity otherwise it won't be responsive. You can go and look into a lot of the math behind this. It's quite easy to follow but it's really interesting and good stuff. This applies to any system. So we're talking about software systems or systems of people. We all work on teams. We're a system. If there's any project managers in this room if you run your team at very high utilization they will not be responsive. You cannot avoid it. You cannot run away from the math. It will get you. So kind of being aware of a lot of this real fundamental stuff applies at all levels in how we end up working. So can we go parallel? So that's one queue to use one service. Well we seen in the previous picture that if I have multiple queues I can get more work done. Well let's look at that whenever we have some contention. So let's say at the front of those queues there was a person telling people to go to different desks. As you typically do, so you've got coordination overhead. You might be a lock in one of our pieces of code. That person distributing things is a lock in these sorts of systems. What we normally look at this is Amdahl's law. Who's heard of Amdahl? Jean Amdahl. Good. So Jean Amdahl did not come up with Amdahl's law. People are interesting. He wrote a paper called Amdahl's argument and his goal for this paper was to scare people away from using mid-range computers. He sold mainframes and competed with IBM. He wanted people to buy these big, fast, single-threaded computers because you didn't have to deal with a lot of the parallel programming in it so he made this argument to scare people off. People later took his paper and turned it into a law and it's been attributed to him so it's quite funny that the guy who didn't want people doing parallel programming is the one whose name is associated with a funny in some ways. So what is this argument? So let's say I've got a job and it's made up of two parts, part A and part B and if part A can be done in parallel, I can get a certain speed up. If part B can be done in parallel, I get a different speed up. So if I can throw four processors out of a problem, I can shorten the time so the response time is going to be less and I'll get more throughput as a result of this. So that's a nice speed up. But let's say A could not be split apart and B can be. We end up with a much less speed up in this. So fundamentally, whatever job you've got, if you have got a constraint within that job, you cannot get so much of a speed up. You're limited by how much of this there is. So let's look at it for different percentages that can be done in parallel. So if something is 50% can be done in parallel, the other 50% we just have to wait and go single threaded on it, we can never get more than a two act speed up, no matter how many processors we throw at it. And even go the whole way up to 95% of a job can be done in parallel. If we are limited on that 5%, we can never get greater than a 20 act speed up. Anybody like meetings? Do you like attending meetings? Meetings are a sequential component in an algorithm. Everybody must join on it. So if you want more throughput from your teams, less meetings. So it's really interesting a lot of performance theory works in organizations as much as it works in systems and stuff because it's all systems. It's kind of interesting. So this is Jean Amdahl's argument. What's interesting was this is actually not what he said today. He just wanted to scare people off. In the real world, can you even get a 20 act speed up if something is 95% parallel? Who thinks you can get a 20 act speed up with an infinite number of processors on a job that's 95% parallel? Good. Everybody's either scared or they know the answer. Well, you can because when people started looking into this, they couldn't get those numbers in particular Neil Gunther when he was working at Xerox Park looking at how to scale up some of these systems. We now have a new law that's been around a while called universal scalability law which actually went into this in much more detail and when we got our job that needs to be run in parallel, we have two things we've got to care about. We've got to care about what percentage of the algorithm can be run in parallel and what percentage cannot be done. The bit that cannot be done is the sequential constraint in this. So that's something that's limited. It's our contention penalty as Gunther talks about it. But also when you've got all of those different parties, you have got a coherence cost. You've got to get all of those threads or all of those processors to the same point and then disseminate all of that data. When we go to a meeting, you don't materialize from your desk straight into the meeting. You've got to walk there and walk back. If that meeting happens to be in another building in another city, that's your coherence penalty. It takes a while. Nothing happens instantaneously. Everything has to move and even at the speed of light, it still takes time. Inside silicon, if we're going to move electrons at as fast as they can possibly go, which is not the speed of light, it's actually but half of the speed of light, that is one nanosecond. So it still takes time. Once you get distance, it starts to become a real problem. If we take the 95% case and let's say we add in 115 microseconds of coherence cost, which typically sort of grid processing problem in an Amazon cluster, what you'd likely see on average as a round-trip time for this, how does this play out? If I take Amdahl's law and I applied that, I'll take universal skillability law, plugged in the same figures. The blue line is Amdahl's law. Awesome topically heading up towards the 20x. Let's see if we get there. Notice what happens as we throw more and more processors at the problem whenever we have got a coherence penalty and everything has a coherence penalty, we just got to work out what is the duration of that, you feed that in. After a while, we stop slowing, we stop speeding up and then we actually get worse and we slow down. So these things come to get us in the end. So making stuff go parallel is actually a really difficult problem unless it is embarrassingly parallel. The term embarrassingly parallel comes about from jobs that have got no contention penalty. Where the job can be completely done in parallel, there is no bit that is any active coordination and those jobs are quite rare. Most things you usually have a contention point. So be aware of this, we can't just throw processors at a problem. So even though our processors aren't getting faster, we're getting many more of them with other things we need to think about. The core design starts to really matter in this. Anyone know what that is, a graph off. What we've got here, number of processors, testing from one to eight processors. I've got some job. It's taking 16, 17 microseconds to do that job on average. And as I throw processors at it, I should be getting more throughput. But I seem to be having this linear time problem. So that is suggesting whatever this is has got a 100% sequential component in this. This algorithm has no benefits at all from throwing extra threads at it. Anyone guess what that is a graph of? False sharing. Could be. The numbers are too high. We're looking at around 16 to 17 microseconds. False sharing in a processor. You're probably looking at around 60 cycles on the same socket. This is different. That is the mean logging duration in pretty much any Java logging framework. If you want a system to be scalable, I recommend you go read the code inside logging frameworks and do completely the opposite. Because they are the absolute anti-pattern for how to design a scalable high-performance system. A single-grip big lock around the whole thing is just a sequential component for it. I don't even need to name any particular logging framework because they're all the same pretty much. There's very few that are any better than this. There's ones that are but they're not quite commonly used. All the major candidates have got this as a serious problem inside there. I have to fix many systems in the wild and very often within the top three performance problems that I see is people using too much logging in. As you throw processors and cores at it, it doesn't scale up. Kind of be aware of that. These things come and get us sometimes even where we don't expect them. That's a little bit of groundwork. Let's look at what it means to have code as clean and responsive. The thing I alluded to at the start. I love to look things up in the dictionary because I personally struggle with words. In the background I'm dyslexic, but it also really annoys me that our whole industry keeps using the wrong words for things. I find it really frustrating. It sets the wrong example. A good example of this is maintenance of software. Software doesn't require maintenance. It doesn't require greasing. It doesn't wear out. It has different characteristics and behaviors. We choose the wrong words and metaphors. Another one, random access memory. If I go to access memory, I don't want random address given to me. I want a very specific address if I ask for it. I want arbitrary access, not random access. The list of these things is huge. We talk about non-functional requirements. That's another terrible word that just doesn't fit what's going on. Let's please pick up a dictionary if you're going to name anything that's significant. What does clean mean? I love this. Look up in the Oxford English dictionary. It's clearly uncontaminated. Pure, innocent. Who can say that about their code? It would be really nice if it was true, but most code is far from uncontaminated. It would be really nice. Clean code movement, I kind of like the idea that we do get to this as an aspiration, but we're far from it at the moment. And representative is interesting as well. It means serving as a portrayal of something. Again, that is exactly what our code should be. It's a software simulation, usually some business problem, so it is a representation. It's a portrayal of that problem that we're addressing, so I kind of like some of these words. How do they fit? The thing I like about representative code is that the code is usually the best place we capture our understanding of something. How often have you read code from a relationship at all to the thing they're simulating? It's one of the reasons I love demand-driven design. I think it's such the right thing to do to actually capture your understanding in the code because documents, everything else goes out of date. The software should still be running. If your software has got things in it that are named totally different and totally unrepresentative, you've got an interaction layer. That's got a cost, so it means the problem you're trying to solve, so that's not a good thing to be, so that you'll be clean and representative is really important, and you'll tend to get code much more appropriate for what it needs to do. It tends to show up in performance, in its reliability, security, all of those other things too. What are some of these things? Let's dig into them in a little bit more detail. Abstractions, I think, so we use this term a lot. Let's look at it in a little bit more detail. I've got some rules of abstraction. The first rule is, don't use it. Second rule, don't use abstraction. Let's deal from Fight Club. Okay, I'm sort of making a bit of a funny point here, but what I really want to get to is we over-obstract too quickly in most design. Abstraction is good. It's a very useful thing, but if you look at most code, it's kind of gone the wrong way. People will rush to an abstraction before they truly understand the problem they're dealing with, and there's nothing characteristic for how people work. So, for example, you start working on something. It's the first time you're dealing with that. Oh, I'm going to create some abstractions because this may help me out in the future. You don't totally understand the problem the first time you do it. You create an abstraction that's really imperfect. It's really un-precise for what you're dealing with. Now we've got a real problem because you've created your baby. It's your child. It's your child forever, even though it's completely inappropriate. And we do these sorts of things in our code. It's much better to just build the thing exactly what it needs to be. Move on, build the next thing. You might see some commonality. Just keep it completely independent. Totally decoupled. There's no dependencies now. Do it again. Maybe you do it a third time. Maybe you start to see, and yes, there is a lot of commonality here. The abstraction may help me to be more precise about what I'm dealing with. That starts to make sense. Kind of going that way. When I say don't use abstraction, I am being a bit facetious, but don't rush to it. Always do it the other way around. Abstractions should be added when you understand the problem really well. Don't rush to them up front. We tend to do this so much. Particularly people who've got architected in the title thinking they're helping out the team with these cathedrals up front. This is not a good way to be working. It should be going the other way. Abstractions have a cost. Like anything, if we're going to use something, it's got to pay for itself. If we use something, do we get more back in return than the cost of using it in the first place? If that's the case, it's good. We should be using it. If its cost is greater than what we're getting in return, that's only economic in our software. If we're doing those sorts of things, we've got to balance these things and make good engineering-based decisions. Engineers work within constraints and they choose things based upon the payback that they get for doing these. And kind of a final point on here is be careful of dry. Don't repeat yourself. I've seen people rush into drying out code too much and introduce abstractions as a side effect of it that they didn't even tend to do properly. Sometimes it's actually better not to have the thing dried out, to have two completely copied versions of something that's independent. Maybe it's been worked on by different teams and evolving at different rates. Rushing into drying out code can be dangerous. Also, I'm not saying everything should be copy and pasted everywhere. It's against those engineering trade-offs. You've got to develop a taste for when to do this. But don't rush to one or the other. Go at it with a conscious open mind and look at the cost and the benefits and wear them up in this. Take simple examples of... There's many forms of abstraction. One form of abstraction is if we want to generalize a type. So we see something. We think it's the same type. We extract an interface and we start creating multiple implementations of that. On our modern processors we've got some really interesting things to play out. One of the things, for example, what our processors do is they speculate. So they reach branches in our code and even if they don't know the value of what the branch is going to evaluate to say, if i is less than 7 do I go one way or go the other way? They will use statistics and guess and go forward in different directions and if they're wrong, they've got to unwind. If that speculation is read one value then go read another value. It's called a data dependent load. It's a data dependent load and that is one of the things that chokes our processors up faster than anything else. Now how does this appear in types? So if we've got one implementation of an interface or a class, they say Java's an example of this, pretty much all managed languages have the same feature. The compiler and the runtime can apply some optimizations. So it's got polymorphism in this case. If there's only one implementation it can do what's known as a monomorphic implementation. So it knows there's only one of that and it can get rid of the direction. It can go straight to the thing it wants to deal with. And it usually has a trap then put in the code. So typically this is done through a thing called class hierarchy analysis. So it will look for how many classes of a type have been loaded. If there's only one it can take certain optimization choices. If it discovers another class of that type is loaded it can undo previous bets it's taking to optimize that stuff. So on the second one it can still do some things. It can do things called a bimorphic implementation where it look at the call site, put an F in and go one way or the other. Typically say with Java or C sharp if you've got more than one or more than two it goes megamorphic so it has to use a jump table. If you're using JavaScript it goes megamorphic at greater than four implementations. So different languages take limits at which they kick in. And if you reach the point where it's megamorphic it's going to be slow and it's going to have issues getting past that. A compiler still have some tricks. They can find the most common case and do a thing called inline caching but you're into all of this stuff that the compiler can do and hoping for the right case. It's like just don't jump to this unless you know it's a good thing and you're getting the benefits for it. So kind of be aware of some of this stuff. So again it's judging costs. And if it's not representative it's a big smell. So again little types will come in they're not appropriate they're confusing what's going on. I'll come back to that and why but I think one of the problems that we get into here is this big framework approach like something happened in Java about 20 years ago that was different from most other languages. I remember being a C programmer and if you took a library into your code it didn't do any logging it didn't do anything odd with exceptions or whatever it give you callbacks and you as the user of that library made decisions on what you want to do with it. Then Java came along and all of a sudden libraries started making their own logging decisions they started implementing their own ways of doing stuff and they started imposing ways of working on the caller of them not the other way around. We kind of went this kind of weird way I think this kind of summarizes it up quite well. So we end up like having so much baggage that's totally unnecessary rather than traveling light and if anybody here has ever backpacked you'll know these two types of people. There's a person who backpacks a lot he understands it who travels with just the minimum of what they need and they're nice and efficient and then there's the people who pack everything they can imagine and they're a complete nightmare deal with. It's the same with their code like who can admit that they've seen people or maybe done it yourself turn up on a new project and the first thing you're doing is installing all of your frameworks before you've even worked out what the business problem is that you're trying to solve. We've got to be honest with ourselves about some of this at times it does happen so abstract when you're sure of the benefits it's interesting we have got these even interesting laws on abstraction so presumably we've heard of Joel Spolsky so the fogbugs guy who used to be running an Excel team well he came up with the law the law of leaky abstractions all non-trivial exceptions are to some extent legally because the detail of the underlying complexity cannot be ignored this is just an example of the wrong abstractions it's not that our abstractions are leaky people are creating the wrong abstractions and now as a result of the leaky they're papering over the cracks this is a smell of what the industry is doing I prefer this description so one of our Turina award winners said this in 1972 to go back that far Dijkstra says the purpose of abstracting is not to be vague but to create a new semantic level in which one can be absolutely precise that's not leaky abstractions that's making something very clear making the rules very simple for what's going on this is understanding let me try and take an example of how we can apply this because we've got good abstractions we've got bad abstractions if you look at most of our operating system abstractions usually written by people who have refined them over deck heads they tend to be very good and some of our hardware ones are similar to that as well so take memory subsystems I think this is a good example I ranted earlier about random access memory well it's not random it doesn't even give you arbitrary access because there's some interesting behaviors there's two orders of magnitude difference depending on how you choose to access memory to the performance you get so it's not exactly the same well what's interesting is we can abstract the designer for hardware down to three bets now if you study hardware systems anybody who's going to come to my workshop tomorrow will find out in painful detail what our memory subsystems have in them and how they work there's things like PLB caches prefetchers all sorts of interesting stuff that's in there but fundamentally you don't need to know all of that stuff if you understand the correct abstractions and the correct abstractions for memory subsystems is three bets so those three bets are as follows first is the temporal bet so it's time based anything that you've used recently you're likely to use again in the near future so that's one of the things that our hardware friends back into so it's what we all mostly know how caches work that's pretty intuitive I think I don't have to explain the detail of that to pretty much anyone we get that but there's a couple of other interesting bets in there that are more probably less understood so one of them is the spatial bet so things that are close together are likely to be used together that's one of the things in our memory setup so that appears in things like how our pages are set up, how our cache lines are set up like for example two fields in the same object are likely to be in the same cache line so they're spatially close together so you're likely to access them together I'll come back to that again with a more detailed example so keeping things together that we use together and the last one is the pattern based bet and that is that your code will have a predictable access a pattern to memory for example walking through an array summing up all the values in an array to another array that is very predictable the processor can spot that pattern and can prefetch ahead and deal with that if your pattern of access is random and random in the true sense the processor cannot predict that so it cannot hide the latency and what do I mean by the difference in some of this latency if you hit the L1 cache and you fetch a cache line that happens to be in that that's how long the response time is on that if you go to main memory on a typical server these days you're looking at 100 nanoseconds that's two orders of magnitude between hitting the L1 cache and going to main memory so whether these things work for you or not makes a massive difference to the performance of your code we're not talking 20-30% we're talking orders of magnitude of difference and so when people want to have stuff go fast well, sort of go parallel there's much bigger wins in your algorithm design and playing well with memory so those are the abstractions that matter so let's take this into how we implement our models and what do we need to care about so much of modeling is about fundamentals of design we have to have really well practiced design fundamentals like coupling and cohesion everybody in this room has heard about coupling and cohesion do you truly understand what it means is a more interesting thing do you practice these things every day one of my base about our general industry is we practice a lot of stuff and we care about stuff that really doesn't matter in the long term and yet we don't practice some of the stuff that really matters that we should be doing more often so what's an example of this on coupling and even cohesion classic example of this is I've got one object and it's constantly calling another object to get fields out of that other object we've got a term for that we call it feature envy it's a design smell that this object is constantly reading fields from another object to do its job those fields probably should be in this object it's coupled so this object is coupled to that other object now if I look at those fields and I realize yes they should be there they were kind of put in the wrong object to begin with I put them across into the object that's using them and it's all encapsulated inside it doesn't it this object is now cohesive and so that's really simple things in design it's one of the things I do a lot whenever I start learning new code bases I just walk around all of the object relationships and see these patterns so if I see feature envy another one you'll see is what's known as a train rack who's seeing code that goes this object dot this one dot this one dot this one dot this one that's a train rack and that's an example of you've got a coupling problem you'll hear of law of demeanor tell don't ask it's some of the fundamentals of design we talk about we just tell an object to do something you don't keep giving it stuff and taking stuff out all the time you try to encapsulate as much as possible and this is shows up now what sort of difference can you get from this like I find on the typical code base when I get involved in a project to just to walk around understand the model fix all of the cases of like train racks and feature envy and stuff like that here's the clean it up makes it easier to understand I tend to get around about a 30 to 40 percent performance improvement just by performing that exercise because I'm now respecting the second bet the special bet things that are used together should be kept together because memory is moved around our processors and the memory subsystem in what are known as cashlines they tend to be 64 bytes and so if you've got an object and you've got fields that are together in the same object they'll tend to be moved around together so if you go to access one you take the cashness on that you go to access the next one you won't take the cash miss because it's likely to be in the same cashline so good design gives you really good performance around some of those things it becomes important so the tip around this is respect the locality of reference you've got to keep the things together that are used together you'll also find that that makes code much easier to deal with because things that are changed a lot together should be kept together as well so sometimes you say how do I work out which fields should be in which object look at how you change and how the system learned you'll find that things will tend to group together and if you find you're going across lots of relationships together you've probably got a smell and start to find out why yes absolutely good question that's the next subject I'm going straight on to so that's good you're kind of with me so here's an example of this so I said an interesting question is like can you implement an efficient B plus tree in your language now that may seem like an odd question in some ways does anybody know why relational databases were kind of an academic idea and all of a sudden become really successful sort of around the end of the 1980s anybody know why they just all of a sudden took off clues in the slide invention of B plus trees because what was happening is you go to disk to walk an index and if you, so imagine what a binary tree is like you hit a node from that node you're going to go one of two ways then you get another node one of two ways so every time you do that you haven't have another disk access you're getting another block somewhere randomly on disk another page of the disk now what's nice about B plus trees is they're not binary they're wide they can be as many wide as you want let's say they're eight wide each node has got eight values in it and then you're going out so all of a sudden rather than going binary you're forking a lot of the trees become a lot more shallow a lot less disk accesses now this is the same now with their data structures in memory if you're going to implement a tree a binary tree is a terrible idea in memory for any size of data structure a B tree or B plus trees you get much more value per node and the fact that you can perform lots and lots of operations on that node before getting another cache miss and that way you just go through things much easier you would think well if I have to then mess with a node that's going to be more instructions the instructions almost don't matter play it back way what if I said the instructions almost don't matter if a main memory cache miss is a hundred nanoseconds let's say your CPU is running at three gigahertz so you can have three cycles per nanosecond so that's three hundred cycles in the time of one cache miss on a single thread on top of that we have between six and eight execution units per core we've actually got parallelism in the core it's a so the whole other interesting topic where our compiler on our hardware is actually getting us a lot of parallelism even out of a single thread because the order in which your statements are typed in I'm sorry to sort of break this as an illusion that is only an illusion to how your code actually runs the compiler and the CPU can do anything it wants with your instructions as long as it gives you the same result as if it run them in the order that you typed them in and it does things a lot out of order so it breaks all the so anything that has no data dependencies it does them in parallel so it can be retiring say four instructions per cycle so four instructions times three hundred cycles that's twelve hundred instructions per cache miss we don't need faster CPUs if you measure CPUs in the real world they're spending most of their time idle we need to feed them with better data and we feed them with better data we get more out of them you typically will see most CPUs are running at even single digit percentage points of utilization they're just choking waiting for data they don't have it so that's the kind of interesting thing and this is where B plus trees can help because we can start looking at it applies to many other types of data structures so let's go into the relationships in detail I work a lot in finance let's take a simple example we have order books and we have orders sitting on those order books and if someone was the model they tend to just look well couple of boxes and I'll draw a line in between the lines is where the really interesting stuff happens so often they also notice most people don't put the name on the line they tend to put the names on the boxes I think this is a kind of interesting flaw in our industry and it's actually I think a lot of the causes we lack diversity in an industry men are focused on nouns and will name things they tend to not name the relationships and actually the relationships is where all the performance happens to be in this because that line is really quite interesting for example there's actually two it's bids and offers under many in those sorts of cases but it's more than that we qualify on price and so we have to pull those out and they're ordered with five so so taking the time to actually do all of the modeling there's much more interesting stuff in between those boxes than there is in the boxes themselves in any cases and the navigation between those boxes is the cashmases so picking our data structures is what really matters here taking the time to understand how things relate and then pick the right data structures for that relationship is the key to a lot of this so kind of neck friends with your data structures like we just don't have maps and lists we have many types of really interesting data structures and they're well documented going right back through many decades of our industry the way I kind of look at this is like I've been programming professionally for 25 years stuff I learned at the beginning of my career on data structures is still absolutely applicable today I've been writing code even longer than that I was playing with games in the 1980s I learned stuff about maps and stuff from even some of the early Canuth work I can still use today so this is stuff that stays with you for your entire career you learn a library if it's in Java or C sharp it may have a lifespan of one to two years if it's JavaScript it may have a lifespan of one to two weeks is that a good investment for your career data structures are a good investment for your career this stuff is really useful and will stick with you so let's look at some other different techniques I found batching really interesting and the key to doing batching well is that everything we got to deal with has an expensive cost somewhere there's an expensive cost we want to amortize those expensive costs so I get a cash miss I go to memory I've got to take that 100 nanoseconds can I get a few things for that rather than one this is like the cash lines the two fields that are together in the same object that's where I'm going to amortize the cost I only take it once and I get the two things the three things whatever it happens to be so getting the modeling right but this applies to other things let's say I wanted to write down so let's say a login framework for example my login framework I'm having to write the disk in this case because I don't want to lose anything because this is another really irksome thing I find about login frameworks so login frameworks that we have tend to write into an in-memory buffer when that buffer is full it tends to write it down to disk now guess what happens if your process crashes the buffer that has been written to disk you haven't written it and the most useful information to help you debug it was probably in that buffer and it's gone it's like we miss the fundamentals of design when we put these things together like if that was a memory mop file if your process and your process goes pop the operating system will keep your memory mop file and you go look at it maybe find out why your process crashed in the first place so getting the design right so let's say we wanted to write down to disk in this case and it's going to take a unit of time so let's say there's ten producers there's only three on the slide here but let's say there's ten if it takes one unit of time and we do the write down and we wrap it all with a lock we're going to have ten things turn up and burst and we've got a nice queue and something's going to take one unit of time right through to something taking ten units of time waiting at the back of the queue so on average we're going to wait five units of time to get through that queue for that burst now let's say they could all just encode into some data structures this is where things like the disruptor came about where I got involved with into a structure that's very fast for doing this and let's say all ten turned up at the same time they get written in and there's a batch to the other side that goes to the structure to pick up stuff to write it down to the database in this case or the file system or whatever or the network, whatever it happens to be the worst case scenario is it goes to the data structure and it just catches the first one and it goes to write that down and then it comes back and it gets to the other nine and writes them down the whole thing in two units of time with the average being close to the two units of time but a bit under it more than likely you'll probably get a lot of the burst goes through in the first one maybe all of them goes through in the first one so you've amortized the cost like your average time worst case is less than two where it's five in the other case so it's the mathematics of this stuff we cannot avoid we've got to be aware of it it's everywhere so what we can do with this system's work is if you start thinking about the design up front and you consider that you can batch and amortize all expensive costs where you can deal with your typical system, remember the the J-curve for Curina fact is the red line, that's what we typically get the blue line becomes possible if you start thinking and batching through your whole system design your system to deal with batching like anybody who's into financial trading the bursts in the market are usually the ones who win and you have to do it right down the stack and know what's going on ultimately everything saturates you will eventually reach for the utilization and you go up the curve again but you stay there for a lot longer doing it a lot better and there's loads of examples of how this applies everything from respecting the spatial back to how you use your networking to overall algorithm design particularly API design and I'll show some examples of that so batching is not boring it's actually really cool for design and so I think in many ways what I want to kind of campaign for the all boring stuff that people actually start to learn it because it's useful and it's good now our code is full of branches with branches everywhere an if statement, a while, a for, a case all of these things are branches we reach a point in code the compiler has to jump to somewhere else based on some condition these are branches now as you mentioned earlier processors will speculate and run forward and go as fast as they can and if they can get a batch right it's great so sometimes they'll hit a branch like i less than 7, yeah typically is I'll go ahead anywhere finds out later it was the batch paid off if it isn't they've got to unwind all of that state and try again so the last times they've got to do that the better but also there's a finite buffer of branch history tends to be about 16,000 branches on a typical processor that's a lot in some ways but also many code bases are massive they've got way more than 16,000 branches in there so where do these things typically show up like the simple cases like I'm just going to pick on one really simple case so let's say I want to do some stuff and I'm going to iterate over a collection or if a list of something you'll get people doing silly stuff like thinking they're optimizing like this piece of code here so they may be null check it that's another reason I'll talk about that in a second or they may check if it's empty now think about the common case here so if the common case there's something in there you're going to do the MP check and then you're going to iterate anyway why not just iterate why not have the same case the same cost all of the time like unless there is empty it's going to be significantly less than iteration which it typically is and so you've added branches and you've also probably got the null check because you're passing nulls around your code for typical stuff which you shouldn't be come on it's nearly the end of 2016 we should not be using nulls in our code anymore pass around sensible first-class objects for what we're doing I got to meet Tony Hoare a few years ago it's like really kind of proud moment the guy who invented the null pointer he still regrets it badly so just get rid of that sort of code keep your code simple keep it cleaner more fit will fit in the cache the predictable paths through your code will work a lot better for this sort of stuff and to try to have at least surprise your code so don't surprise the processor don't surprise your fellow workmates and stuff by passing nulls and all of this complicated stuff if you see code that it's just a mess of ifs and whiles and whatever going right across your screen it's going to be really hard to optimize by the processor and the compiler so take the time and refine it and go back to that particularly with this loops I've seen loads and loads of statistics and from other people and stuff of measuring ourselves you find that Parity's principle really applies here is typically about 80% of time will be in some hot loops somewhere in your code we get this over and over like all of our systems we measure there's always some hot loops somewhere where we spend a lot of our time if we find those loops and we make them much more efficient we get a lot of benefits I particularly think this is very relevant so if you take the quote off if I had more time I'd written a shorter letter this applies to our code as well usually when we come back we re-read our code we can make it much more efficient much cleaner and I think this really matters for our work and practices so this is something I've tried to do over the years and do it more often is I'll write some code if you re-read it again just to see if you've made any obvious mistakes and you get your test to pass and you move on to something else but I would not say you're done at that stage I'll go start something else and then I'll come back and I'll re-read it again and you re-reading it again after you've done something else has got a really interesting effect it's how our short term memory works if you do something and you look at it again in a very close period of time and you haven't done anything significant in between it's gone straight from your short term memory to replaying your short term memory like who's ever written an important d-mail sent it straight away and then the next day whenever the first storm or whatever happened you're like oh I shouldn't have read it that was a bit of a mistake if you just re-read it some time it's the same thing with our code we need to get a pipeline going so that we're coming back to stuff making it more elegant and cleaner it's a natural way for working on these things we can respect how we work and get the best out of it and why do some of the stuff matter on loops? well inside our processors many people will probably aware that we have main memory and we've got various levels of cache you'll hear of L1, L2, L3 quite typically what a lot of people don't know is our instruction cache has got an L0 cache behind the L1 our processors don't so let's say most of us are writing AXID6 code these days in some form so one of our compilers be it our C compiler our Java compiler, .NET compiler our JavaScript compiler, whatever is ultimately spitting out AXID6 assembly instructions they get sent to our CPU that is not what our CPUs run our CPUs treat those as high level instructions and decode those down to even simpler instructions called micro-ops that decoding process is quite expensive on energy and time but the decoded instructions can be capped in a little cache called L0 cache it is only over 1500 micro-ops that's a very small amount but if say for example your loop fits nicely inside that a major loop your code goes a lot faster because you're just working inside that cache it's also using a lot less energy a lot less electricity and so it can run faster because it doesn't overheat the same because our processors can boost higher whenever they're not using as much energy if your code is just too big to fit in that you're constantly churning and having decode instructions over and over again so keeping code small and elegant really matters follow the single responsibility principle don't cram more and more stuff into core loops inside your code and even behind that, even behind the decoders in the L0 cache we have another small queue which is another queue of instructions tends to be with 28 instructions on the latest processors and if you've got really nice little small loops like running across an array, summing up things it fits really well in that and runs really fast if you start adding stuff into those loops for no good business reason, you can end up hurting performance quite a lot so keeping things simple and small and elegant can really work well for this now I'm not saying that you must force all loops to be small you don't need to do more work but the default shouldn't be that you cram extra code into loops just to get something done in a hurry to keep things simple, keep them following single responsibility so look at crafting your loops like good pros, especially the ones that are key to the core of what you're doing, the profiler will tell you where your key loops are inside your code composition is kind of interesting on a design perspective and this is where size really matters so we can make our methods as big as we want or we can make them as small as we want in the past, we tended to especially if you're writing c code a long while ago, the function call overhead was quite expensive, that is not the case now with our managed runtime languages, like distila quote here from cliff click if anybody knows who cliff click is he wrote the hotspot c2 compiler the server compiler that java is using talks about how the compiler can do such a good job of inlining the code, you should not be writing big methods, the smaller methods give the compiler the chance to compile the code based upon its usage patterns because it can collect data at runtime and assemble code to be really efficient it's really hard to break things apart it's much easier to assemble together and we're talking like 35 bytecodes and something is not going to automatically inline, over I think it's around 2-300 bytecodes you can't even inline it if it's considered hot and so the compiler can't do certain things that's a simple rule for doing this how do you work out the right size of a method, if you can't cover a method with your hand one way or the other in front of the screen it's too big and a 4k screen is not an excuse to get around that keep things really simple and the compiler can do the right thing just focus on single responsibility if a method does one thing and does it well, it's easy to test it's easy to reason about it's easy to compose together by the compiler and to other things just keep these things really simple and the small things can come together to form whatever complex thing we need this is kind of one of the ones that I think is really really important so hopefully you've kept a whack for this interesting one the biggest impact you can have on performance more than anything else because they limit the design of how things work let's look at a simple example well not so simple example I read a lot of networking code this is NIO dealing with selectors you're probably having to squint a bit it kind of makes my eyes bleed looking at this so if I want to see if data is available across a number of sockets I've got a selector on you select across all sockets so I call slack now I then get the keys out that have been selected I get an editor, I go over the editor calling next I then get the attachment, I process it and then I've got to remember to remove it again which I can end up forgetting about really easy this code is error-prone it's really inefficient because one of the big problems with it is it's got a lot of allocation you have to allocate the selected key set, you've got to allocate the editor you've got to deal with all of this this is so much CPU so it's actually a really terrible design if you work with it you find it's very very error-prone if you change the design of the API what could it look like well rather than having to create the selected keys here in this example how about if we could just have a collection that we could reuse over and over again so at the top I'll just create myself a list of keys I will then call select now and pass in the keys and maybe pass in filtering criteria to find out is any of these things readable and then I can just for each afterwards now as people say this is a functional design it's kind of the way Java's going out we could have done this from the very beginning it's just an API that could have been an inner class, it could have been an interface that passed now you don't have to do the allocation you also don't have to have a whole mass of locks that's taken out given the other API design and it's kind of interesting I think it's down to this sort of framework idea and imposing ways of working versus giving the user the choice like I can choose the type of collection I pass in as long as it meets the contract not have it allocate the collection and pass it back to me and I've got no control over that so if anybody's dug into the source code of error and you'll actually see examples of where we had to work around stuff like this to get good performance so I've had to do nasty things like use reflection to go inside of NIO and change some of its internal collections to better choices that are more efficient for what it's doing it's kind of filthy but you end up doing these things to get decent performance so be aware there's loads of examples of this around look at string split for example the fact that it returns a collection is really really bad if you were to pass in this thing it's going to fill in you've got total choice and you also would not have to have all the memory around for all of this collection at the time you can stream through them the whole streaming through memory makes it much easier makes a smaller data set data's kind of interesting so we're having to deal a lot with data much more in our applications we're doing analysis of it data's often just big grids of data now one way you can implement a big typical grid in Java would be to create an array inside it those objects will be scattered all over the heap that will be causing lots of cash misses if you wanted to go through say there were all customers and find out the average date of birth often given set of customers or all customers below a certain age or something like that it becomes really difficult let's say for example rather than creating an array of objects you create one array per field of each of the properties that are in each of those objects so you end up with a group of arrays now what's fascinating about this is if I want to go down and find out all the date of birth now look at what the memory subsystem is going to do it's going to be getting bunches of these in cash lines together it's going to have a predictable pattern of access so it can prefatch as it runs down through this and on top of that it can do things like vectorization operations where our processors can actually apply the same instruction to multiple items of data per cycle at a time so you can be doing matches summations all sorts of interesting math across a lot of that so just changing to have options think about data in different ways and think about how the memory subsystem has got to deal with that like if you want to know which get all the fields for the object but you just take them all across a given set of references and this can all be hidden behind an interface and doubt within an interesting way start to think differently it's just a kind of play to don't just fall into the normal traps like I could just iterate down like that I could even be iterating down multiple columns at a time because our prefatchers can deal with many streams independently it's not just one they deal with this all really well the important thing is there must be predictable the kind of just be random patterns of access so start looking at other power norms to get some of this thinking if you're just doing oh check out set theory set theory is awesome but also things like functional programming is useful as well so start combining multiple power dimes they can all be really useful in different cases so let's start wrapping this up a little bit let's start talking about performance testing itself so that's a lot of the theory out of the way what should you be doing well kind of fundamentally with performance you've got to have goals set out like when you know you're done so you need to find out what is it through put targets what are the response time start going towards that don't just work with adjectives you need units and numbers and start working towards that and then start measuring things like response time properly if you are going to measure it how you're going to deal with it like who here uses averages anyway willing to admit that let's explore averages a little bit so let's say I use a histogram here and I'm going to capture the averages of the system but not just look at the averages I'm going to capture every single value and for a given amount of response time I will increment a kind so we'll end up with an interesting distribution that looks something like this log scale on the bottom here if I look at something like the different types of averages we've got three basic types of averages people use one is the mode the mode is the most common occurrence that we see reasonably useful in some ways another thing we'll have is the median where if we line everything all up and we take the middle of the range that's sort of useful but not typically very useful the mean which most people normally take about averages is add them all up and divide them by the number of things it's really not very useful at all so looking at the context of this what does mean tell you well it doesn't tell you anything about the really common occurrence which is really important and it doesn't tell me what's all that interesting stuff out there in fact it's given me a value that hardly ever occurs in the sample set which is completely useless so please do not use averages if you're going to measure your systems start capturing them in histograms histograms are really really important and also be aware of a thing called coordinated omission so as I mentioned earlier we'll measure the service time of some things but we don't measure the overall response time so how you get around that problem especially if you're measuring internally you have to be aware that pause type events happen so like if you have a systemic pause you know that there's requests that are going to be coming in at a certain rate they're going to be building up in buffers and they're going to be subject to that delay if you don't account for those your numbers are going to be much more screwed you have to look into that Google for coordinated omission that talks on the subject and you capture a lot of this by using things like histograms HDR histogram is a great tool for recording things into like recording sample times if you record into this and you've got a cache hit into the histogram itself it's going to take about 6 nanoseconds in time getting the time itself from the operating system will cost an order of magnitude more than the time to record into histograms so don't be shy in these things they're really really cheap and effective and use very little space how do you run performance test really often well if you're in the Java world GMH is kind of the gold standard at the moment it's a really useful tool avoids a lot of the problems have a look at it look at the samples and stuff that's in there it's really worth seeing and playing with but how do you find out some of the low level stuff so I can mention things like branch misses, cache misses CPU cycles our CPUs keep counters of all of these things and you can query them we can get them out of the CPU to tell you what's happening they're known as MSRs the model specific registers we're here with CPU performance counters use a tool like GMH it can sample all of these if you're running on Linux if you're running on your shiny OS X Mac really you're not going to be doing very good performance testing they do not behave anything like the typical systems you're running in production and you can't get performance counters and you can't get a lot of the other useful timing metrics that comes out of measuring stuff so they may be nice for developing UIs if you're developing server side apps seriously get yourself a Linux box or get yourself a nice pretty Apple Mac and get rid of OS X of it and put a real operating system on instead much better but also starting with continuous integration get your performance tests in the CI and see what your code's doing the whole point of agile development is that you want to shorten the feedback cycle so think about things like if you introduce a performance regression if you're constantly running performance tests you'll know the commit the cost you'll be much more aware of what's going on if you just build the whole project and do some performance testing tacked on to the end you're not going to learn very much the same as if it's just a continuous thing and running things like these tools capture the output put them in the database even graph them it's really interesting watching if you could make big screens like this many of the offices out where didn't we put screens all around the place and we've got all our CI details including our performance graphs and whether we're getting faster or slower in development of our applications really useful stuff to do and if you've got them up on screens the non-developers they'll even walk past and go why is this going down? well we're getting slower and whenever the CEO starts asking what is your system getting slower you usually get budget to go and fix it again and make it better so be aware of that also sort of think of your life systems build telemetry into it in any other engineering discipline electronics everybody puts telemetry into the systems we're so behind with this and software it makes it so much easier to understand when you put this stuff in I'll say it louder please build telemetry in your systems and don't do it as an afterthought it stops us going blind in life systems there's some really good examples out there telemetry should be dead cheap and not be a performance impact on your system when I'm saying this I'm not talking about JNX and all sorts of other abominations like that which makes your code really slow and creates a lot of garbage I mean well if you look at sort of another world like say formula one like I'm a big petrolhead and it's kind of cool that the marina bear tracks just outside there with formula one cars would be going past the reason they're so good is they've got so much telemetry that tells them what things are doing and how to get things better I worked on a messaging system and the good thing is by putting that out as open source we will put out examples of how to do good telemetry so things like queue lands, packet rates error rates, everything we capture them into memory mop files they can be then read from another process at no cost or impact effectively to the process that's creating them so examples are in the system counters and error on its open source go look at it as an example but do capture all sorts of interesting metrics so when you look at queuing theory capture your service times capture your queue lands, your queues are everywhere capture the rival rates of your customers your transactions, everything like that then you can sort of graph this all out and plot the behavior of your system you can even predict it before it happens also capture them into histograms the histograms like hdr histogram can be serialized to tens of bytes quite cheaply and logged as well you can do that per minute, per hour per whatever you want it to be you'll find it so much more about your system let's sort of wrap this up now in closing by clean we mean uncontaminated get the rubbish out of your code have things in your code that is adding value, doing the right sort of thing it's not going to be a burden on what's going on so a great aircraft or a good spacecraft is one that has got nothing you can remove can you say that about your code interesting thing to start measuring start measuring lines of code deleted on a project and see what that does for it I guarantee it will speed up your team in the long run not just the performance of your code but the performance of your team will also improve and this is just all about good design fundamentals getting the design fundamentals right will result in very good very clean, very fast code it's also very easy to secure and deal with to take a quote from Bill Lear it was the inventor of the Learjet he had a great comment about aircraft, he says if it looks good, it tends to fly good I find that is very similar about the performance of software really good high performance software tends to be clean, elegant, simple easy to understand and you'll know it goes fast because there's nothing obvious in there that's slowing it down to make it really clear and simple on that I'll wrap up and thank you all very much for the pub do we have questions? yes so now when there are some columnar storage memory nodes like Apache Arrow and stuff like that how do messaging systems actually align with those when messages arrive one by one so we want to store and process things in the columnar so it's a way of implementing performance do you have any examples so I forget the question right so we've got lots of so what were the products you listed to begin with? for example you have something like Apache Arrow or some columnar thing for processing and you receive messages which arrive one by one serialized how do you align them together? okay it's a good question so basically how do we monetize the cost? how do we batch up? we'll apply techniques like batching so that's where things like the disruptor has been used in the past I've been doing a lot with things like Arrow and where it's going forward so the reason Arrow has got really good throughput and really low latency is because it naturally batches things on and off the network when you get bursts it completely fills the network packet we've done similar things with how you write down things are coming in one by one how do you buffer them up and write them all down with nice handoffs and pipelines in the design if you look at the internals of Arrow it's completely linear in its memory access so you just go forward the nice thing about being involved in that I've been involved in a number of different messaging products so it's Todd Montgomery who built the system with me he done LBM PGM and a few other smart sockets in the past lots of other products and we've both learned from the mistakes of that and one of the things to do is you've got to respect modern hardware to get the best out of it if you design any product like that these days like the data stores messaging anything and if you don't respect how the hardware works it just won't perform and it won't scale so it is starting but a lot of the stuff that's out there people don't have any of that sort of knowledge which is a shame I assume Jason do all these types of texts that you're trying to spread in a word? no yeah we do that because it's easy we go with all these text based protocols because they're easy to debug it's actually much easier to deal with boundary protocols if you build the tools to deal with it it's because we can't actually debug Jason directly we do it because we use text editors the text editors are the tools we just need the tools for the binary stuff and it's much easier as well could you speak more on profiling and how do you find bottlenecks in code? ok yes it's a good question it's also a tricky one to answer well so how do you profile and find things in code? so the first thing up is where lies to you they lie to you in different and interesting ways they sort of tell you partial truths so you've got to know what you're looking for from different angles and different profilers will help you out in different ways so I'll give you some examples of this so anybody here use javvisualvm javprofiler sort of your kit all of those most of those use a technique called stack sampling so what they do is they have to stop every thread that's running and then look at where the stack is at the point where they've stopped all of the threads that means they've got to use the thing called save points so your code gets save points inserted in them and save points exist typically at the back edge of a loop or the entry or exit of a method now if you have lots of code that's been inlined because it's hot it'll tend to have no save points because if there's no entry and exit to the method also our loops, there's two types of loops it's counted on uncounted loops a counted loop is something that you typically do a for loop indexed by an integer if it's indexed by a long or if it's unclear how many times it's going to go around it's known as an uncounted loop uncounted loops have a save point on every iteration counted loops don't have save points at all so what you'll find is if your problem is CPU intensive those profilers don't find them they find problems way away from it that isn't really part of the problem because the really hot code that's been well optimized by the compiler has no save points in it so if you want to use jevvisualvm, yourkit jevprobe all of those use that for finding things like I've sent a JDBC call to the database or I'm going to be looking for things that take a period of time it's useful for other types of profilers things like honest profiler mission control like flight recorder, things like that they sample the stack of the threads that are running without waiting for the threads to come to a save point and they're very good for finding things like where your CPU time spent on a thread but they don't catch things like threads that are waiting on locks threads that are waiting on the database anything like that because those threads aren't running they only catch the threads that are running so there's two examples of totally different views and it depends on what type of problem you've got and then you've got things like allocation so is it allocating objects that's caused a lot of the problem so if you use something like yourkit for that it will put bytecode weaving in to find out where your objects have been allocated that will actually defeat things like escape analysis will make methods bigger change the whole behavior of your system and it's actually a pretty poor way to catch that if you use something like flight recorder mission mission control with flight recorder it uses hooks inside the JVM and finds object allocation really well so knowing the strengths and weaknesses of various profilers is important so what you need is just run multiple often so they get different perspectives of the problem and try and find out from there or you may want to look even down at CPU performance like things like perf and vtune and other profilers but again, different levels, different problems I tend to use a whole range just to get multiple views Acquiring time most of the times in finance and mining you come into these situations where you get a burst of messages there's network latency because the curing latency and we need to take care of all those things so most of the best kind of sometimes like how to minimize or how to drain the queue in the least amount of time so that there is no waiting in the queue as well two very simple things one is focus on your service time and have it well optimized to deal with your burst the best you can so your utilization state is low the other major thing you can do is actually how you deal with the queue batches so can you pick off things and amortize the cost there's a whole range of design techniques for dealing with those different cases but those are the two fundamental things that you've got to do and use different implementations to do it okay well thanks very much