 Okay, to introduce myself a little bit. I'm a Java champion. I have one of the most Answers in the world on Stack Overflow for Java and the JVM. I Was the first to get a gold badge holder in memory file and concurrency in any language on Stack Overflow so My most recent article put about 17,000 views in in a couple of days and and Some of the content here is based on that article and also previous articles What else? So as an organization We a lot of our clients are in banking But a lot of our open-source software is used all over the world. So we get about 7,000 downloads a month from different IP addresses for our individual open-source software And overall it's about four million a month Across all of our products, but in terms of commercially 80% of our revenue comes from tier one banks One of them you probably know is a UOB for example. It's a local bank So today I'm talking about a number of topics that have really Got my interest and attention recently Something I see again and again between different clients One of the things that frustrates me the most is the level of what's called accidental complexity and accidental complexity is where Your application is spending a lot of time doing things that are not actually business requirements or even requirements from IT either and Certainly, we've seen cases even quite recently where it's two or even three orders of magnitude of the time spent is not actually Fulfilling the business need but is actually just because of the tools or the strategy they've used or the approach they've taken So I look at a specific example that's covered in one of my articles Another one the one that got the most interest is about Showing that allocations don't scale. So as you as you have more and more threads creating objects You reach a saturation point such that the system cannot allocate more objects just by having more threads And this is shared globally So if you've got multiple virtual machines, then it's not partitioned. It's not a shared resource One for CPU virtual machine is enough to saturate an entire server almost regardless of how many CPUs it has so So it is something that you do need to pay more attention to than You probably may have done in the past Then I look at both behavior-driven development and data-driven tests and why they Relevant to low latency systems in particular or also makes systems more maintainable and Improves the velocity of development And then I could go into some coding samples. So all of these got benchmarks and code the articles have links to them from my blog, but I'll show you some of the code here today as well to give you an idea of what does this look like and If we have time we'll get into Some durability guarantees and why they're important to look at And in particular if you you have a database in your critical path That can be such a dominant feature in your latency and choosing determining your throughput That nothing else you do probably matters too much. So I'll look at some alternative strategies for not using the database in your critical path So what we want to do is we want to focus on clear requirements That either come from the business or from IT or we've established ourselves and stick to those requirements And so what's the most effective way of delivering on them rather than or what what's something that works? and that that discrepancy can be enormous and And in particular I have so many projects we see particularly in banking But in I guess also in other industries where the velocity of development declines quite rapidly to the point where it's very hard to change anything compared to when the project started and What we want to do is avoid that happening So if we saw we from the start make sure that that the velocity of development can be maintained Then you have a much more maintainable project You can keep delivering to the market new changes all the time and your project won't essentially be put in Put put it to one side and so we can't develop this anymore. It's just too hard. Let's start again so accidental complexity is is Where your application is spending time or resources on things that aren't actually a requirement And in particular this can also be when developers are spending time on things which are not the business I doesn't actually require and that's often down to not having clear specifications Where the business either can't tell you or hasn't tell you or it's not being communicated or maybe it's can mean Communicator someone else, but not necessarily to you and so Going back and focusing on what does the business actually need? What does it actually need from you because the business won't specify all the requirements? And Really focusing on that you can cut down the amount of work your team has to do cut down the amount of work the application has to do and Deliver much faster Whereas this sorry essential complexity is that what you actually need to do What can't be taken away? Maybe you can do it a little bit differently But fundamentally your application will need to do that because that actually meets a requirement Now how is it possible that you could get an accidental complexity that large right like two or three orders of magnitude? one of the causes of this is Your application will have multiple levels of abstraction right and even in our case We try and cut down the number of levels of abstraction, but still there are multiple levels of extraction each level of abstraction Won't will often do a bit more perhaps even a lot more than is really required By your application at that time and by the time you've run this through multiple levels each one of them adding a percentage They multiply together until eventually you can get ten times hundred times and in some cases even a thousand times More effort going into it than what is actually strictly required Let's take an actual concrete example, right? So I'm just throwing numbers out to you at this point, but so in this case This is a benchmark where a message is sent via a queue to a service Or a microservice and that microservice sends back a reply Which is in fact the same message back again via another queue and they're picked up by the client Now look at the round-trip latency now Kafka is a is described as a low latency product It is lower latency than a lot of alternatives. It's not a bad implementation That's not what I'm suggesting here and in fact when I benchmarked it I was able to get latencies around half what the a lot of the vendors that support it publish So it could I think they can actually do better if they improve their benchmarks. They would get better numbers, but Yeah, so so I can still get good results with Kafka, but Doing exactly the same thing with a chronicle queue and this is the open source version The latency differences around about a factor of 750 to do the same thing right send a message serialize it deserialize it persist it All this is all on one machine. So it's a very very simple configuration Simplest possible configuration So network traffic isn't really part of the equation here We're just looking at the overhead of passing a message to a service Viracue and getting it back Viracue So this difference is so large that it can't it can be difficult to even conceptualize I'm part of the reason why it's not such an obvious problem is that in this benchmark Kafka was getting on the 99th percentile, which is the worst one in a hundred Kafka is getting about four two point six milliseconds now in their own benchmarks. They were getting about five milliseconds So I thought it was being reasonably fair to them by actually getting two point six milliseconds So how far can the signal go in in two point six milliseconds? So instead of representing it in terms of an amount of time you can't even imagine you can't see two point six milliseconds Let's turn it into a distance like how far can the signal go in that time? So that's something you could think of considering as well is that if you're trying to convince someone This is a long time, but it's still you're working on a system where actually this is shorter than you can see how do you convey that another way and so two point six milliseconds is About the time it takes to send a signal from Singapore to well past Kuala Lumpur So that's quite a long distance that a signal can go in that time whereas with Chronicle the The amount of time is it's around 30 Microseconds and the distance that a signal can go there is from boat key to Clark key so you can see that's a much shorter distance and Like I said, it's doing the same. It's meeting the same requirement Right, and it's even difficult to chart So this is a log chart where on the Vertical access you've got powers of 10 It's the only way to get them both on the chart at the same time Because the difference is enormous as you can see like on the 90th percentile You're getting around three microseconds here up there. You're getting about 2396 microseconds and That's at a lower throughput if you increase the throughput to similar to what I was doing for Chronicle It goes up by another factor of 10 or even a hundred so it's Quite a huge discrepancy and again like I said, this is just doing the same work And I'm not necessarily talking about messaging here. We've seen other examples where Doing the same function, but it in a different manner can cut down by one two or even three orders of magnitude So what's the benefit of doing this right so what I mean? I Thinks this is this is So what how does this help me as a developer right because the business has said oh, they're happy with a few milliseconds and Actually, there might be what is the benefit to having it less than a few milliseconds right because you're meeting the business requirement well in this case we'll have in this this is a How long it takes to build our One of one of our key pieces of software is the EFX trading system and It's time to compile all the code run all the tests now each of those tests starts up a service and shuts it down So instead we're not timing how long to send a message is how long to start up a service Test it and shut it down again and check that it's doing the right thing. So in this case, it's doing 466 tests doing the Java doc Building it creating their jar and everything even though end-to-end. It takes 77 seconds to do the whole lot on My development machine. So this is just on one machine test the whole thing including some integration tests and And and one of the key points is this is any of you doing everything right obviously if you're just doing some point tests with One microservice for key piece of functionality You're talking about less than a second and the benefit to you is that that means your whole development Life cycle is much much shorter because you can run through Hundreds or even a thousand tests in a very short time frame because everything is running very efficiently and very fast And it can run on your local machine without having to go, you know, push the date of software up to some server Wait for it to upgrade And then maybe you run into some issue because you're using a server that someone else is using and You've got some sort of conflict going on none of that's happening here because your your system is efficient and Fast enough that it can just run on a single machine So that's that's a benefit to you Right. So you just it helps you be more productive. Not just it runs faster in production Yes, there you go that'll do So moving on to the next topic. So this is about This is something I've seen it again and again Not so much in our space because in low latency space you tend not to create a lot of objects Lot of it for this reason So there is a perception that creating objects is cheap and the truth is If you've got one thread and you run a jmh test and you're creating objects, they are cheap But in production, you're not running one thread. You're not just using one core. You're running you want to use all the cores and If you want to use all the cores, they're all accessing They're all going through the small number maybe only one l3 cache They're all sharing that l3 cache and that's a contended resource So really what you want to do is each one of your threads You wanted to fit comfortably in your l1 and l2 caches And even though you can now get gigabytes and even terabytes of memory in a server you're up to caches 256k that's for a code and data So the last thing I want to be doing is creating memory pressure by literally filling it with garbage Right as fast as you can So so in the example I've got here the time it spends pausing is Compared to the time it spends doing allocations is a factor of 80 So the the the cost of allocating is 80 times higher than the cost of actually pausing itself so so Tuning the pause reduces jitter and show that shows up in your metrics. It's something you record But the cost of allocating can be much much higher And even a small number of allocations can make a big difference At least to a benchmark and it and it won't necessarily affect your application But what it will do is it will affect your ability to handle high load or bursts of activity So this is a again the sources of my blog for this, but If you've got just to in this case, I've got two threads creating small objects of 44 bytes as fast as possible how fast can two threads create 44 byte objects and You get a pretty good rate of 126 million a second That's a pretty big number And the allocation time is very low 15 nanoseconds and generally if it's just 15 nanoseconds You wouldn't need to worry about it, right? However by the time you got to just four threads and this is a machine that 32 logical threads By the time you got to four threads it's starting to tick up Right, it's not getting twice the throughput anymore And in fact the average latency has gone up by 20 percent now 18 nanoseconds But still 18 nanoseconds is nothing to worry about. It's not going to make a big difference However, you see that actually as you add more threads The throughput doesn't go up really The six seems to be slightly better, but essentially 32 Even across two JVM's you get the same allocation rate as you did with four threads So all those additional threads Weren't really being utilized or they weren't adding any benefit because in somewhat unrealistically I'm creating objects as fast as I can and Importantly in this case the average latency has gone up from 15 nanoseconds to 150 nanoseconds It's taking ten times longer to do the same thing because I've saturated the box And it may not be you saturating the box It could be another virtual CPU running on the same physical machine Saturating the L3 cache of the memory bus is actually the one that's causing the problem here So if we move on So I've got a slightly more realistic test where we were sending message events over Network connections over TCP again. It's all on one machine to keep it simple. It's going over loop back But in this case, I'm sending events as fast as possible Over loop back similar to the previous scenario. I'm sending an event to Enneco service and it it's only job is to send that event back again, right? Now to keep things sufficient because we're in a low latency space We're not creating any objects to do this all the serialization Deserialization the look up the method calls all none of that and the proxies none of them are creating any objects whatsoever But then you might so all why do that? Why go to all that effort of creating no objects? What what difference would just one object make right? So you've got a lot of applications out there that if you told them to handle a whole request You can only create one object. They'd probably think you're crazy, right? But even one object does make can make a difference so So in the middle chart, I'm looking at how many events per second can this system process if it's not creating any objects whatsoever And I looked at different JBMs different GCs and parameters and the GC doesn't matter too much because the bottleneck Here is the On the on the left-hand side is the ability for the kernel to basically push data around over loop back And the average latencies are very low, but you can see that they're so low that our 150 nanoseconds although in this benchmark. It's more like 170 nanoseconds really makes a big difference Right so so in this case what the benchmark does is if you run it in a certain mode It will create one object for every event instead of creating no objects for every event And that one object for every event is reduced my throughput by 25% already. That's a 44 byte object Not a big one Really? So it's just that one object has really cut me cut down my throughput. You still get very good throughput Obviously as you add more and more objects, it's just going to go down further and further in terms of what kind of throughput my system can sustain One thing that's interesting here is there is an uptick from Java 8 to Java 11 But Otherwise the version doesn't seem to make for this use case at least didn't really make a lot of difference Java 17 happened to get the slightly better result I'm not sure that if you had a different use case that would that would also happen. But anyway So we go on So so what I mean without going to all of this work and trying to benchmark is there something simple I can do to work out whether it's likely to be a problem or not and in fact the the peak throughput at which sorry allocation rate at which Your system saturates doesn't actually vary that much between servers to be honest So my high-end laptop which has a battery life of an hour and something But it goes very fast It peaks out at about 8 gigabytes a second a typical server will pick out about 10 gigabytes a second which is nice round number and The rise and I was testing on before was peaking only about 12 gigabytes a second And I tried this on a number of different machines and they're actually the variation between them really isn't that much and figure that Cliff click quoted for a very similar Point is that he quoted around 10 gigabytes as well as a good rule of thumb So so from your perspective all you need to do is look at all the allocations of all the JVMs on your machine total them up and as a percentage of 10 gigabytes that's how much of The time the system is spending Allocating objects as a proportion of the time it's doing so if it's creating it at 1 gigabyte a second total Then roughly 10% of your CPU is that it's being used goes towards allocating Which is quite high really and it's much higher than Your GC time probably you might find that if you added up the GC time so in the benchmark previously It slowed down the throughput by 25% for the amount of time spent in GC was point three percent very low Very efficient you'd look at that through point three percent and say well Allocations aren't a problem for me. Geez garbage isn't a problem for me because it's only point three percent It's not making any difference But actually that's not the biggest problem the biggest problem is the fact that you're creating memory pressure and then hitting a contended resource so yeah my earlier point that actually I know on some systems actually to CPUs is closer is enough to saturate the machine, but on a good server it even four is enough to saturate it and Adding more numeric genes and more sockets doesn't necessarily help very much. I mean it will help but it doesn't help It doesn't scale right you have two sockets and data one This idea was about 25% more throughput could be achieved with two then could it be achieved with one? so moving on to next topic so behavior-driven development is one of the best practice techniques for ensuring that you are extracting requirements from the business that are clear and something you're doing up front and making sure that Before you even start ideally you have at least some high-level requirements that they are looking for and this is a methodology for doing this now This isn't just an exercise for the business this actually will help you in writing low latency or efficient software Because what we will do is allow you to focus on well What are the things that you actually need to do like end-to-end not all the things going on in between? I don't know how many times people have told me that it was a requirement to do something that I Know for a fact if they try to explain it to their business what they're doing They would have no idea what they're doing or why right? You think about well, you think this is a requirement go and ask the business if they even understand what you're doing let alone With is that something they would come to you and say this field should be an integer and not a long Right, and then you'd have to explain to what what is the difference and why do I care and how much money will that make me? right and you go you know so This this will help you focus at least on the requirements they can give you that doesn't mean all all requirements come from the business in reality IT will have their own requirements. You will know from experience. There are things they are going to need Which they don't even know they need yet So you some of the requirements will come from your own experience or your team or your security team For example, they will have their own requirements, which is just the way they do things Don't necessarily want to question you just say right. That's what they're looking for. That's what we would deliver But once you've even got all of those You you find that Focusing on just those things are not Getting too much into the detail too early Can help you deliver just the minimum that system needs to do So so the the overall phase here is that you start with Some some way of detailing at a high level the requirements you create scenarios that fail So this is a test driven development and then you create an implementation that fixes that and This isn't always the approach you need to take but vast majority. This isn't done enough in so many projects Then you have a passing scenario and one of the key steps here is refactoring So that's about reducing technical debt Reducing technical debt might not seem important to the business and this is one of the things that you may need to explain to them Is that reducing technical debt will allow you to help maintain velocity as the product matures as it gets more complicated? It will become progressively harder and harder to maintain if you're not paying off this technical debt Reducing all of the issues the things that okay Yes, it works, but it does it in such a horrible way that it's going to be really hard to maintain later So you want to have at least some of the time and effort spent Making sure that the the code is clean the implementations are clean that it all makes sense that it's all that's not just lots of things bolted together and That will help them in the long run particularly If you're a developer and you work on the you get it to production. It's all that a Lot of successful projects will cost three times as much to run in their life than they did to create in the first place So in making it maintainable helps reduce that long-term cost So one of the benefits of behavior-driven development is as I said you can focus on people who have The domain knowledge either in the business of the IT side and they can help you give the requirements the problem is that They're not always accessible and sometimes when you get stuck getting into the minutiae and the real detail that you still you Need to write a real program. They're not always interested or have the time to go through it in absolutely every piece of detail so right so rough rule of thumb is that say For a large project you want to be able to have about a thousand tests or a thousand different scenarios or whatever In general the business will be able to give you like ten off the top of their head before we've even started right just in the first Conversation it will just come out. These are ten things top ten things I need Right, but then over time you may do extract out more and more details and feedback But you might only get to about a hundred right But then you've got this gap right there's a still enormous gap between what they can tell you and what you will probably need to Really nail down the system and make it maintainable now What you can also do is if you've got domain expertise and you've got requirements from IT you can go through the same exercise You will know ten things off the top of your head like DR or a Monitorability all these things that they may take for granted. They're just there You can you can you will probably go through a similar exercise There'll be ten things off the top of your head and over time you'll get another about another hundred things that you know need doing They don't need to tell you to do that But you still there's still an enormous gap and we'll get on to how you can bridge that one way you can bridge that gap So because we're dealing with real time in our case real time low latency We've modeled all of our key systems on a venture of an architecture, which is Events in and events out and in between you have a function that function is to entirely dependent on all the events It's ever seen And that way it's completely deterministic that that engine will process those events the same way every single time And in that case we record every event as well So you can take all the events from production for example put it on a development PC and either Fix bugs or run it again and again to improve performance. You know try out different performance approaches You move on okay, so in this case the ones in green are cues and The ones in the boxes in the middle are the functions that are just purely dependent These are sometimes called lambda functions more technically they're kappa functions, which is they just event-driven systems and On the outsides we we tend to like when you're interacting with the outside world and other systems We have gateways their job is just to normalize the data from whatever You're connecting to to your internal data model So they're usually stateless and so they're usually one once you build a stateless system and all it does is transform Usually becomes quite stable. It only has to change when the protocol changes or your internal data model changes A lot of the really interesting stuff the business logic is in like control systems in the middle they They have to maintain some sort of state in the sense that How they act will depend on previous messages So for example, if you want to cancel an order you that will depend on whether what the order was right? So that will be a previous event So the way to still treat this as a function is that it's a function of every event It's ever seen so every input comes in as an event every piece of reference data every piece of configuration It gets told absolutely everything gets pushed to it All right It and if it does need to go away and get some additional information It spits out a message in the output That is read by some other system that will pick up that information then feed it into its input So now if you want to test this or reproduce its state for a failover for a debugging All you need is the inbound messages in this case and the software for the control system And you can recreate its state at any point For debugging for example and checking you fixed it. You don't need anything the rest You don't access to production databases Any kind of production systems? You don't need to run any of the other components because when you're just recreating it they don't need to be running That means you've got a nice self-contained testing framework and It makes it then very easy to develop that microservice in isolation So one of the key benefits of this approach is it makes it very easy to create regression tests So a regression test isn't so much saying for this input I expect to get that output which would be nice that that's what that's would be the goal That would be the hundred you got from the business and the hundred you got from it but Sometimes that that detail isn't available to you So instead what you can do is create regression tests So you say for this input which I know is a scenario that will happen or might happen or I need to Consider could happen What does it actually produce and I can look at what it actually produces and it's up to me or the team to decide Well, how much do we check that? You can either make it as a pure regression test where you don't even look at it and then The benefit of that is that if something changes in the future You will at least know that it changes you will at least know that it's produced a different output now than it used to do Which one is correct? You haven't really spent enough time investigating, but you at least know when something changes Obviously if you can spend some time saying does this look sensible if you can show it to the business and say it does This looks sensible does this look right? Then that would be great, but that doesn't always happen in reality, but you don't need it to you can still generate a significant number of regression tests and At least have some stability and be at least aware when things change perhaps Unexpectedly because you know one of the big risks of any change is unintended consequences that you just not aware of and this at least Will help you cover those Yeah, so regression tests are very simple you you take a series of inputs and I did often you would just take a series of inputs that are Valid and I've already come up previously So you don't need to create them from scratch and then you just play with them You change the make invalid inputs. You put duplicates in you drop out messages that should be there and just see what the system does and You can record those results and then check Whether that behavior ever changes in the future And this can be used in realistic cases So in this case, we've got a series of cues which are in blue In between a series of services So the same cue can go to multiple services The price or in this case takes data from multiple inputs and all of this can be canned and produced into tests or And regression tests and they can be used to get the two techniques can be used together So one of the areas that we differ and again, I'm still talking about open-source products here is that We advocate using a high-level interface, which is a Java interface with pojos. That's it. We're not talking about fly weights over off heat memory or Anything really super low level that can be very hard to work with right? We're talking about being able to still do Low latency coding, but still have something that looks like, you know, most Java developers would expect So this these are some sample interface interfaces that the top one is a really simple one that I use in some of my Examples you've got one event that one event is modeled as a single method call which takes one argument Which is a piece of text in a slightly more realistic example We've got an interface in to put to an OMS on order management system You can you can send in you new order can singles you can ask to cancel a request and in each case It takes a DTO and that DTO has all the information associated with it So the order manager has an input interface Which are all the events that can come in and has another interface for all the events that can come out and obviously You can use things like inheritance and so on to compose these it doesn't have to all be flat But going back to the simple example of this say so you've got a simple it's an asynchronous It's a bit like asynchronous RPC in that you don't wait for the reply. It's just an event you sent But from a coding point of view, it's really simple. You're just you're you're making a method call You don't need to know all the details that all well Actually, this is going to be serialized and what serialized for vision format am I going to use and is it going to go to a queue or is it going to go over TCP? None of that is in the business logic because that's not what the business logic is about business logic is I Have this piece of information in I need to do something with it and I need to produce this piece of information out How you actually do that as a transport or whatever is really a business growing? Sometimes it is sometimes I will know it needs to go over fix or it has to go to this Kafka queue Or it has to go over here But often unless that's a specified requirement. You don't need to be Going through all of that in your business logic So the code to do that is really simple. So there's two key elements to here The component expects to be given an implementation of where it's going to write out all of the outputs So it's just an implementation of the interface in this case. It's the one with one method and then it has a method which is Where what gets called when that event comes in? So in this case because it's all it does is add an exclamation mark as you can see when when an event comes in to to this component it will Produce the same event going out with an exclamation mark added And as you can see, this is not low-level coding, right? This is not bytes that you're having to worry about this is not about all heat that you're having to worry about or All of those factors because at this point you're only interested in well What is this function supposed to do from the business logic point of view in reality all of this can go to off-heat queues and Be persisted to disk by our memory map files and then share over shared memory But none of those details are important to what you're trying to describe and test at this level So the way we we our testing framework works is you have yaml for input and then yaml for the expected output the benefit of doing that and separating the two is that the output can be maintained very easily and So in the trivial case with an regression test, you don't know what the output should be you can just generate it right from scratch And in this case each each event is modeled as a Document in a field a field of a document so that say in this case is the field name But that's the event name, which is the method name. They're all just it's just a one-to-one match And the string that comes after it is the string on the method call And and when you make those method calls in a test harness it produces exactly the same Recorder as an output Why go through all of this effort? Well the main benefit is that if you need to maintain this so I mean there's lots of different ways you can check whether you've got the right result or not But a lot of the time spent is if you didn't get the right result Why didn't you get the right result right sometimes in complex examples? That's really hard to find out and also Sometimes that can be really hard to fix right because you okay. I've worked out that it's this this field has changed I've now got to go into the code figure out. Well, what do I change in this test to now make it pass? And then you have to do this for every test you've got so say you've got several hundred tests that you have to update This can be a really really tedious exercise to go through every test to figure out what's changed. What's wrong? fix what's wrong and Then go on to the next test Whereas if you use this approach you could even run it in a mode where it just does a regression test You run all 600 tests and just overwrites all the expected results and instead what you do is you review it when you check it in Or you review it as a PR because and you don't have to write even touch the code To do it. That's the ideal case in reality It's not always that simple because you may find actually some of the changes weren't intended They're not desirable and you actually have to fix it at the test or the code But let's let's see what happens if there's a discrepancy So you're probably familiar with seeing this sort of thing for a unit test failing It says click here to see the difference and because it's doing a text comparison The idea already has support for this and it looks like that so you can see in this case I changed the input ran the test again without changing their the expected result Can anyone spot the discrepancy? As you can see it's pretty clear and if I wanted and I just want to treat as a regression test I can now take the actual result copy paste Overwrite the expected result and now the test passes Whether that's a good thing or not. It's down to your judgment You can see that if that is the right thing to do It's really easy to do right and that's even if you're going to go through and do it manually There's a mode where it will just override it and do it for you Let's look at a more realistic example where it's more complicated because as it gets more and more complicated That's where the benefit of this approach helps. So This is a new order single from the previous interface so we've got a DTO now with a number of fields of different types and one of the things you may notice is that the symbol which is actually text so it's an instrument which is something like you're a dollar or IBM shares or whatever That could be implemented as a string or possibly an enum, but in this case it's it's being encoded as a long and All you have to do is add an annotation that specifies How will I encode it to text and vice versa? You can come up your own strategies but base 85 has a nice effect that it uses the 85 most common ASCII characters and You can get 10 characters packed into an ache Into a long so it could be variable length string of up to take 10 characters of most of the ASCII characters you're likely to use and Still stick it in along if you do that at a from a text level It that's not going to really make a lot of difference But from a binary level it makes a huge difference because this creates no objects right stored as along It also means that comparison is trivial because you're just comparing two longs You're not having to look into those objects and then compare them You can also validate them at that this point as well You can ensure if you're in your converter you can validate whether it's about it's correct or not That way you won't even import something. It's invalid right from the start Another one that's very useful is encoding the timestamp. So instead of using say local date time Which has nanosecond resolution, which is an object quite a complex object And and quite a bit of overhead again. You can just use along which is the number of nanoseconds since 1970 and And in a real example, this is a real real test You can see that in YAML. It's still readable as the text is the timestamps appear as a timestamp The symbol appears as a string, but it's actually encoded for you in as along in binary very efficiently You can still have strings. You can still have enums if you wish But you have another option not to use those if For efficiency reasons So then What happens if I break the test? Is it still easy to see what I broke? Well, you'd be the judge Can you see what I broke here? Right, can you also see if I just wanted to take the expected result and make the test pass that that would be really easy right Because you have without even looking at the code Right, you could you could fix this without needing to understand Well, if that's the path you're going to take if you decided actually you shouldn't have done that and you need to Fix the code well, obviously you may have to go into the test or you may have to go into the the original code But in terms of your easy option, it couldn't be much simpler And we have examples where we've got a megabyte in and a megabyte out of data And tracing through all of that is really tedious if you break something But using this technique it will scroll down to the point and you'll see the exact thing that's different Now in a slightly more realistic example We do add a couple of things to to add some functionality so because we are interested in low latency and These method names are quite long from our point of view. It can actually make a difference in a benchmark We you have the option to turn them into numbers instead of being strings And in fact we were looking at some numbers today actually and the four nines. That's the worst one in 10 000 Can be reduced by A factor of two if you use some of the lower level techniques. This is one of them There are other techniques you can use that in combination can reduce your your high percentile latencies Significantly, but you don't have to do this up front. This is something you can do later Once you've got the system running once it's all stable you can play around with these Annotations and see will do they help or not? And in fact even the previous test where I had a string and a date time That's something you can change later. The YAML doesn't have to change. So once you've built all the tests They they're not touched by this. It's only the implementation. You're you're playing with to optimize it How are we going for time? I assume we're going for good for time and I'll I'll just go through this quickly It's one of the key things in any project if you're if you're looking at the end to end latency is Is there going to be a database? In your critical path and what are the the durability guarantees you you need? Because you're going to use a database that will dominate and then it will be about well How can I use this database most effectively? However, there are other options that can still give you good The data the durability guarantees, which is you know, how likely am I to lose data and how much data might I lose? that you can get other options without necessarily going down that road and that can significantly reduce the latencies involved And then it then it does matter how your how your application runs So if you start at a database the sort of typical latencies you might see are in the order of 10 to 100 milliseconds This is often enough in in certain applications however Using other options you might be looking at 2 to 20 milliseconds 10 to 100 microseconds or even 1 to 10 microseconds And obviously in that case we tend to be more at the bottom here um Also sometimes looking at redundant messaging as well so in this scenario The sort of model is that you've got some client or some gateway or some source of information And you've got a server that's processing that information What does what needs to happen Before you process it or you before you send a reply. Does it need to go to disk doesn't need to have a redundant copy on another server Does it Does it need to go to a file so that if the application dies? That's fine There are a number of different options and they will make remarkably big big difference to how much latency that operation takes so In in this in the lowest latency option the least guarantee is Say you've got an application and all you care about is how long will it take to write that message Not how long it will take to process or persist It's just uh, you've got an application that's being logged or monitored or for compliance purposes is being recorded and you just want to um All you care about is the right time And um, this is one that's often benchmarked because it's the lowest number you will get Right because it's doing the least So but it's something to watch out for because if that's not your use case A benchmark that shows you the public the time to publish isn't going to be very helpful because That's not what you're looking at But if all you care about is is recording or logging and that's a very common use case Then the time to publish is the is the one that really matters Um And as you can see there are options out there and we're not the only ones Where that that that delay is extremely small and this is in microseconds right, so we're talking about 170 nanoseconds typical latency here It's extremely low However sending the same message But using acknowledged replication So you're sending a message to a server and then getting a response back saying that the second server has a copy That can be still very quick. So But we're bringing that up to about 16 microseconds typical and that'll be either On a that'll be I'll say on a low latency network. So two two machines quite close to each other Low latency infrastructure end to end You can get it down to about 16 microseconds If you looked at something like aws, you might be looking at 40 to 60 microseconds But it's still a very low number still much lower than say writing to disk for example Much lower than writing to a database um and in the second the the higher one there is You have the option To give a hint to the so you so just writing it to a memory map file You can give a hint to the operating system saying it would be a good idea if you wrote this Sometimes very soon So it's not really a guarantee as such but it is a method call you can take To prioritize whatever that means to the operating system Writing that data to disk. So it does minimize How long the delay is between writing it and then actually going to disk And that that brings up the uptick a little bit So what this is doing is it's telling the operating system it should write it out to disk And it's telling waiting for at least a copy On a second machine, which is close by And this is actually a pretty good combination Uh, it gives you quite a lot of Quite a lot of robustness, but not necessarily as you know, it's not for all use cases But you can see that it's actually very low. I mean we're still under 40 microseconds at this point And um, that's that may not be Make a really big difference. Also, it's very consistent. All right. So the worst one in 100 It's about 72 microseconds But moving on Um, what happens if that's not the guarantee that we need or this is not the guarantee we use what could be achieved um, so the high end here is usually this is the highest requirement is that they The the os or at least the hardware controller Not necessarily the disk itself, but the hardware rain controller has replied that that data has been written to disk All right, and with spending this that was pretty high and quite bad, but Certainly a decent enterprise ssd, which is what is being tested here You can get that down to about seven microsecond seven milliseconds seven thousand microseconds uh typical And uh, 99 percentile of about just under 17 milliseconds Um, and if that's enough for you then, uh, that's the best guarantee you can pretty much get anyway. So, um That that's that's going to be fine now You can add in acknowledged replication, but that's usually so much faster adding acknowledged replication won't make much difference At that point, it's only how How long does it take to send a message and get a reply? So it's just down to your network speed at this point So if you're now knowledge replications from singapore to hong kong That will be it's the length, right? It's the speed of light to send a message that fast. That's your bottleneck But sorry before you go What if we can come up with something it's a bit of a best of both worlds here, right? We're say you're sending a heartbeat. Does that really need to be committed to disk? Maybe not if you're sending out a very small trade So for example, um in stocks, um securities The size of the trade can vary enormously. So you can have really small trades and then there'll be a small number of really big trades And so how do we make it that well instead of paying the penalty on every single event? How about we only pay a penalty on the ones that actually have a key risk to the business? The ones the business really cares about and says These are the ones we can't possibly lose So what I did here is to look at if you took that strategy And again, it's an assumption because it entirely depends on your use case What if this only happened 10 times a second? So every roughly every 100 milliseconds a message came in that you couldn't possibly lose You couldn't continue until it had been committed to disk But every other message Um just just can just go as soon as it's available So it still means that every message before it has to be committed when this happened Because you need all of them, but um you're only doing this periodically And as you might expect the typical latency is improved significantly because typically this doesn't happen Right, so you get the typical latency, which is similar to just doing acknowledged replication But then in um your higher percentiles, they're going to be dominated by how long it takes to commit to disk But even they are improved Not not hugely, but they're still improved somewhat Compared to just persisting absolutely everything Now another step you can take is use a faster disk So this is not a disk that this is a high end uh, usually used in desktop disk Um, there's a number of vendors that do this at the moment However in the next couple of years possibly by the time Uh, you have a project that needs this and it goes through say a bank or your organization as being an approved disk Um They may have these speeds as well, right? So we're talking about where is the market going and um Certainly, I mean this is something you can't buy today But uh, it may not be an option for you yet But um, as you can see, uh, the typical latency is still quite high But it's down to 1.7 milliseconds at this point, which is is pretty good Um, and there's still a benefit from taking this sort of hybrid approach where only when there's a trade too big Or a message too important or there's too much outstanding data. Do you actually commit? Yes, so I'll introduce gerry who's the um The local regional um, md And also introduce vatican who is properly locally actually lives here. So, um Yes So they they they can help you out with a bit more detail. I'm I'm based in the uk even though I'm from melbourne originally So, um, are there any any questions ping me on that? Yes But if you got any questions, you're willing to ask now go ahead You can ask me or them Yes Yes It's more the l yeah l3 memory, uh l3 cache memory bandwidth Is is the bottleneck it doesn't seem to it's not entirely linear with the memory speed either If you use faster memory memory that's twice as fast rated as twice as fast You don't quite get twice the allocation rate though Um, it does help but that that's the limiting factor there So, um, so that's why adding more cpus doesn't really it does help to a point But that that point at which it helps sort of maxes out at two three or four cpus And then after that adding more cpus just means They're more they're waiting for each other more when for the allocations to go through Yes Uh Yes, so there's um core complexes, uh, this one actually had the cpus testing has to it's amd. It has two core complexes Um, so it actually has multiple tiers and I did test that um, there's very little difference between having Say 16 threads running on the single core complex or 16 threads running across two core complexes Um, it split across two core complexes. It didn't didn't really make a lot of difference. Um, so it may be something lower level than the l3, but Um, that's that's roughly where I would place it as being the bottleneck Sorry Yes Um, well, yeah, if you run the cloud you will still run into the same restrictions at the hardware because fundamentally They all run for the most part run amd or intel Um, and you'll and they tend to copy each other's architecture in terms of cpus So you'll get a you should get a very I would expect you get a very similar result There'll be some saturation point and it will be much lower than you might expect In terms of the number of cpus required to achieve it um Between that that's that's the main takeaway Yeah, I I haven't in our experience as you add more sockets you get bigger and bigger machines Accessing memory generally gets lower not faster like you get more cpu But your memory bandwidth actually tends to go down because you've got a more complex Structure and and um, numerous regions to worry about and so on in terms of accessing that memory Actually takes longer Um Yes, certainly in our use case, we don't consider them at all Um, because they do create too many objects. However um So often in our core systems, we will we'll try and make this very low gc But then they need to talk to systems that aren't as stringent requirements So they may create a lot of objects that may be using spring boot They may be using jbos for web servicing. They may be using um Not all the systems have to be designed this way But what we tend to advocate partly for this reason is that you'll put that on a separate machine separate physical machine And then you're kind of isolating the problem to that machine um, so then um, yeah, it's fine to use all of that And then you at least won't be at uh impacting the critical path the critical uh processes And I guess this there's garbage when you start up and there's garbage in the city state So, um, we've got lots of customers you spring to set up their system And then, uh, you know after a few minutes after it started, there's there's no more garbage There might be some spring proxies floating around so you can be very careful of them But if you can avoid them then you can kick off a gc 30 seconds in or whatever And you should be running fine. Yeah, and it doesn't have to be absolutely zero garbage It just needs to be so low that it doesn't matter and um, that might be That might be surprised still surprisingly high in terms of raw numbers. So A common example and this goes back many years. I've done this many times Is if you've got a system that's creating say a gigabyte an hour of garbage Which if they're 100 byte objects, they might that's like 10 million objects an hour per jbms You can have multiple jbms doing this as well, but say you can have it under a gigabyte an hour Then that's 24 gig a day Right, so you're still creating objects still creating some garbage here and there, but it's just not very high level And then you you can um start your system up with an edin space of 24 gig So you can just watch that edin slowly fill up over the day And then you can run um, and I've actually done this do it as a timed maintenance task to do a full gc overnight So once a day you do one gc. That's it And then the as long as you keep it say within certainly if you can keep it in within the edin space size Then there are no more collections. This just becomes not an issue anymore And I'll tell you a gigabyte an hour Even going back to my um estimate here You're looking at Far less than like a 0.1 of the time is it's chewed up by allocations Yes Yes And Yeah, that that's partly why we emphasize that it's not just about Making the applications efficient. It's also about making developers efficient And often the cost of developers is more than the cost of the hardware All right, and sorry Yeah, and um, I mean the thing is if you could halve your build time How much more productive we you and the rest of the team be for example Yeah That's right And if you could reduce it by a factor of 10 You would probably enjoy your day more right for example, you know, it's it's not just you'd be more productive It would even be more enjoyable so So because when you do builds you and you've got an established system You're actually talking about having to do quite a lot many times many variations quite often and um Your um You would just like all of that to be much quicker. Okay. Thank you very much for listening