 Hello, my name is Todd Montgomery some of you know me So I won't go into you know who I am, but I will talk to you about what Aaron is so Aaron is a open-source project started around 2014 And sponsored by the CME to design and build a high-performance low latency Messaging transport in other words communication you and with the intention to be used within trading organizations In the front office. So exchanges Bicyde traders anything you can think of with that low latency requirement Aaron archive came along also sponsored by the CME and it is to Record and replay streams of Aaron data and In Aaron cluster, which was sponsored by our friends at adaptive, which I happen to be a part of it and And that is to build a raft cluster Framework on top of Aaron archive and Aaron as well with the same goals of low latency you know high availability and You know and high throughput So we have a lot of customers who depend on us We're now able to actually show some of them now a little bit more publicly just flash the logos up a little bit This is really neat Especially, you know, if you have an open-source project that started, you know And people didn't want to admit they use you now some of them start to actually acknowledge that we exist and they use us Which is nice But really what we wanted to talk about here is some of the challenges of moving front office systems so that they're 24-7 availability and One of the questions you may have and a lot of people do have is what's so hard about 24-7 if you're talking to like consumer You know websites and things like that. What's so hard about 24-7? They have to be available 24-7 in any way. It's a little bit different when you start to think about it In the in the instance of low latency high throughput very demanding systems So there's nothing really hard about it except when you start to look at the SLAs that are very aggressive That must be met not only contractually, but sometimes regulatory, right? So you have to be available You have components that need to be upgraded. You can't just take the system down to upgrade it It's one of the advantages of running, you know When you have dedicated downtime you can upgrade the system, you know, and that's going to happen when you don't How do you know when to upgrade? Well, the answer is you should be able to upgrade at any time you need to right? but you can't do that and You know continue to meet your SLAs, can you? Actually you can if you think about it and Transaction to be processed in real time. It's a little different when you don't have transactions that have to be processed in real time So the not all systems that run 24-7 are exactly the same so It isn't hard, but it is hard in certain situations So the things that really matter to this are the fault tolerance model that's in use The how disaster recovery is handled and how you can sort of lean on it as well as how you do things like upgrading your production So first fault tolerance, I'll take you on a little journey about how Fault tolerance kind of evolves and has evolved so let's say you have a service you have a client connecting to that service that's running on some machine and You want to make this fault tolerant your service can be restarted I mean it means someone going in and actually restarting it by hand It could mean someone goes and pushes the power button, you know because they get it got turned off Right there are all kinds of things the idea being the first thing you may do to make this highly available is to put it on the cloud Now how does this make it highly available? Well now when you have cloud infrastructure you do have the ability to do all kinds of different things like move it Right, it's a little different than moving it when it's on-prem This is a form of fault tolerance But it gets more complicated as you start to think about your service and start to add different types of Fault tolerance to it you may have multiple services and you can go to any of them You may have more clients and how do they know which ones to go to these are all solved problems We have many different ways to solve them But essentially the problem that we're really trying to fix is the state Problem it's the fault tolerance of state. It's that state in the services that Starts to become the thing that we want to make highly available Most of the time a lot of systems just kick the can down the road What they do is they basically have a state storage somewhere else. They have a database or something else Again, it's a way of doing fault tolerance, right? It's not my problem now It's now a vendor's problem because I'm going to Oracle. I'm relying on Oracle now to set to have this problem But you notice that the same problem exists. It's just now moved, right? That might be good enough So it really is this state in the fault tolerance of state that really kind of matters and when you start to look at this there is a Continuum between partitioning that state and replicating that state if you have the state in one place No, what's the old adage on if you have it in one place you have it in none, right? The idea being if you lose one It's not good enough Regulatory wise you may have requirements that indicate that require that you have this state in multiple places As well as just wanting to have it in multiple places so We do have methods of doing this, but I'm going to talk about one specific method of Visualizing this for for development and for the way that Eric cluster works And that's the idea of a continuous log with snapshot and replay The the thing that is actually being replicated is a log of events those events then are processed in order and They then generate states and So and you can at some point snapshot that state and replay that state To sort of give an example. Here's just a sequence of messages It could be anything could be new order ad It could be cancel modify something like that. It's very easy to think of it from the trading side And that has an associated set of state along with it that gets incrementally kind of Continues on right it kind of gets added to as it goes along If you were to look at any of these events you could take a snapshot of that state at that at that time That snapshot is just a roll up of those the the effects of all the state changes you could then Kind of look at it as if I were to take that snapshot Be able to rehydrate it and then replay the log five and six after it Those two sides should actually be the same right? I should have the exact same state You can do that with many systems that do this basic idea What this allows you then is a way to think about this instead of it being your snapshot of a state within a Within a database it's just a log with snapshots and replay in fact most databases work this way They have you know this that they have a log they do this Essentially this with some additional things attached to it as well But if you take this step further so we've really just talked about how the application might look at a Stream of these events and actually state change and looking at holding that state But there's a reason why we're talking about this. We want to actually have multiple clusters of These services with local log archives Okay, so those look the log is replicated across them and those services are running and they're consuming that same log Now each of those should have the exact same state in fact They should and they do that's deterministic execution that log will always generate the same state When you have that deterministic execution now all those replicas should be in in the same state This idea of replicated state machines has been around for quite some time It is a simplification that allows you to build systems that are very highly available So there's a couple things here each replicated service needs to be reading from the same event log obviously and They have to have the same input ordering if the input ordering varies between them That's not going to work right you have to have the same input ordering That's the set the same log and you're locally replicating the log Okay, so there is no central place where the log is stored this the log is replicated at all the different nodes of the cluster Snapshots and are really checkpoints their events in the log and they roll up previous log events or the state changes for those events so One question that you kind of come up to first is okay I'm reading from this log, but there might be different spots that each service is reading from this log How do I know when I can process a message? What happens if only one processes a message and then it's the only one to have it? Well, we don't want that There's a couple different types of consensus algorithms Paxos was one of the first but They even go back further than that into the late 80s early 90s work was on even that being done in the 70s on Kind of they didn't call a consensus at the time, but they had the same building blocks RAF consensus is an idea that is very simple to distill it basically means that The event must be recorded at a majority of replicas before being consumed by any replica This is the consensus part so you can't just get a message and process it immediately You have to wait until that message is Locally replicated to at least in a node of three one other so you have it and another has it So that's the basic idea of something like you know Aaron cluster It's also the the you know other wrap differentations that are sort of taking the same technique They're looking at that log. They have the concept of snapshots. All those snapshots are an optimization And it is the idea that you use this so that each of these are in you know Without being intentionally making them in lockstep. They are in lockstep because they're actually replicate, you know reading from the same replicated law But it has some things that make it nice for 24-7 and they may have some things that make it bad for 24-7 The first thing we'll talk about is disaster recovery so when you're talking about recovering from a disaster and When I I'm using this term really kind of pretty loosely the idea for disaster recovery is not that you lost like a whole data Center, it's like maybe you lost, you know one member of that of a three node cluster Let's let's even consider that to be a mini disaster The metric that we really are concerned with in most of you know When you have tight SLAs is going to be a recovery time objective in other words How long does it take before you can get back into operation and continue operating and you want that to be low? And you want that to be low in all cases of disaster recovery that would be the ideal, right? Lose a whole data center lose all three members of the cluster You still like to be able to in disaster recovery bring that up within seconds ideally, right? So how does this look when you start taking into account the idea of disaster recovery? Well, the first thing is remember that snapshots, you know could be lazily disseminated the idea being that when a snapshot is taken It's really an optimization. You could still just use the log, right? And it's because it's an event log you could lazily disseminate these to the other members Or to dr, and you could strain the log as it reaches consensus to dr So, you know as you take snapshots you would a new snapshot comes along It could be lazily disseminated if you had a failure before that you could use the previous snapshot until you know You get the the the snapshot disseminated But this brings up an interesting point you could run this in a cold scenario where the What the where this data is being stored in disaster recovery is simply just You know cold there's no, you know, it's not running hot. It's just storing the snapshots. It's just storing the log Or you could even be really cold and put it in like s3 or something, right? So it could be you know, even even in that kind of scenario. It's very cold or it could be kind of warm In other words, you have service logic that's running on that log and it's setting there ready to take over in case it needs to So let's let's break this down for a cold Replicated state machine recovery. It's a little different than warm. The first thing is with a cold Recovery you first have to load a snapshot then you have to replay the log If we break that down sort of in the end of you of we have loading the snapshot We have a couple messages here in the state that gets generated You break it down into two steps the time to load a snapshot and then the time to replay the log So What does this matter? Well, let's say that you snap shotting once a day every 24 hours What's the worst scenario you've got a snapshot that is 23 hours and 59 minutes and 59 seconds point 99 seconds old, right? How long does it take to replay that when you're sending millions of messages a second a Long time. It's not IO bound It's compute bound because it's replaying and doing all the different things that needs to get this state so It's realistic to think that you could have a situation where you're replaying that much data and have to Reconstruct it and that's time where you're not participating in you know and actually being a node in the cluster Can't till you catch up So that means that the looting a snapshot is small Typically ish And it's large, you know the time to actually replay the log. It can be large So How do you fix this well a couple different options? The first is you can actually snapshot more often Suppose you snapshot every hour and you know that you can run through an hour's worth of data in Say a minute and that might be good enough right your SLA may be that you know in a certain types of situations like this You're able to come back within a minute replay and be caught up to speed. That's great And that means really what you're doing is you're just making it smaller log to replay But you then have more snapshots. Well, so what happens about the impact of taking a snapshot? You know anything we do on a machine perturbs it In fact, if you work in low latency, you're very aware of this trying to quiet a machine is a very intense process I'm trying to find out what is going on in the machine that is changing and making something else slow So taking a snapshot is going to perturb the machine in some way In fact, it does more than that in rafts taking a snapshot is an event in a log You then have to have all the members take that snapshot at the same time in the log Guess what nothing else is being done while they're taking that snapshot, right? So that's a disruption But if you were to instead think of it as Having something else that asynchronously takes that snapshot and disseminates those those snapshots back into the cluster nodes That is one way to think about you know reducing the impact of snapshots That doesn't just solve that you know snapshotting more often it helps But you still have the problem of if you're going to load For you know a snapshot and then load a log and you're not willing to go for a couple minutes while you're doing that How do you go faster? You have to have something which is at least warm and Then the idea is having a service replica running your logic in DR ready to be hot In a moment's notice You don't have to be an active node in the cluster You're the one the standby or that other node is just pulling across the SNAP the snapshot that's pulling across the log It's processing it just like a normal node was it's just not a active member of the cluster If a active member of the cluster goes down and you want to replace it with that you just have to tell it join cluster And that idea of being able to just swap in you know a crash node with a node That's warm and letting it go hot What kind of delay is that? We're seeing it's only about a second or two Most of that can even be reduced as well. So you're thinking of something like in DR. It is possible to run in DR Something where you have a full copy of your of a cluster running. You lose the data center You can start up in DR in You know seconds That's kind of powerful right because that means that it's like oh we lost the data center We had another heat event like we had in this summer right in AWS And then be able to be right back in it within seconds. That's pretty nice, right? That's the kind of idea that that we we kind of think about you know what has to happen for a lot of systems 247 upgrades so that's kind of one cut one There's a couple different things there that we talked about in terms of what it actually means to look at How you can you know get basically things from a disaster and recovery in a recovery time But let's look at another cause of you know when you're running 247 and how you want to do upgrades So you're going to do upgrades probably by stopping everything, right? You're going to do a big bang You're gonna actually do a forklift upgrade actually no people who do this They they actually take a forklift with a whole rack pull it out and put another you know rack in its place I'm not joking. There are there are businesses that do this but That's not going to work right there. There is a significant amount of downtime. That's going to happen no matter how you do that So how do you? How do you do this? No, let's say that we run two systems one hot Okay, one that is let's say it's warm and we upgrade that one and then we switch over How long does that switch overtake? couple seconds That too long it might be If you're doing it if you're doing a couple second outage, you know four or five times a day People don't want to upgrade right? So that's a barrier you want to make us so that upgrades can be done quickly So the first thing is you have to really kind of absorb the fact that components will be on very varying versions No matter how you're doing an upgrade you can't just do it all at once if you're going to be live while you do it And that means including the infrastructure things like Aaron Aaron archive Aaron cluster These have to work in situations where they're running multiple versions in production now That's that's something we take on right but we don't think that it's any different if or if we say that you know You should do it. We have to also do it too, right and we do and I mean all components That you have the cluster is only one piece of usually larger systems So all of them might be on different versions. The best thing here is protocol design And lessons that have come from that so about how to make protocols so that you can have multiple versions running and how you do things like You know backwards compatibility obviously, but for was compatibility So what does for compatibility mean? It means that you put forethought into how might things change and how can I catch that there is a new version? that is safe to not Interoperate with and things that are safe to interoperate with by putting several little Bits in messages. That's one way of making them for compatible because you think up front. How do I handle this when I don't know what to do? You can give a little bit of additional information There's specifically things like IPv4 TCP Options and other things have this in mind because they they know they're going to be operating in situations where there's never going to be the same version So they have like ignore bits in options where you can ignore this option and keep processing but if you don't understand it and it doesn't have the Ignorance that you have to throw the whole message away that type of simple thing can then be leveraged when you're actually doing things like that Other thing is version everything and I mean everything messages data data and rest data in database Everything should have versions attached to it. Your future self will appreciate this To know sort of a lot more things later on the concept of semantic versioning is Is a powerful one? It is a versioning scheme that has very strict rules about what is compatible and what's not and how you version based on it When you have this you can leverage that idea to figure out if that's a message or a piece of data that you know How to handle and ones that you may not know how to handle so upgrading during production rolling rolling upgrades What we have seen is This sort of scenario, you know, you have a service. It's running various versions In this case, they're running the same version because we're gonna upgrade all of them And so you basically do the the simple thing you're leveraging raft You're leveraging the idea of your high availability. You're taking one down and upgrading it. You're continuing to operate because you have to The obvious problem here. I will address in a minute But you do this Now that is upgraded now you start upgrading the second at that point notice you have two versions in production You have two different versions that are live and active in production You have to have that so the idea that you're going to have this and just accepting it is fundamental to being able to do that and We do the last one and then we're done and everything everybody's upgraded and happy a Couple things to realize about this the obvious problem here is that when you take one down You're operating at risk and the idea of operating at risk is you have one node failure and it will stop and stall Processing because you cannot reach consensus so during that process when you're doing the upgrade of one you better hope that they don't fail How do you address that? You use a larger cluster So five nodes you can upgrade one at a time and not be running at risk now You can only you know will stay on one failure, but that's fine if you're Thinking about it if that is your bar Then doing it does that mean that you should grow the cluster to five and from three and then go back to the You know back to three when you're done actually know if you're upgrading all the time. Why would you want to do that? Why don't you just run five right? What some of our customers do better use Aaron they've run five. This is how they do rolling upgrades. Just simply Take one down. They not running at risk They then upgrade that one and just continue to do that. But if here's an interesting part of mouth math The majority of seven is four right so you can actually upgrade You know you you will can actually upgrade two at a time when you run a seven node cluster It's actually faster to do it that way than to run five Right with five you have to do one right after the other. That's five times So the time commitment is you know five times whatever though it is for one when you do it with seven it's four The interesting interesting thing to consider so There's lots of other implications of when you start to version everything your log now Has all kinds of version history in it That's kind of interesting isn't it so you have all these versions you can see when they change You know exactly what you know when they change the message event that when they changed That's an interesting thing to think about So this this has some interesting side effects when you start to start to put versions all over the place and start Thinking about running with the different versions in production Here's some hard lessons though that we've seen Retaining old broken behavior forever You almost have to so here's an here's an example, let's say you're in exchange. Let's say that you have some broken logic Let's say that you executed that broken logic at some point now It probably had no effect on the customer, but it has an effect on you for a regulatory reason If someone let's say you're operating in you seven years that you have to be able to retain that data You also have to go back seven years to look and see the logic that you execute on that data The interesting thing to think about right you may have to not hold on to that That broken data or that broken logic in your actual app and running But you may have to be able to go back to it and say here's how this happened So that's an interesting thought is how you have that because that log stays around for a long time The cluster services are often easy part Kind of if you look at this a lot of Design goes into how you're going to actually handle different versions when you're going to upgrade how you're going to upgrade Doing rolling upgrades like this and being able to continue to work while you're doing that and In how you coordinate this as a system with the external things like gateways attached to it stuff That's downstream that you know is processing data even databases that are attached in other places So it's usually the cluster itself in the cluster nodes Those are the easy things to upgrade the hard things are the things like everything else that's attached And that's where when you started to think about the fact that you may have multiple versions that are running in production and how they Interoperate you you actually actually make yours make the whole system cleaner because you can now use some of those techniques that you probably learned downstream as well Future flags or feature flags are not version They're not versions Seen this many times where people look at feature flags and they put them in a version and then they want to upgrade the version But that means that they're they just want to add a feature flag that which means they have to update the version Why not have the feature flags totally separate from versioning? That's a that's something that's usually a lot better the couple feature flags and versioning And the last thing is leverage the determinism you have determinism Realize that now reading that log. There's a lot of stuff that you can do with that that logic So leverage that for testing leverage it for the fact that you can run that same test over and over and over And it will give you the exact same state That's a powerful thing Just don't put like you know things that are non deterministic in there because then you'll have to be chasing them for a long time Anyway These are just some of the things that we've seen as we've seen a lot of users of air move from going from a lot of times Where they had some dedicated downtime maybe once a week not a whole lot But I'm talking about something where they had a window to no window at all that where they have to operate 24-7 and maintain Pretty aggressive SLAs they normally have Thank you any any questions Yeah, the the the back up and and this works also with standby as well So the backup of the standby the the the entity that is in disaster recovery or somewhere else It's the one who's going to the archive of the node and it's going to a follower It had been going to a leader. We've changed that recently So it goes to the followers instead and it load balances between the followers And it pulls across those things as they are you know being in the case of the log as it's being generated In the case of snapshots as they are current. Oh So we're we're in the process of doing asynchronous snapshot support as well And in fact, it'll be one of the jobs of the standby is that can be done like that What we've seen is that people have tried various things to do asynchronous snapshots the the most common thing I think is is Not thinking of it asynchronously enough There's a natural tendency that when you want to take a snapshot you want to get it back into the cluster as media quickly as possible and get disseminated and It's actually better to let that be a lazy So one of the things we added a while ago was when doing you know various things is to actually put IO length on the work that's being done in the archive and the intention there is to do it so that we go slower intentionally till it be less perturbation on the system, you know in that case so the It's sometimes patience pays off so in that case it's take the Take these snapshot asynchronously and if it takes 10 minutes to disseminate it, you know as long as you and you're doing it slowly You're being less perturbing to the system. It's better to do that than to do it too aggressive Now if you want to make sure that you get that snapshot, you know Disseminated because of other reasons you may have other reasons that you want to make sure that you may have other work You want to do it that snapshot? You might want to you know you have full control over how you want to do that But yeah, that's the common thing of just not being patient enough thinking that it has to be there now. I have no idea I Kafka traditionally does things in a kind of a little bit different manner and thinks about things a little bit differently it's not so much a You know unless they've added it recently there's not really a whole raft implementation in Kafka You can build systems that kind of look like this with Kafka You can if you wanted to write something yourself that you know does consensus has its own voting and underneath the covers uses Kafka It's going to be slow So how when I say slow, you know, what is that right? So the fastest way that we have run, you know a cluster is 1.8 million messages a second coming in and at the same time being able to round-trip a message from a gateway Through the cluster and a response coming back out and having the p99 of that response that whole round trip being less than 20 microseconds so On-prem solar flare cards EFI, you know, but you can get these systems to run very fast I think to run something like 1.8 million messages a second with Kafka and that's three machines I don't think you run the Kafka cluster 1.8 million messages a second, you know with three machines. I don't think so This was 32 byte messages We're just looking at you know, and that was maxing out to 25 gig nicks You know one for consensus and one for the log that lots of detail on that But essentially what that is is what kind of how far can you push cluster and you can push it very very far But that's as fast as we've been able to push it. I'm not sure I think what I mean from the Normal way that I believe that kind of architecture works You're really what you're doing is you're extending the club the Kafka club, you know the Kafka nodes the Kafka services into DR in some way Kind of it could look somewhat how we do warm, you know warm standby or cold backup That kind of thing where you're extending it into but I don't know the details of what they do I mean traditionally if you look at most brokered architectures they do that by bridging, you know into DR and That means they get very slow because they're round-tripping everything through DR We're fundamentally not we're asynchronously streaming into DR. All right. Thank you