 Okay, welcome everyone. This is the status update of the Colo project if you expect to hear something about dual Android on Nexus 10 That's next door so Let's begin. Hi. I'm Will Ald and Xiaowei and I are going to talk about Colo This was talked about last year at the Zen Summit in San Diego. So this is kind of an update From that kind of what's been done in the meantime But we'll start off with just kind of an overview and then Xiaowei will take it into the details of what's gone So What is Colo? Colo is a A course-grain lock-stepping for high availability the idea is that you Want to provide a solution for a client server model so you encapsulate the server side So that it doesn't have to know anything about the high availability Features that are there it can just You know do its own thing do the whatever the server application is Uses dual VMs on on separate Systems and It uses a more relaxed constraint Mechanism than was previously done And we'll talk about that in a little bit so In terms of the way that it talks to the client the clients will send in the you know Request for information or you know, whatever it is that they're getting from the server the server will Or the Colo system will take that request and replicate it out to to the two VMs that are that are handling the work the VMs will Of course pass it into the application application will respond with with its Response packets those packets are then compared between the two systems and When they're the same then the response is just sent back to the client And when they're different then we sync the the to the primary and secondary VMs so that they're back in alignment and Then the response is given to the client This is sort of a pictorial view of this As I didn't say it explicitly, but the application is running in both VMs simultaneously and So they they look very similar that the differences are all kind of in Zen where We embed the Colo Functionality so in this case the picture shows a storage unit that's shared That's one way to do it, but it doesn't need to be shared. That's not a requirement for Colo Now I want to contrast this a little bit with Remus, which is a solution that's been in Zen for a while and We talk about it later. So Just to give you an idea Remus also uses two systems two VMs or two hardware systems two VMs But it's not running both systems simultaneously. It just runs the primary VM Request come into a remus system it also is hiding the The high availability work from the application so the application in this case also doesn't you know Doesn't need to be aware of it, but so remus requests you know Provides the request to the application gets the response and then it buffers the response rather than sending it right back to the to the client and This buffering goes on for a Predefined period that the operator will set up At the end of that period there's a checkpoint between The primary host and the secondary host which brings the secondary right up to you know The same state is the primary and then once that's completed then the Responses that have been buffered are sent back to the clients And that allows them to keep the machine state Exactly the same between the two hosts If there's a fail actually I didn't mention this in Colo, but These are almost the same or if there's a failure at some point For Colo The secondary VM is already running and it will just take over For remus the secondary VM will be started up and then it'll take over Didn't realize this was a build Okay So there there are some problems with the approach Existing approaches You know one of the existing approaches is just this kind of instruction level lock stepping and In in this approach It really doesn't compete very well performance wise because there's just so much overhead so I Other than it Not being a terribly good solution and especially if it's done in software You know it Doesn't compete in the periodic checkpointing. This is something like remus Then you've got the extra latency for the packets that you're buffering up And you're doing a checkpoint at every at the end of every Period and so that has quite a bit of overhead associated with it now So looking at remus is that was a big part of the inspiration for Colo and trying to figure out if there was a way to reduce the the overhead there and What we did was look at the at how to relax the constraints of the you know having an exact machine state replication and Now we still update the machines, but it's at a slower rate and essentially what We're looking at those packets that I talked about earlier when we get the two packets back we compare and see if they're in sync and The the relaxed state is essentially dependent on the The assumption that the application itself is going to go through a valid set of states so Moving from one state to the next will always be valid for either the primary or the secondary and So if they're in a state, you know where they're Delivering the same response then the next step in that will also be a valid state If those two are out of sync then there's no guarantee that you could move from one state to the next that that you might see So at that point we need to sync them and See oh And so this actually ties the synchronization to the application itself or the characteristics of the application so if you have an application that's Doing things that tend to make their output very more then You'll see more Updates and synchronizations and when the application is not you know, then you see very few so in most cases we find that this sort of a relaxation does lower the the number of Synchronizations required and so the overheads lower than the Remus case, but not in every case now if we look at the Picture, you know that has a little more detail than the previous one. We see that there's sort of a heartbeat No, they're within the dom zero and that's used to determine if there's a Failover that's needed when when you don't get the heartbeat you're expecting then you do the failover. There's the the checkpoint Facility there that that manages the synchronizations and the co-le manager where that the Response packets are compared and Decisions are made about what to do there Other than that, it's just the the primary and secondary VMs On the system, so it's it's fairly clean and then so the the current state Is that The patches have been put out on the mailing list There's a Paper that this URL here that you can get on the kind of the current state of the work on Zen and then Huawei this is where shall we it's from Huawei has announced that they will use colo in their fusion sphere product and From here, I'll let shall we take it Morning everyone. I will give you a detailed of our Color upgrade update the first one is our TCP IP or optimization. So mainly go up to be up here optimization is to Make there's a package response between the primary VM and the second VM be more similar so we can have Last checkpoints, which means a bad performance. So let's look at Then one by one first is a plug connection comparison as we know that Server scenarios that could be multiple connections between the client and server and It's hard to anticipate the packet orders Among their multiple TCP connections so it will Triggers many checkpoints in our coral solutions fortunately if we change Comparison a bit to do their plug connection comparison so the situation can become much better and The next one is called or cost-grain TCP or timestamp as we know timestamp is very used for identify or TCP timeout and In Linux time-time is gotten from the system time Even on the non-virtualized environment we cannot Guarantees or The time system between two servers be exactly the same and it's harder on the virtualization platforms because we need to virtualize our time Fortunately, we thought that the TCP timestamp may need to mean May not need to be such Precise so we introduced a cost-grain TCP timestamp, which means we're which has our time granularity of 128 millisecond and Are and let's look at a more details the next one is for cost-grain TCP notification window size It's basically it's basically the same story so packet From the primary and the secondary VM diverged just because they have different window size So we want then the window side window size to be more common, so here's how here's our Cost-grain window size algorithm if the original window size is less than 256 We just round down them to the nearest power of two and if the window size is larger than that We must for the eight least significant beats so This time give you example without the cost-grain window size Or our comparison model will think the two packet are different and will trigger a checkpoint and with our modifications It looks exactly the same and the packet will be sent out to their kind directly Okay, another example is Deterministic segmentation as we know that the application Data is or goes down the network stacks. It will be Assembled into SKB buffers and the buffer size is determined by the MTU size So here's example as one application send the first data With a size of 300 or 3000 the bytes it will be divided into as three SKB buffers And the later the application will send the another data and the data Well, it's like it will be appended to their last SKB buffer Depending on whether the last escape buffer is sent out or not. So it triggers another Possibility of a packet divergence. So our solution is just simple It is simple just to Prevent the packet to be appended to the last SKB buffer Okay, next let's look at how our coral do the storage a Process In coral designs storage or state is considered as an internal state Just like our memory and the CPU which mean that We need to guarantee their state be exactly same at on each checkpoint and Here's how we does it. We will add a module into then type this to intercept every request from the guest every read and write request from the guest and It will determine if the request will send down the storage deck or be satisfied Just within its cache and first let's look at how write operation works on the primary note, the writer request will be copied and Send to the remote side with and be cached inside a PV which we call a PVM cache and in the meantime the data will be sent down to the storage deck and on the Secondary note, sir, right request will just be simple or Simply be cached in the SVM cache And there are less than let's look out of how their read request works on the secondary now side as a request will first look at the cache if it exists it will be satisfied Otherwise it will read from their storage and On the primary note side as a process doesn't change on checkpoints What device Device manager will call their block driver to flush the primary virtual machines or cache down to their storage because we need Is a primary notice state exactly be copied to a secondary and on failure which means that the secondary node will take over So we just are gathered rid of the primary primary virtual machines cache and they will Flash the secondary virtual machines cache Okay Next is our memory sync in our previously designed are the memory The dirty memory will be synced to their remote side on each checkpoint But that could be problems because or when their time period between two checkpoints are it is long That could be not many or dead pages accumulated and it will cause her high CPU pressure and Long surface downtime on checkpoints. So we did some Optimization of the checkpoint process. We will do their memory sync at the wrong time so basically when their When the process breaks from weights, we all check if it's because of Triggered by the checkpoint or it's just needed to do one round of a memory sync and If later we are copies are dated to the remote side and it will be cached at the remote side until the next checkpoint Then that catch will be synced to the second secondary virtual machine and This one is very simple on checkpoint. We need to save our stores or device PV device state Yeah, and there previous on Current solution it use or then start to communicate between the front end and the back end But one problem is that then start could introduce a high latency our Job is very simple to avoid using Zen store and just using event channel Okay, and next let me show you some a performance data. It's just a copy of run our Socc paper first one is there a lot Performance data of their webbench benchmark Here are two For four configurations. The first one is negative. The second one is Ramas with your epoch Which is the period between two checkpoint Equals to 20 millisecond and the third one is Ramas with your epoch Set as a 40 millisecond and the last one is our chorus performance as We can see a first. Let's look at two Remus result We can see that if we set to the epoch longer, we can get a bit better performance. However, it also means That the their packet will be buffered a lot of a long time and well, it will have a longer latency and There but our color solution doesn't have such issues We will send out the package directly if you if if compare most think that primary and a secondary note No, the response is the same under with regard to their Ben ways we also Can see that our corals of homes is also bad is better than the remus And the next let's look at a question. Yeah Do you have an explanation for why the tail often Yes, yes, yes, yeah, because it's simply because that's their compare mode from the packet or Packet diverge diverge increases and they will do their checkpoint more frequently. It will have the performance And then of course there are space to optimize in this area Okay, next let's look at our scalability result as we said that's the car one key advantage of a coral comparing to their lock step Solution is that it can support SMP against while the lock step solution can only support UP So here we do the benchmark with earth of guest of easy view set as one Two and four and we can see That's the performance scales are very well when the vis-a-vis number increases So it just means that scalability is good Okay, next is their benchmark of their PG bench, which will test the database transaction It almost Tells the same story. We got a better throughput lower latency and the Scalability is very good. Okay upstream as mentioned, we have already pierced the initial patch to there is an upstream or this or July and there's a few comments and more comments are we can welcome and and the coral rematch corollary as he said coral reuses rematch for VM checkpoint and heartbeat and when we do the development we rely on their Sentos or Zendee But as the community move from Zendee to the lights that could be More work to be done on the then light to support our rematch Okay, last summary Our core work currently has already a walk to well with the combination of H&M latest a guest and the PV driver and We are doing their development of their to support Windows guest and We need or not participants or new community and want to be checking the upstream As soon as possible Okay, that's my slides and the questions Mm-hmm Split one Brink Actually this is her Not a focus of her of coral we just do a user heartbeat to check To monitor the health or state of the another another side We saw we don't do any specific Implementations Yeah, so then architecturally if I'm implementing this I'm looking at only the primary node And the primary node itself is going to be Whether Yeah, yeah, if their secondary secondary node thoughts that's primarily crashed it will take it try to take over and So it will respond to the package Also, so client will see their response from their primary and secondary Client is Yeah It could be our first detection by the heartbeat and we didn't fix that problem in our car Any additional questions we got about two minutes. I had it just a quick one I must put the heartbeat being associated with Dom zero. Yes How about what I would consider to be a suboptimal of possible use case? Harbour machine one running VM one harbour machine to running VM to Harbour machine three running the colos of VM one and VM two Simultaneously is that not allowed or just a bad idea. It's a very it's the bad idea Because we want to survive from the hardware crash So this deployment don't change anything. It's just I've worked in enough data centers over the years I could really imagine some manager saying well, we'll just have the machine out inside the secondary data center That would be the backup to all these other machines out in the field and not thinking necessarily about the consequences of doing that, but I was just wondering whether that's even conceivable in this So you You're saying that case where there are multiple Where you have various primaries, but you're using the same host for multiple secondaries I don't think we really look at it It's possible, but will increase or much higher overhead Because I've definitely seen a lot of the hot spare model So I've definitely seen a lot of the hot spare model used over the years My assumption is that this is very latency sensitive between primary and secondary