 Hello everyone, my name is Zhang Jiaqian, I work for BitDance's STE team. Hi, I'm Yongji Xie, I also work for BitDance's STE team. Our topic today is about high availability features for workout advice. The topic will be divided into four parts. First, I want to introduce some background of this work. Next, I'll be talking about three high availability features, including crash recovery, life upgrade, and life migration. Finally, I'll talk about the status of the features and the possible future work. Okay, let's start with the background of this talk. Workout advice, as most of you may have heard of, is a host to Work Machine Guest's pass-through fail system. It is first introduced in 2018 as a substitution of workaround IP, and it has better local advice semantics and better performance. It also actively developed by many open open source communities, like Linux kernel, Qemail, Kata containers, and Libort. If you are interested, you can find more introduction about WordLIFS in its website. The basic usage of WordLIFS is to avoid the fail copying if we want to share some fails or a directory into the guest VM. And in the cloud environment, the most promising usage is to solve the secure containers. For example, Kata container is one of the secure container runtimes. Basically, as shown in this figure, secure containers invoked by Kata runtime is sandboxed by a machine. And if the machine is booted with WordLIFS, we can directly pass through fails from host to guest, and it will save the time and space for booting a secure container. There are more possible usage for WordLIFS. Just like many user space fail systems based on fields, the WordLIFS daemon can also be implemented as other customized foes. For example, we may implement distributed clients as a WordLIFS daemon. We can also implement more efficiency container image services. There is also an excellent talk in KVM forum 2019. You can find out more ideas about the WordLIFS usage in it. Now, back to our topic today. What is high availability and where we need it for WordLIFS? As you may already know that high availability features is a type of technologies to eliminate the single point of failure of large systems. For WordLIFS, the features include crash recovery, life upgrade and life migration. However, the WordLIFS daemon today does not support any of these features. That means when the WordLIFS daemon crashes, the whole WordLIFS machine will be unavailable. And there is also a conception to measure the high availability of a system, meantime to failure. If the WordLIFS daemon does not support any of these features, the overall meantime to failure of a WordLIFS machine would be bottlenecked by the WordLIFS daemon. Therefore, we decide to implement some of the high availability features for WordLIFS. Now, I'd like to move on to the second part of this talk. How can we support crash recovery for WordLIFS? First of all, there is an exception of the crash. The crash failure model is fail-stop. It means the WordLIFS daemon could be mistakenly scaled by other processes. Or scaled by the kernel auto memory killer, or just crashes for other reasons like segmentation faults. The fail-stop model here means when an internal or external error happens, instead of running for a while and exit, the WordLIFS daemon process will be scaled immediately. This assumption can cover most of the online crash cases. And no further internal states will be mistakenly modified by an error handling, error running program. The grab on the left shows the WordLIFS machine running with WordLIFS daemon. There are two communication channels between the Qemail and WordLIFS daemon. First, the few requests are continuously transferred from the guest to WordLIFS daemon through WordLIFS daemon. WordLIFS daemon handles the few requests by further requesting the host fail system. Second, as the WordLIFS daemon is a WeHo user service daemon, there is also a unique domain socket connection between Qemail and WordLIFS daemon. For crash recovery of WordLIFS daemon, the most intuitive thing you may thought of is that we need a supervisor processor to keep watching the WordLIFS daemon. The supervisor process could be cut a container runtime or some other system service managers like SystemD. When the WordLIFS daemon crashes, the supervisor process will notice the process is scaled and restart a new WordLIFS daemon. Additionally, as WordLIFS daemon is a WeHo user device, Qemail will keep trying to reconnect to the new WeHo user socket when a new WordLIFS daemon is restarted. Once we have a supervisor process to restart the new WordLIFS daemon, the next thing we need to pay attention to is how to resummit the inflight request. For the WordLIFS request, the guest kernel is waiting for the completion in an uninterruptible state. So if we simply drop the inflight request, the guest will be blocking on the unfinished IO, so we must resummit unfinished views requests. Fortunately, there is already an inflight IO tracking feature in WeHo's user protocol. This is because inflight IO tracking and resummission is a common issue for WeHo's user service crash recovery. And it is also first introduced for WeHo's user blog service. As shown in the left figure, the feature basically reserve a range of blog areas shared between the WeHo's user daemon and Qemail. The log will be recording the inflight word Q descriptors, and if a crash happens, Qemail will send back the log back to the new WordLIFST, and the inflight request will be resummitted. However, there is a dominant issue for the inflight request re-handling. This is because the few request handler may crash in the middle of an execution, and would make some residual internal states of WordLIFST. So we need to ensure that for every request handler that would be re-executed, they must be a dominant. I will come back to this issue later in this talk, but now let's just move on. The last thing we need to think about about the crash recovery is how to save and restore the internal status of WordLIFST. WordLIFST has two kinds of internal states. The in-memory states and the file descriptors opened on behalf of the guest applications. As shown in the figure, we need a persistent data store for the two kinds of internal states. In the implementation, we store the in-memory states to the shared-memory files with an MF-friendly data structure called FlateMap. For the opened FD states, we save the FDs as file handles by using two system calls called open by handle ID and name to handle ID. The general idea of internal state saving and restoring is straightforward. We just saved the states when they are updated and restored the states when recovering. With all the components ready for the crash recovery feature, let's just review the overall procedure of WordLIFST crash recovery. First, the supervisor of the process will detect WordLIFST failure and restart a new WordLIFST. Second, the new WordLIFST will restore its internal states from the persistent data store and then listen on the WordLIFST WeHo user socket. Third, Qemio will reconnect to the new WordLIFST, reset up the WeHo user state like virtual machine memory layout, WordQ address, and send the in-flight IO tracking log back to the WordLIFST. First, the new WordLIFST will re-handle the in-flight fields request from the in-flight IO tracking log. Finally, the new WordLIFST will start to handle the normal fields request from the VM guest. Okay, now let's go into more deeper about how can we stall the internal states of WordLIFST. First, for the in-memory states, we propose a new data structure called FlitMap. The FlitMap is based on the Qemio WordLIFST struct, LoMap. But as shown in this figure, it is more MAP friendly. We embed elements data into sloths instead of dynamically allocating and attaching them to pointers. We also attach the elements to the end of the meta fields. The crash consistency issue on the FlitMap updating is also properly handled. To save the OpenFD of WordLIFST, we use the persistent field handle mechanism provided by HostFail system. We use two system calls and save the OpenFD of WordLIFST. To host kernel by name to handle it, and restore them by Open by handle it when performing recovery. Let's go back to the adamantent problem when we resummit the in-flight request. The adamantent here means if we execute a same request multiple times, it will always produce the same output. But if a field request is not adamantent, it would leave some residual states in WordLIFST. So we need to ensure every field request is adamantent. We analyze the field request handlers one by one, and there are three situations for achieving this. First, some of them are already adamantent, so we don't need to change anything of them. And some of them are relatively easy to modify. We only need to relax some error handling in the WordLIFST handler. And the other requests are more complicated. We need to introduce journaling to automatically change the internal states. Next, I'd like to give some examples of these situations. The first kind of request is adamantent requests. We don't need any special handling of these requests. For example, the app allocate field request just extends or removes or fail range from the off-site with a length size. Because app allocate this call itself is adamantent, and there is no internal states changed in the app allocate field handler. So the app allocate request itself is totally adamantent. The second type of request needs some special handling. For example, the makeDLFuseRequest will call makeDLite.syscall to create a new directory. But if we execute the same request the second time, we can be sure that makeDLite.syscall will fail with eExist error. So if the crash happens after makeDLite, when this request re-executed as an implied request, it would be an eExist error. To solve this problem, we relax the error handling of this kind of request. For the makeDLite request, the makeDLite system call fails with error number eExist. It will return the subsize to guest kernel directly. It should be noted that this modification is safe because guest kernel should already check the existence of the directory by the fuse lookup request before the makeDLite request. So if we return to guest application without even seeing the makeDLite request, if there should be an eExist error number. With the relaxed error handling, the makeDLite handler is more adamantent now. The last and the most complicated situation is the request that needs journaling. For example, the forget request, it will change the internal state of the word FSD. As I marked right in this figure, the forget request will decrease the endlookup counter by one. So if we crash after the decreasement, the handling of this request will decrease it again. So the endlookup counter would be decreased by two. This will cause unrecoverable internal error. It is inconsistency, so it is not accessible for us. To solve this problem, we introduce a lightweight journaling mechanism. Before decrease the endlookup counter, the old i-value will be recorded to the journal. And when recovering, the journal entry of infinite requests will be rollback at first. With lightweight journaling, the forget request is a debitant now. To minimize the downtime of crash recovery, we also did some optimizations. For example, the upstream ocean Qemio will keep trying to reconnect to wehost user service socket with second-level interval. We make some changes to Qemio to support millisecond-level reconnection. And we also delayed the restoring of open-fail descriptors from fail handles. The WFSD only restores open-fail descriptors from its fail handle when the fail is first accessed. We test the WFSD with downtime with FAO, with one fail, 100 fails, and 1000 fails opened. As we can see in the figure, without two optimizations, the recovery downtime is at least one second. With Optimization 1, we can achieve the recovery time within one second. And with Optimization 2, we can further reduce downtime to less than 100 milliseconds. Okay, thanks, Jia Chen, for the great presentation before. Next, let me show our works on WFSD level upgrade and level migration. For WFSD level upgrade, actually, we can achieve it through crash recovery. While we need another wheel, the main reason is to reduce the downtime as much as possible. In our test, the wehost user renegotiation has a significant impact on downtime, especially when there are lots of water queues. So we'd like to get ready with the wehost user negotiation during level upgrade. And another reason is that we'd like to get ready with the implied hour replay, which makes things complicated. To implement this future of WFSD, we first need a communication channel between WFSD and the supervisor process to launch a level upgrade. This can be achieved by QMP or something similar. And we also need to save and restore the internet state, such as the memory and the open file descriptors. But compared with the crash recovery, this now can be inherited from the older WFSD in some ways. At last, we need to find a point to stop the older WFSD, flash the infinite aisles, and start the aisle processing in the new process. With these steps, we can upgrade the WFSD without the wehost user negotiation and the infinite hour replay. And for WFSD migration is also an important feature. Currently, we cannot migrate a VM with WFSD enabled. This limited the user case of WFSD. So we did some walks, we tried to enable migration with shared backend. Then how to achieve that? Firstly, QMP needs to support sale and load the device states or wehost user FS. Secondly, the internet states should be migrated to the target load. For the in-memory states such as filemap, we can send them directly like the device states. But for the open file descriptors, since the backend FS would change to another host, we cannot use filehands anymore. We must reconstruct those information on the target host. And we also need to handle countercase such as open on link and automatic open. Finally, the infinite aisle needs to be handed. We can choose to drain infinite aisles in source host or resubmit infinite aisles in destination. Now, the FCPatch state on class recovery feature was sent to QMLStream last year. And we got a lot of valuable suggestions from the community. And for the features, firstly, we are going to post the version 2 of patch state to QMLStream recently. Then, we will try to enable those features for the Russell version of WFSD. Lastly, we will do some walks to support it in Russell VM and Hackwizers. Okay, that's all for today's talk. Thanks for your attention. Thanks for your attention.