 So my name is Fernando, I work for Entities Open Source Software Center where I lead in the kernel team and also the KBM team. And today, well, I would like to tell you about all the travel and all the issues we have with the system for system-free API. So what's for system-free? Basically, it's the capability of suspending rights to your file system and putting that file system in a consistent state. And by consistent state, I mean a state such as if we create a copy of that file system, we can actually mount it and access the data that is there. And all the metadata reflects what's on disk. And the most common use case for this feature is storage backups. The reason being that file system freeze allows you to create storage snapshots that are consistent at the file system level. And how is this implemented in Linux? Well, there are two APIs, one which is accessible from user space. And it's implemented using two IOCTLs. One is FIFreeze, which is used to freeze the file system. And the other is FIFor, which is used to unfreeze it. Then there's another API which is accessible only from inside the kernel and from kernel modules, which is implemented at the block device level. So since this is not accessible to users, I will not talk about this one in detail. So how does it look like internally? Okay, so let's assume that you want to create a backup of your file systems. The first thing that you need to do is to execute the FIFreeze IOCTL. Once you do that, all new writes to the file system will be suspended. The kernel will do that for you. So if a process, let's say, uses the write system call, that process will be put to sleep. And not only that, all the dirty pages that hasn't been written back to disk will be written back to disk. In some cases, for example, if you use a FIFor system such as XFS or EXT4, which has a journal, in such cases, we call first things specific callbacks so that we can put a mark in the JVD or in the journal so that if the system breaks or dies while the first thing is frozen, you can actually recover your file system. And so once we stop all new writes and write back all the data to the storage, we can create a snapshot or the backup and we know that we will be able to use it because the system was in a consistent state when we created that snapshot. And once we are finished, we have to call the FIFor IOCTL to resume writes. And it's pretty simple. It's pretty simple, but there are a lot of bugs and limitations that I will tell you about now. The first limitation or the first problem is that these days you can freeze a file system and un-mount it just after that. But what happens if you un-mount a frozen file system? Let's assume that after un-mounting that frozen file system, you want to unfreeze it. The problem is that to unfreeze it, you need a file descriptor that refers to a file in the file system that you are trying to unfreeze. But the first thing has been just un-mounted so it's not accessible. So there's no way for you to un-freeze the file system. You're stuck with frozen un-mounted file system. In other words, the superblock inside the kernel is still alive, you know? The training process is still using the file system. It's kind of frozen, but there is a usage of that. So how actually the un-mounted is happening now? Okay, you think there are no users? Yeah. For whatever reason? Okay, if there are no users. Or you can do a lazy un-mounted. Even if there are users, you can always do a lazy un-mounted. It will succeed even if there are users. So if you do a regular un-mounted without users or do a lazy un-mounted with users, you're stuck with the frozen file system. If you want to get out of that situation, you have to mount the file system again. But if you do that, you will be writing to this. For example, XFAS in MDXC4 updates some data in this unit mount. But if it's supposed to be frozen? Yeah. There are several ways to fix this. A possible approach is not allowing users to un-mount frozen file systems. This looks reasonable, but if you do that, you're breaking lazy amounts. Which is not acceptable to the BFS maintainer. The other approach is adding a block-level API to a frozen file system. The problem with that is that not all file systems are block-device-based. And I think that the guys from Samsung grow some files that are not block-device-based. Is that correct? So in that case, you couldn't use such an API. So what I decided to do is to automatically fill file systems on your mount. With the current API, this is the only reasonable way to fix this problem. Oh, sorry about that. The problem with the current API is that there's no check API. So there's no reliable way to know whether a file system is frozen or not. So you have to keep track of the frozen state yourself. Maybe make a mistake using bash and ending up freezing several file systems. There's no way for you to know what happened. You can freeze a file system and un-freeze it, but you cannot check what the current state is. This is quite easy to fix. I posted patches that added a new IOCTL to check the frozen state. I also added a patch to export the freeze count through MountainFold, which is a file in the proc file system. Okay, more fun. In my opinion, this is really cool. I don't know if you're familiar with the hand task watchdog. Are you familiar with it? The thing is that if a process gets stuck in an uninterruptible state for more than several seconds, the kernel will assume that something went already and that the kernel should be panicked or the kernel should lose, you know? And the thing is that when we freeze the file system, or in other words, after freezing the file system, if a process tries to do a write, that will probably be suspended. In other words, that process that attempted a write will be put to sleep and it is going to be in an uninterruptible sleep. So the hand task watchdog will assume that that task is dead. That for some reason, the CPU scheduler is not scheduling that task. Or we assume that that task is waiting for, you know, an exclusive request to finish or to complete, but it's not finishing. So depending on your settings, you know, of your system, that watchdog will panic your joke at all. And the thing is that that task is just waiting for the administrator to, you know, call the foe or unfreeze IOCTL. So if you do it by hand, let's say that you put the first thing using this IOCTL and then go and have a cup of coffee or whatever and forget about this IOCTL. You may end up with a system which is panic and usable. So what I did to fix this is give a hint to the hand task watchdog. So I just add a flag to the task extract so that the hand task watchdog can look at other processes that are, you know, waiting for the first thing unfreeze and we check that flag and if the flag is set, the watchdog will know that everything is okay and there's no need to, you know, to panic the system. Okay, as I mentioned before, there are two different first and first APIs. One which is accessible from user space and the other which is only for in-car on users. And Forbidab is the in-car on API and this is being used by both XFS and device map or code. So we would use DM as a snapshot. DM as a snapshot will try to freeze the first thing that sits on top of the DM device. But the thing is that in some cases you have first systems that are multi-disk such as BarrelFS with BarrelFS you can build your array system at the first layer. And the thing is that if you do that there's no way for device mapper to know that what's sitting on top of the device mapper is BarrelFS. And the reason for that is that traditionally, for example, EXT4 keeps a pointer to the device that is using in the superblock structure. That's okay for EXT3 and EXT4 because they can use only one disk at a time, right? But BarrelFS should need a list of pointers to keep track of all the devices that it's using. But there's no such a list. So there's no way for the device mapper code to get that information. So in that case, you create a snapshot using DM snapshot and BarrelFS is the first thing that sits on top. It changes how you will not be able to mount the snapshot that you created. So you create a snapshot which is unusable. To fix this issue, what I did is modify BarrelFS code so that we keep a pointer to the BarrelFS superblock in the struct block device. Another funny thing is, I mean, it's when you try to use First and Freeze inside an user namespace. The reason being that the current First and Freeze code is not namespace-aware. So in some cases, for example, let's assume that you freeze a file system from inside a container and the root container is using that file system too. So you may end up hanging processes that belong to the root container, which breaks isolation. So in my opinion, this is a security problem. It hasn't been addressed yet. I work on parties to fix this, but I still haven't had time to send them to the FSDevL mailing list, but I will. You have a public cloud and this is a really big issue for us. Okay, so now let me talk about the cloud or virtualization use case for a while. I mean, companies like entity that have public clouds, for their customers, they want to provide automatic back-ups for their customers. You can get more money from your customers if you provide that kind of features. And we've been providing some features for a few years, but the thing is that what happened sometimes is that, okay, you get the backup for the user and your user wants to use that backup. But what would happen sometimes is that they would not be able to mount the backup image. And the reason was that we were not taking care of the guest's file system state. We were not freezing the guest's file systems. So we were taking a snapshot of our backup while the guest is doing grids. So we ended up creating a snapshot of a file system that wasn't in a position state. And if you want to fix that, you need some kind of cooperation from the guest, which means that you need to run a guest agent inside the guests. At these days, in KDM, we have something which is called PME agent, and that's what we are using these days. So what the cloud provider would do is access, you know, that guest agent, and access this agent to freeze the file system for us. And after doing that, we can actually create the backup image, and we will know that we will be able to actually use it. This looks simple, but as I mentioned before, we have no check API. What happens if the guest agent crashes or dies after freezing the guest's file system? Okay, so let's say that I'm a cloud provider, I've frozen my guest's file system through the guest agent, but the guest agent dies, crashes for whatever reason. Our customer, we end up with a frozen file system. And the thing that our user didn't know that we froze his or her file system, so our customers end up with hard systems. And the funny thing is that even if there was a way to reboot or restart the guest agent, the guest agent has a way to check the state of the file system to actually unfreeze it. So with the current API, we cannot provide that kind of automatic backup services to our customers. It's not safe because there's no check API. Okay, so what we did to address this issue is created a new frozen-freeze API from scratch. So the way it works is that when you want to freeze the file system, you use a new IOCTL, which is called GetFreezeFP, which keeps you a file descriptor. As long as you keep that file descriptor open, the first step will remain in a frozen state. And for example, let's assume that the guest agent uses this new API. So the guest agent uses this API, gets that file descriptor, and keeps it open. And let's assume that for whatever reason, the guest agent dies. When the guest agent dies, the guest's kernel will close. That file descriptor for you automatically. Because as part of it, when you process a dice or it's killed, all the file descriptors are closed automatically by the kernel for you. Since the kernel will close this file descriptor as part of the cleanup for the process that it's done, the file system will get unfreezed automatically. So there's no problem. Even if the guest agent dies, we know that the file system will get unfreezed. And the whole problem disappears, goes away. There are several other issues that we are working on. And the first one is that I'm trying to get my patches merged upstream. It's a pretty big patch set. I've been sending patches for something like two or three years. But the BFS main data albiro is really busy, and you don't have time to pick my patches. I already got, you know, app buys and repeat buys from the access maintainer and the ext4 maintainer and several other guys. Hopefully by the end of the year my patches will be upstream. And another thing I work on is BSS support. BSS support stands for volume shadow copy service, which is a Windows API which works in such a way that when you try to freeze the file system, but before actually freeze the file system, send a notification to the applications so that they can write whatever data they want to write back to this. This is useful for databases such as Oracle or MySQL that may want to write all their internal buffers to this before freeze the file system. So that the first thing is consistent not only at the first level, but also at the application level. Some people want both. There are some databases such as PolishQL that do their own journaling. So being, you know, not replay, they can, you know, recover from, even if something crashes or if you take a snapshot without modifying PolishQL, PolishQL is capable of detecting the fact that something happened, you know, without being notified. But that doesn't apply to MySQL and some Oracle databases. So for Oracle and MySQL, this kind of projecting would be cool, but some maintainers in the kernel community don't like it because it's something that comes from Microsoft, but we have some customers that would like to have this. Even if it's not accepted upstream, we would be willing to, you know, maintain the files ourselves. And that's it. There's a different level over there. So anything which it passes is interesting for me. Generally, I think there's some kind of gulf between kernel and Linux, hardcore community and people. It's not me. There are some kinds of misunderstandings. I am not for long. And I find there are some things, I might be hitting the future about it now. I'm really aware that if I ever decided to have some of these traces, what kind of problem might be hit? Especially in the cloud, you have to be careful. Yes, absolutely. The virtualization use case is really... I mean, it's really useful, but it's really broken. So if I were you, I would pick these patches and... I don't need to provide you with this. I work with the community, so I mean, hopefully by the end of the year, and back port it to RHEL 7 or RHEL 6.5 or SUSE and everything. We are not quite there yet. That's why RHEL wasn't providing support for this feature, even though it was part of the kernel. I mean, these iOS 6.0s can be used in RHEL 6.4 and 6.3 and 5.9 is the latest version. I used to just out there, but RHEL is not providing support for it. It's like, I think they call it a technology preview. Isn't that the name? So the AK is there, but if you use it to go ahead, but it won't help you, it's not the brakes, you know? So I understand that, for example, the virtual machine is doing snapshot. It fully relies on the engine right inside. Yes. It's for anything like Windows Windows, which is running inside the virtual machine. How do you... So for example, if I make a snapshot of my Ubuntu running in the kit, who am I? Who in the world? The problem is it happens with guest agent guys. When the guest, but if I don't have any guest... You don't have no guest agent? Nothing. And you take the player snapshot from the host? Yeah. Okay, you may end up with a backup image that you cannot mount. That happens to us. Especially if the guest is not using write-for-years for, you know, their front systems. Just desire that you will not be able to mount the snapshot or backup image that you created. It's just a lot of fun, or it's not. Really, you know, our customers were really pissed off. So that's why I decided to fix this. What do you think about... Self-feature? So, power, and I was curious to learn more about this. Okay, so what would you need? What kind of... It might be better to freeze the entire process rather than just the front system. Okay. The whole stuff on the front. The front system, yeah. Okay, you want to freeze... Is all this kind that you care about? Well, it's nice from a sort of profiling perspective to be able to separate things out like that. And so I thought, well, maybe if I wanted to just look at how badly Disdina was going to freeze this and let everything else continue, but I'm not sure. I think it's interesting. So the problem with this is that the process I'm going to do in AIO will keep running. Right. So maybe that doesn't work for you. Yeah. It keeps running. Isn't the right thing I'm looking for, but it was still interesting to learn about it. And it's kind of frustrating because, I mean, I fixed all these issues like two years ago, but I fixed some of them. I'm not upstream yet. Some of my parts were about 452.6.5, which is a bear. There's no efficient release yet. Well, you know, this is a BFS layer, and I've been raised to be part of a conservative because I've made mistakes, but you know, something like that. This is a BFS. And that was it. Oh. So I just wanted to let you know that maybe, oh, the projector is really... So did you find these kind of issues with the hand-task watch, though? So you are using a flag to mark the processes that you want to... I just accept the problem. I don't have to deploy anything real. Most of the work I do isn't allowed, so... Okay. It is frustrating, though. I was proceeding, you know, for our customers. You don't like it. Okay. That's it. Maybe it's like, you know, speeding up upstream things by some social technology, not going to CIS admission. You know, you have this feature, your customer's going to be pissed off. You're most my fire you because you said it. And you know, there would be some, you know, demand on things like red, you know. Why is it working? Because you're saying about the potential of the walk. Yeah. It's having the pressure. The things that I mean, I talked about the Zichi at the kind of summit a few years back. Yeah. They don't really care about this one. Okay. Oh, yeah, it's working. We'll fix it. But I fixed it. For then, you know, just spend. But it's like books. I try to learn more because we have a lot of I.O. and survive, you know, going for some kernel books, but they are written for developers. Okay. It's interesting that you have, like, CCODE what happened when you just did a new process, but it's for CIS, I mean, it's not very useful. It's like 10 sentences in the beginning of the chapter, making some description, which one would be useful to me and to tell my developers, you know, Java developers, okay, it's not worth it doing, like, read. No, it's wrong. If you can write and not read, like that. But it's, there is this, this, this, this is something that people are doing something, you know, for kernel and there are people which are using them. Okay. So it might be the, if you show to others, no less to develop the kernel developers, not to the people, like, devouts, which, you know, might hear this problem, they might be more interesting to investigate and ask, you know, how it happened, why? Well, do you care about the namespace use case? For the, do you care about the user namespaces or not only containers, what do you think? Let's see, do you have a container? Would you see a question there? Which is also, they're being used by the root container. How should Facebook's behave in the areas? Should we allow containers to be first increased at the beginning of the week? Okay. That's just my opinion. Okay. But that's not, that's what I thought. This gets really complicated and you're not having to free just part of your whole system, you know, just the directory and not the other part. But in some cases you can't do that because you have a journal and there are dependencies and it's a mess. So maybe I should, maybe what we should do is modify the BFS so that we know, we don't allow the containers to be saved in this return. Maybe CBC or whatever, I don't know, is something that you don't support in our room. What would happen if we like, get to a machine level and try to phrase something? From inside the S. Let's say yes. Because you mentioned Katie and do you, my container asked like a, like wave container or like Katie and machine is also. How do you go about Katie and her? So is it like usual machine? It's not like virtual machine. Because if I see the money from people. Absolutely. Open BZ and LXE. Because if you won't have to put your machine on and you can somehow wipe us. Yes. In some cases I imagine if you are, it's like an essay or something like that. You don't want to. I'm sorry. So there are personal approaches on, I mean, you know, you say, yeah, from inside the containers. That's a personal approach. Let's still think about it. So yeah, any ideas? Just looking at it. If you know what we are, please, definitely take it. Take it. My purchase is a success. Any questions? All right. Thank you. Thank you.