 So, hi guys, my name is Mikhail. And today I'll tell you something about the struggle I had with file locking while implementing this feature for Libvert. So, as you may know, Libvert allows you to start virtual machine or domain, as we call it, under just any user, really, in your system. However, in order to do that, it needs to change the ownership of the file or, you know, set some context of all the files that the domain is going to touch. But that's not the end of the story. Later, when domain is shut off, it needs to restore the ownership of the files because it may contain some sensitive data, like, I don't know, some passwords or some other kind of info. So, my feature was that, so, basically, when it did restore the file, it returned it to root root because it hadn't stored the original owner anywhere. So, my feature was we store it in the extended attributes. But this poses a problem because Chone and FileCon are atomic on their own. But if you wrap them around some extended attributes work, the whole section is no longer atomic. So, you need some way to mutually exclude with other demons that are trying to do the same over the same file. So, that's one scenario where you need file looking. The other, typically, SQLite, for instance. So, if you have your database in one file to enhance concurrency, you might want to allow it to, you know, to, for instance, if you have queries that run on two different tables, you might want to allow them to run in parallel, you know, to enhance concurrency. But in order to do that, you need to be able to look only some parts of the file, those parts that you are touching, not the whole file at once. And the third typical example where you definitely need file-looking Rpid files. So, question, how would you implement, you know, so imagine that you have a demon that is supposed to be running only at one instance most in your system. So, how would you achieve that? Anybody knows the answer? Yeah, so, but you know, that's a file log again. Right, so you can have a Pid file that you will write your own Pid into it, and the second time you start a demon, the second instance will come, read the Pid file. It will probably be more clever and check whether the Pid still exists. But that's not enough. Because the first instance might have crashed, for instance, leaving the Pid file behind. And another process might just have taken the Pid, because of, you know, of the Pid response. So we need the file-looking, so that the second instance, when it starts up, it will either fail to log the Pid file, meaning the first instance is still running, or it will succeed in locking the file, meaning the first instance is long gone. So, as I was researching what APIs can I use to fix my problem there, I came to conclusion that the situation that we are currently in can really be depicted by this picture. I mean, we have some sense of security, but really, you know, it's not really a security. So there are basically two major types of file logs that you can use. First one are advisory, meaning application is responsible for placing the log. Itself on appropriate places, which means you have to change your code and, you know, add lock and unlock calls at appropriate places. And the second one is mandatory, where Keramel is supposed to somehow came up with some sensible locking for applications, so you don't need to change anything and everything just works out of the box, which sounds promising, doesn't it? But not really, we will see that. And advisory logs, basically there are three types. All of them, like, this is not really specified anywhere, but they are supposed to be independent. So if you use VSD log in your application for, you know, to log one file and the other application uses the POSIX file, POSIX log to log the same file, in theory it should succeed, but as we will see, that's not the case too. So VSD logs, how they work. By the way, all the APIs take file descriptor to work with, so we need to open the file and you need to open it for writing at least because acquiring a log is viewed as a write operation, so we need to do that. And placing a VSD log is really simple, you just take the file descriptor and say whether you want to log the file exclusively or in shared mode. This is similar to way that read write pthread logs work. So you can have multiple readers, one writer, you know. And these are very good, except you will always log the entire file, which doesn't really see the situation from the database work. It might see the situation for libware and pith files, but it's not going to fly for libware either. I mean, VSD logs are not POSIX, even though they are pretty available, they are still not POSIX, and on different platforms, they might behave differently than you would expect them to. For instance, if you run older Linux, and by older I mean to point something and older, the API does nothing and returns success. Now call me crazy, but I don't think that's the way you're supposed to implement an API. If you're a newer Linux, it may get silently converted to a different type of log. For instance, on NFS, it's converted into POSIX log, so all of a sudden you might start clashing with some other applications that are trying to log the same files as you do. And also there's no atomic log promotion. I mean, even the main page says log promotion is done via unlocking and locking the file again, which definitely is not atomic. Okay, but we have the POSIX file logs, right? So they might be useful. So again, we open the file, we set up some structure where we can tell whether we want to grab read log or write log, again, you know, the same meaning to that. We can then set which portion of the file we want to log. So in this specific example, I'm locking the second and third bytes of the file, and then we can even set whether we want to like set the log and if it fails return instantly or we want to wait for the log to be set, we can even query the log. I mean, it looks good so far, doesn't it? So we can log ranges, kernel does even some dead log detection and prevention, which is good. I mean, if you would dead log, you would get an error from the call there and it works across NFS. So this looks promising, doesn't it? Yeah, not really. So the logs are not inherited into the child, which, okay, you may argue that if you have an exclusive log, there should be one bit at most which holds the exclusive log to a bit, but neither of the shared logs are inherited to child, which I don't think it's sane. It doesn't work on Samba, which again, you might argue that you don't care about Samba, but the worst part of POSIX logs are they're semantic when it comes to close. I mean, if you have a trivial application where you open a file, lock it, only read some bytes, write some bytes, and close it and die instantly, you're not linking with some other libraries, you're running in single-threaded mode, everything works nicely, and you really want to unlock the file on the close, but the problem is it's the first close that releases the log. So if you have multi-threaded application and one thread opens the file, acquires the lock, and the other thread opens the file and close it, it will release the lock, leaving the first thread thing is still on the lock, which is a problem. Okay, so you will work around it, like you will re-implement open and close so that you defer closing the file until the very last moment when it's safe to do so, but the problem is if you are linking in the library, you cannot do that. I mean, you can change your own application, but you cannot really re-implement the library, can you? Or you can't, but do you want to, right? That's the question. Okay, so you ditch all the libraries that you have, you re-implement everything yourself, and it still won't work, because as you may know, files under Linux or basically any Unix can have multiple names. So if you open one file name, lock it, and the other thread opens and closes the other file name, you guessed it, it will release the lock, which is, I mean, okay, this is, so if you have really simple application and single-thread application, post-ex-lock work, they will mutually exclude with other processes, but not with the threads in the same process. And one thing that triggers my OCD, why the structure is called Flock, which collides with the BSD, ah, I don't know. But okay. Ha, ha, ha. Historical reasons. Historical reasons, yeah. Okay, but there's another API that we might use. Again, for post-ex-locks, it's called lock F, but this is really just a fancy wrapper over the FCNTL. It does basically nothing more, it has the same capabilities. Okay, so when Libware, I mean my kernel developers, Linux kernel developers, so that it's really, there's nothing usable, they implemented open file description locks. It's not description, it's description, meaning the lock is associated with the description itself, not the file descriptor. And then combined the two approaches, they taken the good parts from the previous two approaches and combined them into one. And this looks really usable. I mean, you can use anything and the locking behaves exactly the way you would expect it to. But the problem is, this is really just a Linux only. So if you care about portability in your system, you cannot use it. It's not even on BSD or anything else. I mean, they're trying to get it into post-ex, but you know, it's a long run, so we'll see about that. Oh, it doesn't? I thought it does. Okay, thank you. But okay, so, you know, this is these analyzer locks, but we still haven't covered the mandatory locks, which looks promising. I mean, you don't have to change anything, right? And it just works out of the box, doesn't it? Well, not really. I mean, even the main page says do not use it, it's terribly broken, we are not going to fix it. I suspect it's because it uses post-ex locks under the hood, but really, just do not use it. They are not going to fix it any time. They're racing. And they are also racing, yeah. They're racing on the car. Yes, yes, there is a lot of trouble with it. So, I mean, the main page says do not use it. We're not. So for the record, we are not gonna compile them into rel 8. So what's the solution? So, because Libre cares about portability, I had to use post-ex locks. But, you know, as I said, they work for really trivial applications that you have. So I had to create a really small application that I could use, which basically is done by a fork. So if I fork, then I could, then I have single threaded application with a small critical section that I can use. And that's basically the approach that Libre used. I mean, Libre links with a ton of other libraries. So we cannot, I mean, the post-ex locks look the best solution for us. Escalite, as I said earlier, Escalite 2 uses post-ex locks. But if you read the code, and I really advise you to do that, they have like three screens long description of how they are using the post-ex locks. They re-implemented everything. They use then P thread locks to mutually exclude the threads. And it's really a messy code, but, you know, again, they care about portability, so they cannot really use anything else. And they also narrow down all the libraries. Like, they are linking only with three libraries, I guess, like Lipsy and some other two basic libraries. And for Bitfiles, it's safe to use BSD locks because, you know, you can lock only, you don't care if you lock on the portion of the file or the whole file. By the way, Libre couldn't use BSD locks because we are looking the same file that QMU is looking. So you might know that QMU does already some locking of the disks that it uses. So if we were to use BSD locks, because of what I said earlier, you know, BSD locks might get translated into post-ex locks. So we might try to actually lock the whole file and therefore we would deadlock with QMU. So post-ex was the only option that I could go with. And I think that's it. If you have any questions, comments, please. Yes? Right, so the question is, what's the problem with post-ex locks and multi-threaded applications? So if you have, imagine you have an application that runs in two threads, right, and one thread opens a file and locks it because it wants to work with it. And the other opens the file and closes it immediately. Doesn't really matter. Doesn't really matter because at the moment that the second thread calls close all the locks for the file I released. Right, but you might, okay, so you might place, here it is, you may place read locks, right? Which can be set, you know, you can have as many read locks as you want. So if you have two threads, they are doing read locks, right, for instance. And the first one acquires the read lock and let's suppose the second one places the read lock too. At the moment, the one of the threads closes it, all the locks are released. It doesn't, yes, yeah, it's associated with I know, which is the problem. So the question is why I haven't used the new stuff. As I said, Libret cares about portability. Libret has to run on BSD some other, yeah, that could be an option too, but then you would have maintained two different code parts which poses a problem on itself. Yes, and also, yeah, that's a good point. Like if you have BSD running and it wants to change excellent attributes on, say, NFS and you have a Linux machine running, you need to mutually exclude these two to, you know, yes, they will, yes. But then again, you have two code parts that you need to maintain. SMB, yeah, of course it works. They don't work on Samba, Samba mounts. SMB mount. Okay, so through Sips. Yes, yes, yeah. Well, they do. Yeah, technically. So essentially, yes. But they're mandatory, typically, on this machine. Now, if you're talking over a Sips mount to Samba, it uses the classic suspensions and those will get translated into a normal FCNTL lock on the other end. And I'm not sure if Samba ever did use the OPD lock too. Did it come out? I love Jeremy. Jeremy, how would it be about it for ages? And then I'm not sure if he ever did make a CSI. So technically, they do work on SMB, but it really depends, as always, when you're mapping the system. Yeah, exactly. The system does not work on 21 transfers. Right, any other questions, comments? Okay, thank you.