 I expect people can see my screen. Yeah. I cannot but I can refresh them with help. Does anyone see? Yeah, see it. Please continue. All right, very good. I won't be sharing my video today because I noticed that the quality of the presentation is suffering because of the video processing. So forgive me for not being on your screens. Great. So what we're going to talk about is a very interesting piece of machinery that we had to come up with to make sure that we have a very performant and yet very secure solution for our document jailing. And that's what we're going to be talking about. We're going to talk about first the need for such a mechanism. And then I'm going to go into some details on the technical level as to what really it took to get this whole thing working and working properly and securely. At the very end, I will leave some time for questions. And let's take a step back. What I want to do is make sure that we're all on the same page. The architecture of online is such that every document is isolated, both in terms of the process it is running in. And also its file system is isolated. And this is roughly the big picture. There is a WSD, which is a demon. It is the client or external facing piece of process that receives all the connections. Internally, it needs to figure out every client and which document they have access to or they're trying to access. Behind the scenes, for every document, we have a separate locate instance. And that instance is essentially loading all of the LibreOffice core libraries and the rest of it. Of course, in our deployment, it's typically the collaborative office. But you get the picture. Fork it at the bottom is the process that is responsible for doing these spawnings of processes in the background. And indeed, we do this proactively, meaning we create extra ones. So that when a document and new document is requested to get loaded, we will have spare processes and we can do this as fast as possible. So every process, the locate on the far right, has a jail associated with it. And that's the file system that it can see. And it is the only file system that it should be able to recognize and access, meaning it can't go outside of the limits of that particular file system that we give it. So how do we do that? I believe the previous talk actually touched upon this. So change is the key system call that we do. And that's a core functionality support Linux. And the way we do this is essentially, we give it a path and we say, you know, from now on, this is the root of your file system, right. So anything above that is invisible. Anything within that is essentially what is available and what exists for that particular process. That's the basic idea. And once we do that process, because this change is obviously a privilege call, we need to drop these privilege capabilities. So the locate process essentially, once it is set up, it reduces its capabilities to the bare minimum so that it wouldn't have any liability. So that's a basic idea. But to do that, we will obviously need to prepare this file system in advance, right. As I said, when you change to a path, that path becomes your root. So you need to have everything that any process expects to find, right. And on the right, you can see that this is the template that we create and it is called system template. So that's the system template that becomes the j root. And you can see that we have device in it, we have etc. And obviously the system libraries, even, you know, user and bar as well. So we have everything that a process needs. And this is just a level one. There are literally thousands of files in this directory. And this is created at installation time. Right. So this has to be set up in advance at the time of installing and setting up online. And the idea is that we have a setup script, little WSD system with setup, which is responsible for creating this directory. And once it is created, it is used as a template to create the jails for every document, right. And besides this main template, we also have LO, which is a liberal office installation. You don't need to copy it. It has its own directory. But what we need to do is we need to make it available inside of the jade itself. And that is done in the LO directory. And you can see this in the screenshot on the right that it is provisioned already. And that's going to be ultimately where it's going to live within the jade. The jade bootstrapping itself used to be done in a very simple fashion. You would essentially logically want to copy the sys template. And then within the sys template, you want to copy the liberal office installation files in the LO sub directory. And all of that becomes your new jail. And what you do is you change root to it. And that's it. You have a dedicated file system for a given document. Once the document is done, all you need to do is essentially rmrf that directory. And it's gone. The problem with this is that it's not very fast. And that's the main challenge. It's not very flexible. And it's not very fast. If you do it in the fashion of copying, obviously you can link the files. But that only works if you're on the same file system. So hard linking doesn't work across file systems. So you have a lot of challenges like that. And I will go a little bit more into it. One minor thing to add is that the jail directory itself on the real file system is a cryptographically secure random string. So even if somebody can break out of the jail, they cannot figure out where to find another document. It becomes a challenge for them. And that's a basic security idea. So as I said, the problem with copying or linking thousands of files is that the performance is not amazing. Even on a past SSD, you will see that you're spending a couple of hundred milliseconds in the best case scenario. That's pretty good. And it's a cost that you can live with, during that you're actually doing all these preparation of the locate processes in the jails in advance. So you always have a couple of spare processes that are ready to handle the next document. So this is not a terrible cost until you realize that there are deployments that are done in containers like Docker, for example. In the Docker world, you have an abstraction layer that actually isolates the real system from the Docker image. And within the Docker image, if you try to link thousands of files, you will see that it's painfully slow, as in tens or dozens of milliseconds per link. It literally takes seconds to create any individual jail with all the files in it. That absolutely does not apply. You will time out, you will fail. The user in most cases will give up and just walk away. I mean, if it's taking a minute to open a document, well, they're not going to think very highly of the program. These are very serious challenges. And what we needed is a way to avoid all of these problems, but at the same time, even maybe improve the security of the system. And you'll see how we've achieved that, albeit at great technical challenges, but nothing that wasn't answer-mountable. So ultimately, you will see a success story. So this is the basic situation, as it was before going on this wind mount idea. So what is wind mount? Mounting is a very fundamental part of getting the file system set up. And usually the way you would do this is you would mount a disk. And when you mount the disk, the file system is either recognized or specified. The file system, once it is recognized, mount is responsible for making sure that it is available at a certain path in your file system tree and everything works just fine. The case here is not one where we're trying to mount a new file system or a new device, but what we're really trying to do is we're trying to mount a single directory. And we just want that directory to be available at a certain path. And that would be the jail path. So that's the basic idea. And find is exactly the recipe for that need. You can find any directory at any other location, even across file systems that work just fine. And you would imagine why that is, because that's actually what mount does. It mounts drives. So it does support the ability to do this across systems and drives and even virtual mount points as necessary. We only need to deal with two syscoles. One is mount and the other one is new mount or new mount 2 with the second option. And we need to have the sys admin capability. So this is something that we don't like to maintain the sys admin capability. So what we do is we have a dedicated process for this that has the sys admin capability. So the new mount process is the dedicated process for doing the mounting and unmounting. And that limits the elevated or privileged capability exposure to that particular process only. And that's the basic idea. So we reduce our security. It turned out that it's not that straightforward to simply just say mount and give it a pat and be done with. The fact is that the API is actually much more complicated. It has a lot of options. And as you would imagine, these options are not necessarily always compatible with one another. So you have to really choose your strategy. So a lot of cycles were spent in trying to understand not just how mount weighs with the different options, but what it is that really works for us. And this is not as straightforward as people might imagine because depending on the available resources that you have or the capabilities that you have in your API, and depending on how you want to set up your jail, you need to find the best combination. And that's that was a challenge, the first challenge. The second challenge is that you actually can't do a single call and get exactly the picture that you need. So as you can see, we ended up doing three different calls. The first two are absolutely required because when you're doing a bind, you cannot also enable read only. Read only has to be a modification that you have to do as a second call with a remount flag. And then you can say, I want this to be read only and I don't want access time and no SUID and all the rest of it. So you have to do this in two steps, but it turned out even that is not enough as we will see in a minute or two. We really had to do a third call which says, don't find this directory, this mount again. And we needed this because we actually had to mount within the mount, right? And we didn't want this to become a recursive problem. And these were, as you can imagine, cases that actually were found during testing and during development. So what are the real challenges with using mount? I mean, so far, I think it sounds like, you know, this is the silver bullet. You just mount, okay, fine, you need to pull the API three times, but once you do it, you're done. In reality, it turned out that that was not the case at all. It's not that straightforward. The main problem is that when we do the mount, we actually need to modify the mounted directory, right? Remember that when we're doing the assist template copy, we can create new directories there, we can create, you know, writable temp directory. We can do all sorts of modifications after that, even if we link the files. We still have a new directory that we're playing with. But with mounting, you're sharing the system. So your first obvious problem is if you're sharing a directory across all the jails, well, that's a security goal. So system that must become read only after mounting the jails. So the jail itself has to be read only. And you saw that in the previous slide that we were indeed mounting the read only. That raises new challenges because now we have to struggle with the fact that we need to create a writable goal and writable temp directories. And on top of that, we actually need to update some files. And those files are typically the network related files. Because if the network config files change on the real system, you don't want to force your clients and partners to restart your daemon worse, you know, ask them to shut down the daemon and reinstall. I mean, that wouldn't fly at all, right? So you need to actually do something at runtime. The next time you spawn a jail, you need to be able to update these files. And that is a very serious challenge, especially if system flip becomes reading, right? If system flip is still writable, then you can update system flip. And the next time you mount from it, your files will be up to date. And as we will see, that was actually another challenge that we had to deal with after all of this work was actually completed and done. Another challenge is that mounting can fail or might not be available, or might not be available in all the options that we need, depending on the system, the kernel, the patch. It's a complex world out there, and we have to be ready for fallbacks. On top of that, we obviously allow our systems to be able to disable this in the event that they see some problem with it. And so this is configurable. It is enabled by default. And the only reason we enable it by default is because we are confident in our fallback logic. But once it fails or it is disabled, we have to do the fallback. And the fallback is really just to link and copy files. But now you have another problem, which is if you try to mount a directory that was really copied, first of all, that's going to fail, but that's not a problem. But you will leak a lot of files in the megabytes and in the gigabytes per document from the file system. And soon enough, you're going to get a call from an angry sysadmin complaining that the disk is being called by our product. So we need to differentiate between what is mounted and what was copied, so that the copied stuff we can delete, but the mounted ones we unmount. And obviously, if you try to delete a mounted directory that isn't read only, you end up destroying your sys template, and you essentially destroy your installation. So this isn't completely straightforward with the magic. So with all of that on the table, we have to come up with a new strategy. And the idea of the new strategy is that we need a end-to-end approach that would really get everything working as expected. The first thing is that we need to make sure that the sys template itself is created with the provision that it actually might be mounted, as opposed to copied. The copies are straightforward or linked, but mounting has to have a special setup, especially if sys template becomes regained. And again, I'm going to return to that briefly at the end. Another thing is that we really need to make sure that the random devices are set up properly. If you mount read only, and then you try to create the random nodes within the mounted read only path to jail, it won't work. So a trick that we came up with is to use the temporary directory, which is writable. And we create links, sim links, to the temporary directory that don't exist. I will very briefly go back to the screenshot here and you can see that the dev random and the dev view random are actually sim links to non-existing temporary path, which is a relative path. This temporary directory is going to be created in the jail, as we will see, but doesn't exist in the sys template. So this is a trick that we use to overcome this challenge. So the basic idea is we split this into really four parts. Sys template setup script does the basic setup of the sys template directory. UWSD, which is the main beam, is responsible for figuring out if mounting is enabled and is possible. So we need a test mount, and based on that we actually enable or disable mounting. And then fork it, which is the middleware that is responsible for supporting these background go kit processes and setting up their jails, is responsible for updating sys template, if it is writable. And also it's responsible for doing the cleanup. It's the main cleanup logic, because that's where we do the spawning and where we make sure that we have extra processes in the background. The kit, though, is the one that does the heavy lift. That's the low kit prox test that is hosting the documents themselves. And that's where the rubber meets the load, so to speak. This is where we talk about what the kit is doing when it's trying to set up a jade. When the kit process is being started up, it is responsible for making sure that its jade is set up properly. And what it does is first it checks if mounting is enabled, because it could be disabled, as we've already discussed. If it is enabled, first it mounts the sys template. That's your root of your jade. Within that it needs to mount the yellow template, which is the LibreOffice installation, and make it read-only as well. And then it needs to create a temporary directory that is writable, but is created with a cryptographically random path. And then we bind mount that within the jail path page. So the visual of this is that the sys template becomes your root in a new path, that's a jade. Within it you have the yellow template and within it you have the temp directory. That's the basic structure. So you have three mounts, really, two of them within one another. And this is why you would remember that we had to remount with unwindable, because when we do a mount, we need to do a remount to make it read-only. And remount can actually pull the other mounts with it, duplicating them, and you don't want that. So again, there are technical details here that we're very carefully managed. After all of this is successful, then we can move on to creating the random devices in the temp, but if we fail any of the above mount steps, then we have to do the copying or linking pullback. Once we are done with doing the random devices as well, we create the environment variables and we're done. That's the final step, and ultimately costs us three logical mounts, and each mount, really, is three system pulls. So there are nine system pulls to set up a jade with this approach. So this is my last slide before I go through questions. After we were done with all of this, we weren't done. First of all, SysTemplate was requested to become owned by root by really nervous SysAdmins who really wanted to make sure that there was absolutely no way to hack into the template directory which is used for all of the documents. So that meant that we can't really update the SysTemplate post installation. So we had to really work very carefully with the dynamics network files that might need updating, and we had to rework a lot of that logic. We had the logic to detect if these files are up to date or not. We linked to them at the installation time, but that link might actually fail. So there was a lot of special cases to get this work. Another major challenge was, as I had in my slides at the very beginning, is that we have multiple flavors of online, right? So we have deployments on mobile that have special builds, and we have app image, and we have Docker installations. And depending on where and how the code is built, you get different flavors and different combinations. App image and mobile in particular, they don't really have SysTemplate and they don't use change root. So it's a special case and you can't do a lot of the things that we do here and you need to handle them in a very special way. But all of that, you know, done and all the special cases and corner cases handled correctly, everything really works very well. The performance is fantastic because you always are doing at most these nine system codes unless you have to fall back, which should happen very rarely. And so your performance regardless of, you know, whether you're running in a Docker or even in a spinning disk will be milliseconds per jail. And with that, I'm happy to take your questions in the remaining few minutes. Thank you. Ash, I'm Gabriel from 1-on-1. So I wanted to ask you something. So I know when you start a container from the Docker image that contains the online application, you have that, let's say, time consuming hard linking, right? Yes. And this bound linking solves that issue? Completely, indeed. Because previously you needed to make as many system codes as you have files. And the problem there that each one is really slow, whereas the amount you're really doing only not at most. And even if each one is on the order of, you know, let's say 50 milliseconds, which is extremely slow, you're still going to be, you know, under half a second for the whole thing. So the performance should be amazing, right? That is the promise that's the expectation. And unless, you know, we find a case in a while that has some performance issue, we should address, obviously. We expect this to resolve the problem for everybody. Yeah, I understand. Yeah, so, well, for that problem, I, at some point, I found how to solve that, but not through mounting, in mounting. And that's good to hear. That was an issue related to the overlay system. Right. And I would imagine others to have similar setups. And again, the game and how you get up. Yeah. And I saw that just by copying those folders once again. I mean, when the containers, the last layer, the containers layer is created. Yeah. When the container starts, you just need to copy again, those folders system plate and local plate. And that's sort of the issue. I think I discussed this already with Tendient. Very good. I don't know if they share that idea with you. Can I stop you, stop people so we, you know, we want to respect the people that are behind you in their talks? Yes, indeed. They're out of time.