 Buenos dias a todo. La presentacion seria sobre como construir un sistema de archivos en Python. Well, don't worry, I'm gonna talk in English, so, yeah, I thought a lot. Our talk, thank you, our talk will be year of s, and we'll share some advice, some experiences in how can you actually build a file system in Python. I'm Emmanuel, and I'm a software engineer at PressLabs. Hi, I'm Vlad, and I'm also a software engineer at PressLabs. Before we get to the details of the file system, I would like to introduce our company a bit, so we get a picture of what we do and the problems that we encounter. We are a Romanian based startup, and we do WordPress hosting dedicated for publishers. Our main goals are performance, reliability, and humanity. And as you can see on the low part of the slide, we encounter some interesting numbers along the years. We had 45 million page views on a single site during the day. We also had 6 million page views on a single site in a single hour. In our busiest month, totally summed, we had 2.2 billion page views. And in the past 12 months, we only had .0006 outage, including the maintenance time. Okay, so this is not anymore, and this neither. And we didn't even begin the demo, so we apologize for this. This was not planned at all. We are on call, both of us, so this is an emergency entry. Problem solved. This was not planned, but it turns out well. As you can imagine, the business is far from perfect, and one of the problems that we encountered along the years was the conflict between the publishers, namely the site owners, and their developers. So usually the workflow works like this. Someone has a website and a developer, and the developer writes the code, everyone is happy. Until the publisher, namely the site owner, tries to change things, and they try to change it, even though they don't have the technical know-how. So this is it. We have chaos. They break the site. We don't know who changed what. The publisher starts blaming us. The developers start blaming their publishers and blaming us. So yeah, we have a big pile of chaos. So we thought really hard, how can we fix this problem? And after some thinking, we came up with ETFS. But what is ETFS? ETFS is a self-versioning file system based on Git. And once you mount it, you can use it just like a normal file system, but behind the curtains it will do automatically the versioning part. So from a functional point of view, what it does, it takes this complicated tree structure, which is not really human readable, and it transforms it into this. As you can see, we have the root folder, which contains two main folders, current and history. In the current folder, we have the state of the repository at the latest moment. So in the current folder, you will find the newest content. In the history folder, we have a folder for each commit. So basically, we take each commit, we take the content from the Git object, and we display it in a human readable way. And in the current folder, you can write, you can change the content, you can view the content. And in the history folder, you can only see the content of the commits. So yeah, this is it. Simple, right? So let's do a demo, and hopefully it will work as planned. OK. We have here the remote repository. I'm not using the network, hopefully this time. And here we have the developer clones of the remote repository. And now we're going to mount this file system in mount point. It's very easy. You just pass the remote URL, pause the mount point, some parameters, like local repo path, and some timeouts. OK. Now in mount point, you're going to see that awesome structure with current and history. In current, you'll have the current state of the repository, which is just a file. And in history, you're going to see a very nice history of that repository grouped by each commits. Now let's go to the developer side and write some text. Put 42 in readme. We commit them and just push it. Now if we go to the GitFS mount point in current and open that file, we're going to see the 42 content text. And in history, the last commit is now. And pretty much that's it. In history, it's a bit special. You cannot do any write operation, because it's only in the root directory. Thank you. As you can see, it's easy as one for three. It was built entirely in Python. It's open source, so if you find this product interesting, we welcome you to contribute, change it, adapt it to your needs, and maybe we can grow it further from this point on. But how was it actually made? Well, since neither of us had previous experience in building file systems, we started with some research. We jotted some requirements, and after analyzing these requirements, we defined two problems. First of all, how can we handle the Git objects in a very efficient manner, both time-wise and memory-wise? And second of all, how can we implement the file system operations? Again, very efficient. For solving the Git objects management problem, we use PyGy2. PyGy2 is a wrapper on top of libGy2. LibGy2 being a library written in C, which handles the Git objects directly. So no command line, no time wasted. For implementing the operation, the file system operations, we use FusePy, which again is a wrapper on top of the FuseC library. And using FusePy, you'll see we have a very elegant way of implementing the file system operations. And now, I'm gonna let Vlad tell you more details about the intricacies of how GitFS works. Vlad? Thank you. Okay, to simplify a little bit our job we introduce a concept called views. A views basically is just a class that implements some C-scalls that do some specific logic. For example, for each directory we added, we created a view for current directory, the current view for history, the history view, so on and so forth. Between the actual C-scall and those views, we introduce a router. And based on some regular expression, when the open, for example, is gonna be passed to the router, that router is gonna route the C-scall to the specific view and execute the proper logic. It's pretty obvious, Django does it, everybody does it. Now, if I'm going to open a file from nine months ago, for example, that I'm going to do an open C-scall, that open C-scall is gonna be passed to the router. Router will decide that I need the commit view to do that open. And I will instantiate a new commit view, execute the open and return the file descriptor. This is our very easy and useful diagram view. We have a main view, called view, which inherited from logging, mixing and operations from views. That view is gonna be inherited by read-only view and pass-through view. The current view is gonna be inherited by history commit and index view, because as you can saw earlier, it cannot change the past. So, yeah, the current view is gonna implement the pass-through view. Basically, the current directory is just a pass-through view with some additional magic for the right operation. Okay. As you can encounter in real life, if you are doing a lot of pushing, pooling, commits and stuff like that, you get a lot of conflicts eventually. And we did the same, implemented in our file system a simple push-pull mechanism, and in order to solve those conflicts, we choose to implement a strategy called always accept mine, because for us is one of the safest strategy, but in PiG2 you don't have this option, so you need to implement by hand your own strategy. Also, the strategy mechanism is pluggable. If you want to implement or use another strategy, just specify that at the one point. Okay. Let's simulate the conflict. We have a branch, let's call master with the commit 1, 2 and 3 and on the remote, the developer pushes commit 4, 5 and 6, and our file system on local wrote the commit 7 and 8. In order to always accept the local changes, what we need to do is to get all those 7 and 8 commits and push them after the commits 4, 5 and 6. First of all, we split those local and remote branches in merging remote and merging local branch. We easily can find after that that the 3 commit, 3 commit is the last common commit and after that we can find the 7 and 8 are the local changes and local commits that needs to be appended to the merging remote. After that we just append 7 and 8 to the merging remote and rename the merging remote branch to the local branch. And that's how we solved conflicts. Now. We have a pretty stable file system. We have a basic push mechanism. We solved conflicts, but now let's see how we can behave in the real world. For that we need a really big repository and we choose WordPress which has 70,000 commits and to do a simple listing on the history view it took 34 minutes and it was not fun. So as you can imagine after some profiling we find our bottlenecks and we can cache everything. So we implemented 3 layers of cache. The first layer on the bottom we cache all the git objects when we mount the repository at mount point we read all those git objects and we stored them in the cache in memory. After that and also invalidate on each new commit that cache. After that we saw that the router just created a lot of new views and he didn't reuse them. Each time you do you wanted to read a new file he will just create a new view and do the same open then read operations. After that we implemented a simple all real cache for all the views and in the end we implemented a git ignore cache and for now we don't support some module we did that because each time you wanted to write to a file you needed to check if in that path where I want to write is in git ignore or in git some module, no. So basically what we did we just put all the git ignore and git some module content in memory and invalidate that cache on each new commit. After that, after we implement all those three layers of cache we managed to do the actual history listing on the WordPress repository in three seconds. So from 34 minutes to three seconds we did a big improvement. OK, now for the last part we needed smarter upstream synchronization mechanism. We just doing just pull, push and merge is not enough because for example you don't want if you have a big archive and you just unzip it you don't want to have 1000 file you don't want to have 1000 you don't want to have one commit saying, OK, I just wrote 300 or 1000 files of the disk. In order to obtain such things we had four more components main components. We have first, we have the few threads which we don't have control on them. I don't know how many few threads will gonna spawn for me and history view and other views. We have a commit queue and the sync worker use the commit queue to communicate between the few threads and the sync worker the sync worker is gonna do all the sync in the merging, pushing stuff and also we have the fetch worker which is just gonna fetch the certain time out from the remote. The fetching worker has a special mode called idle mode for example, if you don't have any activities on your file system for more than a time out, let's say a day then it's go to that it's enter that idle mode and in idle mode the time between fetching is increased. So for example if you don't have any activities on your repository or on your file system for more than one day it's gonna fetch only once per week or once per month or so on. We do that to save some resources. Ok, now if we have, if our few threads are done writing some files after that some commit jobs are gonna be put to the commit queue and those jobs are gonna be consumed by the sync worker the sync worker is gonna batch those jobs and create only one commit and as soon as the commit is created he want to push them to the upstream in order to do that first we need to merge those commits, in order to merge we need a clean staging area to get a clean staging area we have to lock all the writes and wait until all the writes from the few threads are done we notify the few threads ok, now we need to merge and push, so please don't do any write operation and also the fetch worker ok, please stop, I'm going to sync the changes after the sync process is done and all the changes are up to the upstream the sync worker is just notifying the few threads and the fetch worker that is ok, you can now resume your work the concurrency everywhere now for the final remark so we let Manu to say so final words if you want to use GitFS you can simply install it we have created an Ubuntu package and some folks from the community also created one for Fedora and Arch you also have one for OSX so if you are a Mac user you can use GitFS ok, and now we want to leave you with some takeaways then we hope will be beneficial for you first and foremost you can actually create a file system in Python and use it you can see we did it we have been using it almost a year now and we had no problems related to the technologies we used lots of folks said ok, you should write it in C or something more fast, but we did it and as you saw it works great writing a Fuse file system at first is pretty straightforward you have to implement some operations but to get the data model and the operations associated to that model sometimes can be tricky so we had some problems with concurrency this is the actual model that Vlad spoke about as you can imagine it was not the first one that we came up with and we had lots of problems and we did a lot of refinements to get here so this is a word of caution sometimes in the future you plan to write a file system you would think really hard about the model and last but not least we enjoy working with new shiny tools, programming language after all this is a conference about programming language but sometimes is good to not forget that our main purpose is making people's life easier and we should sometimes focus on creating tools that allow non-technical people to get access to powerful systems so someone who is not technical could use Git for example that's in our opinion something pretty awesome you can find a project here and we are expecting you if you think this project is interesting you are eager to get more contributors and as we said to grow it further it has a lot of use cases that are not yet implemented but could be now, if you have any questions doubts please ask hello can you explain how this helped you solve the first problem that you described with JavaScript how is it put to the real world I will answer we have basically our clients use SFTP and they are pretty much familiar with SFTP so we just mounted this file system on SFTP server and they can use SFTP but instead in the background they are using Git file system and their developers can now use Git because usually the developers know how to use Git so in case there is a JavaScript error you go back to a previous version or what do you do yeah you can do that but it is not automatically you need to go and do a copy from the history from the last checkpoint of the repository or the last commit and then copy the entire directory there so I have a question so yeah so there is a way to limit the number of revisions in the history of in Git FS so that you cannot I mean if you are doing a lot of updates your storage keeps on certain limits and not grow for too long right now no but you can do some tricks here for example you can specify that sync timeout a little bit like to have to do the sync basically that sync timeout somehow is related to how often I do the commit and for example I can batch an entire hour of changes in only one commit and you can limit that way but you don't have a hard limit say ok you can do only revisions or something like that hello thanks feels like a pretty neat tool I have actually two short questions first one like you shown the git rebase thingy was was it like git rebase I mean it was a slide like 4, 5, 6 and 7, 8 commits yeah the merging part so my question is if you can reuse some parts of Git actually or you had to implement it from scratch so we don't use any Git we don't use the Git command line tool so basically we just did it by hand then we commit we merge here for example when we paste the 7 and 8 commit we needed to merge manually each commit second short one profit from tree structure file system can you repeat a little bit the question do you profit from a natural tree structure from a file system itself? not that much right now thanks for your talk you were saying you are caching a lot of stuff from the Git repository so I was wondering is your memory consumption going up when your Git repository gets really big? yeah but it's not linear expansionally because for example for WordPress repository it took only 200 megabytes I think 200 or 300 for a very very big repository usually in production we have only 60 megabytes per repository so for us it's pretty low but yeah it can get a little bit higher even though libgit2 is really really efficient in that way and you can tweak a little bit the cache so for example the views cache you can tweak and say you need to stop at that memory size how do you do a specific revert how do you find that? for now you don't have it's pretty hard to model that in a file system because for example you need a special file and when you open that file to do a revert or when I writing you can get like a meta file with meta data and say okay please revert to that commit but for now you only can do that manually going to the history commit and just move all the commits all the files or just the file that you are interested in but that's a pretty cool feature thanks do you have a problem with big binary data like images maybe? yeah we don't actually support that and we have a limit on how much you can write and this is all tweakable from the options and second question do you know about github trick for the big files like github tries to move it's big files from the gith system to another file system and keep a link do you get a fast to keep these links? yeah for now no and I don't think we are going to implement that what is the good question and we can debate on that because that's a lot of implication to do that okay thank you for grand fork thank you instead of using the repository and backing for file system is it possible to use the file system as a view on an already existing repository so it might just give you a nice way of using a standard file browser and tools to just look through the history of a git repository already got yeah for now no because what it does is just clone that repository but that's a nice idea that's a nice idea usually don't work this way for now I know that you cannot push on the local repository it needs to be a bad repository to do the actual push but maybe we can change it a little bit so it can be just a view of your repository no more questions okay well thank you thank you