 Hello. How are we live? Yes, thank you. So thanks for coming. I know it's the last one of the session, so I'm kind of closing the show. And I'm sure it will be a super hit on the internet anyway, people just waiting that it's been uploading. So it'll be cool. So I'm here to introduce you about OpenStack Swift as a backend for Git. So that's a use case that we used at Inovance where I work. And I thought it would be pretty cool to share that with you guys and how we did it. So at first, I'm sure you're dying to know who I am and what about me? So I can tell you one thing is that I'm not that guy. So that guy is the actual author of that implementation. And he was supposed to come over, but unfortunately, he was unable to make it final over here. So I took over because I thought we still need to share that and see how it go. So we both work for Inovance. It's a service company, managed services company. And that does pretty cool stuff like that. And as for myself, as you can see from my t-shirt, I am a developer, a senior developer. And I've been an OpenStack Swift core developer for since really the beginning of OpenStack or even before OpenStack. So the first question, what is Swift? I wasn't going to do the what is Git because I'm assuming that everybody knows Git. It's pretty popular. Is there anyone who doesn't know Swift in the room? OK. Glad you came to OpenStack. Cool, thank you. So I'm just going to highlight the few things in case the people on the internet doesn't know. So it's an object storage. It's not a file system. It's based on eventual consistency. It's everything when you access to it. It's based on the REST APIs. And it's very scalable. It can scale as much as possible the way you want. And it's been designed for that. And it's what people have been coined the term of software defined storage. So it's the software who manage your storage. And it's not the hardware anymore that does the thing. And all the intelligentsia, the scaling and stuff, and the error checking, it's all happening in Swift. So it's a cool thing. It's a cool product. Git is cool. So we try to play this thing together. The question is that aside of being cool, why would you want to use Git with Swift? That's a fair question. Is there really a need at first before developing that? So the first question is, why not? It's cool. So why not? But when you dig into it, you see some problems. When you do the normal workflow with Git, which is using behind a file system. Or maybe you can have an NFS share or whatever. So what happened? What's the issue with it? Is that you have to perform the backups of your repo regularly. So you have to do performing all those backups. So it's a pretty valuable thing. And you want to back the thing up. So people can argue that Git is distributed by default. So all the commit history and stuff. But in some cases, you want to have that endpoint being always available and be able to restore. Because when you use in a model, we would use a master on the end. So when you have to extend the master repo, and when you start to grow in a very, very large way, and you end up in a smaller partition, you need to resize that partition. So it's kind of a more manual and operational thing to do. And the thing is that most of the time is that you have to manage it by yourself. And it's not available everywhere. Because it's another piece in your infrastructure. So in an open stack way, usually you have your open stack. If you have your open stack Swift, why you can't use it? And it's not fun. So we have all that new things going on with the cloud and all the services and blah, blah, and things. So you have your Swift, and you want to know what would be cool to use that for that. So what are the benefits? That's a fair question. And the advantage of the benefits of using Swift is really like the contrary of the disadvantage of not using it. Because you're going to be sure that it's going to be safe, that your data is going to be put properly. It's going to have the capability of easily extend your storage. And really, Swift has been designed for no single point of failure. So you're sure it's going to be always available. And in an open stack environment, as I said before, it's already there. It's like you don't have to spin up a file server old school way. You can just use whatever Swift you have. Or in another way, you can even use the public provider from the Swift or the other public provider, like HP Cloud or Rockspace Cloud. So you can have backups everywhere if you wanted. Or even GitHub. So now let's get deep more in the technical terms and how everything works. So how Git internal works. It's not really used for everyone. It looks complicated. So I put a nice schema. So I'm hoping that you can see, yeah, you can see cool. So what's the workflow of Git? So it's based on Git fetch pack. So when you do a fetching, when you fetch a new change in your Git repo, so what happens? It's like you have a Git fetch pack, which is run on the clients. It connects and starts like the Git upload pack on the server side. The Git upload pack is going to get the reference from the backend. That backend is going to send a non-reference to the Git fetch pack on the other side, where he would know what's the reference of the server at that time. And the client is going to send the non-reference and the one that he wants compared to whatever he gets sent before. So after that, the Git upload pack on the server is going to build a custom pack, is going to read the objects from the backend server. And with all the blobs, the trees, the tag, and the commits, it's going to send the custom pack back to the client, where it's going to build locally, and it's going to be available on the client Git repo. So that's the basics, like the way it works for Git fetch. So on the push, so you have the fetch, on the push, the way it does is like it does a Git send pack. It connects to and start the Git receive pack, like the same way we saw that on the fetch. Same thing as before, it's going to read the reference from the backend server. And it's going to know the reference, and after it's going to send to know, it's going to tell what was the action with an update that created the delete to update the reference until Git receive pack is going to verify from the backend server, and until it's going with the custom pack. So really, I'm most exactly the same as the push, as the fetch. But in a push way, it's going to compute the difference between the two, and going to send it back to the object. So that's been proven pretty scalable for Git, and pretty efficient, the way it sends things over. So yeah, so now we want to know, how did we plug that to Swift? So how did we do that? So now you have that. So the way we did that. At first, the thing that we didn't want to do, and which was the obvious way, was to not modify the binaries, the Git client binary. So we didn't want to go inside the Git binary and start to modify to add that new backend that would connect to object stores, and do abstractions, and things. So why is that? Mostly, it's because you don't want to do a fork first. That's the basics of print source. But you could have a other plugin system that would take time, and it would be you need to handle different architecture, different type of servers, and stuff. And it would be available right now. And it would not be available for stuff. That's not the Git binary, but here as well. Other implementation of Git. So we could only use that. So what we did, we used a pure Python Git library called Dalwich. So I don't know if anyone heard of Dalwich. OK, cool. So someone. So I have a full side for explaining what Dalwich is doing. So Dalwich is a pure Python library that implement the Git protocol. And be able to literally implement a Git server the way you want. So the way it works. It's like it creates and manage the loose objects. So the loose objects is like blob, trees, commit, and tags. There's things that are happening in a Git repo. It's managing the pack files and the pack annexes. So the pack files is like the packing of the full stack, or when it gets sent over between the client and the server. And it's implementing the Git smart protocol, which is a Git upload pack and Git resource pack. And the cool thing about Dalwich is that it has a way to provide interfaces to storage backend. So you could say that the storage backend is pluggable in a way that you're going to upload your blobs at the end. So that was a pretty cool thing. And that's what we went for. So how does it work in a nice way? You can see that. Cool. So in an environment like that, you just end up like to have a Dalwich proxy. So that Dalwich proxy is going to sit between the Swift cluster and the Git client. So really, you're going to upload, you're going to get to the Git push or Git fetch from that URL. And the Dalwich proxy is going to take care of uploading on the Swift cluster and take care of the QE delete and all the blobs things. So that's pretty transparent. I mean, we didn't need to have a custom middleware inside the Swift. We don't need to have like a modified version of the Git binaries or anything else. We just need to have just a piece and faster to it. I can sit on the Swift boxes where it does the translation for you. So as I said before, we did a backend implementation called Swift repo. And let's see how it works. So the way we did that, we didn't want to do the standard way to store Git reference, like one file by reference. So that's what Git is doing, usually. We didn't want to because doing a lot of small files, reading, and a lot of, it would make a lot of HTTP requests when you start to send like bunch of blobs. So what it does is like it's going to use the pack format, it's going to pack them, those objects, with an index into it. And it's going to store those index and those packs straight inside the object server. So in addition to the index server to handle like the mapping of the object and the pack, where they belong or where they're stored, we add like the info object. So that info inject that's stored inside the Swift would tell you like where all the information the packs are stored. And it uses the range request features in Swift to upload the objects and the packs like in a concurrent ways. So it would like to do like range headers and to be able to upload the packs in a more efficient way. So that was the way we started to do so. So we implemented that. We started to implement that. And we started like to, so the first thing first, what you wanted to do is how does it work really? Is that really working? Is it like efficient or anything like that? That was the first question. So we started to benchmark it. And the benchmark, so we had like different testing scenario. We had GitHub, which is a Git client from the InEvents network, so from our company network. And that goes directly to GitHub. So that's what the standard things like that most developers are using. And you had the Dalwish Swift, which is the one that we did with the implementation. And that goes from the InEvents network. And it goes there like to the Rackspace compute gateway that goes over Rackspace Cloud files. So we use Rackspace UK Cloud. Because I used to work there as well. But yeah, so that was just one of the tests that we were making. And we used three different sides of the repo. So Swift Sync, eDeploy, and Swift. So those are different. So the one is like small one, medium one, and Swift, which is pretty big, and a fairly big open source project. All right. So when we start, so how did it work? We had the full repo push. So when you look at it, it's like the small ones, so Swift Sync. So I think the slide is not really well aligned. So the first one is the one that uses GitHub. And after you get the Dalwish Swift. So you see on the Swift Sync itself, which is the small repo, it's not that it doesn't do, it's not that much different. Same goes for medium repo is that you get eDeploy and GitHub, sorry, the eDeployer one. It's almost the same. It's like you have a few overheads, but it's not that much. But if you look over a super large repo, then the Dalwish version is like it takes more time. It takes way more time. So why is that? Is that because on the back end, on the Swift storage, you can have rate limit. So you can end up having a rate limit of your Swift. So you're not able to send as much as possible for the first time. And that was the main issues. The developer who did that, Fabien, as I was saying, told me that he had a few implementations the way he would increase the pack and optimization that was possible to do. And it was his plan to improve this kind of thing for a really, really large directory. So there's still some work to do on that. So when you look over the clone, when you start to do the clone of the previous push project, that's a bit the same as before, except for large projects, it's even slower. So that optimization will need to be done. So for small and medium, there is no problem at all. You can put in your Swift easily. But for really large ones, there is much more work to do and optimization to do so. So that's mostly what we did. It's out there. So we implemented that in Delwich. There is a large pull request on the GitHub of Delwich that's been sent, which implement that thing. So it wasn't merged because it's a very, very large patch. And people were not, the author of the project wasn't wanted to do in different ways. So it's still available to use. And hopefully it should be merged sometime soon. So that's one of the things. It's almost done, but I have another thing to talk about is that for my slide, a lot of people have been asking the question about eventual costancies. How do you deal with eventual costancies? So yeah, I know I see the ZOVM guy nodding. So the eventual costancy could be a problem. So it could be a problem, but it's not really a problem from the client's point of view because the client's point of view is like he's going to retry if he doesn't. So if he's doesn't going to get the right reference, he's going to get like an older reference. And then until like the client is going to do a pull to get the right reference and for the next time he's going to get the right thing. So from like clients, I mean, Git clients doesn't need to have like the exact reference at the right time. It's just going to update concurrently and then it will be fine. So because of the workflow, doesn't need to have a strong costancy and it can work with eventual costancy, then it's not a much problem. Please. So if you can really have split head. Yes. This client will push with this roster, this client will push with this roster, and then replicator will go. So the question is, the question is what happened if there is a split head and when you have like the two clients working at the same time. So in that case, it's like the last one would win and since like Git itself would take care of the differential since the last one would be happening, then one, I don't know if you see when you start to push to Git and like the reference is like older version of what you have or like it's not a newer version, then he would fail, the clients would fail. Then that's what will happen in that case, is that only the last one would win and the other guy will need to merge by the developer you need to merge by hand is Git repo. So that's mostly what happened as well when you have like Git smart protocol. I mean, it's the same things like you can have a concurrency problem and when they push at the same time, then you get itself to the developer like to do the merge themselves off that concurrency. So it's a bit of a workout. But what if you were put on but they're still standing on itself? Well, if you push different branches then you have different objects. So you maybe can have a problem when you push like the same branches but it's going to be another commit idea as well. So you're not going to be a conflict. It's going to be another hash, object hash, in that case. So it can't be, it's always unique. So you can't have a problem with it. It's only the only concurrency problem can happen would be on the same branch, on the pack file. But each object of Swiss always always are independent to each other. So you cannot have a problem under one, I think. I don't think it's a pack branch. Like it's a full pack. I don't know how it works exactly for the branch side. I haven't had the information but I guess he's aware of the pack of the branch. Yeah, I think it's a global one, I guess. That's what I, that's what I don't know. But the pack, the pack, if it's not uploaded, if it's not updated at the same time, then it will be the next one will be updating it. So that's, it started like to try to look, it never like, we actually had question about it. I mean, that's what you told me. He had plenty of question about the eventual consistency and he tried to figure out like a way to break the workflow and he was telling me that there was no due of the workflow of Git being unique and that each, even like each tags have a different, like the update of a tag is another ID as well, SSHID, then you wouldn't be able to break. The only thing is that, is that you're going to have like a last version win kind of thing that's going to upload. Yeah, GitHub will have the same problem kind of for it as well. So that's it. If you, I think I'm mostly over on the one. If you have any questions or on the things or the question, feel free to go to the mic so other people can hear on the internet. And yeah, so that's it, I guess. I hope you enjoyed.