 Hi. Hello. So my name is Alvaro Leyva. I'm a production engineer at Facebook and Instagram. And I'm the only thing standing between you and lunch. So let's get this over. So basically, my talk is about how we saw Leonard's talk last year about CISync. And we saw that it was a really cool project. So we start finding out what can we do with it. And we found this problem. So we tried to solve it with CISync. So basically, I will start saying, why did we want to experiment with this? How was our strategy this? And then our results, yeah. OK, cool. So as a race of hand, the system that you work on, can anybody raise their hand if that system is deployed once a week? OK, cool. Once a day? Cool. OK. Twice a day. 10 times a day. 20. Yeah. So the reason why I say this is because at Instagram, we deploy, since two years ago, we deploy more than 50 times a day. Because basically what we try to do is that we try to deploy each single commit that our developer sends to master. We try to deploy that directly into production, have it enough time in production, so it give a signal if that commit will break or not break, and then move to the next one. This works really good for us, because it allows us to find things that will break or security vulnerabilities and stuff like that really quick. So the way that this work, it's simple. A developer commits its code. We pack it into our own internal tooling. We run tests into it, and then we send it into production. And this really works really well, because it allows us to chip small changes. A developer, after he lands his code, it's an hour until it's in production. So he's around. If he breaks production, he can help us fix it. And it's really easy to roll back to the previous date. So basically, how we do this, you have to imagine that this is version A. And this is kind of a representation of what would be our source tree. For those who don't know, Instagram is mainly at Django shop, so that's Python. Most of these are just plain text files. So we basically have this package that is a representation. We strip things that we don't want. We compile a few things that are C. We convert Python to bytecode. And then we package into our format. Then a developer comes, make a commit, make a small change. And then we do the same process and we end up with a package that is really similar to A, but has all the components. And then we also have C that probably also changed different things. And you can see that if we do this a lot of times a day, the sum of A, B, and C gets really big. So this was like a really interesting problem to solve with C async. And I will explain a little bit how we view C async. That is like for reasons of gravity and for what the abstraction of our problem is, maybe oversimplified. But OK. So the way that we work is that we take this version A and then what do you think does it will take and divide it in little chunks of data. And the magic about C I think is that these chunks are barrel length. So that means that this piece over here can weight like 10k, but this one can weight 20k and stuff like that. So we take those packages and then C I think will output two things. The chunks, there are these files and an index file. That index file is basically a recipe on how we're going to take these chunks and then reconstruct them into creating the directory. So this is the service session. And then the opposite process is that you take your index file, then you grab whatever chunks it says that you have, you assemble them in the right order, and then you have your package back. The cool thing about this is that if we now have a package V that has a small change of it, the service session result will be really similar between A and B. And maybe it will have one extra or two extra chunks there and will yield a different index file. So now we don't have to think in terms of versions. We just sync or we just have all these, oh, hello. We just have all these stores in a single location. And then what we distribute, it's the index file. And the index file is what it will become our version. So that is basically how we use C I sync. So it's really simple. The way that we work on this is that we put an intern who was really good at his job. And we ask him to come with abstraction and stuff like that. First of all, we wanted to create an abstract definition of package that will englob this idea of having stores and having index, but also will not be subjected to just file systems. And maybe instead of syncing the index file as a file in directory, we want to sync it as a database record, because this lends to be a key value stuff. So the first thing that we did is that we changed the idea of index to an idea of manifest. And the reason is really simple. What C I sync give you as an index file, it's basically a recipe to reconstruct your package. But it doesn't give you any information of how the package was constructed. Who constructed, where it constructed, did it use certain compilers, what version, what hash of the Mercurial repository. And then we put it into stores. So this is really important for us. It's like the information about the package is almost as important as the package itself. So we created this tool called CA Package to make an abstraction over C I sync. And the way that you will get the index file is worth something like that. You give CA Package stage, and then you give a URI. That URI in this case is an SQL query, so we can store all the things into SQL. And this will be our manifest. Basically, here the data, it's an encoded version of what you will get into index file. And then you see that we put all the other information that we look here about. Stuff like, for instance, what is the package name? Like, on Instagram, of course, we deploy our Instagram package. But how about if you want to deploy virtual environments, or we want to deploy other binaries like this? Not just file system, but simple binaries. We have version. Like, the same package, we built it multiple times, so it will be really good to have versions. All those things. Finally, we basically have a store adapter that we can, if we want to save or retrieve a store from an HTTP server, we can, if we want to store it from local disk, we can also do torrent. So that is basically how we did it. So let's see a few experiments that we did to see how was the results, and then we will be done. So the first thing that we did is we work on the creation. That's why I put a little cookie there, because it's like, how do we create the package? The first experiment that we did, we took a hundred version of Instagram, and we created, in our regular format, and then we created with CA package that is basically an abstraction over CSync. The first thing that we did find out is that we save about 90% of space. And this is kind of obvious, because this is the idea of CSync at the end of the day. In our regular model, basically each version is a full package container of the things. While when you do it with CSync, you are just including the new stuff or whatever extra it is. And in terms of resources and time, creating the things took about the same time, and it makes sense, because we use the same technologies, we serialize the same way, we compress one and compress using the same libraries. So basically the big win was in space. You can see if you deploy more than 50 times a day, by the end of the day, you are saving a lot of network because you are not tipping all the components, you are saving in space, and eventually reconstructed will be faster. So the second part of this experiment was to actually get these packages and put it into production in the same way that we deploy our normal system. So in parallel, we basically had a few, a handful of machines that when a commit came and landed into master, we created this package and then we chip it into production. We measure the total bandwidth, we measure the time that it took to state and the resource usage. The total download was again like 90% safe, makes sense. We are basically downloading the stuff because moving from version A to B, it's really pain-free, but the cool things that moving from A to C without going through B, it's also really pain-free. The stage time was faster and the resource usage was basically the same again because we used the same technologies. So that concludes basically what we did. What's next? We want to try on binary heavy distributions. So again, we say Python application is basically mostly text files, but we want to try it with virtual environments that basically has a lot of binary things. We want to stop shipping chunks through HTTP and start using Torrent because if you have a big infrastructure, you can leverage the fact that most of your machines already have them, the chunks. We want to embed CI Sync, right now, we just shell out and execute it. We would really like to start using CI Sync as a library instead of just chilling out to it. And finally, we want to also try other toolings different than CI Sync because the AVA is really cool and we would like to stress it against other things in the market. So basically, that's it. I finish, yeah, don't worry. Question, there will be questions. And yeah, also, if we run out of time, I'm gonna be here and I have stickers if people want. So the question is, because just the last point that you raised, that you want to stop chilling out, do you want to turn CI Sync into a library or do you want to re-implement the code? So CI Sync, again, I don't want to overstep my boundaries here, but CI Sync works or it's written in a way that it resembles a lot like a library. It's just, it's in version two, so I don't know if the API is gonna be stable or not. And basically, what we want is to take that same thing and put it into a Python binding. Go ahead. Does it work? Yeah. So my intention was always that it was supposed to be a library and that's why it's written in the library style, but I haven't come around to make it a library yet. My other question was actually just that, what's the size of the images? The things that we produce? Yeah, the stuff that you actually store there, like what's the average size? Oh, you mean the full size? Okay, I don't know if I can say that the size of my... Just a graph address? Okay, so it's like, I don't know. I would say like 50 megabytes to 200 megabytes depends on the size. These are like text files, so you will basically, like even if it's really big when you compress it, go for the string. Thanks. So you create a lot of packages per day. What do you do with all chunks? After a while, do you garbage collect them or do you keep everything all the time? So the cool thing about that is that since we deploy every commit, by the time that we deploy, I'm gonna say a number, commit number 150. There's no point of going back to anything before that because we know that we are in a good place, right? So we perch all chunks all the time. And the way that you do that is that you have a list of all the index that you, I mean, all the chunks that compose your, whatever you want to keep in the back and then you just serialize it and find all the chunks that doesn't belong to that list. Oh yeah, yeah, yeah, but you cannot deploy at this speed of having different branches. You always deploy from master, yes. I would like to ask if something like that can be achieved using Git Annex or Git LFS? And if not, what are the advantages of using CI sync? So that's what we are gonna discover, like in the next step, we're gonna start using other tooling. The good thing that I really like about CI sync is that it's general purpose. It's not the best on particular, like this particular problem, but it's really general purpose so we can apply the same techniques and put it into binary distribution instead of just text. Or maybe we want to tip like an entire file system with this and it works really good while Git and all the things tends to be like more single into problem. But I cannot say that for sure because we haven't tried yet. So one more question. Go ahead, he was trying to make a question. So you have an incremental way to ship packages. Do you also have an incremental way to build if there's only small changes every time? So if, let me see if I can kind of explain, not really because when you do A and B, you have to think that this, don't think of this as incremental. That's kind of the first thing that I tried to get out of my mind. Think of this like I build A, I build B and it happens to have like components of A and components of B are similar. But this could be like two different applications, like they don't have to be. So don't think of this as incremental of A to B. With that in mind, you still need to serialize your whole directory and compare the chunks and then realize which one you actually have to build. But when you are on that point, then you already wasted like 90% of your time on just doing the serialization and doing the serialization. I'm gonna stay here. So if people want to ask me questions, you can do it after this. Thanks a lot.