 Hello everyone. So today I'm going to speak about our journey of building an object distribution system which essentially uses peer-to-peer networking protocol. So as we were recently mentioning right about Torrents. So we had a problem in Flipkart which was basically slow deployments. And I'll begin with explaining the problem and how BitTorrent came to rescue for us. And how we built a platform which uses Torrents to orchestrate downloads in our data center. And maybe, and we'll go through a demo and our tuning and adoption exercises for the same. So when you want to distribute files right from one machine to another, there are an over network. There are tons of protocols to do that. You have the time and tested FTP. You have everyone's favorite HTTP where basically it's transferring CSS, JS, HTML files from a server to a client. And you have BitTorrent. BitTorrent actually accounts for 5% of internet traffic in download and more than 30% of internet traffic in uploads. So you have all these protocols and they have all their uses for, and they're all used for different purposes. So before we go into where we use BitTorrent right, I'll talk about how we do deployments in Flipkart. We have our own data centers. So basically there are rows of physical servers out there. We are recently adopting LXC containers to basically drive containerization within the company. So LXC containers are pretty much similar to Docker. You generate an image when you are creating a build. And when you want to deploy a cluster, you essentially want this image to go on all these physical servers where these containers would spawn up. So if let's say I'm deploying a 100 node service all at the same point of time, I would have probably 100 physical servers downloading this one image. And once 100 containers come up, they are put behind a load balancer and your service is up. So the problem we started seeing was that when we started deploying large clusters, our deployment time started to increase linearly. And this was essentially because the source where we were storing these images, the upload bandwidth on that source started getting saturated. So imagine if you have one server, right, it has finite bandwidth to upload some packets. And if you have more and more machines trying to request at the same point of time, a particular file, you are going to distribute the upload bandwidth of that server among all these clients, which means the download speed is going to just go lower and lower on every client that is trying to download this file. Typical way to solve this problem of increasing your download speed is having caches, right? You could put up a nginx cache layer in front of your source. You could host replicas of your files over there. You could basically configure your eviction strategy. But the thing with caching is right, you have to do a lot of these things. You have to worry about how do you scale your cache? If you have more and more builds getting generated every day and these builds are generating images, you have to worry about how I'm going to do eviction. You have to start worrying about is my in memory storage for that cache enough? If it is not enough, then I probably need to invest time in sharding. And when you are doing sharding, you will invest time in routing. Basically, a certain request is coming for a particular file. You will then start thinking about which cache layer, which cache box does it need to go to to get that download because probably you have implemented sharding in a way where all cache, all files do not go to the do not map to the same cash box. So with caching, right, we started seeing a lot of our increasing infrastructure and maintenance costs. And we started looking at some out of the box solutions on how we can get rid of the cache and still speed up the deployments in the data center. So we established the problem here, right? The problem is that when I'm concurrently downloading large files from a particular source, the upload bandwidth or the network is getting congested at that point of time. And I want to ensure that network congestion does not happen in this case. So we started looking at BitTorrent. I'll explain about BitTorrent in the next slide. So because BitTorrent has some unique characteristics to ensure that network congestion does not happen and to ensure that your downloads are resilient to individual node failures. So before I go into BitTorrent, right, we quickly did a POC. And what we figured out was that when we were setting up a 512 node cluster, which probably used to take an hour earlier, that we were able to set up within five minutes. And that was a huge speed leap for us. And this was a validation in our theory that we can probably deploy BitTorrent at scale in our data center and magically speed up deployments for all applications inside Flipkart without any change in the application side. For purposes of this experiment, we kept our ingress out, outgress to 64 Mbps on every node so that we could get repeatable results. So and before I explain into how we adopted BitTorrent and how we designed a system around distributing images using BitTorrent, let's talk a little bit about BitTorrent itself. So I'll assume no prior knowledge, but we'll probably not go into a lot of detail because that itself will, that itself is a top could take an hour or more. So BitTorrent works with this strategy, right? And the premise is simple. You have a large file, and let's say just one box has this file. The goal is to distribute it effectively in the network. So you divide this file into small pieces. And you create a metadata file, which is called a torrent file, which we all download legally or illegally. And this torrent file contains the information of all the pieces and how they're going to be structured back or ordered back together to create the original file. So if I want to distribute a file, I would create this torrent and I will, I'll publish this torrent to some of the websites out there which index my torrent. So where users can go search and download this torrent file. So if some other client wants to download a particular file, it will try to find the torrent for that particular file, get the handle of the torrent file and talk to a service. And this is very important, a service called tracker. And there are a lot of trackers out there in the internet, which are public and there are private trackers like the ones we host inside Flipkart. Tracker's job is very simple, right? Tracker essentially does what its name is. It tracks all the pieces for a particular torrent file. So it knows who has the pieces for a particular torrent file and who is right now can share that piece to you. So if I'm a client, I will go to a tracker and say, Hey, I'm looking for these pieces because I want this file and tracker will say go connect to all these IPs and they probably have the pieces that you want. Once I connect to these IPs and this is that swarm, right? I go to a tracker. I find out that there are some other guys out there which hosts my pieces. So I directly connect to these clients in the torrent protocol. And once I connect to these clients, I'm getting those pieces. Now I can share start sharing these pieces to other clients, which wanted as well. So I am not waiting for the download to finish before I can start uploading it as in when I get these pieces. I will tell the tracker. Now I also have this piece. So tracker can give my information back to other clients who want that piece. So there are few standard terms out there that you should probably be aware of. Every machine that's participating in a particular download of a file using a torrent is called a peer. And there's a special term for the node which has all the pieces of a file. And that node is called a cedar. And cedar is very important, right? When you're beginning a download, you need to have a cedar so that it has all the pieces, which means this, this file can be constructed in originality across all the clients if they were to get all these pieces. So this is the cedar is a true repository. Once I get all the pieces, my client also becomes a cedar. So there can be multiple cedars, multiple peers in a torrent network. And as was mentioned in the previous talk, right? It works on a tit-tat principle. The more you download, the more you upload pieces to other clients, the more you can download as well. So this is the basic flow control that Bittor and protocol uses. And having established this premise, right? And having had, we had this POC where you were able to deploy a 512 node cluster within five minutes. We came up with some very simple design goals. HTTP by nature is very straightforward. All you have to do is you do a get request and it gives you the response back. So if that HTTP URL was representing a file, I have standard libraries out there to make those get requests. You can just do a curl and, and simply get the file, right? There is no depth complexity there, but Bittorrent is not the same. In Bittorrent, you have to think about availability of tracker. You have to think about a torrent file. You have to understand, parse it. You have to reconstruct the pieces back together and you have to ensure that if torrent is becoming unhealthy, you have to do something about it. So we wanted to design a system which can run in our data centers, and it's dead simple to use for everyone. All you have to do in that system is give it HTTP URL that you want to download. And that system should internally download any HTTP resource given to that using torrent protocol. So if I have a handle for a file, which is being hosted by a server as HTTP URL earlier, it was getting bottlenecked because I was doing standard HTTP download. Now what we want to do is give this HTTP download to our Shatabdi system and Shatabdi is the name of the system that we built. And it will figure out whether I need to download this file using HTTP, whether I need to download this file using torrent. If there are concurrently multiple clients which are downloading this file, do I need to orchestrate the download in some manner? So dead simple to use and speed up download of any HTTP resource. These were the primary requirements. And at this point of time, I would like to introduce Shatabdi. Shatabdi is the name of the system which we built to solve this problem tailor made for fixing deployments in Flipkart. Download simply scale. We don't worry about these days that whether I'm deploying a 100 node cluster, a 500 node cluster or in some cases a 1000 node cluster is my source going to get choked. We do not scale the source. We do not modify it for any reason. We just hand over the download operation to Shatabdi and it internally figures out what to do. And we will go into the internals of Shatabdi soon. It's resilient to node failures. What do you mean by that is if there are some nodes which are seeders in this network, which have all the pieces of a particular file and some of the seeders were to go down a bit torrent protocol by design ensures that if there is at least one seeder on the network, your torrent can always finish because pieces are out there in the network for your clients to grab to grab them. Shatabdi goes one level ahead. If all your seeders were to go down, Shatabdi will figure out that okay, the torrent is unhealthy. I do not know how to complete this or download and we have seen this right. We are trying to download a rare movie from internet and we figure out that we cannot complete a download because there are not enough people seeding that particular file. So what Shatabdi does is Shatabdi knows where this original file is located. That is a source. So Shatabdi will figure out that I need to make this file available on the network again and it will create a seeder at that point of time so that the download can finish for everyone involved. And we try to push all of this logic to the client. You just have to give the client this URL and it will do the rest. And what does client do? Right? So as I said, right client provides you download guarantees. It will ensure the download will finish no matter what till the source is up. It will do torrent discovery. So once you give an HTTP URL to the client, the Shatabdi client, it will figure out what is a torrent for this particular HTTP resource. And if there is none, it will create one. It will ensure that source is not getting overwhelmed, which was our original problem statement. Source getting overwhelmed was leading to slow deployments. So it will ensure that if the torrent is not there, and if 100 clients are trying to download a file, only few clients go to the source and download the original file. And then rest of the clients will use torrents to distribute the file in the network. So it will give those guarantees. We do a lot of work around ensuring that there is high availability of the different components which go in bit torrent protocol. So as I said, tracker, we have a discovery store for torrents and we ensure these remain highly available. We ensure our source remains highly available. And we do a lot of deep integration with our in-house alerts and metric solution so that we can monitor if something is going wrong and we can then react to it. Not proactively. That is observability. So with this, I will go into the component diagram and I'll try to explain what goes under the hood, right? This is one peer. This is one of the physical servers in our data center. And every server in our data center is running a Shadabdi client, which is a Golang demon. Shadabdi client internally wraps a torrent client, which is used to do the actual torrent download. So you guys would have used mu torrent, qubit torrent, some other torrent line, right? So we use an open source torrent line called qubit torrent. This part is plug and play. What we initially do is any VM, any container, any process running on this physical server makes a link local request to the Shadabdi client that I want to download this file. And this is the HTTP resource to that particular file. Shadabdi client will then first of all go and talk to our indexing and consensus store, which we use at CD for and it will figure out, is there a torrent for this particular file and is a torrent healthy? If there is a torrent, it will try to get the torrent, hand it over to qubit torrent. And at this point of time, qubit torrent will talk to all the other clients by talking to tracker and figuring out what are the clients and you will try to download the pieces for that particular file. So you will never hit the original store at all for that particular resource. If the torrent were to not exist or if the torrent was unhealthy and I'll come to unhealthy scenarios, what Shadabdi client would do is it will basically create an interest list for that torrent in HCD. So we decided that we want to give a guarantee that source is not getting overwhelmed. And this is where Shadabdi's magic comes into picture. If there are 100 clients trying to download, all hundreds will participate in a leader election for that particular HCD resource and that leader election is hosted by HCD over here. One of the clients will get elected as a leader and that client will put its hand up and say, I will go now talk to the object store, all of you other guys wait. It will go talk to the object store, which is your HCD link. It will try to download the original file. It will try to create a torrent for it and publish it back to HCD. Now, all these other clients are waiting for this leader to finish doing the actual download using source and once they have finished, they will all get a notification. Once they get this notification, now they will just start the torrent download by talking to consensus store and this torrent was already published by the leader. Torrents, as I said earlier right, can go unhealthy at different points of time because of unpredictable node failures and which happen all the time in your data center. So, to provide download guarantees, we should have the clients all figure out that is a torrent healthy or not. So, a lot of probabilistic analytics goes behind the scenes in the in the sense key if a torrent is going to become unhealthy because download speeds are dropping drastically or the availability of pieces on the network is dropping drastically because of different peers going down. It will again try to create seeders using the same flow. It will basically try to host a leader election, become the leader, download the file again originally so that download is now available, torrent is now healthy again and so that other clients can proceed. So, this is the internals of Shatabdi. I said tracker is very important right? Tracker is the glue which holds everything together. A tracker service is where all the clients go and tell and basically do an announce to tracker service that I have piece one and piece two of a particular file and some other client will say I have piece three and piece four of a particular client. A seeder will say I have piece one two three four all the pieces of a particular file. So tracker knows everything and this is what clients connect to to tell what they have as well as this is where clients go to and talk to to figure out where do they should where which IPs have the pieces that I want. So, we took an open source project called Chihaya which is a tracker but there were some bits and pieces missing before we could reliably deploy this in production cluster of Flipkart. So, Chihaya basically has an in-memory store where it stores announces the piece information of all the IPs but there is no guarantee of that if this were to go down basically if this if Chihaya node is to go down there is no he there is no durability of it. So, we custom build these components on top of Chihaya in the sense now we have a he story where if a Chihaya master goes down we really like the new master on a different node and we ensure that Chihaya is always up and available behind the load balancer. We take periodic snapshot of piece information and send it to XED so that if a new Chihaya node is coming up it knows it knows already what pieces are there so that torrents can resume with the same time same speed. So, these are the improvements we did and before I go into demo right I want to talk about an another open source project called Dragonfly. This is by Ali Baba it was not live when we started exploring. So, we had this problem of orchestrating downloads concurrent downloads for large files right and before we started exploring bit torrent we looked what's out there like we all do but nothing was open source at that point of time but recently Ali Baba has open source a very similar project to Shatabdi and I would urge if you guys like this talk you should go check this project out. There are few key differences into how they do things architecturally as compared to Shatabdi they do not use torrents the way we know it they have a custom peer-to-peer implementation. The cache with that is if you are trying to maintain that system you have to understand that P2P protocol before so that you can reliably solve issues in production. The other thing is they have this concept of super nodes. They do not use leader election to ensure that some cedars are created in the network. What they do is they create a set of they have a set of nodes in their cluster which they use as mandatory cache so before beginning any download on your client their node will download the actual file and then your then rest of the clients will start downloading from that particular node and they call it super node. So they need to have a caching layer which is not a requirement for Shatabdi. These are some sequence flows these are mostly what I talk about when in the component diagram so I'll and we'll go through more of them when we see the demo but these are there in the slide so in case you want to refer to them or you have questions around these sequence flows they basically go in a lot of detail around how we orchestrate downloads between a torrent line between a direct line how we do leader election how we do discovery and how all these pieces talk together. So this demo is about basically orchestrating a download of a file across five nodes and we are going to do this in two batches we are going to download a never before seen file by Shatabdi on three nodes so what we expect is that one of the node will become leader and it will create the torrent file and rest of the two nodes will wait and once the torrent file is available rest of the two nodes will directly download the file using torrent and when we start the second batch because deployment is usually done in batches inside flip card right to maintain a certain availability ratios when we start the next batch they will instantly start downloading using torrents because the torrent is now already available in the Shatabdi system so object here is a s3 like object it could be any file we are right now setting up some of the settings on the demon so so as I said right Shatabdi has a demon installed on every server so we configure certain things we configured a download timeout the source url this is the actual url and this is the source where we are going to download it from right now I'll enable download using on three of the nodes so Shatabdi exposes sorry I'm not going to show you so so Shatabdi exposes basically a rest API which we used to control downloads um all you tell to this rest API which is locally hosted by the demon is start a download stop a download tell me the status of the download and that's all you need to do you don't need to worry about bit torrent at all so I'm pinging all the demons on all these nodes using ansible um these are the curl calls these are actual rest calls which we are doing to start a particular download we have written some wrapper around these curl calls for the purposes of demo we are setting up some network limits so that download does not happen too fast for the demo and here we are going to start a download of this particular file on all three nodes uh concurrently now what would happen is that these three nodes are starting the download if you see these two are waiting in leader election as torrent was not found and we just noticed that this guy this third machine got elected as leader it started directly downloading from object store it created a torrent file after the download was complete and it became the cedar now once it became a cedar you can see um they got this notification that torrent is available and this is all happening in real time uh and torrent magnet is now they downloaded using torrent magnets instead of uh downloading directly from source and these are some of the metrics for that particular system these are internal grafana dashboards um next what we do is on we enable shut up the demon uh on the rest two of the client so we start downloading these two batches in this batch on these two nodes and the download should immediately start on both of these nodes so the moment i start it finds a torrent because it was created by the first batch and torrent has started and it has already finished and it has started seeding this particular file for everyone in the data center so these are two these are metrics for two different batches that you see over here um we have more metrics but for interest of time um so uh there are a lot of learnings right and we this was not the final state when we started building it right we arrived at this uh after a lot of benchmarking a lot of trial runs uh we did not take into account torrent unhealthy, healthy scenarios and we actually burnt our fingers in production because uh we figured out that uh cedars if we have four or five cedars they will never go down all at the same point of time but uh that actually happened and we figured out we need to have a way to make torrents healthy again uh in the same way right uh and i talked about network being the bottleneck earlier but once we moved to torrents we started seeing disk as becoming a bottleneck we have some really old hard disks in the data center uh on some of the boxes and their uh disk speeds are abysmal so what's happening in torrent is you are not really doing sequential writes you're doing a lot of random writes because you have all these disparate pieces coming together which constitute a particular file and your disk is spending a lot of time you're doing a lot of time io writes because and doing them randomly on the disk so a lot of time we spend doing io seeks and we started seeing that for particular 50 clients for 24 mbps in gress and out gress it was taking 150 seconds to complete a 3gb file which was a lot more than http and figured out there has to be a way we have to speed this up otherwise our our whole bit torrent exercise is a failure um so what we did was before starting a download we actually mount an in-memory file system uh a tempfs volume and we complete our entire download in memory so all the pieces that are getting downloaded and are getting uploaded while the torrent is in progress while you are still basically trying to download a particular file happens through in memory once the torrent is complete once you have the entire file i can serve this file to the lxt container ecosystem so that it can spin up lxt container out of it and i do not need random writes anymore because all the pieces are with me now all i need is for this file to get uploaded to other boxes out there so then i will do a file flush and move it to hard disk again and free up my in memory file system we did we spent a lot of time tuning and tweaking lip torrent so lip torrent is an internal library which almost all standard torrent lines use out there it's like kind of the reference implementation and it has so many ways you can tune uh you can tune send buffers you can tune receive buffers you can tune timeouts you can tune um certain torrent related tit-tat principles so that uh so torrent has certain principles around if you upload more i will give more data to you and some of these principles did not make sense to us in a data center kind of setting not in a public internet so we had to tune those uh we spent some time uh some time tuning our file sync operations um these are early days for shatabdi uh while we are in production and we are present on shatabdi client is deployed on every single physical server of flip cart um not everyone has moved to containers and that is an ongoing task in flip cart and as and when we are adopting those orchestration of images for a particular container happens through shatabdi so as of today we have 600 gb that is transferred every single day using our shatabdi system um you we have 250 unique uh container images being deployed in a day at any point of time and right now we have 2500 bare metals which are participating in these lxc container downloads uh in production while we have done benchmark it's like this for thousand plus node clusters in production uh we have so flip cart dot com gets deployed using shatabdi the website that you see out there uh that gets deployed using shatabdi right now so and it's a 450 the cluster size is 450 some of the references again this is for uh if you go through the slides you can refer these um this is a team some of them are actually here in the back um and i think we have some time for questions okay so if you want more detail on any particular thing sorry references um the slides are there in the haskeak proposal page so uh you can just the demo is also there so feel free to ask questions uh we are around here somewhere in the conference so let us know um one of the key learnings from this talk is that we traditionally do not can i ask a question please um so um one of the issues with bitter end could be file integrity because you're doing uh multi-part writing right so you're not downloading the complete file have you run into issues with the integrity of the file so uh torrent has this concept of checksums so every piece so that's what i uh when we go into bitter end protocol right every piece has a a hash and once a piece is downloaded that hash is contained so that torrent file is basically nothing but a list of hashes of all the pieces so once you download a particular piece from some other client um your client would actually compare the hashes and ensure that the piece is correct and if the piece is correctly downloaded it will not do anything but if the hash mismatches it will again begin download of that piece um this is actually a major problem in public internets because you have a lot of network drop right um we see very few of these uh problems inside our data center although the protocol remains same so we don't we don't remove the hashes so we still have that integrity check so uh deployment pipeline works like this right um i'm deploying let's say flipkart.com right they want to deploy a 450 node cluster and it's going to happen in certain batches they're going to do a build they're going to release a container image for this and this image is going to go in our source data store which we do not want to scale for different reasons because doing that is harder so uh what happens is um when you want to deploy let's say 100 boxes out of that 450 nodes um these 100 containers are going to be placed on certain physical servers in our data center and once the placement is done by our ias stack um all these physical servers are going to talk to the source and say i want this particular image and they are going and this is where the problem was they're going to overwhelm the source so they use shut up to get that image and then a container is created out of that image and placed on that bare metal once you have all the four and eventually they are put behind the load balancer so typically deployments are done by creating this container image uh ensuring that this container image reaches on all the physical servers where placement was done and bringing them up uh not docker lx image uh so docker has this concept of overlay file system right so you are not really downloading really fat images if you have just done a config change you are just downloading a small this thing lx does not have that