 to the people at the back. Okay, thank you. Okay, so I have a tough task of keeping you awake late into the afternoon. I hope I can do that. All right, hi, I'm Anuj Mithil. I'm here to share our experiences as we scaled catalog at Flipkart, right? It's the largest functional data set. Now, there will be various aspects of scale. It is very tough to cover all those aspects in 40 minute time duration. So I'll try to cover the most interesting bits and go deeper in few of them, right? If really required, you can ask a question in between. Otherwise, I hope I have enough time in the end for the questions, right? So, okay, let's dive straight in. Catalog, right? So instead of me giving you a English dictionary definition, let's try to relate it with real life examples, right? So on the left-hand side, you see is an array of boxes, right? And they are ordered by name. And you basically have either student grade cards or library cards kept in those boxes. And it is basically carrying out all the student-related information. And the catalog over here is a student card catalog. On the right-hand side, if you have gone through any of the retail magazines, if you have flown through flights like Indigo, they have these magazines and you can browse products that they sell in the flight in those magazines and there will be various information about those products, price, et cetera. And you can browse all of that, right? So this is the offline world, right? Let's move on. So what happens in the online world? This is a typical product page. If you have visited flipcard.com, I hope most of you have. So it has a lot of information. Let's look at some of the catalog information, right? So this is Moto X Play, Black 32 GB. It's basically the title of the page. Then there are some specifications about the product like the battery life, camera quality, a processor and the screen size. Then there is another bit of interesting information which helps you decide which variant to buy, right? Either the 16 GB one or the 32 GB one. And then there are some compatible accessories like charger, earphone, et cetera, that may come with this phone and you can navigate to all of that using this link, right? So there is a lot of information there. Now the second aspect is that this catalog information is very diverse, right? Based on different categories, it'll have different properties. For example, here we have a book and here what matters to a user browsing on flipcard.com is like, for example, the author, language, publisher, et cetera, right? So let's move on. Let's try to zoom out a bit out of the product page and look at another interesting facet. So this is a book which is sold on Flipcard to kill a mockingbird. It's written by author Harper Lee, right? Now Harper Lee may have written many other books like Go Set a Watchman. There is a movie adaptation to kill a mockingbird in which Gregory Peck has starred. Now Gregory Peck may have starred in other movies like The Guns of Navron. And then finally, he has won some awards for to kill a mockingbird. So as we can see, there is a lot of information which is intertwined with each other, right? There are a lot of relationship among products and also some pieces of information like author, actor, et cetera, which help you enrich the catalog. So basically it's a web of information. I can start at one node and through my discovery reach somewhere else in some other category while browsing. So there is, as we can see, a lot of data, a lot of complex data and a lot of facets to this data. But to serve this data to a user or any client, I need to streamline this information. I cannot just serve it as it is, right? So that is basically the functional complexity. Now let's look at the challenge a bit deeper, right? So let's say this is the catalog data about iPhone 6. It has price information, dimension information, brand information, et cetera. Now for example, if I have to render some information on its product page, right? I probably need title, dimensions and camera, camera information. Whereas when I'm trying to check out this product, right? I'm trying to purchase this product. The most valuable information for cart is the price and let's say probably brand. It's just an example, right? So what we see is there are a large number of attributes in catalog. There are different clients internally as well as externally who are trying to consume this information, but they only need a subset of this information at most of the times. And most of the subset may or may not overlap, right? So if I look at catalog, it has a lot of information, then there will be multiple clients who want projection of this information and these projections may or may not overlap, right? So let's call these projections as views, right? So views is the term or concept on which our entire talk will revolve around, right? So you can consider it analogous to MySQL views, right? Where you are constructing your data set from various tables and it is, you're keeping it in a format which is ready to serve. And next important thing is I want clients to be defining their views because they need this information to execute some more business logic and also own their views. It's basically a functional contract between a cataloging system and its clients. Now, what are the underlying caveats, right? Most importantly, I want this, I want a timeline consistency to this data, right? For example, if you are on a product page and a product costs around 50,000, right? If you move to a checkout page and if you see the price difference at 60,000, that's not a experience that I want to give to a user. I want to keep all these views consistent. And I want to keep this consistency at low latencies and very, very high throughput. So almost every time when you are clicking a link on flipkart.com, catalog system is in play. And in a single call graph, it may be invoked multiple times. So before going further, let me throw some numbers in front of you so that we know what scale we are talking of, right? There may be dozens of ways to approach these problems, but things, as all of us know, get funny at scale, right? So if there are things that we can generalize at this scale, it may help us. So the numbers, as you can see, right? 4 million read QPS, data set running into double digit terabytes. You have two to 3 billion documents, write QPS, the update, the velocity of updates to this data will be 15 to 20,000 QPS. So it's very, very large data set that we are talking of. So now let's go deeper in our solution. So let's say these are our sources and those are our clients. Sources, by sources I mean, these are the systems which are aiding your ingestion path. It is helping you create your data set. And the clients are the front-end systems which generally a user see, and they need to get a processed information. So I need to compute my views in all this middle. Let's see how we can connect these dots. Now, computation, there are two options that I have. Either I can do this computation at runtime, or I have to pre-compute this and keep the data set with myself. Now, obviously, I cannot do this computation at runtime because that would mean all my sources have to be web-scale resilient. And secondly, computation is very intensive. So it may be highly latent if I do it at runtime. So I have to pre-compute all of this and keep it. So what should be the main characteristics of my computation engine? I want my computation engine to combine data from all these sources and also transform. So as we saw in the example, we had length breadth height separately, but on the website we are showing it as dimensions. So information is not changing, but the way information is viewed, it is changing. And second most importantly, I want this information to be loosely typed. I do not want to deploy code for new view requirement or editing existing views. That is very, very slow. And also I do not want complex business logic to creep into my view computation layer because that defeats separation of concerns. Now, as I said, we need to pre-compute this data. So if we are pre-computing, we need to store it somewhere, right? So we have a view store and what do I require from a view store? It has to be a document-based store, right? Where I can combine, get information from everywhere and store it as a single blob, which is ready to serve. It is not just a plain key value because I may need certain metadata which help me do some admin tasks or help me prioritize things. And finally, I need a serve layer which exposes an endpoint to which all my clients can connect and get this information. All right, so we have connected the dots now, right? So let's see when we want to scale this up, what is it that we need, right? So in general, the limiting factor to scale anything up will be your data store, not the compute layer. It is relatively easy to scale the compute layer. So if I want to scale, I want additional requirement from my view store. I want it to be, I want it to support sharding. Basically, I can horizontally scale with ever-growing data. I can just throw the hardware and still be up. And secondly, a good to have thing is if I have faster disks available like SSDs, my view store should be able to utilize the capability of that faster disk and give me a significant performance boost, right? And the next aspect of scale obviously is I need a caching layer, which gives me low latency, high throughput and also adds to my resiliency on top of view store. Now, as we discussed, there are a lot of updates that I need to take in. As I said, it's easy to scale, relatively easy to scale compute layer. We can have all sorts of parallelization, but what we really also need is one priority queue where you want some data set, some business ask come, et cetera. You want some data set to be propagated very quickly to the end client. So you want to have priority queues, which can be spin up on the fly. And secondly, you have ability to work with patches. Now, by patches, I mean your source in this case may be dumping you with the entire data set. And your client only needs a subset of this data set and that also they may be only interested in what has changed, not the entire data set. And if you see this whole pipeline, it is costly, network wise, to propagate huge chunks of data and then it'll have redundant computation. So if I can calculate patches upfront and only propagate small bytes of patches because the data which is volatile in most cases are things like price. So that is just one attribute, key value, which will be changing most often than probably camera specifications, title, et cetera. So those things need to be propagated very fast. Thirdly, I would want to update things which people are accessing, what my clients are accessing. So whatever my clients are accessing, if I have got my caching strategy right, it should be cached. So I can have a shortcut from my computation engine directly going and updating the cache so that I am relatively more consistent as compared to the normal updates which are going via ViewStore. All right, so I like to take a pause. So we connected all the dots, right? But this is something where we eventually reached. There were a lot of learnings in between. So as I talked about cache update, there were multiple options, whether you want to push the updates, you want to pull the updates, how, what percentage of updates need to be reliable, right? So again, that is very case specific but the considerations were, you know, that either your service, the server layer knows about the backend and it is pulling data or backend can write to your cache, right? So it's a very case to case basis thing and reliability, if you are not 100% reliable, which is very tough to have, especially in a push mechanism, can you work with low reliability yet self-heal yourself? Again, as I talked about patches, so we work with a common data exchange format JSON and we assume that there will be open source libraries existing which can give me a diff of JSON and the patch that I got, they can apply to the third JSON. But as we ran those open source implementation to the 50, 60 million products and the kind of updates we got, we found all of them to be buggy. Hence we looked at RFC and implemented our own spec, sorry, looked at the spec and had our own implementation called ZJSONPatch, it is already open source. You can go have a look at it and that has not caused any problems still. Okay, let's move to our server layer. Let's see what funny things happen there at scale, right? So what is the traditional approach, right? You have your clients, you have a bunch of server nodes for clients to get access to your server nodes. Typically you put a hardware load balancer in between which advertises a virtual endpoint to which your clients can connect and behind the scenes load balancer is taking care of machines, bringing them in rotation, taking them out of rotation and distributing your requests among the server nodes. So all good, right? So now let's say if I have a requirement of 192 GBPS. By the way, this means I'll be serving 48 Blu-ray movies in one second and that was our requirement. So if I have 192 GBPS requirement, what do I do? We went to our network engineers and asked for a hardware load balancer which could give us this and they basically showed us the door, said, please get your numbers right and this is not possible. So we went back and we did the numbers again and obviously the number was still this much, right? So this didn't work. Now what if I just remove, I could just remove hardware load balancer from between, right? And I could have a software load balancer which is co-hosted with the client and that has all the capabilities of a hardware load balancer and it is able to distribute requests among your server, right? So and if all the clients do that, I immediately get additional bandwidth which is additive in nature given the number of clients and the servers. Now each server client connection has a fully duplex FAT pipe which can work and load balancer is not a bottleneck. So if I have a 10 GBPS link on each server and some 200 odd servers which I already need to serve the QPS, my bandwidth requirement is fulfilled. So we are trying to discover this data without a hardware load balancer. All right, so again, it wasn't as simple as it looked. So we have a software load balancer, how should it run? Is it a separate process, a demon running on the client box or it should be embedded in runtime, right? So for me, this was my biggest learning as far as this project goes. So what used to happen was we were running it as a demon, right? From the server to the client, I was able to transfer data in required amount of time. My latency was okay, but handing off data from my software load balancer which is running as a separate process to my client, right? Latency got amplified five, six X times and we were like clueless, why? I mean, on your loopback interface there should be ideally any limit. So we dig deeper and we eventually saw since client at full throttle was heavily loaded, it load average on the client was very high. It did not get enough CPU cycles to complete the communication in one go. And it had to do retransmits and eventually after five, six attempts that communication got over, right? So this failed, this strategy failed, running it as demon at our scale. So we had to eventually embed it in client's runtime and things are very smooth after that, right? Then there are other things that your software load balancer must do support different kind of load balancing algorithms. Now let's look at our caching layer which is just below our server layer. So as we already discussed, like my primary store, my caching layer also needs to be horizontally scalable because I have an ever-growing data size, right? So this is the traditional approach where you have a caching layer, like a mem-based cluster below the server nodes and they have discovery nodes, et cetera, and they know the entire topology of the cluster. So what are the cons of this approach, right? In a very latency sensitive environment, you are adding one extra hop. You are adding more hardware which needs to support a very large amount of network bandwidth, same as between client and server, right? And it is eventually one more failure point that you have. Now what's the alternative approach? What if I could co-host my caching layer with my server nodes? If that was possible, what are the pros that I get? So apart from the cons that get eliminated straight away, we get the following pros, right? So one thing which I learned during this whole process was benchmarking and getting the right tests done was a key, right? All of this helped in that. Now I could say what can, how much QPS, how much latency my one node can give and add it so that I can do the hardware estimation for my entire cluster, right? I can independently benchmark the nodes and it supports a concept of shared nothing architecture. If a node goes down, it goes down with finality. There are no inconsistencies in my cluster. And if I am deploying a 500, 600 node cluster, I do not want to be looking at uptime of each and every node, all right? So we have caching layer co-hosted with the server layer now. Now I need to fill up this cache, right? Hence I need to have a distribution logic of data that goes into this cache. Now what does it depend on? One, the scale that we are talking has two dimensions. One is the QPS that each of these machines need to support. And secondly, apart from QPS, how much of data, right? If my data set is let's say 10 terabytes, that may not be getting accessed. Only let's say 500, 600 GB worth of data is getting frequently accessed. So, but still it is ever growing. So how can I support cache size for that amount of data, which obviously will not fit in a single node? So then we needed to, apart from that, we needed to have a consistent hash. What I mean by consistent hash, I cannot afford to rehash my data set in the cache because of any node going up or down, right? If there are fluctuation, I don't want keys to getting hash. It will create a lot of chatter and it is not desirable, right? So I need to be, I need to consistently hash. If a key belongs to one shard, it should always belong to that shard. Finally, I need a hashing function which can do consistent hash, but hashing function, what my requirements are, it should be non-cryptic, it should be fast, it should be simple. So we use murmur hash for our case. All right, so let's look at a real life scenario and see if the architecture that we have proposed, it makes sense or it doesn't make sense. So let's say each node in the cluster, maximum size, it can support, maximum cache size, it can support apart from memory that your server process will require is 50 GB. And whereas the data set that you want to be cached, right, is 100 GB, okay? So let's distribute this 100 GB of data set over a key space of key K1 to K100. And since one node can only host 50 GB of data, let's distribute, let's create a partition of two sets, K1 to K50, mutually exclusive set, K1 to K50 and K51 to K100. And each of them like shards have one more replica for high availability. So nodes two and three are hosting K1 to K50 and nodes one and four are hosting from K51 to K100, right? So as I already talked about sharding, as I already used the term sharding, it's very similar, right? I want, since I already told that I cannot rely on my uptime of my servers for my hashing, I have to rely on something else which is constant. So I pre-decide my number of buckets that I want to have in the cluster and then apply a hashing function which consistently hashes a particular key to that bucket unless the bucket number is changed, which will be the case. So if the bucket, the number of buckets are constant, a key K1 will always hash, key K1 to K50 will always hash to bucket B1 and key K51 to K100 will always hash to bucket B2. So now instead of talking at a level of keys, let's go a higher up in the abstraction and say servers two and three are hosting bucket B1 and servers one and four are hosting bucket B2. Now finally what I need, client wants to access key K1. What does he do, right? How can he access K1? So to access key K1, he basically has the software load balancer that I showed with it and the consistent hashing algorithm which will map key K1 to always B1. So the only additional information it needs now is the server list on which B1 is hosted. So that it can go to those servers two and three and say, okay, you have key K1, give me the data of that. Now key K1, this list of buckets, right? What are the characteristics of that, right? That is a very dynamic list. You have a dynamic cluster going up and down. It needs to be propagated as soon as possible because if you are taking a note down, you don't want to wait till eternity before that is reflected on the client side and it's only a few kilobytes of data that you want to exchange. So you basically need a metadata server which can host this bucket information that which bucket is being served by which server and propagate that information to the clients, right? So now the client knows that bucket B1 is being served by server two and three and the software load balancer that we have deployed there, it can choose appropriate load balancing algorithms for all the keys that it is getting and distributed load between servers two and three, all the keys between K1 to K50. K51 to K100, it'll go to servers one and four. So now let's look at the production scenario, right? Your max cache size still remains 50 GB, but your cache size required is 500 GB. Your full data is five terabytes and there are millions of keys and hundreds of buckets, right? So now what basically needs to happen is very simple, right? Instead of one server hosting just one one bucket, you one server host multiple buckets which can cater to your requirement of QPS. So it gives rise to a concept called lava bucket as we only coined this term, which basically means now let's say if you have a sale going on, right? And all the Apple products are getting accessed. Let's assume that all the Apple products are hashing to bucket BL, right? So what I would want is dynamically be able to serve BL from all the capacity that I have so that I can meet that spike in throughput and still provide the same quality of service. So all that can happen by a click of a button or a API call, right? You don't need to do anything. You can just sit and do all that very easily. Apart from that, as I already talked, there are the other admin controls that need to happen which your hardware load balancer was doing. Now you have to incorporate that in your software load balancer. Basically if you want to take a node out of rotation or bring it back in rotation in the cluster, you have to implement all those functionalities, right? So now if I want to basically characterize my software load balancer, what it needs to be, it needs to be lightweight, right? So that my clients can accept the request of embedding this in their runtime, right? It needs to have a bounded resource consumption. It needs to have variety of load balancing algorithms and dynamic time out for various data sizes. As you already saw, some catalog information can be very huge running up to MBs and some could be five or tens of kilobytes. And it can also be fault tolerant. Basically that will help in a case where your metadata server, et cetera, is down and your clients have a stale information of your cluster topology. So they can self-heal and know which machines are down. So we call this a virtual bucket architecture. What are the wins that we get from this? So we have logically sharded data, basically scale our caching layer equivalent to the number of servers we have. It's a shared nothing architecture and I have avoided network hop which helps me gain on latency, especially when your data sizes are huge. All right. So again, everything wasn't as hunky-dory as it seems. Again, we had our set of learning before we eventually reached there. So we were more greedy, right? So we had a server, we had a local cache. Now we also wanted an in-memory cache in the server, right? But that strategy failed because the cache size that I could give to the server process, that was way less than the amount of cache required to be kept on one node. So that did not work, which led to a lot of cache evictions which eventually led to a lot of GC pressure and our servers were running in full GC mode in no time. So that did not work. The other aspect around smart client, initially we had a async implementation, right? And in a lot of corner cases and when we had network blips, et cetera, async implementation failed. There were a lot of connection leaks, basically leading to dangling close weights and your servers were rendered useless. So your client machines were rendered useless because all the entire connection pool got exhausted. So the only option out of it was I had to restart. And if let's say if I have 1,000 to 1,200 clients running, holding the front end, right? That's a very bad position to be in, right? So we moved to sync implementation of all the popular open source HTTP clients that are available and implemented async behavior on top of that. Third interesting bit over here was that now on the client side, right? We have this smart client, so as to say. So we were deserializing that information and converting into JSON format and handing it over to our client. But that had its own cost, right? And it was redundant. Why? Because the client layer that requires this data will anyways deserialize it again because for its own business layer and I do not know their model, et cetera. So they'll be doing that deserialization again. So it doesn't make sense for me to do any kind of deserialization. So we discovered this interesting concept where we could, where I have sent a request to many servers and I get response from all these servers back in byte streams. And I just create a sequence of these byte streams and hand over the handle of the byte stream, the combined byte stream to my client. And the data only gets deserialized where it is actually required. That alone gave us a performance boost of 30 to 40% in latency. So what are the additional concentrations? Now, again going one level more deeper. What needs to be cash, right? And the obvious answer is what clients are accessing, right? That data needs to be cashed. So couple of interesting thoughts on that. So the caching algorithm that we had was not just least recently used, keep the least recently used keys in the cache, but it was a combination of least recently used as well as least frequently used. And we tuned it to our requirement. The other thing is your access key in the cache can be different from the routing key. Basically your routing key is being used by your back end, et cetera, to decide which cache to go and update. Whereas the actual data your server is accessing can have a different access key. So you need to be very cognizant of what is your access key versus what is your routing key. Again, as cash evictions, et cetera, as I said, it depends on my algorithm. And as far as possible, try to keep those cash evictions low. You can, I mean, whatever strategy you employ. And the second part of refresh. So as we talked about, keep back end pushing notification directly to the cache to update. But as we said, it cannot be 100% reliable. So this refresh strategy should auto heal. If there is, if I have missed an update, it should not be that that, that stale data resides forever in the cache. There should be some auto heal mechanisms which can refresh its data on its own. Now a bit on data exchange. I mean, we took that assumption that, you know, okay, we have a HDP API, et cetera, we are hitting, we are getting its response. And we did not even bother about the content encoding that was happening. But we later realized because, and we coincidentally realized because we were dealing with network bandwidth issues that Gzip, which is the default encoding in most of the distros. It was crapping out at our scale. It was creating a lot of GC pressure and that was very detrimental. So we tried serving entirely uncompressed data. And then we also stumbled upon a library called Snappy. So here were the results, right? So if Gzip, Gzip was 1x, raw data was 6x. So, you know, six times, six was the factor it could compress to. And whereas Snappy was 1.5 times more than Gzip. So space-wise, Gzip performed better. But time-wise, Snappy, you know, beat both of them hands down. And that is the trade-off we took and we used Snappy for our case. So it helped us reduce this GC pressure and get the latency sensitive. So this was most of it, but yes, it's still not a perfect platform. The journey has not yet ended. Like GC happening in our JVM-based services, the cache that we use, right? There is a lot of memory fragmentation that happens due to evictions. And it also needs to do something like GC to defragment that, right? And unfortunately, right now we need to recycle our nodes to basically defragment and have a clean node brought up. So on days when cache evictions are very high, this exercise is a lot more frequent. The other bit beings, we talked about timeline consistency. It was easier to do when we had only 5, 10 views, but now we have more than 60, 70 different unique views. It is very tough to keep consistency across those 70 views. Third thing, there is a lot of overlapping data set, like product, title, specification, et cetera, which is part of lot of views. So we are basically, if I have 70 views, I'm storing 70 copies of the same data. So that is leading to a data blow-up. I would like to basically go to a solution where I can keep that blow-up very minimal and yet be performant on the server side. So finally, the takeaways, the new paradigms we came across. Basically, it was a news to us that, network is not something which you can take for granted and you had to basically replace your load balances with smart clients and focus on data set transfer over wire. The patterns we implemented were shared nothing architecture. We tuned our consistency to match our uptime and as we discussed different caching strategies. And finally, as we discussed the techniques, after each experimenting across compression techniques, we had to redo our GC tuning for it to be optimum for a particular caching technique and then benchmark. So all of that took lot of time. So the capabilities that came out of this was, we had a view computation engine, about which I talked very briefly. And smart client that is called Mystique, right? Zookeeper based service discovery and ZJSON patch, which I already told you about. So ZJSON patch is already open sourced. We'll be looking to open source Mystique and VV cluster very soon. All right, thank you, Q and A. This was the team. We have done it. Any, I mean. Okay, I have a question. Have you also used TCP based load balancing in your? No. All are HTTP based. Yes. Okay, I can connect with people offline. Thank you.