 Good evening guys. This is Shriram. I'm from index. So I'm a principal engineer at index. Been here for five years. Mostly worked on a major set of problems at index and kind of working on machine learning problems right now. If you have to kind of put in context what index does, index deals with product, collecting product information and kind of structuring it and providing it to end users. So we kind of crawl the web for product information e-commerce tools and other affiliate information about products, structure them and then kind of store it in our database which is index and then we serve it to end users. At a high level index deals with, I mean right now we have anywhere close to 900 million to a billion product records. We have close to 2 billion offers which spans across 50,000 plus brands and 7,000 plus categories. So that's the scale of problem that we deal with and I'm going to just briefly talk about what the pipeline looks like for us because I need to kind of set this context so that we get into why we build some of the stuff which we call SUG. So the pipeline is at a high level we kind of crawl HTML pages from the web and then we have a series of algorithms which involve processing this unstructured HTML content. Is it audible? Unstructured HTML content and then processing it through the pipeline where we kind of stamp each product with the category, we kind of identify the brand of the product using some of machine learning techniques and then we also kind of try to go deeper into the product and understand what the product is talking about, like more attributes about the product and so on and so forth. So that's at a high level what we do with respect to the pipeline. Now if you look at the pipeline, there are, I mean of course throughout the pipeline some of the common tenets that we have are scale, we just cannot deal with anything less than a few terabytes later, so that's a given for us. And we also need to have a check on latency in terms of how we process and respond to user queries and user queries are mostly internal. So what we end up having is we end up having a few indices, a bunch of indices across the pipeline to kind of serve some of our requests and that's why I kind of put on a couple of databases across a few significant places in the pipeline. So some of the common requirements so to say from a technical angle for these systems are, I'm talking about both the processing as well as the storage, some of the common properties that we need are, we need to handle scale and it's a terabytes of data, we also need to be fault tolerant because we cannot house everything on one box, we are naturally distributed but that also means we need to be fault tolerant. And on top of this, I mean we've kind of tried multiple things and we've realized that operations and productions is very, very critical to kind of managing our systems and delivering delightful user experience. So for us these are operations of at most importance. So given these, we kind of went through our own story of finding the sweet spot for us but I'm going to kind of take a step back and then see how generally a problem like this is being approached outside, anywhere, how traditionally we have tried to address this problem from building, in terms of building software. So I mean you kind of traditionally slap this system into multiple tiers, kind of compartmentalize your problems and then say that I'm going to kind of manage the application tier separately from the web tier and then I'm going to have an ORM or something like that that's going to sit on the application tier and talk to my database which is going to store information and so on and so forth. So and more often than not, we are very successful with scaling our web tier and an application tier but I mean we know the problem always goes back to the place where we touch the disk or our data. So and from the morning there's been substantial content that has been provided to support this in the sense that we know that managing data and scale is very, very hard and building distributed systems is also very hard. So what we are essentially trying to solve at index and of course across a lot of places is trying to scale data systems which is to scale storage as well as to scale compute. With this context, let's kind of see this how the traditional system is going to be benefited with the set of technologies that we have got in the last 10 years. The rise of key value stores have actually taken away a lot of the problems that the traditional data stores have given us. I mean some of the pain points, some of the bottlenecks that we have had with the traditional databases have been kind of taken out of context completely by these distributed systems but even in previous talk on how to pick a database clearly pointed out saying that you need to know whether you want to push your computation on to this applications or you want to leave it with the database. So some of these calls were trying to force fit a relational model into a key value store or trying to force it into a key value store and the relational model is not going to help. So when we kind of scale these applications with distributed data stores, more than, I mean one part is a design part where we kind of come up with different data stores to fit our needs. The other part is actually the place where we start seeing something called as latency. So, and this latency starts to kind of become the biggest monster that you end up dealing with. I mean if you are using HBase, Cassandra or any other data store in the back, I'm sure almost all of you have managed HBase and Cassandra who does spend a few weeks, nights, burning midnight oil trying to squeeze the hell out of the GC, right? I mean there is no doubt on that. So you would end up spending a lot of time trying to tweak some of these settings to ensure that your latency between your application and the data store is within the limits, permissible limits, right? So that is a problem that is not going away and as you scale through more data systems, this problem is only getting bigger, right? So what we thought is, why not take it head on and try to solve it to some degree, right? So one of the things that we tried is can we co-locate our compute and the data? So if I can co-locate compute and data, then I'm actually killing some part of latency. Of course, because the nature of problem itself is killing its scale, I won't be able to hold the entire data set in one box. Naturally what that means is my latency is only reduced to that portion where there is data locality. But that's still better than having nothing, right? So this idea of co-locating compute and data, what that means is we are moving away from this concept of just a data store to a data service which is little bit more intelligent than a data store which doesn't know how to organize your data and keep it for indexes for the matter and so on and so on, right? So building this data service, I mean it seems to solve some of these problems for you but when you actually go about doing it, there is a lot of complexity that is involved because just building a distributed system right out of the box from scratch is not easy and anyone, what they solve will work for it, right? So one of the things that we learned is that we need to take this and solve this problem head on but not every application writer cannot be solving this problem every time, right? So we kind of understood that we need some merry-go-round for this. So we kind of understood that there is a data service component that is needed and we tried to find something in the open source world that kind of fits our need. So how many of you here have used HBase? A large portion of you and how many of you here have known of co-processors? So what I want is something like this, I want a data and compute to be co-located and co-processors fits the build exactly. I don't need to look outside for anything. I have co-processors which are responsible for which is basically a unit of processing that is going to lie alongside the shard. Any operation that happens on the shard, the processor is going to be triggered and whatever action I need to perform there, I can get it done. But the problem is co-processors in HBase is an afterthought. Whether we like it or not, it is an afterthought. It is not designed from the ground up because think about this, right? I won't actually release a new version of the application and my data store is actually shared data store. When I'm running a 100 node HBase cluster and when I have to share this data store with a couple of applications, what kind of call do I take on the co-process that is running there? Should I throw in a new jar or should I have an application upgrade or should I have a database upgrade, a data store upgrade? What do I do there? So these things are not very well defined. Plus the semantics around co-processors are not really well understood in terms of failures and in terms of upgrades. So while there is a little bit of provision for you, the entire design is not set up in such a way that you are going to be able to use it for your production use cases where some of your business critical logic or your code needs to lie there. So now with this kind of sitting and with no other option, we kind of realize that applications need to own scaling. Developers need to be aware of how to handle scale. That's the bad bones of the reality now. How do we solve this so that it's easier for all of us? So we kind of debated a lot. We tried some of this stuff and we realized that we couldn't make use of whatever was available and then we kind of moved on to something that helped us build our own stuff. Along the way we realized that taking a call on off the shelf versus in-house involves a bunch of parameters and more often than not it's a question of your expertise and the need. Is there a real need for you to build something and manage it versus using something that's already available? And on top of this, given that we are a startup, we are kind of extremely prudent about cost. If there is a way for us to actually shave down on cluster from 20 nodes to 5 nodes, if I'm able to do that with let's say a couple of people working on it for a month or two, I would rather do that because the cost of managing an external, I mean a bigger infrastructure and constant towards it is a lot more in the longer run than just solving the problem head on. So with this kind of setting for us, what we realized is most of the problems that we are dealing with with respect to distribution lie into one of these buckets. So essentially for you to build a distributed system, you need to solve or you need to have a hold of these problems. You need to have an idea of how to establish communication between two machines. I mean, how do two machines stop each other? What is the protocol that I'm going to have for this cluster? Am I going to talk HTTP, TCP or some custom protocol? How am I going to do that? That's the number one problem that you need to solve. On top of that, you also need to understand who are all the guys that are part of the cluster. So if I'm going to bring up 10 nodes, I need to know who's in the cluster right now, who's going out, who's newly added, what data is there in what mode and a bunch of metadata around these guys. And if I have a cluster of 5 or 10 nodes, then naturally the idea is to store partitions of the data across these nodes. So what is the strategy for me to partition data? Technically, it's called sharding. So how do you apply the sharding? How do you apply your queries in such a way that these systems identify the right shard for your query? It could be just one node or multiple nodes depending on how much data it spans across your queries, but how do you really like zero in on a bunch of nodes to actually get your work done? On top of this, having a distributed system without replication is pointless. You're basically shooting yourself on the foot by not having replication and managing data properly. So replication is a must for us to actually do anything in a distributed work. So with this context, we kind of understood that these are the primitives, so to say, that we need to address. And in the last three to four years, we kind of built two systems that are in production. We kind of built distributed systems from scratch. These are very simple master slave architectures that we kind of understood and built from scratch in the end production. And as part of the learning from that, we realized that we could pull out these common primitives as abstractions and see if we can reuse them for building newer systems from ground up. At the same time, I was also, me and Ashwant, who's the co-author of this project, we kind of stumbled upon something called Ring Pop Remover, which kind of handled some of this stuff for node and go. So thus, we kind of brought in Suchi and then we started hacking on it. Because just to start off as any other open source project, we spent a couple of weekends on it, trying to get our hands together and then come up with the basic primitives, set up the interfaces for it and then try to see how to go about building this. So what does Suchi provide you? As an application developer if you ask me, what is Suchi providing you, right? It provides you basic blocks to establish communication between the node and your cluster. It helps you to pick up a routing strategy, something that we have defined. Or if you want to define your new routing strategy, you can kind of try to base on interface and then contribute it back. You also get different modes of replication and you could choose to have different ways in which you can set up your membership. You could have a gossip base membership or you could have a static membership where you predefined a set of modes that are going to be part of the cluster. So these are the basic primitives that we thought would be essential to get any simple distributed system from ground up. And we kind of went with it. So let's take each of these primitives and then go a little bit deeper into it to see what kind of things are supported out of the box right now. So as far as communication goes, we kind of support two major paradigms. One is the handle of our paradigm which says that any request that I get, I need to know whether I should be handling the request or I should be affording to someone else who is responsible for that request. So generally this primitive is common when you know that your queries are localized to a node or localized to a bunch of nodes. There is another primitive which is very common in the aggregation stats in aggregation infrastructure pieces which is the scatter gather primitive which is basically saying that my data is going to be thrown across a bunch of these nodes because it spans over a timeline or my query is going to actually span across different facets. So I need to hit a large portion of my data but at the same time I need to be able to run computers on it and get back the results faster. So the scatter gather primitive is something that we support out of the box. So these two are kind of, so the way you would think of this is that you would write your application service and then you would kind of configure your application with what kind of routing and your application and other strategies that you get. So we saw communication which is basically to say that who has what, which node in the cluster has what. Now coming to the routing or the starting part. So what decides the fact that I am responsible for this chunk of data. We kind of started off building this consistent hatching which would support routing and replication together because the data structure itself supports it out of the box for you. So we use a simple consistent hatching. I mean it's kind of well understood for a large set of people who are familiar with distributed systems and kind of understand how routing works generally with data systems. So we have consistent hatching out of the box but if you want a total ordered partitioner or some other form of partitioner or a router, you could call it a metadata based router or whatever that is build a master's paving. Then you could always plug in your router and then kind of take it from there. The next component or the next primitive that we support is membership. Like I said you could bootstrap your application saying that these are the five nodes that are part of the cluster and they would understand each other and they would be able to talk to each other. Or you could actually have a little bit more dynamism with respect to your membership and you could use dynamic membership. I mean we have color coded it. The dynamic membership is under testing. We are using Atomics or I mean basically playing with both which users are under need for consensus. Atomics users are under need for consensus. So we are I mean we are kind of moving towards dynamic membership so that it helps a lot of us to kind of get started by not having to do some of this stuff and it also helps in environments where your machines come up and go down. Especially when you have these orchestrated environments like Nizos and Marathon. So that's on membership. As for the replication, I think one of the most critical things with respect to replication is calling it done. You really need to know when it is done. Is it done when you write it properly or is it done when you made the call to the remote machine or is it done when the remote machine actually writes it with this. And at every level you have to do some tradeoffs. Right now what we support is a serialized synchronous application. What that means is I write to this and then if there is another replica that replica write to disk is successful only then the entire write is complete. So this of course has this overhead of latency for your request but this is a call that we took because we found us redundancy of data is more important than kind of moving towards this eventual consistent model. So you I mean we also support asynchronous replication. This is again under testing. We don't I mean right now we are not having the kind of space to test out this mode of replication right now. But hopefully I mean we will be able to do that and kind of release. This is kind of an optional thing. It's not part of the standard distributed system but we realize that for most of our tasks we use state. One way or the other we use state. We rely on state and across index we realize that ROXDB is something that we are getting very comfortable with and the performance and the benefits that we get out of it seems to be much better than the other data stores embedded stores that we use. So we provide support for embedded KV which is a ROXDB implementation right now. And of course you can kind of build your own BBR strategy. So at a high level this is the various primitives and support that you get from SUCHI. Now all this is fine. How do we get started? So I mean I spoke about communication layer and I don't know if I kind of touched on it a little bit more deeply. So we use GRPC for communication. How many of you are aware of GRPC? So I mean GRPC uses protobuf underneath you could define your messages in protobuf and you can write your service specs in protobuf. We generate, you get the generated steps. All as an application developer you need to fill in those steps with whatever business project you need and then wire it into the SUCHI's server abstraction. So the server abstraction comes up with default ROD strategy and the replication strategy and your services are wired up. That way when you bootstrap your service you know that who are the peers and what are the services or the functionalities that you support. So when I say services you are basically talking of a functional. So I'm going to just, I mean the server abstraction basically wraps up your membership, your router and your replication. That's pretty much your server abstraction. And each of these is again applicable. You can implement your own of all of these. So this is what we kind of came up with and with this set of abstractions we were able to accomplish a bunch of things that at least a couple of systems are running in production with this set of abstractions and we realized that it's kind of flexible enough for us to kind of work with. So tiered index. What we have is we have two systems which are in production. We have an HTML archival which is responsible for doing some compute on top of the HTML that we receive and index them internally, provide a series based queries on it and provide a pointed lookup. So this we are able to do, this is a right system. Every payload is close to, I mean I would say 300 to 400 kb uncompressed. If you compress it, it's different. So that's a right payload that it's taking and we have a five to seven node plus if I remember right. Handling close to terabytes of data, handling hundreds of transactions per second just on bytes. On the read side we have, we have actually built general map to use adapters which we used to scan from this index and we also built the real time stats aggregation system based on city. What this meant was we were able to push stats life and get these rollups done with this system. So my favorite implementation part of G has been the stats because it kind of captures the essence of doing computes as well as storing data internally, localize data as well as doing distributed processing for your stats. And we use, we use scatter gather parameters wherever we need within SUCHI to actually roll out the compute across multiple nodes, crunch the numbers whatever we want there, get localize results and the guy who actually initially received the request does one more round of processing and pushes it out. So we, I mean we have kind of set up this system in production and we are I think two teams at index internally are pushing stats to this infrastructure right now. And we haven't tested it out exhaustively to understand if it's able to handle millions of events or not, but for whatever the scale roughly 1500 million events we are able to handle something that I know about how much it can take. So that's on that. The third system that is being kind of built that is in progress, we kind of testing it out in a smaller environment and all that is the real-time scheduler which involves rescheduling of URLs dynamically which also involves data local operations plus data storage plus a global view of the index. So this is something that we are kind of in the work, it's work in progress it's going to be a couple of months into being in production. So that's pretty much what Sushi provides you. I'll actually show you some code that will help you understand how simple it is to get started with the distributors. So this is the proto definition that I spoke about. Are you guys able to see it in the back? Is it visible now? Is it visible now? So I mean this is basically a service definition that I spoke about. You kind of define your service as a portable service spec. I kind of define the read service which basically gets, I mean it passes in a get request and then it gets back a get response. And you define how the get request looks like. This is just an example of a simple distributed key value store. And the basic operations that are supported are getting. Now if I have to actually go to the service implementation which is the read service you get some of these. I mean the implementation of get actually gives you a response observer that is the GRPC implementation of the stream from which you got the, so you kind of use the get request, the message that you get as part of your request and then use the stream response instance that you get to send back a response once you're done with your local processing. Similarly the case report request. In this case we are actually storing the data into a data store. And the cool part is with your I mean with this just your two read and write servers, I mean the service definitions. What you have is you're basically built up a distributed server by just doing this. So I create a server abstraction with a native server which is basically the transport. I'm kind of defining the routing for the read service using the routing strategy. In this case it's just a hash ring. And I'm defining that I need to do a parallel application and we've also added a scan service and I'm going to get just a demo. Essentially you kind of for every service you can define your router and then things go through there. So that's pretty much on how to get your basic distributed service going with Suchi. Any questions? We have about two minutes for questions so there's probably time for one. If someone has a question please raise your hand and we'll get you a microphone. Do I see a hand? I see a hand right up here. My microphone. I can relate to some of the capabilities which obviously have done a good amount of deep work on that. I think today when I look at this I would place it similar to what I see in ZooKeeper as a set of capabilities but somebody went and defined a set of recipes that you can build on top. What I think would help for Suchi is to maybe define a bunch of recipes so that to see if someone is trying to build distributed infrastructure what are the kind of systems that this would actually be most of the good ones. So for me right now you don't have to answer that now but I'm finding it hard to figure out. Obviously you used it very effectively in index but outside of index if someone were to use it what would that be or what class of problems. So maybe those recipes would help. We actually made some documentation on that. The distributed key value store is a very many like example of what we could do. We could actually go deeper into this and then talk about what are the various scenarios or use cases in the real world that we could apply or build Suchi with. One of the things is just a distributed key value store. The other thing is a distributed smart segregation system. If you had a requirement like this you could probably use Suchi. So we have kind of put up a couple of recipes but we would kind of definitely add more to it and then see if it kind of guides us to a wider audience. Thank you Sriram for the interesting talk and for creating Suchi. I'd like to make one final call for flash talks. If you're interested in giving a flash talk today, 5 minute talk they are starting at 5.10 and you need to give your name, phone number and the topic of your talk to this man. If you don't want to do that right now, also if you did not already turn in the . . . . .