 Hello, hi, so welcome to the last talk of it, I am Umesh and I have a decade of experience in search. I was the daily hand second employee, 3.5 years, then I was flipped back of 4 or 5 years such a new platform. More to Lucidworks, for 6 months USR did really like it and now I am consulting for Unbox, 3 days a week. So this talk is an ingenious talk, I will get into, I will demystify Lucid, I will demystify solar and I will tell ok where it works and where it does not. I have been a Lucid hacker for a while, so all this work slide is Unbox, but all the work has been done with the cut and we had a great experience. So about me, I have been using Lucid from 2.1 version and I hacked in it, it is 4.6, the recent versions I have looked at the giras, I have looked at what this is, but I have not really hacked it. User hacker, I have built middlewares, schema designer, all bunch of things, contributor and I have made a specializing in new excuses, the agenda, it is e-commerce search and we are going to very specifically talk about real real time search and building the real real time index ok. The pipeline everything I am going to mention, but I am not going to pay too much of attention that has to be the topic which I will discuss all about. And I am going to get into data structures ok and how to how to build your own inverter index and integrate with Lucid. So, it is going to be a first principle approach and then we will have, hopefully we will have time propositions. So, this is a fifth part piece, you have build launch, you have a 200, 31 million documents, it is a parent SKU and then it is a marketplace, so you have multiple S6 for it. One thing I want to pay attention is ok, this top position is at premium, you have only 6 products that the user can see on a screen, I took the screenshot specifically for this purpose, which means if I show out of stock product there, I am losing real estate and that is a decline in revenues, nobody likes it. So, this is 2014 big billion day, the problems we faced on that day led to us building this system that I am going to describe. So, 24 hour sale target of 100 million achieved in 10 hours, was the catch, ok. I like what this builder is the founder of the company, he has a, he puts very good jobs ok. So, 1250 to 120, I was one of the guys by the way, who could not buy it. And again there was a video analysis, the links are there you can see, but the real issues were ok, this price changes and out of stock business, happened to a whole bunch of people, expired, it spoiled the whole user experience and there was a lot of backlash on social media. So, this is what was actually happening on that day, ok. 500 entrepreneurs, offers which were expired and they were still showing up, ok. Actually you have products ok, we shall download, ok, we shall, we still have the products ok, but as I, it is the top product which matter. So, the impression was ok, all the products are gone, this is the steel yield speeds and this has ok, people want to break. So, we took it back and said ok, let us do in ring 101, what is the real deal, ok. Because when something like this happens, the issue is ok, you should know your system ok, we are already rise, how is this happening ok, at what layer. So, this is what was happening, in a normal day, the ranking is based on user intent and how good the product is, user intent, you can have implicit, historic behavior, session behavior, explicit like queries and filters that has been selected and on the product, mostly you have how good the product is and you have ratings and reviews and other things, the catalog items are there. So, but on sales day and people love ok, how do you influence user behavior, ok. One of the thing is, ok, user behavior, when you influence it, you influence it to price or discount, ok. So, I bought a TV, 16000 rupees, 14 is TV, 8 months later it is still used, ok and I have been postponing the process for 2 years, such a great deal. So, instant gratification, prices count offers and the delivery experience, these become the key things in a sales ranking and if you want to get that right, you have to reduce the data lag, ok. And what is the data lag? So, from source of truth, you have the search index and then from search index, the CDLs are getting short to the front end, you have lot of lag at both ends, from source of truth to search index and at the front end. So, you have to reduce the data lag, ok, so that you can give a customer experience which is consistent. Now, what are the sources of truth? So, this is a product case, you can see products and listings, ok, there are bunch of products and you know, this is iPhone 6, ok and you have multiple service for this. So, this is the structure, where the data come from? Catalog service and then these are the other sources. Availability comes from other service, seller rating service, promise service, offers, pricing service, whatever you see over here, all the real crime attributes are coming from a different service and all calls go to those service access from the search index. Everywhere, all of these services actually get a call in the real time, but request basis. So, now, data pipeline, so you have the source of truth, from there you have to send data to search index, ok, this is usually a pipe, ok, it is not a per request basis, I am specifically putting a big pipe. Streaming updates via push, Kafka plus a stop, that is a standard architecture, caches, front end, it is a search index, ok, but then you have a cache, we have a call place with TDR and then we have a URL literature, which is written in this URL, ok. So, this is something, ok, which is standard Lambda architecture, you know, there are a lot of talks, ok, I am not going to talk about it, focus is search index. Lucine, ok, we say solar elastic search, at the end, ok, it is just Lucine, Lucine is just a plain inverted index, ok, with a lot of other things, ok. So, we know how to be consistent pretty well, ok. We know SDFS is there, ok, which is the storage system, HBase, MapReduce, IE, ok, we know how all this is co-system fits. In Lucine, all of that, ok, is there, but it is written, ok. So, for storage, you have directly implementation, ok, you have RAM directory, you have SDFS, you have Mmap and for data serialization, ok, you have 4x, this came in Lucine 4x, ok. So, you have a, this basically posting this, you have a codec for that, you have bunch of information. Similarly, you have column oriented fields, which is top values, ok. Then you have store fields, you have term vectors, and then you have a whole bunch of other fields, ok. So, this is the reality, ok. These things are there, ok, they are pluggable, ok. So, this is 4x, ok. There is a whole reason, ok, why it was there, but majorly it was because people are not using Lucine for research purposes, alright. So, now I have list the whole bunch of lists of Lucine resources that will be very useful for somebody who wants to demystify Lucine and understand it from the principles. Of course, dog cutting was the creator, he created a loop later, ok, lunch, I think, ok. And he has a paper, search optimization in total ranking, ok. Lucine is based on the paper. Gully, Lucine's, this is LinkedIn's search architecture. It is similar to what we have built, ok, but different of course, for daily use case, ok. That talks, ok, demystify Lucine. Then early bird, ok, it was a by Michael Bush, ok. This is real-time search editor, ok. This is what I was targeting, ok, whenever we will be doing this. And then you have, like, ok, books and resources, ok. On the blog side, you have a, ok, Macanolus is a very good blog, ok. And then, ok, if you want to really look into Lucine, you can look into Lucine, ok. The best tool is Lucine, ok. So, I have used Lucine a lot, ok, to learn Lucine about Lucine and how indexing, analysis and everything works. So, ok, this is, I am going to skip this. So, the final takeaway that I want everybody to take here, ok, search index, ok, we built a custom NRT index and plugged it, ok, with Lucine base index, ok. Essentially, what we have built, ok, is dub-shotted index. And I am going to teach you how, so, how we built it. Integrated Lucine for the index to callbacks, ok. And it has better liveness than solar cloud. Solar cloud, ok, NRT. So, why not use the source of to doing in the ranking? So, a lot of you have to just check the calls, check the service and get what is the availability. The thing is, ok, 234k SK use mess in the cloud for that one, ok. I cannot make 234k calls to promise engine or any of the service, ok, to get the ranks. So, these are basically type group calls, network latency, will kill the download stream service, you have to have all this data, ok, stream to you. Solar cloud, brief about it, it is an IED based starting. And in Lucine, update is not supported, partial update is not supported. What you have is update is delete plus add, ok. And then we use something called join-to-logger index for financial documents, which means to pay update basically the whole block. So, and that puts additional complication, ok. What happens is solar cloud, you send the request document and it gets a stream to all replicas, which means, ok, you have a contingent between indexing and search. They don't work actually. So, I am not going to discuss much more. So, independent field updates, support is not there, single point of failure or bottleneck is not there. If there is still a single point of failure, ok. And then you have, it is not a original tool, ok, but sir, if you have QPS, lots of replicas, it is going to go down in indexing, it is going to be very slow. And there are talks, ok, there are very good papers and there are use cases, ok, it has been very much. So, e-commerce marketplace. So, we say, ok, what are the special data characteristics, ok, e-commerce marketplace? So, this is basically a, ok. So, you have product SKU, this is parent document and you have listing, this is I document, which is basically, you know, it takes all attributes of parent document plus add some more. Query is mostly SKU attributes, ok, like brand, filter, color and blah, blah, blah, blah, title, all of those. Filters can be your SKU attributes, ok, and plus listing attributes, that is a price, ok, or celebrities, ok, or delimitings, ok. And ranking, ok, is SKU plus listing attributes, ok. Listing is a lot of work in the ranking set, especially on our sales team. And this is the update, by the way. So, on pricing, ok, the red ones, ok, are the ones which are highly, they get very high updates at the peak time. Pricing, you get like 10 million updates in an hour, offers also like 10 million updates in an hour, because the whole catalog changes, ok. It is just a whole bunch of tags, ok. And combining all of them together, ok, into a single pipeline, really does work. So, the bottleneck is your document build, ok, where you are getting all of these guys, ok, and then trying to create a mega document, ok, and then sending it to Lucene, ok. So, document builders becomes the bottleneck. What happens inside Lucene, ok? Inside Lucene, ok, it is called segment merges. So, Lucene has lots of media indexes, ok. So, you know, that kills the performance. I am skipping these ones, ok. So, what we tried, by the way, ok, basic about Lucene segment, ok. It is a standalone index pipe itself, ok, with all complete data structures, ok. So, and what happens is, it is immutable, ok, which means, ok, once you have got it, ok, you can take out the mappings and try to manipulate by yourself, ok. So, this is what we tried, ok. Take the segment mappings out, we said, ok. In the segment, what is the primary key? You have external key, what is the internal document ID, ok. And then, ok, you take it, call back, ok, and then say, ok, take this mapping out, ok, and try to manipulate it outside. We also tried Lucene codex, ok, the one side of, ok. It did not work. The encoder decoder that Lucene supports, and you can plug it, ok. We tried that, ok. And we tried it, building that, you know, in Redis. It did not work, ok, because it was, it had issues with 2k something. So, what I have summarized is about, ok, one year of work, ok, right, three approaches. Standard code, ok, it did not work, ok. You cannot convince my husband, by the way, with this. So, we tried the first one, ok. So, it was like we were prototyping, seeing, ok, for four months, three months, ok, seeing it, benchmarking it, ok, and it did not work, ok. So, the final one, ok, is what, basically, as I mentioned, ok. So, we sat down together and said, ok. Now, we tried the whole consistent approach, by the way, till this time, we were trying all consistent approach, ok, between inverted index, ranking, filtering, everything. It did not work. So, we went back and looked at the drawing board. So, again, we went back to warrior. So, this is the e-commit market place, ok. We started looking at what are the critical factors over here, ok. So, seller market place, ok, you know, query, SKU headquarters. So, whatever is in green, ok, are the ones, ok, which are low-changing things. Whatever is in red, ok, they are very highly-changing things. So, we said, let us just break it. This is what Peter has done, by the way, not Peter, this is what LinkedIn has done, ok. So, of course, we came to load converter data at that value. So, it is a base index, it is a normal lucid index, ok. It gives text relevance, it gives all metafills, and all NFR, so basically, all product ratings, everything. It has lower rate of change, ok, and whenever you want to update it, ok, you go and read the Google Documents. The NIT store is what you pay. So, the requirements for it, ok, must have streaming updates, ok. It should be, and all servers, ok, that you are going to get updates are serving lifetime. It has to be commit-less, ok. This cannot be a delay, because any time you do a commit, ok, there is going to be a delay. Remove single point of failure or bottlenecks. Support bust updates, ok, because at the peak time, ok, you have to get, like, 120. We got 120, ok. We are going to talk about that later, in a second, ok. And it has to be optimized for ranking, ok. It will be sacrificed on, very quick, ok, which is another use case for e-commerce. It has to be optimized for ranking. Some numbers, end-to-end lag, less than 10 seconds, 100 km per second, hundreds of independent sources or signals, ok, and, ok, I talked about this. So, again I talked about this. So, commit-less, ok, why? Because solar has internal caches. Just clear the squires onto that. And there is a lot of garbage creation, ok. But it doesn't scale, ok. If you want very good performance, ok, your caches have to work, ok, there is a whole bunch of overhead, ok, that will be coming. Now, this is the real talk, ok, till the year of it was building context. Give us to buying an RT store. So, what are the stores? When you are talking about inverted index, ok, what are the things that inverted index must have? Two things. One, a forward index, which is basically a list of talk values, you see in terms, ok. It is just, ok, it just goes as before this key or this document number, give me the value, ok. And an inverted index, ok, for this term, ok, give me the full list of all matching documents. So, the first is a forward index. Second is an inverted index. And what really happens is, ok, you get the forward full document, ok. And then you convert from there, you make it the inverted index, ok. That is what really happens, ok. That is what all you have seen does, ok, in the index vector, ok. It does it, it does it, scale it well, ok. There is a whole bunch of things over there. So, what is, what is the forward index? So, which is, it is just a column of storage, ok. So, and optimized for highly, highly updated rates, ok. So, you have, you know, all data structures are memory agent here. So, you have, we have basically used arrays and fixates and hard. Not this pass-by set. This is, I am going to talk about them. And there is a whole bunch of work which has been done here. C-store was the first store to have a column of storage. Lucille supports dark values, which are basically just ORC, many ORCs if you are familiar with the whole group work. Considerations. Look up, you can see. 50 percentile, we have 10 K matches. 99 percentile, 1 million matches has to be on Java heap. And which means, ok, all data structures have to be memory efficient. So, this is the API. So, it is get, give me the, give me the cranberry string, ok. Give me the value, ok, where it, value can be your numeric, float, double string, whatever, ok. And it is called, it is called in the state loops, ok. So, the callback which goes like, ok, for each matching document, ok, for each field that participates in the scoring, ok, give me the value, ok. So, you have 1 million documents, ok, and 20 fields, let us say, ok, it is going to say, 1 million to 20 fields, ok, give me the value, ok. So, actually it is, we have much more than. And then you apply a whole bunch of complicated functions, maybe, ok, you can have any mathematical function. It all functions very easy. So, any mathematical function is supported. So, very, very, it has to be, very, very look up efficient. So, the name implementation, you just want value, right. So, given a string key, ok, just want the value. So, use a hash map. So, you have a plucked ID, ok, say plucked A, ok, and you have already said, prove or false, ok. And then your lookup engine says, ok. So, you get a callback, document ID, document interlucing document ID, and say, fill the price, keep it the value. So, you go and look at the lucine symmetry. What is the primary key for document? Say this product A, and then you say, ok, fine, ok, let me go and look it, ok. So, yeah, sorry. So, product ID 3, ok, what is the, so basically here. So, this is document ID 3, ok, you have the field of price, ok. You want the value for it, this is a callback from lucine, document engine, say product ID for 3, what is the product ID for 3? So, you are looking in this segment, say product ID is product 3, and then you are going to look in this hash map, say product ID, price, ok, what is the price? So, you got the answer, ok, what is the cost? So, it is 10 seconds for 1 million dollars, ok, which is definitely unacceptable for a single thing, ok, because we have to get response in an address, it may be less. So, what is the bottleneck, ok, and we do the JFR over here, but we benchmark it, ok, and the thing is, ok, it is the integer odd that lucine is giving, then we are converting it to a string, ok. So, that conversion is not efficient, ok, and in application layer, ok, we are taking this string converting into hash code, ok, and then from hash code, you have a bucket, and then, ok, it is a very large hash map, ok, only one person, one request will, they are not going to be any cash locality. So, the whole thing actually, you know, because of integer to string, string integer, ok, and then you can find the bucket, ok, that whole thing is inefficient, ok, and we said, ok, just remove it. So, we just removed that, ok, and came up with the second solution. So, we said, ok, lookup engineering is also based on odd, ok, and we are going to say primary key to integer odd, ok, this is the NRT dictionary, ok. So, at update path, find the odd, update the values for odd, and then, just, you know, at, at, it makes your prime, it just works, because, ok, lucine does it already. At the lookup path, you can say, lucine has an internal document ID, and I have, I am going to maintain, ok, what does that internal document ID for a particular segment maps in NRT store? So, basically, what happens is, this is something that should be, lucine segment, inside a lucine segment, ok, your product ID is going to change, I am sorry, your ordinal is going to change, ok, because there is no, nothing like global, ok, it is your, the blocks, or in that segment, ok, it is internal lucine, lucine does not ask, actually it recommends, you should not use it outside, ok. So, it is going to change, ok, but for us, ok, the inventory for our store is not going to change, ok, because there, whatever ID gets a product gates, ok, there, it is going to stay constant, which makes it easy for updates. So, we do a mapping, ok. So, we have a doc ID to NRT ID, ok, we do a mapping, ok, and so, see, like, ok, document, product ID B is 0 in lucine segment, ok, but this is 3, is 3 over here, this is product B, 0 over here, ok, and over here, ok, it is 3. So, we have this mapping, 0 maps should change, ok, and then at the callback time, ok, so you maintain a, basically, at the callback time, ok, we are going to just take this, ok, look at this doc ID to NRT ID, get the document, ok, and then just go and pick the value, ok. How much is the performance? It is actually 100 hours. So, data section, the data section, ok. So, I am comparing, I have done a comparison with lucine index, ok, you have a turn dictionary lucine index, and you have posting list, ok, which is basically with sparse bit sets, ok, and then you have doc values. In case of our, ok, we, so, this, all these addresses are IP, ok. So, basically, if it is a Boolean field, let us say, availability, ok, or whether a product is live or something, it is a fixed bit set, ok. So, then it is bit set. So, it is very fast to update, ok, it is fast to look up, ok. So, and then this is a, basically, it is random update room and searchable. It is intermediate field, ok, and with low cardality, there is a list of fixed bit sets. With high cardality field, it is a bit array, ok. Nobody feels, ok, array, similarly, ok, tag, if it is a tag field, ok, you maintain a dictionary and then you have, you have, you are maintaining a sparse bit set. This slide should have been earlier, ok. So, basically, I am going to, this is document ID 3, this is price, look up in there, ok. Give me the price for this, what is an RTID for 3 and it says 2, ok, and then, ok, you look up, say, what is the price for 2. So, this is 100ms, ok, and this is a 100x difference, ok, in performance, ok, just by seeing using the right data fixtures, ok, and understanding your use kits, ok. And that allowed us to build a whole bunch of things. In what the index, ok. So, you have a, the requirement is, ok, that you have one term and you have to give a whole bunch of list of matching up, ok. So, we chose, ok, over here, we could not keep it consistent. It was, it was like, ok. Maintaining an inverted index, ok, is modifying an inverted index system, ok, because this is a sparse bit set. And sparse bit sets will occupy a whole bunch of memory if you are going to use an array or a whole bit. So, Lucid does a whole bunch of data fetch, it uses run-lend coding, it uses other bunch of storage efficiency, it reduce the space and make it more efficient, but make it faster, ok. So, you cannot update it. So, what we do is, ok, we theoretically invert it, ok, and there is a lag, ok. So, what you have over here is, this is a Lucid segment, ok, 0 is product ID, 1 is product A, 2 is product C, 3 is product B, and then you have this NRT forward store, ok. What you do is, ok, you take this die over here, ok. Let us say it is a variety 2, ok. You say, ok, 0 and 3, ok. So, this is, this is we do it to periodically and then we use something called NRT filter. This is, all of this is basically callbacks in Lucid, ok. You are providing implementation, ok, which reads from your data, ok, which just reads this, ok, and creates this post-index. And then you use that, ok, to your custom class in, club again, ok, basically. So, the final solution, ok, we use Lucid extensions, we integrate it with custom in order index, ok. And, ok, we have an eventual consistency between the replicas, because all replicas are reading from, ok, and there is no consistency between them, ok, for different amount of time. And the reason we did it, we wanted a better indexing throughput, and we wanted a consistent latency, ok. So, these are the solar integration points, value sources, filtering, we have a custom filter, query, we have a query which is dark cover filter. And of course, we had to write a whole bunch of custom components by setting collectors and all of them, ok. So, people who are familiar with, I have kept this slide, ok, if somebody is familiar with solar, it actually worked out. This is the whole architecture. You have an initial pipeline, ok. And then, ok, you have these all the services, ok, cut log, price, availability of us, ok. All of that feeds into Kafka. Cut log feeds into Kafka, initial pipeline and goes to the scene updates. So, basically, and goes to solar master. And then, what you have is, some of these, ok, pricing, availability of a serial quality release, ok, feed into Kafka, ok, over here. And that goes through NRT update. Release is just for source or source. This is a part of requirement. They are trying to do this. I think they might have already moved it. It is for use for good study, new solar sleeve. And then, you have NRT forward index, NRT inverted into store. And then, you have all these components, ranking, presentation, ok, which all plug into this NRT forward store. And then, ok, it goes through this. Solar master, you have a commit plus replicate plus reopen. This whole cycle goes. And, ok, over here, this is all inside a replica module. All of this, ok, that you see, is a single solar replica, ok. It all will be released later. So, we have a specification and then, we will put it. So, this is the experience, ok. This is last year, 2016, ok. And I am sure, ok, there were no auto stocks, ok. People can tell, ok, I have to look at the fingers. So, this is the accomplishment. Real-time sorting, real-time filtering using post-breakers, higher latency, near real-time filtering using, by, you generally basically take this, we will talk about the forward index and you revert it. So, no consensus between lookup and filtering. If you say, select auto stock, ok, it will show you auto stock. It may show you auto stock. And it is independent of use in commits, ok, which is a very big win, ok. And very cumbersome latency comparable to that. That is the scenario. 150 signals, this we thought we are using it for every one of the signals. Yeah, that is it. Auto stock products, it went down by 2x. And, 50k updates, ok, that was the, what was on the big billion day. We benched my feet for any questions. I am sorry, it was a, usually this is a bigger talk. There is a question down the path here. Hi. Here, can you hear? Yeah, so, you talked, you said that you went commit list. Yes. For solar, right. So, how do you deal with system crashes and all the poles and that goes down? You are not committing at all, is it? No, see, there are 1000 replicas. Yeah. And the data is in Kafka. You can just replay from Kafka. So, you. So, entire cluster doesn't go down, right. It is one box, ok, has issue or something. Yeah, but in a case that, how many nodes are there? Like the poles, 100 nodes are there. Yes. So, the source of truth is Kafka. Ok. So, replicas, ok, has just, and then, ok, the same, the snapshot is in Redis, ok. So, you read from Redis, ok, generate this inverter, build this data structure, ok, and, ok, you will play Kafka. So, we have trouble if Redis goes down. That has, if you, let us say, have a bootstrapping, let us say, 20 nodes or 30 nodes, everybody hitting Redis at the same time, ok. Actually, it is cluster restart. If you are going to do a cluster restart, ok, 100 nodes, ok, then, but then you don't never do a cluster restart. It is always a rolling requirement. Next question, please. Another thing is, ok, the NRT portion is the commitless one. The index, ok, the text message in this is still coming. It is just sought for in real-time. All right, ok. How do you avoid, how do you really avoid commitless? Not on the chain. There, it is term charged. So, some fields of text fields, ok, are on this. They are using this, ok. So, you have commits, ok, and you have backup, and you have the store and everything. You have the whole DR there, ok. The pricing, however you know, other fields, ok, which are, which are very fast changing, ok. They are, ok, which is not, ok. It is in memory index, ok. It is generated through Kafka, so, the date source of this Kafka, ok, and then you have snapshot at Redis. So, you can generate it very fast, ok. It takes maybe 10-15 minutes, ok. Yes. So, the sparse bit sets do not teach much memory. So, one example I will give you, ok. We were, by mistake, we had enabled this, you know, a dense bit set, 2GV and 22M, ok. So, that is what the space savings sparse bit sets give you. And then we have encoded biter, we use biter, yes, ok, ourselves, ok. We do not use it. I think everyone was trying to understand. Yeah, I am planning to put it on GitHub, ok, tutorial based, ok, that, this is the builder inverted index, ok, from, yeah, Umesh Prasad, ok. Alright, are there any final questions? No other questions? Alright, thank you very much. I would like to take a minute to introduce the...