 It's almost eight years of experience in search, information retrieval, and recommendation engines. Ansham has been a part of Core Search Team at Naukri.com and ClearTrip. He's now back into open-source world and is currently working with Lucidworks, the leading developer of search and discovery and analytic softwares based on Apache Lucene and Apache Solar Technology. Please welcome Ansham. Hi, guys. So this is who I am. I'm Ansham Gupta, and I've been doing search and related stuff for almost eight years now. Currently, I work for Lucidworks, which is pretty much the primary backer of Apache Solar and the community. Before this, I was with AWS helping launch a service called Cloud Search, and these are the other places that I've worked at prior to joining Cloud Search. So I'm pretty sure so far people have been talking about big data and how data is growing, hardware is getting cheaper, and pretty much all of you would know about it. So where's the real value when it comes to big data? So it's not really storage of data, which was the case until probably a few years ago. It now looks like real value is in processing, storing, as well as being able to search through all of that data. Search, if you really look at it, how things have changed over the years, earlier it would be an expensive solution, really complicated, not really accessible to most of the people who would wanna implement search. There were very few people who could do this. Now, things are getting affordable. They're open source solutions, which are easy to implement, work out of the box, and yeah, you still can get it to be as complicated as you want. So I'm gonna talk about a lot of points right now and probably try and plug them all together as my talk goes on. So this is what Wikipedia says about no-SQL databases. What it says is it's something which helps you store data and is not really exactly how traditional relational databases look like. It's meant to scale out and support a lot of distributed stuff. So what it looks like is it's not traditional, it doesn't use, at least it was designed for SQL and may not give you asset guarantees, but at the same time, you get, I mean, you trade off the asset guarantees to get a high level of scalability. Also, what scalability comes, distributedness and you want something to be fault-altered at the same time. So these are the DB rankings, pretty much, the most recent one, I guess I put this in yesterday. This from July and this from a site called DBEngines.com. If you really look at the rankings in there, the top five ranks still go to relational DVMSes, probably because these are legacy systems which are running, a lot of people use them, it's tough to move away from them. But the interesting thing to note there is at number seven is MongoDB, which I guess would be the leader when it comes to no SQL data stores, as of now at least. And then you have an interesting entry, probably the only, among the only couple of options in the database ranking, would that be solar at number 11? It's got a gap in there, but it's still, you know, coverable, I guess. This is one of the search engine rankings on the same site look like, and there's solar right up there with a massive gap with elastic source, which was at number two until recently. It's overtaken Sphinx to get to number two. And yeah, that's how other solutions look like, and you can see, I guess, so it's pretty much up there when it comes to ranking of search engines. So I'm gonna talk about other no SQL data stores very briefly, like MongoDB uses a binary format, PSAN, its distribution model is shuddered master-slave async replication, and it has a portable right lock when it comes to maintaining consistency. Talking about search in MongoDB, they have a full-text search, but it's not really up there, probably because they're not the guys who do search. They do really amazingly well when it comes to, gets inputs and probably storage of data and managing it, but when it comes to searching across that data, there's a lot of gap between a good search player and what MongoDB has to offer. The alternate solution there, what people generally switch to is have a solar or probably another search engine hooked up directly into your MongoDB instance. So you externally figure out how to keep it consistent with your data store. You explicitly figure out how to run search on top of your MongoDB data store. So yes, it's not native for sure. Then there's Cassandra, which is column-based data store. It's distributed model is basically, it uses consistent hashing for its distributed updates. For consistency, it uses timestamps for consistency, and talking about search, there are different solutions, Cassandra, Solandra, but if you really look at these solutions, they are again, Lucene-based, solar-based solutions, which people have kind of tried to hook into Cassandra. Another thing is React. So this is something that's been built using principles from the MongoDB paper or from the Amazon DynamoDB paper. React search, as of now, what it looks like is it uses two things, merge index is what they use for backend data store, and then they've got React solar, which they say is something that gives you solar-like APIs and solar-like search capabilities on top of React. And then there's Yokozuna, which as their site says, is the next generation of React search that marries React with Apache solar. It sits alongside of React, so it's again, not something that's native. So what have I said so far? I've spoken about MongoDB, Cassandra, React, no SQL search. Everybody is using a different data model and well, they work pretty well for their use cases. They're different on update handling capabilities and they seem to be working fine as well, and consistency management. Now, when you look at, so they're doing good on storage, but when you look at search, there's really anything that's native right now. You've got almost everybody trying to hook up solar, Lucene, build something off Lucene, and hook it up with their no-SQL data stores. So how does adding search to no-SQL look? MongoDB, a lot of people I know are trying to hook up solar with MongoDB, Cassandra has already spoken about similar solutions. Thing is these things were not really designed for search. They were designed for storage, a lot of data storage, and they do pretty well on that. But if you really look at search, they may not be doing that well because they're still trying to figure out how to get this to work. When you look at adding no SQL to search on the other hand, which is, if you take up a search solution which was built for a lot of data, you have the capacity of being able to search as well as something that handles a lot of data at the same time, right? So again, from what I said, at least to me it feels like it's more intuitive. It's easier to think about and so it's probably easier to implement as compared to implementing search for a no-SQL data store. Again, I'd say there's still no key players in it as of now at least. There's no clear card winner as of now at least. Now talk about Apache Solar. So Apache Solar 4 happened a while ago and I would say it's reasonably different than the previous versions of Solar and this is what it kind of looks like. So it's document-oriented, no-SQL search server. That's how you see a lot of people dealing with Apache Solar talk about it as. It's certainly data format. Agnostic supports a lot of data, I mean a lot of different formats. It's got schema-less options. I'll talk about that later. It's distributed for sure, fault-tolerant, it's got atomic updates and then there are these three things. Atomic updates, optimistic concurrency, there's near real-time search which was kind of things that almost led to a few key decisions in the design of Solar Cloud. Before I go any further, I just like to clarify when I say Solar Cloud. Solar Cloud is not a hosted solution. Solar Cloud is nothing but a subset of solar features which were meant for distributed working. So things that were meant to be distributed in nature in solar are referred to as Solar Cloud. It's nothing to do with a hosted service or integration into AWS or Microsoft Azure or any of that sort. Then it's got full-text search and highlighting and other specialized queries for certain search and other stuff that you generally get along with Solar. So these were what the Solar Cloud design goals looked like before it really was put there and released to the world. There was a need for all of this at automatic distributed indexing, high availability of rights, durable rights, near real-time search, real-time get, optimistic concurrency. Near real-time search and real-time get are two different things because a get is not searchable. It's a document that's in there. You'll get the most recent version of the document, but it may not be searchable for that content. This is about Solar Cloud. If you really look at Solar Cloud, it's got distributed indexing design from the ground up so it's not really borrowed stuff from here or there. When people sat down trying to look at what they wanted, they initially started off with a dynamoid DB paper from what I've read. And for all of you who do not know about CAP Theorem, it's about distributed computing and CAP Theorem basically says between consistency, availability, and partition tolerance when you're designing a distributed system, you can at the most get two of them perfect. You can't get all three of them. And in reality, you have to handle partition tolerance, which is if a partition goes down, you have to be able to handle that. So what Solar Cloud ended up looking like is a CP system, which is its high on consistency and partition tolerance, which means it may not be available. A client might return that it's still not, you know, that your updates fail for a time rather than an update going through, but you haven't to resolve consistency issues later on. To handle that, there's optimistic concurrency that's used and I'm gonna talk about that soon. As I said, right? So we're valuing consistency over availability and MongoDB is probably the closest in architecture when it comes to looking at Solar Cloud architecture. Well, having said that, we still do reasonably well with availability. We use ZooKeeper as our counterpart. And what happens is when you have a huge cluster and half of it is unable to talk to the other half, the bigger half is the one that stays active at any given point in time rather than the smaller half, your updates to the smaller half generally fail, but the bigger half, so things will get rerouted to the bigger half and will continue to take place. So they'll continue, the cluster will be up. This might look a little complicated, but this is how Solar Cloud, you know, a basic set of Solar Cloud would generally look like. So when I say shard, anybody who's read about Solar Cloud so far shard and slice up pretty much interchangeably used, shard is basically a logical entity. So if you have, let's say a dictionary of terms of words, let's say, and you wanted to split it into two. So I'd say shard one is something that represents terms from A that began from A to M and then from N to Z is the second shard. And then every physical node of every physical instance that represents an index holding this data is a replica. Now, it might sound a little confusing, but it's everything is a replica and one of them has the role of a leader in case of Solar Cloud. So in these cases, the ones that are in dark blue are the ones that are pretty much the leader and the job of a leader only is as different from that of other peers which are following it. The leader ends up versioning data as well as, that's it, actually. It versus data and routes it to all of its, and routes the updates to all of its followers. That's the job of a leader. That's as different a job of a leader as it gets. Every shard otherwise, every replica, sorry, otherwise goes out and indexes its own documents. So zookeeper, Apache Zookeeper is another project which we use, which is used to hold the cluster state and all these shards are generally talking to zookeeper. Zookeeper stores all of the configs, you know, the cluster state and everything that's to do with the distributed aspect of stuff. And these shards would go ahead and talk to zookeeper to get the latest configs and stuff like that. There's some more information that I've given there which is it holds stuff like nodes in the cluster, collections, schema and config for each of them, shards and replicas and collection aliases. Distributed indexing. So when you send a document to a solar cloud cluster, you don't have to be really bothered about where do you have to send it, who's the leader. You could ideally send it to any of the nodes and it would get forwarded to the leader of that particular shard that it belongs to. It would be hashed and you'd figure out, okay, it belongs to let's say shard one and replica three in this case is where you send your document. It says okay, it belongs to shard one and replica one happens to be the leader of that shard. The document gets routed to replica one where it gets versioned and sent over to replica two as well. So that's how updates work and that's how, and the way you get high availability in such a case is in case replica one goes down, replica two is automatically elected as the leader. When replica one comes back up again, it goes ahead and fetches the difference. It talks to the current leader of that shard and checks, okay, how far behind am I on the updates? How much have I missed? If it sees and it realizes that a lot of stuff has been missed, it probably does a full index sync. In the other case, when it realizes it's just a few documents that have been missed, there's no need to pull up the entire index. You generally get those updates sent from the current leader to the one who's recovering. This is optimistic concurrency and this is how consistency is pretty much managed in Solar Cloud. So let's say you as a client want to update a document in Solar, the worst thing you do is you get a document. When you get a document, you also get a version number in the document. You modify the document retaining the version number. So you know what's the current version number and you keep it as a part of the document. You send this new document back to Solar and when Solar looks it up and says, okay, I have, this is actually the latest version of the document that I have and so I'll update it. If that's not the case, which is in the meanwhile between step one and three, somebody else went in, fetched the version number, updated the document and now the version number on Solar is different. It's, you'll have to retry the entire thing because it would return a failure. It would actually return a 409 HTTP code which is basically a conflict. So this is how consistency is managed in Solar Cloud. So these are different distributed query requests. How you could either send it to the entire cluster and say, this is my query, distributed and Solar Cloud does it for you, picks up load balances and picks up the right charts, queries them, aggregates the results and gets you back or you can explicitly specify the addresses to load balance on. You can probably just go and say, so shards is equal to low-closed, whatever, slash Solar and the other port. So everything separated by a pipe, it's gonna load balance between all of those and everything separated by commas are stuff that it'll send queries to. So you can specify logical shards, multiple collections and Cloud Solar Server is an API thing which you can probably use because it's smart enough to figure out, it's zookeeper aware which means it goes out and figures out which shard to send a particular update on. So it's smart enough in that sense and so it's a better option for you to use. So document routing, considering Solar Cloud shards stuff across, so it's basically what it's doing is it's hashing your key and creating hash ranges across shards. What it looks like is it's a ring of hash keys, right? It's a circular ring of hash keys and in this case, let's say you've got four shards and that is how each of them has one fourth of the hash range. Okay, so one thing that you can use is a composite ID router. What you do with a composite ID router is you can decide to have a particular shard, a co-host, all of the stuff from a particular, let's say if you're trying to, let's say, do books, or you're trying to index books or an e-commerce site, so you've got books in different categories, you can have the first part as the category with an exclamation mark followed by the document ID. Though the entire thing is a document ID, only the first part, which in this case is big co is used to hash which ring, I mean, which ring would this document belong to, so that everything that has big co-exclamation appended to it will go ahead to the same shard. So when you have a query that comes in, you can say, okay, I just query the shard that contains this ID. I'm not gonna go into more detail about how routing works. There's some nice write-ups on a few blogs. I could share that with you after the talk if you want to go through that. So, durable writes. So Apache Solar is basically built on top of Lucene and Lucene flushes writes to a disk only when a commit is called, which means uncommitted commits are going to be lost if you happen to kill JVM randomly. SolarPore, whereas on the other hand, maintains its own transaction log. What a transaction log does is when an update comes in, the update is written to the transaction log and only is a positive response passed back, which means you've persisted it to the disk. The transaction log resides on the disk. The index is however not updated. The transaction log is basically used to do things primarily for your rights to be durable as well as for real-time gets. So real-time gets actually are done through the transaction log. In case something goes down, what it does is when it comes back up, it goes through the transaction log again and plays it back, which means you wouldn't have lost anything if you accidentally killed up a JVM. So that's one way of having durable rights. The other thing is in a cluster, you could have your rights or updates forwarded to different charts, which are basically different replicas of the same leader, which means that if this leader goes down, you would have another leader that could come back up. You would never lose any document probably. So these are basically the collections API. That's basically a set of API calls, which Solar Cloud comes with. These are the really straightforward mechanisms to do a collection-wide operation. Create, delete, alias, split chart, delete chart and reload pretty much the collection APIs that we have right now. I don't think we have anything more than that right now. And that's how easy it is to just have a collection create, I mean create a collection, which is you just call this, with an action as create, name of the collection, number of charts that you want and a replication factor. I'm not sure if the replication factor is persisted as of now, but it's used in a way to figure out what router would be picked up for you. So SolarPoda3, there's an interesting thing that went in. When you start up Solar Cloud, you have to pre-specify what's the size of, what's the number of charts that you want to split your hash range in. If you want to change this number later on until recently, what you had to do was you had to go ahead and recreate the entire collection, re-invict all of the data with a new num charts, right? Which means setting it up all over again. There was no other way to do it at real time with just moving around stuff. So there's something that came up in 4.3, doing a chart splitting, which is a collections API. This is how the call looks like, which says the action is split chart, collection is so-and-so, and that's a chart name. What it does is it lets you specify a chart name. You specify a chart name and it goes ahead and seamlessly splits the chart into two. As of now that's hard coded, you can only split it into two, so if you want to split it further down, you can call that over and over again on a already split chart. The good thing about this is you don't have to re-index anything. You can go ahead and split something. You can begin with an estimate that is small enough and then grow out of it by splitting your existing charts which are loading up probably because of data or you have received too many requests. What it does is basically it creates subcharts in construction state. A chart in a construction state is something that does not receive any requests from the outside world, but when you create a subchart in construction state, the parent of the chart that it's getting constructed from starts forwarding all the updates that it's getting onto it. In this case, chart 2.0 or 2.0 and 2.1 are the subcharts that will get created. Initially, these are in construction state and all requests coming to chart 2, which is all updates, will be forwarded to both of these subcharts. This guy will start maintaining a transaction log and buffering all of that. After that buffering starts, chart 2, the core on chart 2, the index on that will be split and installed in both of these as per the hash range. And once that happens, you replay all of the transaction log and get that in sync with where chart 2 is right now. Once that happens, you create replicas of chart 2 and chart 2.1 to match the replication factor of the parent so that you're not overwhelmed when you get to active state and then it automatically shuts down chart 2, doesn't really shut down, it marks it as inactive, so you don't ever get any updates, no requests, no queries, no updates. And requests start coming into chart 2.1, chart 2.1, chart 2.1. As of now, among the released versions, you don't have a collections API to clean that up. You have to manually go and clean up the parent chart for now at least. So yeah, that's how a split chart thing looks like and sort of 4.4 is in the works and should be released pretty soon. I guess the release candidate of the, I mean of 4.4 should be out pretty soon. Schemeless is what's the key factor in there other than, oh sorry that I forgot, 4.3.1 is something, if you wanna use split chart, do not go with 4.3 if you wanna try it out, please use 4.3.1, 4.3 is buggy and you might, you're bound to run into issues. 4.4 is a schema less, when you say schema less, there's no way you can say no schema, Lucene when it wants to index a document, needs a schema, you need to specify and you have to be consistent with how a particular field looks like. So the best bet so far was using dynamic fields which is you stick with convention over configuration. When I say that, I mean you have a convention in how you name your fields and you say any field that looks like so and so will actually belong to a particular field type. So you could say anything that is underscore I will be an integer field and tomorrow you have a new field in your data that starts coming up and you realize it's an integer, you can have field name underscore I instead of that field name itself and so it would recognize it as an integer field. So you need to specify field names and conventions in your, in a schema and then you can just go ahead and add fields. That was dynamic fields. But now what's coming up is one thing is guest schema and also you'll be able to, with 4.4 you'll be able to add fields on the fly, concrete fields rather than having to specify dynamic fields. So you don't have to stick to convention even, you can add a field on the fly. Guest schema is another thing that's coming up which is you specify a field, it looks at JSON input and tries to guess what kind of field does it look like. Again, with the guess work in place it certainly will not optimize, right? It looks at a number and it has to decide whether it's an integer, it's a double, I mean, what is it? In which case you might run into optimization issues if you're really finicky about optimizing your stuff. On the other hand, you'll never end up catching field naming errors which is you've enabled guest schema which means your document contains a typo in your field name and you never get an error in response because this guy goes ahead and guesses a field type and goes ahead and adds that field to it which means it's added probably 10 fields only because you have typos in your raw data whereas you never intended for that to happen. So I'm not a big fan of type guessing but yeah, that's what's coming up and also if guessing doesn't really work out and you realize it later that probably means you have to re-index that stuff. So we guys, I'm done with my presentation as far as telling you guys or talking about solos is concerned but we guys ran a meetup at Apache Lucene solar meetup in Bangalore and we already are almost 150 people strong. We've already had, we've had one meetup so far and we had a good show of around 60 people in the first meetup. Feel free to join the group if you're interested in anything search or solar in particular. That's the link, we also have a desk outside so feel free to come talk to us. Yeah and yeah, we are planning to have a meetup soon. I'm not sure about the date but it might just be around 26th of the month. That's where you can follow me more. Yeah, any questions? Hi, so do you? Nothing concrete but I guess, did AOL move to solar cloud? AOL did move to solar cloud and as far as I remember they did publish some numbers on the user list. I don't have any numbers right now. Hi, so you have made a pretty good point about using solar as a NoSQL data store but how well does it scale in case the index become very large? Say like 80 GB, 100 GB. I guess you should be fine with an 80 gig, kind of a set up, I haven't personally tried it out to that extent. So I've seen that it works pretty well as far as the index size is small but as the index grows and my all fields are indexed because if I wanted as a data store then I would want all my fields to be searchable so I have to index them and in that case it starts like crawling. So do you have any design sessions on like how to fix this if the index size becomes very large? I don't mean one thing could be you could optimize your schema and probably look at how you're sharding your stuff. These guys are also doing some noble challenge here. We're also doing some custom sharding stuff which might come in handy for you if you wanna kind of figure out where do your documents go and how to shard them that might come in as a handy optimization. So we can always put more resources that's okay but the thing is... Have you tested it out and you're saying it doesn't work? Yeah, I mean as compared to others suppose if I'm using other column oriented database and in the same... Without the search. No, I'm comparing the solar as no SQL with other no SQL data stores. So if I'm using same type of resources with other databases then it's pretty well for a T 100 GB of data but if I'm using solar as a no SQL data store then I don't think it's worth that much better. Yeah, so the thing is if you're really looking at having your data searchable solar cloud is a good... Yeah, noble. Are you able to query the data in your other store by any other field or is it only by ID or a one other field? In other... But you're searching on all of the fields. Yeah, which means... So you're actually using solar as a no SQL which also supports search. You want the search part. You're not using it as a plain no SQL. Yeah. Right? Yeah. And I really doubt that. I'm not really sure if you're really using search... Yeah, so that's what I want to understand. I mean, is it designed to handle this like 100 GB index? Can give you anything generic? We're trying out different, you know, short sizes. Okay, give me a minute. And other thing is like solar comes with embedded zookeeper. It also comes with embedded zookeeper though in production we'd highly recommend, we'd actually say no to it. You should... Not to use embedded? Not to use it because when you say you're embedding a zookeeper in solar, one big thing is you know the liberty of the node going down and a new leader getting elected or an replica going down or whatever, right? Your zookeeper goes down is a problem. Solar instance going down is not a problem. So you don't want to mess up with zookeeper. Okay, thanks. I have a question. Yeah. So I am using Elasticsearch and one of the features that make is the river switcher. So you can, for example, set up a river and tail couch DB or MongoDB transaction logs. Is there something like that available? Let's say I want to tail DB transaction log or a file system and automatically index it instead of running bad jobs. I don't know about piping, but you probably could use DIH to read that data. But otherwise, I don't think we have a streaming support as of now out of the box. Okay. I mean, I don't think we have any streaming support if that is exactly what you're asking for. Yeah, something like that. So I don't think we have streaming support directly out of the box right now. Okay. Yeah. So any third party plugin available from the community? I didn't get you. Any third party plugin which does the same thing for Solar, which is available from the community. I guess DIH should do that for you. Charlotte could be in a better position to answer that. So what Lasticsearch offers is basically, the community has built certain plugins for indexing certain kinds of data sources specifically. What Solar has is data input handler, which can index your databases or XML files or it can crawl your file system. And there are plenty of commercial partners which provide support for other things. But then the biggest part about Solar is because a lot of people have used it for a lot of very different use cases. There are a lot of plugin points available so you can actually use them to build whatever you want. So you could either find something or write something and plug it in. But there's no direct streaming support other than DIH, which you could probably use. Okay, you guys can find me outside anyways. Thank you.