 Now we are moving ahead for the next session, so which is Elasticsearch. We have here with Dhanamji sir, he is a python developer and design and build intelligent and slick software suit at DirectEye. He is going to speak about Elasticsearch and how he leveraged his awesomeness in the massive infrastructure. I hand over the podium to him. Hello everybody. Yeah, that's a large crowd, so anyway, thank you. So good afternoon, all of you. Today I'll be talking about Elasticsearch. So the idea of this talk is to get you acquainted with Elasticsearch, which I'm really not an expert on, but I've been using for the last one and a half years at my job at DirectEye. I'm responsible for designing and building the central operations platform. And we leverage Elasticsearch in multiple ways, right? So I know when we start introducing a new database, the first question or we kind of end up in this slug fest over what my database is better than yours or I use Mongo, I use SQL or whatever I do. Okay. Guys, can we please settle down first? Let's settle down before we, so that we can start first. Yeah, cool. So the first questions that fanboys are going to put out there is, can it do X, Y, and Z? Is it open source? Is it scalable? Is it redundant? Does it support Hadoop and analytics? It's REST? Well, I just thought I'd put that out of the way and say, yes, it does most of those things. It comes with Hadoop connectors. It's based on Lucene, which is quite brilliant in its design and the way it's actually built and powers everything from simple search applications to the blue gene that played Jeopardy, IBM's blue gene, right? So it's incredibly powerful and it comes with a large subset of functions, but they are slightly hard to use and understand. And Elasticsearch kind of builds around that. The second question you're going to ask is, is it ready for production use? Is it web scale and blah, blah, blah, right? So for a project that came around in Feb 2010 and released the 1.0 release, this as recently as Feb 2014, it is incredibly stable. It is used in production. Wikipedia's search is now powered by Elasticsearch. They moved over from solar. Before Graphsearch came to Facebook, it's also powered by Elasticsearch. GitHub's code search is powered by Elasticsearch. Add DirectEye on large bunch of teams use Elasticsearch in different ways. Come to the next point, you need to understand something very fundamental, right? When you're using a NoSQL data store, you have to use it for what it's designed to do. You can't use a cannon to kill a fly, right? So what is it good for? Obviously, since it's called Elasticsearch, well, full text queries. The second thing is geo queries, right? If you wanted to find an aggregate metric of how many people here, like Thai food or are talking about angry birds on Twitter, and box that by geo grids, that's incredibly hard to do in your usual database. So that is incredibly simple here, big data, right? Elasticsearch is not asset compliant, and it does not support transactions, right? So obviously, you cannot use it for something like a banking transaction. But then you could use it for really large amounts of logs and data structures and stuff like that. It's incredible for aggregations. It's good for scripting, percolation, and well, these three points I'll cover in more detail. Time series data is probably the most used form of Elasticsearch, where you pair it with the Elk stack and you put all your logs into Kibana and LogStash. And Elasticsearch powers that. And they have a special section on auto completions, which I will show you, right? So when you're dealing with large clusters of distributed databases, ideally, it's a pain to set up. Even setting up SQL replication is not really the easiest things to do. If you're setting up something else, you probably want to set up something like ZooKeeper to synchronize your nodes. In Elasticsearch, the four commands you see is all you need to do. You can automate it using whatever automation tool you like. And you run the fourth command multiple number of times. And they'll discover themselves in that cluster and you have a really large cluster coming up. Hit any one of those nodes on port 9200 and you're dealing with the REST API. It's completely based off REST for any language but Java. In Java, if you're using the clients, you can talk over their custom via protocol. So if you want to handle really, really large loads, I would advise you to use Java. But if you're using Python or if you're using Ruby or whatever else. In fact, even if you're using JavaScript in your browser, you can talk to your database using JSON and REST, right? So it's important to understand this mapping. Now, the way we have been brought up and the way we started programming, we came up with, we understand SQL constructs really well and we try and map those when we start developing applications for a new SQL case, right? So it has a concept of indices, which can be analogous to your databases in SQL. It has types, the types are pretty interesting. Types would be like your definition of a table and a table would look something like this part over here, right? And it would give it structure, it would give it form. You could decide in settings, you could decide how many shards, how many replicas do you want, what is the purge rate. A lot of things that you probably don't want to know. And mapping would be where you describe what your JSON is about. If this field comes, how do I map it to a certain data type? Now, it's really interesting because Elasticsearch by design is schema-less. So the first time you pump in a JSON that says like the JSON over here, if I pump this in, so you can clearly see info is already a nested document and interest is an array type over here. So the first time you pumped in a document with these fields, you would automatically initialize it to an email type, to a string type, and do an integer type. This can cause problems because at times when you input float data, the mappings are not valid any longer and you can't actually do math on them. And a lot of the use case of Elasticsearch is to do a lot of calculations, right? So a document looks like this, which can be analogous to your SQL record. And the advantages are pretty apparent, right? In SQL, you would have a record and you would have a foreign key and another table. Essentially, the developer has to flatten out his data structure, right? And that is not necessarily the best of practices. You end up with those crazy outer joins and inner joins and stuff like that, which is honestly a pain at times, right? So what is full text search? How does this database work internally? And understanding this is pretty key. If I had a bunch of documents, what I would do is, if I were to invert the search on its head, right? And if I wanted to know what documents contain football and what documents contain job, what you would do is then you would, you could iterate through every document and look for the word football or job. But clearly that won't scale, that'll be too time consuming. So what you do is take the data that you have from here, you analyze it, split it up, get each word. And now each field, right? Each field, like you remember you had name and email. Name would now become a hash map where you would say Niners is a key and the values are now the documents it belongs to. So if I search for Niners or if I search for buildings, he'll tell me it belongs to document three, right? So well, that solves the problem of a single word. Now what about a phrase? If I wanted to know about football job for instance, right? I'm looking to coach a team. How do I do that? So what I do is just make a small modification and I store the document and the location as a tuple, right? Now I know football belongs to document A at index two and document B at index three. Now I look for job with the same analyzer and I find in document A it's at index ten. So clearly it does not meet football job. But in document B it's at index three, which is an offset of one, which was exactly what your query was. So turn it on its head and for people who do you use elastic search, if you don't want to store something, right? We are very used to storing is person an ideal field, right? In SQL and we set that to a false. Don't do that, just let it not be there. If it's not there, you can use a missing filter on it. It's just easier to do a set operation than a search operation, which is a login operation, right, at best. So use that. So breaking that down for you, I think it's really important to understand what happens to your data when it goes in, is you can define analyzers, right? You can define filters and analyzers and you can say turn every instance of ampersand into and, okay? Into A and D, the characters. Or you could come up with something more complicated using Hunspel, where you would say translate the words if they are in German into English and store them that way. Now, the user doesn't need to know that. The user is gonna input a recipe in, say, German. If I've translated that to English while inputting the data, and then I look for recipes of a fish. In my query, they'll still hit the right documents, right? And you can choose how to start this data and you can have both the German and the English meanings there. You can get rid of stop words. In most cases, you don't need stop words, or these words you wanna probably throw out. And another common use case is you build a system to scrape a lot of webpages. And you wanna get rid of all the HTML tags because that's spam. So that comes built in. So you actually put these together in a stream and that is how your data goes in. Now, that seems pretty simple, right? That's doable in any other database. The interesting bit is you can apply the same filters and analyzers on a query. So even though you store data in a certain way, you can enable your query to be analyzed in a different way, which will then map to your data that you stored. So if a user typed in something in Italian and you defined a Huntsville translation over there, it would just work, right? So this gives you incredible control over what data actually hits your index. And this is something really amazing, right? So let's get to the basic operations. What is inserting? And inserting would be as simple as that. You need not even have those indices created, right? So PyCon 14 would be the index here. Attendee would be your type, and one is the ID. And I send this as a put request or a rest. And this would get stored, he'll figure out the types and it'll work, right? Now I've got an attendee in, similarly any tweets about Bangalore. You use an analyzer, figure out all hashtags, map to tags, right? That's how you input your data. If you define mappings on them earlier, the mappings will be applied. And if it finds a mismatch, it'll give an error. It'll throw back 502 or whatever, right? So it's really that simple. And, okay, show our fans how many of you have heard of Lucene? Or worked with the Lucene syntax, right? So the simplest form of search, which is why you got down to this database in the first case, is to do queries, right? So the Lucene syntax looks something like this. When I say q is my query string parameter, which is a Lucene query. And I say tags, elastic search, and user dnscorsate. So it's a way of saying, give me all documents that have elastic search and this user. And you could use and or minus, like, that's the Lucene syntax, you can look up the link. So that's the simplest form of quick querying you could do on your browser. So all this time, you need not even have any library set up. You're just interacting with it on your browser. This is ideally what a query would look like. It'll be, it could be longer. We have queries that are about 300 lines long to do certain stuff. And it gives you incredible power. So just to give you an idea, I'm not gonna, I will explain this in a different way, right? So let's, you saw a filter part here, right? So filtered query and I say range timestamp. And I just say now minus 20 minutes to now. So give me data in the last 20 minutes. And this is the first part of where we actually get to learning what filters are. Filters are a kind of way to say that I do not want to query this data. First filter the data set for me on this filter and then run a query on this filter. And the idea of filters is they are incredibly fast. They're about 10x faster than queries. And obviously they're boolean only. If you want, boolean algebra can be done really quickly, right? So they're designed in a certain way that you don't really analyze them or query them, you just use boolean logic there. And they are scoped to your query. And also it has a very cool form of caching, right? So in most cases what you would do is you'd make a query. The cache would use a TTL and it would be cleaned up. In Elasticsearch it does it differently. Any document updates that happen after you've defined a filter, right? Even a bunch of and filters are split up. And each filter gets updated automatically as documents change, right? So it's all caching in memory, which makes it really, really fast. And this is just an example to give you a geo distance. If you were trying to find all people talking about PyCon in 100 kilometer radius of this latitude and longitude of where we are, you'd just run this filter. And that would automatically take tweets that have lack longs within a 100 kilometer radius. Coming to queries, queries can be quite detailed and they can be fuzzy. The default is the must. Must indicates that it must match something, right? It must match this pattern. Must not, obviously, is self-explanatory. And should means it should compulsorily match it. Must is still fuzzy. It need not really match it. So what really happens in this case is when your data comes out, it gets scored with something called TFIDF that scores your document matching a query. And they are automatically scored according to how you write your query. So if you notice I'm running a multi-field match, I'm checking if the word Elasticsearch exists in either the tag or the message. I'm running this on a stream of tweets that are coming in, right? And I can give something called a query field boost on tags. That if someone has taken the pain to particularly tag Elasticsearch in his tweet, it's probably more relevant than someone talking about Elasticsearch randomly. So my query will automatically give it a 10x boost. And it will show up on top. Because search needs to be designed in a way that the results that are most relevant should show up on the top. How many of you have ever gone to the second or third page of Google? Or anywhere beyond the fifth page? Okay, well, very few. Obviously, you had a lot of time, right? So there's something called fields, which is essentially like in SQL, you'd say select star or you'd say select A or B, your tuple tables where you can select fields. And you can do a date desk that's basically selecting a sort order. So you'll get them sorted by that, but they still have the score relevance sort remaining on them. So it's like sorting inside buckets. One more insane thing it does is highlighting. So when you do a Google search and you get a result that says this document matched and you get a bunch of highlighted text telling you what that document is about. Now try implementing that by yourself in any other search database. And I can give you, well, I guarantee you'll be a pain. And here you can actually define how many words will show up in your result. And he'll give the highlighted term of why a certain document matched you. In what way? And he'll give you an analysis of how he did it, right? So it's that. One more thing they've done really well is suggestors, right? Now, in 80 to 90% of the cases where you are provided a search box, you know what the user is looking for. You're building a shopping website or you're building food, something like Zomato, right? When you start typing in TH, he's probably looking for thai or he's looking for shoes. Now, a lot of times you don't really need to hit a query at all. You just need to autocomplete his query and infer what he's talking about. So what these guys have done is all your documents, when you put them in, you can say that this is a type of string and it suggests an autocompletion. You can define its payloads. And it'll automatically start suggesting for these inputs. If someone starts typing nim hands and he is on your page and you are has geek, right? Then you would start automatically going to PyCon India at this point. Because that is what he's looking for, it's context. And these are basically built using finite state transitions. So they're really quick and you can have fuzziness in them. So even autocorrect works on them. So you really don't need to run searches. You'll already have a cash aid result for stuff talking about PyCon. You shouldn't be running a search in that case. So this is something really, really cool and a tip out there if your Suggestor is taking more than one millisecond, then you're probably doing it wrong, right? As I said, I'm gonna not talk about using elastic search in the conventional way. I'm gonna be covering topics or lesser known topics that you would like to discover and use and what you could take away from them. So this is scripting, right? Scripting is incredibly powerful. How many of you have scripted in Redis? You can write little Lua scripts in Redis. Now the idea that you can script a database is fairly impressive. In an ideal case, what you tend to do is you look for a record. You fetch your record. You update your record and you send an update request, correct? And especially this comes matters a lot in NoSQL because you don't have control over each row. Every time you update a document in most NoSQL databases, you actually just mark the old document for deletion and create a new document at the end of the index and then you merge it. Which is fairly inefficient. But it's the way they're designed for very obvious other purposes, right? So what you could do is you could use your documents as finite state machines. And what we actually do, this is some of our production code. That when we see an action on a particular page inside our infrared on the tool that I build, it automatically refers to the document that I'm dealing with and alters its fields and alters its stats without actually doing anything else. And it guarantees atomicity for me. So even if this document is being updated at the rate of say 100 requests a second, it'll guarantee these atomic step changes through going through that process. And that becomes really important when you're trying to keep track of thousands of servers, right? And also it takes lower bandwidth, obviously. Now you're just referring to a script. I call this on action. I'm referring to a script and I'm only passing the parameters that the script should execute. The script is already sitting on my database. So I don't need to get data and send data. It's just so much less work. It's just much easier. And you can write scripts in a variety of languages. The current default is Mevel. In the next release that's coming out next month, there's going to change to something called Groovy. But you have modules that allow you to script in JavaScript, in Python, in a bunch of other languages, pure plain Java too, if you want those kind of speeds. And the interesting part about scripting is you can actually write a script that will then be used on your query. So if you wanted to write a script that says, run this particular thing, that if the document contains these two words and it's by this user, update 10 more things and get me the result and run an aggregation on it, at query time it'll happen. So you can refer to scripts in your thing. This is perhaps my favorite part, right? I ran a query for, I ran a query over a stream. I revert Twitter's host pipe into my database. There's something called a river. I can revert entire Wikipedia as it changes into my database. And the database will do the pulling for me. And what I did here is I defined something called a split 10 minutes. Every 10 minutes, I want to date histogram based on the field time at an interval of 10 minutes. Well, I name my field that way. So this will take all the documents that entered and filter them in buckets of 10 minutes, right? 10 minute buckets. The second aggregation I used on them was term tags. What that will do is for each bucket, he will build for the buckets with every tag. Say you had tags of pizza, Python, you had tags of Cuban Park or whatever else. He'll create tag buckets out of those. And then you can use, if you scripted your mood, you scripted a little algorithm that told you about the mood of what he's talking about. You still have documents of his entire tweet. Is he happy when he's talking about pizza? Is he enjoying the talk he's attending at PyCon, right? You'd write a little script and he would give you all those aggregations in buckets or, and you could actually graph them out using something like Angular or D3.js and plot them on a timeline. There's something even more very interesting, right? If I got all these tags and I use something on a top hits aggregation and I sub aggregated that with something called a geo grid, right? He would divide the world map into geo grids and you can define everything down from 1.6 centimeters to a 5,000 kilometer by 5,000 kilometer box. And he would give you for each box, what are the trending phrases every 10 minutes and you could zoom into a box and see how, what people in a particular region are thinking when they tweet. Or this is just an example of Twitter, right? So, and all this in one single query, right? We've got, we, I just wrote a stats app last week for our, graphing all our infrastructure and I get data about every single response metric, every single way, how people are interacting with the infra in a single query in a single page in less than 0.6 seconds. So, this is impossible with any other, I don't know of any other database you can do this with, perhaps solar, but that's about it. That's also, it's like a competitor, right? This is something that's really, really, really unique, right? Now 90% of your use cases, what you are traditionally used to in a database is you store a lot of data and then you run queries on it. Simple enough, that is the idea we have of a database. But if truth be told, you really don't need to do that. In most cases, if I developed a mobile application that learned about me, I see I was to develop Google now and I start, I want to start suggesting places around me, food around me, movies running in the mall next to me, right? You'll have a bunch of queries that you would run against each data entry you got and now scale that up to a billion people, it won't really work. So, what Elasticsearch has come up for this is something called percolators. In percolators, you store your queries, okay? With a query ID, you store your queries. You do not store your data against a particular index and then you run the data point that you are going to enter into the database against the percolator. And all the percolators that match this document, right? Say I set a query that said match tags Elasticsearch and I started showing in a lot of tweets about, from the global host. Every message that matches this query will return something like this. It matches one query on index tweets one, which is exactly the percolator ID that I registered with this. So if I had a thousand or a million, there are people using million queries at a time and they can be used for a trigger of alarms. They did this stock price. Suppose I have a way of entering stock price into my database. If stock price belonging to category pharma went up by 10 points in the last 10 minutes, then well, send an email to me, right? Now that would be something really hard to do with the amount of data flowing in a convention stock market. Here it would be really, really simple. So this is one of our favorite new things. We use it to tag events flowing into maintenance modes and stuff like that in our current architecture. That's percolation. Obviously it's Picon. You must be wondering where the hell is all the Python. I've been talking about a database in rest all this time. In all honesty, there have been a bunch of iterations about how I would device a client library for Python or for any language for that matter. I use a lot of Ruby these days with the same Elasticsearch Ruby clients. So the idea is Elasticsearch is evolving extremely fast. They come out with a new release every two months and the release is actually worth upgrading to. I mean, I'm actually waiting for the next release because I want some features, some kind of new aggregators that they've built. So it's extremely fast. There is no real way to map a DSL from Python into these rest JSON queries. Even for Ruby, it's really hard when Ruby is actually much simpler to frame DSLs in. So why should you use a Python library and not use, so finally they said, let's ditch this idea. Let's build a basic library that would take a hash or a, right, Python calls it a dict, right? So you would take a dict and you would enter that dict as your parameter. And your basic stuff like index could be mapped into this name tuple, you would have doc types and IDs. So the idea is you get persistent connections. Now when you're building applications that do tens of thousands of queries every year, you don't want to open a new TCP connection and send requests every time. So your libraries would have a connection pool manager. You would have load balancing between, you could fire your queries remotely into any part of the cluster using a wild string or a round robin or whatever logic you choose to use. You'll have thread safety, it'll be easy to integrate your models into Django or Flask or whatever you plan to use. Also error handling, right? If you were to do a rest request, it will give you a 502. Now a 502 is not too useful. Did my database fail? Was the query invalid? Did the chart fail? Did I run out of memory? You know, what's the deal? So these libraries have basic error handling and exception handling built in, which is super, super useful. You would just put a try except loop around it. And failed connections, right? So connections to databases do fail. And you need to automatically retry. Now what the Python library currently does, the official Elastic Search Python library does is puts all this work that you don't want to deal with. You focus on building your application using this. But you use the same hash to send your queries across the server. So that is probably the only reason you should be. And in fact, you definitely should not use rest to talk to the server and use the library if you're building a proper, respectable application. Production gotcha, right? So we've been running it in production for a while now. And some of the problems I thought I'd share with you is firstly memory. It's written in Java and it's slightly dumb in its design for memory. What it does is in most databases, if you want to load data sets into memory, it first checks if I have enough. Elastic Search does not check. It just tries to fill up memory and goes OM. Even if you have 48 GB of memory. So an ideal use case would be, if you gave your Java VM 8 GB of memory, keep 8 GB spare because he will overwrite more into memory. And it's really simple to scale it across. You can score terabytes of data, just add more nodes. It's not really hard to do that. Manual sharding, right? As I told you, it is a no-SQL database that writes and does not really update. So what we had to do in one of our use cases where we had, we were actually using these state machines to toggle stuff, was we took a shard and put it in memory. And we named the shard with the different suffix. And it automatically, you can query indices using just commas and wildcards in your query strings. So we could modify stuff in memory and ram. Well, you don't need to do segment merges. So it just works. Networks always create your applications so that they can deal with network errors. Because network errors will happen. I mean, if you design applications with the assumption that everything will work, you're probably doing it wrong. Like, you have to assume it doesn't work, right? There's one concern that hasn't been solved in Elasticsearch yet, which is a slight pain. It's called the split brain. We've thankfully not faced it today. What it does is when you have a bunch of nodes in a cluster, then it auto-elects a master node where all the collection. So Elasticsearch essentially does internal map reducers. They call it scatter and gather. So you send a query, it scatters it to all shards and replicas, and then gathers the results and compiles them back, right? So it needs some kind of coordination. I mean, it could happen on any node, but there has to be one master for at least writes. So there's a configuration option called minimum master nodes. And a safe way to get past this at least 99% of the cases is keep this value as n by two plus one. If you have n nodes in your cluster, you should have, so if you have three nodes in your cluster, at least two should be up. So that your cluster doesn't go down. In Elasticsearch, it's designed to be run on cloud environments. So if I set up 100 nodes and I had enough nodes up, if I lost 20, I probably wouldn't be bothered if I had enough replicas. And you could just put them up and down as you please according to your elastic needs. And it would rebalance itself out. So that's something really amazing. Security wise, it comes with basic HTTP auth, as Kiran was telling you about. But you usually never expose your database to the outer world, right? Easy way of getting around this is use a proxy. There's something called ES proxy, it's a Node.js application, which can plug into any authentication system. And the best part about this is you have an Elasticsearch.js library that you put in your web browser. And you're actually talking to your database through JavaScript code using jQuery, which is pretty insane. You try doing that with other databases with as much ease. And yeah, your client should be designed to handle failure. Like nodes might go down and up. It should retry to one of the node, and it's not really hard to do that. So that's about it. And finally, the Elk stack, which is probably the most popular use case of Elasticsearch today, is you just, you take something like a Kafka broker, put all your logs, use LogStash and just chuck like loads and loads and loads of logs. Like millions of billions of entries we chuck into our Elasticsearch cluster every day. And you could run analysis on them. You could find out who's calling from where, who's doing what, what server, what kind of issues, what was the CPU history, what was essentially anything. You could use it for Facebook. You could use it for tweets, as I was giving an example. And I think you should go home and check out this database. It has some kick ass features that you probably want to try. I'm done. Thank you. Any questions? Any questions? Yeah. What is the difference between MongoDB and Elasticsearch then? It provides all the features. They have all the features like aggregation, text search, geo-special indexes. So aggregations are not really the same. In MongoDB, aggregations actually write your map and reduce functions. No, they have aggregator. OK. Percolators, do you have percolators in Mongo? Yeah. Well, it's a new technology. I liked it. Mongo wasn't really working. I mean, as I told you, you're going to start these database walls where you'll say, hey, my database is better. So the idea was to introduce you to a new one, which really works well. Yeah. We have one more question. And you won't have. Let's see in a try. OK. If you were to build a chat app that had to do a lot of search across 1,000 users, so the use case of HipChat, right? So HipChat used, if you want to run full text search on all your logs, and obviously, the thousands of millions of users. Now what you would do is put your data into MongoDB. And there's no real way to search through all these documents in a really quick time in Mongo, right? So what they did as an implementation was they put Redis in front of it, used Mongo Redis connectors, and ran Lucene on top of it. And they took some, and they reduced their footprint by order of magnitude when they shifted to Elasticsearch. So that's, yeah. Is there a standard way to back up? Yeah, that is a very good question, once again. Is there a standard way to back up Elasticsearch DB? Yes. In the 1.x release, you can snapshot, you can back up. It works really well. Does it, the backup need to run all the nodes or one of the nodes? So in, firstly, in Elasticsearch, you really never need to back up your data. I mean, okay, that's probably not true, but it's perfectly cool to lose a bunch of nodes in your cluster. It's designed to lose a bunch of nodes, right? So in most cases, your replication acts as your backup. But if you do want to do snapshotting backups, there's a plugin called, well, there is a, I'm not sure what it's called, but we ran custom scripts because we started before 1.x. So you could just back up your entire data there using a cron job, it's that simple. And it would work. Elasticsearch and solar, right? So solar is really good, but, well, there are certain things Elasticsearch is better at. It's, especially its aggregations, percolations. Also, the redundancy and stuff is slightly better. Well, you could use either to be honest, but I just like the DSL of Elasticsearch more. You could use solar. So there's a very good document on this, on why Wikipedia moved away from solar to Elasticsearch. Maybe they'll give you a better idea. Yeah. Okay, one announcement. Please submit the feedback forms at the registration desk. Okay? Yes. So ideally, logs are, have you seen a syslog ng or log stash? Right, so by default, what you would do is you would put it in a specific format and into a JSON object and send a put request. And that should map. If you have the mappings, he'll enforce the type criteria on them, like flot or int or whatever else. If you don't have type criteria, he'll assume it's a string and he'll store them and you can still query them, but you would lose stuff, right? So if I ended up storing an ISO 8061 timestamp as a string, because I had not declared it a string, for some reason you messed up a Z, you missed your TZ data, then you probably might infer it to be a string and then you can't run time range aggregations on it. So it'll be a good idea to set your schemas. I mean, even though it's not necessary. Go to, but what O is to use? What distribution do you use? Fedora, just add the repository, yum install Elasticsearch, hit your port 9200, you'll get started. It's really that simple. Do we have the support to configure the indexing relevance? Sorry? Let's say in a document, I want to index the words in the first para with higher relevance and index the words in the second para of the lower relevance so that I can define the order of the results being returned. So what you would ideally do is that would take place during an analysis phase where you would split up the paragraph and you would index para one as para one. And in your mapping, what you would say is copy para one to total corpus, right? So you'll have a copy in para one, you'll have a copy in corpus. So you can actually return either one of them as you please. And the thing to remember about Elasticsearch, mappings is they are dynamic. They're not dynamic, sorry. Once they are set, you cannot change them. You have to re-index your entire database. You can add mappings, but you cannot remove or change them. So using this, I can change the relevance of each word. Yes, you can. So one more thing that Elasticsearch does is smaller words get a higher score because smaller words are generally more important than larger words. I don't know, but... No TFIDF comes into the picture here. No TFIDF comes into the picture here. Of course, you can tweak the scoring algorithm. It's based in, you have five or six things. You have biation filters, you have bloom filters, you have TFIDF, you can select that. You can totally select your scoring logic. You can write a custom scoring logic. So that, we've done that in one place. We've written our custom scoring logic, sure. So the solution is both based on top of what you need. Yeah. Can we expect the same kind of throughput or is there any difference in terms of the technique? So I'll be honest with you, I've not really used Solr, right? But from what Benchmarks suggest, it's pretty much similar, but... It is done by Alucine at the end of the day, but those kind of things like filters, percolators, the level of analyzers, the number of tool sets you get for it, Marvel, they're monitoring dashboard called Marvel. The plugins are available for it. I think Elasticsearch wins in those fields. I was thinking in terms of data ingestion. It should be very similar. To be honest, I'm not really used Solr, so I think you should probably Google it. There's just some more different parsers like Edismax and Dismax. Sorry? Elasticsearch supports query parsers. Query? Parsers, like Solr does. Solr has a query syntax. Yeah, yeah, yeah. So it uses a query parser for... Yes, yes, yes. It uses various kind of parsers. Like Edismax and Dismax. Oh, yes, Dismax and Dismax, yeah. They're all basically mapping to your Lucene parameters, right? So, yeah, those are supported. I mean, I couldn't go into those details. It's really large. But you have a lot of configuration possibilities. And you have access to raw Java data types and all that. So... Sir, can you give me an example? For example, in Flipkart, if I'm searching for a headphone, if I've gone through laptop to headphones as a category, I want laptop headphones to be ranked higher. In a way, all 3.5 mm jacks can be listed. Yeah, so I don't think that will be handled by a database because there's no real way for your database to know what the user was, right? Your architecture would internally then do a query on your database, yeah. I can prioritize the tags. Yes, yes, you can give it a boost. You can give a query boost. You can give a term a query boost. You can do that. Yeah. Thank you. Thank you. Thank you, Jananjit, for the wonderful talk on Elastic. Now, we'll have a 15-meters break and get an assemble back here for a...