 So I'd like to talk about Elasticsearch DSL, which is a new library for interacting with Elasticsearch that I've been working on. But first, let's take it a little slower. So let's talk a little bit about what Elasticsearch is. I did a little talk yesterday about what search engines are in general and how they work. So I will try to be brief on this part. So what Elasticsearch is, it's an open source distributed search and analytics engine. That's quite a mouthful for essentially a distributed data store that can store your documents, search through them and analyze them. And by analyze them, I mean run different sorts of aggregations. By distributed, I mean just that. If you have one instance, it will work. If you start two instances, they'll find each other, form a cluster and automatically share your data and spread the load. So that's where the Elastic part of the title comes in. So as I mentioned, it's a document store. So it's JSON based. So anything that you can express as JSON, you can index and search through using Elasticsearch. It's not exactly schema free, but it has a dynamic schema. What that means is you don't need to tell us what your documents look like. We'll look at them and we'll infer the schema information from that data, only in some cases where you have some knowledge that we don't. For example, you know that this number will never get above 256. You can tell us and we'll index it more efficiently for you. Or in some cases, you actually need to inform us what the data type is because there is no way to know from the JSON. For example, if you index a geo point or a geo shape, there is no way from us to automatically distinguish it from just a list of two numbers. So typically you want to tell us the schema, but if you don't, if you just want to play around with it, just start indexing documents and you should be good to go. We also have support for some relationships. So you can actually have nested documents, which is essentially a sub documents as part of a bigger document that can be queried independently. We'll see an example in just a bit and we also have parent child, which is essentially one to many relationship that you can use to query across. So you can query the parents while asking conditions on the children and vice versa. So to sort of give you an example and don't worry, I don't expect you to be able to read this. This is a sample document from indexing data from Stack Overflow. I'll later be doing a demo. So this is the data that I'll be using. You can see that I have several interesting fields that I have highlighted. One is a title and body. Those are just text fields that do exactly as you would expect. And we have a date time as a creation date. We have comments, which is a list of nested documents because each question in Stack Overflow can have comments, also each answer. What we don't have here, this is a question. We also index the answers and we use the parent child relationship in Elastic Search to map the relationship between the question and an answer on Stack Overflow. And finally, we have also highlighted the field rating, which is an integer field, which is the rating from the Stack Overflow, the quality of the question or the answer. And that's important because you can actually take that into account when sorting or when outputting the results. You can either sort by it or you can just take the score that the search engine gives you and combine it with this number to produce the optimal sorting. So it's not one or the other, it's a combination of both. Unfortunately, that will have to be left as an exercise for the user. Not enough time. So I've talked about queries and so what do they look like? How do we query Elastic Search? So Elastic Search is HTTP and JSON. So everything we do is HTTP and JSON. So if you want a query, you send JSON over HTTP, surprisingly. And the JSON that contains the query is essentially an abstract syntax tree. It is essentially a serialized version of an expression tree that contains, amongst other things, but the most important ones are queries and filters. There's important distinction between them, but what you need to know is there are fully indistinguishable from the outside. If you can use one query, you can use any other. It's really, the queries in Elastic Search can be overwhelming for beginners. When you look at a query and it's a full page of JSON, it's really distracting. But if you start to think about it as a tree, as an expression tree, which has a very easy grammar, you have a query and each query type has a different grammar. For example, a filtered query can contain a query and a filter. So it's simple, simple grammar that can be recursive, whatever. So once you understand these concepts, it's fairly easy. So queries represent the unstructured part of Elastic Search. It is the full text. It is the part that not only tells you which document matches your query, but also how well does it match? Is this a good match or so-so? So that's why we have several different types. We have match queries which do what you would expect. We have fuzzy queries that are able to take into account typos in form of matching across Levenstein distance, so across different mispronunciations or mistypes of the word. We also have queries like regex or wildcard, which allows you to do partial matches. You also have compound queries. So if you have multiple of those core queries, you can put them together. Typically, you do that using a bool query, which is a short for booling, which essentially just takes a bunch of other queries together and says, you must have all of these, some of these, and none of these. Again, we'll see an example just to, there are queries and there are compound queries. The queries rely heavily on analysis, which I talked a lot about yesterday, and they produce score. And because the score is dependent on the actual form of the query and on the state of the index, these queries are not cached. So I wouldn't say that they're slow, but filters are faster. Because filters do the same thing as queries in that they limit the result set, but they don't have to bother with the score, with the relevancy, because they only narrow it down. And because of that, they're much more suitable for caching. What we actually do inside is we represent a result of each filter as just a bit set. So we literally have one bit per document that matched that filter. So you can imagine that's a very efficient storage and also something that can be very easily cached. And also, once you have these caches for each of your individual filters, so you have a term filter, so you're looking for an exact match, or you have a range filter where you're looking for a range of a numeric or date value. So if you have multiple caches, which are bit sets, and you wanna combine them using a bull query, a bull filter in this case, sorry, that's very efficient because you have multiple bit sets and you wanna see a document that are in both of them. Well, that's an end. A binary bitwise end, pretty much one of the most effective operations you can do on any CPU. So it gets very fast and it allows us to cache the individual core filters. So you get a lot of reuse from the caches. This is all transparent for the users. It's just important to keep in mind that there is a difference between filters and caches and queries. And you should always use filters if you can, if you don't care about the relevancy. So again, with filters, you have the core filters and the compound filters. Pretty much the only compound filter that you need to care about is the bull filter that allows you to exactly compound the individual filters very effectively. It by default uses the caches and the bit sets inside. So this is actually one of the smaller typical queries that you would ask Elasticsearch. So this is a query. It is a filtered query inside and a filtered query has two components, a filter and a query. In this case, as a filter, we use a range filter. So we're looking for questions on Stack Overflow that were 20,000 and newer. So that's the filter part. For the query part, we have a bull query and then we have three parts in there. We're saying that the title or body must have PHP in them and there must be an answer to this question, so has child, which has Python in the body. And we're also saying in the must not branch of the bull query that title and body must not contain Python. So effectively what we're looking for is some poor sap on Stack Overflow asking a question about PHP and some smartass replying, yeah, you should use Python. We've all been there, we've all done it and this is how we can identify ourselves. So that's sort of what this query is. You can see that I was right that it can be confusing to people, it's a lot of text, a lot of weird characters and that's one of the reasons why I created the DSL. And just the bottom half of the text is actually aggregations. So I'm actually looking to see the distribution per tags and for each tag, I wanna see the average comment count. So in the result set, I will have that with the tag design patterns or something. I had 24 documents and on average, they had three comments each. Again, something that we'll see in more details. This is just to give you an overview how it would look if we had to write everything by hand. So that's pure elastic search. Now we're at a Django conference so that means Python. So how do you interact with elastic search using Python? Unfortunately, many people immediately jump to this question. Like how hard can it be? It's just HTTP and JSON, right? Yeah, so the problem is elastic search can be a little difficult. Not that it's unpredictable or anything like that, it's just there's a lot of things going on. For example, it's distributed. So which node do you talk to? You run the risk that if you only talk to one, you're gonna overload that node and the rest of the nodes in the cluster will just be there blazing around. Even though they share all the load, some of the work will always go through that one node. Not ideal. And what happens if that node goes down? The cluster is still fully operational but your application cannot reach it anymore. So that's one aspect, the distributed aspect. Then there are different environments. So many people would deploy elastic search behind a load balancer or they would try and use alternate transports. For example, you can use a thrift as a plugin to elastic search doc thrift because some people prefer binary protocols for some reason. And then there is also the fact that we have almost 100 API endpoints with almost 700 parameters each. So if you want to use the raw HTTP, that's knowledge that you have to carry around in your head. And trust me, it's not pleasant. It's just a huge amount of information that's essentially useless and you just want something to do it for you. So that's why last year we released a bunch of official clients that are very low level. It's for all those people who would prefer to use the HTTP but we think that they shouldn't. So they should use elastic search PY instead. Elastic search PY is what you get when you do pip install elastic search. It's a very low level client. It's essentially just one-to-one mapping to the REST API. There is nothing added, there is no opinions because we really wanted that nobody would have an excuse not to use this client. So it's very extendable, it's modular, you can override any different parts of it. And it supports all the API and all the parameters and actually have documentation for that. It's tested as part of the release cycle for elastic search itself. So if you're using Python, if you're using elastic search, there should be no reason not to use this client. But as I said, it's very raw, it's very low level. So the only thing it will give you on top of using raw HTTP is the different methods for different API endpoints and it will do the serialization for you properly. So it will take your Python dictionary, serialize it into JSON and send it over the wire. It will then do some smart things. For example, if it cannot reach a node, it will put it on a timeout and talk to a different node instead. Or it can even ask the cluster, hey, what are the current nodes that are part of the cluster so it can do the load balancing properly? But aside from that, it's fairly dumb. You still have to write the queries yourself. As Python dictionaries, which is much better than JSON because you can actually use trailing commas. Yay, but it's still fairly painful. So instead, I said, I don't want this. I want, there should be a simpler way how to do this because for example, imagine that you have a query like this and you want to add a filter. So first you need to determine like, is it already a filtered query? Can I just add a filter or is it just a raw query and I need to convert it to a filtered query to add a filter? Then inside a filter, is it already a bull filter that I need to just add something into or do I need to convert it to a bull filter and add the filter to the filter that's existing already there? So that's painful. It's certainly doable. It's just Python dictionaries and it's nothing complicated, but it's just painful and it should be easier. So enter Elasticsearch DSL. This is for now, it's essentially just a query builder for Elasticsearch. It relies on the Elasticsearch PY, on the raw client for transport and everything network related and communication related. So what it essentially only does is it will build a query, serialize it into a Python dictionary and set it over, get the results back and present it to you in a nice wrapper. So again, you don't have to get a dictionary that contains a dictionary, it's contains a list of dictionaries which contains a dictionary with your actual data. So this is how it looks. You basically define a search. You associate it with the low-level client. So it will know how to communicate with a cluster. And you start querying. We'll look into it deeper how it looks. Just suffice to say that you can just issue individual queries or filters and anything against a search object and we'll figure out beneath the hood how to combine them into the compound queries and filters and we'll do the same for aggregations. And then if you wanna get result back we'll give you a nice glass that you can actually access attributes into and you don't need to use brackets everywhere and have a lot of work with that. So this is sort of the high-level overview. So what was the design decisions that we made? Well, first one was I was just sick and tired of typing brackets, square, curly. I felt like a list programmer in a good way. So that's the first thing that I really wanted to get rid of. It's dictionaries are easy. It's a great data structure and it's very fast and easy to work with but it's not really fun to write. So that's one part. We wanted not to have any more brackets that we absolutely needed. The second part was we wanted to do the automatic composition. So you don't need to know how to combine two queries and what's the logic between combining a bool query with the match query? How does it work? We have simple rules that will actually do that for you. All you need to do is say, yeah, add this query to the mix. Add another condition essentially and we'll figure it out underneath. All this while still allowing you to do it yourself if you absolutely need to if you know what you're doing and hopefully without any additional pain in that case. Also one of our very important points is we don't want to pretend something that we're not. We are not SQL. The query DSL is completely different from SQL. It has different capabilities, different semantics, different syntax for sure and we don't want to shoehorn something like a Django RM onto Elasticsearch. That would make no sense because it wouldn't allow you to access the 90% of Elasticsearch features while still not supporting all that the ORM can do. So it would be sort of the least common denominator which is in this case very small. So we just own up to the fact that we are not SQL, we are not anything else. We are still Elasticsearch and you should be familiar with the queries and filters that you can run against Elasticsearch. We'll try to take the pain away but not the actual work, sorry. So if you actually look at the example again you can see that I'm actually manually specifying that, yeah, this is a match query and essentially what I'm passing in the title equals is the same, absolutely same thing that I would create a dictionary for in the raw DSL. So it's essentially just as intact sugar for in this case creating a dictionary with one key match which would have as a value a dictionary with the key title and the value of Python. So it maps very easy so you don't need to learn another tool, you don't need to learn another DSL. No Elasticsearch or you should if you don't and then you can just start using this immediately and you can see we do the same for filters so there's a range filter with the creation date equals and here in the DSL you would have a nest dictionary so here you have the same because I didn't wanna try and invent some syntax or borrow, overlive with the underscore underscore because that would get really hairy. So again, explicit is better than implicit. Go for it. So it's very close, you can however see one thing that in the query in the second one I have something with a capital Q that's a name that I borrowed from Django and it's essentially a shortcut. If you wanna create the query manually like outside of the search object if you need to manipulate it. For example, if you need to negate it or if you need to combine it with another query using an OR operator instead of an AND. So we have those shortcuts for all the important important objects in the DSL. So queries, filters, aggregations and some others that will keep secret for now. So you can see how you can create it underneath it will actually do, what it will do it will look up the class that corresponds to that given query type or filter type and just instantiate it. So it's really literally just a shortcut. You can even just pass it the raw dictionary that you would otherwise use as the query. So we'll see later how that can be used to actually facilitate the migration process if you wanna switch over to this new library. And it of course supports Boolean logic. So you can do AND or you can do negations and it will actually do the right thing. It actually tries even to be a little smarter. So if you for example do double negation you will end up with the same filter or query. Just so that you don't have a ridiculously big ridiculously big queries once you work with them a little bit. So this is how you can construct them outside of the search object and how you can work with them. Then if you construct a query this way you can just pass it into the search and everything will work as expected. So the way you pass it into the search is by using the dot query or dot filter methods. And those longest everything else will just actually return you a modified copy of the search object. Here we borrowed from Django's design where the query set is essentially immutable and every time you do something on it you will get back a copy. So you shouldn't be afraid to pass it over to someone else or anything like that. So you can actually fork it and have two different versions. The only exception to this is aggregations because there we needed to the chaining behavior to be a little different. So for queries you can do search.query.query.query.query and add multiple queries on the same line. With aggregations you wanna do something a little bit differently. At least that's what we came to expect. So it is in Elasticsearch when you define aggregations you define a bucket and then metrics inside because essentially any form of aggregations be it SQL or no SQL or anything else is essentially dividing your data into buckets and then calculating a metric or a computation inside each of these buckets. So if you have a grew by, you say grew by this column in SQL so you'll have a bucket for each column and then you want to see account or a sum over this value. That's the calculation that you run inside the bucket. Elasticsearch is very explicit so we actually yeah, we call it bucket. So here in the first line we are creating a bucket per tag and inside we're looking for an average over something. This is just a shorthand. I'm omitting all the parameters so it can actually fit on a slide and then we're adding another metric. So we have one bucket with two metrics. On the other line however, we have two buckets that are nested. So we have one bucket and then a sub bucket and then inside of that we have a metric. So you can see that the chaining behavior is a little different because bucket will actually return itself so you can call an aggregation on it, a metric, whereas a metric will return its bucket so that you can add another metric next to it. So just something to keep in mind once you start using this that the behavior there is slightly different. So the last thing we have is a response object. I mentioned it several times that you get back a fancy response object instead of just a huge dictionary that contains nested data structures. So you have a response object which has a success method which will tell you like, did I actually reach all the data that I needed? Because elastic search will happily keep serving your search requests even if half the cluster is down. It will tell you that half the cluster is down and that it couldn't reach half of the data but it will still try and return to you something. So you can ask, hey, was this a success? Did I reach everything? Yes. And then you can just iterate over it and get individual hits. With the raw response, you get the metadata and as part of the metadata, you have the source. That isn't really that practical for normal use case. So in this case, we reverted it. So you get the object back so you can see that I'm doing age.title and if you want to access any of the metadata, you just do age.meda, underscore, ID, document type or index or any of the metadata that are typically associated with the document in elastic search. Also the score. So you can see you can use attribute access. You don't need to use square brackets and string students who access the data and the same goes for the overall response. So you can just do a response or aggregations.pertag.buckets. Then you access the first element and you can do dot value and stuff like that. So it's much more convenient to work with. We even added, we'll see that in the demo, hooks for introspection. So IPython will correctly autocomplete and everything. So this is essentially all that we have done. So now what do you do if you want to start using it? If you have a fresh project, congratulations, I envy you from all of my heart. If you don't, hopefully you're already using the low level client. In that case, you already have the dictionaries with your queries lying around. So what you can actually do is you can just create a search object from the dictionary, manipulated however you wish and then either executed directly or you can again serialize it back to the dictionary and plug it back into your existing code. So for example, if you have a query somewhere and you wished that it was simpler to add a filter to it, just create a search object from it, add a filter to it, serialize it again and nobody needed to know that you actually cheated and used a different library instead of doing the work yourself. So now let's see if everything works. So can you read this? Somewhere in the back, can you read this? Thank you. So the first thing that we can do is I'll show you how the migration actually works. So let's assume that we have a dictionary like this that actually contains a typical query to Elasticsearch. It's a pleasure to read. So what we can do is we can create a search object from it. We can actually already see how it would look otherwise if we were wrote it using the DSL, using the Q notation, that's representation. We can associate it with the low level client. ES is just an instance of the Elasticsearch client and now we can finally execute this to get a response. So you can see response, it has hits total. So totally we have hit 48 documents out of the approximately 500,000 that I have currently loaded. You can get the first one and you can see that it has title, it has comments, it's plural. And it even has something like an owner which is actually a nested document so we can continue, so we can have nested.owner.displayname. So the first question that we found was actually asked by Joel Fan. That's unsurprising, giving the data set. So this is sort of the basics. This is if you already have a query and you just wanna plug it in, you just create a search object from it and start querying. If instead you actually are starting fresh, so what you do is you just create a search object yourself. Now this search object will actually, if I do a count, it will match absolutely everything. So we have, okay, so just 200,000. We can also limit it to just a certain doc type. So we're only looking to search for questions. So we can see how this has changed and we have no questions. So it actually should have been questioned. And this is what happens because, I'm not copy pasting things as I should be because the corresponding question and if I just query for a doc type, it will actually just add the doc type to there. So if I do a count now, it correctly returns. So now let's say that I wanna actually do a query. So I wanna have a match and again for the title just use Python. And you can immediately see that it's exactly the same as the dictionary would look like. So if I just now add some more queries and filters and aggregations, which I will not type in, you can see that it gets more complicated. So through gradual steps, you don't need to know that you should have used a filtered query with a bull filter with all this stuff, but just adding a filter and then adding another filter, it will first be converted from a query to a filtered query and then the filter in the filtered query will be converted to a bull query. You can see also for the aggregations that I defined in this step, I defined a bucket per tag, which is a terms aggregation over the field tags. And inside, I actually am asking for a metric. I want it to be returned under the name of max score and I'm saying it's a max aggregation over the field rating. So when I execute this, I get the aggregations back. So I can see the per tag, I can see the buckets. So these are all the tags that were in the result set. So I can get the first one and just get the key. So the first one was obviously Python. So we just learned that when you query Stack Overflow for Python, the documents with the most tags will be actually tagged with Python. What a surprise. And we can actually also ask for the max score that is in Python. So for Python, the max score of a question was 134. So this is sort of the analytics part of Elasticsearch, where you could easily just take all these values and visualize it very nicely using JavaScript or Plotlib or anything else. So last part of the demo, I'll just show you how to construct the queries yourself. So let's start with creating a query. So we are looking for title Python and not body Ruby. Then we'll create a filters the same way. So we are looking for tags Python or range which is smaller than now. So we're looking either for documents that are older than in the future. I know it makes no sense, but bear with me. Or that are tagged with Python. Yes, this filter will match everything, but it's really hard to come up with demos that actually do something. So then what we can do is we can actually manually wrap this query. So we have this construct analytics or just called a function query and that allows you to take a query and provide elastic search with a formula how to actually calculate the score if you know better. And we do. So we have a field in our document that's called rating that is human contributed. Some humans actually said that this is a good question. It would be shame for us to ignore that information. So by this line I'm saying query is now a function score query, which is wrapping the original one. So I'm saying query equals queue. And the function that I want to run on it is a script score function. And the script is I just multiply the score by 10. Here instead of 10 I would typically just use just use access the field and everything but that wouldn't fit on a slide. So that part is left to your imagination. And now we just create a search object from it, which hopefully we will be able to execute. Yes. And also you can see this is what we created by the three steps that I showed. This is the query that we put together. It would not be impossible to write this by hand but I certainly would not want to. So that's sort of the grow of this library to allow you to run this query, create this query easily. So that was the DSL. And now let's see how you can actually plug it in, how you can use it. So if you want to use Elasticsearch from your Django application, this is all the code you need, more or less. So this is a code how to actually index all your data into Elasticsearch. The first part is you just do a bulk load and you iterate over all your models, over all your models, the model is called model for some reason, you iterate over all of them, you call a method to dict on them and then just index that. And then the second example is a simple function that you can register as a signal handler for post-save and it will update Elasticsearch after any change in the document. This is literally all you need. You probably want to get a little fancier by specifying the schema, that's the line with the put mapping. But this is all you need and I do want to make this more automatic in the future but for now this is what you need and then you query as usual. As I showed in the demo, that's all you need. Just construct the query, run it, you get data back. That's all you need. So that was a Django integration, the helicopter overview. So what is next for this library? The first part is I want to extend it to be not only for queries but also for mappings because that's also something that people struggle with. How do I define the mapping? The mapping is the schema and that also has a fairly complicated syntax and semantics which is very powerful but sometimes a little overwhelming. So that's the first part. Once we have the mappings, we also have the information about the types that are stored in Elasticsearch. So at that point we can implement a persistence layer. So essentially model, something that has a dot-save method. And we can do that because now we know how to serialize and deserialize even things like a nested document so we can wrap them in their respective document classes or we know how to deserialize a daytime. Because currently we return daytime just as a string because JSON has no support for daytime so the only way how to do it is by matching regex against every single field. That's not very good and by far it's not performant enough. And once we have the persistence layer it's only a short step to do proper Django integration to actually be able to correlate the documents with the models. So that's all for me. I would love to thank Rob Hudson and William from Mozilla. They helped me a lot when designing this library and they tested this library. Mozilla has been brave enough that they already run this in production. So kudos to them. I still haven't gotten any complaint so I'm guessing I haven't interfered with their operations by creating this library. So that's good news for me and now if you have any questions I'll be more than happy to answer them.