 If you weren't awake before, you are now. A little room to run here. Cool. Sounds loud enough. Ever in the back? Good? Sweet. Cool. Well, thank you everybody for coming. Howard and I are going to tell you about Elasticsearch and how we fell in love. We think after spending a little bit of time with it, you're going to fall in love, too. Just for a quick reference, I am the red lightsaber up here, laser pointer. I'm the green one. So if you see us, we'll might be playing. Yeah. I'm the good guy. For a quick just to kind of get a pulse from the audience and get kind of warmed up. Who here is using Elasticsearch? Cool, maybe like a third or something like that. Who here is interested and curious about Elasticsearch? OK, cool. Excellent. You came to the right place. And it's funny that you all wandered in here. And then so kind of of the people that are using Elasticsearch, like how many are using it kind of either in production or kind of with a cluster bigger than, I like, size of one. Yeah. So a little bit. Cool, that kind of gives us a little bit of frame of some of the stuff we're going to go into and what might make most sense for you guys. Next to you, Lau, I'm director of engineering at Pantheon. And yeah, I like Elasticsearch. It gets me really excited. You can do really cool stuff. And I want to share with you guys today some of what makes it cool. And Howard thinks I'm a smooth operator. So you can see that right there. Hey, I'm Howard Tyson. I'm a Tizzo on Drupal.org and everywhere else. I can get it. I work at ZivTech. ZivTechnician is something I'm trying out. I think it's probably not the right call, though. Smooth ZivTechnician. Smooth, smooth ZivTechnician. Yeah, I'm the vice president of engineering, which mostly means that I'm the one that gets to play with new stuff and decide whether it's going to eat our lunch or not. And I came to the conclusion that Elasticsearch was not going to eat our lunch. It was going to buy us lunch. And I'm hoping to convince you of the same thing today. Yeah, so we're going to do that by giving you an intro, going through some examples, looking at what it looks like to interact with Elasticsearch on the command line, some basic ways, talk about clustering and stuff. And we have a really cool demo lined up at the end, which where you kind of indexed all of the Git log information from Drupal Core. And we're going to show you, hopefully, get you guys excited about the possibility of querying that and doing some fun stuff. Yeah, I think the thing that makes Elasticsearch so cool is how flexible it is. So that's kind of our example to kind of walk you guys through, like, here's a use case of looking at some data that we're all familiar with and we all care about. And, like, what can we do with that in 10 minutes at the end of the session? And it's a live demo, so what could possibly go wrong on the last day of Drupal Con? And this is all up online, as well as, like, a little read me that explains how to do the demo that we're showing here yourself. Gives you all the resources you need to set it up. And the slides that are up there need to be updated. We've got a couple of new commits that we just committed. But yeah, we'll push right after the session here. So, Nick, what's Elasticsearch? That's a good question. I'm so glad you asked, Howard. So this is Elasticsearch is a distributed open source search and analytics engine designed for horizontal scalability and easy management. That's a bit of a mouthful. We're going to dig into a little bit more about what that means. So first, it's distributed, right? Which, initially, it really just means that, right, you can have one or more of these nodes and they can build out a cluster, which we'll talk about a little bit more and look at a little bit more. But it's sort of built, it's the Elastic in Elasticsearch, is that you can start with one of these nodes and you can just kind of start adding to it. Also, good, like, we'll go over some concepts which are generally relevant to other distributed systems. Maybe that's Cassandra, something like that. So some of the ideas we'll get into about distributed systems, which is a whole kind of interesting area of computing, analyzing failure modes, kind of either at scale or in the, kind of, with real world problems like, you know, the network not being reliable and that kind of thing. It's open source, like all good things at this conference and everywhere else. It's under the Apache license, so it's pretty liberal. You can kind of do what you want with it. And it's document-oriented. Yeah, so that basically means you're just taking, kind of, basically what can be represented by JSON, key value pairs, throw it in there. You don't have to pre-define the schema or anything like that. You can kind of throw in pretty much whatever you have. And it accepts nested documents, that kind of stuff. So you can, within reason, kind of think of any JSON you have and just kind of put it right in Elasticsearch and start playing around with it. And so here, our example document is what we're going to be showing for our kind of demo at the end. Here's just an individual commit, sort of, as you might see from the core queue. That at timestamp is kind of a pattern that people use with Elasticsearch because it makes this tool cabana that we're going to talk about just sort of work out of the box, although you can do whatever. So Elasticsearch is described as near real time and CosmicAlph over here is going to point out like, well, the only difference between real time and near real time is that near real time systems are not actually real time, right? So what does near real time mean? Basically, ideally, when you put a document in Elasticsearch, ideally it's, the next square you do will include that in the result set if it's relevant. This is really to distinguish Elasticsearch from kind of batch-based indexing where it's not instantaneous, but it's near real time. If you reasonably, kind of, at the speed of a human insert a document and then query, you should get that in the result set. And it is built on Apache Lucene, like almost all open source good search stuff. Lucene is a Java library. It is the same thing that powers solar and a number of other projects. So Lucene provides this Java API and you can write Java code and then you can use Lucene to really easily add all those fancy features, like being able to do language stemming, telling that running runs and maybe even ran are actually the same word meaning-wise, Apple and Apples, right? And getting all that stuff right is really hard. So thankfully, the open source community largely has standardized on this Lucene thing and then it's a question of like, a lot of us aren't Java application developers and don't want to embed the search directly into whatever we're doing, so how do we wrap that up into a nice interface? Yeah, so that kind of gives you an idea of the differences and similarities between kind of solar and elastic search. They both have the core search technology of their core, which is Lucene, and they're basically just packing up different ways to kind of APIs and different ways to manage and kind of make that easy to use. And again, the elastic and elastic search is that it's clusterable. So here we've got, I think... Yeah, so this is a kind of snapshot of a plugin for elastic search which kind of lets you visualize what's going on with the cluster. So this is our cluster Pantheon, which consists of five different nodes. You can kind of see those on the left column. So there's a couple different things going on here. We have two data nodes and three master nodes. So in a distributed system, there's one responsibility part of that system is the kind of management of that cluster. So by definition, distributed system is in multiple locations, usually multiple servers, could even be different data centers, and something needs the responsibility to understand which data should be where, what the health of the overall cluster is. So in this case, we have three kind of management master nodes. And what happens is they have a consensus algorithm to go through a process called leader election to pick out the current master who's gonna know the most about the cluster and kind of be the authoritative source at that point in time for what's going on. So you can see the little star on the left is designating that that node out of the three is the current master. And so yeah, here, like kind of using different servers to separate the different roles. One role is to have the data and serve the data which we're indexing, and the other role is to kind of manage the cluster and know about which node should have which data and that sort of thing. So moving over to the right a little bit, so each of the verticals there represents a different index. So you can see on that leftmost index that Howard is pointing to, there are five what are called shards and those are kind of sections of your data. And you can see on the leftmost column, the two data nodes each have a box indicating that they have data. So shard zero is on each node, shard one is on each node. And that's because we've set a replica. We've configured it to have a replica. So that data is stored on two different nodes so that if one goes down, we still have that data. Now if you look at the other indexes we have, which are log data, so these do not have a replica configured. So this is a trade-off where maybe this data isn't super important. We don't mind if it gets lost, so we don't need to replicate it. And that lets us kind of scale out for like a heavier data set across more servers. So a important thing to get into stocking with some people last night quite a bit about is kind of about data security and integrity in Elasticsearch. So Elasticsearch is not my sequel. It's not trying to be like a transactional, highly consistent data store. Joe Miller is an engineer at Pantheon who helps set up our cluster. And he wanted sarcastically to let you all know that it's a great primary data store for all your critical, mutable data. That's not true. That's not what you wanna be using Elasticsearch for. There's a great series called the Call Me Maybe series of blog posts about distributed systems. And the joke kind of popular cultural references like in distributed systems, you may or may not kind of get that call back or the data may or may not actually get there. And so I love this one little quote from the analysis of the new Elasticsearch which is the Elasticsearch marketing website is all, we put your data first and after like really digging into the code specifically coming out saying, well to be precise, it's not putting your data first, it's putting it zero to five seconds after you write the data. And if the server was to blow up within those five seconds, you're not guaranteed to have that data. So this should help kind of frame how you can use Elasticsearch not as a primary data store. It's great if you can use MySQL to regenerate the index data. Other stuff, like in our case, we're gonna be going through the Git log so we can blow away all of Elasticsearch, rerun the Git log stuff from Drupal Core and get all that back in as index data. The Call Me Maybe series is really fascinating too. I highly recommend checking that out as extended reading where they sort of, he sort of takes all these distributed systems that promise to be able to survive all kinds of partitions and failures and stuff, and he starts finding out what happens when you really start just unplugging the plugs while they're running. Does a really nice job of that. Dude, Nick, I don't know why I should care about Elasticsearch. I've been searching for nodes since like 2005, man. You may have been searching, but you haven't been searching in style. Yeah, so I mean, like a lot of people, I think I started out doing search in MySQL. You just kind of throw everything into MySQL and it's your answer for everything, so you just start doing your search there. It's the kind of thing that we all sort of experiment with in college, like a lot of other things. But like those other things, at some point you kind of need to grow out of that college age experimentation and realize that actually I'm a grown up now and I shouldn't be doing these filthy things to my database. And even if you don't run into the performance issues, I think you really just end up with uninspired results and we want to inspire ourselves and our customers and make awesome websites. Yeah, MySQL search not only will bring your site to a screeching halt, but it also has the benefit of giving you terrible search results. So Elasticsearch versus Solar, we kind of got at this a little bit before. They're both Apache 2 license. They're both supported by big commercial companies that are pushing them. They both support HTTP indexes. We're not gonna go spend a whole time running this down. Solar versus Elasticsearch.com will give you your giant wall of checklists. Every time somebody adds a checklist on one side, the other organization's run and try to add it on the other side as well. There are some features that are on one and not the other and if you really care about one of those, it could be a decider for you. Yeah, but really, when you look at kind of the check box based comparison, you're like, well they're pretty much the same. This one has this feature that I don't really get or need or something and this one has this other, but really what it comes down to, to me is more of like a gut feeling, that kind of inspiration, those aha moments that I feel like I really just get playing with Elasticsearch. So talking with people who use Elasticsearch for Drupal and WordPress stuff and some pretty big sites doing some really cool stuff and they're kind of like, well I guess it is apples to apples, but it's kind of like this mealy, wormy, you know, gross apple and you want to like fresh off the tree red delicious. Elasticsearch is the new apple. Yeah, it's the new apple. Yeah, I think also like there, it's not just about the product, it's also about the documentation. I mean I've been working with Solar for, I don't know, eight years or something now and every time I need to learn how to do something, it's just this deep sigh and I'm like, can I just not do that? Can I make views do this maybe? Because every time you dig into their documentation, it's just this stupid rabbit hole and you keep getting links to like Java docs for loose scene queries and like maybe you'll find a way to do it but you don't even know where to look and when you download Solar even today, you get like an example folder. It's not like here's the thing, run it. It's like here's a way you could run it with Jetty, I guess. You download Elasticsearch, there's a bin folder with a bash script called Elasticsearch and you just run that and just wait for the demo because we're gonna run that script about six times. Does anyone know what Sphinx is? Has anyone used it? Sweet. Awesome. Well I wasn't sure too much about it. So Sphinx is another option. I was gonna make a joke about nobody raising their hand and not needing to talk about it much which is why there's no slide content but. We were so sure there wouldn't be a single hand and there were two. So talk to anyone who raised their hand if you want to know about how Sphinx compares to Sphinx better. And then I was just gonna make a little jab here in case you're wondering how many capitals are in Elasticsearch, it's just one. So you don't need to write code that changes that. Nick's really excited that Pantheon now supports WordPress. You know, there ain't no party like a curl party. So we figured that was probably the best way to show you guys how you actually use this stuff. Yeah, you're probably not gonna be like shelling out to bash and curl in your production systems but just the fact that you can do it and curl on the command line kind of gives you an idea of how simple it is and we can all translate that in our heads how that would map to the code we're writing. Great, GDP. That second thing of curl there is a bug. But yeah, so fun. As you can see here, if you don't create an index you can just start throwing documents into Elasticsearch and it'll create one for you and try to figure out what the mappings can be. But if you do deliberately create an index you can actually tell it how many shards. So that's kind of like how many slices Elasticsearch should carve your data into and then those slices are what get distributed among the servers. So a lot of the time you could probably let Elasticsearch make that decision for you but if you only have a single node it might be a good idea to just say one. I'm not sure. Number of replicas is how many duplicate safe copies you want of this data sitting around. So here we're ensuring that if some node dies we're not gonna lose any data. This is like details you might learn along the way after or just trying to fall in love here so we can kind of ignore that stuff. It's the first date, you know? You don't have to. So, and here is inserting a document. Pretty simple and here again you can see this is like all stuff you'd get out of the Git log, commit ref, commit her name, all their name, some date stuff, what the commit message was. And when you get this back it kind of tells you where it put it in log stash, what the ID and what the version of that document is. And hey, it's crud so you can update it to replace that post with a put and put the ID that you got back when you put the first record in and you can push some updates to the same thing and it'll tell you that the version of this document is now two because you've updated it. And two comes after one, it works out well. You can actually do partial updates too but it kind of involves going deeper. So sort of the simple crud, you push a new document and it completely replaces the old one. And here, you know, again this is crud so this is the retrieved part. Separate from search, you can just grab a document by name. Cool, let's get to the MIDI stuff. So here maybe the use case, we have this, all this Git data in there and like, I want to find commits by XJM matching with a subject matching views. So here, the kind of two key parts of the search are what the query is. So that can be a Lucine query which you can get kind of fancy with and then the filter. So in this case, the filter is gonna restrict only results where the author name is XJM and then within that it's gonna use the query to find matching results. Nick, did you? No, I didn't put this in. Do you? Is this like a metaphor for elastic search or something? I think. I mean, I feel like solar's the thing that kind of blows. Yeah, maybe I can't. Let's just move on, let's just move on. Okay, so far you've just seen like, you can throw some stuff in it, you can pull some stuff back out, you can do a search, right? Solar probably would have looked pretty much exactly the same so far. So what kind of elastic search do you that solar can't? Yeah, so I think about Arthur Clark's laws, I guess he just got to make laws, he was a science fiction writer, but about the one about any sufficiently advanced technology being indistinguishable from magic and especially when you start using elastic search, it kind of feels like magic so we're gonna try to communicate that to you. So one really cool thing that elastic search has is this ability to create histograms. Where if you ever used Photoshop and you see like the layout of an image and you can see like how many pixels are totally dark and how many pixels are totally red light and like what the distribution is, I feel like if most people have encountered a histogram outside of this context, that's probably the kind of thing they've seen. Well, elastic search makes it super easy to make histograms of kind of like any kind of data. It's built in as like a primitive kind of query where you can say how you want to do a histogram query, how you want to do different bucketing to sort of even in a nested way, use bucketing to sort of structure, take all the things that fit into this category and give me overall counts and structure and then even subdivide within that. So this is a little snapshot from our log, centralized logging in Pantheon. You can see kind of, so this is being bucketed on the timestamp fields and then it's bucketed per one hour and you can see on the bottom there kind of how many of those different events on Pantheon are happening over those one hour periods. And then the same thing for counts split up by roll. So there's like 16, 164,000 events that were caused by the super user. Nick, do you want to talk us through the score boosting? I sure do. So this is a pretty cool use case, right? So this is where we kind of start to get a little bit of inspiration into our search results. So you're like, okay, I have a lot of nodes. I have, I don't know, 10,000 nodes and what would be really cool is if we could really boost those, boost the search results based on what the popular nodes are. Like, okay, cool, let's figure out how to do that. Where do we store data about what our popular nodes are? Okay, Google Analytics. All right, so you could write a quick script that just goes to Google Analytics, gets a list of each of, you know, the top 100 most popular pages on your site. And put that into Elasticsearch along with that document. We could just make a custom field, maybe call it like GA rank or something like that. So we'd have 100 articles with a GA rank between zero and 100, or one and 100. And we can use this, the kind of function score multiple. And as you can see in the snippet, we're gonna say origin is zero and that means the closer it is to zero, the more you wanna boost that search result. And then we only really care about the top kind of 100. So after 100, don't really give it much of a boost factor. And so in this case, pretty straightforward to have a little conscript or whatever, which hits Google Analytics, gets your most popular articles, updates Elasticsearch with that ranking and then really easily on all your search results boost those popular articles to the top. You could even take this further using the same kind of function and get popular articles that were recently published. And so this is where I kind of started that. You can start to see the power and inspiration and flexibility in Elasticsearch. Elasticsearch also has like geo fields. So you can also, I've heard about people arranging things in a two-dimensional space so that then you can do searches based on proximity of like not just latitude and longitude, but any two values to create like a two-dimensional plane and then do like regional searches based on that. You can get really crazy with this little stuff. Percolate, percolate is a kind of bizarre feature that's really cool. So what it lets you do in the abstract is store a set, the word should be there, store a set of queries and then use documents to ask which queries would find this document. So you can basically like say, hey, solar, sorry, solar. Hey, solar's just like, I don't know what you're talking about. So then you ask Elasticsearch, hey, Elasticsearch. I want you to remember this search where I wanna find all of the commits by Jess and then I want you to remember this one for like all of the commits that are related to security and then you can start taking individual documents and saying like, which of these does this match? One, both, all. So yeah, yeah, Doc, I heard you like search. So I put a search in your search so you can search for your searches. Which is actually, this is actually just taken from the Elasticsearch documentation page. So kind of bring that back to our example about this, the Drupal core commit Git log stuff. So a cool way we could use this is to kind of classify those commits. So maybe we'll make one percolator called security. We'll say, hey, if it has XSS in the message or security in it, then it matches that. We'll have another one called admin UI and we'll say, hey, if it has admin and UI in the message then that counts as a hit for admin UI. What we can do then is say we just get one of our commit messages coming through and we can query Elasticsearch and say, hey, I have this commit with this message. What does it match? In this case, we'd get some results back that say, hey, you know what, that matches security. And we could use that to then tag that commit and have that kind of tag in the database. So it's a cool way to kind of classify data. So yeah, basically like here we just push a couple of queries, the same kind that we could just run against the documents that are indexed. And that creates, that stores those. And then we can just chuck a document at it and it'll tell us which of those queries match, which can be a really cool way to do something like alerting. Like maybe if you are aggregating PHP error logs, you could say like anytime I find a stack trace or you could say like follow syslog and if I ever find OOM, then like wake somebody up. So it's a really cool way to do that. One note, it doesn't like auto tag. It's not like you can say I'm adding this, run these queries and then tag it with what matched. You'd have to do that yourself. But it's just a way to sort of like take this search and turn it inside out. Where instead of storing all the documents and then asking questions about it, you can store all the questions and then throw documents in to find which questions matched. Which just opens you up to doing a whole bunch of stuff that I never would have thought about before. So integrating Drupal with Elasticsearch. There's a few ways to do it, some of them are better than others. There's a couple of different Drupal modules you can use. The one that you might be inclined to reach for, who here uses search API? Yeah, so you'd like me search search API Elasticsearch and be like oh hey, there's an integration, perfect. This'll be just like solar, I'll just plug it in, it'll work, right? What's nice is that search API Elasticsearch has really complete integration right now with all of search API's features. You can set up all the different analysis steps and you can reorder them and you can do all kinds of different query creation and it sort of does all the same things that you would expect after working with search API in solar. The bad is that it's gonna just error all the time and maybe lose your data. So I think if you wanna chip in on the let's get Drupal working well with search API and Elasticsearch, this is a really great place for contributions so I don't know the maintainer but I suspect patches are probably welcome. There's another module called the Elasticsearch connector. This I think is sort of more interesting and much more complete and it's a nice framework for an API where you can define these clusters or indexes that you have in sort of the abstract globally within your Drupal site and then you can use that from all these different submodules that it ships with. So there's one with views integration which I can show if we have time. I'm not sure if we'll have time and that'll actually you can go in and when you're creating a view it'll find all the indexes in Elasticsearch and make them available as things you can create views of and then it introspects the index and finds all the fields that are in there and exposes those as well. So you don't have to be using Drupal to even be putting the stuff into Elasticsearch. It can be getting into Elasticsearch from Curl and then you can use views to just build lists of it on your site which is pretty awesome. It also comes with a cool statistics module that sort of like Google Analytics style will just chuck every authenticated page request or every non-cached page request, anything that's not hitting varnish or whatever in there so that you can then make views on histograms and all kinds of things of what's actually happening in your site. The bad, the search API integration is also there so this module also ships with search API integration. The thing is it doesn't enable a lot of the features that you'd be looking for. So like you can't adjust whether it's stripping or like whether it's doing highlighting or a whole bunch of other features that are really nice that are built into search API and work with solar. If you're trying to decide for my Drupal site should I use Elasticsearch or Solar for like my primary content search, I just wanna like index all my nodes. I'm sorry to say this at an Elasticsearch talk but just use Solar. You'll be happier with your life. It's more complete, it's more well tested and for like search API kind of gives you like not the lowest common denominator but a common interface across these like backend so that in theory you can't even tell whether this is Solar or Mongo or Elasticsearch or whatever. As like a lowest common denominator that only has all the shared features like you're not gonna get much out of Solar over Elasticsearch. Maybe next year we'll have better news. What's that? Maybe next year we'll sing a different tune, yeah. Yeah, and maybe Acquia and Pantheon allowed Elasticsearch hosting so we can like really use it on all our sites, right? You bet, Sean. The other thing that you can do with Drupal and Elasticsearch is log aggregation. So there's a couple different ways to do that. One is you can use this cool new module by Amitai Bernstein, Logs HTTP. And you don't actually need any other intermediary. You can actually just wire this up and just say like let's say you had your Elasticsearch instance running locally. You can just say my endpoint is that URL wherever I can hit Solar slash HTTP logs. This'll create an index if it doesn't already exist slash message and that'll just start creating those messages in that log. And you can say what, log severity and you can define a field for each environment so that dev staging and production can be differentiated. And that's all you need to do. Now you have all of your log data coming through Watchdog automatically being put into Elasticsearch, which is pretty cool. Another thing that you can do to do this is use the GELF module. I think Mark Sahnabamm wrote originally, right? The illustrious and lovely Mark Sahnabamm in the front of the room here. And so with GELF, you can pair GELF with this tool log stash, which we're gonna talk about more in a minute. Or the module, the system that was actually designed to be used with Graylog, which is this other log aggregation thing that also uses Elasticsearch actually. And they have this kind of JSON format. And so this module will post that JSON format so you can either have this go straight into Graylog or you could pass it through Log Stash into Elasticsearch. Two really good ways to get those logs in, especially if you're doing other log aggregation, finding your MySQL slow query log, your Apache access log, whatever. You can make a unified log stream of all of that stuff so that you can look in one place and kind of identify which sorts of requests we're having these kinds of problems and have that in one place. Yeah, and this is an awesome way just to get some data in Elasticsearch and then you can start playing with it. It also kind of speaks to, there are two general primary use cases right now. For like around the kind of the, on the Drupal side. So one is helping you actually serve your search results better to your end users. And then the other is more of a kind of operational tool to understand what's happening on your site. And that's where kind of having the logs might help you see if you're being attacked or what pages are 404-ing or kind of what your traffic patterns are or anything like that. So the other way, wow, this monitor is really low contrast. The other way that you can do it is just using custom code. So this is a little snippet from a module that I wrote for a Drupal 6 UberCard site. So this Drupal 6 UberCard site has like some of the screaming fast, most cutting edge search technology of any site that I've worked on in UberCard. Sorry, I'm just sharing my pain. I just want a little validation here. So here's an example of how you can just write this code and you can see it's gonna sort of look a lot like the stuff that we were seeing in the examples before. This is using one of the Elasticsearch clients that I think are sort of officially listed from the Elasticsearch documentation. So you can see here, like I just have this client class and I'm calling indexes indices create and I'm handing in this description that says the number of shards, one, the number of replicas, zero. I don't care if this data goes away, I can just reindex it. It's defining auto suggest completion things so that Elasticsearch has a built in feature. You don't have to just do a full search. It has a bunch of sort of memory optimization stuff for doing type ahead. So you can tell it kind of like what fields you should be able to do type ahead on and it'll make that really fast for you. And then here's some custom code to query. You'll see here, I'm taking just some search phrase and I'm telling it what fields to search across and then I'm giving it Lucene, what are they called? Boosts? Yeah, boost factor. Lucene boost factor where I'm saying if it's in the name, which is the user name, give it a boost of plus four. If it's within the full name then a two and then if it's in any of these other fields, neutral. No better, no worse. Right, create the index, perform searches on it. I don't have a slide for pushing stuff in but it's kind of exactly what you'd expect. You just make an object and you call this method on client to say index this. So in like 100 lines of code you can actually just create your own custom search if this other stuff doesn't sort of work for you. If you can't just plug in Elasticsearch or one of these Elasticsearch integration modules or the solar integration module, it's really not a lot of code to write your own custom integrations, especially as clean and well documented as the Elasticsearch API is. So here I needed like really good user search in Drupal 6 with address data from UberCart and like I was able to write that in a day and it way outperformed all the other stuff that we had tried out for it. So that's kind of like most of the Drupal use case but the thing is part of what's given Elasticsearch so much in the way of legs is how well it works for DevOps-y people. Sweet. So we wanna introduce you to what's called the ElkStack. So this is Elasticsearch, LogStash and Kavana. So these are all kind of three tools provided by Elasticco, which is the company kind of behind this. This gets us to some cool use cases. This is kind of what we use it for a Pantheon for centralized logging. This is basically like the killer app for Elasticsearch. I think this is a way a lot of people, organizations will start using Elasticsearch is that kind of like the operations team, tech ops, DevOps, whatever it is, kind of are like, hey, let's use this. It's like a really great way to look at our logs, see what's happening on the system so we don't have to log into each different server when something goes wrong. And then this might be a way that people have kind of fallen in love with Elasticsearch and then bring it to other parts of the organization. So we went over a couple ways, right? Like the key is to get your data into Elasticsearch and that's when the fun begins. Howard went over a couple of cool, easy ways to get that in. Maybe something like a Drupal logs HDP module. You can just hook up syslog or journal D if you're running a system D-based distro like Fedora. Or you can kind of get, there's lots of ways to get your logs in. And there's a contrib project that has even more, like a thing that'll read DB log records out of your Drupal database. That ships with logs-contrib. Don't do that. But super flexible, so there's a lot of cool ways. So as the logs, you're kind of thinking about, logs aren't like a thing, an entry in a file, right? Logs are kind of events that are happening on your system. So the idea of a log stream is kind of that these events are kind of happening in different places. Maybe that place is nginx, or maybe it's Drupal. And those events are admitted. They go through this kind of log pipeline. And so log-stache is gonna be kind of the first step on this logging pipeline. So if you see bringing back to the Drupal Git data something like, hey, this issue, whatever is fixing xhtml slash. And in our log-stache configuration, we can have this little groc. And what the groc will do is it'll split out the issue ID into its own field. So before we just had kind of, you know, you can think of just straight text as kind of like, it's kind of dumb. It's just one string. It's not super interesting. But then you can pass it through log-stache, through these kind of, you know, regexy groc things. And then you get rich data out. So the rich data is more kind of like a key value pair. You can see at the bottom, the issue ID is kind of broken out. The issue message is broken out. And the original message. And yeah, so rich data is really cool. We'll tell you, we'll go into that in the demo. But basically the richer your data is, the more you'll be able to kind of query it and play with it. If you just stick in, if every document you put in Elasticsearch is just one line, one string, that's not, you know, you can still search against that, but it's not super interesting. So the richer you can get, the more kind of key values you can start to do some fun stuff. Yeah, so like, log-stache, Elasticsearch is a great way to kind of, you know, just understand what's happening in your system in near real time. Maybe that's attacks happening on your system. Maybe that's kind of investigating a security breach or something that broke. And how did that user kind of get to that step? Diagnose and outage that happened. All the things that you'd use Watchdog for, but in a way that you can kind of query and play with in interesting ways. For example, that informed Nick and the guys at Pantheon that I am at the top of the leaderboard for failed login attempts. Apparently somebody knows that my email address has an account on Pantheon and they're just trying to brute force my password. And so like, he was able to show me on this like graph, like, oh look, there you are. You're right at the top of the list of all the people trying to break into Pantheon. It's either that or Howard wrote some bad automation in the tight loop that's like, it really could go either way. Yeah. Sweet. So we're gonna jump into a demo and you guys can get kind of a better feel of the power of your fingertips. So this is the K, the Kibana in the Elk stack. Is this dashboard which lets you kind of query, interact with Elasticsearch on the back end. So this right now is kind of a web UI, a little dashboard we threw together that's on top of Elasticsearch with all of the Git log data for Drupal Core. So Howard can go into more detail. Yeah, so I'm gonna show you how I put this together but figured I'd start with the context of what this is and kind of show you the power of all this stuff that we've been talking about. At the last DrupalCon Austin, I also had a talk that overlapped with Elasticsearch and I also had a live demo and that one exploded in flames. So fair warning. So here what we have is a Kibana dashboard like Nick said of all of the commits in Drupal Core. This is a couple of days, this is a little bit old. I think there's actually like 28,000 now. This last import, I'm not sure I got quite everything. But yeah, so what we can do is we can just sort of just play with this stuff and start making these lists so Kibana doesn't know anything about core commit data, right? But what I was able to do was create this log stash rule and Nick and I put together this thing that would parse the output from a custom sort of Git log format and then ingest all that stuff into Elasticsearch. So you can see here, I know this is a little small but you can see here like the timestamp, you can see the author, the Git commit author was WebChick. You can see that the email address she was using was her magic Drupal.org one to keep her personal email private. Then you'll also see some other structured data that wasn't structured in the Git commit, right? So like the issue, which used that groc filter that Nick mentioned, I'll show you the code in a second to extract that because it followed that normal Drupal pattern of issue colon space pound and the issue as you can see on the next line. And then we also use that to repackage it up as a link to the Drupal.org issue. So you can just click that and jump right to it. So, yeah, this is pretty cool. So right, there's colors, there's pie charts, there's a histogram, we see some log entries, right? So let's maybe just start playing around. So who wants to look up, who knows a core committer by name that we wanna investigate a little bit? Call it out. Sonobom. Hey, there we go. So there are 43 commits that are tagged Mark Sonobom. And we can see of Mark Sonobom's commits, we've got a tie, oh no, Alex Pot. Alex Pot commits the most stuff. So if Mark wants to get something in, he now knows who he should probably talk to. Alex is gonna be the most sympathetic to Mark followed maybe by a web chicken catch. I had a killer year. It's almost making like a middle finger with that. Yeah, Mark is flipping us all off with his commit statistics. 25 of them in 2012, Mark. Big year. What happened? You're done? He's done. You did it all. And then so the little table there is, is the Instagram by issue. So on the left it's the issue number and then it's how many commits were committed for that issue number. So you can see that when Mark was committing every issue he addressed, he addressed with a single commit. Bam. But if we take off the filter on Mark, we can see that, wow, this one issue, 26420, took 21 commits to do. We're like, what is that? Okay, they were all done by trees. Let's dig in a little bit more. See, it was all in 2005. Okay, we're learning the story here a little bit. There we go. We see that it had something to do with renaming aggregator module. But then also all the other commits. Do we want to know more? Well, yeah, so look at the other commits there. That's renaming, looks like a lot of module renaming. So I don't know, I'm just kind of putting this together, but maybe there's an issue that, it's kind of doing a refactor in the code base, renaming or moving it around, and instead of lumping that into one huge commit, which would be scary to push, there was one issue on D.O., but that actually happened in a number of commits, I would imagine, just to make the, if it's a refactor, to kind of make it a little more sane. And then we have that nice feature here. We can actually just click that link. Any link will automatically format itself properly. And yeah, we can see here's the Drupal.org issue, split modules into their own directories. It's interesting. So that required more commits than any other thing in Drupal core history. We're learning here. Howard and I are excited to use this database to give us a leg up at trivia tonight. We think we can really, yeah. Yeah, and then we can just say like, oh, what about a different time interval? So I was looking at the last 15 years, but what if I just wanted to look at the last year? So there we go, we have a lot less committers and we can see kind of a different graph here. It's a little bit more detailed. It's now automatically splitting things up by day. So this is the last year by day. I can see that the most commits happened on January 11th. Interesting. Alex Pott, 62% of the commits. So he is either just really working hard or his standards are a little lower. Just teasing. You can see that we've got 35 by cache. We've got 33 by XGM, Jennifer Hodgson 64, right? It's just really easy to start exploring these. And again, we can click on any one of these and it'll refilter. So now we're seeing just the commits that, and again, this is being pulled from author data. So there's some overlap between user names and actual names, depending on what people had in their git config at the time they committed it. Yeah, you can also track kind of people's identity as their name or handle change over time. One other nice feature LogStash automatically will create two different indexes. One is the raw and the other one is the analyzed copy of the field. So one is like broken up by tokens, but the other one has the full value, which allows you to kind of do different things here, depending on the nature of that data. Yeah, and if you remember back to when we were writing a query in curl, there were two parts of it. There was the query and then there were filters. And so that totally maps to the top section of Kibana there. So the top is the query. And then below that there are the filters. And so that we're filtering by a specific author and then we're querying for, what did you type in there? Security. Security, so for security commits for this author. And you can see those issues there. Actually, I split out the author. So we are requiring author. So anything that's missing an author doesn't show up. That's because the way I did the log output, we're kind of losing any commit that had quotes in the message. It's a little imperfect. You could use a little. That'll be fixed in version two. Yeah, version two of this talk. Yeah, can someone shout out something else we want to search for in Drupal commits? What do you want to search for? Yes. Oh, interesting, yeah. So we'd first probably, we want to go back for a while. And then, because we're not sure when that happened and might want to query in, just on both of those names. And, I don't know. Let's see if I pick this. So we can see where it started. Actually, it might be easier to do up here with catch, or, oh yeah. Yeah, Elasticsearch just fixes that for you, so you don't have to worry about it. So it's nice. So if I clicked on either one of these, I could see that the oldest one here is from 2012, December. If I dropped that filter, I could see that the oldest one here is 2012, December. I think he just has two computers and they're configured differently, maybe. Yes, I think we could. It seems like the exact same breakdown. What's that? Yeah, type in one query. Oh, right, the plus, yeah. Yeah, and this is really fun to play with, pretty visual. You can get kind of funky with it, but it maps really well to just even what you'd be doing with Curl, and it's a pretty cool way once you can get some data in Elasticsearch to start playing with it. So we can try to get two queries up here, and then another thing I want to show you is exactly right. Cool, so we can see those two queries showing up with the different colors on the histogram on the left. Yeah, so we can even see how much of each one, right? So of this year, 473 of them were Catch, and 891 were Nathaniel Catch schools. So do you think the yellow is his laptop and the other is desktop? I don't know, but can do some fun stuff. Or the catch is actually just in the email address, so that's why it's hitting both. How can you show, can you do an inspect on that histogram? First I wanted to see, oh, interesting, huh. So if we do an inspect, you're actually going to see how awesome is that it actually prints out the Curl command that you'd run to run this query yourself in the command line against the index. So while we're doing live demos of my crash and burn, Howard, do you want to just copy and paste that and see if we get results on the command line? What could possibly go wrong? Amitai always tells me you should do more live demos. Something with the white space could go wrong. And this is also a cool way, right? So you get some data in and then you can actually just do this inspect and kind of see a little more advanced queries that come on is generating for you to kind of learn some of that cool syntax. So there we go, what we see coming back out, this took three milliseconds, I guess, didn't time out, it hit five different shards. Three were successful, zero failed. We had 17,000 hits and then we can see that it's breaking things down by term here and we can see exactly how many fell into each one of these categories. Cool, so we have about 10 minutes left and again on github.com slash tizzo, the examples and all the slides are up here so you guys can play around with that. We're gonna also walk you through, if you have any questions, ping either of us. But this is a cool way to have some fun, do some exploring, this is the kind of stuff I love to do at Pantheon with our logs and kind of understand, try to do a little bit of sleuthing and figure out what stories the logs are telling you in some cool ways. So I think Howard's gonna go over a little bit of, oh, actually continue the demo to more like command line, how we're gonna spin up the kind of clustering aspect. Can you guys see that? You can just command plus two. Yeah, I just wanted to see if I could. So this is just the elastic search download. If I just run elastic search, I just spun up a new node. It automatically named itself with a Marvel character, Lady Killer. I don't know what her deal is, but apparently she's one of the characters. And in theory, in just a second here, we should see a new node just sort of pop up. Wow, I think I have a sort of split brain on my own machine, live demos. Cool, I think maybe now is a good time to just open up for QA if anybody has any particular questions about elastic search for this commit data. I think if you could get up to the mic, that would help us, or you can shout and I'll repeat it, but the question was about Kibana 3 versus Kibana 4. Kibana 4 is gonna be awesome. It's gonna be better than Kibana 3. I think we just, I haven't used Kibana 4, but it's on the list, so for me, what's up? Yeah, so the question was about using, experience using existing web crawlers like Nutsch or otherwise to put data into elastic search. No, I haven't, but it sounds like an awesome project. And yeah, it seems like it would be totally possible. You know, and I think that to me, the idea that it's so clear what's happening because you can use elastic search so easily that I'm not even sure the specifics of Nutsch, but it seems like there's gotta be a way to make that happen. It might take a little duct tape and glue to go from whatever format Nutsch likes to kind of an output that looks like elastic search. But yeah, it sounds like a cool project and it sounds like something that you could throw together pretty quickly. And for what it's worth, you'll notice that now, as I add elastic search nodes, I had to kill everything. I had some networking stuff, I think, going on. Now you can see that we're building out a cluster. I just added three new nodes to it. Why not? Let's make it four. All I'm running is .bin slash elastic search and these nodes are just showing up, popping into the list, and elastic search will even automatically start distributing the shards among those nodes as we have data to pull in. See, now I'm seeing the smiling of like, that's the magic of elastic search. That stuff is awesome. If you've ever set up other distributed systems, this is unusual that you just put them in the same network and they just find each other. Although they did invent their own magic for that, so it's kind of got its own eccentricities and the fact that you might have already worked with a distributed system like ZooKeeper might not actually mean you know how to troubleshoot their own magic zen disco thing. But if something does go wrong, you can just blame it on the names, like lady, who is it? Something wasn't up to it, but Lord Chaos is totally up to cluster, so much so that he became the master. Hulk, smash. Mark, you wanna kick stuff? Right, so, kind of by default, if you don't define a schema and map different document types to kind of what their structure is going to be, elastic search just tries to figure it out. Now a lot of the time elastic search kind of makes the right call and it works really nicely, but it doesn't always, and depending on what your first document that came in was, you could end up with an index different than what you want and you can't change those mappings later without dropping the index and recreating it. So, right, if you wanna make sure that you've described to elastic search that that's not a long, it is a date, right? You might wanna tell it what that time stamp is or elastic search might guess wrong. And then when you go to start trying to do date range queries saying like find me all the stuff in this month, elastic search has some magic for understanding dates and if you didn't ahead of time create your index and tell it what that mapping was, then you might not end up getting the kind of nice date indexing that you're looking for. Did I get any of that wrong, Mark? Splunk? If you can afford Splunk, Splunk's probably pretty good. If you have your laptop or one server. Yeah, they're similar, right? So there's Splunk on demand or whatever that one's called, Sumo Logic Logly, there's a bunch of like SaaS and kind of licensed software that can help you do stuff like around log analysis and around the same stuff. And those are all great options. They just have different trade-offs in terms of cost and cost to maintain. I think there is a big difference. Like if you wanna get excited and started playing with elastic search spin up one node and you're golden. The trouble is when you start to cluster it that's when you kind of need to understand the different failure scenarios and that kind of thing and how many replicas should I have? How many shards should I have? And so that there's kind of a, to operate a cluster, you kind of need to understand what's going on there. So I'd say it's really just the operational cost if that makes sense for you and your business to like if you're gonna scale out a cluster. There's an overhead to that, but the proprietary solutions work for you as well. So. Equals to yellow by default. Yeah, right. So, go ahead. My model for securing elastic search has always been the same as my model for securing solar and memcache and redis which is just like firewall it off and call it good. Communicate to it either over an unobservable LAN or through an encrypted tunnel. And there's probably other ways to do it but like for the most part you probably don't want people hitting your elastic search cluster from outside your data center anyway. Yeah, so a couple of quick things are just like MySQL or anything else the defaults are not necessarily your friend and if you're trying to be secure are probably not your friend. So you need to look at those defaults. I think just not exposing it externally is great. Basic stuff like IP tables, fronting it with Nginx, fronting it with SSL are all cool ideas and pretty easy to do. Also one kind of easy thing you could do with Nginx is restrict based on the HTTP method so you can at least kind of say hey, if you come in through this domain or whatever you're only doing gets. So then that's more of a data integrity aspect but you also need to understand the confidentiality parts of that as well but yeah. Oh, one other gotcha. But a good note for sure. If you're starting to play around and you get some log data or something very often can be sensitive data in there and so it's on you to make sure that's secure. One other gotcha with hooking up Elasticsearch in Kibana. By default Elasticsearch, that actually is one reason that you might wanna allow Elasticsearch to be accessible from outside because I think Kibana hits its API directly. You need to set a flag in the Elasticsearch config to turn on cores allowing headers so that the Elast the LogStash instance can talk to Elasticsearch instance. The documentation's all inside our examples folder inside this presentation. And I think we have gotta cut it off. Cool, we'll grab anyone has any questions. Good question, we'll jump into that with you right now.