 Okay, building a high available Elk Stack for Drupal. Let me introduce myself. I'm Margie. I'm a system engineer at Morft. It's a small Drupal agency based in Sydney, Australia. Being small is a good thing, but we are lucky enough to have a few enterprise clients from pharmaceutical finance sector, media sector. So that gave me the opportunity to do some cool systems stuff, DevOps stuff. And I hope to show you part of it. I realized that this is really difficult topic to present on. Like when I was presenting in DrupalCon New Orleans, that was for beginners, it was more about playing with the data and putting your Elk Stack together using Docker with a little bit of live demo that was playful. But this stuff actually talking about the high available solution, I found it very dry topic. It's like not too many opportunities to laugh. So if you find any, please do. I need a little bit of encouragement to make it interesting. So this is targeting CIS admins, mostly, and technical savvy developers. And I will present a high available elastic search, elastic cluster solution. And it will be AWS oriented. And it's for intermediate audience. So I expect you to know a little bit about the Elk Stack. You have installed it, you have touched it, you know what it is about, and you are interested into how to make it high available. So I'll go through a few topics. First, I will present a potential design, which is scalable enterprise high available. I will talk a little bit about how to auto scale different components of the solution, how to prevent the stack running out of space, how to secure some of the data flow in it, and how to patch your stack without having any downtown, because that is a high available requirement. And being at DrupalCon, I will also cover a few ways of getting Drupal logs into LogStash. So what's this Elk again? This is how my slide at DrupalCon started. The Elk R3 open source program. Elastic Search LogStash Kibana, this is from the Elastic, the company which is behind these products. When they introduced another component in the stack called Beats, they were trying to come up with representation of Elk and Beats together. So this is Belk. It was more fun, but I really liked it. So some people call the Elk stack, call it Belk stack because of the Beats being there as well. Beats, Elastic Search LogStash Kibana. But the official name is Elastic Stack, especially now with version five, I think got beta last week, five days ago. So this is the official name of the stack of the technology I'm talking about. So what's the goal just quickly? I know you know this, like just maybe just to refresh. So we wanna be able to take data from any source in any format, transform it, process it, and reach it and store it so you can search and analyze and visualize it. So that's the goal of the stack. And as I said, it has four components, Elastic Search, that's where we store the data. It's distributed high available and I just wanna say there are plugins you can install in it to enhance its functionality. I'm just saying it here because I will cover a few plugins as well. You could spend all day talking about Elastic Search, it's a really big topic. LogStash is the tool which collects process and reaches the data that's your transformation pipeline. Transformation pipeline, it has many inputs, many outputs and it can do stuff in between. So this picture kinda describes what you can do. You can get data from file, from socket, from database, then you transform it and reach it and then you store it somewhere. We will use mainly the Elastic Search storage. As I said, there are inputs. We will focus on beats. There are outputs, plugins. We will focus on Elastic Search and there are ways of manipulating the data in the process. And then you look at it via Kibana, which is open source data visualization platform. It's a, you see a browser to interact with your data and which you can search basically from the Elastic Search where it's stored. And beats, the latest addition to this family that these are like open source, like a lightweight data shippers written in Go. So that's Elastic's effort to find a really lightweight way of shipping the data from the source to either lockstash or straight to the Elastic Search. So say you install little file beat on your Apache lock and it reads your Apache lock and streams it to lockstash or Elastic Search. Also popular, it's top beat which run metrics and streams it to lockstash or Elastic Search. So you get some kind of metrics for free. I will show you. So I talked about these four components. This is just so you know how they talk to each other. So in Elastic Search, I have the data. Kibana connects to Elastic Search and visualizes the data, right? But the raw data somehow needs to get there. So you stream them, stream it to lockstash and you can do it from either beat. You can see that one of the source has beat. So maybe there is a file beat installed, which does that. Or maybe that's the source itself, like maybe syslogs streaming it via TCP. It can be the application itself, like opening a socket and sending it to lockstash. Or as I said, that's what Elastic tries to do, maybe the beat is capable of streaming the data straight into Elastic Search by passing lockstash. So you have one less component. It's easier. This is just to represent what lockstash does in between. So this is an example of source. This is a log file on Apache server. And this is, when it goes through the pipeline, I look at it and I can see this pie chart with response code. So that's what that gives me. So that was just like a quick walk through the four components. But I really wanna focus on how to build that in AWS and how to make it high available. I thought it would be good to have a use case. So I imagine that you have a client, enterprise client, maybe from the banking sector. They have a few dozens of sites and servers and they want to have all their locks in one place. They cannot lose any lock. They might have data retention policies. They might be, they need to be able to produce audits and maybe response to customer complaints. And I think that the Elastic is a potential solution for that requirement. And maybe we are lucky enough that the customer approved AWS as environment. They can deploy that into, rather than doing it on like a bare metal in-house. So I will cover that. So this is basically the stack but how can we make it high available? Well, we modify it a little bit. You can see we still have the same components but what happened there is that in the middle suddenly I have a message queue. I have a message queue and the lockstash, there are more lockstash instances than before. So you have data coming from different sources maybe via beats, maybe directly from syslogger application not to lockstash directly but to a load balancer. Then I have more than one lockstash shipper behind the load balancer to make that part high available. And the shipper has the only one purpose, store that event, the lock line into a message queue. So that part must be high available. It must not go down. I don't want to lose any single line there. So that must be up and running all the time. And then there is the other part on the right side which actually skip it. So this was the first part which stores it in the message queue and this is the other part. When you have lockstash indexers, these indexers, that's basically the data processing and reaching manipulation. They just like fetch it from the message queue and you can have one too many. You auto scale, basically if your queue is growing you provision more of these. If your queue is empty you can destroy some of them and you store it in Elasticsearch but here we can see that we have a cluster with three nodes at least. And Kibana reads from that Elasticsearch cluster. So let's dive in into some of the components and explain why I would do it this way. So the shipping data part we have either via beats or via the application or the server itself going in load balancer. So I would recommend using the beats file beat is really robust. It's an, before it was called lockstash forwarder. It's really nicely written application which basically reads a log file and line by line transmits it to either lockstash or Elasticsearch. The beauty is that if it had any connection troubles it would wait and try again. So you would never, if it's not down for like one days and your lock file rotates more than once you would not lose any data. Syslog can do that as well but Syslog has just output buffer and if it cannot reach the target for any reason for a while it starts dropping data. So that's why I would rather say use beats in between rather than streaming from Syslog straight. Application can stream as well as I said we can do socket, web socket, TCP. Don't use UDP of course, we can lose packets or they can go arrive in a different order. And if you can of course put SSL encryption on that TCP connection. So now I'm looking at the load balancer being in front of two lockstash shippers. So let's, what's the lockstash shipper? I suddenly started saying lockstash shipper without explaining what it is. So the main purpose of the lockstash shipper it's a lockstash and lockstash as we said can take data from any source and many sources and write it to different many sources. So shipper here is just that ships data from the source to a message queue. So it's really configured in light way. I don't do any heavy processing and a regular expression matching. I just want this to go through as a pipe. So the only purpose of the shipper is there to be able to store it in the message queue in the format the message queue understands. And I put the load balancer in front of it. Well, of course to make it high available so I can take any, if there is at least one up and running I can take the others, if they can fail I can update them, I can reboot them I can reprovision them. I'm not losing any data. I also discovered that there is a nice ELB support CPU autoscaling when you use it for SSL of loading. So originally I was experimenting in like lockstash shipper can be configured to do SSL of load termination for you, right? But that takes a lot of CPU and it's really hard to scale because if you have like a sudden spike in data you wanna be a pipe and it's sudden like CPU requirements you don't have enough time to autoscale but if you put that SSL termination in front of that load balancer happily autoscales for you. So it's a beautiful way of using that fact. Also, by using elastic load balancer you protect yourself from a zone failure. I mean like if you have three shippers maybe and one zone goes down you still have two zones up like you don't lose any data. There are some disadvantages of using elastic load balancer. Elastic load balancer does not give you a static IP address, some enterprises wanna have give me the IP address and port so I can open my firewall. Well, you cannot with this. I tried to over work around about that by putting like HAProxy in front of it with EIP elastic IP but then you once again you have the problem of making the HAProxy available itself and like how do you flow the IP address between these two instances you make it too complicated. So ideally you don't do that, you don't need that. Ideally you don't need a static IP address. And also ELB does not support client-side SSL authentication. So lockstep shipper does. So you might say well only clients with the certificates I trust can connect to me otherwise I basically don't establish the connection with ELB any client with SSL can connect to you. But you can overcome that once again you know what kind of data you expect maybe you have some kind of signature or you know that the field must have this value if it does not you drop it straight away. It's like a lightweight operation. Behind the shippers we have message queue. That's where all the locks ends up. I selected SQS because it's fast, reliable, scalable and fully managed message queue. It's surprisingly cheap in AWS space. And it's thus unlimited number of services and messages. I remember once being outside like during the weekend somewhere in a forest got a message that I'm not getting data in Elasticsearch. And when I got back, I didn't really rush but when I got back I found two and a half millions messages waiting in a message queue because the lockstash indexer bent down. I had like the auto scaling didn't work properly and the funny thing is that when I fixed that the queue got provisioned within a couple of hours and the client didn't even realize because they haven't lost any data. There is one disadvantage of using SQS. The SQS protocol is not supported by beats. I'm saying that only because Redis is supported by beats. So if I use Redis instead of SQS I could potentially configure my beats to stream straight to the Redis queue and I would maybe I would make it high available and I would be able to avoid the lockstash shippers in front of it. So I'm just pointing it out. So you don't find out too late. Redis is still very popular for using in this deployment. So the indexer. So the indexer is just like a standalone box. We just like fetches the locks, the lines from message queue and processes it and stores it in elastic search. High availability in this context mean that if it's not working it means that you are like a high availability means that you are processing locks close to real time. That's what in my opinion, the high availability in this case because it's just fetch process store, fetch process pros. So if you can see that this is nice if you use auto scaling policy here you can either monitor the amount of messages in the SQS queue and if it keeps growing, you provision another lockstash instance to help and then terminate it. Maybe you can monitor the CPU of the lockstash instances and if it's like close to 100 person for a period of time you provision more. And then elastic search. So elastic search as I said, is it's designed to be high available. It's scalable by design. I have seen people using two nodes makes absolutely no sense. So you need at least three nodes because when you use two nodes you either have no high availability or you can get split brain. Like surprisingly many people don't realize that. So you wanna use at least three master eligible nodes that's the minimum to have a high available. Eligible master eligible means that the note can become master. There's some kind of election going on and based on quorum and each node can be either master or data, storing data or both or client. Client means that it cannot become master, it cannot store any data but it can talk to any other nodes. So it's kind of like a smart load balancer. So master eligible means like if you're starting I would recommend just using three nodes and configure them to be master and data nodes as well. And it's good start. And only when you need to grow the size of the cluster then I would go into three dedicated masters meaning that they will have no data at all and all the remaining nodes would be data nodes. There is no need to put elastic load balancer in front of the elastic search cluster and that's because the load balancing is built in. As I said, once again I have seen how to switch basically recommend like take load balancer put one, two, three elastic search nodes behind it and this is how you talk to your cluster. There is no point in doing this in my opinion. You just get an extra moving part for no reason with need for a life check, health check. No need because LogStash supports multiple hosts in its output so when it's streaming data to the elastic search cluster you can say this, this, this node and it just rounds through bin sending it there and also Kibana recommends something else talking to the elastic search cluster rather than load balancer I will cover that a little bit later. Each elastic search node which stores data has a configuration which says this is the directory or directories when the data is stored. I recommend using SSD in stores instance store for that if that's good, big enough for you. So EC2 instances some of them come with local SSD drive which is connected to them and maybe they can be reasonable big. So if this is good size for you use that that's the fastest ever elastic search cluster you can get only if and this instance store of course it's not persistent so if you power off that instance you lose it but why would you do that? If you reboot it, it is persistent and because elastic search by default has always every single data like it has one replica so if you lose one node completely you always have the data somewhere else and when you reprovision the node it will once again create the replica so you basically if you don't power off more than one node at a time you don't lose any data even though you store your data on the non-permanent instance store. If that is not big enough for you you have to use, I would recommend using the elastic block store the SSD is a standard these days I found out that like if you need a small one I would go for the provisioned IOPS which guarantees you that you get this amount of speed this amount of IO per second on that but if you use the general purpose SSD the bigger it is the more IOPS you get so you might find out that if you are using the maximum size of a general purpose SSD you get the same IOPS and speed as if you were paying for the provisioned one like that's good to know but I think it recently changed and the provisioned IOPS actually has twice as many IOPS or twice as many speeds as the maximum size general purpose one but when I was looking at it like few months ago it was exactly the same parameters if you max it out that was very interesting to look at. I would avoid making these notes too full basically when it's becoming too full it becomes more chatty there is something which is like a watermark at 85% if a note gets to 85% storage full it stops writing data and tries to send it to some other note and like maybe you don't have space there it's better never get there basically have alerts like don't let this happen so when you have this cluster you kind of need to decide how much data you wanna store like maybe you are happy to have only the last two weeks of all your traffic or the last four weeks you basically need to decide what's the how many days you wanna store and what you don't need to store you can create a snapshot of so there is this snapshot and restore module which is built in Elasticsearch and it allows you to create snapshots in a remote repository and it supports several backends and in my opinion if you install the AWS cloud plugin you have the ability to create S3 backup so you basically say whatever is like you create backup maybe every day snapshot every day and then delete everything older than 30 days and you know that if somebody needed to go and search the logs from last month Monday you just restore from the snapshot there is a really nice tool designed for that called Curator which is the perfect tool for creating and deleting snapshots and operating on Elasticsearch I highly recommend using that one for maintaining the space I have Curator running basically saying delete everything on that index which would take more than 800 gigabyte so I don't limit it to number of days because you might get a surprise spiking data can fill your disk but I limit it based on space so I know yes what can happen that suddenly instead of having three months of data maybe I have only two because there was spiked I didn't expect but I will never get in a situation when my cluster gets full okay as I said you can talk about Elasticsearch all day long and I'm definitely not the guy who has the knowledge to talk about Elasticsearch but I covered the don't and do in my opinion I learned with this deployment and Kibana I have to admit I have running Kibana as a single instance and never died on me but I'm ready to reprovision very quickly by running a simple script which would just provision a new one if you have many heavy Kibana users just put a load balancer in a few Kibanas and maybe you would all scale them when you have Kibana you think how do I connect it to the cluster because you kind of either my first thought was okay I use the load balancer in front of the cluster that's the not recommended configuration then I'm thinking somehow it would be nice if it talked locally to a Elasticsearch node and they actually say you know what provision a new Elasticsearch instance on the same box your Kibana is and configure it to be client, the client node so it cannot become master it cannot store any data but it's the smart load balancer which can see the all other members of the cluster and routes your like Kibana requests smartly Progress check when I was reading about how to present to a technical audience they say make a joke and I could not figure out what kind of joke I can make related to Elasticsearch and Elk cluster but I have a colleague who have been to many DevOps sessions today and he said that there were some people sleeping and leaving so I think if it happens to me I won't feel bad so that's my joke. Okay so we have looked at a possible design I consider reasonable of scalable and high available Elk stack I talked a little bit about auto scaling its components I talked about how important is it to prevent Elasticsearch from running out of disk space we touched a little bit of like SSL floating ELB tricks and there are still two topics I wanna talk about how would you upgrade such a stack? Once again it's supposed to be scalable it's supposed to be high available running all the time and being at DrupalCon we need to talk about how to get Drupal logs too so how would we patch this? So look at the log stack shippers well we know it is high available already well we just provision another log stack shipper instance connected to the load balancer and after it becomes part of the cluster then we de-register the old instance running the older version instead of shutting it down I would de-register it because there is a thing on a load balancer Elastic load balancer called connection draining which is enabled by default and it's basically when you de-register an instance and there is a TCP connection established it's just the register itself from the load balancer so it does not get any new connections but it does not terminate the already established TCP connection so it's nice to have if you are killing an old instance just de-register you from the load balancer first and there is a setting how many minutes and hopefully most of the connections will gracefully finish so it's a nice way of doing it but if you have auto scaling policies I believe this will be done for you. Indexers as we said if you take and log stack indexer that's the thing which keeps taking data from the queue and processing it if you take it out of line for a little while you are not losing any data you just are not getting close to real time data to Elasticsearch they are growing in the queue but so that's one way of doing it you can provision a new one and once again you can have like auto scale policies saying I always wanna have one member in the cluster and then you just kill that one and a new one gets provisioned. When you're patching Elasticsearch cluster there are two ways of doing it usually when you go from minor to minor version so 2.2 to 2.3 you can do something which is called rolling upgrade so you take one node offline you upgrade it you install the higher version of the Elasticsearch software and then you bring it online again you wait for the cluster to rejoin to be happy to become healthy and then do another one that's why it's called rolling upgrade so there is absolutely no downtown you still can ride into it you can still search from it like you cannot see anything maybe some performance degradation but there is no outage at all opposite to that is full cluster restart you have to do it usually when you go from one to two or maybe like five is going to be released very soon pretty sure that you will have to do full cluster restart and I got burned by that if you upgrading version of the Elasticsearch program you have to reinstall your plugins they are they are version minor version specific and there is one thing I realized you can do as well if you don't know Elasticsearch that well this is my case and I know that I will wanna roll out version five but I have you know like say maybe I have one terabyte of data in the cluster and I am honestly scared to just you know do a upgrade how about I do live migration when I provision a brand new Elasticsearch cluster running version five so I have two I have the old one I have the new one then I configure my lockstash indexers to stream to both Elasticsearch is the new one and the old one because they support many outputs nothing is stopping me from streaming the data to the old and new one then I provisioned a Kibana server on my local Docker just to look but I'm getting the same data but it is any trouble no trouble I restore from snapshot the data I have because I do daily snapshot so I just in the new cluster I just run curator and restore from snapshot the last how many days I wanna keep I verify with a Kibana that all data all visualization everything works as it is and then I just like flip Kibana for the users to the production Kibana to the new cluster I can just like keep streaming to both for some time no complaints and only after I'm sure few days later I just in the provision I stopped streaming to the old cluster and the provision it I'm not sure whether it's called live migration I call it live migration but I have done this and I realized how powerful it is when you are not sure you can always go back and as I said patching Kibana is super easy you just provision new version either take over its Elastic IP if use it or change the DNS record if use route 53 can do this automatically and then kill the old one well good so I think I got through the boring stuff now maybe something a little bit more interesting cost estimate I thought you might be interested don't get scared just before I will show you the numbers please my numbers currently are something like that I think I let's say that I store about 500 gigabyte a month and I want to have data in my cluster and I want to keep it for three months so it's already 1.5 terabyte of data and it gives me something about close to 200 events per second coming in so that of course cost me CPU cost me memory like so if I look and so let's look at the solution so this is the minimum number of components I can do two shippers to be high available one lock indexer maybe it can so there's already three EC2 instances minimum three nodes for elastic search that's six and one Kibana so I cannot go under seven instances whatever I do so this is this is the rough estimate of the cost per month in USD dollars so as you can see I'm using C4 large these are big instances that's why it costs so much but I'm getting to like three times index three times lockstache three times elastic search that one is they have 16 gigabyte of memory that really like memory they like memory a lot and you see Kibana doesn't take anything it's on T2 small very happy and then we have three one terabyte EBS volumes they are also prices so that's together makes and the SQS load balancer S3 traffic that doesn't do anything like so in total it's 1200 a month roughly so it would be close to 1000 euro a month and I can probably buy if I had much less data but still wanted to keep it high available so I cannot change the number of instances but I probably don't have the SSD EBS I would have like just the local SSD drive I think I can go to half maybe one third that means 400 USD but not not below so this is the starting price of the solution in my opinion there are alternatives elastic the company behind this offers elastic cloud which is hosted elastic search and Kibana on AWS but note that there is no lockstache that's why they had they have but they hosted as a software as a service and it starts at 45 dollars per month and it is high available it gives you like your elastic search and Kibana and like some security as well there is no lockstache and that's why I believe that their shippers are being designed in a way so they can stream straight to elastic search because if they don't host the lockstache there is no way how to do that easily you know they try to figure out how we can get the data straight to the elastic search having the logic already on the shippers when the data source is then we can avoid lockstache completely and people are still happy and we get our money so really nice and smart and there are these platforms also like a locally some logic many others I don't have experience with and they also started something small at least I checked two of them like maybe you can start 50 or 80 a month but when you start playing with how many dice I wanna keep and how much data I have if I put the bars to the similar numbers I keep in my cluster I was also getting to something similar about like 1,000, 1,200 so of course they by having they have condense infrastructure they can be more competitive you please do not forget that by doing this in-house your time is an extra cost it's not something you walk away from so you have to consider that when you are thinking whether you wanna do it yourself or do it somewhere else a few compliments monitoring I think it's like a must have you cannot run this without monitoring elastic search needs to be monitored for cluster health it's like a high level monitoring it's either green, yellow, red green means everything is fine yellow means something is missing a node is rebooting or missed you haven't lost any data everything still works fine but you have no high availability and red is there is a problem data inconsistency you cannot write or so you need to definitely monitor that I would alert at least at these few things like the status of the cluster I would alert on disk usage and I note usage I would also alert on Kibana availability especially when you have management looking at it and if you know that you are getting a hundred of events per second for example it's very likely that if you look at the most recent record in your cluster it won't be older than five minutes so it's a good test to run like every now and then like what is the youngest entry in my cluster and if it's too old you can see that the data is not coming through there is also something which is called lockstash heartbeat as I have one lockstash shipper here and lockstash indexer here like lockstash has a heartbeat output plug-in which just injects a message so that's in front of the queue and the DR1 processes it and realizes the heartbeat and so you can monitor that as well actually I do have that in place and you need metrics for figure out what kind of instance, how much memory so CPU loads, swap all this and Elasticsearch is like a lot of Elastic the company behind this has really good documentation I have to say and there are like pages of what you can monitor and like the bigger your cluster gets the more you need to monitor and understand what's going there JVM performance, all of this this is just to show you what you can get for free this is metrics which you get when you use the file, a top beat top beat beat so here you can see that I'm comparing two lockstash shippers and one lockstash indexers so the shippers being behind the load balancer you see that the CPU, the load is copies each other and even the indexer is like somewhere else you can still see that it still relies on the amount of data flowing through the solution it's nice to see but please realize that you need to if you wanna use this you need another little elk somewhere else you don't stream the metrics data to the same cluster you wanna monitor because if it becomes broken do you have no data to look at of course it's very tempting and you get, they try to use Kibana and give you these nice visualization some of them make sense this is something some of them don't but it is available and it costs close to zero time to put together there are like a few web, there are a few elastic search plugins, web admin plugins most popular, one of the most popular is COPF it gives you this overview I can see in a micro cluster is red I can see the data distributions if it was relocating or like I would see what's happening there you can see that I have three nodes I can see ES32 in the middle is the master currently heap usage is a 52 disk usage is like 50% I have 1.5 terabyte data in the cluster and two billion documents so that's something you put on the dashboard elastic at quarter, similar statistic you can see like how much a top beat index cost you per day to 0.3 data of course it's irrelevant depends how many nodes you have but you can see what it costs you on space good, so I think we are getting there Drupal watchdog, clocks, how do we get them in Drupal how do we ship them to Drupal there is this Drupal DB log input filter in lockstash don't use it, because as you can see from the syntax it tries to connect straight to the database and fetch the lines from the database so that's something we shouldn't be even be possible in production situation and you shouldn't have DB log enabled anyway so just to, so what I recommend is syslog enable the syslog module, make it stream to the servers R syslog or ng syslog, ng log I would configure the R syslog to create a dedicated file like here I can see I'm creating like viral log Drupal log, so that gives me of course I could use this default syslog and then if all the syslog goes to lockstash I then parse it, but why would I use like a CPU power to distinguish between oh this is Drupal, this is PHP if I have the luxury of streaming it from a file and tag it saying this is Drupal so when I get it already in lockstash I already have the tag, I already know that this is Drupal it doesn't cost me any CPU to find out and also if I actually use filebeat on that file rather than configuring the syslog to stream directly I as I mentioned before I have it's more reliable because filebeat is designed to wait if the cluster, if the connectivity is bad for some kind of reason while R syslog logging remotely via TCP will drop database. I just prepared how I would parse watchdog log in lockstash so this is a little bit of out of scope but you can see that you get a line from Drupal right and this is you create a pattern which I call watchdog and then so this is the pattern and the bottom line is actually the stream so you can see the host name would get matched to in a variable called Drupal vhose the timestamp would end up in Drupal timestamp the 127 would end up in this Drupal IP I put it in a gist if you wanna have a look and so and then I define syslog watchdog just by using the watchdog pattern and putting the timestamp in front of it and this is how you put it in the lockstash indexer configuration and basically trying to match it against the regular expression pattern. There is also a module called locks HTTP that streams JSON event from Drupal watchdog straight to a endpoint via TCP and you would use that one when you are not in charge of the syslog on the server when somebody hosted for you. I know that Acquia also has this log beat or log gem log gem like they have this live streaming through the GUI but there is also a log gem which you can get and it connects to like a web socket so you can get the real time close to real time stream of locks from your potential Acquia subscription to a file somewhere and then get it to your lockstash. So I think I have covered, hopefully I have covered most of the topics I wanted to tell you about and I did it in reasonable time, nobody's sleeping. So to wrap up, I think building your own high available elk is a joy. The joy does not finish with its deployment, it's a continuous joy in my opinion and monitor is a must have. So I prepared like a few links if you wanna know where to start next. There are a few official code, there is Ansible elastic search, puppet elastic search cookbooks that's by elastic. So these are official cookbooks modules roles. Simple Kibana role just to show you how in Ansible how you can do it quickly file beat as well the Drupal watchdog gist. And as I said, the elk stack, the elastic documentation is brilliant. If you just read it, if you wanna understand it's just brilliant and there are a few references to my previous presentation and that's it. Any questions? So I have two questions. The beats that you were showing the file beats to deliver to the lockstash, can you configure them to that they deliver chunks of information so that it's not every line that is registered by the Apache but like use them as a buffer to have like five mega and then you make one transaction. And the second part of the question is, does it work over HTTP HTTPS connection? We have a proxy actually issue and we don't want to ship every line through the proxy. So our idea was if we can pack them in a chunk of five mega or 10 mega and then ship them through the proxy so that we have one connection then we probably not get fired by the other team that is managing the proxy. Okay, understand the question. I am not sure, I think you can define the chunks but if you cannot, I would probably I would have a lock rotating policy which would create me chunks in a size. Because when you stream from file beat, you can use regular expression as well. So that keeps matching all the names. So maybe you can have a lock rotate policy that you create like a new five gigabyte pick and give it this name and then the file beat realizes, oh, there is a new file and streams it at once that would be my way of getting over the first requirement you had and where it supports HTTP as instead of TCP that part I don't know, but I would check the documentation because as I was checking it supports TCP with SSL, it supports Redis, it supports Elasticsearch so it's possible that they will keep adding like as I said they try to avoid using lockstash to be able to stream directly. So it's likely that the beat might support HTTP as well. Can I ask you as well, you said that you can actually get rid of the first load balancer and the lockstash if you use Redis directly, why don't you do that? Because. Why do you complicate your... Understand the question, so let's go back. That is the cluster. Honestly, when I found out that there is this SQS and it gives me this unlimited storage and it's cheap, I thought that it's better for my case than using Redis which I would have to make high available. I think Elasticsearch probably supports high level Redis but the auto-scaling is not there. I cannot be sure it can cope with... I thought that actually using the key was easier honestly and it didn't cost that much. Plus I like, yeah, yeah, okay guys, one more. Just a minor question regarding your slides. I didn't get something. The first one is that you said that Kibana recommends running on a local ES node but on another slide you said that don't run Kibana on existing ES node, master node. I'm confused. Yeah, yeah, yeah. Okay, so that depends what kind of node it is. I would say do not run... If you have your Elasticsearch cluster which consists on masters and data nodes, don't connect Kibana to any of these. If you wanna connect Kibana to a Elasticsearch node but a different one which you configure not to be master, you configure it not to be data. So it becomes so-called client node and that still talks to all the other nodes but... It's a retriever. Yeah, Kibana talks to it because you install it on the same server as Kibana is but there is no data knowledge. It's just like a load balancing to the other nodes. So that's the recommended solution. Yes, so it's no master and no data. Just like it's basically like a load balancer, internal load balancer within the Elasticsearch cluster. Another question that I don't know, it's in scope of this presentation. Do you have any way to query Elasticsearchs via JSON or REST services in general in order to retrieve data that has been indexed? Because from what I see, Elasticsearch has a magic to combine data from several sources, be it an Apache server or MySQL or PostgreSQL or whatever system that you have. So all data are already indexed on the Elasticsearchs. Can you use these index data somehow with an API to present it on a Drupal site or something like that? Okay. I know you have a scope, but... As I said, I don't understand Elasticsearch that much but I know once again you can use LogStash and use the Elasticsearch input plugin. So you can basically have your LogStash hooking into Elasticsearch and read already existing data through that and deliberate somewhere else. That's one way of getting that. Also, Kibana shows you what the data is indexed and what data is not. If you browse a index which you browse index which haven't been re-indexed, it tells you and the GUI guides you to do so. But I don't know more about that unfortunately. Yes.