 All right, thanks everyone for coming. I know it's kind of a late afternoon, so great to see you all here. My name is Martin Lanner, I'm with Swift Stack. We do Swift, obviously. And I'm gonna be talking about monitoring and analytics using Elk Stack, Elasticsearch, Logstash, and Kibana. I'm assuming this works, maybe not, there we go. Okay, so here's my little agenda that this is what we're gonna be talking about. So why would you wanna do this? Who would want this data? I'm gonna look a little bit at the components that I have deployed and talk about the setup configuration, a little bit about the grok patterns that I have created for this. There were none that I could find originally, so they're brand new. I'm gonna do a little bit of a walk through live demo on my VMs on my laptop, and then we're gonna look a little bit at the visualizations and the dashboards that Kibana provides. All right, so why is the question? I work at Swift Stack as an engagement manager, which means I work a lot with our customers. I deploy Swift for them, and we have multiple machines, and we need to get the logs into a place where we can help them troubleshoot, understand what's going on, all kinds of things. So typically you have your logs, they're all dumped onto the same machine, and when you have tens, maybe hundreds of machines, it's kind of hard to troubleshoot anything because you need to get the logs from everywhere. So to a lot of people that are running Swift, it's like this, it's just a big black box. They don't know what it is, and what's there. So everyone knows that the internet and Swift in particular, it's all intended to store catch photos. And since we're in Tokyo, it's gotta be a hello kitty. All right, so in my daily job, I also have a bunch of people that work with me in our support and professional services organization, and we have to support these systems and it's challenging, like I said. We have to understand what is happening, where and when. We need to be able to troubleshoot the systems faster, but we also wanna know what the cluster is being used for. Who's doing what, when, where? And there's a lot of data, but there's not a whole lot of information. So with the data coming in from the logs, can turn that into information that's actionable and in different ways. So providing easy access to that, and then when we see something going wrong, whatever it may be, we can practically start taking actions if needed. So then the question becomes, who actually wants this data? I would argue that the devs want the data because they can better understand if there's a bug, something that needs to be fixed. Ops absolutely needs to look at it all the time to understand how to better manage the clusters. Support people, obviously wanted as well. And ultimately, business people probably want the data also because you wanna understand, for example, like what does your workloads look like? Workloads are critical to being able to put the data out there and making the applications run well and serving customers, whether they're internal or external. So with that said, the components here is OpenStack Swift, Elasticsearch, Kibana, and LogStash. Most of what we'll be doing here will be kind of focused on the LogStash piece. There's a lot of work that I've gone into trying to do that. And then we'll look at that through Kibana and the Elasticsearch stuff. So here's the ElkStack. The only other thing I've added in here is Nginx, which just adds a web frontend for Kibana. Then I have the Swift node and I have a single Swift node that I've used in this example. That Swift node is running proxy, account, container, and object, so all four main services on a single box. Of course, in reality, you'd have much more than that, but it's a lot of VMs to run on a single laptop for demo and something. All right, so hopefully you can see this. Elasticsearch runs on port 9200, LogStash on 5000, Kibana on port 55601, and Nginx in this case on port 80. And I have, on the Swift cluster side, I have LogStash forwarder installed to send the data into the Elk environment. All right, so we have a couple of key configurations here. The key configurations really come down to, I think, the ConfD filters for LogStash on the server. Grock patterns, I'm not sure how familiar everyone is with them, but they're basically a regular, based on regular expressions. And if you know anything about regular expressions, that can be kind of tricky. So like I mentioned earlier, I went out and I kind of scoured the internet to see if anyone's done anything before with actually parsing the Swift logs for EC indexing on the Elasticsearch side. Didn't find anything, so I went ahead and started creating Grock patterns for those logs specifically. And generally speaking, Swift comes with two logs, that's the proxy access log, which has the proxy stuff in it, and then it has the storage pieces, which is account container and object, and all the replicator, auditors, and stuff like that. And again, LogStash forwarder is installed on the node, and it includes in this particular example, just as a log and information and the Swift stuff. In real life, you'd probably include a lot more than that, but that would be really confusing and too much data for this demo, so. All right, some Elk server configs. You can see here, I basically have broken it out into three different things. You have the lumberjackinput.conf, you have the filters, and you have the lumberjack output. The one that I'm gonna be focusing on is the filters part, and I based all of this on the Swift logs. So this is available on the web, so if you just go to the open stack docs for Swift, it's there, and you can see the client IP remote adder, daytime request method, so on and so forth. And these are all very actionable pieces of information that you can use to get really good information about what's going on. So here's the example. This is the grok pattern for the proxy access log, and I'm sure a lot of you guys are probably pretty savvy with this, and later on, I put this up in a GitHub repo that's public, so if you wanna help contribute to it, play around with it, you can go download it and check it out. In addition to this, the Swift logs have a few little chain, different ways it does timestamps, so I created some extra patterns for that just to adjust for those pieces. And you can see the extra patterns, Swift, which I call it, they're pretty simple, but they're there. All right, so the Swift node configs. Hopefully you can see this, it's kinda small maybe, but it's okay. This is the simple output that you saw before the filter, and specifically if you look at the piece that says var log Swift, and then you have the asterisk.log in there, and then how you have put, under the fields, I've added a type called Swift, to be able to easily put that into a field in the interface that we can search for it. And I've also specifically named it Swift.example.com, and that's just my cluster name, so you can call that whatever you want. All right, so let's dive a little bit into the demo here and I'll show you what's going on. So this has been running for about half an hour now so this is just the standard thing. I created a little bit of a, I selected some fields here, the cluster, so that I can search for the particular cluster so if I have more clusters than one, I can run all these different clusters into the same instance and search across them. I included the host field up top here so that I can see where the data is coming from. I call this Swift stack node one, and I also show which log each message is coming from. If you wanna dive down into this a little bit, you can see here that you can see all the different pieces. So I showed you the log definition from the Opus stack Swift page earlier, which is actually this right here. But you can see all these different pieces. So you see the host field, you see headers, you see the file, you see a date time stamp, you see the cluster, client IP, there's none in this case, but here you see byte sent and the full message and a bunch of other information like request time and so on. All right, so let's go and pick just the proxy log. So here you can see the different proxy logs and if I wanted to search for hello kitty. Oh, no, I don't get anything on there, unfortunately. Let's see, oops, huh, okay. Let's upload something and upload goes on. All right, there you see. Now you can see that there's a hello kitty image in there and we can see a transaction record specifically for that. So I can search the, if I wanted to, for example, look at that particular transaction record. I can look at these other log entries I've done and I can look for transaction record and you can see how that shows up here. So all the entries that has that same transaction log will show up. So why is that important? Well, there are times when people say that I lost data. So far I've never seen Swift ever lose data, but I have a transaction record of everything that goes on in the cluster. So if I know that hello kitty image was in there, I can search for it, I can find a delete record for it, I can find the transaction record and I can track it all the way through and see everything that it did. I can see what disk it left or was on or which disks actually it was on and I can see what IP address deleted that item. So it's incredibly powerful to be able to do so and we at SwissDoc have had to do this for some customers at times when they said, oh, we lost data and we've actually been able to go through, find the data and say, well, you know, this IP address actually deleted the data. So it's not like we lost the data, someone did intentionally. So there's a bunch of visualizations you can create here. Here's one example, hopefully you can see this, but this example is for different status codes and so you can see what kind of status codes are being generated by the system and most of these are 200 okays, which seems like a good thing. And then we have some other things here like created 204.0 content, so on and so forth. So I've created a bunch of visualizations here. You can refresh them as well and you can open new ones. We can look at different ones that I have here. Now, these different visualizations, they can be really powerful. You can create whatever you want. You just have to kind of dream up what you're looking for and what's important. You can put it up on a dashboard. So I created a couple of simple dashboards here and right before I walked in here, I started dragging and dropping a bunch of information and deleting some data, uploading some data and some of these are just here's a CRUD profile. It just lists the puts and gets and deletes and posts. Here's the request methods the same way, but it's broken out into sort of percentages instead. Here are some user agent data and I'm not sure if you can see this. It's a little hard to see maybe. It's maybe better like that. User agent data could be really important. If you've ever used Swift, you might have come across something called Python Swift Client. In this particular case, it's hard to see here, but I'm using Swift Client 2.660. A lot of people may use different Swift Clients. So the question is, well, why don't they upgrade to the latest? It may have new things, new features. Maybe we want to actually push them to upgrade to the latest. And so by looking at this, I can go out and I can find out what people are using. They're using the C++ libraries. Are they using Java libraries? Go, the Python Swift Client. And it gives you a lot of good information in terms of what you want to do with the cluster. Who's using it? How are they using it? And so on. Other things, request times per second. Also, pre-informative data. You can see how long does it take to upload things. If you start having problems in the cluster, that request time may go up. But you can sort of baseline that and understand how the cluster is doing based on looking at the request data, the time ranges. The status codes one we looked at before. Here's another one, pie chart that I created for object uploads in bytes. So you can see what's the size of all the different objects. You can create another one that shows the distribution specifically. And that may be important because if you have, let's say, an application that renders really small mapping images, for example, you may actually want to have a cluster that is built on a particular side of hardware. If you, on the other hand, are using the cluster as a backup target or a archival thing, it may be a different set of hardware that you want or need for that. And then you can customize your cluster based on that and that may have implications on how much money do you pay for various pieces in this cluster. So that's pretty much it there for that. We'll go back to the presentation. And let's see here, there we go. All right, so I created a couple of to-dos here for myself and also maybe to inspire other people to take a look at this and understand how it works. Some of these grok patterns I created, they can be refined for sure. You can also make additional grok patterns to understand things like what does the replication cycles look like in the cluster? How much, how fast are they? Do you need to take actions in terms of making sure that your replication cycles are faster and that you have any kind of problems? So for example, one thing that would be nice to do would be to separate out replication and auto log files to have more fine-grained information about that. Like I kind of highlighted before, there's a million different ways of looking at data. Depends on who it is. You make different dashboards for different types of people. Ops people may want to see something while devs might want to see something completely different or devs might not even care. They just want to search through the logs. Ultimately, what I'd like to do is to actually push these grok patterns up to log stash, get them in the log stash distribution so that when you install it, it comes pre-configured and ready to go. All you need to do is start it, point your logs at it and you'll be done. So I created a Git repo yesterday. It's available, it's under the Swift stack account. It's called Swift to Elk. And if you want to take a look at it, deploy it. It's there. I need to put a little bit more information in how to actually deploy it, but it's pretty straightforward. And that's pretty much it for my presentation. If you have any questions about what I did, how I went about it, if you want me to demo anything else, we can play around with the actual demo. It's live and it's up and running. We can unmount disks and do things like that. But anyone? Thank you, this is very helpful. So I have one question. So if you have a multi-region cluster, Swift cluster, what's your recommendation for this Elk setup? How does it look like? That's a good question. So obviously if you have multi-regions, you'd probably have one Elk server on one side and then you may not have another one unless you build it in a cluster, which was a little bit beyond the scope of what I was trying to achieve here. But right, so you say you have one Elk server on both each side? I don't want to lose Elk, right? Right, so if you want- You need to be a distributed solution. Yeah, you can build a distributed Elk server if you want, LogStash. I did not really look into that, to be honest, at this point. But you can point, if you wanted to, you could point your, so to say, data center A to one side and data center B to the other side. But ultimately you would have to have to combine, if you want all the data in one place, you'd have to combine it into a single, point it to a single LogStash instance. Have you ever considered to have a two separate Elk and then using, I think Elasticsearch has some capability, kind of like, what is it called? I forgot the name, basically it can route, kind of like mixed to Elasticsearch cluster in one view. Or versus you can do one, because you have latency problem. They're latency problem, right? Yeah, I have thought of it, but I didn't really look into it and how to do it. Okay, thank you. But if you have any suggestion for it, I'd happily take them. Any other questions? Yes? So, Martin, have you thought about add the log analysis, like event correlation, if you have the latency search and then what is the root cause of it? So now, when I looked at this, it's mainly you do the search or you do the dashboard display. The root cause analysis part is pretty much manual, right? Have you thought about it to push it to the next level to something like that? Yeah, that's a good question too. So there are some tools out there that you can connect to LogStash and Elasticsearch and you can alert on it and you can learn more and more, right? And as you see things coming in in the logs and having potential issues, you can start alerting on them ahead of time so that you know when something's coming up. And, but beyond that, sort of that's, once you see that and you start learning about it, I would, I think that's something that the developers would really be interested in knowing like, okay, seeing what those patterns are and understanding how that works. Yeah, I think the problem we have right now, the other type of the storage is that when I have the client impact, they say, you know, the latency is high, it's paying the bud to find the root cause of it, could be some disk is dying or something other, the network. So, but you know, the debugging process is really piece of art, right? So, yeah, for sure it is. It's hard, but I think with things like, for example, the latency or the time to completion that I showed earlier, right? In the examples, when if you see those going up into abnormal levels, though that would be, that's where you really need to start looking into the logs and if you see a disk unmounted in there, I didn't actually show an example of that, but that's something where if you start losing disks rapidly, that can obviously be a problem in terms of how the cluster needs to replicate on the back end, because it's going to have to take data from other parts of the cluster and move it out to create a third replica of that data. And that is something that you can also see and once you have, if you unmount a disk, you will see a high amount of replication activity in the logs. And so you can make aggregations on that and see how the, you can correlate it to, to a disk failure or you can correlate it up to like, well, suddenly I have a huge influx of data, right? In the cluster. So it could be different kinds of reasons for why that would happen. But if you aggregate all that data then you can definitely see what's going on and why it's happening. So any other questions? All right, well, I will be hanging around here for a few more minutes if anyone wants to talk afterwards or take a look at what I have running. And thank you very much.