 All right, thanks for coming, everyone. We're going to be talking about monitoring Swift tonight or today, this morning. I guess it's so dark in here, it feels like night. My name is Martin Lanner, I'm an engagement manager with Swift Stack. I deal with a lot of our customers. We do implementations. And the reason I picked this topic was because everyone asks, hey, now that we've got this up and running, how do we actually monitor Swift and make sure that this thing stays up and running and everything? And with me I have Adam Taclam, senior systems engineer, doing most of the pre-sales kind of stuff with the customers. So he run into pretty much the same questions I get asked all the time when I do implementations. So we've picked up a couple of monitoring solutions. Some of these you may be aware of. Some of them you may have used yourselves. And it's just an example, really you can use whatever you want, but we thought that these ones were interesting and fun to work with, so that's what we did. So we're going to be using the Elk Stack, Elastic Search, Log Stash, and Kibana, which is the interface, the graphical interface for looking at logs. We are going to have a Xabix environment that we have loaded up, and we have Prometheus, a newish monitoring system, and it's going to be fronted by Grafana. And in the middle here, of course, is a Swift cluster. It's actually just a single node. And so it's a cluster of disks, and that's what we're going to be looking at. So here's a little bit of an overview we're looking at today. What are the problems? The problems are really the usage intelligence, what is the cluster doing, who is using it, all those kinds of things. Capacity planning, how fast is data growing in the cluster, when do we need to add more? Operational health, just sort of system metrics type of stuff, and then audit trails, who did, what, when, and maybe why. Background here is that we're basically looking at the logs from the Swift cluster and we're looking at the system metrics from each individual node in the cluster. We will then dive in a little bit into interpreting the metrics around this, and as a follow-on to that, we'll be doing thresholds and alerting kind of on that. And then we'll talk about the monitoring concepts of what to monitor, how to monitor it, obviously. And then the methods we are using, logging is going to be elk, it's going to be trending and forecasting with Prometheus and systems monitoring with Zabix. These are a little bit overlapping and it's basically pick your poison, whatever you happen to like better, you choose. And it may not even be one of these. So, all right, after all, this is just Linux. What would you do if you had a Apache web server that you needed to monitor? You would monitor as a Linux machine with potentially some additional Apache log metrics that you were trying to figure out and stuff like that. So, it's really not all that different and you will see as we go along this journey. So, hopefully most of you have some knowledge about Swift, but if you don't, I'm just going to run through really quickly what Swift does and what some of the key properties of Swift are. It's a distributed system, many, many nodes, many, many disks. And it stores data extremely durably. I have not, to this day, seen Swift loose data any time, although people have claimed that it did, but we'll go through that too. And I can show you how we figured out that it didn't lose data, they actually deleted it. Swift does this through what's called storage policies. It's either through replica-based, which is by default three replicas in the system, or it's done through erasure coding, which is a newer type of storage policy, but it's really interesting. Of course, there are lots and lots of different machines and disks. We have no single point of failure in the system. It's designed from ground up to be like that. There's even distribution of data in the system, meaning that you can actually throw in a four terabyte drive in your old stuff and six terabyte drives in the new gear. And as you fill up the system and it's 50% full, a four terabyte drive will be two terabytes full and a six terabyte drive will be three terabytes full in the same system. So it's kind of smoothing this out. It's really a resilient system. You can do a lot of bad things to it, and it will still just continue working. I have seen cases where people have ignored the system to my last point here. You can abuse and put it through a lot of negligence, and it will still run. And the nice thing about that is it has all these self-healing capabilities that if a disk dies or if an object gets corrupted, it will still replicate that out and make sure that there's a third replica of it and so on. So Swift is extremely resilient and very nice to work with in that sense because really there's no rush to fix problems. You have time to think about how do you need to do this? And with that, we're going to go through and figure out with you how you can look at the problems that occur and figure out what the problem is and then fix it. And with that, I'm going to turn over to Adam who's going to talk about the anatomy of the solutions. So as we go through and are looking at a few different types of monitoring solutions for Swift, it might be helpful to keep in mind the four major components that make up any kind of monitoring solution. The agent is just the bit that runs on the individual server that's collecting the metrics and maybe collecting those metrics through the kernel, looking at CPU, memory, or through the network, through the network drivers, or it may be parsing the log in the case of LogStash to be able to gather that way. Whatever that methodology is, that's what makes that particular agent unique and is able to expose those metrics then to the aggregation engine. The aggregation engine then takes all those metrics that is getting from all the different servers, bringing them together into a common place. Usually that common place is some type of time series database and makes that available to the visualizer. Now the way it makes it available, it could be just very simply you can query and get metrics within a particular time range, but usually there's additional functions to be able to do things like linear regressions that we're going to see later on to make that a more enhanced experience. The visualizer then, as I kind of already alluded to, then allows you to define graphs and visual elements so that a human being looking at this can see, you know, instead of just a long list of metrics, either numbers or log lines, but you can easily visualize what's going on and spot out trends or anomalies in the data really quickly. Then lastly is alerting. Alerting, there's two pieces to that. First is thresholding. Thresholding is defining what's normal and what's not. So if a metric exceeds a particular value called the threshold, then that's when we want to trigger an alert. We want to send, say, an email to an administrator to let them know that things are out of whack. Oh, excuse me, I need water. So developing a monitoring strategy. So we can't pick a monitoring solution if we don't know what it is we're trying to do. So the major points of a monitoring strategy are really, really two things. What we want to monitor and why we want to monitor it. I've listed a few. I don't know that this is necessarily an exhaustive list of all the different possible forms of monitoring you could have, but I think this hits the major ones. So system utilization, kind of the basic one that we're all familiar with, gathering CPU memory and network IO, but that can also include service-specific metrics, in this case, like auditing cycles, replicator timing. We could look at consistency metrics around SWIFT, so all that would fall into that category. Then there's monitoring for performance, trying to see, well, it's not necessarily an error, but are things just starting to slow down. Error monitoring, I have two here. Errors and outages, and they differ in that errors are looking at did the user do something that provoked an error response from the service. So are they asking for something that doesn't exist, or is the request malformed, or have they just somehow gotten themselves into a bad state, which is different from an outage where the service just isn't there anymore. So either the entire host is gone, or the service is crashed, or something, some dependent service, something the service needs is gone. Feature usage. Feature usage and audit trails also kind of go together. Feature usage is saying, well, what's popular in my application? With a web app, this is a little easier to see. If you think of, let me just take an example, if I have a web application that allows users to log in, then they can view certain videos, and then maybe they can read certain documents, and we see, okay, well, 80% of the traffic is going to the videos, 20% is going to the documents, and so now we can start to make some intelligent decisions about what we need to optimize for based on that information. The audit trail is a little more specific. It's not rolling that information up. It's more saying, you know, Bob, a specific user, went and did this specific operation at this specific time. Like Martin was mentioning earlier, people quote-unquote lose data. You can go back with an audit trail and say, nope, I can see right there. That's exactly where that specific piece of data was removed by this specific user. Certainly people who work in regulated environments are used to this as well, and being able to have full traceability on your data so you know everybody who's touched it, when they touched it, and more or less what happened to it. So once we've identified that with the different forms of monitoring are why it is we're trying to monitor the system and what we want to get out of it, then the life cycle of a metric is obviously measuring the metric itself, reporting on that measurement, and then characterization. Characterization is the process of figuring out what's normal and what isn't. Setting those thresholds. So characterization can be difficult, and it can be difficult in a couple of ways. One, because sometimes metrics don't cooperate. It's not that it's always a three all the time. There could be a significant standard deviation within what's considered normal for a particular metric, but also anytime you get a new version of the software. Again, take our web application example. Every time engineering drops a new version on you, that could change now what the normal response time is to be able to start playing one of those videos. So now the characterization has to be redone every time somebody messes with it. So threshold numbers, then alerting. So that's the really two pieces. Where alerting, identifying when a metric has exceeded that threshold, and then defining a method to alert people and defining who it is to be alerted. So usually you have various groups of people. They're going to be alerted via SMS or email or whatever. Root cause analysis. So now once you get an alert, what are you going to do about it? So often the main alert that gets triggered that indicates a problem from the user perspective isn't the metric that actually points to what went wrong. So there's really two classes of metrics there. One, I call the canary metric. And that's for a web application is usually response time. Looking at the latency that it takes to be able to respond to various different requests from users. But just because that slows down or a user start getting a bunch of errors or it's not responding at all, that doesn't probably tell you a whole lot about what the core cause that happens. So usually the system utilization metrics are going to be the ones that are going to start pointing down and giving you an idea of where that problem really is and how to resolve it. And that is the last step which is remediation which is actually doing something about it either in an automated fashion or a manual. Sorry. So we're looking at three different monitoring packages here today. There's really a lot of overlap between the different packages. But what makes Elk unique, for example, is that it's looking at log data. So it's going to be both empowered and limited by what information is reported to the logs from various services. In our case, Swift. It's going to be looking at who. Who is it that's accessing data? What are they accessing? What agents? Since we're looking at user agents. So what platforms are they running? What browser? All that can be helpful in both identifying areas to optimize and also with diagnosing issues. Triggering on those response codes, like I mentioned, that's going to really be in the realm of Elk because the proxy will be logging every transaction that comes through. So any time we get a request or response, it'll be there in the log. So that makes this a very appropriate tool for gathering that information in addition to the errors and audit trails kind of follow on in the same way. Prometheus and Zabix are both agent-based monitoring solutions. They both have agents that run on the machines and are looking at primarily kernel metrics and reporting them back. But they both also have the ability to write additional plugins, extensions to be able to gather service-specific metrics also. But in this case, I wanted to focus on capacity planning with Prometheus really just for one method, and we'll see that later on. The regression method that's useful for being able to do forecasting and predicting what's going to happen within some period of time based on what's been happening, which isn't necessarily always true, but we'll see that in a moment. And then Zabix for operational health, so that's more the traditional utilization of that tool to gather network CPU and memory consumption. So when we're looking at monitoring Swift, there's a few things that really jump out as being the most important things, because as Martin mentioned, Swift can take a lot of abuse. But where do we draw the line? Where is it that, okay, we really can't put this off any longer and we need to really take some action? First on the list is clusterful. This is something that we've seen several customers run into, because it's tricky. And I think a lot of people don't initially, until it happens to you, don't really appreciate the importance of it or what exactly it is to be looking out for, because it's not a matter of just saying, oh, okay, we're going to completely run out of disk space. We're going to go to just zero available space on the file system within some period of time, because you will actually, your cluster will completely fail long before that ever happens. It's almost like kind of flying into the sun. Touching the sun is irrelevant, because you have already vaporized before you got there. So in the case of Swift, really it's about 20%. So once we get to about a 20% margin of available disk space, we need to be really throwing some critical alerts at that time that we need to do something, because that buffer is necessary in order to be able to actually do a lot of the resilience and durability mechanisms that exist in Swift. So as I say, you had your cluster sitting at 80% utilization, and then you have a couple of disk failures. Now the cluster has to rebalance and put that data somewhere else. You need that buffer to be able to do that. If you get within that buffer zone, and then something like that happens, the cluster needs to be rebalanced, and you run out of space to do that rebalance, you're going to have a really bad time. So networking, proxy states, and looking at the auditing cycle. So these are really getting down into understanding some of the causes of problems that can happen in Swift. So one of the main things around networking is really performance. In most clusters, and it does depend on a few factors, but most of the time, network bandwidth is going to be your bottleneck for performance. So looking at it from a throughput perspective, you can flood or completely saturate a 20 gigabit link through Swift without the cluster hardly breaking a sweat with a sufficient number of disks anyway. And so watching that and being aware of where your current utilization sits versus your total available bandwidth is going to be a very critical indicator to know when you're about to run into some performance problems and maybe even be able to get ahead of that before it happens. So the proxy state, proxy state is... I was already kind of mentioned a bit on the previous slide, but it's important because that's the place where all the requests ultimately funnel in from the beginning. And if something's going wrong with the proxy, if it's not responding or it's down or it's slow, none of the services behind it have any chance. So that's why it makes sense for it to be a focus of your monitoring strategy. And we mentioned here account replication, container replication. These are cycle times of the workers that are going on in the background. And that can help us to understand if there are insufficient resources being allocated to maintaining consistency in the cluster based on the rate at which we're ingesting data. So if data's coming in faster than we can replicate it across, we would know that by looking at the cycle times increase. So load balancers. So this is kind of building on what I was talking about earlier more specifically, because this is something that Martin and I get asked a lot as well, is how do I configure a load balancer for Swift? Because it has to balance load across the different proxy nodes. And why? Why do I need a load balancer with storage? It's really weird. Particularly traditional storage guys will ask us that. Because they deploy different block storage systems, that's not an issue. So when we go through that, explain it, then the first line of the backup. The first thing, by default, that most load balancers are going to do in order to be able to determine that they're sending load to valid proxies is going to be to just ping those hosts. And that really gives you an absolute minimum amount of information. I mean, it tells you whether the host is up or down. That's really all you know based on a ping. So then the next level after that is trying a TCP connection to the target port. So if that port was an 80 or 443, then actually having the load balancer try that, see if the service is accepting connections on that port, that gives you more information. At least you know that the host is up or down. And you know if the service is up or down. Beyond that, you can have a specific health check URL that will do more checking on the back end to ensure not only is the service up, but it's actually in a good state to be able to service requests. In the case of Swift, there is a health check middleware that you can install that will give you that health check URL. And it's also used for an interesting, kind of a side purpose, which is for doing rolling upgrades. So we'll just kind of walk through the workflow really quickly. So if we're going to do a rolling upgrade of the Swift software across say three nodes that are running proxy account container object on them, then we could set the proxy in a state, in this case through creating a particular file in a particular location that will cause that health check URL to return failure. When that returns failure back to the load balancer, load balancer says, oh, that proxy isn't in a good state to handle requests, so I'm going to remove it from the pool. And once it does that, then the second thing we can do is watch for all of the pending transactions for them to flush out of the proxy and complete. Once the proxy is completely idle, perform the upgrade of the software, remove the file that was causing the health check to fail, then as the load balancer continues to ping that failed host, it'll say, oh, now it's back and working again, and it'll redirect traffic back to that host, and then we rinse and repeat for the other proxies and perform a rolling upgrade in that way. So it's sort of a clever way to leverage the monitoring infrastructure to do a rolling upgrade in a way that doesn't require any specific knowledge about the load balancer or having to interface directly with that load balancer API. So let's take a look at this. So I failed to mention when we started that underneath here we actually have a little demo environment, and if we have time, we can go into it later and actually run live queries and stuff. But we have picked out a few of these things that we've found interesting and wanted to kind of highlight for you. In this example, we're going to look at audit trails, like I mentioned before, and here I have made a query in Kibana against the logs, and I'm looking at the proxy access log specifically, and I'm looking for anything that has an EXE extension on it, and I'm also looking for a delete. And I had uploaded the, as you can see here, if you read the message down at the bottom, you can see that it is the Swift client 3.0.exe. It was uploaded into the cluster, and I then issued a delete from the Python Swift client, the command line tool on my laptop, and I deleted that, and you can see the EXE being highlighted, you can see the delete being highlighted, and the blue highlight up above, that is my IP address, so that's the client that that call was originating from, and so if I would have thought or someone came to me and said, hey, you know, Swift lost data, and okay, what data did you lose? Oh, I lost this Swift client.exe. I would be able to go in here and having all the different nodes reporting all their Swift logs into LogStash, I could go in here and I can audit the whole thing and say, well, who had that IP address, 109.100? Oh, yeah, that was Martin. Well, clearly you didn't lose data, you deleted it. So that's a really powerful tool, and we have, in our support organization at Swiftstack, because we have so many different clusters that we help our customers manage, we get this on occasion, like you lost data, and you're like, well, I don't think you did, actually. Let's take a little deeper look into this, and every single time this has been the solution. Just find that thing in the logs and prove to them that you didn't lose data, Joe or Mary deleted that thing on this particular day at this time. So that's really, really powerful, and of course, this is just one example. You can grep or search these logs through LogStash and Elasticsearch. You can do that for anything. You can just keep on reading the message here. You can look for the agent, like Adam said before. You can look at all the different time stamps, how long it took. You can do all kinds of cool things about it. I also created, if you're familiar with Kibon, I created some visualizations in terms of dashboards, and this is one of the examples of object size distribution, and why did I pick that? Because when we go in and we deploy with clusters, one of the questions that people say is like, oh, hey, I want to tune this so that it works really well for my use case. Okay, great. What does your object distribution look like? What's the size of the object and how many of each would you have in a percentage? Usually, I get a blank stare back. I'm like, I have no idea. I was like, okay, well, cool. Here's what we're going to do. We're going to stand up your cluster. We're going to load stuff into it, and then we're going to try to figure it out. The Elk Stack is really good at helping doing that kind of stuff, because we can now start looking at what is actually going on, how is it done, and then we can start tuning the cluster based on that. So that's really helpful. Another one would be distribution of operations over time. This would let you understand a little bit what the workflows look like, when is traffic coming in, and what is also the distribution of those operations. Is it predominantly puts, there gets? Are they deletes? Most of the time, it's not a whole lot of deletes, and generally, it's just mainly a lot of puts going in, and data never ends being put into these systems. So that's some interesting metrics that you can play around with using Elk. There's obviously just, your imagination is kind of the limit here, what you want to do. Moving over to Zabix. Zabix is kind of a traditional monitoring system. We have some agent little scripts that are running and sitting on each Swift node. We have written a Zabix template that is Swift specific. So in addition to the standard Linux monitoring tools we put on these nodes, we also apply this Swift template. And this is just a long list of it. There's severity that is associated with each of these and how it triggers alerts. For example, you start losing devices. It's not a big event, actually. When I say device, I mean a disk. It's really not a big event in Swift if you start losing disks. Swift will do this whole self-healing thing behind the scenes. You can just go home if it's Friday afternoon and spend time with your friends and family or whatever and deal with it on Monday. But at least it's good if you know that this happened. So the other thing to Adam's point earlier is drive utilization. How much data is being loaded up on these disks and when do you need to start getting new machines or new disks in? Again, at the bottom of this is all the Swift demons and we're looking to make sure that they're up and running. We're making sure that they are not exceeding a certain threshold in terms of hours when they're completing. So as you can see on the drive utilization 50% is kind of a low value but it's just a warning and when you start hitting 85% it's kind of critical. You don't need to get new machines in fast or you may start running into full clusters. This is an agent that we have in a Git repository if you ever want to download it. If you're interested in it just come up and talk to us afterwards and we can give you the link to it. So Sabik's memory usage, standard Linux metric really. It's going to bump up. It's usually not a problem. It will lead as much memory as it can but there are times when just looking at that memory consumption is really useful and it can help you troubleshoot what's going on with the system. So a simple thing there. Excuse me. Here's an example of exactly what I was talking about earlier. Drive utilization. What I did here was this is a tiny little cluster of two gig drives. And I just loaded up a ton of data on those two gig drives and as you can see at the bottom here I now reached the threshold of going above 50%, which now tells me I need to maybe take some actions. And then I deleted the data and it's back to just an okay status again. And if you look at the top there you can see that it's specifically targeting the Paco1SwiftStack.OSS node. All right. Disk I.O. This is important a lot of times going back to my example earlier about people wanting to understand what their loads look like and so on. You just need to have enough disk I.O. on the cluster ultimately to be able to operate at a certain performance level. And if you don't have enough disk you're not going to have enough I.O. And this particular screenshot here is from a SwiftStack controller. It shows you how many IOPS are in the system, how much is being used. At times you can become disk bound and this would be a helpful thing to look at when you're doing like benchmarking and things like that and going like, well, I've completely saturated the number of disks I have in order to get better performance I would need to add more disks as well. So that's another example of that. And object replicator operations. Another metric that not necessarily showing a problem unless it continues over time. If you, for example, you have two racks of gear and now you put in two more racks of gear and now you start having all this replication of data coming flying over from the two original racks to the two new ones and spreading the data equally over all the disks in the system you're going to start seeing a lot of replicators going across and that's perfectly normal. However, if you didn't do that or if you didn't do any capacity adjustment at all like for replicators to start moving a lot of data is problematic if it suddenly happens. It could be something that where you could be looking at disks suddenly going bad. Now you have many metrics that will probably tell you that but that's what's going to happen. If we had one customer that had a batch of bad disks and about 30 of them went out in a week they just pop, pop, pop. And what happens? The replicators start moving data around to protect your data. So it's a great metric to kind of take a look at and understand what it does and would that, Adam, a little Prometheus here? So as I was mentioning earlier we wanted to look at Prometheus for doing trending and forecasting for one particular function that it provides as part of the aggregation engine. So that's illustrated here. The first graph on the top that we're looking at this has available storage capacity. That is really mirroring exactly what you saw before in Xabix as it's giving you that real-time information about what the available capacity is. Now for the sake of the demo this is data from last night. So we had seven gigs of available storage space and then we just loaded a couple of gigs of data onto the system so that we could show a big drop in the available space in what that would do to the forecast. So beneath that graph on the bottom is where we see the 24-hour forecast. What that's saying is that what it's telling us is that if you keep doing what you're doing now in 24 hours this is what's going to happen. Now, as we all know that's not necessarily the case. You may not continue doing what you're doing now. In our case that's exactly what happened. We had this sudden drop in the available capacity because we loaded data and then we stopped. There's some administrator just killed a bunch of disks or something in our system. Yes, some administrator did. That's right. Without a service window or anything. So that's another thing that can impact available capacity as well. Of course disk failures is another example. So forecasting can be very helpful to give you early warning of potential problems that you're about to run into but it also is certainly prone to error as it does make an assumption that things will continue being the way that they have been. So as you're looking at the rows from the bottom we have the 24-hour forecast being recalculated on an hourly basis and it's basing that on the previous days operations in this case but the only thing it has is really a steady state and then that one drops so we can see how it says oh, you're going to be fine, you're going to be fine and then oh my God, the world is coming to an end but it's not. So going to be the next slide. So then logically the next thing after that is to be able to alert when it looks like we're about to run off of a cliff and I couldn't think of a really great way to demonstrate that in a way that was really effective but I thought this might be helpful to kind of get an example of exactly how you would configure such a thing in Prometheus and using the alert manager so I promise I didn't just do this to be mean or anything but I'll take you through it really quick. So it's saying, so we have an alert the alert's called storage critical 24 hours so that's telling us okay, if 24 hours keeps going like it has been over the last day and the next day we're going to be having problems so how is that defined? It's defined as, we're seeing a, I'll skip some for a minute and just look at the next method it's called predict a linear that's the important one, that's the one that does the linear regression against the previous series of data so it's doing a predict linear on node file system free so it's just like it sounds free available space on the file system for a job called in this case Swift stack but that indicates a node or a group of nodes so that's the set that we're interested in and the mount point identifies the drives that we're interested in so in this case it's a regular expression match against slash SRV slash node and that's going to basically give us all of the Swift drives that will always give you all your Swift drives and so we're grabbing that and taking in the square brackets you see 1D so that's saying look over the last day that's the data that we're going to take to be able to make our prediction and then all of that gets summed so that's the sum for that prediction across all of the disks that match that regular expression if that is less than all this other stuff all the other stuff is just saying 20% of the available space rather than writing that as a constant I wrote it as a variable so that in case you add capacity to the system this alert would automatically float and adjust and not be locked to that constant value but it's similar to what we saw before it's node file systems size instead of free for the same job same mount points sum together and then we just multiply by 0.2 because it's 20% for one hour so that is an interesting bit because one thing that can be really really troublesome and aggravating about monitoring particularly when you have alerts configured is spikes just random spikes and outliers in the data so by specifying for one hour what I'm saying is that we're going to reevaluate that expression every five minutes and the five minutes isn't shown here that's just based on the server configuration when that particular job was defined I said we're going to have a five minute evaluation interval so every five minutes we keep evaluating that and we keep getting this keeps evaluating to true for an entire hour then we want to trigger the alert so we're sure at that point that this isn't just an anomaly that this is really real and we won't bother somebody with something that we think is not true the labels then just identify the group being storage admins so that's like I was saying earlier we need to know who we need to notify and how to notify them in this case we're going to... it's a critical alert so we could... that could either just appear as the word critical in the subject line of an email or it may correspond to an actual mode of delivery so if alerts are critical then we want to do SMS or if they're just warning we want to do email or some other method and so that brings us to the end that we have a demo that we can do if we... I think we're actually kind of running up against time here are we out of time? or we can answer questions I think we are but if there's any quick questions we'll be happy to take them it knows we can't see very well with the lights so we're going to try but we can also hang out afterwards outside and if you have any additional questions feel free to come up to us and check we can also set up the demo outside and poke around on it and do all kinds of fun stuff if you have time does anybody have any questions? or everyone's either a master or they're all completely lost that's one of the two alright thank you so much