 Okay, so thank you very much for coming. Last session before lunch, my name is Christian Schweder. I'm a principal software engineer working at Red Hat. And today I want to talk to you a little bit about monitoring your Swift cluster health. So I'm working like for two and a half years now on Swift. I'm also one of the core members since about a year now. And most of the time I'm helping our customers and partners with their Swift deployments and trying to commit the work upstream back to the Swift community release. So let's get this talk started. So I've got all good things comes in threes, right? So this talk will be like separated in three parts. First, we're going to have a look at the Swift architecture because this is important. You need to know what you want to monitor. So we are going to have a look what processes are running on your Swift cluster, and what your Swift cluster consists of. Next is some basic monitoring. So Swift is shipped with a few tools that are really helpful if you want to ensure that your Swift cluster runs in a good shape and your data is started a durable way. And then we will continue with metrics. And metrics are really interesting, especially if you face customer requests because with metrics you can ensure that the quality that you deliver to your customers and your users is always in a good shape and that you can detect bottlenecks early and so on. So a short overview about Swift itself. So Swift is an object storage. That basically means you don't have a block device or a file system that you are mounting. All your requests to the storage system is done using HTTP REST API codes. So that means basically you put or get requests with your data. And so for this you need a new URL, right? So your user has an URL. It wants to, the user wants to store a document, for example, and you send a put request. And this put request is then sent to a specific URL made out of a account name, a container name, an object name. This is sent to the proxy server. And the proxy server, in this case, it's only one shown, but for larger Swift clusters, of course you have a lot more proxy servers. And the proxy server itself is like the gateway into your Swift cluster. So it accepts all the cards from the users that want to use your Swift cluster. And the proxy server itself then forwards these requests to some backend storage servers. If you're running Swift with the default of three replicas, which was, which is most likely at least until kilo, then the requests will be split across multiple storage nodes. So the object is taken and it's not only stored once at that time, but stored three times on different storage nodes on different disks to ensure a high availability and durability of the data. So even if you're one of these servers fails then or the disk is broken or so on, you have very good, well, not a very good chance, but you can ensure that the data is still there. At the backend servers, there are a lot of processes running. So I've only shown a single server on each node here, but there are multiple servers running. You have servers for the account databases, for the container databases and object servers as well. And at the same time, you have also multiple replicators for all this different kind of data. You have auditors and you have updates running there. What are these processes doing? Well, the replicators are ensuring that you have multiple copies across your cluster of these objects. For example, let's take this disk here, has a copy of your object. And if, for example, the disk here fails, the replicator will take care of that later and create a new copy on a replace the disk, for example. If you're running Kilo or later, we have a new feature called Erasure Coding. In that case, you don't have replicators, but you have a Reconstructor that rebuilds the missing fragments from your objects. There are more processes running. The auditors are ensuring that the data that you have stored in the past is still readable and that the data still equates the data that you stored one or two years ago or whenever. And it does this by reading the objects or the container databases and building a hash and comparing that hash to a previously calculated hash. And in case there's a mismatch, this object, for example, will be moved away to a quarantine section and will be replaced later on by the replicators. So there are a lot of things going on in the Swift cluster. A lot of processes here. You have, of course, a lot of network traffic also, for example, from the replicators, from the proxy servers to the storage nodes and vice versa. So you need to monitor all this. Happily, Swift is quite resilient to errors. So even, for example, if a complete server fails here, your data is still accessible from the user, which is really nice and makes a very good solution. What you can also do is you can group multiple servers into different regions and zones. So for example, you can ensure that at least one copy is always stored, for example, in a different data center or in a different rack. And by doing this, you can ensure that you have a really, really high availability and durability of your data. Because even if one of the data centers burns down, you still have a copy somewhere else available. You can also tell the proxy server to store data in a specific region first. This is a feature called write affinity. So the proxy server might use a local server, local storage server, or prefer a local storage server first. And these servers are then replicating data later on to other regions and zones. We'll come to that in a few minutes again. So let's talk a little bit about some basic monitoring. So what I always recommend is that you first start with your existing monitoring tools. Most likely it's, you already have a data center, right? So you have already some kind of monitoring systems in place. Nagyos, Ikinga, maybe HP's OpenView. There are a lot of tools out there and most customers have different tools and prefer their own set of tools. So I always recommend to start with them and do some basic monitoring like is a server available or do we have filling disk and stuff like that. Swift comes with a few tools included that make this a little simpler. So for example, if you want to know if all my services or servers are available, there's a health check middleware. And you can enable this health check middleware inside your servers in the pipeline. And if you query the server then with a specific health check URL at the end, you will get a 200 okay later on in case the server is available, of course. Another tool that is really helpful is Swift Drive audit which passes your log files on your server for errors on your disks. And if there's an error on the disk, it will unmount the disk and write another log entry. So why is it unmounting a disk? If the disk is unmounted, the servers and the replicators can avoid trying to store data on these disks and just use other disks as well. So they already know that there's something broken and don't try to store data on these disks or retrieve data there. One of the really useful tools is called Swift Recon. Swift Recon, well, basically it consists of two parts. You have one part, it's a middleware inside your servers and the other one is a command line interface. And this tool gathers some data, for example, from the replicators, from the auditors, from the updaters and so on. And it's a, you can send a basic REST API request then to the servers and it returns to you a JSON dictionary. For example, it says, okay, my latest replication run took 30 hours or 20 minutes or whatever or it fails. You need to do this for all your servers. So that's, you need to configure a lot if you use your existing monitoring tools. So Swift Recon simplifies this a little bit because Swift Recon just passes your ring files and the ring file will tell them, okay, we have like 20 servers here at these API addresses, these parts and Swift Recon queries all of these servers and tells you, okay, we have five servers here and two of them are down or one of them has a really long replication time and so on. The next one is a tool called Swift Dispersion Report. Again, it consists of two parts. The first part is Swift Dispersion Populate and what Swift Dispersion Populate does is that it stores containers and objects across your cluster and later on Swift Dispersion Report queries these objects and containers, but not using the proxy server but directly talking to the storage servers and not only querying one copy of your data but multi, all copies of your data. So Swift Dispersion Report is able to detect if there are replicas missing, right? This happens from time to time. For example, if you change the ring, in that case, for example, you add a new disk or you remove disk, change servers and so on, then it's a little bit of time, there will be only like two out of these replicas available and Swift Dispersion Report reports it back to you. If you want to audit a specific account or container, there's also a tool available for this and it basically lists all the container data, containers inside your account as well as the objects and checks if these are available. Okay, so we have some basic checks available if Swift is running, but this data that Swift is running is not enough to ensure that your customers are happy. Maybe your customers call you on a Sunday night and tell you, you know what my bandwidth for the incoming write object is like, one megabyte a second now and I'm not able to store any backups any longer because I will never finish. And in that case, you need some kind of metrics. You need metrics from the past because you probably want to compare the metrics from the past to your current metrics and you want to have some metrics from now to see if there's a problem at the moment and where is the problem. So you start collecting metrics and metrics are fine, but it is that we humans are not really good in analyzing metrics if we just view the numbers, right? I mean, from these numbers, you won't see that much. So maybe you see that on the right side, there's a timestamp probably on the left side, it could be whatever it is and you don't see anything obvious here on using these metrics. So what we really want is not only the metrics itself, we want a visualization of these metrics and if you visualize your data, you see immediately, for example, here sells some spikes, here's a spike down there, actually this data comes from this region, but you immediately immediately see that there's some problem or maybe not a problem but at least an occurrence in your cluster and then you can start looking into this further. So how do I get this visualization? Well, Swift is able to send metrics to a StatsD server and StatsD is a project that was developed by Etsy a while ago. It's a very small server that collects metrics, aggregates these metrics and can do something else with these metrics. So what we are doing here is these metrics are then sent from StatsD to Grafite and Grafite actually builds this nice graphs for you. Grafite itself consists of multiple parts. There's a, well, the tool that accepts these metrics from StatsD, for example, it's called Corebone Cache. There are some database files that are stored on the server in a specific format, it's called WhisperDB. WhisperDB is like a fixed size database with a kind of round robin data. So you can store, for example, some very fine-grained data for a short period of time and then tell Carbone Cache, okay, for the data two months ago, I'm not really interested in like seconds of resolutions. I may, maybe I'm only interested in like a five-minute resolution or something similar to restrict the amount of data that I store in the long run. And then Grafite Web, Grafite Web actually retrieves the data from the WhisperDB files and allows you to visualize the data and build some graphs. There's another tool here, CollectD. So CollectD is basically doing or collecting metrics that are not collected directly by Swift itself. So for example, you can collect metrics like CPU usage or memory usage, disk, the free space on your disk and so on. There are a lot of plugins available in CollectD, so you can collect a huge number of different metrics, whatever you are interested in. You can even write your own parses and add them using these plugins and this data gets then also sent to the Carbone Cache later on. One nice thing here is that the communication, for example, from the Swift object server, the replicators or any other Swift process is done using UDP datagrams. That basically means if the StatsD instance is down for whatever reason, that you're not blocking Swift and you're not blocking the operations that your users are doing. Of course, you are losing metrics then because you can't collect them anymore, but at least the Swift cluster itself is still working without any interruption even if there's a backend to store these metrics not available. Also, if you have a lot of servers, you can scale that out so you don't only have a single StatsD instance but multiple and they aggregate the data then later on. So, installation and configuration of these tools. In the past it was a little bit difficult, actually, so especially StatsD. StatsD is in written in Node.js, so you have a lot of dependencies here. Dependency is here. GraphEat itself is written as a Django application so you have more dependencies there and it was a little bit cumbersome at least. So, I'm with Red Hat. So, we have a community distribution called Fedora and actually in Fedora 21, everything that you need is packaged already for you. So, you only need to install these packages on the left side and you have everything you need to run GraphEat and StatsD instance which makes it really simple to start building and playing around with it and to see if this is something that you might like and that you might add to your data center. So, the first four tools, StatsD, Python, Carbon, GraphEat, Web are only run on this instance, not on your Swift nodes. Only CollectD should be run on the Swift nodes because you want to store and retrieve data from the Swift nodes itself. There are a few configuration files. So, in Swift itself, it's proxy server, account server, object server and container server configuration file. There's an example in the configuration file. It's basically just telling Swift, okay, you have a StatsD server somewhere at this IP address, send the data there. There are two or three more settings. One is the prefix. The prefix is quite useful because I always tell our customers, okay, if you're using the prefix, for example, use a storage, use a node name because then you can later on see the metrics from different hosts and separate them. Otherwise, you have a huge set of metrics but you can't see from which hosts they are coming from. For CollectD, it's also quite simple. You enable the plug-in that basically allows you to send the data to GraphEat. For StatsD, it's the same. You tell StatsD, okay, here's my GraphEat instance, send the data there. And then there are two more files, storage schemas conf and storage aggregation conf to configure Carbon. You really should have a look here and because Carbon aggregates data later on and stores data for some specific time, for some specific period. So what does that mean? Let's have a look at a very simple example. Let's assume you have a 10 second period and you are collecting two metrics. And the one metric is always the same level, it's always a seven, that's a green one here and the other one is a red or orange one here. That is a different metric. By default, all your metrics are averaged and that does mean in this case that both metrics have the same average value, right? I have 21 for three samples for the red one and five times seven for the green one. The average is the same, it's seven. So if you just calculate the average of this data, that's not really useful to you. For example, if you're interested in the amount of replicated partitions in your Swift cluster, it doesn't matter if it's seven for one run or maybe there are 10,000 runs with a seven but you can't see it later on anymore. So you need to take care of the aggregation method. And in this case, for example, for the partitions or for any counted values, you want to summarize the data because the green value here, if you summarize that up, it makes a 35 and the red one only a 21. There's a huge difference here. Also, if you're running a little bit older versions of StatsD and Grafite, you are only collecting the latest sample. And if, for example, Grafite is configured to have a retention period of 20 seconds and you send samples every five seconds, only the latest sample is recognized or stored at the end. So you have to make sure that the retention period is shorter than the amount of periods that you're sending data into this. Okay, so hopefully you have then a Grafite web installation up and running. And if you have it up and running, you are opening your web browser and this is what it might look like. So you have on the left side a lot of different metrics and in this case, I used a prefix to my storage, my node names as prefix. So I have multiple storage servers here, as you can see. And below this, you have a subtree with all the Swift processes that are running here. You can simply select a metric here then and it will show up as a graph on the right side. And what you can do then is you can apply, for example, transformations to these metrics. So in this case, it's a moving average over the last 10 samples. There are a lot more, there's a very extensive documentation available and this tool is really handy if you want to play around with it and see what kind of data you can get out of it. What is even more useful then if you have an idea of what kind of metrics you want to store is a tool or is a part called the dashboard. And with the dashboard, you can arrange multiple graphs in a single page. And not even that, you can also merge multiple graphs into a single graph. So for example, you are interested in the number of current type containers, let's say, from different storage nodes. So you can just drag and drop these graphs on top of each other and then create graphs where you have multiple lines in a single graph or multiple metrics. There's also a small editor included in this graphite. And it's basically a JSON dictionary that you have here. That's the configuration of the dashboard. This makes it really handy if you want to exchange dashboards, maybe with your colleagues or you want to store the configuration in a repository or whatever else. If you look into the source code of this tool, you can even just send the data using put and get requests using curve, for example, and have this lying around as files. You also can store multiple dashboards. And this is, I think this is really useful because by default, I create dashboards with a general overview, for example, then with an overview about some storage servers, the proxy servers, and even more dashboards, for example, then down to each of the single servers in my cluster. And doing this, you have all of your data quickly available later on because if you face any problems later on, you don't want to start to build dashboards at that time point. You want to do that probably in advance. So, that said, let's have a look at a few selected metrics. So first, we have the proxy traffic here. And that's the, as you can see, in the lower left corner, you have get and put requests here. The same on the right side. These are aggregated values. And as you look at those on the scale on the left side, these are megabytes per second. And so the red line and the pink line here, these are the traffic. And here are some milliseconds, which is the response time to the user on the proxy server. It's quite stable. So the data is not fluctuating very much. What you can see here is a light increase in the response time at the beginning of the graph. But it's stable out at the end. So it's quite stable at this area. In this case, it was because there was some replication still running, heavy replication for a lot of smaller objects. And one of the multiple servers were a little bit busy. So, no, the servers were busy at this area, not that busy at that area. So the response time was higher here. What you're probably interested in is the amount of errors that your users are facing and seeing. Most likely, these are four or four errors because your users are like, or maybe an application is using or trying to access an object or a container or whatever that is not accessible or not available at the moment. What you might look out for are errors like 500 errors or beginning with a five because these are, these might at least be a problem on your cluster. If you see a lot of 500 errors and ramping up, or ramping up 500 errors, you want to have a look at these errors as well. Here we have an average CPU usage on one storage node. And as you can see here, there's a, well, kind of periodic interval. So we have at a time of 605 and then later on at 635 again, an increase in the CPU usage. In this case, the storage servers were configured to replicate or to start replication processes every 30 minutes. So when the replication process hits in and starts, there's a lot more traffic or a lot more CPU usage going on on your servers and that's why the load is increased here. What you want to look out for is, in this case, the red line. So the red line is the CPU waiting time. As a general sum of rule, I think if it's a single digit value like below 10, that's okay, that's normal. If it's increasing a lot and you have a lot of CPU usage time, it's mostly because your container database servers are somehow overloaded because you're probably running as a container databases on spinning disks, not on SSDs. That's a problem. So you want to avoid a high CPU wait time here. And if you have a high CPU wait time, you should definitely have a look at what is going on on the nodes then. Okay, so the cluster I used for this is a little bit smaller. So you can see here the free disk space on a few selected nodes. And as you can see here on the left area of the graph, there are three disks and the free disk space is quite low actually here. So I added more disks to this small cluster. So first, a new disk here. And what you can see here is that the replication hits in. So the existing disks are increasing or the free disk space is increasing on these disks. And as you can expect, the new disk is using more space here. So that's fine. That is what you would expect. But then there was another disk added, right? The brown line here. So you would expect that all the other lines are going up as well. This doesn't happen here. So if you look at this line, the new disk was added and then the usage on this disk even dropped even more. This was, the reason for this was that the ring was changed or a different storage server on the ring was changed there as well. So you had multiple partitions moving at the same time. You applied it at the same time. And only after a while, the free disk space on this disk also increased on. So probably you want to avoid operations like this where you apply multiple operations on a ring and run them at the same time. So talking about replications, these are the numbers of replicated partitions on a cluster. So as you can see here, it's quite periodic. Let's take the green one here. It's running for some time and then stopping and then 30 minutes later it starts again, running for some time and so on. Same for the brown one and for this one, for the red one and so on. What is interesting here is that the blue line, it's running much faster, only, well, maybe a minute or maybe 30 seconds or so. And so it looks like a very high peak. So the reason for this is that this cluster was configured to use write affinity. What does it mean? The proxy server sent date new incoming data to all other storage nodes. These other storage nodes store the data and later on at least one copy of the data from the other storage nodes is then replicated back to storage server number one. So storage server number one is not replicating any data back to all the other nodes but the other nodes are always storing or replicating data back to storage server number one. And this is what you can see here. So sometimes it's happening that your storage servers are busy, which is normal. And in that case, a proxy server, for example, selects another node. In that case, these are so-called handoff nodes and you see handoff requests. That you're seeing handoff requests is quite normal. What you should look out for is spikes like these. So in this case, it's quite easy to, at least on smaller clusters to do something like this. For example, if you're storing a lot of zero-bisest sized objects and your container databases are stored on disks. Because in that case, the object servers don't have to do or don't have much to do then, but your container servers need to update a lot of small objects and a lot of smaller entries. And then they might get stuck and you see an increase in the handoff requests. So at the same time, you have so-called async pending files appearing on your cluster. Async pending files, it's basically when the object server wants to store or wants to update a container database entry and the container database is currently locked because another process is also updating the same container database. So the object server stores an async pending file which is later on catch up by an updater. And then this, well, not yet done update will apply to the container database. So if you have a smaller cluster and you're sending a lot of zero-bisest objects at the same time, then you will see that the async pending files are increasing a lot, which is okay at that time point. What should happen then is that the amount of async pending files laying around on your disks is dropping again later on. So you shouldn't see that these lines are increasing all the time. If they're somewhat stable, it's okay, but you always want to see that they're either stable or going down again later on. So, and of course, you don't want to look at all these graphs at a single time. You want to build yourself some dashboards. So this is a dashboard that I used in the past as a general entry point. So you have some basic overview what is going on in your cluster. So I was interested in the average CPU usage. And as you can see here, the waiting time is a little bit increasing, which should be avoided there. I'm always interested in the proxy traffic because if the proxy traffic has some drops or some heavy spikes in it, then I know one of the users is doing something that is not done in the past probably or something else is going on there or if it's dropping a lot, maybe there's an issue with the network connectivity then. The proxy timing here in this case is also interesting. So I put that onto the overview as well as some other metrics like the free disk space or error rates and so on. Okay, so I think I'm done for now. What I'm doing is I will upload the slides to the website and at the end of the slides, there are some example configuration files. So I don't want it to go into detail right now during the talk, but you see a few configuration examples that you need to build your own graphite instance so that you can start doing this. So thank you very much. And if you have any questions, please feel free to ask. And please use the mic. No clay, you're not allowed. Then I'm gonna jump in. So I do have one question. When you were sketching out the block diagrams of where you had the stats D in the carton cache and all that kind of stuff, how do you deploy that? Are you doing a centralized stats D listener or do you do that on local machines and then kind of coordinate across that? Or what does that look like when you've got, what's your pattern for doing that? Well, it really depends on the size of the cluster. I mean, Why? I mean, I can know that. Because if you have a lot of requests coming in, stats D requests, you probably want to have multiple stats D instances running, right? What I did in the past is to separate these stats D and graphite servers onto separate machines completely. So they don't have to do anything with a storage or proxy server nodes. They were running on separate bare metal machines to have enough performance running there. The last time I did this, it was only one machine, but it was a very beefy machine. So it was fine. I mean, what kind of ratio do you look for? Sorry? Like one machine per 10 storage nodes? No, it was like 24. 24, yeah. So what you can do actually is you can tell Swift not to send a metric on every request, for example, you can tell Swift in the configuration okay, I only want to have a metric sent out for every 10s request, for example, and to lower the amount of incoming metrics to stats D, for example. Yes, we also use stats D for metrics collection. Well, I mean, obviously we use stats D for the emission and graphite for the collection, but we don't use the graphite dashboards directly. You alluded to that has the API and you can collect the data metrics and then put them into different kinds of different tools to organize your graphs. But I think for somebody getting started with Swift or we say you should do that stats D and graphite, but we don't necessarily have anything that we can turn over as a starting point. And I don't really know much about the graphite ecosystem and I was wondering is there any way that as a community we could collaborate on some of these dashboards and share graphs? Is there a chef cookbook for Swift that would include the base set of graphs to get you started? And then even if you decide to layer something on top of that, that's not just the graphite graph upboards, but something that's presenting them to your operations team or something. Do you think that might work? Could that be? Yes, definitely. So I think that's a really good idea and I was thinking about the same for one of the Friday sessions and to talk a little bit about this. So it's even simpler because the configuration for the dashboards, you can exchange them like in a JSON dictionary, which is really easy. It's easy to read, it's easy to ship and to store somewhere else. And yeah, we should definitely look into this, I think. So what Clay just said also is if you don't want to use the graphite dashboard, you can use, you can, each of these small graphs that you see is basically an URL. So if you have a web page for your internal monitoring, you can also embed these graphs there if you want to. Any other questions? Clay has one more, I think. Oh, well, do you know if any other projects are doing that? Can I go, if I'm running Nova, does anybody have like, hey man, you should check out the Nova Graphite thing over here to get started? I don't know about the other projects. Do I also know? Internally. I'll reply to that one. All right. So at Catalyst, we are using graphite and also Grafana as our dashboards and we've published on GitHub sample dashboards for SAF based on the pretty much same architecture, same structure, collective. So it's definitely possible achievable and we would be very happy to collaborate with other people on doing the same with Swift because we're implementing Swift for object storage. So, yeah. Awesome. Sounds great. I can tell you the, yeah. Yeah. Yeah, sure. Yeah, put you, okay. Okay. Yeah, definitely. So, yeah. So just to repeat this for the mic. So the idea was that we have more and more data. We have cabana data or you have cabana data, you have graphite data and so on and to really collaborate in the community to exchange common set of dashboards because everybody probably wants to run some kind of dashboards and to have a good starting point for your metrics that you are collecting because a lot of time we're doing the same work probably. Yeah, please. Could you please the mic? That would be great. So with your stats, do you ever look at trends and history and do anything with that? Yes. So personally I do. I didn't go into detail in this talk but yes, you can, there are some tools in graphite included to monitor trends. For example, if you look at a, let's look at, for example at this line, it looks quite stable, right? But in the long run it might be that one week ago it was not 80 megabytes a second but maybe only 60 and next week it's 100. So you don't see that if you just look at a very short time frame but this is the reason why normally I use trends and add them as well so that I can see okay in the long run it's probably going up or it's going down. Where it really comes in handy is the free disk space because then you can somehow predict when your disks are running out of space so they're now okay I need to order new disks within the next six weeks or so. So that's where it's really helpful and of course, well, a lot of metrics as well. So with this graphite it's only graphs but what you can do is because all of the data for these graphs are coming as a JSON dictionary you can just fetch the data and then parse the data in other external tools if you want to. Yeah, for example. Hi, my name is Sridhar, I work for a company called Vedams. We use the stats D metrics to collect it plus do some machine learning algorithms on those and try to figure out if maybe some services are down, your disks are bad, it's a self-hailing system so it's automatically going to go in, the system is going to work but maybe with a less performance kind of thing. So if an IT admin is interested in figuring out I want to know right away, services are bad, they've failed, disks are bad. So we try to use stats D to kind of figure out if some hardware is bad or services are bad and that we have open sourced. It's in GitHub, it's called Tulsi, T-U-L-S-I. So if people are interested to take a look at it, can just go with that. Thank you. Hello, it's a bit related to this project I think. Do you have some kind of tool to alert when something strange happens? I mean, if 500 hares increase suddenly? No, at least not a tool that is available on some of my wrappers but what I'm normally doing is if you have existing monitoring systems in your data center that we build like scripts around it or configurations that send alert to someone if for example there are more than 10 or 100, 500 errors in the last minute or so that you get a notice, okay something really bad is going on here right now but it's not included by default. Any more questions? How have you found the graphite language, the filters and expressions? So for example, you have one on disk free. That graph gets cumbersome in for example like Dona's cluster. And so one of the things that we try to do is apply some filters to sort of wrap some stuff up so you're like on this storage node this is the aggregate storage but then here's also a line that shows you the most full disk so you can watch. You don't have to have 6,000 lines. No, definitely not. Maybe down just to 100 nodes. So especially for disks, which is a good example I think there are some filters like show me only the top most metrics or the lowest metrics. So if you have for example disks and you have let's say 8,000 disks in your cluster you maybe are only interested in the lowest 20 disks that are running out of disk space right now. And something like that. Or you might even want to group these metrics into different regions or zones that you know okay and this rack I have three disks running out of space and that rack is fine and the next the third rack is like completely full or whatever. Sometimes. Yeah, so it can be overwhelming at the beginning fully agree but I think you get used to it especially if you use through the documentation and if you notice that you can apply multiple operations onto the same metrics you need to play a little bit around with it. Definitely. You did this all the time right? Yeah, I knew it but it really feels like something we can collaborate on. Yeah, definitely. Something that I've had to string together to get it to what I wanted. Oh yeah, that's true. That's all right, Christian's talk next in Tokyo will be on Grafana. Yeah. Yeah, thank you.