 Hello. Hi. Welcome to the last in a series of talks that we've given at the summit on visibility into what's going on in your OpenStack cluster. As we've already mentioned in the last two talks, in order for you to diagnose problems or in order for you to even operate your cluster efficiently, it's sometimes very often important to know what's going on in your cluster. Because without that, it's extreme. As you might have noticed, if you've been an operator, it's extremely hard to figure out things from just all the salameter data that you're seeing or all the logs that you're collecting. So the last two talks were on network visibility. This last talk is on storage visibility and how we can improve the storage performance. And in this particular talk, we are going to go deeper into SEP. And here is my awesome team. And we've done this work together, all the three pieces of work that we've presented. And we have a lot of things in the pipeline. And thanks for all your feedback in all the previous talks. And we hope that you will give us some more feedback in this one. So the question, obviously, is that why do we need storage visibility? Now, if you look at OpenStack as a layer cake, you have the OpenStack APIs. And you have Cinder, Swift, and Glance APIs. But then, suppose you're running a big data or a Hadoop job and your DFS node fails, how do you know that it's a network error or a storage error or maybe some disk drive has failed physically? You have no way of knowing unless you have visibility into what's the back end for the storage layer. And now I'll hand it over to Mark who'll talk to you about the back end that we choose, which is SEP and how we can get more visibility into SEP and what's more important is not just visibility. We could use the insights that we get out of the analytics we do to optimize SEP even further. Thank you, Debo. So what do we have at the last layer? So we have the storage jungle. This is a mysterious land. Only the very few experts who are trained in the art of storage can navigate it. And even them sometimes find it very, very hard to find what they're looking for fast. So for example, the other day we were running a storage experiment and one of our nodes failed. And that's a picture of us after debugging for two hours with no success. This is why we need storage visibility. What can we do with it? So possibilities are endless. For example, once we have the data, we can calculate in real time what's the optimal distribution for our objects. Also, as it has been mentioned, we can perform failure detection. If something's going wrong and we have real-time data, we will know right away. A very cool feature is that we can expose the underlying configuration, make it easier to change by the system administrator. And any slight change in this configuration will turn out in different data. So you can spot any performance change right away. This will allow you to tune your cluster. So OK, you can do very cool stuff. But we need some use case to prove it. We decided to get safe visibility. Just one question. How many of you know how safe works? Can you raise your hand? OK, not bad. How many of you do you use it as part of OpenStack? Not so many. OK, so I'm going to spend two slides doing a very, very quick introduction to safe. In this slide, you can see the safe architecture diagram. At the very bottom, we have rados. Anything you do with safe will get stored as an object by rados. On top of it, we have lib rados. This will allow us to interact with rados from the application level. Then rados gateway, the one on the left, the great one, is a REST gateway compatible with Swift and S3. Then we also have RBD, which is block storage, useful for Cinder when we're talking about OpenStack. Finally, we have the safe file system, which is a distributed file system. So now that we know the architecture of safe, let's see how safe stores an object. For this, we need to introduce three very simple concepts. First, OSD. This stands for Object Storage Device. This is essentially the physical node where you're going to store your object. Then we have the placement groups or PGs, which is a collection of objects. And finally, we have the pools, which are a collection of placement groups. So let's say we want to store a file called full in a pool called bar. First thing, safe will get the pool and will retrieve the number of PGs in this pool. With this number and a hash of the name, it will apply the modulus operation, and the result will be the PG where the object will be stored. Then, safe will retrieve the crash rule from this PG and this crash rule. And with this crash rule, we'll get the list of OSDs where the object will be saved. Then the object will be saved to the first OSD, and this OSD will take care of the replication. So we save it one time, and then it gets replicated. So nowadays, there's several good safe dashboards that provide visibility and work well. But we wanted to create one that's tightly integrated inside Horizon. For that, we said that we wanted this kind of visibility. First of all, we want to be able to monitor our safe cluster. What's going on? What's the OSD status in real time? How is any object being distributed in our safe cluster? We want to know that. We also want to be able to see what crash rules we are using at that time, and also be able to add new ones, edit them, or remove them. In order to do that, we created the Safe Horizon panel. And now my co-worker Kai is going to demo it. Thank you, Mark. Hello, everyone. Let me introduce our dashboard to you. And there are four tabs in our dashboard. You can see. Now let's take care of the first tab. The overview tab has some basic information about our cluster, including the health, the space usage, and some information about the monitors, information about OSDs, and the metadata servers. And let's scroll down a little bit. And here it is. This is the graph representing the historical and new time IO data. And we have several metrics, and you can simply click on it, and it will be updated automatically. And you can see the read and write times. And the read and write bytes goes to this cluster. And you can see the history of space usage. This graph can be really useful for detecting the failure. For example, when you wake up in the morning and you file that your cluster is not working. So what happened last night? You just simply look in our dashboard and see the exact data like this. And you can even export it as a picture and attach it to your report. Let's go to the second tab. And the second tab is representing the topology of our cluster. In this example, it simply presents the logical, I mean, the physical location of our cluster, including the node information of LOSD. You can see the CPU disk and memory usage and some information about the staff. And when you create a new LOSD, it will show up here. And the green means it's up and running. And let's check out the watch feature. Say, let's trigger a benchmark here. And it's running from VM in Open Slack and see. Go back to the OK. Now on every LOSD, we can see there is a donut chart. It represents the percentage of load and load request goes to the LOSD. And what's more, you can see the hit of each LOSD in terms of the number of requests, just like this. And you can even click on the LOSD and see the detail of the information. Let's go to the third tab. In this tab, it lists the crash loops using in the cluster. And you can create a crash loop called tempo1. And it will use the default bucket. You know, a crash loop can be used to determine how your object is going to distribute on the cluster. For example, one of the copies goes to the place on the SSD. And the other two copies place on normal OSDs. Then you can use the crash loop. OK, let's go to the fourth tab. As we know, when we use the save as the backend for Sinder, for example, your volume will consist of thousands of objects and distribute across several OSDs. And on placement group, you can see how the distribution is. And we have 44 objects on the placement group 3.2e. And let's go down a little bit and see. The second graph shows the object distribution on the OSDs. And you can see that the OSD6 is acting as the primary OSD for 18% of objects. Let's go back to the style. We now see how the panel works and what kind of information we can get from the panel. So is it useful? Our answer is yes. But our whole work is not enough. So let's imagine if there is a bottleneck in our cluster. So let's go back to the panel and run a computer drop again. And what we saw is something like this feature. The load of OSD is abnormally high. And then we try to reduce the weight of OSD6. And after all the placement group come back to the clean and active state, we try to run the benchmark again. And in this time, we witness a better low balance. And so far, we have seen how we gain visibility instead and how we convert it in a panel in horizon. So it's now time to see what we can do with this visibility in terms of optimization. So first, we can do a reference story no directly. Say if you have a high performance of OSD, you may want to increase its weight a little bit so that it can take more responsibility. And with the help of our dashboard, you can see how it reacts to the computer drop in the new time. And in the following few weeks, we are going to add a benchmark tab in which you can similarly trigger a 10-second or 20-second benchmark so that you can analytics the performance in terms of bandwidth latency and IOPS. And in the future, we plan to have all these analytics and take place in the community automatic way. So finally, let's summarize what we can do with short story visibility. First, we can reduce the downtime by detecting the potential problem as soon as they take place or even before they take place. And we can use this visibility to tune our system and see immediate result and see if the configuration is good or not. And what's more, we can provide products and historical data for the future prediction. So the most important part is there's plenty of insights that we're waiting for us to find. So we really want to see how the community thing is better. What kind of insights the community would be interested in. So please scan this QR code and provide some feedback. That's all. Thank you.