 All right. Let's get started on time. Sorry for stealing your thunder. I'm just going to introduce myself then. Yeah, welcome everyone. My name is Lenz Grimmer. I'm the member at the component lead for the SAF Manager dashboard, the project that I'm talking about in the SAF project. I also happen to be the person in charge of supporting the team who works on this project at SUSE. I have been talking about this topic in slight more variations at various hostems before. Previously, I've been mostly speaking about the project where this work originated from which was the open attic storage management project. At first, I think I had a talk where we talked about open attic for storage and SAF management, then it was open attic for SAF management, and now our project has kind of evolved in the sense that it has now been well not really ported, but we have basically taken the experiences from open attic and our key learnings upstream, and we are now working on a management tool, I would call it that is called the SAF Manager dashboard, that is part of the upstream SAF project. So since a few releases, the SAF now ships with a web UI for managing and monitoring out of the box. The topic of this talk is to give you an update on the progress, the current state of work, a bit of history and background where we are coming from the evolutionary steps that the dashboard has been going through, and if the demo gods are kind with me, we may end this talk with a live demo on this measly little laptop, so it's not going to be that impressive from the number of notes and OSDs and everything, but I hope that this gives you a glimpse and an impression of what to expect in the upcoming Nautilus release. So going back in history, the first dashboard that SAF ever shipped was included in the SAF Luminus release. That is kind of the predecessor that we are basing on. As you can tell from the name, it was supposed to be well a dashboard, a web page that you can put somewhere on a browser maybe in your data center that shows you the basic stats of your SAF cluster at a glance. Some of the features that it provided are listed on the slides, I'm not going to read them all out, but the key points to keep in mind is that it's well read only, there were no options to modify your cluster or make changes to it. It didn't have a notion of a user login system, so it was not even encrypted, because we couldn't make any changes anyway. So from the whole architecture and structure, and also from the JavaScript framework that it uses for rendering the graphics on the front, and it was a bit more simple, but good enough for the current scope, and the intention that it was supposed to be serving. But even after the Luminus release, people submitted patches and contributed more features to add more functionality to it, and there was this desire of having more than just the read only dashboard, and almost a year ago by now, we approached the SAF project and basically introduced ourselves, hey, we have this thing called OpenATIC, which is a standalone open source project that happens to do SAF management fairly well. How about if we just contribute our work upstream and see where this ends, and this is basically what happened within the last year. We started in several phases. At the very first beginning, we basically just did a prototype where we took the existing dashboard and ported it to our code base and infrastructure from the JavaScript framework that we used and the backend that we were proposing. But it's not a one-to-one part of OpenATIC per se. So even though the UI might look familiar in some places and the functionality, well, there's a huge overlap in functionality, and one of our goals is to reach OpenATIC feature parity with the Nautilus release. It's basically a rewrite from scratch, and we started with a new Python backend, created our own REST API controllers. Yeah, and the GUI, also all of these things, some components may look similar, but in general it's complete new work based on the experiences and learnings that we've made in the OpenATIC project. Over the course of the last year, while Suzer started this project, we now have a community around it. Red Hat is contributing a number of people that are dedicated to working on the Chef Manager dashboard, and we also get the eventual drive-by-pull request from people out of the Chef community that are not more or less paid for working on the dashboard. So the first official dashboard V2, as we call it release, was with the Mimic release in June last year. That included all of the functionality that the original read-only dashboard included, plus some additional features that we were able to finish within the timeframe of the six months that we had. That, for example, included things like SSLTLS, for example, you had a way to define a username and password to protect this new dashboard from unauthorized users to log into it. Some management of Chef objects started to appear, block devices, object gateway management was there. Some of the newer features like capability to browse and view all the configuration settings that could be set within the months could be seen, and of course a completely new UI, new design, new layout compared to the original dashboard. So this was the foundation that we've set with the new architecture, and well, since Mimic, of course, we haven't been standing still. We have been working on quite a long list of features. I did try to summarize them in the next two slides. I will may pick a few out to mention them. I'm probably missing a ton because there's so much detailed improvements in so many areas. Many of them aren't really user-visible, but there's a lot of change that has going on in the background, re-architecture, re-factoring of things. So quite an evolution and if I now look back into what the team has achieved in the course of this year that we are working on this, it's pretty amazing and impressing what we have come up with. The thing that I really enjoy about it is that really, since we are upstream, Chef, we are just getting much more exposure and much more users, and it's not just the original core of people working with it, but really a growing community of its own within the Chef ecosystem that is just dedicated to working on the dashboard and improving it. Once the Nautilus releases out, I anticipate that we will actually get an increase uptake in users simply based on the fact that Nautilus is likely the release that other vendors will derive their downstream products off. So for example, SUSE is currently working on that. Our next version of SUSE Enterprise Storage will be based on Nautilus and will, of course, prominently ship the dashboard. So we have just started with our public beta testing. So if you want to take a look at a packaged version of Chef, almost Nautilus, and don't want to build it from scratch, you could join the SUSE Enterprise Storage beta test if you want to take a look at that. All right, so what have we achieved for Nautilus so far? I have two slides. This one is more specific on the Chef particular features and another one that just talks about kind of structural changes or feature changes that are dashboard specific. Long list, as you can see, I'm going to go through, hopefully, most of them in the live demo part. You're now able to start or have more features in working on the OSD management page. It's not just a table that shows you which OSD exists, but you can actively manage them by, for example, marking them out or down. Change configuration settings. Recovery profiles is one thing that I'm going to show. That's quite interesting. The previously read-only config settings browsers has now become a config settings editor. So you are able to quickly search across the, I think, nearly 1,700 configuration settings that Chef provides. There's an embedded help that explain you what this config option is about, what format it expects, what the default value is. So a very flexible way for quickly tweaking your Chef clusters parameters if you want. Pool management has been added. You can now create Chef pools within the UI. You can edit existing ones and choose, for example, the Eurasia code profile. You can choose between replicated and Eurasia code to begin with. Lots of small details and options that are available here. A while ago, we added support for configuring the asynchronous mirroring of block device RBD mirroring from one Chef cluster to another one. So you can now use the dashboard UI to enter the credentials and the various parameters to select the pool or the RBDs that you want to replicate, which makes setting this up hopefully a bit more easier. And the whole thing about monitoring your cluster has been vastly improved. We have now integrated support for Grafana. So there's a Chef Manager module plug-in that talks to Prometheus and exports metrics to Prometheus. And we then use Grafana to visualize those metrics from a running Prometheus instance. This is optional. You can enable it if you have this set up in your environment, but the dashboard also supports a number of graphs by itself. So Grafana just gives you an additional, more detailed insight into much more runtime parameters. CrushMapView, I was added, which gives you a tree-like overview of the layout of your cluster, how the OSDs are organized, how your failure domains are designed. That's read-only for now. There's a lot of requests for creating a CrushMap Editor at some point. But for now, we are postponing this a bit due to other priorities. NFS Garnisher is one thing that we're working on. The pull request is under review right now. So you will be able to use the dashboard to configure new NFS shares. You can select on which Garnisher node they should be exposed. You can choose if you want to CFFS or S3 based. That is functionality that existed in OpenATIC, but we have pretty much overhauled the concept on how the management of the shares is done. Due to the fact that Garnisher also has evolved in the meanwhile, it makes it much easier. So instead of having to copy actual config files with the share definitions to the nodes, we can now use a save pool in which we store config options using RIDOS, which makes it much easier because we don't have to depend on an external orchestrator or tool that copies the config files on the node. ISCSI target management is another thing that's still in the works, but nearing completion, yeah, create ISCSI targets based on RBDs. All the Belsen's Wittles that you would expect, that's a fairly big project. OpenATIC used a different tool for managing the ISCSI targets, what that was named LRBD, so we had to also move the support to the Ceph ISCSI project, which is part of the Ceph upstream project. But yeah, we realized in the process of adding the ISCSI functionality to the dashboard that Ceph ISCSI itself had a few limitations and was missing features that LRBD provided. And for us as the team working on OpenATIC, we really want to make sure that we have almost zero gap to what OpenATIC provided in favor of the dashboard because at some point, well, we will be recommending existing OpenATIC users to go with the dashboard instead, so there should be a smooth migration path. RBD QoS is another thing that we're working on. Well, yeah, I'm not too happy with the term QoS at this point because it's more a throttling way, so you can configure how much bandwidth and IOPS in RBD, a single RBD, or you can also set it on a pool level for all RBDs contained in that pool should consume in order to make sure that not just a single client can basically tax and hog all resources. You can have kind of a rate limiting feature here. Also, work in progress is the integration of Prometheus alerts. Right now, we just use Prometheus for collecting metrics, but of course, since Prometheus provides this out of the box, it makes sense to also tap into this. So we're working on a dashboard feature that will talk to Prometheus alert manager. We'll visualize all the alerts that have been triggered. At some point, you will also be able to further configure individual alerts. I am probably missing something, but that's the high-level list of features that I have been able to gather since the Mimic release. And by the way, if you have any questions or thoughts about what I'm talking about, I usually prefer that you just raise your hand and we have a conversation about the things that are of interest to you, rather than doing it all at the end. So there's a question here. Do you have one question? Yes. A storage of safe metrics showing what we're not. Right. So the question was what kind of backend is used for storing the metrics, and that's Prometheus. That's the default by now. Yeah, we had to make a choice. Especially as I just said, because the alerting in Prometheus is very strong and very promising. That's the current primary solution that we're focusing on. But well, at some point, we may, of course, look into other tools as well, given bandwidth and willingness for others to submit patches. Right. So that's a broad overview of the specific management features. We also added a number of features to the dashboard itself, primarily and most importantly, you can now define not just a single user, but multiple ones. So you can create roles for your admins if you want to give people access to your dashboard, and they only should manage Rado's Gateway, for example. You can create a user and assign them to the Rado's Gateway Manager role. If you're in your environment, use an external identity provider that supports the SAML protocol. You can configure the dashboard to offload the authentication of these users to this identity provider. Still, you need to create the user and assign roles to him in the dashboard. But the verification of the username and password can be done through an external entity. David. The question was, if that is integrated with Cephex at all, and no, it is not its complete separate. Yes. But it may be worthwhile evaluating if Cephex could be somehow used as an external authentication mechanism. I am looking at Ricardo if that's actually possible. But I honestly don't know. So far, the primary request that we've received is that SAML is somewhat kind of the lingua franca and an overarching standard that also makes possible to use Active Directory LDAP underneath. And so SAML is likely, yeah, it was one of the lowest hanging fruit, basically. That's why we went for that. But the architecture in the dashboard itself theoretically supports other authentication mechanisms as well. So it's not limited to that. Auditing was another request. So as you can see, many of these features are enterprise features. So some of these requirements come from downstream discussions that SUSE engineers had with Reddit engineers and the product managers. So one of the things I really like about the dashboard in particular, that this is an area where we have really managed to come together as two companies that usually compete in this space on the product level. But after we have spoken and had meetings, we realized that in the end, we all have the same requirements. And there's really no point in trying to differentiate and have all of these features implemented just in a downstream product. So the dashboard is really benefiting that all of these functionalities are high on the wish list for the downstream product. So we actually get the bandwidth and the time to work on these as well. Most visible compared to Mimic is likely the new landing page that you will see once you have logged in, which is well, what I would basically call a dashboard with metrics and graphs, overall health information. So your cluster's health at a glance, basically. Internationalization was added. The dashboard is now capable of speaking different languages in theory. The infrastructure is there, even though the translations for a language are still a bit behind. On the one hand, that's due to the fact that our development is still ongoing, and the messages and strings still change. And we haven't really banged the drum for asking the community support or submit more translations in various languages. We are using TransEffects as a public translation platform for that. So if you're curious in adding your own language by all means, please get in touch with us, and then we can make this possible. But the whole translation internationalization is something that can basically be done at any point in time, so right now we're more focusing on finishing the remaining features. And once that's done, then we can start doing the translations in the various languages. For developers, the dashboard back-end provides a browserable REST API based on Swagger. So you can point your web browser at the REST API endpoint, and you see all the various endpoints. You can manually try them out, send a GET request to see the JSON that's being returned. At the moment, we have an outreach intern. She does a three-month project, and she's working on improving the REST API in the sense that it's being automatically documented. So as a developer, if you add a new REST API endpoint to the back-end code, there will be, I think it's Python doc strings based, if I'm not? Decorators, yeah, kind of a decorator that you add to the definition that will then be rendered as documentation when you open Swagger. All right, so that's basically dashboard current state in a nutshell. Do you have any questions about this so far? Otherwise, I'm going to switch browser, and we'll take a look at the life instance. If my cluster is still up and running. Give me a moment to just switch. I'm going to change to mirroring, because I don't want to look back all the time. That looks good. We need this one. Is that visible? Let me try to increase the font size a bit. OK, so right now, you can already see one of the big changes that it's now possible to select multiple languages. In the current development branch, it's just German and Portuguese, which accidentally also represents the majority of the people working on this project. We have people in Germany and Portugal. But also, now in India, somebody working in Japan, China, the community is growing, which is very exciting. And we have a complete translation into Chinese, which hasn't been merged yet, which is pretty cool. OK, so this is a development environment, so I'm just using a default user here. And if this would have been an HD display, the dashboard would actually fit nicely on the screen. So that's the screen resolution that we have optimized this dashboard for. And since this is a development cluster that is idle, there isn't really much to see at this point. But if you go down here, for example, you get the utilization, how much space in your cluster is occupied, how much is available, how many objects. Placement groups per OSD is an important metric. Or here, the PG status that basically shows you how much data is currently clean, as it is right now. So the redundancy level is in the right level. And there's no data and flight that is being replicated to another OSD, for example. All of these metrics can be seen here at a glance. As I said, if your resolution is high enough. But from here, you can also move to the various subsystem pages. So to say, you could either use the top level navigation. For example, let's go here to the hosts. And this table, I'm going to increase this a bit more, is a table that shows you all the hosts that your cluster consists of. This is just a development environment, so it's just a single node that runs all of the services. As you can see from the long list of services running here, this is probably more exciting if you have a cluster with thousands of nodes and you quickly want to get an overview. So like with most web applications, you can sort them by various criteria. And you can also quickly kind of drill down to maybe just a selected group of hosts. If I would just want to see the hosts that are named CephDef. And if I now enter something else, you see that the display disappears. But if I had any hosts that would match, let's say Def01, and I would only see that one in the list here. So it's updating in real time. And you can quickly query it for the information that you're looking for. From here, you can jump to getting some more specific metrics for each of those services. Let's see the OSD, for example, lots of runtime parameters that also can be queried here. Let me go back for a moment. And here is the first demo of how we embedded the Grafana dashboards. So if I click on an individual host below, I have an embedded Grafana dashboard that shows me some basic metrics just for this particular node. That's not very exciting at the moment because Grafana isn't happy with me running this on local host, and it's looking for CephDef. So configuration error on my part. But this is basically pretty stock information that you get out of Prometheus and Grafana. But it's just nice to have it quickly available within the same dashboard. You could of course just point your browser to Grafana directly and see the same dashboards there as well. But in the way they're embedded here is that they are always kind of associated with the object that you're currently working with. We also have one that gives you an overall performance view. So these are specific metrics that are aggregated across all nodes in your cluster. So you will see the top 10 busy hosts, top 10 network load. Just especially in a large cluster, you don't want to click on each individual host to check its metrics. So it's really important to figure out, OK, how can you aggregate those numbers that you're interested in without overwhelming the user and the browser as well? We used to have a Grafana dashboard that tried to plot each OSD performance metric in a single graph. That didn't scale well. I think the browser crashed because it was running out of memory at some point. So this is an approach to drill down and minimize the information that is needed to quickly get an overview of where your cluster is heading and how it's doing. So that's hosts. And I have 15 minutes left. Thank you. I'm speeding up. Months. That has not much changed from the original dashboard. Maybe if you lay out changes, but basically you get a quick overview of how your months are doing, if they are all in quorum. Some runtime information there for month-specific internal information. The OSDs page is likely one that you want to take look more frequently. Again, you have a way to filter it so you can drill down to individual months by their ID or individual hosts to see a list of all the OSDs running on one particular host. Again, if I click on one of them, I get additional information down here. There's an IOHistogram. That's a legacy widget from the original dashboard. I don't really know how this looks like when an OSD is really busy. More interesting is probably the Grafana dashboard that we've embedded here, which now just drills down to the OSD specific runtime information. You can make changes to the OSDs in various ways. So one example is that there are a number of OSD-specific cluster-wide flags that you can set. For example, if you're doing maintenance, one of the things that you might want to add here is the no-in flag, that it's more under your control when an OSD kind of rejoins the cluster and the rebalancing starts happening. I'm actually keeping this setting now because I want to quickly demo how the dashboard reacts to changes to your cluster and if there is actually a problem happening. So I'm selecting that OSD. And I can mark it as out here. So this basically means it's no longer available. And Seth needs to do something about this in order to make sure that the placement groups that are stored on this OSD are replicated elsewhere. And we can now take a look at the top-level dashboard here. It's not telling me, oh, we are having a health warning. And down here in the PG status, you can now see that some PG's are in an unclean state. And now it's replicating the data to the other OSDs. And then things are happy again, even though the health warning still remains because of the no-in flag that I've enabled. That is a health situation for Seth. So let's go back to the OSDs. Not much happening. Let's shoot another one just because we can. Now we take a look at the Seth pools in advance because that's likely something that is more interesting to see on the pools page. Right. So now here you can see which Seth pools are affected by me taking out that particular OSD, what the status of the placement groups is. And you see the IO that is happening between the OSDs. And you can quickly see how Seth slowly recovers again. So hopefully this is quite useful for you to see how your cluster is doing at a glance and getting some more insight without having to use the command line. Again, overall performance for pools. The nice thing about Prometheus and Grafana is that it reacts very quickly on changes. If I add a new OSD, if I add a new pool, I don't have to do anything on Prometheus or Grafana to kind of instruct them that, hey, there's a new object that you need to start monitoring. They take care of that by themselves. And the new metrics and graphs are available in a very short time. So that makes it very flexible to use. Let's see. What else can we show? The configuration editor that I spoke about is this one. Again, you can quickly query for specific parameters. Let's see, OSD memory target. I think that's one of the new Nautilus features. So let's take a look at that one. So I get a description, the default value, and so forth and, of course, can make changes if I want to. So make the changes here, press Save, and they are saved. How am I doing time wise, Jan? 10 more minutes? OK, thanks. And let's see. The CrashMap viewer, that was something that was contributed by someone in the community recently. Again, since this is a minimal system, it's not that exciting to look at, but basically, you have a tree like structure, like I mentioned. And on each level, you can click to get a few more details about that particular element. So right now, we have a fairly small structure. You just have OSDs on one host. But of course, if you have defined, let's say, racks or data centers, that hierarchy would look a bit more interesting. Locks, of course, you want to see what's happening in your cluster. Cluster locks is empty, which kind of surprises me, because we just made some changes to it. But that may be a bug we need to fix. So that's the demo failing here. Yeah, I don't know. But audit lock is the separate lock that basically shows you any change that has been done to the cluster. If you enable auditing in the dashboard, the dashboard itself will also inject audit messages for any operation that you as an operator can perform. We have a pull request in the pipeline that will add a feature to search the locks and trim it down to a certain date range. So that's ongoing work. All right, let's hop back to pools real quick. I just want to add one to show you this. And it's going to be an RBD pool. Here, I can select which type it should be. Erasure code, it would look like this. So I can choose an erasure code profile here. If I had several ones, I can define compression. I'm going with replicated and small cluster, little placement groups, replication wall is the default. Replicated size three means that each object is duplicated to two other OSDs. And I need to define what applications should have access or should work with that pool. It's RBD. Compression, yeah, let's go with aggressive compression using the snappy. And I'm not sure if I need to define anything else. Creating pool, there we go. And as you can see now on the bottom, it's no working on adding the pool to my cluster. OK, there we go. This might be a good opportunity to quickly show this feature over here. You basically have a list of notifications and a list of background tasks. So some operations in a safe cluster may take quite some time. And in that case, the dashboard will just basically cure it in its internal background services mechanism. And you will get notifications once the operation has finished. OK, going to block images. These are, well, RBD images. If you would be using safe in, let's say, an open stack environment where other applications are creating RBDs, they would also, of course, be visible here. If you create RBDs on the command line, they would immediately appear. If you quickly want to manually create one, you can, of course, do this by using the Add button. Let's create an RBD test image. Since I only have one pool with the RBD label, it's pre-selected. If I had more, I can choose them from the list here. When using a dedicated data pool, I would have had to create one, so I'm skipping this. There are a number of features that you can select. I'm just going to do the defaults here. Oh, I need to at least tell them how big it should be. Let's go with ten big. And no goes ahead and creates an RBD. So, well, what you would expect. Maybe something to mention here. Many of the dialogues and the information that is displayed is still, I would call it, a bit raw in many ways. So in several places, it basically resembles the same thing that you would do in the command and with regards to what options, what values you need to provide. And sometimes also the terminology that is used that you need to be aware of. We know this, and one thing that we would like to do, based also on user feedback, is slightly making this more user-friendly in the way that we kind of hide the information that's not really necessary for the operation at hand, or maybe start doing something like more of a guided approach. So you as an administrator usually have a clear goal that you want to reach, and instead of you having to know which steps to take, we would do something like a wizard or more of a guided step approach that where you basically just select, OK, I want to serve RBDs, and then the dashboard guides you through the various steps involved into achieving that goal. Time is ticking. I'm going to be rushing. RBD mirroring, that was mentioned. So basically a way to connect two self-clusters to another that RBD images that are created in one self-cluster are being replicated to another one. This is your UI to see how this is doing and how to configure it. It's not set up in the demo, so it's not of much value. ISCSI, this is currently a read-only view into how your ISCSI demons are doing, what images are being served. This will be completely overhauled once the ISCSI management pull request is merged. Self-FS, basically showing you which metadata servers are around, what clients are connected to them. Again, an embedded Grafana dashboard gives you a bit more information about how your MDS is doing, for example. Here you would see all the clients that are connected to self-FS at the moment with some metrics. Let's see, what else can we show? Radar's gateway. So object gateways to serve objects with using the S3 protocol. You can see the version that is running. Again, you have various performance counters for each of these. Some of them are also visualized in custom Grafana dashboards here. So if you're serving data through RGW, this might be useful for you. And again, we have an aggregated Grafana dashboard that shows you overall how your Radar's gateways are doing. If you want to manage users of Radar's gateway, this would be the dialogue to do it. You can add new users here. You can manage their keys. You can set quotas. You can create buckets for them. You can see what buckets they have created. Everything that is available through the Radar's gateway admin ops REST API is accessible here. And let's see, maybe finally the user management just to show you this. So we have just one admin user here at the moment. I can try to create a new one. So user name password, the user thing. And then down here, I can create various roles. The dashboard ships with some default roles that we have defined, but you are free to add new roles in case you have a use case that we haven't thought about yet. So maybe going back to the roles page here. So these are the predefined ones. And if I, for example, click on the Radar's gateway manager, I can see in what scopes the user has which permissions. So read, create, update, delete are the various permissions that you can have. And for Radar's gateway, this is, of course, the only thing that you need, plus being able to read configuration options, so that's another flag that we've added here. Yeah. So again, this is the initial approach for this. I guess over time we will get user feedback if this matches the expectations or if we need to make changes. Looking forward to any feedback we can get here. And with that, I think I'm going to stop here. So we have at least a few. One, two, three minutes. One minute for a question. Do you have any questions? Or feedback on what you see so far? Is this crap? Is this awesome? All right. There are no questions. Sorry for the rush at the end. I hope this was at least giving you a short insight and makes you curious about what's coming in Nautilus. Thank you. Thank you.