 Hello, everyone. Welcome to my presentation about Red Hat OpenShift Data Foundation monitoring. I will introduce what Red Hat OpenShift Data Foundation actually is. It's a product, part of OpenShift, of Red Hat OpenShift that provides higher level data services and persistent storage for Red Hat OpenShift. It is composed of three components. Red Hat, Chef Storage, Rook and Multical Object Gateway. Chef Storage and Multical Object Gateway both contain their own monitoring, which is incorporated in OpenShift Data Foundation, and both of those components need to be monitored. OpenShift Data Foundation provides persistent storage and its interfaces can be object store, block store, and file system, and all of them have different monitoring and capabilities to monitor it. Red Hat OpenShift, as a platform, platform has monitoring implemented through Prometheus and contains also alert manager, Grafana, telemetric, client, operators, and some other components. Prometheus is used to collect and store metrics and to access them to endpoint, API, and to handle basically querying those metrics. Alert manager is there to provide interface for alerts. You can also set up some more receivers to get alerts from OpenShift to your mailbox, for example. There is also Grafana, which provides dashboards, and dashboards are also integrated into OpenShift platform, and those other parts enhance the experience and telemetric client also sends some of your data, of your cluster to read it inside to better monitor it. In Red Hat OpenShift platform, you can view your monitoring data in dashboards, you can get alerts, and you can view and work with metrics. When you log in into your platform and you have installed OpenShift data foundation, then you see something like this. It's an overview page that is visible right after log in, and if you have read OpenShift data foundation installed, you have this green tick with storage in the central part of the page, and if you click on it, you have there some link to more OpenShift data foundation reports. It's good to have green tick there, because if there is anything else, then you have a problem and you need to solve it. If you click on that link, then you will get two pages like this. It contains only some basic monitoring statistics. There is not much information, but there are more dashboards. You can click on the link that is in bottom side of the page, or you can go there to storage systems tab, and you will access more dashboards. The first one is block and file, and here you can see some basic monitoring for your storage. You can see information about platforms that we are running on. You see status, this contains self-held status, and other statuses of systems that you have there, data resiliency, and capacity. You can scroll through this page, and you will get more information. It's mainly about optimization. You will get time-based data for your capacity, IO, throughput, latency, and recovery, and that's it for dashboards, for file systems, and both storage. For object store, you have also this dashboard, but this is more complicated because with object store you can monitor both multi-code object gateway, which is not part of the server, and object gateway based on server, RGW, and in this dashboard you can switch between those two. But you also see the same information as the previous one, that contains some aggregated information for health of your storage. In the bottom part of the site you see some performance chart, where you can switch between those two providers, and you can look at some metrics like latency and bandwidth that are specific for the provider. But you may not want only to look on dashboard. It's useful also to have set up some alerting. For basic cluster health, we have two alerts, Ceph cluster warning state and Ceph cluster error state. When you get any of those alerts, you should search for more alerts and investigate your system because it's in a bad state. Also, all alerts in OpenShift have some grace period. If you are already receiving some warning state, then the system is for a longer time in that state. It's probably not just some random event. For capacity, you can check Ceph cluster near full, critically full and Ceph cluster read only alerts. Those are useful for block, file and object. And if you get read only alert, it means that your cluster is basically broken. There is no data stored and you need to resolve it as soon as possible. Before that happens, you usually get the first two. But when the third comes, you probably need to contact support. Also, if you use OpenShift data foundation as a managed service add-on or Red Hat OpenShift cluster management tool, then these are only alerts that you get and everything else is handled by citere liability engineers. But this is not available to all customers. This is currently not available to all. Only for a limited number. If you use this product, it also should monitor health of Ceph components for your persistent storage. Here we have listed four of them that we can go through. Ceph manager, Ceph metadata server, Ceph monitors and object storage demons. We can start with Ceph manager. This one is important because it's responsible for collecting and sharing all metrics and alerts that come from Ceph. If this is not working, then you have no alerting for Ceph and you are basically blind. You don't know what is happening in the system. But it has monitoring and alerts separated from Ceph system. Even if Ceph is not working, you should get alerts if MGR is absent or if MGR is missing replicas. And if you get those, you won't see any cluster error states or something, but that's okay. You are just blind. You need to investigate what is happening in your cluster and you have no data. Ceph component is Ceph metadata server. This is not so critical. Only if you are using the file system, then it may consider you because this only stores metadata for the file system. But if you are not having this, then your file system doesn't work correctly and you also need to investigate. But it doesn't concern block devices and object store. There is alert for this Ceph MDS missing replicas. She tells that there are no replicas in OpenShift or in Ceph. Ceph monitors. Those maintain the cluster state. The actual cluster state where all data is stored and all relevant things. There are at least three of them usually. You need to have a number of monitors because there is a system in Ceph that elects always a leader for those monitors. And you need to have a quorum between all of those monitors to elect correct leader and to know which monitor contains the latest and the most correct data. There are alerts related to maintaining that quorum. Cephmon quorum at risk, Cephmon quorum lost and Cephmon high number of leader changes. The first two are critical. The first one not yet. It's usually that you lost a monitor and it goes soon to recover because OpenShift data foundation has mechanism to recover almost from every state but it may happen that you will lose the quorum and then you also don't have monitoring and you have problem with your cluster. And the last alert is only that there was some change in leaders. It happens usually when you lose that monitor or node with monitor and then it recovers. It's usually not a serious state and it resolves quickly. Last component of Ceph that I will talk about is Ceph object storage demon. Those are the actual components that contain data and they handle operations with data like replication recovery and rebalancing. Data is stored in those so you need them. It usually is not a problem when you lose one of them because in OpenShift you have replication almost of everything and you usually set up your cluster to have replication for OSDs. Also Ceph handles this. That's why we have all those monitors and all the complex structures to maintain this. But it's still serious if you lose too many object storage demons. In that case you also get some Ceph health error state and you have a problem. There are three kinds of alerts for those demons. They are for availability. You have Ceph OSD disk not responding and Ceph OSD disk unavailable. This is a problem. You have also for capacity that you can have your OSD particularly full or near full. You need to resolve them almost every time accompanied with cluster capacity issues because there is always happening some rebalancing that should avoid these states and there are also some alerts for data recovery and cell healing. If you see some of them then you probably don't have enough resources or you are executing too heavy workload on your cluster and you have usually slow operations. For object storage OpenShift Data Foundation integrates multi-cloud object gateway. This is a component based on Nuba project which was a startup acquired by Red Hat in November 2018 and this handles most of object-based interfaces. It contains a lot of processes and possibilities that you can do with S3 compatible APIs. You work with object buckets there. You can mirror them through this. You can aggregate them and do many stuff with it. You can also use a theft-based storage that you have or you can use any bucket that is external. You can, for example, use some AWS bucket that you have prepared somewhere else and you can import them and share them through your storage that you have in your OpenShift. Of course, you can monitor it. You can monitor it in object dashboard that you saw but you also want to probably monitor it through some monitoring that is part of this project. Basically, all monitoring for multi-cloud object gateway has prefixed Nuba. It's from the original startup that was acquired. It was never changed and from Nuba you can see that it's something with objects. There are some examples of alerts for buckets. You can monitor bucket state with Nuba bucket error state. You can monitor quota with Nuba bucket exceeding quota state and Nuba bucket reaching quota state and capacity of those buckets with Nuba bucket full capacity state and Nuba bucket no capacity state. There are more because it's kind of complex product but this is the main thing that you care about when you are dealing with object store. If you don't want to use only alerts, you can also work with metrics. In OpenShift you have two ways how to access them. You can use OpenShift dashboards and OpenShift provides you with this interface. This interface allows you to use time-based series and query multiple data sources and if you want just simple query, you can use this and write there some text. It will find metric that is most similar to what you typed. You can type there self prefix and it will find all metrics that are relevant to self or you can type there Nuba prefix and you will have all object store metrics that you can look for. Here we have resources. The first two links are to documentation. The first one is for OpenShift container platform and for its monitoring and the second one is for OpenShift data foundation. The third one is for troubleshooting. You can use this link to mitigate any alert that you come across. This documentation contains lists of all alerts and actions that need to be taken when they appear. There are some basic mitigations that you can do on your own. Most of those alerts are mitigated by contacting support, so it's not always useful, but at least you know every time what to do. Thank you. Do you have any questions? I think that it could be possible, but I haven't done that yet. You have a primitive endpoint available and it should work, but I don't know. Thank you, everyone.