 Hello, everyone. My name is Karina Angel from Red Hat, and welcome to All Things Data. Today, we have a very special guest speaker for our second episode of All Things Data. We have Sige Volkov from Red Hat, who is here to give a live demo of workload resilience on OpenShift with SEF storage, updating and adding storage to your persistent volumes, and everything just keeps running. Thank you, Sige. Please take it away. Hi, my name is Sige Volkov. I'm a storage performance instigator. Today's demo will be workload consistency during SEF upgrade and storage expansion. This demo is based on components of the OpenShift container storage version 4.2. So OpenShift container storage 4 is based on the OCS operator and the Rook operator, and the Rook operator is the director or orchestrator of All Things SEF related. The reason this demo is actually using Rook and SEF and not OCS4 is because OCS4 is very closed and very opinionated about what versions we can run and how and to what versions we can update. And so it's just very easy to perform the demo on Rook and SEF, as opposed to doing on OCS4, but the method of the update of software or storage expansion in OCS4 is completely identical. In this demo, what I will show is I'm going to have a MySQL pod, so I'm going to preload with some data, and I'm going to have a small job that is going to run another pod, a sysbench pod that will stress the MySQL pod. We're going to show, I'm going to show many terminal windows. We're going to monitor the transaction per second of the sysbench job. And during this process, we are going to update the SEF version by changing basically the SEF CRD. And we're going to monitor to see if there's any changes during the update. And then we're going to add storage or expand the storage that the SEF cluster have. And again, monitor to see if we're seeing any kind of IO performance hiccups or anything like that during this stage. This is a completely live demo on an AWS cluster, consists of one master and three worker nodes. So we also have to cross our fingers and pray to the gods of AWS. And I will now start with the demo. And the version that I'm going to update to is 14.2.6. Before starting this, I'll just share a few other windows in here. On the top left is a list of pods that SEF is currently running. On the bottom left is the components that SEF needs in order to run and support the RBD or block devices options. As you can see, all of them are with version 14.2.5. On the top right is the Rook operator log. We're just constantly tailing it. On the bottom right is our transactions per seconds that are running on the MySQL. And I will just run this Pets script in the middle window. And it will start the process. What we will see immediately is that the Rook operator will get the request to change the version and act upon it. Now, of course, all of these many of things that I'm showing in here are not going to be needed on OpenShift Container Storage or OCS4. In OCS4, you are able to choose whether you want to automatic update the OCS software or manually update it. And the OCS4 operator will basically take care of everything manually that I've done here, such as patching the Rook operator and things like that. So on the top of the left side, what we're seeing is that the first mon A was chosen as the first one to be restarted and deployed with the new software. And we can see that it's also on the bottom left. It's the only one that is currently available with the new version already with 14.2.6. The way this process of update is working is it will go through the three mon pods that we have. Every time it will update the mon, it will allow it to be back into the mon quorum and will not continue to the next one before it's actually working. The way time is roughly is exactly 60 seconds if the Rook operator decides that the mon is not back in the quorum, it will just pause for 60 seconds and start again the next time. So we see that the Rook operator decided to move into the next one. And we can see it being initialized. And on the bottom right, we can also see that there's enumeration for basically the required whether it's updated and whether it's available. These are the two different values in here. And they are basically how Rook decides again or whether to move into the next phase or not. One note about the pods that we have in here. This RookSafe clusters only runs the Albedi portion of SEPH, the block device option of SEPH. You're going to see a lot more pods with other components of SEPH, for example, the gateway for the objects and the SEPHs that, again, are going to be updated in a fashion where only once a certain object is completed the update, the Rook operator will continue to the next one. And the Rook operator is now going to update the third monitor pod, the third mon. As you can see on the bottom left, it's waiting for it to be updated. And then it will wait for it to be available. The next component that is going to be updated is the manager pod. There's only a single pod of this. And once the mon C is basically back into Quorum, it will continue into the updating of the manager pod. Now, one of the questions in the presentation that I kind of propose whether should we see any kind of IO disturbance or IO pose, as I call it, while running these updates? And the answer is that, typically, we should not. Most software-defined storage that are available, you can update them, you can update the software live. However, most places will probably like to create some kind of a time frame where there will not be any kind of live IOs going into the storage system for this duration. But in this demo, we're actually updating the software while IOs are being running to the MySQL pod. Now, we have three OSDs. These are basically the pods that are providing storage or creating the self-cluster for any application on this OpenShift cluster to be used. And we can see that the first one was already updated, OSD 0. The second one is being updated right now. And if you look at the bottom right, you can see that the IOs are a little bit dropping. And that is because at this point, Ceph has to basically tell the self-clients, hey, do not use this copy of the data because we are now changing it. Use this copy of the data because it's valid and it's already with the new Ceph version. So OSD 1 is also updated. And the Rook operator will soon move into updating OSD 2. And by the way, there's a pod here called Ceph Tools. Disregard days. That's a Rook self-management pod. OCS does not use this pod. So again, the cycle of 60 seconds is going to be enforced by the Rook operator if it does not think that all the OSDs are up and running. And now we are seeing that after the 60 seconds, OSD 2 has been terminated. And we can see on the bottom right how the availability, the require is for one pod. One pod has already been updated. It is not yet available at the Ceph cluster. And we can see also that once this OSD will be updated and back in the quorum, the log of the Rook operator is going to show succeeded updating the cluster. My little script here said it took about six minutes and 50 seconds to update. We saw a small blip with iOS being a little bit reduced into the MySQL pod. And everything else continues as is. So I'm going to continue the demo now with how we are adding additional storage into the Ceph cluster using Rook. And again, all of this will be done from the OpenShift console in OpenShift Storage Container in OCS4. What we have, this is an AWS cluster, as I previously stated. On the top left, I have a watch that basically is looking at all the OSD pods. Again, the OSD pods or the OSD process is what consume storage and create out of all the storage a Ceph cluster and then provide it back for other applications to use. On the bottom left, we have a watch that is running a Ceph OSD3 command. As we can see, we have three hosts. And each of them has a single OSD running on it. On our top right is the same Rook operator log. And on the bottom right is our transactions that are continuing to be running. What I'm going to do is I'm going to edit the Ceph cluster component, Ceph cluster object. And as you can see in here, when I installed everything, I actually filtered what devices to be used. And I specifically also specified to not use all the devices on the AWS VM. So I'm going to change that to true. And I'm going to change this to null and save the object. And we're going to go and see on the top right, immediately the Rook operator is identified as a change to the Ceph cluster object. And it's going to initiate a search basically on each of our VMs whether there are new devices that can be used. And now as we can see, the OSD prepare pods on the top left are all being restarted. OSD prepare pods have one task, and that is to prepare a host or a VM to be used for Ceph. And so they have a specific job. They basically go over all the devices that are available to be used on that host and then provide back the information into the Rook operator. And once the Rook operator decides that if there are any new devices to be used, it will create new OSDs by them. That's why you see that some of these OSDs pods have already completed. Some of them are still running. And what you can see is that on the bottom left, the OSD tree have already marked three OSDs that it can be used because we just provided more storage devices to the Ceph cluster. They are still just something to be used. It's not really usable yet because there is no OSD pod specific that is running and using the device that the VM is providing. And now we're seeing on the bottom on the top left how OSD pods for OSD 3, 4, and 5 are being initialized. And once the pod will be up and running on the bottom left, we are seeing how each of these new OSDs are basically joined into their existing host, are marked as up and are available to be used by any application that is using the Rook and Ceph block option. As we can see on the bottom right, there's a little bit of a slowdown in the IOs. To try to explain, it's very simple. We had three devices that provided the storage for everything. And we had some data on it. Now we have six devices that are providing storage for everything. Basically, Ceph is going to take the existing data layer that we have and will spread it over all the devices. This is actually one of the biggest strengths for Ceph. It's massively distributed. The more devices Ceph has, the better the performance is. And so Ceph as a cluster is now redistributing the data. Of course, all the data is available all the time. And we just finished basically adding the storage to our three available VMs in AWS. And this is the end of the demo. Thank you so much, Sigi. That was excellent. And thank you, everybody, for joining us for another All Things Data session on OpenShift Commons on the Briefing Channel. Now, all of these are posted to YouTube. And please look at the openshift.com slash storage page, as well as view the All Things Data OpenShift Commons YouTube channel, where this will be posted. So please look out for this on the OpenShift blog. Thank you, everybody. See you next time.