 So today I'm going to show you two servers in AWS. One is running ASCS, the other is running ERS. We're going to inject a kernel failure into the ASCS server. It's going to go down. The secondary server is automatically going to take over and start running the ASCS process. And then when the server that failed comes back up, ERS is going to automatically move from the currently active server to the one that failed as it comes back up in order to maintain SAP best practices. Hi, welcome here for Let's See. And today we have with us Don, Solutions Architect at SciOS Technology. And today we are going to talk about configuring SAP S for HANA for high availability. Of course, we'll talk about it and we're also going to see a demo of it in action. Thank you. I really appreciate you having me back. Let's talk about some basics. It may not always be clear how to build a high availability SAP infrastructure. Can you tell us what options are available to users? There are really three options for high availability in the SAP infrastructure. You've got Red Hat Enterprise Linux High Availability Out On and SUSE High Availability Extension and SciOS Life Keeper Protection Suite. Both Red Hat and SUSE are integrations of Pacemaker and SciOS is its own custom clustering software. The SciOS big advantage is simplicity and ease of use. Pacemaker is going to require you to build a lot of custom scripts to manage failover. With SciOS, we have wizards to configure the HA environment, which saves a great deal of time. SciOS Protection Suite is SAP certified, handles the data replication at the application level for your ASCS and ERS volumes. And SciOS handles database re-registration and can do manual or automatic switchback when a source comes back online. There are a ton of manual tasks when it comes to Pacemaker. Now it may look like that building high availability, disaster recovery in SAP HANA environments can be complicated, can be complex. Is it really complex? And if it is, what leads to that complexity? SAP HANA environments are incredibly complex. Especially when you want to do HA and DR. They're so complex because you're dealing with multiple applications at different layers of the application stack. So you've got your presentation layer, which is easily protected, your Windows servers. And then you've got the application layer in the middle, which is where ASCS and ERS reside. And those need to be running on different servers for SAP best practices. So if you have a failure, you need ERS of the ASCS node. You need to move the ERS to another node. And just anticipating all the places that there can be failover. You also have the integration with the database. You've got things like takeover with handshake for maintenance of the database. Just an absolute ton going on in the SAP environment at any point in time so that when you have a failure, being able to automate the failover and minimize your RTO and your RPO, it's just incredibly complex and difficult. Can you share with us some tips, some strategies, how to simplify this, how to simplify your SAP HANA environment to ensure business continuity? Yeah, so the best strategy is really to identify, plan, train, and test. You need to identify all the places that you could possibly have a failover, which could be environmental, human error, hardware failure, software failure, power failure. There's just a number of places that you can have a failure or a glitch. You're not going to be able to identify as many possible potential places of failure as you can. And then plan to make sure that there are no single points of failure. And then train all the people responsible for supporting and keeping that SAP HANA environment up and working and available. And then test. That's where many companies really struggle is to be able to test the failover, test different types of failures within their system and to make sure that when a hurricane does come and takes your data center out that you're ready for it and that your high availability and DR systems actually work the way you expect them to. I also want to talk about the cultural aspect of high availability disaster recovery in SAP HANA environment. Talk a bit about how SAIO's tools kind of build that culture and how Riot culture can help teams get most out of these tools. Yeah, so the real culture caused by providing high availability for SAP and HANA stems from the fact that SAP has done such an excellent job of providing you scripts, ways to monitor the database, the application, and functions that you can call to determine the health, fail the system over, and they've got all of these building blocks but they're no real packaged way to implement all of those building blocks and functions that SAP has given you to easily create this high availability infrastructure unless you custom build scripts with Pacemaker, whether it's the SUSE or the Red Hat implementation. And then you've got to interpret and account for all of the SAP best practices. What SAIO's has done is we've gone ahead and done all that work for you so that you can easily implement this high availability infrastructure when there's upgrades or changes to SAP. We've already handled any changes required in our software where an upgrade to SAP or to Red Hat may change the Pacemaker implementation and what you have to do to make sure that you maintain high availability and keep those SAP best practices. And now it's time for a demo. Before we start the demo, tell us what are you going to show us today and then let's see the next one. So today I'm going to show you two servers in AWS and one is running ASCS, the other is running ERS. We're going to inject a kernel failure into the ASCS server. It's going to go down. The secondary server is automatically going to take over and start running the ASCS process. And then when the server that failed comes back up, ERS is going to automatically move from the currently active server to the one that failed as it comes back up in order to maintain SAP best practices. Perfect. Now it's time to see the next one. Okay, so here we have the two servers in AWS, SAP Demo ABAP1 and SAP Demo ABAP2. And I'm going to go over here. What we have is the LifeKeeper GUI. So here you can see I've got a two-node cluster. There is a witness on these, but the witness is a storage witness using an S3 bucket. And that just prevents us from having any split-brained scenario. So if these two servers lost connectivity, the witness would ensure that they don't both run the same process at the same time. So now what I'm going to do, and you can see the hierarchy here. So we have, at the top of the hierarchy, we have ASCS. We have the IP address, which will move. This is a virtual IP address using the SciOS IP Kit. Then we have the EC2 IP address, which that basically is an object that will change the route table so that when the address moves, you can find the server in another availability zone. And then we've got the ASCS volume, which is being replicated. So right now you can see the primary or the source is ABAP1 for ASCS and the target is ABAP2. So this is a mirror. When we fail, this mirror will be the first thing to come up. In any part of the hierarchy fails, the whole hierarchy will fail over to the secondary server. ERS is running on the secondary server. You can see here that process is active. And we have, again, the ERS volume is now the sources on ABAP2, and that's being replicated over to the server one. And you can see that they're synchronous mirrors, and it's fully replicated at this point in time. So my next step here is we're going to create a kernel panic in server one. And you can see that that panic has caused everything to go down on server one. So right now, refresh here, it takes a minute. We've got to miss the keep lives to recognize that it's gone down. And you can see that the data volume here has now become the source, the mirror, for the ASCS volume on server two. And I'm running this GUI on server two so that we didn't lose the GUI if it was running on server one. And so the volumes, it's showing out a sync to server one because it's down, but it'll reboot here shortly. And everything's up, the IP address is up. The only thing we're waiting for is the ASCS process to start. So that's started now, it's now active. And that quickly we've failed server one and everything's up and running on server two. Now we have to wait a minute for the server to reboot. It should be fairly quickly. So right now it's coming back up. You can see here it's already back up in standby mode, but the replicated disk is out of sync. So it'll take a few minutes for that to sync up. And then there's a little bit of a wait period to make sure everything is healthy before the ERS process will start to move. So right now you can see that the ERS process has moved. It's started the move. We've already moved the source of the ERS volume over to ABAP one. And the volumes up, the IP addresses moved over for the ERS process. And now we're just starting the application right now. So as we go up the hierarchy, the process, everything starts at the bottom and goes up. So in just a minute or so, yep, there it goes. Now we're active on server one. So by injecting that failure on server one, we've effectively moved ASCS, failed it over to ABAP two, and then ERS once ABAP one became available, automatically moved over. So we've reversed, you know, which processes which running on each server. Here you'll see that the disk is waiting to re-sync as we reverse the mirror direction. It'll pause for a few minutes there. Now it's restarted the synchronization process. And that's it. Really appreciate your time today. Thank you very much Swapno. Dot, thank you so much for excellent insights and a great demo there. I look forward to talking to you again soon. Thank you.