 So hi and welcome everyone to the building an antifragile highly scalable system to assure business resilience presentation by Ramya Murthy and Sai Subramaniam. And yeah, we're glad you could join us today. And yeah, without further ado, over to you Ramya. Good evening. Good morning, everybody. Glad to be part of this India of light version of the Covertible Conference. I would like to start with a thank you note to COVID. Thank you to COVID. Primarily because back to back pandemic waves have actually created opportunity to create a lot of chaos and chaos attacks on our to test our personal resilience level. Right. I'm sure everybody would agree that. And now stay tuned to hear from us. How do we build a resilient, highly scalable IT system that can help to assure business resilience. Let's get started with the session. Here's a quick outline. How do we want to plan the next 20 minutes for you? We will start with an introduction to set the context to set the stage before we dive in to talk about how we made it resilience a possibility for one of our global FinTech customer. We'll start with what are the drivers for enterprise resilience. I'm sure everybody would agree. I forms the backbone and plays a vital role to enable businesses to meet the highest possible service level or recover application or infrastructure to quickly to prevent any data losses or revenue losses and not the least reduce the impact. During unexpected disruptions. You want to share your screens. It is already shared. Let me reshare it again for you. Is it in the shade mode? It's shared now. Okay. Sorry about that. So. Let me go on to the drivers where we stopped. We talked about the need for IT resilience. Wait, this forms a backbone for any businesses to meet the highest service level demand. So now let's quickly jump in to look at what is IT resilience and what are the factors that are shaping up, which are emphasizing on the need for IT resilience. So we have the expectations on having application always on which especially post COVID. This expectation have become the de facto expectation from our customers. The digitalization journey, the digitalization initiatives that we could see across the market, across the different business domains, where the cloud adoption or architecture modernization initiatives at the scale, which we haven't seen this scale before, never before. Or look at the increase in the cost of early IT downturn, or I'm sure everybody would agree. We are used to see the businesses making headlines every now and then, talking about the devastating disruptions leading to a large business revenue. Cost by IT outages. With this, we understand the importance of IT resilience and how IT resilience can help to meet to enable business to help to meet the business resilience is by making the system always available for my end customers. They buy increasing the user experience or fostering the revenue growth for the business or bait enhancing the brand value. So this is how the importance of IT resilience could be reflected on how it attributes or helps to achieve business resilience. So now, why resilience matters in a nutshell? Obviously, resilience helps to increase my system availability, which is the million dollar expectation from every business now. Or it helps to reduce the IT outages to save businesses from business disruption, devastating disruptions and help to reduce MTTR, the time to fix a problem, time to repair and bring back to its normal state of working. How resilience can help us to validate a preparedness to make this happen to reduce the MTTR. And not the least, thereby helping to enhance the customer experience and build or enrich the trust and thereby helping businesses to build its brand to maintain its brand value. With that context set on resilience, let me quickly spend that minute talking about how chaos engineering as a lever, it helps to assure resilience. More from looking at resilience engineering a broader view, resilience engineering discipline, it implements the best practices, it brings the strategies for building a resilient and reliable application looking at a holistic view, stating the end-to-end SDLC life cycle phases, thinking about it proactively right from the inception phase, as is my architecture, is my design, thought to or built with the resilience design patterns. Or this is my development team aware of the need for resilience, be it at a testing, be it at a release, be it at a last, the right most part of it, how do we operate it. So resilience engineering is the broader perspective of making the system resilient, but not just missing the proactive management part of it. But if you look at chaos engineering, which is a subset within resilience engineering, which focuses the only the testing part of it, which is basically test, observe the resilience mechanism that has been implemented, understand whether it needs your expectation and start focusing on bringing improvement measures. So if you look at resilience engineering as a more proactive way of looking at it from the inception phase, whereas if you skip that just focusing on only bringing chaos engineering, it will definitely help. I am not saying it is not going to help you, but the rate at which you mature and move on to the highest expectation will be not be possible if you just focus on bringing chaos engineering without having a very bigger holistic view of need for resilience engineering. So now with that said, now do we all know, so building a resilient application wasn't this challenging before and it is why is it increasingly becoming complex and difficult. Now, if we look back at the market forces that are transforming the IT landscape, we were used to monolithic architectures and on-prem deployments, but with cloud coming in, with moving everything to cloud, having highly distributed architectures, the complexity has increased, it has exponentially increased. That's the right word to call about, right? So with that increase in the complexity, now comes the challenge for the technical team. How do I ensure fault tolerance characteristics are as expected? The applications are built with fault tolerance capabilities validated by bringing the right chaos engineering principles to make this IT resilience no more a dream, make it a possibility. So with that said, I think we have set the stage. Now let's quickly look into the case study. So now the next 15 minutes, we are going to engage you talking about how we were able to transform and build failure as a delivery culture for a wealth management platform that we developed for a US based global FinTech leader. So we will be covering how we were able to achieve though we started it looks like a dream. We have how we will, what are the best practices that has helped us to achieve 99.99% of high availability in cello that we committed to our client for in spite of implementing a very highly complicated distributed architecture with containerized technologies and this was made possible by holistic resilience engineering principles. So here we go. We will talk about the few topics that have been covered in this quick representation. Let me start with the engagement background to make you understand the problems statements of the pain point that we had at hand before we dive into the solution. Okay, so now the scope of the engagement was legacy modernization. So the client was running on a monolithic wealth management platform, which was expected to be transformed to operate in a multi cloud, a distributed architecture, obviously using cloud native principles bringing micro services and containerization and whatnot. With this coming the expectation from customer was very high primarily because this application this platform had it had a variety of 100 plus applications running under this platform. And this platform was known to failures with a with a large number of incidents coming every day and day out and the platform had a history of problems managing and sustaining the operations and production. So because of that experience client was very particular about four key pain points. So when we modernize the architecture and make it cloud ready, how are you going to make this application make this platform fault tolerant enough to adapt or auto heal in if in case there is a failure. It could be at any level network infrastructure review you name it. Yes, they had a bad experience about that for failure in the past. So that was an important pain point that we were very particular about addressing that. The second, with 300 odd micro services running, how do I bring the ability to quickly nail down, isolate a problem if there are any failures reported because there have been a history of problems in the past. And with distributed architecture implementation being the expected being being the implementation solution, how do you make things work was an important pain point and obviously bringing a hybrid infrastructure deployment with multiple cloud or on-prem infrastructures in place. How do I bring an end to end visibility to connect the doors and see what could be the failure from an end to end full stack perspective. And last but not the least, after I have my application ready, how do I sustain and meet my commitment, what I give to my customer and how do I how can I monitor whether am I over running the commitment or am I continuing to meet the expectations from the customer thereby monitoring whether I am making my customer happy and day by day enhancing my customer HP. So how do we all do this was a million dollar question which had been put forth by the client. So for this background of the problem statement at hand, we proposed a resilience engineering as a delivery culture which revolved around three key areas. Bringing shift left resilience engineering practices, bringing a full stack of continuous observability platform which gives me a very good visibility on what happening internally to fix the problem to do a root cause analysis quickly and make things work. The third aspect, bring a robust site reliability engineering practices in place where it helps me to stitch the dot from the right side and bringing the shift left resilience engineering focus proactively thinking about developing the system and not just sustaining it helped us to bring a resilience engineering making it as a delivery culture. Okay, so now just double clicking on these three aspects that we wanted to propose as part of our resilience engineering, we created a resilience engineering COE, which brought a lot of governance into the place in I mean, it talked about the operating model, the how the management journey would look like how do we transition to the different maturity level, it is not that we will be able to operate at the expected high availability SLE from the day one or a month one, but how do we lay out how do we plan how do we matured as we want. How do we bring shift left interventions, bringing it as part of the pipeline of the pipeline or be it how do we bring SRE, mature bringing a lot of functions under as we know SRE is not one single function and it's not just about SLE monitoring, it revolves around multiple function areas to make effective performance and capacity to make it monitoring. There are multiple functions, service areas that SRE covers right so how do we bring SRE functions, how do we cover a governance layer as part of the COE, not just looking at the connection points but more diving into more looking at the people, the process and then the techniques aspects, the COE focuses on bringing a lot of focus on bringing methods, techniques or processes or the SOP, the standard operating procedures for different category of application that was much more challenging than any other task because the variety of applications we had was, we couldn't bucket as two categories or three. We had a different variety of applications with a different scale, I mean, the kind of business criticality or the strategies for test data, managing the test data or the strategies around what kind of failure attacks is required were very worst and it was highly getting into different categories. So bringing a SOP for each of the category to onboarding app of a particular category or how do we bring reusable, how do we enhance the engineering capabilities to bring reusable assets be it at a process level, be it defining some methods or be it building automation scripts as part of the toil reduction activity, bringing the right quality gains that is much more important because we all focus on the governance layer what it helped us. We had a governance team for the shift left part of it and then a governance for a layer for shift right. This governance team actually did the good job of smartly balancing whether these two teams are not operating in silos and they connect well with each other that was actually taken care by the governance layer. And obviously, covering the tools aspect what ecosystem of tools to be in place to make observability a possibility was one another activity which the COE function has to work upon and then make a conclusion. And last, when we look at the people perspective, right. So if you look at resilience engineering, it is not just about looking, bringing a performance engineering focused team or bringing a chaos engineering folks are just a fairy folks right. So, it was not just about bringing new set of roles to validate or to bring all chaos engineering principles or a fairy principles but we have to define a lot of collaboration that was actually that was a much, much bigger initiative that we need to focus on for a primary reason that resilience engineering as as we discussed, it is not an independent siloed activity where we see a holistic view connecting the dots with each other. That is when the collaboration protocols we defined helped us to bring certain interventions how do we collaborate with the existing personas right so the personas were distributed throughout my SDLC and how do we make the resilience engineering COE team, not just the governance or not just the bottoms up team who are part of the scrum team to make things work. But in nutshell, how do we, at a different layers, how do we define collaboration protocols to interact with the existing personas and more than that, how do we bring skill transformations in terms of the responsibilities the existing personas had because resilience engineering emphasizes not just bringing the new set of roles and responsibility but also requires skill transformation of the existing person. So just a quick time check just two more minutes left. Sure, thanks. So with that said, we had not to be taken for granted more than technical expectations of how do we bring process in place technical solution in place, we had a much more bigger cultural challenges than what we envisioned than what we thought about right. So primarily it revolves around making failure as a culture, making bringing chaos as part of a BAU was a bigger challenge we had at hand than the technical challenge. So we felt actually the people transformation be it bringing skill transformation activities or to bring the roles and responsibility changes in the existing personas and more than that bringing people out of the comfort zone to make them come out of the comfort zone to make failure to treat failure as a norm, that was really much more challenging task for us than the technical challenges what we have to take up. So with that said, I think we have set the context on how we approached and what was the problem before I hand it over to Sai who will do a deep dive on how and what methodology did work for us to make this. Thank you Ramya. So what we understand now was resiliency is not just an afterthought, right? How do we build resilient applications by design and first time wide? And this is where we need a smart balancing of the shift right and shift left principles. Shift left principles essentially evolves around building or bringing in the resiliency engineering practices as part of your application development life cycle. So if you look at the inception phase, you define your NFR requirements as well as define the SLAs and SLOs. Look from your production analytics, production logs and find out what are all the critical business processes that needs to be tested for resilience. From an architecture and design standpoint, we had our resilience engineering COAT closely collaborated with the Enterprise IT architect team to kind of review the architecture. Find out is there any single point of failures? What is our application redundancy strategy? What's our data replication strategy? And can there be auto scaling that can be provisioned for handling unexpected user traffic spikes and so on? And from the early sprints and the hardening sprints, from an early sprints standpoint, we did the early chaos engineering test with a limited blast radius. We did the observability portion as well by correlating with the application business transactions. And from the hardening sprint, we did our end to end performance engineering and full-fledged chaos engineering test. This is where we had high blast radius of bringing down our thought or bringing down the VM and looking at how your application behaved and take this as an input for our performance environmental capacity planning. Quickly on the shift, we had brought in the site reliability engineers who kind of acted as a bridge between the operations and engineering team, balanced the velocity of the application feature releases versus your reliability aspect, and also monitor the key SLIs and SLOs where we aim for the 99% upon 99% SLO uptime. So from a chaos engineering standpoint, we approached it in three phases. The first one was building the hypothesis where we looked at the current stable system baseline, the steady state behavior, and then we defined the SLIs for the hypothesis validation. Basically, how your system is expected to meet the SLAs when you run the chaos engineering test. And we also finalized our failure attacks and blast radius. Failure attacks, if you look at it, we did it at four levels, the infrastructure, code level, network, as well as at the Kubernetes level. Blast radius, we always recommend as a best practice to start with a low blast radius and kind of increase to a high blast radius as your system hardens. Validate hypothesis is more on doing the actual failure attacks on the target environments and monitoring via the observability platform. And remedy is comparing the system behavior, your stable system behavior versus the system behavior during your course of chaos injection and identify if there are any action items for remediation. It could be increasing your infrastructure capacity or provisioning additional microservices containers. Take those learnings back to our build hypothesis phase and rerun the cycle with a new hypothesis. So tools and platforms. So we understand that one tool does not fit all the purpose. For example, we had siloed chaos engineering tools, we had APM infrastructure monitoring tools and we have log analytics tool. But with the complexity of the engagement, we had about 3,000 microservices in a highly distributed architecture deployed on multi-clouds. What we required was a unified platform which kind of a unified platform which had the capability to carry out the various chaos engineering attacks, integrate them as part of our continuous delivery pipeline and also a powerful observability platform which can not only do the infrastructure and application business transaction monitoring, but also predict your capacity, for example, looking at your historical infrastructure utilization data and kind of predict what is your likely infrastructure anomalies, which can drastically cut down your production incidents and also how can you correlate between your infrastructure anomalies and your application logs. So these were some of the key components on the observability platform that we built. So this is our LTI platform-led approach. If you look at, you know, there were three major components, the chaos engineering for injecting the attacks, again, both on the on-prem as well as in the cloud, observability both on the infra layer as well as on the application business transactions layer and predictive analytics for predicting the infrastructure anomalies and feeding this as an input for our production capacity planning. And these were integrated with our continuous integration pipelines and also the various other channels like Slack or Teams and so on. Hi, I'm sorry to interrupt. I think we're going to have to stop the session soon. Yeah, so last 30 seconds, Maulika, I think we are done. From a benefit standpoint, yeah, we were able to cut down our production incidents by over 60% primarily because of our predictive capabilities that we've gotten using our AML platform, 90% reduction in our meantime to detect and meantime to recover, primarily because of our correlation capabilities that we've gotten and as well as we were able to achieve 99.99% SLO where it led to zero business restrictions. In fact, we didn't have any outages for the last 10 months in our application in production and obviously improving our service availability and durability for a better user experience. So with this, we are open for any questions. Yeah, so I just, yeah, I want to thank you Ramya and Sai so much for the session. I appreciate your experience with us.