 Hello everyone, my name is Rahul Miglani. I have over 15 years of experience in DevOps, SRE and cloud with a primary focus on build, release and delivery automation with high availability in mind. My core responsibility is around my career involved setting the overall delivery automation strategies via investments in automation at several layer of technology stack to implement a continuous deployment or delivery pipeline which is highly scalable and highly available. I have experience in building intelligence, SAS, IS and PAS. I hold experience in working on Puppet, M Collective, Aira, ActiveMQ, Gira Confidence, Pitpocket, Artifactory, Docker, Kubernetes, Accureft, Civil and continuous integration using Jenkins or Bamboo or Quark and Argo CD. I have worked on .NET, Android, Java and Scala projects. I have experience using DevOps tools using Docker, Puppet, Shev and Siebel cloud formation, Opswork and Elastic along with good exposure to cloud technologies and tools. I have a vast experience in AWS, Oracle and GCP. I am certified in AWS, I am certified in GCP and I am also certified in Oracle Cloud infrastructure. I have proven skills as a programmer and a DevOps engineer as a whole. Overall, I have also done project management of SRE and DevOps Cloud projects with experience in RFP, RFQ and RFI documentation and Agile. Scrum Kanban and obviously building O2 market strategies and architecting cloud development with in-depth business consulting and product management. Well, most recently, I have been working on the observability tech stack with monitoring, logging and tracing with open tracing and monitoring, logging with Splunk, Grafana, Prometheus and creating custom dashboard with ELK and EFK tech stacks, creating complex architectures with SecOps and Phenops in mind, managing Kubernetes and cloud coast with tools like Qt coast and Infra coast and security in mind with vulnerability testing with tools like SNIC or Horushek and beginning into a big Phenops and SecOps area, I also have hands-on on managing DevOps audit, CI CD majority audit and Kubernetes maturity audit with complex architecture with hybrid or multi-cloud environments. Coming to this course, I welcome you all to this course and I hope you will enjoy this course and have fruitful learning. If you have any question regarding the course, you can get back to me via any of the social media channels or you can send a message on to this course as well. So welcome once again and happy learning. The topic for today is DevOps Incident Management, one of the very important aspects as we grow from ITIL to DevOps to SRE and make sure our infrastructure is available 24 course 7 and we have as low as possible downtime. So sit back and relax and let's begin with the session. So moving on to the content of this session, we will definitely be talking about what is DevOps Incident Management and then moving on, we'll see in detail the components of DevOps Incident Management and how ITIL Incident Management has now grown into SRE and DevOps Incident Management and how we can learn and probably recommend our dev teams on how to proceed in case of any incidents or how to work on our alerts. Also moving on, we'll see again in detail the benefits of our incident management processes and how to make them meaningful in case we have millions of alerts as well. Moving on, we'll look at the flow of our incident management process and how it should be made without blame on any teams but it should be made blame-free. And then at the end of the session, we'll look at the best practices of DevOps IM and see how we can make the whole process meaningful and obviously helpful to all the teams at stake. So these are some of the incident management tools. You can integrate them with tools like PagerDuty or any other tool and with this tool you can aggregate all applications, infrastructure and networking monitoring data in one place to initiate and execute any of the necessary response. You can send alerts via phone call, SMS, push notification or email to the correct subject or the expert. You can respond faster to alerts and provide incident response teams with more time to mitigate and reduce customer impact. There are multiple integrations available and there are extensible APIs available with custom logic. For example, Nagios, you can write your own plugin with Splunt. You can customize the integration and you can also have Slack alerts for all the issues you face. So depends on the type of incidents you're working on, you can use any of these tech tech. For example, you have AWS which has custom monitoring and logging tools. You have New Relate. ServiceNow was majorly used as a ticketing tool but now people have understood its importance in the incident management tech stack as well. Similarly, Datadoc sends you alerts and you can customize those alerts as well. So all in all, all these tools can help you mitigate all the incident and problems you might face in your delivery processes. First of all, what is incident management exactly? It is the process used by DevOps and software development teams to respond to an unplanned event or service interruption and restore the service to its operational state. Well incident management refers to a set of practices, processes, workflows, obviously runbooks and solutions that enable teams to detect, investigate and respond to the incident. It is a very, very crucial element for businesses of all sizes and obviously a very important requirement for meeting most data compliance standards. Incident management processes ensure that your IT teams can quickly address vulnerabilities and issues. A faster response help reduce the overall impact of the incident. It mitigate damages and ensure that system and services continue to operate as planned. If you look at the ideal incident management process, it would start with incident identification, incident categorization and obviously prioritization. Incident are identified through user reports, solution analysis or manual identification. Again, moving on to incident notification and escalation, incident alerting take place although the timing may vary according to how the incident are identified or categorized. The main idea is to make incident alert automatically managed and obviously we move on from investigation, diagnosis, resolution, recovery and at the end closure. But again, if you have to define incident management, it has to include a very easy accessibility to the report of a particular incident, a very effective communication strategy, an automated notification system, an automated alerting system, alert for let us say ticket updates, user replies, status updates and obviously DevOps SRE essentials. If you do that, you will probably end up having prevention of incident, reduction or elimination of downtime, improved mean time to resolution, improved customer experience as well and obviously increased data fidelity. You will probably have improved productivity as well. There are many tools. This webinar is not inclined towards tools but you probably might have already worked on Atlassian tools or alert or service now and our legacy remedy. So we are not going to talk about tools but the underlying framework, the underlying phenomena of why it is important because incident management is one of the most critical processes a software development team has to get right. Service outages can be very, very costly and it can be dramatically higher than the actual infrastructure cost. So let us move on and look at the details on why incident management is important, what are the incident management activities we need to perform as a DevOps and SRE provisional. Now moving on to common incident management activities, what are the activities that a common SRE who is handling the incidents will do. First of all, it would be detecting and recording the incident details. Now let us say you have a remedy tool like remedy or a service now which does the detecting, does the logging of the ticket but does it have the necessary details everyone requires to understand the situation. So our responsibility first would be to record the incident details and then what is our known error database looks like. For example in our remedy or in our service now we have the matching incidents. So what action was performed and what we did in the past can we do the same and resolve the incident. So that is what would be the main activity here. Again at the end of the night we will be resolving incidents as quickly as possible. We will end up prioritizing incident in terms of impact and urgency depending on our past experience with the matching incident and then obviously we have to will probably escalate incident to other teams to ensure timely resolution but the most important part here is what is our recommendation, what have we done in the past so that this incident doesn't occur. For example point number two states will probably find out in our known error database maybe on confluence or maybe in remedy or service now itself what were the matching incidents. Our recommendation so that this particular type of incident doesn't occur in future. So it could have been a recommendation for example increase of infrastructure or decrease or update of an infrastructure or maybe updation of a process or maybe deletion of a process which doesn't make sense. So that a particular incident doesn't occur in future so more than the detection and recording matching incident again the known problem it is what SRE is recommending. For example if you have faced the incident 10 times in the past you might end up recommending an automation which makes sure that this type of activity doesn't occur in future. So that would become then the main activity of an incident management SRE. Now if you look at the process implementation and its implementation we'll see we'll find out certain benefits of the implementation because it is important for any ID department to have a plan for managing incidents. After all no matter how good you are predicting events incident can still happen. So let us say let us look at the benefits of the implementation. Now let us say you are trying to maintain SLAs when it comes to meeting SLAs avoiding incident is ideal. So the best case scenario of an incident management is no incident at all. So we can use a risk mitigation solution which can help us with it but we still need to have a plan in place to keep our services up and running and just in case an incident happens we need to be ready. So we can use IM to outline how you'll ideally deal with the incident and resolve them as quickly as possible. Moving on how do you meet service availability requirements. So our business obviously cannot afford down times. So it is important that we are able to consistently meet service availability requirement and in this regard our IM processes can make it easier. We can use these processes to define how we'll detect incident and remember time is a sense considering performance management solution to help us keep a finger on the pulse over activity. Now let us say if we are trying to increase our staff efficiency and productivity our IM will help us achieve that but our staff should also have the tools. For example they should have something like a remedy or service now so they can mature the IM infrastructure at our end. So considering using a performance and capacity management software. So again when you include a software make sure that the return on investment on the cost of the software makes sense to the project as a whole. Now at the end of it obviously customer is the king and user satisfaction is of utmost importance. Now your customer and obviously employees for that matter don't want to be disrupted by the incident right. Practicing IM processes can help you improve your satisfaction too. You'll probably need to do a dummy user retrospection wherein you'll probably have a disaster recovery trial at your end and see how you use your capacity planning software how you use your incident management in case of disasters right. So that is how you can do a demo practice of your incident management process. At the end of it you'll probably end up maturing your incident management processes as a whole. So starting with maintaining the SLAs and going on till improving the user satisfaction incident management is the key. Now let's have a look at the overall process which should be followed in case of DevOps incident management. The DevOps approach to managing incident isn't very radically different from the traditional ITIL steps we already know. In fact DevOps incident management includes all explicit emphasis on involving developer teams from the beginning including opening up a bridge setting up on call, major duties, assigning work based on expertise etc right. So moving on to first step of the flow that would be detection right. So instead of hoping incident will never happen DevOps and SRE teams place a high value on the preparedness they work collaboratively to plan their responses to potential incidents by identifying a weakness in the infrastructural systems. They set up monitoring tools, alert system and obviously run books that help each member to know who to contact, what to do, what process to follow, whom to call, when to open a bridge and so on. Whenever an incident occurs they have to open the run book and see what to do in a particular set of situations right. Now this leads up to response rather than having a single on-call engineer responsible for responding to all incidents in an on-call shift DevOps incident management teams designate multiple team members to be available for escalations. So if the designated on-call engineer cannot resolve an incident independently there is a run book ready to act as a guide and obviously the on-call engineer can bring in the right people to assess the impact and severity level of the problem and escalate it to the right responder. It might also include opening up a bridge to have developers on the team to have a prover manager on the call so on and so forth. Moving on to the resolution part of it now when it comes to responding to an incident the DevOps incident management teams can often get to resolution very quickly applying a hot fix right. This is because as a whole we are more familiar with the application or the system code because obviously we wrote the script right but with the benefit of advanced preparation and good communication system together we can also do that work by you know resolving the incident reaching the resolution faster than a third party response team right. So making sure the right set of people are available to look at the right set of incident is actually an important part of the resolution. Now when we do a post-mortem we reach out to our next step in the flow which is analysis. Now let us see we are doing a post-mortem now this post-mortem has to be blameless right. Now the DevOps incident management teams close out an incident with a blameless post-mortem process they come together we share the information share the metrics share the lesson learned with the goal to continuously improve the resilience of our systems. As we do that we make sure that future incidents are resolved quickly and more efficiently again even if there are multiple team involved the process the analysis has to be again blameless right. Now at the end of it what is our readiness of the system what is our readiness of the infrastructure now what the incident is resolved our remediation steps have been completed and the system is restored DevOps incident management teams take a step back to assess their readiness for the next incident right they take what they have learned in the post-mortem process and update their runbooks accordingly and the DevOps focus on continuous improvement applies to the people and the team and not just to the infrastructure and technology itself. So after an incident each team member is better prepared for the next one right. So as long as we follow the this this this flow of incident management we end up at reaching out at the best practices which are affected for a DevOps incident management team right. Now in the previous flow we learned what are our action as a DevOps incident management SRE right. Now adopting a DevOps approach to the incident response can lead to improved communication between your various teams be it dev team or the off-stream obviously a faster incident response remediation and a more resilient system. So to achieve that the best practice would be to automate processes and workflows you know we need to integrate our service desk monitoring ticketing asset management CMDV or even chat tools we need to streamline these incident related teams as well as our alerts and work workflows right so that we ensure the right people get the right notification at the right time and the correct set of people are working at a particular set of incident and when they do that they have with themselves the information they need to get started on a resolution we need to set up runbooks again I am repeating this but we need to set up runbooks with predefined workflows so that people can hit the ground running when an incident hits. Moving on another best practice would be the communication between the teams we need to ensure the members of our teams can communicate across the organization with real-time chat tools we need to use tools that create a record of the incident so anyone can jump at any time and get up to speed on what's happened what's what's going to happen and what is the current state of the incident right it can be any of your ticketing tools be it remedy service now or any of your slack channels right and again as we looked into our flowchart as well we need to use the blameless approach right after you have resolved the incident we come together as a team to review what has happened for a blameless post-mortem now we should obviously avoid finger pointing and focus on you know sharing information what went wrong what could have been prevented and what we could have done in a better manner right now that can help everyone to do their job better and it probably will contribute to a more reliable system again another best practice would be identify and focus on the business bottom line that is very important because that was incident response is more than a means to a better communication it's a way it's a culture to ensure developers and operations are working together to deliver the real business value right now we track matrices such as mean time to detection mean time to repair and mean time between failures to understand your your team's rate of improvement we'll probably talk about them in the later slides but again moving on to we need to utilize an own culture dueling to position developers and cc means as system reliability engineers site reliability engineers to precise right so on a dev-optim the lines between a developer and cc means start to blur and those responding to the incident often become sres right but still the individuals will like to have specialized knowledge in either the application code or the infrastructure code or the infrastructure as a code for example and civil and terraform right so we need to set up our own cold schedule to ensure you've got the right mix of expertise available to respond the right set of incident now setting up an all cold schedule is tricky but if we communicate properly what set of expertise is available I think we'll do a better job if we understand the system better right now that closely brings us to the end of the session but if you look look closely as a recommendation we need to measure everything we don't need to send alert for everything obviously because we only need to send meaningful alerts but we need to measure at least everything what is the deployment frequency will give us our readiness will be encountered in our preparedness right what is the chain volume how many what fixes have gone through right or let us say what is the lead time from dev to deployment so that our testing can be ready on time or our multiple environments can be test ready on time so that we could you know avoid dependencies we do not encountered incident where there is a dependency management problem because the environment wasn't ready on time right again we need to see the percentage of field deployments more importantly what were the regions of it and at most important what were our recommendation on those right our mean time of recovery is also dependent on our incident because we need to see how many type of particular incident are occurring or reoccurring or have occurred in the past and what is our recommendation or action items on the incident right now customer ticket volume or change in user volume it will probably come at at the end of let us say usability and accessibility but again we need to see what type of tickets are coming in what types of issues are coming in is the are the issues coming in because there is a change in the user volume or are the tickets coming in because there is a change in infrastructure or are the ticket coming in because it is a particular time of the year where we could forecast more tickets because there is more activity on the system right now if we do 6 and 7 will probably have a greater control over high availability of the system because then we will be able to forecast now let us say in a particular time in a particular time in year or in a particular time in our product life cycle there is a need of high availability or there is a need of higher set of infrastructure values because our performance or our response time to the end user need to be at a particular value right again i'll probably say major everything but don't send alert of everything because our let us say if we send out 5000 alerts a day we'll probably end up ignoring all of them but let us say if there are generated alerts are 5000 we have an automation in place which sends out only let us say 2000 alerts which are meaningful and in the alert itself it is written on what set of alerts we need to take action and what set of alerts are FYI or just for information for the SREs right so you need to make sure that your alerts are intelligent your alerts trace back to from where the incident has occurred from and their alerts are intelligent enough to point you towards the recommended solution and obviously when you have a known as a database and you end up looking at multiple alerts on multiple on the same set of problems you already have a recommendation in place so that the same set of incident do not occur in future right so I hope you'll probably keep these things in mind when you set up your DevOps infrastructure and probably monitor it in the previous webinars we've already spoken about observability at scale but the ground rules still remain the same right so I hope you probably have learned something today and will you will probably go back and make sure that the alerts and the incidents are meaningful and obviously you have your runbook in place right well this brings us to the end of the current session I would like to thank you all for participating hi that brings us to the end of the course I hope you liked the course and had a fruitful learning if you have any suggestions any queries or questions do reach out to us email us or reach out to us via any of the social media handles thank you very much once again happy learning bye