 Hi, this is your something party and welcome to another episode of tear for let's see or demo And today we have with us drew stokes head of engineering at jelly Drew is great to have you on the show. I'm excited to be here Yeah today's discussion is going to be interesting because you folks in yourself had some incidents internally and that's where you ended up using Jelly and a lot of things that we have talked about With the jelly some of those things came back some of the practices that we talked about earlier as well So I want to talk about all those things But before we get it started just remind our viewers What is jelly all about on how you folks are a bit different from when we look at traditional actually traditional is not the right word there But typical incident response companies or solutions happy to talk through that So jelly is it an end-to-end incident management platform? And so what we're trying to do is provide products that help our customers Effectively respond to their incidents and then also learn how to adjust their organizational processes and Teams in order to be more effective in building the products that their customers need and delivering on the The reliability that customers expect for some of those in the previous discussions with jelly One recurring theme was that it's really important to focus also on what happens after an incident you know that plays a big role so talk about this focus and its importance The post-incidence analysis and how was again it's different from other approaches. It's really easy to imagine Incident management as being primarily about Identifying the issue and fixing the issue and then moving on and I think We've come to recognize it jelly that effective incident management It requires going beyond technology and the contributing factors for a specific issue and Looking at how to bridge the gaps in understanding and priorities and context between people in an organizational You know a group is this socio technical aspect of managing software Nora Jones our CEO and founders she started jelly to help organizations recognize those gaps and brought on a bunch of folks onto the team who have experience kind of navigating these complex systems in these complex organization organizations, excuse me And they know how this stuff works So we kind of can bake that knowledge into the product and help fill those gaps on behalf of our customers For us post-incident analysis is one of the most important things organizations can do Because understanding the way that organization operates and the way that these systems are maintained Can help you Make changes that reduce turnover that help your bottom line as a company that that overall help you deliver more reliable products to your customers I mean if you looked at Sometimes some companies I mean there are tools they use like Slack is their email is there Google Docs to that Which did not only used to record but also kind of a lot of post incident analysis happens on those platforms as well talk a bit about What role does jelly play there what additional value you folks bring oh my of course It's not replacing those but in addition in on top of those other you know Platform that folks already use incidents generate a lot of data and and we see with a lot of our customers There's this combination of Discussion in your Slack channel. There's a Google Doc running where you're capturing details on what remediation steps The team has taken what they think is going on Sometimes there's conference bridges where folks are talking in real time and coming up with solutions that maybe don't make it back to the team Longer term and so I think jelly's kind of value proposition for those types of complex data rich Situations is that we save you time and money. I mean at the end of the day That's the thing that we need when we're dealing with these customer impacting incidents and trying to learn how to improve efficiencies and organization So, you know when you're in an incident keeping your stakeholders be they vp's of engineering or customers or whom I'm a bit whomever they are updated on what's going on during the incident so they can kind of set expectations with with folks and then conducting the Investigations once the incident is mitigated all that stuff takes a lot of time, right? You got to make sure you're communicating things to the right people Got to make sure you're pulling in, you know the Google Docs the slacks all of the details from your various systems And so the ways we try to help with that is we've got an incident response bot that standardizes the process of responding to incidents so responders don't have to look at Incident run books or kind of remember the sequence of steps that they need to take to get an incident going And our narrative builder, which we'll look at in a little bit Makes it easier to build those infinite narratives and do that analysis After the incident by doing some of that work for you. It makes it really easy to kind of gather all of that information Finally over time as you're responding to incidents and analyzing them and coming up with themes and takeaways that data Can can be presented back in a way that helps you make decisions about What engineering work needs to be prioritized where there may be gaps and headcount or skill sets on particular teams and Knowing all of those things help leaders make better decisions about organizational growth and and operation Operational efficiency. Will it be wrong to say when we look at you know posts incident analysis or just look at incident management in general It is More about people and processes right processes then tool I mean, of course tools are important, but unlike a lot of other things. I mean, of course Most of things but sometimes right tools and that's done But when I listen to you or you know, when we look at these things Is the engagement of whole team is like a whole organization? They have to look at it from a really different perspective a holistic view has to be there It's not just a problem of one team because when something goes wrong, you don't really know what went wrong We're so whole teams have to come together. Of course Jelly cannot go out and change Cultures, but sometimes tools they do encourage, you know teams coming together So talk a bit about the importance of culture and then I'll talk about how jelly's tools actually kind of help It becomes a catalyst in changing the culture as well That's a really great question So culture is really important and and we found and this will come as no surprise that every organization's culture is different both in terms of the way that they Respond and and kind of think about incidents But also the ways in which they analyze and incorporate what they learn into their organizational process So one thing that we're really proud of and we think is really important about jelly as a product is that it is People-centric so every feature that we develop We're asking ourselves. How does this help responders make better decisions or focus more on the incident rather than you know kind of Pushing information around as things are going on and then on the tail end in the analysis How are we telling the story of the people that were involved? What they knew at the time that they responded what they didn't know and how these specific types of events affect them their teams and the broader organization and You know as I mentioned earlier like this is this is a huge area of opportunity That can be difficult to invest in because of the amount of time, you know, if you're doing this all by hand That that it can it can take up So I think one of the things that we hope is this people-centric focus in the product and the ways in which we are kind of supporting Existing processes and making this process customizable For our customers is that it will help shape some of that cultural, you know Perspective on on incidents and what they mean for an organization and what what can be learned from them And I think one of the best example can also be jelly itself as we were doing before the recording We're talking about that you folks have some incidents a few weeks ago And once again you leverage itself to understand what's going on and which which is you know a very good example not only just About the right tool, but also kind of culture that either jelly had or kind of triggered a hate This is the kind of culture you should have so let's talk about you know The resident the lessons and of course We would love to see a demo as well that how you know jelly help you folks as well If it's okay with you all kind of show you how we run these incidents very quickly and then we'll talk about Some of what my team did to analyze this this big kind of multifaceted incident what we learned from it and how jelly help So let me just show you a little bit of how the incident response process works for us a Fun thing about jelly is that we are all incident response and analysis nerds And we get to build a tool that we use you know every day to kind of respond to our incident So what we're looking at here is a slack demo environment where I'm going to show you some of these features. So You mentioned culture one interesting thing about Incidents is that sometimes it can be difficult for an organizational culture to make space for folks to declare incidents You're worried is this an incident should I declare it? I'm going to pull people out of their work They're doing I'm going to page them, you know, maybe at a time that's inconvenient for them So one thing that was really important for us is when there's an issue We wanted to make it as easy as possible to get an incident started So in slack you can run slash jelly open and you're presented with a modal that allows you to add a bunch of optional data about an incident For example, you can add channels that you want to send status updates to to keep your stakeholders updated on what's going on You can create conference bridges with zoom you can create incident tickets in your ticket management platform You can also specify who is The incident commander or whatever role that you use internally both are all customizable as well as Declaring incidents in private sometimes you have incidents where the information is sensitive and you need to conduct those in a private channel So this is the modal and one button gets you into an incident channel We talked earlier about how jelly saves time And I think one of the things that ends up happening when you're a responder Is there's a series of steps that you need to take to get the incident going you got to create a new channel you got to set up your Broadcast you've got to start a zoom call you've got to do all these things And so what we try to do is take care of that with the press of a button So we now have an incident channel. We've got a zoom bridge in there Freddie Mercury. I'm Freddie Mercury in this demo is the incident commander and I can start responding In whatever way makes sense given the incident, right? Maybe maybe searches down Right So we can kind of coordinate the response there I can bring in other folks into the channel and assign them their specific roles and we can kind of Figure out how to resolve the incident drop and drop in the bridge and start talking about it once everything is is kind of cleared and we've closed the incident out or We understand what was going on and we fixed the issue There's a lot of different options of things we can do along the way, but at the end we can close that incident This will mark The incident is closed as well as indicate that the incident has been mitigated and in the jelly platform We can control who can view the resulting investigations. Sometimes we want these investigations to be Private until we've kind of gone through all the details We can set an optional severity during or at the end of the incident and then we can provide a summary to our stakeholders I'll just give a really brief one here And then we'll close it and what happens is when this incident is done Jelly takes care of all of the work of gathering that data for you So we don't have a lot of data in here But the entire slack transcript as well as all the on-call rotations for folks involved and all of the communication That's pulled into jelly. We'll go ahead and archive this channel because we won't need it later And that's that's how we respond to incidents almost every day We declare them we run through and then jelly takes care of all the heavy listing of setting up what we need to respond And getting all that data into a place where we can analyze During the incident, I'll just show you an example. Here's our incident view This is like live updating view that folks who are not a part of the incident response can kind of look at to keep Up-to-date on what's going on. This is an incident that's closed obviously, but lasted 25 minutes There were a couple folks in the channel Responding along with their Specific roles and what group they're in so this is really cool for you know keeping everybody on the same page about what's happening And how we're resolving it you that's the front end of the incident management process as you mentioned we recently did a Pretty comprehensive cross incident investigation So jelly had four incidents over the course of about two days, which is kind of an unusual situation for us so we a Colleague of mine and I came together to do what we called the galaxy brain investigation This was the one that was gonna help us see you know everything there was about the organization because we were looking at a Couple different incidents and what we're looking at here is the result of that investigation We've got this overview where we provided some summary on each of these specific incidents that were involved And we got a lot of details on who is involved in each What sorts of terms or tags came up in each of those these little sparkles are automatically tagged technologies from the transcript as well as some things that are related and In the process of working through this we spent about a day or two working asynchronously Looking at the incident conversation all the detail around it and coming up with a narrative of what actually happened and what it meant for our organization They can see here There's a timeline that shows Everything that happened over the course of those four incidents including when we detected them how we figured out what was going on and then What we did to repair and for each of these you can look at the specific what we call narrative marker and you can see details about What was happening from the actual transcripts the conversation inside of the the incident channel? So this is really cool for telling a story and in our learning review We walked through each of these narrative markers and talked about you know, how we got there what the implications were This is not exactly helpful for us But we can also see where folks were located at that time and what their time zone was and how the situation might have impacted them But the real like value of these types of exercises are the themes and takeaways So we had a couple things here that came out of this investigation that we found Helpful one is in this specific week. We were doing a lot of changes very quickly Balancing quality and speed is challenging. This is no surprise to anyone And so it gave us a chance to reevaluate ways that we could kind of manage that Those competing priorities Tuning monitoring is another one that that lots of folks deal with pretty often We found that there was some opportunity to tune our monitoring so we could get better insights about what was going on in the system And then we noticed some other things like there were a couple folks that were involved in all four incidents And we talked a little bit about why that is and how those folks know what's going on and kind of keep up to date all of this stuff was really helpful for us and Something that we learned as we were going through this investigation is We could not have done it without jelly I mentioned saving time and money Not only did jelly help us coordinate the response for each of those incidents But it made it easy to gather the incident data from a ton of different channels If we look at the event log here, which we used to kind of tag and annotate data There's a lot of channels in here and slack that we were using to coordinate response across all these four and Copy pasting data from slack into a Google doc and then working in comments to try and Constructed narratives can be really difficult, especially for these large Issues with with jelly. It's one command to start and then just the incident and we can pull in additional channels via import And so it makes it all really easy Furthermore doing this feeds into our broader understanding of what's going on in our organization So this is jelly's learning center and we can start to see patterns and when our incidents are happening and who's involved in what technologies Are implicated in those so it was it was a really interesting experience to do such a kind of all-encompassing Investigation and like I said, it would have taken a lot more time. I think then The two days we were we were working on it asynchronously if we didn't have something like jelly to kind of collate and give The ability to tell a story about all of that data Can you talk about when we look at, you know, the actual value of the post, you know Incident analysis is if you look at some of those practices that are there chaos injuring a data force if you look at sorry Is you know, how how does you know a jelly kind of helps with those kind of practices also? Because when we look at the themes and all those things that can help a lot with some of those strategies Some of those practices as well. Do you do you see that I think there's there's kind of two primary ways that It helps and we actually talk with our customers a lot about this and I could go on and on but let's focus on on the two primary ways so the first way is It can be very difficult to communicate effectively during an incident and if I'm a site reliability engineer on the team Which I have been in the past my primary goal is to reduce the impact of an incident as much as possible and kind of understand what's going on and and get us to a point where the incident is mitigated and Some of the features I showed around incident response and and also looking back at past incident data can really help Not only with coordination of incidents, but with helping SREs expand their understanding of complex systems over time So education is a really important part of what we're building here Skills and expertise and specific technologies do not just spread on their own you've got to kind of work at that And so we think it helps There and the second area that that I think this helps with with site reliability engineers and folks who are kind of in In the the incident space is that telling stories about what's happening in an organization can be really challenging and it requires data and it requires finesse and Requires kind of a shared understanding of what's going on we talk a lot in in jelly about a recent specific case where We had a customer who had a piece of technology that had been causing them problems for months on end And we had a senior member of the org come in and say you know what we're going to deprecate this technology we're going to get rid of it and replace it with something else and the the folks who responded to that incident were able to tell a an alternative story about the Team like the current implementation of that technology not using best practices or standards The problem wasn't the technology itself It was the organization's ability to invest in getting that piece of technology to a good state The cost of moving off of that technology in this organization would have been extremely expensive But investing a little bit of time to get that technology configured in the proper way saves a bunch of time and kind of avoids some of that Roadmap thrash that would have been introduced by that change So I think the second way that it helps is by giving those folks a storytelling tool so that they can bring the things that they know To groups that don't have the same context You've got these folks who are working with the tech technology who are experts and understand it very deeply And you've got a lot of a lot of other folks in the organization who need to understand that but not at the same depth And that's really the the the gap we're trying to bridge with this tool Is to make it easy for them to do their job and to give them tools to tell better stories about how to make the organization more Successful dude. Thank you so much for of course talking about Jelly incident response, but I really love you know the way you talk about how you folks internally use it And I also love the whole process where you showed a step-by-step How I mean because this is not just still as I was saying earlier that sometime Tools like or companies like jelly can become a catalyst in changing the whole culture or enables a lot of team who have those practices But sometimes we talk about practices a lot, but we see a disconnect between the tool and the practice But here we see that it's like a very good, you know kind of Relationship between these two so thanks for sharing all those and I would love to chat with you folks again. Thank you Thanks for having me. It was great talking with you